The File Formats FAQ

THE FILE FORMATS FAQ

$Revision: 1.13 $, $Date: 2005/10/31 18:02:35 $

Introduction
Microsoft DOC and RTF
WordPerfect
StarOffice and OpenOffice
Other Word Processors
Adobe PostScript
Adobe PDF
TeX, LaTeX, DVI
HTML
Plain ASCII Text
INTERCHANGING

Introduction

Like standards, one of the (not so) nice thing about file formats is that there are so many to choose from. People who live in the PC/Windows world too often incorrectly assume that the ".DOC" (Microsoft Word) and to a lesser extent ".RTF" (Rich Text Format) file types are universally interchangeable. Likewise, people in the Unix, Linux, and Solaris worlds assume that PostScript is a universal. Both groups are completely wrong!

This document will strive to clarify some of the problems with the various formats, and proposes workarounds, suggestions, and recomm- endations on better interchange formats. In the absence of REAL public standards (ones that are not "embraced and extended" in proprietary, perhaps DMCA-protected ways), it's difficult to find a good solution. And there won't be any public standards as long as customers of the big vendors passively accept what they pay small fortunes for without complaining. Downtrodden masses of the word processing world need to unite!

Microsoft DOC and RTF

These are two file formats that are tightly related, despite appearances to the contrary. Both are proprietary and both came from Microsoft. The DOC format is a binary one, while RTF appears to be an ASCII markup version of it (i.e. you can view RTF files with all their markup in textpad or notepad or emacs or vi).

DO NOT BELIEVE that RTF is an acceptable interchange format; the format is fundamentally flawed and has been described by leading CS authorities as "self-inconsistent" (in technical terms, a DTD could not be constructed for it). This, plus the lack of useful details on its structure from Microsoft, means it should not be used. It's ironic, but you are probably better off using the ".DOC" format instead if there is no other choice (read on; there are choices).

People who use both Word and WordPerfect will usually let you know in no uncertain terms that the big problem with the former is its lack of the "Reveal Codes" functionality. This means you cannot really see the structure of your document, and some of the hidden codes may make it unreadable in other word processing or translation programs. This functionality is made possible in WordPerfect because the foundation of their file format is solidly grounded in SGML (Standard Generalized Markup Language, the parent of HTML).

The "Word" DOC format is in fact highly variable, and depending on which version of Word generated the document will dictate which versions of Word can read it. Older versions will be unable to correctly read, render, or print DOC files generated with newer versions (though MS seems to have finally realized this is not a good thing and it's less likely to be a problem in the future). And with document interchange, especially outside our organization, you will have NO idea what if any version of Word your recipient will have!

Finally, there is a new feature promised for the version of Word that comes with Office 2003. It's called Information Rights [sic] Management and because the DMCA can likely block attempts to reverse-engineer the system, it could spell the end of non-Microsoft packages (WordPerfect, StarOffice and OpenOffice, etc.) from being able to read such documents. This is clearly engineered to cut out competition and inter-operability.

WordPerfect

I'm biased. I have believed for years that WordPerfect (WP) is a superior product to Word. Many of my colleagues who have used both concur. As alluded to above, Wordperfect has two good features going for it:

The ability to "Reveal Codes"; when you engage this function, the window splits in two with the bottom part showing the various codes (font changes, hard returns and page breaks, boxes, etc.) You can step through these, delete unwanted or unneeded codes, and gain much more control over your document. Often, such codes can be the root of many document interchange and formatting problems. As far as I know, this function is not available in Word or OpenOffice/StarOffice, or if so, in an extremely diluted manner.
All versions of WordPerfect (starting with version 6) have the ability to read a document created in any other WP version. For example, my copy of WordPerfect 7.0 (on Linux, of course; yes, I've kept this old version alive even on Red Hat 9!) can easily read a WP file generated by WordPerfect Office 2000. Some advanced features may not translate, but the program won't balk or crash, and will render the content accurately.

This word processor has some added advantages. Its ability to render complex formulae, while still second fiddle to the wonderful (and free) TeX and LaTeX markup languages, has still made it a favourite of many Scientists. And it does possess some crude typesetting features, such as the ability to manually insert ligatures (e.g., for ff, fi, fl, ffi, and ffl) and do hand-crafted kerning (minute adjustment of the "leading" [as in lead weights printers used to use] space between characters). And finally, it is not restricted to the Windows platform, and has been available for Linux since before 1997.

Alas, the temporary (and rather costly) purchase of a minority stake in Corel by Microsoft in the early 2000's had the unfortunate side effect of curtailing the availability of WordPerfect for Linux. Corel dumped its Linux OS business, and let the Linux version of WordPerfect die a sad death shortly afterwards. So while you can still find old versions of WP Office 2000 for Linux, getting these to work on modern Linux distributions is problematic (Corel used their own tweaked version of Wine (the WINdows Emulator for Linux), one that is incompatible with modern versions, including the Crossover version. One can hope, now that Corel is no longer under the thumb of Microsoft, that they may yet resurrect this useful package for Linux users, but other than a brief "test" re-release of WP version 8 for Linux (one that has stability problems in modern environments, in my experience) they have shown no interest thus far in doing so. At this point, despite the compelling "Reveal Codes" function, I've given up on Corel and now use OpenOffice instead. Read on...

StarOffice and OpenOffice

When I originally wrote this document, this package was very much the new kid on the block. That's certainly not the case anymore, and I now use OpenOffice extensively, even to the point of making it my de facto standard as a word processor. StarOffice was purchased years ago by Sun Microsystems, and OpenOffice.org is the Open Source (free) project on which releases of StarOffice are based. This is similar to the relationship between Netscape/AOL and Mozilla.org for web browsers. This is very good as it means quality and stability steadily improve due to the hard work and wide scrutiny inherent in Open Source development.

Both StarOffice and OpenOffice have several compelling advantages:

They are (to my knowledge) the only office suites that run on ALL platforms: Windows, Mac, Linux, and Solaris.
The OpenOffice.org version is free! So is StarOffice for Educational use.
You don't have to replace Microsoft office on your windows machine to use OpenOffice or StarOffice; they can simply exist as an extra application for when you want to send that document to your favourite Linux geek :-)

Both StarOffice and OpenOffice provide Unix users (Solaris, Linux, etc.) with a much-needed ability to convert those annoying Word and Excel and Powerpoint files that their colleagues occasionally (usually unthinkingly) send their way. The ability to import and export the usual .doc, .xls and .ppt files is now quite robust and effective, as much as any non-Microsoft office suite can be. In fact it seems better than WordPerfect, surprisingly enough.

My experience starting with StarOffice (5.2) and now OpenOffice (1.0.1, 1.1, and now 2.0) has been good, and is improving. They are successful at importing and rendering MS files almost all the time. At one point in the very distant past, I remember StarOffice 5.2 crashing on a particularly annoying file. My unconfirmed suspicion is that there was a self-inconsistency in the file, or some proprietary unpublished "extension" that causes such crashes. Of course to us Linux/Solaris users, if the application crashes, it will almost always leave our windowing system (X11) running smoothly, and it NEVER causes the operating system to crash!

Having said that, it's not all roses with OpenOffice/StarOffice. The ease-of-use factor, when compared to say WordPerfect, is not quite what it should be. An oft-quoted example of this is how one changes the starting page number of a document (e.g., you've got a particular chapter of a book in a standalone file). In WordPerfect, it's just a few clicks to get to a dialog box where you can set this parameter. Heck, even in TeX or LaTeX, setting \pagenumber is not that hard. But in OpenOffice, what one has to do for this supposedly simple task is incredibly obtuse. It can only be done after you reach the second page of your document, and even then it's a rather advanced use of styles (especially if, as is usual with a chapter, you don't want the page number to appear on the first page thereof). I think I counted 12 or 14 steps in the process! Clearly, (and with all due respect) human factors and usability need to influence the developers of OpenOffice a tad more in the future. I'm sure this will happen.

Windows users should really have StarOffice or OpenOffice installed in addition to their bundled office suite, so that they will be able to exchange documents with their Linux and Unix based colleagues. It's far preferable to have the sender of a document make sure it looks right in OpenOffice before sending; that way if there are any glitches in the conversion, they can be caught before they become a problem.

The native format for OpenOffice (and I believe StarOffice as of version 6 or 7) is a .SXW file which is in essence a zip file containing a set of XML files, any embedded graphics, and a manifest and style guide. It's very open and in principle could be understood by any other software package (assuming the writers of said package are compelled to provide import capability for it).

Other Word Processors

Most modern Linux distributions come with several other word processing (and office suite) alternatives, including:

AbiWord (and gnumeric for spreadsheets); and
KDE Office (kword, kspread and friends)

AbiWord is a full-fledged word processor, as is kword (part of koffice). Both are fairly capable, though definitely second fiddle to OpenOffice or StarOffice. AbiWord has the advantage having a WordPerfect import filter, though it's not perfect (OpenOffice, as of the time of writing, doesn't yet have a WordPerfect import filter, but there is one in the works). If nothing else, it's good to have them available for problematic documents.

Adobe PostScript

The really nice feature about PostScript is that it is a public standard. Adobe owns it, but they put all their cards on the table by producing a book (the "red" one) detailing every last nuance and function of the file format. PostScript is actually a markup language for printers; you can examine a PS file in any text editor and see the commands that do various things (move, draw, download font, etc). PostScript wizards can even program this way, and produce compact files that do extensive computation before rendering -- in the printer!

PostScript and its clone, GhostScript, have become a de facto industry standard in the Unix and Macintosh worlds. You want a printer for a Unix system? Get a PostScript printer. This applies to Solaris, Linux, Compaq Alpha, HP, SGI, and the *BSD systems. It's almost universal. Even home Linux users (like yours truly) goes this route; the Lexmark Optra inkjet series is PostScript capable.

However, the Windows world has never really adopted PostScript. You can get various tools from Adobe to deal with the format (these will cost you money), and Open Source tools such as gsview and GhostScript to view and print PS files on Windows (these tools are free). But unlike Linux, none of these come with the system, so the user has to go to the bother of installing them in order to get things to work.

That said, PS is for the most part a good document interchange format ONLY if you have an end product. You don't want to edit a PS file (unless you're a wizard ) or try converting it to another format; it's not intended for that purpose.

The only "gotcha" with PostScript is that some older printers and previewers cannot handle the more recent versions of PostScript (usually version 3); however, this is increasingly rare as a problem with modern systems and printers, and most tools will allow you to set the version of PS used if you're producing such a file (e.g. Adobe Acrobat Reader can do versions 1-3).

Finally, an interesting twist: because ghostscript is so successful, it is used on Linux systems as the back-end of the printing driver, converting the input -- which is usually PostScript -- to the native format for the printer. Often this will be a HP (PCL) or Epson (ESC P 2) printer. However, watch out for printers that rely on proprietary drivers that are only available on Windows. These vendors are shooting themselves in their respective feet by not providing a driver (even a binary one) for Linux. The good news is that the tech people at Epson and HP are most definitely Linux-aware, and work with the appropriate Open Source groups. HP has even released a (binary) copy of its hpijs driver so their newer printers can be used on Linux systems at full capabilities (the author has a HP 990 Cxi printer at home at the time of writing this [October 2003]). No doubt HP's efforts were due in no small part to the short but fruitful period when Bruce Perens was HP's "Linux Czar", though I know there is a strong and dedicated Linux team behind the scenes at HP.

Adobe PDF

NOTE: When Adobe released version 6 of their reader initially only for Windows and Mac OS/X, the Linux community protested (loudly). For a while, the only way Linux and *BSD users had to view newer PDF files was an Open Source program called xpdf. However, Adobe now makes Acrobat Reader 7 available for Linux, so the cautionary note formerly here is duly rescinded.

PDF is PS with compression and a few other nifty features. It's Adobe's newest format, and is proprietary to a large extent. However, they provide free readers (Acrobat) for all platforms (and I mean all) so once again they have managed to encourage use of their format as a de facto industry standard -- for the end product.

As well as the popular Acrobat Reader, Adobe makes (non-free) software such as Distiller, InDesign and other tools that allow you to generate PDF files. These are professional, well designed and very useful packages, albeit with a significant (mid-3-figure) cost.

Recent versions of ghostscript, gsview, and gv do in fact read most PDF documents directly. There are some, however, that due to the proprietary and binary nature of PDF, they are unable to render. If in doubt, try it out before sending. In addition, there are development versions of two tools, pdftex and pdflatex, that will create a high quality PDF file directly from a TeX or LaTeX document.

OpenOffice and its commercial StarOffice cousin have a very nice Print to PDF function that produces excellent PDF documents.

PDF is NOT a good document interchange format. If you, for example, create a document with pdflatex, and send it to someone running Adobe InDesign, it will likely be an unmitigated disaster if they try any serious changes (e.g. sustituting a different font). The ligatures, etc. that TeX and LaTeX insert will be arbitrarily converted to bizarre entities such as right curly braces "}" or nothing at all. You may end up "}owing text" instead of having "flowing text".

Historical Note from early 2002: The Acrobat 5 PDF-generating software for windows systems has been found to be problematic; some of the output it produces has been reported as unreadable in older acrobat readers. You may need to ask the recipients of your documents to upgrade to Acrobat Reader version 5 to overcome such issues (it is available for all platforms). Caveat Scriptor.

TeX, LaTeX, DVI

TeX and LaTeX (pronounced "tech" and "lah-tech") are document markup languages for real typesetting. Most word processors do not produce typeset quality output; their kerning is usually non-optimal, their hyphenation algorithms are crude, and their treatment of ligatures is, well, ... they generally don't use them.

By comparison, TeX was written from the ground up by a Stanford Professor of Mathematics called Donald Knuth because he wasn't satisfied with the existing process (back in the early 1980's) for publishing Mathematics books. The TeX language is extremely powerful and rich, but as with all such languages, it can be very complex. It's markup so you end up seeing {\bf boldface} and {\it italics\/} as well as many other "tags" in the source file.

TeX is available for most any system. It comes with most Linux installations, is available for most other Unix systems, and is even available on Windows and Mac systems (e.g., PCTeX, though I'm told there are better implementations). LaTeX is Leslie Lamport's extension or layer on top of TeX. Both of these allow you, via these markup commands, to generate professionally typeset PostScript files and printouts. People use simple text editors like textpad, notepad, emacs, and vi to create their .TEX files or source documents.

Both TeX and LaTeX, when run on a source document, will produce a DeVice-Independent or .DVI file. This is meant as a "compiled" binary version of the main document, and can be converted to a variety of formats, though recently this will usually be PostScript. The Unix "xdvi" viewer can view DVI files directly on screen. I believe the PCTeX package includes a DVI viewer as well.

Scientists have been using TeX and LaTeX in academic publishing for over two decades now, and there is little sign that this trend will stop anytime soon. The quality, flexibility and superiority of the languages for markup of complex mathematical formulae is without equal anywhere else. Nothing can do as good a job. Plus, its hyphenation algorithm is only now -- after 17 years -- being approached or equalled by costly "professional" publication packages such as Adobe InDesign. And the main TeX and LaTeX systems themselves are free for the taking.

The TeX/LaTeX markup formats are the ideal document interchange medium for Scientists collaborating on publications or papers; they are plain text files and very amenable to transmission via e-mail, and compress nicely with freely available software such as GNU zip.

There are a set of utilities that come bundled with the version of TeX and LaTeX on most Linux systems, which enable creation of PDF output directly from TeX or LaTeX source. These are "pdftex" and "pdflatex". For the most part they work just like their counterparts (the tex and latex commands) but they produce PDF output right away (no ".DVI" file). They also contain facilities that allow you to embed image files (JPEGs and the like) directly into a PDF.

HTML

Everyone knows that HTML is the language of the web. However, it's not usually thought of as a good document interchange format. But for many documents, especially when people are dealing with mutually incompatible (and even mutually hostile) packages such as Word and WordPerfect, HTML can be a viable way of interchanging files.

WordPerfect has an "Internet Publisher" option in the "File" menu that allows you to create a HTML version of the document you're working on. If you have such a document and want to circulate it among users with heterogeneous systems (Windows, Mac, Unix, Linux...) then this is one of the ideal formats. It's universally readable. And WordPerfect is fairly good at producing relatively "clean" HTML markup, without proprietary clutter.

Microsoft Word, on the other hand, while it can produce HTML output, will generally try to codify the MS extensions in various tags that do absolutely nothing in non-MS browsers such as Netscape, Mozilla, Opera, Lynx, Konqueror W3M, and the mini-browsers now found on PDAs and cell phones. All these extensions do is inflate the size of the HTML file, often by a factor of 1000% (in one case I measured; it was 11 times larger than necessary!) and this slows down the rendering of the file in the browser (and hogs bandwidth for those of us still stuck using 56k modems). Plus, all those nifty extensions, fonts, etc. do not show up for your non-IE or non-MS audience. In the academic world, this audience is not small and may in fact be in the majority. Finally, much of what word (or its web-enabled sibling, FrontPage) produces will only render correctly with a combination of a MS web server (IIS, not the more popular and free/open-source Apache) and a MS web browser (IE).

However, that said, a HTML file from Word is at least viewable universally, albeit with issues. A DOC file is not. So HTML in this sense is vastly preferable to MS's native format.

Plain ASCII Text

Once upon a time, before there was Word, StarOffice, OpenOffice, WordPerfect, or even WordStar, people communicated with each other, sent documents, and even printed out memos, reports, and more, in (wait for it... drumroll...) PLAIN TEXT. That's right; just the plain letters, numbers, and symbols you see on the keyboard under your fingers right now. It's actually surprising how much information can be clearly and effectively sent using plain text. Yes, you can of course "pretty it up" with fancy fonts, graphics, charts, and more. But do you need to? Think before you embellish.

And, while it may be stating the obvious, Plain text is by far the most easily interchanged format there is. It's also the most compact (no space taken up with hidden codes) and can be viewed on ALL platforms. Plus, it's printable on just about every known printer on the planet. And it is trivial to use in e-mail; just type it in or include it.

Almost every word processor, text editor, clipboard program, etc. can handle simple ASCII text. The only gotcha I can think of is the issue of whether lines end with just a newline (linefeed, the unix way) or a carriage return and newline (cr/lf). These days most Unix text editors like emacs can transparently handle the two types of line termination so it's really not much of an issue. I have perl filters if anyone needs them, in ~pmurphy/bin/crlf and ~pmurphy/bin/uncrlf, on all NRAO main sites (CV, GB, AOC, TUC) that convert to/from these two types of file (the links here bring you to the perl scripts themselves on the web; they're GPL'd so feel free to download). If you see unsightly ^M's (control-M's) in your files, then these filters are for you.

INTERCHANGING

So what conclusions can one draw from this highly subjective, if (hopefully) informative, diatribe? =:->

The best formats for document interchange will depend on your target audience and what requirements the document has. In order of decreasing preference (best first), my choice would be:

Plain ascii text
html
PostScript (end product)
tex or latex source (great for collaboration)
OpenOffice sxw format (OASIS)
dvi
pdf (end product, but see cautionary note)
wordperfect
word (homogeneous [same word version] audience only!)

So there you have it. I hope some of this turns out to be useful. Constructive suggestions for changes/enhancements are welcome. Flames? Well, that's your choice but I'll likely send them to /dev/null (for windows users, that's NUL:).

Like the style this was written in? Here's the stylesheet:

<style type="text/css"> h2 { text-align:center; margin-top: 2em } h3,h4,h5,h6 { margin-left: 4%; margin-right: 4% } p { text-indent: 2em; margin-left: 4%; margin-right: 4% } ol, ul, dl { margin-left: 6%; margin-right: 6% } li { margin-bottom: 1em } </style>

Modified on Monday, 31-Oct-2005 13:02:35 EST