THE FILE FORMATS FAQ
$Revision: 1.13 $, $Date: 2005/10/31 18:02:35 $
Table of Contents
Like standards, one of the (not so) nice thing about file formats is
that there are so many to choose from. People who live in the
PC/Windows world too often incorrectly assume that the
This document will strive to clarify some of the problems with the various formats, and proposes workarounds, suggestions, and recomm- endations on better interchange formats. In the absence of REAL public standards (ones that are not "embraced and extended" in proprietary, perhaps DMCA-protected ways), it's difficult to find a good solution. And there won't be any public standards as long as customers of the big vendors passively accept what they pay small fortunes for without complaining. Downtrodden masses of the word processing world need to unite!
These are two file formats that are tightly related, despite
appearances to the contrary. Both are proprietary and both came from
DO NOT BELIEVE that
People who use both Word and WordPerfect will usually let you know in
no uncertain terms that the big problem with the former is its lack of the
"Reveal Codes" functionality. This means you cannot really see
the structure of your document, and some of the hidden codes may make it
unreadable in other word processing or translation programs. This
functionality is made possible in WordPerfect because the foundation of
their file format is solidly grounded in SGML (Standard Generalized Markup
Language, the parent of
Finally, there is a new feature promised for the version of Word that comes with Office 2003. It's called Information Rights [sic] Management and because the DMCA can likely block attempts to reverse-engineer the system, it could spell the end of non-Microsoft packages (WordPerfect, StarOffice and OpenOffice, etc.) from being able to read such documents. This is clearly engineered to cut out competition and inter-operability.
I'm biased. I have believed for years that WordPerfect (WP) is a superior product to Word. Many of my colleagues who have used both concur. As alluded to above, Wordperfect has two good features going for it:
This word processor has some added advantages. Its ability to render complex formulae, while still second fiddle to the wonderful (and free) TeX and LaTeX markup languages, has still made it a favourite of many Scientists. And it does possess some crude typesetting features, such as the ability to manually insert ligatures (e.g., for ff, fi, fl, ffi, and ffl) and do hand-crafted kerning (minute adjustment of the "leading" [as in lead weights printers used to use] space between characters). And finally, it is not restricted to the Windows platform, and has been available for Linux since before 1997.
Alas, the temporary (and rather costly) purchase of a minority stake in Corel by Microsoft in the early 2000's had the unfortunate side effect of curtailing the availability of WordPerfect for Linux. Corel dumped its Linux OS business, and let the Linux version of WordPerfect die a sad death shortly afterwards. So while you can still find old versions of WP Office 2000 for Linux, getting these to work on modern Linux distributions is problematic (Corel used their own tweaked version of Wine (the WINdows Emulator for Linux), one that is incompatible with modern versions, including the Crossover version. One can hope, now that Corel is no longer under the thumb of Microsoft, that they may yet resurrect this useful package for Linux users, but other than a brief "test" re-release of WP version 8 for Linux (one that has stability problems in modern environments, in my experience) they have shown no interest thus far in doing so. At this point, despite the compelling "Reveal Codes" function, I've given up on Corel and now use OpenOffice instead. Read on...
When I originally wrote this document, this package was very much the new kid on the block. That's certainly not the case anymore, and I now use OpenOffice extensively, even to the point of making it my de facto standard as a word processor. StarOffice was purchased years ago by Sun Microsystems, and OpenOffice.org is the Open Source (free) project on which releases of StarOffice are based. This is similar to the relationship between Netscape/AOL and Mozilla.org for web browsers. This is very good as it means quality and stability steadily improve due to the hard work and wide scrutiny inherent in Open Source development.
Both StarOffice and OpenOffice have several compelling advantages:
Both StarOffice and OpenOffice provide Unix users (Solaris, Linux,
etc.) with a much-needed ability to convert those annoying
Word and Excel and Powerpoint files that their colleagues occasionally
(usually unthinkingly) send their way. The ability to import and export
My experience starting with StarOffice (5.2) and now OpenOffice (1.0.1, 1.1, and now 2.0) has been good, and is improving. They are successful at importing and rendering MS files almost all the time. At one point in the very distant past, I remember StarOffice 5.2 crashing on a particularly annoying file. My unconfirmed suspicion is that there was a self-inconsistency in the file, or some proprietary unpublished "extension" that causes such crashes. Of course to us Linux/Solaris users, if the application crashes, it will almost always leave our windowing system (X11) running smoothly, and it NEVER causes the operating system to crash!
Having said that, it's not all roses with OpenOffice/StarOffice. The
ease-of-use factor, when compared to say WordPerfect, is not quite what
it should be. An oft-quoted example of this is how one changes the
starting page number of a document (e.g., you've got a
particular chapter of a book in a standalone file). In WordPerfect,
it's just a few clicks to get to a dialog box where you can set this
parameter. Heck, even in TeX or LaTeX, setting
Windows users should really have StarOffice or OpenOffice installed in addition to their bundled office suite, so that they will be able to exchange documents with their Linux and Unix based colleagues. It's far preferable to have the sender of a document make sure it looks right in OpenOffice before sending; that way if there are any glitches in the conversion, they can be caught before they become a problem.
The native format for OpenOffice (and I believe StarOffice as of
version 6 or 7) is a
Most modern Linux distributions come with several other word processing (and office suite) alternatives, including:
AbiWord is a full-fledged word processor, as is kword (part of koffice). Both are fairly capable, though definitely second fiddle to OpenOffice or StarOffice. AbiWord has the advantage having a WordPerfect import filter, though it's not perfect (OpenOffice, as of the time of writing, doesn't yet have a WordPerfect import filter, but there is one in the works). If nothing else, it's good to have them available for problematic documents.
The really nice feature about PostScript is that it is a public
standard. Adobe owns it, but they put all their cards on the table by
producing a book (the "red" one) detailing every last nuance and function
of the file format. PostScript is actually a markup language for
printers; you can examine a
PostScript and its clone, GhostScript, have become a de facto industry standard in the Unix and Macintosh worlds. You want a printer for a Unix system? Get a PostScript printer. This applies to Solaris, Linux, Compaq Alpha, HP, SGI, and the *BSD systems. It's almost universal. Even home Linux users (like yours truly) goes this route; the Lexmark Optra inkjet series is PostScript capable.
However, the Windows world has never really adopted PostScript. You
can get various tools from Adobe to deal with the format (these will cost
you money), and Open Source tools such as gsview and GhostScript to view
The only "gotcha" with PostScript is that some older printers and
previewers cannot handle the more recent versions of PostScript (usually
version 3); however, this is increasingly rare as a problem with modern
systems and printers, and most tools will allow you to set the version of
Finally, an interesting twist: because ghostscript is so successful,
it is used on Linux systems as the back-end of the printing driver,
converting the input -- which is usually PostScript -- to the native
format for the printer. Often this will be a HP (PCL) or Epson (ESC P
2) printer. However, watch out for printers that rely on proprietary
drivers that are only available on Windows. These vendors are shooting
themselves in their respective feet by not providing a driver (even a
binary one) for Linux. The good news is that the tech people at Epson
and HP are most definitely Linux-aware, and work with the appropriate
Open Source groups. HP has even released a (binary) copy of its
As well as the popular Acrobat Reader, Adobe makes (non-free) software
such as Distiller, InDesign and other tools that allow you to generate
Recent versions of
OpenOffice and its commercial StarOffice cousin have a very nice Print to PDF function that produces excellent PDF documents.
Historical Note from early 2002: The Acrobat 5
TeX and LaTeX (pronounced "tech" and "lah-tech") are document markup languages for real typesetting. Most word processors do not produce typeset quality output; their kerning is usually non-optimal, their hyphenation algorithms are crude, and their treatment of ligatures is, well, ... they generally don't use them.
By comparison, TeX was written from the ground up by a
Stanford Professor of Mathematics called Donald Knuth because he wasn't
satisfied with the existing process (back in the early 1980's) for
publishing Mathematics books. The TeX language is
extremely powerful and rich, but as with all such languages, it can be
very complex. It's markup so you end up seeing
TeX is available for most any system. It comes with
most Linux installations, is available for most other Unix systems, and is
even available on Windows and Mac systems (e.g.,
Both TeX and LaTeX, when run on a
source document, will produce a DeVice-Independent or
Scientists have been using TeX and LaTeX in academic publishing for over two decades now, and there is little sign that this trend will stop anytime soon. The quality, flexibility and superiority of the languages for markup of complex mathematical formulae is without equal anywhere else. Nothing can do as good a job. Plus, its hyphenation algorithm is only now -- after 17 years -- being approached or equalled by costly "professional" publication packages such as Adobe InDesign. And the main TeX and LaTeX systems themselves are free for the taking.
The TeX/LaTeX markup formats are the ideal document interchange medium for Scientists collaborating on publications or papers; they are plain text files and very amenable to transmission via e-mail, and compress nicely with freely available software such as GNU zip.
There are a set of utilities that come bundled with the version of
TeX and LaTeX on most Linux systems,
which enable creation of
Everyone knows that
WordPerfect has an "Internet Publisher" option in the
"File" menu that allows you to create a
Microsoft Word, on the other hand, while it can produce
However, that said, a
Once upon a time, before there was Word, StarOffice, OpenOffice,
WordPerfect, or even WordStar, people communicated with each other, sent
documents, and even printed out memos, reports, and more, in (wait for
And, while it may be stating the obvious, Plain text is by far the most easily interchanged format there is. It's also the most compact (no space taken up with hidden codes) and can be viewed on ALL platforms. Plus, it's printable on just about every known printer on the planet. And it is trivial to use in e-mail; just type it in or include it.
Almost every word processor, text editor, clipboard program, etc. can
handle simple ASCII text. The only gotcha I can think of is the issue
of whether lines end with just a newline (linefeed, the unix way) or a
carriage return and newline (cr/lf). These days most Unix text editors
like emacs can transparently handle the two types of line termination so
it's really not much of an issue. I have perl filters if anyone needs
So what conclusions can one draw from this highly subjective, if (hopefully) informative, diatribe? =:->
The best formats for document interchange will depend on your target audience and what requirements the document has. In order of decreasing preference (best first), my choice would be:
So there you have it. I hope some of this turns out to be useful.
Constructive suggestions for changes/enhancements are welcome. Flames?
Well, that's your choice but I'll likely send them to
Like the style this was written in? Here's the