From nobody Tue Jun 2 14:13:06 1998 Path: newsfeed.cv.nrao.edu!newsgate.duke.edu!nntprelay.mathworks.com!news.idt.net!netnews.com!feed2.news.erols.com!erols!news.mindspring.net!news.mindspring.com!not-for-mail From: Daren Scot Wilson Newsgroups: sci.data.formats Subject: Advice on designing data file formats? Date: Mon, 01 Jun 1998 21:17:38 +0000 Organization: MindSpring Enterprises Lines: 24 Message-ID: <35731A72.26347540@pipeline.com> NNTP-Posting-Host: ip187.ann-arbor.mi.pub-ip.psi.net Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Server-Date: 2 Jun 1998 01:13:39 GMT X-Mailer: Mozilla 3.01 (X11; I; Linux 2.0.32 i586) Xref: newsfeed.cv.nrao.edu sci.data.formats:296 We are considering defining or adopting a "universal" data file format - something we can have our many experimental groups use, to allow re-use of software tools, allow engineers to carry their know-how with them between groups, etc. One possibility is HDF, but no one there knows about it, except me, and I'm no expert - just barely know it exists. I know more about FITS, used at NASA/JPL, but it's oriented mostly toward images. Both HDF and FITS would be overkill for us, we suspect, being far too general. Our data is almost all time series, with 1-300 channels of data. Are there simpler data formats we can make use of? If we design our own, is there a collection of street-smarts and accumulated wisdom to aid us? Any horror stories about what not to do? -- Daren Scot Wilson Member, ACM darenw@pipeline.com www.newcolor.com --------------------------- Find people of great energy and circulate it... -- michael crye From nobody Tue Jun 2 14:14:35 1998 Path: newsfeed.cv.nrao.edu!newsgate.duke.edu!nntp-out.monmouth.com!newspeer.monmouth.com!cpk-news-hub1.bbnplanet.com!news.bbnplanet.com!newsfeed.atl.bellsouth.net!news.acsu.buffalo.edu!srv1.drenet.dnd.ca!gee From: gee@mozart.dciem.dnd.ca (Thomas Gee) Newsgroups: sci.data.formats Subject: Re: Advice on designing data file formats? Date: 2 Jun 1998 09:27:47 GMT Organization: DCIEM Lines: 71 Message-ID: References: <35731A72.26347540@pipeline.com> NNTP-Posting-Host: 131.136.69.18 X-Newsreader: slrn (0.9.4.6 UNIX) Xref: newsfeed.cv.nrao.edu sci.data.formats:297 On Mon, 01 Jun 1998, Daren Scot Wilson wrote: >We are considering defining or adopting a "universal" data file format - >something we can have our many experimental groups use, to allow re-use >of software tools, allow engineers to carry their know-how with them >between groups, etc. > >One possibility is HDF, but no one there knows about it, except me, and >I'm no expert - just barely know it exists. I know more about FITS, >used at NASA/JPL, but it's oriented mostly toward images. Both HDF and >FITS would be overkill for us, we suspect, being far too general. Our >data is almost all time series, with 1-300 channels of data. > >Are there simpler data formats we can make use of? If we design our >own, is there a collection of street-smarts and accumulated wisdom to >aid us? Any horror stories about what not to do? Ah, the quest for the holy grail! I know it well. We looked quite seriously at HDF, and while I was quite impressed with its capabilities, I found the API to be excruciating. Expressive power comes at a price. There is a new Java interface that may make life simpler (if you are doing Java development). You need to look closely at what sort of price you are willing to pay when it comes to implementing a file format. The simplest and most flexible format is probably some sort of delimited text file. However, this is also the most bulky. If you want to save size you will need a binary format, but this can add platform dependencies (e.g. big endian vs. little endian). To avoid such dependencies, you will need an independent format (e.g. XDR) but this requires conversion during both read and write programs resulting in an increase in processing time. If you keep your format monolithic (that is, just a single logical chunk of data), you will end up with zillions of separate files being produced during analysis. If your platform cannot handle a large number of data files (like most PC's) [yes, I'm a Unix bigot], you will need a format that integrates independent chunks of data (like the HDF Vset), but this will greatly complicate the API. Currently, my shop supports about seven different file formats, most of which are proprietary to particular software packages. For each format we have a filter to export data to a simple text representation, and one to import text data into the format. Thus we can freely convert from one to another. However, the cost is having to remember how to use fourteen different conversion programs, and having to store and manipulate an enormous number of data files during analysis. (For example, a recent experiment and analysis involved acquiring and generating 10+ GB's of data divided into hundreds of thousands of files.) Our new approach (currently under way) is to create a new standard based on the Zip archive format. Data is stored in individual text files with simple delimiters. These files are compressed and stored together in archives. Although writing is expensive (due to instream compression), the resulting files are very efficient in terms of size, and very flexible in terms of exporting to other programs. This format conforms to the Java Archive (jar) standard, and as most of our new software is developed in Java this makes the API not much more difficult than open(), read(), write() and close(). I'd love to have a discussion here on the Net about file format design and choosing. Surely there is a great deal of collective wisdom we can pool! So, comments? Other insights? Tom --- gee@dciem.dnd.ca SAILSS Software Development Team DCIEM / Department of National Defence / Canada From nobody Tue Jun 2 14:15:06 1998 Path: newsfeed.cv.nrao.edu!newsgate.duke.edu!nntprelay.mathworks.com!peerfeed.ncal.verio.net!news.he.net!news.pagesat.net!decwrl!purdue!yuma!usenet From: "Tom Hilinski" Newsgroups: sci.data.formats Subject: Re: Advice on designing data file formats? Date: Tue, 2 Jun 1998 09:33:11 -0600 Organization: Colorado State University, Fort Collins, CO 80523 Lines: 38 Message-ID: <6l1606$gai@yuma.ACNS.ColoState.EDU> References: <35731A72.26347540@pipeline.com> NNTP-Posting-Host: ts3108.slip.colostate.edu Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Newsreader: Microsoft Outlook Express 4.72.3110.5 X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3 Xref: newsfeed.cv.nrao.edu sci.data.formats:300 I highly recommend checking into the NetCDF file format put out by NCAR (National Center for Atm. Research). I'm using it extensively on two software projects, and the C++ interface is a breeze. There are also C and Fortran APIs. The files are platform-independent, and the format is oriented towards self-described numeric data sets. The software for the API is free and easy to build. I'm using it on Win32 and Unix w/out incident. See: http://www.unidata.ucar.edu/packages/netcdf/ Daren Scot Wilson wrote in message <35731A72.26347540@pipeline.com>... >We are considering defining or adopting a "universal" data file format - >something we can have our many experimental groups use, to allow re-use >of software tools, allow engineers to carry their know-how with them >between groups, etc. > >One possibility is HDF, but no one there knows about it, except me, and >I'm no expert - just barely know it exists. I know more about FITS, >used at NASA/JPL, but it's oriented mostly toward images. Both HDF and >FITS would be overkill for us, we suspect, being far too general. Our >data is almost all time series, with 1-300 channels of data. > >Are there simpler data formats we can make use of? If we design our >own, is there a collection of street-smarts and accumulated wisdom to >aid us? Any horror stories about what not to do? > >-- > >Daren Scot Wilson >Member, ACM >darenw@pipeline.com >www.newcolor.com >--------------------------- > Find people of great energy and circulate it... > -- michael crye From nobody Tue Jun 2 14:19:38 1998 Path: newsfeed.cv.nrao.edu!not-for-mail From: Don Wells Newsgroups: sci.data.formats Subject: Re: Advice on designing data file formats? Date: 02 Jun 1998 10:55:57 -0400 Organization: nrao Lines: 82 Message-ID: References: <35731A72.26347540@pipeline.com> Reply-To: dwells@nrao.edu NNTP-Posting-Host: fits.cv.nrao.edu X-Attribution: Up X-No-Archive: yes X-BOFH-Archive: yes X-Newsreader: Gnus v5.5/Emacs 20.2 Xref: newsfeed.cv.nrao.edu sci.data.formats:298 >>>>> "Daren" == Daren Scot Wilson writes: Daren> We are considering defining or adopting a "universal" data Daren> file format - something we can have our many experimental Daren> groups use, to allow re-use of software tools, allow Daren> engineers to carry their know-how with them between groups, Daren> etc. Daren> One possibility is HDF, but no one there knows about it, Daren> except me, and I'm no expert - just barely know it exists. Daren> I know more about FITS, used at NASA/JPL, but it's oriented Daren> mostly toward images. Both HDF and FITS would be overkill Daren> for us, we suspect, being far too general. Our data is Daren> almost all time series, with 1-300 channels of data.. 'Time series' data are an example of _tables_, or _relations_ (as in relational databases). Both HDF and FITS are fine formats for tables. I am not aware of any other formats which deserve to be mentioned in the same breath with HDF and FITS for such applications. (That bold assertion is a challenge which I hope will smoke out any competitors so that I can broaden my knowledge.) Your characterization of FITS as being oriented toward images is misleading. It is true that the original motivation for FITS in the 70s was interchange of digital imagery and that imagery still dominates the dataflow when counted by total bits. However, the FITS community worked hard in the 80s to extend FITS to support tabular data, especially for our high-energy and interferometric applications. Large quantities of tabular information are recorded in astronomy today using FITS, using the 'BINTABLE' (binary table) "extension" format. BINTABLE encodes self-describing arrays of arbitrary structures, in which elements of the structures can be not only simple scalars and strings (for conventional tables) but also N-dimensional matricies (the original prototype of BINTABLE was named 'A3DTABLE'). BINTABLE has a mechanism for efficient support of variable-length fields. A wide range of even more sophisticated heirarchical and relational-database data situations can be represented by layering conventions on top of the basic BINTABLE design, and subsets of astronomy have reached agreements on such conventions. A state-of-the-art example of a design document for tabular data formatting in astronomy is at: http://hea-www.harvard.edu/~arots/asc/fits/ascfits.ps A key question to ask is whether your datasets have archival value (in general, astronomy datasets must be presumed to have archival value until proven otherwise). If you want your bits to survive in usable form for more than about a decade, you should use a format which is designed to survive. Even better, use a format which has _proven_ that it can survive changes in computing environments over long periods of time. There are very few data formats which have been designed ab initio to survive and which can also claim that their design is proven---FITS is one of that small set. The oldest FITS files [see http://www.cv.nrao.edu/fits/data/tests/incunabula/ ] are now 19 years old; they were produced by computer environments, such as 60-bit ones-complement machines with 6-bit character sets, which are long since deceased, but the FITS bitstreams are still readable by modern software. An archival format should be an open design, should be completely described from the bitstream upward (an Application Programming Interface [API] definition is not sufficient), and the description should be published in journals which are themselves archival. The format should not depend on assumptions that current [ephemeral!] 'standards' like Unix or XDR will survive. The bitstreams of the format should contain sufficient descriptive text to enable a future programmer to recognize the name of the format and to find the description in the journal. Although the FITS community continues to agree that astronomy's archival format must be defined at the bitstream level, not by an API, our community does in fact have a de facto standard API: http://legacy.gsfc.nasa.gov/docs/software/fitsio/fitsio.html I recommend that you use FITSIO/cfitsio for your applications. -Don -- Donald C. Wells Associate Scientist dwells@nrao.edu http://www.cv.nrao.edu/~dwells National Radio Astronomy Observatory +1-804-296-0277 520 Edgemont Road, Charlottesville, Virginia 22903-2475 USA From nobody Tue Jun 2 16:49:02 1998 Path: newsfeed.cv.nrao.edu!newsgate.duke.edu!nntprelay.mathworks.com!portc04.blue.aol.com!audrey01.news.aol.com!not-for-mail From: tonydortm@aol.com (Tony dortm) Newsgroups: sci.data.formats Subject: Re: Advice on designing data file formats? Date: 2 Jun 1998 20:41:56 GMT Lines: 30 Message-ID: <1998060220415600.QAA06385@ladder01.news.aol.com> NNTP-Posting-Host: ladder01.news.aol.com X-Admin: news@aol.com References: <35731A72.26347540@pipeline.com> Organization: AOL Bertelsmann Online GmbH & Co. KG http://www.germany.aol.com X-Newsreader: AOL Offline Reader Xref: newsfeed.cv.nrao.edu sci.data.formats:301 Greetings! The spectroscopy world uses a vary simple file format originally based on XY type data tables and now being extended for multidimensional data sets. As the holy grail is long term storage and readability the format is ascii based but there are in-built algortihms which produce files which are generally smaller than the original binary files without any loss of digital resolution. These 'JCAMP-DX' file protocols as implemented for Infrared, Mass Spec, Chemical structure and NMR data types are available from: http:\\www.isas-dortmund.de\jcamp an extension for NMR is being worked on and will be ready by the end of June as is a new protocol for Ion Mobility Spectrometry. for more information mail me directly davies@isas-dortmund.de (Chairman IUPAC-CPEP Working Party on Spectroscopic Data Standards) PS Our experience with netCDF has been very painful but we are not Unix based. There are plenty of utilties is you are unix based but the netCDF implementations by the Analytical Instrument Association are "difficult" if you do not have extensive 'experience' and 'resources' available. NIST have apparently managed to make it run but by implementing parts a later 'version' than the official release. The most important decision to be made is not really the method of storing to disk but defing the experiment so that the so-called 'data dictionary' - this list of keywords and essential extra data - is complete. Hope this is of some help. Tony Davies From nobody Tue Jun 2 18:16:04 1998 Path: newsfeed.cv.nrao.edu!newsgate.duke.edu!nntprelay.mathworks.com!cam-news-hub1.bbnplanet.com!denver-news-feed1.bbnplanet.com!news.bbnplanet.com!coop.net!csn!nntp-xfer-2.csn.net!yuma!usenet From: "Tom Hilinski" Newsgroups: sci.data.formats Subject: Re: Advice on designing data file formats? Date: Tue, 2 Jun 1998 15:29:35 -0600 Organization: Colorado State University, Fort Collins, CO 80523 Lines: 27 Message-ID: <6l1qtk$6qu@yuma.ACNS.ColoState.EDU> References: <35731A72.26347540@pipeline.com> <1998060220415600.QAA06385@ladder01.news.aol.com> NNTP-Posting-Host: ts5110.slip.colostate.edu Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Newsreader: Microsoft Outlook Express 4.72.3110.5 X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3 Xref: newsfeed.cv.nrao.edu sci.data.formats:303 Tony dortm wrote in message <1998060220415600.QAA06385@ladder01.news.aol.com>... >Greetings! > >PS Our experience with netCDF has been very painful but we are not Unix based. >There are plenty of utilties is you are unix based but the netCDF >implementations by the Analytical Instrument Association are "difficult" if you >do not have extensive 'experience' and 'resources' available. NIST have >apparently managed to make it run but by implementing parts a later 'version' >than the official release. Don't know about the "implementations by the Analytical Instrument Association" but the latest versions of netCDF (3.4 as of this date) I find pretty painless. True, the C and Fortran calls are more numerous than the C++ calls, but in C++ it is quick and easy. Like any new format, there is a learning curve, but I think an average programmer could get going on this fairly well after a day of reading and playing. Also, HDF (mentioned in a previous post) understands netCDF. From nobody Tue Jun 9 10:58:01 1998 Path: newsfeed.cv.nrao.edu!newsgate.duke.edu!nntprelay.mathworks.com!news.maxwell.syr.edu!news-peer.sprintlink.net!news-backup-east.sprintlink.net!news.sprintlink.net!204.248.219.34!news.cioe.com!newsfeed.cioe.com!news.smithville.net!usenet From: Robb Matzke Newsgroups: sci.data.formats Subject: Re: Need to analyze different Data Formats Date: 08 Jun 1998 23:47:21 -0400 Organization: Spizella Software Lines: 15 Sender: robb@arborea.spizella.com Message-ID: References: Reply-To: matzke@llnl.gov NNTP-Posting-Host: pp236.smithville.net X-Newsreader: Gnus v5.5/XEmacs 20.4 - "Emerald" Xref: newsfeed.cv.nrao.edu sci.data.formats:308 Matin S Abdullah writes: > I need to analyze different "Universal" Data formats for our projects > Can any one point me to any page where I can start. The hierarchical data format current version: http://hdf.ncsa.uiuc.edu/ and next version: http://hdf.ncsa.uiuc.edu/nra/BigHDF/ -- Robb Matzke L-170 #include Lawrence Livermore National Laboratory Voice: +1 812 863 7211 Livermore, CA 94550 Fax: +1 812 863 7211 From nobody Sat Jun 13 03:17:38 1998 Path: newsfeed.cv.nrao.edu!newsgate.duke.edu!nntprelay.mathworks.com!newsfeed.internetmci.com!199.0.154.208!ais.net!ais.net!jamie!news.indiana.edu!mcmcnews.er.usgs.gov!news From: "Charley Hickman" Newsgroups: sci.data.formats Subject: Re: Need to analyze different Data Formats Date: 12 Jun 1998 20:59:21 GMT Organization: U.S. Geological Survey -- Rolla, Missouri Lines: 44 Message-ID: <01bd9644$f848b920$6ea02f90@mcpc0110.er.usgs.gov> References: NNTP-Posting-Host: mcpc0110.er.usgs.gov X-Newsreader: Microsoft Internet News 4.70.1155 Xref: newsfeed.cv.nrao.edu sci.data.formats:311 Matin S Abdullah wrote > Hi > I need to analyze different "Universal" Data formats for our projects > > The craiterion is > > 1) What advantage it gives me over flat files > 2) What is the time and space performance. > 3) How much is it supported in the scientific community. > > > Can any one point me to any page where I can start. > > > Matin > Computer Science Department > University of Houston Greetings, The U.S. Federal Geographic Data Committee (FGDC) has a Spatial Data Transfer Standard (SDTS) which is FGDC-STD 002, and FIPS 173, and draft ANSI NCITS 320:1998. It accomodates spatial data in vector and raster forms. The SDTS web page is http://mcmcweb.er.usgs.gov/sdts/ All current SDTS implementations use ISO/IEC 8211:1994, which is a file-level, self-describing data format used for more than just spatial data. The ISO 8211 page is http://www.iso.ch/cate/d20305.html One of the extensions to the SDTS Raster Profile supports ISO/IEC 12087-5 Basic Image Interchange Format (BIIF) which is the successor to some older NATO and NRO imagery formats (NSIF, NITF). I believe that there is a BIIF web page, but I can find the URL right now. Another extension to the SDTS Raster Profile supports Geo/TIFF and TIFF, which are common industry standards. HTH, Charley Hickman U.S. Geological Survey - Rolla, Missouri, USA chickman@usgs.gov http://mapping.usgs.gov/ From nobody Mon Jun 15 11:48:31 1998 Path: newsfeed.cv.nrao.edu!newsgate.duke.edu!nntprelay.mathworks.com!newsfeed.internetmci.com!193.174.75.110!news-was.dfn.de!newsjunkie.ans.net!newsfeeds.ans.net!news.chips.ibm.com!mdnews.btv.ibm.com!lloydt From: lloydt@watson.ibm.com Newsgroups: sci.data.formats Subject: Re: Advice on designing data file formats? Date: 15 Jun 1998 14:58:25 GMT Organization: IBM T.J. Watson Research Center Lines: 49 Message-ID: <6m3cqh$tsg$1@mdnews.btv.ibm.com> References: <35731A72.26347540@pipeline.com> NNTP-Posting-Host: krell.watson.ibm.com X-Newsreader: xrn 8.01 To: Daren Scot Wilson Originator: lloydt@watson.ibm.com Xref: newsfeed.cv.nrao.edu sci.data.formats:313 You might want to take a look at my survey paper that discusses these and related issues, http://www.research.ibm.com/people/l/lloydt/dm/home.htm It does have pointers to information about several different data formats and APIs that may be useful for you. -- In article <35731A72.26347540@pipeline.com>, Daren Scot Wilson writes: |> We are considering defining or adopting a "universal" data file format - |> something we can have our many experimental groups use, to allow re-use |> of software tools, allow engineers to carry their know-how with them |> between groups, etc. |> |> One possibility is HDF, but no one there knows about it, except me, and |> I'm no expert - just barely know it exists. I know more about FITS, |> used at NASA/JPL, but it's oriented mostly toward images. Both HDF and |> FITS would be overkill for us, we suspect, being far too general. Our |> data is almost all time series, with 1-300 channels of data. |> |> Are there simpler data formats we can make use of? If we design our |> own, is there a collection of street-smarts and accumulated wisdom to |> aid us? Any horror stories about what not to do? |> |> -- |> |> Daren Scot Wilson |> Member, ACM |> darenw@pipeline.com |> www.newcolor.com |> --------------------------- |> Find people of great energy and circulate it... |> -- michael crye ------------------------------------------------------------------------------ Lloyd A. Treinish lloydt@us.ibm.com IBM Thomas J. Watson Research Center 914-784-5038 (voice) Yorktown Heights, NY 10598 914-784-7667 (fax) http://www.research.ibm.com/people/l/lloydt -- ------------------------------------------------------------------------------ Lloyd A. Treinish lloydt@us.ibm.com IBM Thomas J. Watson Research Center 914-784-5038 (voice) From nobody Mon Jun 15 11:49:32 1998 Path: newsfeed.cv.nrao.edu!newsgate.duke.edu!nntprelay.mathworks.com!newsfeed.internetmci.com!193.174.75.110!news-was.dfn.de!newsjunkie.ans.net!newsfeeds.ans.net!news.chips.ibm.com!mdnews.btv.ibm.com!lloydt From: lloydt@watson.ibm.com Newsgroups: sci.data.formats Subject: Re: Advice on designing data file formats? Date: 15 Jun 1998 14:58:25 GMT Organization: IBM T.J. Watson Research Center Lines: 49 Message-ID: <6m3cqh$tsg$1@mdnews.btv.ibm.com> References: <35731A72.26347540@pipeline.com> NNTP-Posting-Host: krell.watson.ibm.com X-Newsreader: xrn 8.01 To: Daren Scot Wilson Originator: lloydt@watson.ibm.com Xref: newsfeed.cv.nrao.edu sci.data.formats:313 You might want to take a look at my survey paper that discusses these and related issues, http://www.research.ibm.com/people/l/lloydt/dm/home.htm It does have pointers to information about several different data formats and APIs that may be useful for you. -- In article <35731A72.26347540@pipeline.com>, Daren Scot Wilson writes: |> We are considering defining or adopting a "universal" data file format - |> something we can have our many experimental groups use, to allow re-use |> of software tools, allow engineers to carry their know-how with them |> between groups, etc. |> |> One possibility is HDF, but no one there knows about it, except me, and |> I'm no expert - just barely know it exists. I know more about FITS, |> used at NASA/JPL, but it's oriented mostly toward images. Both HDF and |> FITS would be overkill for us, we suspect, being far too general. Our |> data is almost all time series, with 1-300 channels of data. |> |> Are there simpler data formats we can make use of? If we design our |> own, is there a collection of street-smarts and accumulated wisdom to |> aid us? Any horror stories about what not to do? |> |> -- |> |> Daren Scot Wilson |> Member, ACM |> darenw@pipeline.com |> www.newcolor.com |> --------------------------- |> Find people of great energy and circulate it... |> -- michael crye ------------------------------------------------------------------------------ Lloyd A. Treinish lloydt@us.ibm.com IBM Thomas J. Watson Research Center 914-784-5038 (voice) Yorktown Heights, NY 10598 914-784-7667 (fax) http://www.research.ibm.com/people/l/lloydt -- ------------------------------------------------------------------------------ Lloyd A. Treinish lloydt@us.ibm.com IBM Thomas J. Watson Research Center 914-784-5038 (voice)