Do-It-Yourself Data Analysis

Table of Contents

Abstract

The common single-dish spectral line and continuum data reduction packages let observers get on with the business of doing science with their chosen instruments. However, there are plenty of reasons why some of us need or simply prefer to use our own software tools to analyze data drawn from intermediate stages in the data stream or from the original data files. This lecture the outlines data paths from the telescope, discusses low-level data formats and how to read them, and offers a few example programs in three different programming languages as starter examples. We also mention a number of necessary data operations that need to be performed on the raw data, such as Van Vleck corrections, power transfer function linearization, and synchronization of data with antenna position. A number of Web resources are given.

Introduction

The scientific capabilities of the Arecibo and Green Bank instruments must be accessible to all astronomers without requiring detailed knowledge of the electronics, software, and observing techniques. Hence, the Observatories supply standard setups and data reduction software and procedures for some of the most commonly used observing modes. However, there are many cases where observers must or prefer to write their own data analysis software. This lecture is intended to present enough information to allow you to access the fundamental telescope data and process it with routines of your own design. Example programs in IDL, glish/aips++, and Python programming languages are provided as templates.

The scientific software world has changed in rather fundamental ways over the last ten or fifteen years. A major operating system, Linux, is in the public domain and very widely supported, particularly in academia. Several powerful, general purpose commercial packages, notably IDL and Matlab, are available for data analysis, display, and creation of publication quality illustrations. These are often available as university services. Perhaps most important to astronomers is the expanding array of high level languages, tool kits, and support available on the Internet free of charge. One can now chose from astronomy-specific packages (CLASS, AIPS, AIPS++, IRAF, etc.), scripting languages (Python, glish, Perl, Tcl, Java, etc.), and major tool kits (PGPLOT, Tk, cfitsio, scientific libraries, etc.) without too much concern about reinventing wheels in the process of creating your own data analysis environment. The old debates about the best programming languages and methods (FORTRAN, C, C++, object orient programming, etc.) have largely been replaced by choices of the best combination of scripting language, display package, high-level mathematical tool kit, and high-level astronomical package(s) that best suit your scientific needs. You should expect to continually change the mix of your analysis environment as new and better selections become available.

Not every astronomer nor even a majority of astronomers should be doing a lot of programming outside of the astronomical data reduction packages, but the science you that do and the observing techniques you that use should not be confined by the scope of these packages. Observational astronomy depends on innovation and improvements to existing techniques. If you do choose to assemble your own data analysis environment from a range of high and low level pieces, keep in mind that you are not trying to rewrite a complete analysis package. Writing software routines for specific scientific objectives is a much more limited objective. If any of your inventions are of use to others, they can be cleaned up, documented, and "published" later on the Web.

One can write new analysis procedures at many levels from scripts within astronomical packages to low-level routines beginning with raw telescope data. Since many observers do not see the unprocessed data or realize that it is accessible without great effort, this lecture will concentrate on this lower data level. Because I know the GBT best I will draw most of my examples from this telescope.

Data Paths from the Telescope

Figure 1 shows an outline of the flow of data from the GBT IF electronics outputs to the data analysis software. There is a mix of data processing electronics provided by NRAO and university groups, and the data produced by the NRAO devices may be analyzed entirely by NRAO software or by more specialized software supplied by observers.

Figure 1. Data flow and analysis software at the GBT. Rounded boxes are components supplied by the user community.

Data formats written by the signal processing "back-ends" depend on how widely accessible the data are intended to be. It is generally easier and sometimes significantly faster to write data in binary format close to the native format produced by the electronics. However, these data require detailed external documentation and specialized software routines to read them. If the data are used by only one or two groups, a binary format is usually simple and efficient. We'll say a little more about binary formats later.

Data that are intended to be read by more than one or two programmers are generally best written in a standard, self-documenting format. In the astronomical world this is Flexible Image Transport System (FITS) or, more specifically, FITS Binary Table format. The original FITS was pure text data so you could read the whole file with a text reader or any program that reads ASCII (American Standard Code for Information Interchange) text. However, large numeric data sets required too much extra storage space and a lot of processing power to convert between ASCII and native machine binary formats so a hybrid ASCII/binary format standard was developed. The self-documentation is in ASCII headers, and the data arrays are recorded as binary integers or floating point numbers. This is probably the most common FITS format that you will see at various observatories.

Ideally, every bit of information about your astronomical observation (antenna position, bandwidth settings, cal state, etc.) is stored with the signal processor output so that you have all of the information necessary to interpret your data without need for a manually-written observing log. In an automated telescope system there are too many parameters to keep track of manually. At the GBT every subsystem (receiver, antenna, spectrometer, etc.) writes its own FITS file for every observing scan. These files are stored in the GBT data directory (check with GBT staff for this directory path) under the subdirectory of your project ID. For example, the antenna information for the project AGBT02A_069 of a scan started at 23:58:48 UTC on April 26, 2003 is in the file

[data directory]/AGBT02A_069_01/Antenna/2003_04_26_23:58:48.fits

Please remember that observing data are proprietary to the observer under whose program they were recorded for 18 months so access of data other than your own without permission is a breach of professional ethics. The data under project AGBT02A_069 is for a public domain survey so you are welcome to read any of its data files.

Collation of the information from all of the GBT subsystems is one of the functions of the "Filler" shown in Figure 1. When you use your own software to read these data files you will need to draw from several files or each scan. All files for one scan have the same date and time string in their file name. Work is underway to provide a collation utility that will create one FITS file per scan, but everything you learn here about reading FITS files will still apply.

FITS File Contents

There are a number of FITS file readers, both stand-alone and as functions in different programming languages. We'll discuss these below, but, first, just try reading one of the GBT files in a text window with something like the UNIX 'more' command. You should get something that looks like

$ more /AGBT02A_069_01/Antenna/2003_04_26_23:58:48.fits
SIMPLE  =                    T / file does conform to FITS stan
dard             BITPIX  =                    8 / number of bit
s per data pixel                  NAXIS   =                    
0 / number of data axes                            EXTEND  =   
                 T / FITS dataset may contain extensions       
   -------- etc. --------

The ASCII text does not contain end-of-line characters so the format looks a bit scrambled unless you set your text window to be exactly 80 characters wide. A bit further down in the file you'll find lines that begin to make sense, such as

DATE-OBS= '2003-04-26T23:58:48' / Manager parameter startTime  ...
TIMESYS = 'UTC     '           / time scale specification for D...
TELESCOP= 'NRAO_GBT'           / Green Bank Telescope (Robert C...
OBJECT  = 'LVM160. '           / Manager parameter source      ...
PROJID  = 'AGBT02A_069_01'     / Manager parameter projectId   ...

It's not too hard to write a little C program to print the ASCII header lines while skipping the binary tables. An example can be found in the file list_fits.c The FITS rules used to write this program are: 1. all header lines ('cards') are 80 characters long; 2. headers and binary table lengths are integer multiples of 2880 bytes, which are padded at the ends with blank lines or undefined binary data to make up the length; 3. the last non-blank ASCII header line contains only the END keyword; and 4. the dimensions of the useful binary table is specified by the values of the keywords, NAXIS and NAXISn. FITS binary tables are organized in Header Data Units (HDU's) with an ASCII header and a binary table in each HDU. The binary table may be omitted in an HDU, as is always the case in the first HDU.

Whether you are interested in C programming or not, you should try the utility 'fv' for looking at FITS files. Its operation is fairly easy to figure out just by clicking buttons. Type fv at the UNIX prompt followed by the path name of your FITS file. By comparing the header keywords with the table organization in 'fv' you can get a pretty good idea how the table format is specified by the TTYPEn, TFORMn, and TUNITn keywords. If you're interested in the gory details of the FITS standard, have a look at Hanish et al (2001). and/or the Web sites [1] and [2].

Of course, there is no need to write your own C code routines to access FITS file. A standard access library, CFITSIO, is freely available at http://heasarc.gsfc.nasa.gov/docs/software/fitsio/fitsio.html This library has been attached to many of the higher level scripting languages and data analysis packages from which it is easier to call the keyword and data access routines.

Doing Something with the Data

Probably the quickest way to learn how to make use of telescope data is to begin with example scripts for accessing and displaying the contents of different types of data files. There is no "best" high level language for this so we'll offer parallel examples in three languages available in Green Bank: glish, Python, and IDL. These have the tools necessary for most tasks that you might want to apply to your data, but they are not the only software environments that could be used to work directly with telescope data files.

Example code

Below is a list of text files containing instructions and example scripts for working with telescope data FITS files in the languages glish, Python, and IDL.

See the GBT telescope fits file documentation.

Back-End and Telescope Data Notes

The basic data from a radio telescope is simply signal intensity as a function of time, frequency, antenna position and possibly one or more hardware states, such as the position of a beam switch. The method for synchronizing the intensity information with the independent variables varies somewhat from one telescope to the next. On the GBT the antenna position, local oscillator frequency, and signal intensities are sampled asynchronously with time tags so an averaging or interpolation operation is required to determine the antenna position and observing frequency at the mean time of each intensity sample.

Hardware states are much more directly synchronized with intensity accumulation in real time with common switching signals that send all data from a given combination of switch states to the same accumulator and data from different combinations to different accumulators. For example, if a spectrometer setup is to frequency switch at a two-second period and turn the calibration signal on and off at a synchronous one-second period, you will get four spectra from each IF channel corresponding to the state combinations (cal~off, frequency~1), (cal~on, frequency~1), (cal~off, frequency~2), and (cal~on, frequency~2). The state combinations are found in the STATE variables of the STATE header data unit of the back-end and LO FITS files. These correspond to the phase or state index in the DATA arrays.

The intensity measurements can be simple total power integrations from the full IF passband from the digital continuum receiver (DCR), accumulated spectra from the spectral processor, accumulated autocorrelation functions (ACF) from the spectrometer, or a high-speed stream of IF signal voltage samples from a VLBI, pulsar, or other direct-sampling device. The DCR and spectral processor outputs may be used more or less without further processing to derive system temperatures from the cal-on/cal-off measurements and spectral line or continuum intensities from on-source/off-source values. A few comments about processing direct IF signal voltage samples will be given in Section~\ref{dirsamp}.

The output of the autocorrelation spectrometer requires three signal processing operations to turn the recorded ACF's into spectra: quantization corrections (often called van Vleck corrections), linearization, and Fourier Transform. At Arecibo these three operations are performed in real time so that the autocorrelation spectrometer output is corrected spectra that you can use directly. The GBT design choice was to record the raw autocorrelation data to retain maximum flexibility.

At first glance the quantization and linearization corrections appear to be shrouded in the esotericism of integrals of complicated error functions, but the basic ideas are pretty simple. To save on hardware cost and maximize signal processing speed an autocorrelator generally samples the IF voltage with only 2, 3, or 9 levels of quantization. Most of the spectral and intensity information of noise-dominated signals is retained, but distortions are introduced that must be corrected to recover the true autocorrelation and spectrum values.

These corrections are derived by starting with the assumption that the sampler input voltage has a normal probability distribution of amplitudes. From this we can compute the response of a given sampler to a range of known signal autocorrelation values over a range of sampler input levels. We can then fit an analytic function to a chosen accuracy to this two-dimensional array of true-to-measured correlation value ratios as a function of sampler input level and measured correlation values. This function is then the quantization correction function. The same is done to get the ratio of true sampler input power to measured output power as a function of measured output power to get the linearization function. These computations are straightforward but a bit tedious to program so the coefficients can be computed once and packaged as a module to be accessed by a higher level data analysis language, as has been done for the GBT spectrometer. One point to make here is that the quantization and linearization corrections are not necessarily a compute-and-forget black box. The assumptions of normal voltage distribution and symmetric, evenly spaced sampler levels are different from reality at some level. This might be an area for further improvement by someone working closely with the data.

Transforming the ACF to a power spectrum is pretty straightforward with any of a number of FFT tools. The main trick is to create a real, symmetric data array of twice the length of the measured ACF by reflecting the ACF around its zero-delay data value. An example of the application of the corrections and Fourier transform are shown in the plot_actsys code module listed in Section on Example Code.

Binary Data Formats

If you are reading FITS data files, you don't have to worry much about the details of how the binary data is stored on disk. This is generally taken care of by the FITS reader code. However, if your data is a stream of IF voltage samples or in a format defined by a low-level language data structure, the details are important.

Binary data comes in three flavors: character, integer, and floating point. A character is always 8 bits (1 byte) long, where the translation from numeric value to alphabetic letter is standardized in the ASCII code table.

Integers can be 8, 16, 32, and sometimes 64 bits long (1, 2, 4, or 8 bytes), and they can be signed or unsigned. The distinction between signed and unsigned determines the interpreted data range. A 16-bit, unsigned integer can store values from 0 to 65535 while a signed, 16-bit number runs from -32768 to +32767. The binary bits in the unsigned value of 65535 are the same as in the signed value of -1.

Floating point binary numbers are almost always 32 bits (single precision) or 64 bits (double precision) long (4 or 8 bytes). Very very rarely you may see 16- or 128-bit floating point numbers, but these are not generally supported by most computing languages. The allocation of bits between exponent and mantissa in the data word is specified by the IEEE Standard 754.

To add variety to life, there is no standard on the order in which computers store the bytes of a multi-byte integer or floating point number on disk. Intel processors store the least significant byte first (called "little endian"), and Sun computers and the Motorola processor store the most significant byte first ("big endian"). (See Gulliver's Travels by Jonathan Swift.) If you read data written by a Sun computer with an Intel machine or vice versa, you will see data garbage unless your reading program is aware of the problem. You can reverse the byte swap by treating all bytes as characters and swapping the order in each 2-, 4-, or 8-byte word.

Processing Direct IF Voltage Samples

The most basic form of data collection is to directly sample band-limited IF output voltages with an analog to digital (A/D) converter. If the data sampling rate is twice the IF bandwidth, all information is retained. You are then free to process and reprocess your data any way you like on a general-purpose computer - Fourier transform spectroscopy, autocorrelation functions, pulse period and dispersion searches, interference excision, etc. Since there is no data rate reduction in this type of data collection it uses a prodigious amount of disk space and requires considerable post-processing resources, but the rapid growth of processing power and disk sizes has moved this option into astronomically interesting bandwidths and integration times. Some pulsar searches and planetary radar are now done this way. One terabyte of disk space, costing about $5000, can store about 14 hours of directly-sampled data from a 10 MHz bandwidth using 8-bit samples.

If you would like to try your hand at extracting hydrogen line profiles from some directly sampled GBT data you are welcome to have a go at the files on the local disk of workstation 'rfimit' at /export/raid/scratch. The data are 16-bit signed integers written by an Intel machine sampled at 5 megasamples per second. Two data channels are interleaved so that alternating data words are from the same channel. The A/D converter has only 12 bits in its output word so the least significant 4 bits of each 16-bit word should be discarded. Files named with an off.dat suffix are the 'off' scans to be paired with the 'on' scan file with the same name prefix.

Web Resources

Keep in mind that the examples given in this lecture are only a few of many ways of accomplishing given tasks, none of which is necessarily the best way. Web searches with applicable keywords, like "Python crosscorrelation" or "IDL Doppler" can turn up a lot of useful tools. Here is a collection of Web sites worth a look.

glish:
 http://aips2.nrao.edu/docs/reference/Glish/Glish.html
 http://aips2.nrao.edu/docs/user/Refman/Refman.html
 http://aips2.nrao.edu/docs/notes/195/195.html

Python:
 http://www.python.org/
 http://pfdubois.com/numpy/
 http://www.python.org/topics/scicomp/
 http://astrosun.tn.cornell.edu/staff/loredo/statpy/
 http://scom.hud.ac.uk/scomdjd/public/python/pgplot/
 http://www.gb.nrao.edu/~esessoms/igkya/

IDL:
 http://www.rsinc.com/idl/
 http://idlastro.gsfc.nasa.gov/homepage.html
 http://cow.physics.wisc.edu/~craigm/idl/ephem.html
 http://idlastro.gsfc.nasa.gov/fitsio.html

Reference

Hanish, R. J., Farris, A., Greisen, E. W., Pence, W. D., Schlesinger, B. M., Teuben, P. J., Thompson, R. W., Warnock, A. III, 2001, Astron. & Astrophys., vol. 376, pp359-380.