[Note: this proposal was approved by formal vote of the IAU FITS
	Working Group, and became effective in January 1990.]


        Floating Point Proposal for FITS

        Don Wells, Preben Grosbol       28 February 1989


	The Proposal


The Basic FITS, Random Groups and Generalized Extensions Agreements
are revised to add IEEE-754 32- and 64-bit floating point numbers to
the original set of FITS data types.  BITPIX=-32 and BITPIX=-64
signify 32- and 64-bit IEEE floating point numbers; the absolute value
of BITPIX is used for computing the sizes of data structures. The full
IEEE set of number forms are allowed for FITS interchange, including
all special values (e.g., the 'not-a-number' cases). The order of the
bytes will be sign and exponent first, followed by the mantissa bytes
in order of decreasing significance.  The BLANK keyword will be
ignored by FITS readers when BITPIX=-32 or -64.


	Discussion

The dynamic ranges of astronomical data have increased steadily in the
years since the original FITS Agreement (March 1979) which defined 8-, 16-
and 32-bit integers as the only data formats. All integer formats are
limited by an absolute error, whereas floating point formats can span a
much wider numerical range with only a relative error. Floating point
formats are desirable for the applications that need increased dynamic
range. Many astronomical data systems now use floating point in data
processing, and conversion to and from FITS integer formats sacrifice
accuracy and dynamic range and consume significant computer time.  Almost
all of the new computers designed since about 1981 use the IEEE format
(see references below). We are proposing that FITS be revised to allow
transmission of 32- and 64-bit floating point data within the FITS format
using the IEEE standard. This Floating Point Agreement will also apply to
the Random Groups Extension and to those Generalized Extensions for which
BITPIX is not explicitly restricted (e.g., BITPIX=8 for XTENSION='TABLE').

In order to transmit floating point data in the main data matrix we must
add some new convention to the main FITS header. We propose to add two new
allowed values for the basic keyword BITPIX. The set of values specified
in the original FITS Agreement was 8, 16 and 32; we propose to add -32 and
-64 to indicate IEEE single- and double-precision floating point data.
Thus, the number of bits per pixel would be the absolute value of BITPIX.
This implies that the number of 2880-byte logical records used by the main
data matrix must be computed using the absolute value of BITPIX (this rule
was specified in the Generalized Extensions Agreement (Grosbol etal.
1988)).

The IEEE standard specifies a variety of special exponent and mantissa
values in order to support the concepts of plus and minus infinity, plus
and minus zero, 'denormalized' numbers and 'not-a-number' (NaN). We propose
that all of these special cases should be fully accepted for FITS
interchange.

We propose that the order of the bytes should be sign and exponent first,
followed by the mantissa bytes in order of decreasing significance (i.e.,
the standard non-byte-swapped order).

The BLANK keyword of the original FITS Agreement should be ignored by FITS
readers when BITPIX=-32 or -64 (the NaNs of the IEEE format will act as the
blank). FITS writers should not write the BLANK keyword if BITPIX=-32 or
-64.

The BSCALE and BZERO values should be applied by FITS readers if they
differ from 1.0 and 0.0. However, such usage be regarded as bad practice,
and strongly discouraged, because of the risk of generating overflows and
underflows in FITS readers.


	Bibliography

"An Implementation Guide to a Proposed Standard for Floating-Point
Arithmetic", COMPUTER magazine, Vol. 13, No.  1, pp.68-79, January 1980.

COMPUTER, Vol. 14, No.  3, March 1981.

ANSI/IEEE  754-1985, "IEEE Standard for Binary Floating Point Arithmetic".

ISDN-xxxxx?

Wells, D.C., Greisen, E.W., Harten, R.H.: 1981, Astron. Astrophys. Suppl.
Ser. 44, 363, "FITS: A Flexible Image Transport System".

Greisen, E.W., Harten, R.H.: 1981, Astron. Astrophys. Suppl. Ser. 44, 371,
"An Extension of FITS for Groups of Small Arrays of Data".

Grosbol, P., Harten, R.H., Greisen, E.W., Wells, D.C.: 1988, Astron.
Astrophys. Suppl. Ser.  73, 359-364, "Generalized Extensions and Blocking
Factors for FITS".

Harten, R.H., Grosbol, P., Greisen, E.W., Wells, D.C.: 1988, Astron.
Astrophys. Suppl. Ser. 73, 365, "The FITS Tables Extension".

        Implementation Notes

Note-1: Many data analysis systems today use floating point internally and
so conversion costs for FITS input and output will be reduced by this
proposal. For machines with IEEE hardware the conversion cost will be
nearly zero, especially when BZERO=0 and BSCALE=1 and the reader has two
DO-loops (one with and one without scaling).

Note-2: The interpretation of the IEEE sign, exponent and mantissa bit
fields is precisely specified by the five rules of Section 3.2.1 of
ANSI/IEEE Standard 754-1985. Algorithms for transformation between IEEE and
non-IEEE formats can be derived from these rules. The FITS Committees will
provide a central archive for conversion software for various
architectures.

FITS readers on non-IEEE systems should test the IEEE exponents for the
special values 0 and 255/2047 (for 32/64-bit numbers). Exponent zero
should be mapped to the local value for floating zero (exponent zero is
used by IEEE for 'denormalized' numbers as well as for both plus and minus
zero); exponent 255/2047 should be mapped to the local value used for
blank pixels (all IEEE infinity and NaN cases have exponent 255/2047). The
IEEE exponent field can be selected from the 32/64-bit values by masking
with the constant 7F800000/7FF0000000000000 hexadecimal. The two tests are
made by comparing the result against zero and against the mask.

FITS writers executing on non-IEEE systems should map their local value
for blank pixels to an IEEE NaN value; we suggest that all bits set (minus
one in 32- and 64-bit two-s complement integers) is a good choice. The
FITS writer should also map local floating point absolute values which
exceed the largest value of IEEE (approximately 10**38 for 32-bit, 10**304
for 64-bit) to IEEE-Infinity.  We recommend that absolute values smaller
than the smallest normalized IEEE value (approximately 10**-38 or
10**-304) be mapped to zero. These two tests are particularly important on
machines which use 'hexadecimal' normalization, with an exponent range of
approximately 10**75 in single-precision.

The DEC VAX architecture is an important special case. VAX systems use the
DEC-proprietary 'F' format for 32-bit floating data and their byte order
is 'swapped'. Once the swap has been corrected the F-format differs from
the IEEE 32-bit format only in that the exponent bias differs by one and
the (hidden) binary point is shifted by one bit. The two differences add,
and the total effect is that the two number systems differ by a factor of
4.0 (VAXF = 4.0 * IEEE32, IEEE32 = 0.25 * VAXF). This floating point
multiplication operation is little, if any, more expensive than the
conversion operation between integer and floating point which is required
by the original FITS Agreement. VAX systems should test the exponent field
on input to detect the infinity and NaN cases and map them to the local
blank code using the test specified previously. The minus zero and
denormalized cases may also need to be trapped.

The DEC VAX architecture supports two different double precision formats,
'D' and 'G'. The 'G' format has the same relationship to IEEE 64-bit format
that the DEC 'F' format has to IEEE 32-bit (VAXG = 4.0 * IEEE64). The 'D'
format, however, requires a complicated transformation to convert to/from
IEEE 64-bit; programmers should consult archives of FITS code to obtain
suitable algorithms and implementations for this case.