XDF, the eXtensible Data Format for Scientific Data. Ed Shaya This document describes an XML mark-up language for documents containing major classes of scientific data. This allows the essential and common data components to be represented in a consistent manner independent of the scientific specialty involved. The language makes use of object models encompassed by modern programming languages. Such data representations would benefit from the widespread acceptance that XML has, and could bring about greater interdisciplinary information transfer. It is reasonable to expect that this approach would lead to a greater amount of clear public dissemination of scientific and technical explorations. A large fraction of scientific numeric data in existence can be classified into the following three categories: 1) A simple parameter set to a single value, possibly infinite, or a range of values, possibly of infinite extent (Eg. x=3.1 or 0 < y < 180). 2) Gridded samples of scalar or vector fields embedded in a continuous N-dimensional space. We shall refer to this class as field arrays. 3) Lists of items and values for a selection of their properties. Commonly, this class of data is called tables. An important complication is that the values of properties in tables may be field arrays (Eg, an atlas). Examples of N-dimensional spaces include physical space, projected space, time, wavelength, frequency, energy scales or some other parameter space. The field arrays include tightly sampled grids such as spectra, images, animations, or time-series measurements. It also includes sparsely sampled data such as event detection or pointed single aperture measurements. The embedded space is almost always continuous, the sampling is often not. Because data always has some finite resolution, it is always gridded (sometimes variably) and both field arrays and tables are therefore represented similarly as ordered lists of values. Software can take advantage of this similarity by sharing input/output methods between the two. In fact much of the data handling can be similar: subsetting, taking cross-sectioning, etc. However, the fundamental difference between the two categories is the fact that tabled properties rarely form a continuous space and therefore interpolation and analyses depending on interpolation are not sensible. For XDF, the field arrays and tables are considered to be a single class of N-dimensional objects, that can contain any combination of four distinct types of dimensions; continuous coordinates, discontinuous item and field spaces, and discontinuous vector components that usually refer to continuous coordinates. A point of departure for the XDF format is a standardized format for associating the embedded space with the data and for keeping data associated with the same space together in a consistent and organized manner. This has been a failing of most of the existing scientific data formats probably because they are too flexible with associated scales and axes descriptions. It is often extremely difficult to create an application that can display axes along with all of the data without some user intervention. And too often the axes information is simply not present in the data file. It is not intended for XDF to redefine all of the header data (metadata) associated with data since that is mostly discipline related. Each discipline should include into the XDF document its own elements for describing the circumstances of the data collection and relevant information for understanding the details. The advantages of XDF come primarily from having a standard core that directs the data reading and the relative positioning of multiple arrays. However, it should also be helpful to express the metadata of the older file formats in XML because it can then reside in well organized structures and benefit from standard interfaces for parsing and transformation. The objects mentioned in the category 2 data class (tables) are objects in a broad sense. Sometimes they are simply things like stars, people, particles, etc. But, sometimes they are subsets of things like locations on a thing, for instance, shells in the interior of a star at particular depths, or locations on the surface of the Earth within a scan from a down looking satellite. Traditionally, the objects of tables are placed in the first column. For the human reader, this is fine because the first column can be easily scanned by eye. For machine readable forms, it is preferable for the list of objects to become part of the metadata along with the field types. The reason for this is that queries are most often formulated for these quantities and it takes much longer to read all of the data then to read just the metadata. XDF allows for, but does not insists on, extracting object lists from the data files and placing them instead in the XDF document as axes. The document can link to the rest of the table, but not read unless the query result indicates that the data there is needed. One of the key concepts in object oriented methodology is that data should be wrapped with the information necessary to read it and to make it useful. The XML language allows for this by including in data documents either references to applications or code in the form of ECMAscript or Java. An XML document can have references to files containing data and different types of data files can be handled by different applications. This not only allows input and preliminary processing to be self directed, but it also allows some of the data to be generated on the end users' machines. XDF includes functions for calculating values along axes, and if necessary, calculating positional information for every grid point. When applications are referenced or embedded within the data documents, it greatly reduces the learning curve needed to begin working with scientific data. Specifics: The XDF is a container for parameters, field arrays, and tables. Field arrays and tables are described by array elements. These contain axes information and data elements. Data elements are ordered lists of values of numeric or string types. The array and parameter elements can be grouped into structure elements. A simplified structure with an image looks like this: a list of values along one dimension a list of values along other dimension info on the ordering of the data values and record format. ... The Data goes here Some other array of data... The structure element is not necessary for this example, but it becomes useful when more complicated sets of arrays are involved. A simplified table would look like this: Fred Mary Ned address phone number birthday ,,, 1212 Sycamore Rd/Gaithersburg MD 20934 301.123.456 9/23 721 Rose Ave/Richmond VA 20712 etc. Why are the axis elements nested? This clearly shows the order of the data in the data element. The most nested axis has the fastest moving index when being read in. This form makes it possible for the XDF document to be transformed in a straitforward way (by XSLT, CSS, or Perl scripts) into an input program. The nesting of the end of the loops would reflect the positions of the end tags of axis elements. It should be possilbe to write a transformation script for any programming language that would work on any XDF document. In more detail, is of the form: 10 20 30 40 50 60 70 80 90 .... The id is useful if the same axis description is used again anywhere in the document. Then, one can simply use: Or values elements can use a built-in indexer: The default values are start="1" and step="1". Thus, the following form puts the index numbers into the axes: ... 7.5 42.2 33.4 34.6 22.5 12.1 1.4 1.1 22.2 Or values can take a script: Or values can point to a URL that will do the calculation: The data elements can also be filled by these various methods. In addition, the application is pointed to the proper read module through the Notation Entity mechanism of XML. The data can be held in files of assorted types of binary format or fixed width record text format as long as there is a reader code for that particular format. It is not yet clear what the best way is to handle rare dataypes. A reasonable way to go is to have an easy way to add new datatype readers to the application on the client side whenever it is noted to not be in the library of readers. The application uses its code in normal operation, but goes back to a central library, indicatedby the notation, when it is missing the needed code. To do this, one includes in the internal document type definition entities of the form. < %datafile1 "http://machine.org/datafile1.dat" NPARSE binaryFile > This means that the datafile1.dat on machine.org is non-parsable in binaryfile notation and appears in attributes of elements as datafile1. The Java clas binaryReader is to be used to read it. So, the line in the XDF document to include this data file is: On the binary, ascii, mark-up debate. There is an ongoing debate about how best to use XML when faced with large data files. First of all, most existing large data files are in binary format and one simply can not mark these up with XML tags. Should these be converted to ASCII just to add XML tags. One argument against this is that one may lose precision on floating point numbers by doing this. This is not so strong an argument because rarely does one need full machine precision and probably an extra character can be used to maintain full precision. Another argument is that adding tags adds considerable overhead and will require larger storage space and longer transmission times. This argument would also hold for ascii data that is not marked up. But this argument also does not hold anymore. Recently, an application for compression of XML documents has been developed, XMIL, that does a good job of compressing XML taking full advantage of the fact that each element type is likely to contain a different datatype and each datatype is compressed well if the correct compression algorithm is applied. However, there is still a problem with marking up all of the data and that is the capabilities of present day parsers to hold the many millions of separate nodes that are needed to contain and operate on these data. It would appear possible to solve this problem also, if a parser could be written to work directly with XMIL format and its container systems. It is because of these problems with large data files and because of the huge legacy of data files in binary and ascii format that are not marked up, that XDF is careful to allow both varieties of data to be included. One can fuly mark-up data and leave out formatting information or one can use old fashioned fixed width formats for records or one can express the binary formats used. Direct Access For working with large datasets it is useful to read in only specific records or possibly go directly to the specific bytes in the files. With the axis information in XDF it becomes possible to directly access the values for the most common queries. For field arrays a section of the spacecan be specified and one can calculate the indices of the data values where that space lies. For non marked up data one can go directly to the sections of the files where the data is found. For fully marked up data there is no format information for direct file access. But, the entire data set can be read in once and placed into a DOM which can be written out to disk. The DOM provides indexing of the elements and thus efficient direct access is then possible. Another tactic is to section the data into many smaller files. This can be reflected in the XDF axis elements with multiple values elements. etc... Here the IDRef's specify the appropriate sections of the axes for the data in that file.