XDF: TheExtensible Data Format Based on XML Concepts

Introduction. The eXtensible Data Format (XDF) data format, now available as version 0.17, is being developed at GSFC’s Astronomical Data Center (ADC) in the course of an AISRP-funded (Applied Information System Research Program) project to bring XML applications to the astronomical data centers. XDF is being proposed as a cross-disciplinary solution for scientific data formats based on general mathematical principles, object-oriented models, and XML concepts. This article describes the need for and details of XDF, the potential value of XDF in the National Virtual Observatory program, and your options for involvement in and use of XDF.

Motivation. Among the most effective and robust of the new open technologies are the eXtensible Markup Language (XML) and its ancillary technologies. The global adoption of XML ensures a long usable lifetime for documents written in XML. It also ensures a future supply of programmers and systems analysts who understand these methods. The present is an opportune moment to redesign the basic foundations of our data handling.

The multiplicity problem. Until now NASA and other scientific institutions have had to deal with a multiplicity of data formats. Each format was home grown within its specialty and in many cases is based on technology of the early 1970s. There has not been a sufficiently strong reason to rally around one particular method for containing data, nor was any one system general enough to accommodate everyone’s needs. The result has been a confusing array of data formats. Even within specialties with adopted standard formats, subclasses of incompatible formats often form and thrive.

The transformations problem. Presently satellite data are converted five or six times before reaching end users because incompatible data format types are in use: (1) satellite telemetry format is converted to science and engineering specialty format; (2) then to a format of analysis software, FITS, HDF and others; (3) to a word processing document format for publication; (4) to SGML at publication houses; (5) to ASCII or FITS again at data centers; (6) to GIF or MPEG for Education/Public Outreach. Furthermore, each of these processing steps is distinct for each scientific discipline! It requires many (expensive) programmers to maintain these pathways. During this processing, most meta-data information is lost, if indeed any had been included. In some cases, even the information of which satellite or mission obtained the data is dropped. An enormous amount of unnecessary work is required by NASA personnel, scientists, and editors to maintain value in the face of an intrinsically lousy system.

The archaic problem. There are a host of other weaknesses in existing data formats, most stemming from their vintage. Some of the more common problems are: restricted number of characters for variable names; restricted record length; restricted number of records; lack of hierarchical structure; inability to directly query multiple files; lack of pointers or referencing mechanism; discipline specific attributes; difficulty in extensibility or interoperability; insufficient self description; no hyperlinking; not easily Web-enabled. Not all data formats have all of these problems, but all of them have some of these.

Advantages of XML. As a result of research at the Astronomical Data Center (ADC) in exploring XML technologies, we realized that a single XML data description format can, should, and inevitably will be an end-to-end solution for data formats of all varieties. Many crucial benefits will result with data transparently interchanged and presented by ubiquitous browsers. Meta-data terminology can be made maximally common between specialties with hyperlinks to authoritative definitions and usage tips. A single visualization/editing package could operate properly on any data set. The public would obtain and visualize the same data format (reduced in quantity, not quality). Processing software would require very moderate modification for new missions. Meta-data could be carried along or linked to throughout the data life cycle. XML query methods would operate on all data, greatly enhancing data mining possibilities. Tables and plots in manuscripts would require far less editing during preparation for publication.

XML has already gained significant acceptance within e-commerce, industry and certain science disciplines as a data standard for interfacing between computer applications. The intense interest has come largely because the XML standard includes specifications of how an XML document should be parsed and represented within any computer, irrespective of architecture or operating system. This internal representation of an XML document, called the DOM (Document Object Model), allows a document to be accessed in the same way by different applications running on different computer platforms. XML parsers are now readily available for incorporation into applications software in all major programming languages.

XDF details. In the course of an AISRP-funded research project on applying XML technologies to the ADC repository, we found that the XML-based data description we were developing could easily be made extraordinarily general and powerful because of the extensibility and other features of XML. In fact we found that we could easily and fully describe any data set that we found, plus add totally novel features.

XDF allows for XML tagging of the data. Alternatively, it is a hybrid system, in which the data may be fixed-width or delimited ASCII or binary. Such hybrids are useful when data size results in unacceptable transfer times or for wrapping legacy data files. XDF gives a full description of how the data are laid out and is suitable to wrap most existing data files.

XDF makes use of systems called ID and IDREF to point between components. This is used for alignment of axes arrays or merging of data or sometimes as a shorthand notation to state that this information is a copy of another.

There are a number of XML concepts adopted by XDF that were not envisioned when the older data formats were invented. Recursive entity replacement allows document writers to enter shorthand notations that are expanded by the parser. We will be making use of this feature in scientific units entry. Entered in whatever system is most familiar, units are automatically transformed to a base set of La Systeme International d’Units. In this way data values from different institutions can be automatically compared and merged properly.

The concept of a logical document comprised of any number of physical files is fairly new. These physical files can be enveloped in the manner that messages contain attachments.

A particularly useful feature to be developed is incorporation of a W3C (World Wide Web Consortium) standard math formula language, MathML, to replace an explicit number series when a formula is sufficient. Many of the common analytic tools (e.g. Mathematica, MathCad, or Maple) have the facility to interpret MathML.

XDF and the NVO. The National Virtual Observatory initiative within the astronomical community aims to enable cross-archive data search and retrieval capabilities. It will allow researchers to obtain data on any object in the sky from any observatory or NASA mission. It is already clear from NVO discussion groups that XML technologies will be called upon to facilitate the NVO functions. XDF in particular has been discussed as an important component of this.

Your involvement. From the home page http://xml.gsfc.nasa.gov/XDF one can peruse the XDF and FITSML DTD and Schema, or download our beta version of the XDF package based on our XDF API. This includes an alpha level TK/Perl table viewer. In addition, one can sign up for the XDF discussion group and announcements server.

In closing. The XDF language is a powerful data description that covers most forms of data transmitted throughout the many scientific communities represented at NASA and beyond. Using inheritance within XML Schema, XDF provides a core language for scientific data arrays, images, vector fields, tables, etc. Individual disciplines may add layers of semantics, each layer becoming more narrowly focused. Our Perl and Java XDF Software Packages can be used to easily add XDF I/O to applications. We hope to add C++ wrappers as well, in the near future. We foresee eventual incorporation of XDF into an XML scientific manuscript language for publication in journals and books. This leads to an end-to-end solution with a single XML data format: starting with a satellite transmitting to ground stations, through processing pipelines, to scientists who include data in manuscripts that are published on the Web or hardcopy and then added to data center databases. With XDF for the entire life cycle, data and meta-data are better preserved and more reliable. end of paragraph mark