XML at the ADC: A Next Generation Data Repository.

Ed Shaya, James Gass, James Blackwell, Brian Thomas, Brian Holmes (NASA/RITSS)
Cynthia Cheung (NASA/GSFC)

Summary:

The staff of the Astronomical Data Center (ADC at GSFC/NASA) is researching the benefits of using the eXtensible Markup Language (XML) for its repository of published astronomical data. Automated pipelines have been constructed for the flow of data from scientists and/or journal presses into XML documents. Data can be retrieved through the web via a variety of query methods. In the process, an XML software toolbox is in development for the importation, enhancement, and distribution of tabled data and their associated metadata documents. A new data format has been designed that takes full advantage of the XML hierarchical view yet makes maximal use of previously standard keywords and parameters.

Introduction

It is expected that the ADC (http://adc.gsfc.nasa.gov/) and the Centre de Donnees astronomiques de Strasbourg, France (CDS, http://cdsweb.u-strasbg.fr/CDS.html) will be ingesting several thousand published astronomical tables each year in the near future. The task of storing and documenting such a large volume of data requires new procedures to be developed. In addition, the need for precision search for specific data in this database that is growing to enormous size also requires a rethinking of its management and organization. As it turns out, the recent acceptance of XML by the computer science community has resulted in numerous applications and standard practices that are perfectly aligned to solve the problems of large and highly complex data repositories such as the one at the ADC.

The Advantages of XML

The basic idea behind XML is fiendishly simple; mark up documents so that each piece of information has beginning and end tags that describe the type of information present. The start tag is set off from the text by "<" and ">" symbols and the end tag by "</" and ">" as they are in HTML. So a simple example of a piece of information, called an element, might be <author>Mark Twain</author>. Each field or trade can decide on their standard tag names. Elements can be nested within elements, so that astronomers can have
 <star>
	<G-type>Sol</G-type>
	<luminosity units="solar luminosities">1.000</luminosity>
</star>

An "attribute" is introduced in the luminosity tag to modify the element type; in this case provide for the units of the contained value. Librarians can have mark up like:

<book>
	<author>Mark Twain</author>
	<title>Tom Sawyer</title>
</book>.

Now, this may not look all that revolutionary; there has been a specification like this since 1986 called SGML from which XML was derived and data providers have always marked up their documents in some form. What is new is the concensus among every major sofware manufacturer and data provider to conform to this one particular style of mark up. With everyone committed to the XML specification of the W3C of February 1998, software to read in and write out XML can be devised for general purposes and cross-platform performance is ensured.

There is now a W3C standard manner by which applications get access to and/or modify data in an XML document called Document Object Model (DOM). There is also a standard scripting language for converting XML documents into other type documents like HTML, PDF, or other XML languages. This is called the eXtensible Styling Language (XSL). With XSL it is easy to tailor views on the data. Users can be fed selected pieces of data appropriate to them, perhaps only those for which the user has permission or clearance. The output format (styling) can differ significantly depending on the user's request and the type of media for display. The display can be formatted coarsely for small screens or with great finesse for high resolution printing.

There is a standard way to link or address sections of other XML documents. XML allows several files or sections of documents to be presented as a single web page. Finally, work is ongoing for a standard language for query. At the ADC we are presently using XQL query language to help users find the information they want. Other sites are trying out some of the other contenders for the query standard.

Document Type Definition

One of our first tasks after defining the scope of the XML project was to develop a Document Type Definition which specifies the structure and content of our metadata documents. It had to support both the information in our legacy documents plus allow for new content types expected in the future plus additional markup for query. This XML language would probably only be useful to the ADC and CDS because it contains far more information than would be found in typical tables of data. Additional work is ongoing between various astronomical data centers to define a simpler mark up for the interchange of data.

Our documents had to contain bibliographic information, detailed descriptions of the holdings, links to related information at other sites as well as within our database, and it had to contain detailed info on the structure of each table, down to the meaning and format of each column of data.

The hierarchical structure of XML allowed the ADC XML documents, datasets.xml, to be well organized. Logically connected information could be next to (siblings), nested within (child/parent), or used to modify (attribute) each other. One tends to make use of deeply embedded structures with XML because one knows that there will be no difficulties in parsing them.

As search tools were developed, many parts of the structure had to be modified to enable clear and focused searching. Recently, we developed an advanced search tool (see poster in this session by Brian Holmes et al.) that allows one to search on any element type in our DTD. Viewed from this applet the importance of proper naming and organization of element types is seen more clearly.

As the legacy documents were translated into XML, additional changes were incorporated to be sure that all the information was contained somewhere and also to take into account information that was typically missing in our documents.

It is not yet clear if or when there will be a final DTD because thus far it has remained in continuous flux. As new applications are added to XML programmers' toolbox one sees need to modify the document format to take best advantage of those developments. However, it is possible to make the pipeline automatically adjust to changes in the DTD so that no new programming is necessary each time there are changes made to the document structure.

Transforming Legacy Documents into XML

The conversion of ADC legacy documents into XML presented some challenging difficulties which we have now mostly overcome. The main difficulty was that the documents were in a form meant for human readability and not necessarily for automated procedure. Information particles were delimited by carriage returns, indentation, a row of "=" signs, parentheses, or whatever made sense to human perception. In some cases no demarcation was used because the human eye is so good at distinguishing the information type simply by the word pattern.

To convert to XML we devised an XML mark up language that expresses in a natural way: the type of data desired, where it should be in the document, its beginning and ending delimeters, and a character pattern to test if it is indeed what we want. A perl script, txt2XML.pl, was written to interpret this set of rules, read in the plain text documents, and output XML documents. A graphical interface (using perl-TK) was added to help write the rules and to locate spots within the documents that deviate from the standards.

We located inconsistencies in several hundred of our documents by this graphical interface and after correcting all of the documents that required only minor adjustments, we were able to completely transform 2445 of 2496 documents. The remaining 51 documents will require significant rewriting to bring them into standardization. We now have available one of the largest repositories of complex XML documents in existence today.

The eXtensible Data Format

Although most of the holdings of the ADC are tables of data, there are also a significant number of images, data cubes, and spectra. While developing the document structure for tables, we noticed an interesting thing. The XML description of the headings of a table could be viewed as values along an axis, much like one would have on a spectrum or a scientific image. And, as it happens, we were recently hampered in the development of a tool for displaying images and spectra because the standard formats for scientific data do not have a unique way of describing axes. With some minor changes to our schema for holding tables, a document type was defined that can: describe data of any number of dimension, standardize the inclusion of axes along each dimension, handle vectors and animations, and was actually a better way to handle tables. It is also better at handling data descriptors (headers) in a better way because of it superior support for metadata of this kind. This new extensible data format is called XDF. We are still working on the details of including binary data when that is needed, but it looks ready now to handle all of the data collected over the 23 year history of the ADC.

The ADC XML Data Repository Today

Now that we have an XML repository it is easy to make use of a number of XML applications that are already available. XML editors exist that allow one to easily make changes to documents by snapping in tags from a menu that is set by the DTD. These editors do not permit one to place tags in the wrong location or to leave out required information.

There are applications that create web forms so that documents can be created on the web by simply filling in a form. We are using form generators to allow the scientists who authored the tables to make changes to our documents.

We are using XSL to transform the XML documents into HTML for viewing on web browsers. The user can choose which parts of the documents they wish to see by choosing sections of the XSL to use. XSL is also used to transform the XML output of XQL query into HTML for better readability and to create links to obtain further information.

The Future of the ADC XML Data Repository