XML at the ADC: A Next Generation Data Repository.
Ed Shaya, James Gass, James Blackwell, Brian Thomas, Brian
Holmes (NASA/RITSS)
Cynthia Cheung (NASA/GSFC)
Summary:
The staff of the Astronomical Data Center (ADC at GSFC/NASA) is researching
the benefits of using the eXtensible Markup Language (XML) for its
repository of published astronomical data. Automated pipelines have
been constructed for the flow of data from scientists and/or
journal presses into XML documents.
Data can be retrieved through the web via a variety
of query methods. In the process, an XML software toolbox is in
development for
the importation, enhancement, and distribution of tabled data and their
associated metadata documents.
A new data format has been designed that takes full advantage of the XML
hierarchical view yet makes maximal use of previously standard
keywords and parameters.
Introduction
It is expected that the ADC (http://adc.gsfc.nasa.gov/) and the Centre de Donnees astronomiques de
Strasbourg, France (CDS, http://cdsweb.u-strasbg.fr/CDS.html)
will be ingesting several thousand published astronomical tables each year
in the near future.
The task of storing and documenting such a large volume of data
requires new procedures to be developed. In addition, the need for
precision search
for specific data in this database that is growing to enormous size also
requires a rethinking of its management and organization. As it
turns out, the recent acceptance of XML by the computer science
community has resulted in numerous applications and standard practices
that are perfectly aligned to solve the problems of large
and highly complex data repositories such as the one at the ADC.
The Advantages of XML
The basic idea behind XML is fiendishly simple; mark up documents so
that each piece of information has beginning and end tags that describe
the type of information present. The start tag is set off from
the text by "<" and ">" symbols and the end tag by "</" and ">" as they are in
HTML. So a simple example of a piece of information, called an element, might be
<author>Mark Twain</author>. Each field or trade can decide on
their standard tag names. Elements can be nested within
elements, so that astronomers can have
<star>
<G-type>Sol</G-type>
<luminosity units="solar luminosities">1.000</luminosity>
</star>
An "attribute" is introduced in the luminosity tag
to modify the element type; in
this case provide for the units of the contained value.
Librarians can have mark up like:
<book>
<author>Mark Twain</author>
<title>Tom Sawyer</title>
</book>.
Now, this may not look all that revolutionary;
there has been a specification like this since 1986 called SGML from
which XML was derived and data providers have always marked up their
documents in some form. What is new is the concensus
among every major sofware manufacturer and data provider to
conform to this one particular style of mark up. With everyone
committed to the XML specification of the W3C of February 1998,
software to read in and write out XML can be devised for general
purposes and cross-platform performance is ensured.
There is now a W3C standard manner by which applications get access to and/or
modify data in an XML document called Document Object Model (DOM).
There is also a standard scripting language for converting XML documents
into other type documents like HTML, PDF, or other XML languages. This
is called the eXtensible Styling Language (XSL). With XSL it is easy
to tailor views on the data. Users can be fed selected pieces of data
appropriate to them, perhaps only those for which the user has permission
or clearance. The output format (styling) can differ significantly depending
on the user's request and the type of media for display. The display can be
formatted coarsely for small screens or with great finesse for high
resolution printing.
There is a standard way to link or address sections of other XML documents.
XML allows several files or sections of documents to be presented as a
single web page.
Finally, work is ongoing for a standard language for query.
At the ADC we are presently using XQL query language to help users find
the information they want. Other sites are trying out some of the other
contenders for the query standard.
Document Type Definition
One of our first tasks after defining the scope of the XML project was
to develop a Document Type Definition which specifies the structure and
content of our metadata documents. It had to support both the
information in our legacy documents plus
allow for new content types expected in the future plus additional
markup for query. This XML language would probably only be useful to
the ADC and CDS because it contains far more information than would be found
in typical tables of data. Additional work is ongoing between various
astronomical data centers to define a simpler mark up for the
interchange of data.
Our documents had to contain bibliographic information, detailed
descriptions of the holdings, links to related information at other
sites as well as within our database, and it had to contain detailed
info on the structure of each table, down to the meaning and format of
each column of data.
The hierarchical structure of XML allowed the ADC XML documents,
datasets.xml, to be well organized. Logically connected information
could be next to (siblings), nested within (child/parent), or
used to modify (attribute) each other. One tends to make use of deeply
embedded structures with XML because one knows that there will be no
difficulties in parsing them.
As search tools were developed, many parts
of the structure had to be modified to enable clear and focused
searching. Recently, we developed an advanced search tool (see poster in this
session by Brian Holmes et al.) that allows one to search on any
element type in our DTD. Viewed from this applet the importance of proper
naming and organization of element types is seen more clearly.
As the legacy documents were translated into XML, additional changes were
incorporated to be sure that all the information was contained somewhere
and also to take into account information that was typically missing in
our documents.
It is not yet clear if or when there will be a final DTD because thus far
it has remained in continuous flux. As new applications are added to
XML programmers' toolbox one sees need to modify the document format to
take best advantage of those developments. However, it is possible to
make the pipeline automatically adjust to changes in the DTD so that
no new programming is necessary each time there are changes made to the
document structure.
Transforming Legacy Documents into XML
The conversion of ADC legacy documents into XML presented some
challenging difficulties which we have now mostly overcome. The main
difficulty was that the documents were in a form meant for human
readability and not necessarily for automated procedure. Information
particles were delimited by carriage returns, indentation, a row of
"=" signs, parentheses, or whatever made sense to human perception.
In some cases no demarcation was used because the human eye is so good
at distinguishing the information type simply by the word pattern.
To convert to XML we devised
an XML mark up language that expresses in a natural way: the type of
data desired, where it should be in the document, its beginning and ending
delimeters, and a character pattern to test if it is indeed what
we want. A perl script, txt2XML.pl, was written to interpret this set
of rules, read in the plain text documents, and output XML documents.
A graphical interface (using perl-TK) was added to help write the rules
and to locate spots within the documents that deviate from the standards.
We located inconsistencies in several hundred of our documents by
this graphical interface and after correcting all of the documents that
required only minor adjustments, we were able to completely transform
2445 of 2496 documents. The remaining 51 documents will require
significant rewriting to bring them into standardization.
We now have available one of the largest repositories of complex XML
documents in existence today.
The eXtensible Data Format
Although most of the holdings of the ADC are tables of data, there are
also a significant number of images, data cubes, and spectra. While developing
the document structure for tables, we noticed an interesting thing. The
XML description of the headings of a table could be viewed as values
along an axis, much like one would have on a spectrum or a scientific image.
And, as it happens, we were
recently hampered in the development of a tool for displaying images and
spectra because the standard formats for scientific data do not have a unique
way of describing axes. With some minor changes to our
schema for holding tables, a document type was defined that
can: describe data of any number of dimension, standardize the
inclusion of axes along each dimension, handle vectors and animations,
and was actually a better way to handle tables. It is also better at handling
data descriptors (headers) in a better way because of it superior
support for metadata of this kind. This new extensible
data format is called XDF. We are still working on the details
of including binary data when that is needed, but it looks
ready now to handle all of the data collected over the 23
year history of the ADC.
The ADC XML Data Repository Today
Now that we have an XML repository it is easy to make use of a number of
XML applications that are already available. XML editors exist that
allow one to easily make changes to documents by snapping in tags
from a menu that is set by the DTD. These editors do not permit one to
place tags in the wrong location or to leave out required information.
There are applications that create web forms so that documents can be
created on the web by simply filling in a form. We are using form
generators to allow the scientists who authored the tables to make changes
to our documents.
We are using XSL to transform the XML documents into HTML
for viewing on web browsers. The user can choose which parts of the
documents they wish to see by choosing sections of the XSL to use.
XSL is also used to transform the XML output of XQL query into HTML for
better readability and to create links to obtain further information.
The Future of the ADC XML Data Repository
- Browsing the Repository -
In the future XML can be sent to browsers
with XSL style sheets. The users will build their style
sheets by selecting the look and content of the pages they desire.
- Visualization - In the coming months we will be designing
visualization tools that allow one to extract interesting parts of
a dataset and then sort or plot that. One should be able to click on points in
the plot to request more detailed information. Most of this had been
developed before XML existed, but with XML and applets it can be highly
interactive and run distributed among the clients.
- Searching - Our advanced search page adapts to changes in the DTD.
This means that it can incorporate any DTD from any site. We are
working towards making it useable as a general data finding tool that can grab
data from many astronomical archives.
- Personal Repositories - It is now possible for individuals
to create their own repository by placing XML versions of their data on
web sites. An XQL or XSL processor can sit there as well to allow
detailed query and response. No programming necessary, almost.