NASA
  Beyond FITS - IAU General Assembly August 12 2000
   Ed Shaya (Astronomical Data Center)

Background

The adoption of XML (eXtensible Markup Language) by the World Wide Web Consortium (W3C) and the commercial software community has revolutionized the field of data management and interchange. This new vehicle for data transfer is a crystalization of the experiences and lessons of several fields over several years, including: professional publishing in SGML, data and forms handling on the web, database query and distribution, and browser developments. The features include, first and foremost, widespread adoption and standardization of the specification, the parsers, and style presentation languages. Another feature is self-description in which a document links to the document type definition allowing for validation of structure. The document can link to applications needed to view itself or additional files that need to be automatically included.

Best practice of XML requires keeping a clear separation between information content and display styles. Documents contain data and informational tags that help to understand and locate the data, plus the document links to a separate document that provides display information that can depend on the properties of the media.

At the ADC we have been developing an XML language for astronomical data centers. Recently we started experimenting with an eXtensible Data Format (XDF) that can encompass a wide class of scientific data. The goal is to examine the features needed for a common interchange format between the various branches of science. XDF encompasses: complex hierarchical data structures, n-dimensional arrays merged with coordinate information, any dimensional tables merged with the field metadata, searchable and editable metadata, and extensibility to new features.

The rapid advance of computer science and technology is a serious obstacle to the success of any data format. What was optimal for data storage at the last IAU General Assembly three years ago has been quite obsoleted by rapid developments in the web, browsers, XML, Java, etc. And we need to face the hard reality that what I say today will be obsolete by the next meeting of the General Assembly. One can not freeze development and ignore the technological advances, although that seems like the easy way, because that way leads to being side stepped and forgotten as enterprising people create their own data formats that do take advantages of the latest and greatest technology. Rather, we need to work together to constantly update and, if needed, migrate to new formats, just as we have been doing all along when new storage media appear.

I. Requirements for Data Formats of the Future

  1. Basic Requirements

    1. Simple
      • Easy to read and write.
      • Short learning curve.
    2. Resistant to corruption.
      • Still useable if a few bits lost.
      • Issue with compressed data.
    3. Nonproprietary
    4. Understood for a 100 years.
      • In 20 years computers may not use base 2!
      • Plan to migrate!

  2. Data Interchange

    1. Viewable by scientists and the masses
      • In a browser.
      • On a palm pilot.
    2. Absorbed easily by a database.
    3. Translatable into other formats (languages).
    4. Standard query directly on file.
      • Database composed of files.
      • As scaleable as your file system.

  3. Data clearly embedded in coordinate space.

    1. Scalar and vector fields have continuous space.
    2. Tables have parameter space.
    3. Make use of similarities yet distinquish differences.
      • images, vectors, tables, are all N-D arrays.
      • General way to enter embedded space.

  4. Encompass many fields of science.

    1. View astronomy as one branch of a tree.
    2. Exchange data with biologists, chemists, physicists.
    3. Possible because math is common to all.
    4. Perhaps MathML should be a starting point?
    5. Reflect the scientists' way of thinking:
      • ra and dec are a pair (grouping).
      • Ergs and joules are the same type (substitution).

  5. Accomodate an object oriented approach.

    1. [Java, C++, Perl, C#, JavaScript]
    2. Objects have methods associated with them:
      • string.length()
      • array1.dimension()
      • table3.field21.units()
      • For large datasets: array.append.read(tile3)
      • Right click on icon to see methods.
    3. Data Type Casting
      • string, floats, array, axis, spectrum, etc
      • operators and functions perform accordingly
    4. Inheritance
      • Create new object types by adding features to old object types.
      • Layer with most general on bottom and each layer gets more specific
        • science (with math as a basis)
          • observational
            • astronomy
              • high energy,
                • X-ray
                  • CCD X-ray detector
                    • ASCA mission

  6. Structured Data

    1. Hierarchical (tree) layout
      • Metadata Organization. Trees are 2-d objects, therefore related information can be packed closer together than 1-d.
      • Enables detailed or fuzzy query

  7. Carry functions and variables?

    1. Operators: +,--,x,/,log,exp
    2. Functions: clip, indexGenerator
    3. Loops and Case Handling
    4. Carry information to recreate reduced data from raw + calibration data.

  8. Links, Variables, and Defaults

    1. Data may be a link to externally maintained data.
      • A file system becomes a distributed database.
      • An XML document can be a mini distributed database system.
      • Reduced data is always the latest version.
    2. Pointers (reference) can be used
      • For inheritance:
        • "Like THAT axis but add this value"
      • For relationships:
        • "Parallel to THAT axis"
      • Repeated sections are just refered to.
      • Full descriptions are default values.

II. Advantages of XML

  1. Universal acceptance
  2. Separation of information and presentation
  3. Automatic validation
  4. File inclusion (Internal and External Entities)
  5. Hierarchical
  6. Parsers
  7. Stylesheet languages
  8. Field specific languages
  9. Extensible namespace

III. XDF

  1. Structures, arrays, parameters, axes
  2. Clear coordinate information
  3. Unrestrictive binary and ASCII formats.
  4. Examples: EOS, astronomy, biology, etc.
  5. OO Perl and Java application interfaces
  6. FITSML - adopt FITS keywords and an XML kernel
  7. Converters between FITS, FITSML, HDF, and CDF.

IV. Future of Data Formats

  1. User definable coordinate systems.
  2. Automatic archiving.
  3. Totally interoperable applications and query.
  4. Handling terabytes efficiently.

V. The Tower of Babel


  Project PI: Ed Shaya
  XML Staff: Jim Blackwell, Jim Gass, Brian Holmes & Brian Thomas
  NASA Official: Cynthia Y. Cheung
NASA insignia