Tuesday, April 11, 2006

StAX of the future

In his blog Paul Sandoz just posted an entry about what he would like to see in store for the StAX API in the future. I worked with a number of people on the typed extensions to the XmlReader API in v2.0 of the .Net Framework. Adding typed extensions to a heavily text-oriented API (as all XML parsing APIs tend to be) is a challenge, but I agree with Paul that StAX is the place to add it. It is almost imposible to add a good typed extension to an API like SAX, where the parser is in control. But with an API like StAX, where the application is in control, it can be added without interrupting the original API.

Off the top of my head, the issues with adding a typed API to XmlReader that I remember being difficult:

  • What happens when the content isn't already typed? One of my goals in the XmlReader APIs, was that a client of the APIs should not need to know if the data on the wire was typed. That way the client can have the same code for text-xml and binary-xml. Also, that means that the typed APIs extensions serve as potentially useful utility methods for all users of the API; not just users building on a non-text serialization parser.
  • How to handle comments and processing instructions? XQuery really made a mess of this, in my opinion. According to the XQuery data mode, the decimal value of these two elements is both 12: 1212. A user of the API should be able to do something like this:

    reader.readNext(ELEMENT, nsNone, "int");
    int value = reader.readValueAsInt();
    reader.require(END_ELEMENT);

    Should the comment be exposed after the value, or just lost entirely?
  • Dates... There are lots of ways to format dates, and all sorts of complications with time-zones. The 'standard' for XML is ISO 8601, but that isn't the default for any of the date/time classes in Java/.Net/Python/etc, as far as I know. In .Net we settled on requiring ISO 8601, and forcing other formats to manually parse the text.
  • Where to end. Where should the parser be positioned when the call returns? If you want to support skipping over comments and processing-instructions, it must move to at least the clost-tag. But is it even necessary to leave it there? Why not skip past that? it really depends on the other methods of the API.
  • What happens if the content does not match what is expected. This may mean that the element has sub-elements, that the content is empty, or that it can't be returned as the type requested. Where is the parser positioned after the error?


I think StAX is a significant improvement over SAX, and would love to see this additional evolution happen. Today the XML APIs on the Java platform seem to be either too low-level (SAX) or too abstract (JAXB). StAX with some typed extensions and some helper methods to simplify it's use in real code would go a long way to filling some of that gap. The important evolution of current parsing/serializing APIs should be about simplifying the code that the client must build on top of the API. Most people writing code using these APIs are not XML gurus, and the API should make it easy for them to do the right thing, and hide more of XML's complexities.

2 Comments:

Blogger Umut Alev said...

XmlLite is the native rewrite of the Managed XmlReader and XmlWriter. We thought about putting type support to XmlLite but decided a higher level reader should do this.

http://windowssdk.msdn.microsoft.com/library/default.asp?url=/library/en-us/XMLLite/html/65c73fa3-be23-4a22-bbd1-81c0dc243b16.asp

9:08 PM  
Blogger Unknown said...

Actually there are now on-going discussions wrt. adding typed accessors (and of course, writers) to Stax, on stax_builders mailing list. You might be interested in reading the archives or joining? Hopefully we will finally get prototype implementations during summer, and work for Stax 2.0 (etc) starting after that or even concurrently.

Some comments wrt. current ideas: idea is to expand on getElementText(), which means PIs and comments would be swallowed, and cursor moved on top of END_ELEMENT. It means operation is not idempotent (unlike getText()), but allows optimal efficiency, and simplifies state handling. And since getElementText() already does this, it's consistent with existing Stax design.

Dates are challening, of course, one thought is to by default only support dateTime, and let apps (or helper libs) deal with more obscure types.

There are lots of open issues, like you mention, parity between textual/binary xml; if and how to handle streaming of large binary data (I hope yes, streaming or chunked). But it seems feasible to define something functional, useful and somewhat intuitive.

9:08 AM  

Post a Comment

<< Home