Friday, February 24, 2006

SAX vs StAX a.k.a. Push vs Pull

I've been writing some code to load up XML configuration files and populate some Java objects with the contents. None of the XML binding libraries work because both the schema for the config file and the Java objects are already predefined and while the mapping between the two is trivial, it isn't a brain-dead one-to-one mapping. I like to think I know a thing or two about programming to use XML, and so I set about implementing the code to load the config files. If I want to run on a clean Java 1.5 install, there were 3 options: SAX, DOM, StAX. If I want to run on a clean 1.4 install, StAX drops off that list. I realize there are other APIs out there, mostly DOM alternatives, but they aren't part of the standard install, so I wanted to avoid them.

The format was pretty trivial, and I knocked together a DOM implementation in a hour or two. Which quickly lead me into my coworker's office to ponder whether any of the designers of the DOM API tried to use the beast they were building. I guess, I'm still spoiled from years of working on Microsoft platforms. I wrote script code for this kind of thing all the time, and would use node.SelectSingleNode() and node.SelectNodes() as my primary way of navigating the DOM tree. In Java, I had to write dramatically more code. Worse, if I wanted code with half-decent error checking, the code rapidly bloated up with null checks. So I ended up writing a set up helper methods that allowed me to write SelectNode()-like code. Why isn't this just part of the APIs? I understand the controversy of query language, etc... but come on people! Pick something reasonable and move on. More importantly, let developers move on to the real task at hand, rather than struggling to code basic tree-walks!

These configs were small, so the memory footprint of the DOM isn't that big a deal, but I'd like to make it faster, which means using a parsing API rather than an in-memory-tree API. Since I want to support Java 1.4, I'm forced to use SAX. You will notice some reluctance in my voice. I've implemented a few XML loaders in my day. Not as many as many of you, but enough. Writing business logic on top of SAX is right up there with visiting the doctor, among my least favorite things. Why? Because, the SAX API was designed to make the parser's developer's like easier, not yours. Sure, SAX parsers are usually faster than StAX or the like, but only 1 in 1000 apps really benefits from that 5%, because hooking into a SAX parser is so damn much more work than using a StAX parser. When I'm writing something like my config loader, I want my code to be reasonably fast and as simple as possible. The config loader is not a high priority on my todo list, but it must be done. When using SAX, I have to turn the problem inside out, because the parser is in control, not my code. I'd give an example, but even a trivial example will take more time than I wanted to devote to this write-up.

In now way to I mean to chastise the authors of SAX the way I wish I could the DOM API authors. When SAX was first defined, this was the normal way to hook up to parsers. Admittedly, James Clark's parsers have long had pull-model-like APIs, but most parsers had APIs like SAX. The problem is that the work has moved on. StAX-like APIs are just better. Part of why so many people need XML binding in Java is because the parsing APIs are so difficult to use. Worse, I think SAX actually encourages people to write fragile XML loaders that can't handle valid use of XML. One example I've seen many times is that people assume an element will have a single text-node child. What if I want to add a comment? Or maybe use an Entity to avoid retyping some common text? Sure, it is possible, to implement this correctly in SAX, but since it is so damn hard to do the simpler parts, many people never get around to worrying about this... Actually, most developers code up this kind of thing don't know enough about XML to even know that this is a danger.

StAX isn't a total solution. It is definitely easier to use, but it suffers some of the same problems. Where are the helper methods that read an element's content and return a String? Why would I need that you ask? The example I used above of using an Entity or introducing a Comment in the middle of some text causes similar problems for StAX programmers, as it does for SAX programmers. In StAX all you need is a single helper method to abstract it away and you are golden, while in SAX you need a much more complicated solution.

Ultimately, I'm suggesting to Sun (or who-ever decides these things... I'm still figuring out how the Java 'standardization' process works) start looking at how to make XML easier to use. You don't need XML language integration (although that might not hurt). You really just need better APIs.

2 Comments:

Anonymous Anonymous said...

Well, I'm one of the guilty parties responsible for the W3C DOM API. You should have stuck around Microsoft, you could chastise me anytime you wanted by walking down the hall :-)

The theory IIRC was that usability wasn't a high priority; DOM was more of an "assembly language for the XML datamodel" on top of which people would build convenience libraries. For some reason that never happened. JDOM and XOM reinvented the low level rather than encapsulating it.

Also remember that DOM was designed as an abstraction of a lot of old SGML and HTML APIs; combine that with the political realities of Design by Committee, and you get lowest common denominator stuff like DOM.

I agree very much with the specific points about text nodes and the XPath methods. FWIW I almost quit the working group over the hideousness that is text nodes ... but alas, I was hosting the meeting where that was decided, and that would have been awkward :-) By the time I calmed down, I guess I had learned to stop worrying and love Worse is Better. As for XPath, I remember arguing that the Microsoft extensions such as SelectNodes() hit the sweet spot [this was many years before I was fitted for a Darth Vader helmet] but no, that doesn't expose the subtle differences between the DOM and XPath data models.

Its a really strange experience to revisit all this 8 years later as the program manager for XLinq http://msdn.microsoft.com/netframework/future/linq/ The really scary thing is that even with years of experience with how ugly DOM is and the ability to learn from the others, XML drags one toward the same decisions unless you try really hard (and, ahem, have Anders H. kicking butt to keep it simple and clean). Mixed content forced us to back off on the No Text Nodes mantra (but now they're only exposed when absolutely necessary, not everywhere), and Namespaces are the gift that keeps on giving and giving and giving PAIN (again hopefully more in the corner cases in XLinq rather than everywhere). Anyway, there's another iteration of XML API design in progress in the scripting world (E4X), Java (Dolphin itself and XJ) and .NET (LINQ/XLinq). People should let the people in their community know about their pain points with the current stuff and try the new stuff as it becomes public, and provide feedback, constructive or otherwise.

6:54 PM  
Anonymous Anonymous said...

I think that SAX has it's charm when you understand how to write handlers that represent simple state machines (see the State Machine part of http://www-128.ibm.com/developerworks/xml/library/x-saxapi).

That is the fastest and most memory efficient approach you can get.

StAX has it's charm especially when you need to read AND write data (this is what SAX isn't good for at all). You get a decent API that will allow you to do both.

Try it as Albert Einstein proposed:
Everything should be made as simple as possible, but not simpler.

9:50 PM  

Post a Comment

<< Home