Thursday, October 21, 2004

Loving and Hating XML Namespaces

If there is any one of the W3C's family of XML specifications, that has caused me the most grief, XML Namespaces is probably it. XML itself is simple enough that most users aren't surprised by it until a week before they deploy their app, and test reveals that one edge case that they forget. XSD is so complicated and confusing that everyone just expects to be confused, and mostly just rely on tools or their local guru. XSLT takes some getting used to, but is basically a closed system and most people figure out the basics, only having to ask for help after getting slipped up by the lack of support for side-effects.

XML Namespaces is unique. Almost as soon as newcomers to XML try and use namespaces, they run into problems. Their DOM code doesn't do what they think, or their XPaths stop working, etc. XML Namespaces is actually very simple and in that simplicity is their downfall. They are too simple and easy.

Then there is the minor fact that the authors created a design that only really addressed how namespaces impact parsing XML. The design they chose has some very nasty impacts on random access editing. How those issues are resolved is left up to each implementation to resolve (or not). The authors do not mention how to handle namespaces in any context other than parsing XML. (To give them credit, that really was the only context that the XML Language specification mentions either, so they have some decent precedent. )

Namespaces and your XML store
For example, load this document into your favorite XML store API (DOM/XmlBeans/etc)
 <book title='Loving and Hating XML Namespaces'>
   <author>Derek Denny-Brown</author>
 </book>
Then add the attribute named "xmlns" with value "http://book" to the <book> element. What should happen? Should that change the namespaces of the <book> and <author> elements? Then what happens if someone adds the element <ISBN> (with no namespace) under <book>? Should the new element automatically acquire the namespace "http://book", or should the fact that you added it with no namespace mean that it preserves it's association with the empty namespace?

In MSXML, we tried to completely disallow editing of namespace declarations, and mostly succeeded. There was one case, which I missed, and we have never been able to fix it because so many people found it an exploited it. The W3C's XML DOM spec basically says that element/attribute namespaces are assigned when the nodes are created, and never change, but is not clear about what happens when a namespace declaration is edited.

Then there is the problem of edits that introduce elements in a namespace that does not have an existing namespace declaration:
<a xmlns:p="http://p/">
  <b>
    ...
      <c p:x="foo"/>
    ...
  </b>
</a>
If you add attribute "p:z" in namespace "bar" to element <b>, what should happen to the p:x attribute on <c>? Should the implementations scan the entire content of <b> just in case there is a reference to prefix "p"?

Or what about conflicts? Add attribute "p:z" in namespace "bar" to the below sample... what should happen?
<a xmlns:p="http://p/" p:a="foo"/>

Namespaces and XPath
I still see emails from confused users who had their app working, and then discover that none of their XPath's work once they add namespaces. XML Namespaces comletely muddy the waters about what is the 'name' of an element. For a naive point of view, you would assume that the name of <x:e> in the below sample is "x:e", but XML Namespaces says no, that is not the name, that is just the serialization. The real name is local-name 'e' with namespace 'foo'.
 <a xmlns:x="foo">
  <x:e/>
 </a>
This provides no end of confusion. The actual namespace declaration may be pages away. The XML Namespace spec treats the prefix as a syntactic short-hand for the namespace, but the prefix is not actually replaced by the namespace, and there is no standard syntactic representation for a namespace and local-name. This then leads to the problem that there is no clean way to write an XPath for <x:e> in the above sample other than string compares against the local-name() and namespace-uri() functions (i.e. *[local-name=’e’][namespace-uri()=’foo’]). This has proven to be serious stumbling block for users of MSXML and System.Xml’s SelectNode and SelectSingleNode DOM methods.

XML Namespaces and Parser Performance
When parsing XML without XML Namespaces, it is possible to return the tag name for a start tag as soon as all the characters in the name are parsed. XML Namespaces now makes it a requirement that the entire tag, including all its attributes. A streamed parse of the tag, which only buffers one attribute at a time is no longer possible. Most parsers did this before anyway, but the fact that this option is completely ruled out is unfortunate. As XML becomes used more and more for scenarios where parsing performance matter, I get more and more complaints about this.

So Why Did They Do It Like That?
XML Namespaces solved a very real problem, and at the time, to many it appeared to be a good solution. The issue is that a given tag name may have different meanings in different documents. Does <address> refer to the street address of my house, or the IP address of my computer? The creators of the XML Namespace specification wanted to provide a way to cleanly support aggregation of multiple tag vocabularies in a single document. The design of XML Namespaces actually makes many common copy/parse and inclusions scenarios very easy to implement, and does this without requiring the buffering of the entire document. The previous proposal would have potentially required costly buffering and potentially transformation of both the container and the inclusion.

In many ways, the XML Namespace design is quite elegant. It built on the existing XML Language specification, in such a way that you could implement XML Namespaces as a layer over existing parsing APIs. In some ways, this cleverness is partially the source of many of its problems, but design which was not possible to implement so cleanly would never have been accepted by the community. I wish things had been done differently, but I’m guessing that it would have had less community acceptance.

Summary
Like my previous rant about XML, I’m not suggesting throwing the baby out with the bathwater. XML is a key standard used for all variety of interop, semi-structured storage, and other tasks. Our goal should be to learn from our mistakes. What could we have done better? Why did we miss these issues when designing this in the first place? I have my own ideas for how I would redesign XML Namespaces, but I see no point in worrying about it now. Instead I struggle to design the best APIs that are easy to use, and encourage fast, robust customer application code, given what we already have.

1 Comments:

Anonymous Anonymous said...

I'm not suggesting (as you say) "worrying about it", but I for one sure would love to hear how you would (or could) redesign Namespaces -- or if you would just implement an entirely different mechanism (non-obtuse AF's? something else?).

How about, for example, for the common scenario of combining XHTML with SVG and MathML vocabularies in a document? Or for that matter, XSLT with WhateverMLs...

XML Namespaces never smelled right (to me anyway), especially with its tie-in to URI Theology. I have not one shred of doubt that one day the world will look back at the markup practice of Namespaces "curing" a vocabulary clash issue like it looks back at the medical practice of leeches "curing" sick patients (like they "cured" George Washington).

I'd be very interested in your thoughts on this...

6:35 PM  

Post a Comment

<< Home