Tuesday, October 26, 2004

When is text not text?

Liam Quin posted a comment about my 'where xml goes astray' rant:
I don't buy your argument at all that
[1] people put garbage in their databases
[2] XML is to blame for not accepting garbage in element content.

If it's garbage, don't put it there. If you have a legitimate use for it, i.e.it's representing information, you can convert it, e.g. into elements.

Liam's argument is based on the presumption that the only reason to store a sequence of characters is that those characters are intended for display. If you did not intend to display the characters, then why worry about control characters, for example? While I might tend to side with him on the issue of storing invalid or incorrect surrogate pairs, there may be valid reasons (beyond just performance consideration) for that also. XML 1.1 validates my argument that control characters are useful and valid text content. The fact that it requires escaping them is exactly the kind of compromise that is appropriate and which should be the ultimate goal of a standard like XML.

An interesting example where the presumption that characters are about representing text is violated can easily be found in standard C libraries: strcspn(). This function takes 2 strings as arguments and "Scans string1 character by character, returning the number of characters read until the first occurrence of any character included in string2" according to the docs Google found for me (http://www.cplusplus.com/ref/cstring/strcspn.html). What if I want to marshal something like this over XML? How do I represent the 2nd parameter? (pre XML 1.1 that is...)

Some argue that you just shouldn't use XML for this. I (and many others) argue that XML is 90+% there. When we were so close, why throw out these use cases? Is preserving purity of 'text' as something more complex than just a sequence of characters so important that we should disallow other interpretations?

Back to Liam's points. [1] the data isn't 'garbage', it just isn't text. It is a sequence of characters. There is an important difference and a format that is about information interop should make it possible to exchange sequences of characters without dropping back to barbaric base64 encoding. [2] I'm not 'blaming' XML. I'm wishing that we could have made 'better' choices for XML (1.0). We obviously learned something because XML 1.1 addresses much of this.

If instead of element content, he was referring to tag-names, then that is a slightly different beast. By carefully restricting what characters where allowed in tag-names, XML slows down parser implementations, and dates itself by being tied to a specific Unicode version. The growth of international commerce has required a constant adjustment of our definition for what is allowable as a 'name' and what are allowable characters in a name. Trying to restrict the definition of what constitutes a tag-name causes problems for a standard focused on interop, such as XML. Interop means the server I'm talking to may be running decades old software. A client can't depend on the server to have been upgraded with a parser for XML 4.2, it must depend on some contractually agreed least-common-denominator; usually v1.0.

One lesson I learned from XML is that a standard like XML, should be designed to be open. Version 1.0 should only restrict what is absolutely necessary to restrict. If you target scenarios that you haven't fleshed out yet, then make sure you are as open as you are willing to be. XML 1.0 targets information interop, but assumes that the information is text, not just character data, but actual textual data (targeting eventual textual rendering). I would argue that this mismatch is the single largest source of problems with current XML deployments. Forcing anything remotely resembling 'data' to be encoded, just in case of non textual character sequences is a very high cost which ultimately provides very little real gain.

Thursday, October 21, 2004

Loving and Hating XML Namespaces

If there is any one of the W3C's family of XML specifications, that has caused me the most grief, XML Namespaces is probably it. XML itself is simple enough that most users aren't surprised by it until a week before they deploy their app, and test reveals that one edge case that they forget. XSD is so complicated and confusing that everyone just expects to be confused, and mostly just rely on tools or their local guru. XSLT takes some getting used to, but is basically a closed system and most people figure out the basics, only having to ask for help after getting slipped up by the lack of support for side-effects.

XML Namespaces is unique. Almost as soon as newcomers to XML try and use namespaces, they run into problems. Their DOM code doesn't do what they think, or their XPaths stop working, etc. XML Namespaces is actually very simple and in that simplicity is their downfall. They are too simple and easy.

Then there is the minor fact that the authors created a design that only really addressed how namespaces impact parsing XML. The design they chose has some very nasty impacts on random access editing. How those issues are resolved is left up to each implementation to resolve (or not). The authors do not mention how to handle namespaces in any context other than parsing XML. (To give them credit, that really was the only context that the XML Language specification mentions either, so they have some decent precedent. )

Namespaces and your XML store
For example, load this document into your favorite XML store API (DOM/XmlBeans/etc)
 <book title='Loving and Hating XML Namespaces'>
   <author>Derek Denny-Brown</author>
Then add the attribute named "xmlns" with value "http://book" to the <book> element. What should happen? Should that change the namespaces of the <book> and <author> elements? Then what happens if someone adds the element <ISBN> (with no namespace) under <book>? Should the new element automatically acquire the namespace "http://book", or should the fact that you added it with no namespace mean that it preserves it's association with the empty namespace?

In MSXML, we tried to completely disallow editing of namespace declarations, and mostly succeeded. There was one case, which I missed, and we have never been able to fix it because so many people found it an exploited it. The W3C's XML DOM spec basically says that element/attribute namespaces are assigned when the nodes are created, and never change, but is not clear about what happens when a namespace declaration is edited.

Then there is the problem of edits that introduce elements in a namespace that does not have an existing namespace declaration:
<a xmlns:p="http://p/">
      <c p:x="foo"/>
If you add attribute "p:z" in namespace "bar" to element <b>, what should happen to the p:x attribute on <c>? Should the implementations scan the entire content of <b> just in case there is a reference to prefix "p"?

Or what about conflicts? Add attribute "p:z" in namespace "bar" to the below sample... what should happen?
<a xmlns:p="http://p/" p:a="foo"/>

Namespaces and XPath
I still see emails from confused users who had their app working, and then discover that none of their XPath's work once they add namespaces. XML Namespaces comletely muddy the waters about what is the 'name' of an element. For a naive point of view, you would assume that the name of <x:e> in the below sample is "x:e", but XML Namespaces says no, that is not the name, that is just the serialization. The real name is local-name 'e' with namespace 'foo'.
 <a xmlns:x="foo">
This provides no end of confusion. The actual namespace declaration may be pages away. The XML Namespace spec treats the prefix as a syntactic short-hand for the namespace, but the prefix is not actually replaced by the namespace, and there is no standard syntactic representation for a namespace and local-name. This then leads to the problem that there is no clean way to write an XPath for <x:e> in the above sample other than string compares against the local-name() and namespace-uri() functions (i.e. *[local-name=’e’][namespace-uri()=’foo’]). This has proven to be serious stumbling block for users of MSXML and System.Xml’s SelectNode and SelectSingleNode DOM methods.

XML Namespaces and Parser Performance
When parsing XML without XML Namespaces, it is possible to return the tag name for a start tag as soon as all the characters in the name are parsed. XML Namespaces now makes it a requirement that the entire tag, including all its attributes. A streamed parse of the tag, which only buffers one attribute at a time is no longer possible. Most parsers did this before anyway, but the fact that this option is completely ruled out is unfortunate. As XML becomes used more and more for scenarios where parsing performance matter, I get more and more complaints about this.

So Why Did They Do It Like That?
XML Namespaces solved a very real problem, and at the time, to many it appeared to be a good solution. The issue is that a given tag name may have different meanings in different documents. Does <address> refer to the street address of my house, or the IP address of my computer? The creators of the XML Namespace specification wanted to provide a way to cleanly support aggregation of multiple tag vocabularies in a single document. The design of XML Namespaces actually makes many common copy/parse and inclusions scenarios very easy to implement, and does this without requiring the buffering of the entire document. The previous proposal would have potentially required costly buffering and potentially transformation of both the container and the inclusion.

In many ways, the XML Namespace design is quite elegant. It built on the existing XML Language specification, in such a way that you could implement XML Namespaces as a layer over existing parsing APIs. In some ways, this cleverness is partially the source of many of its problems, but design which was not possible to implement so cleanly would never have been accepted by the community. I wish things had been done differently, but I’m guessing that it would have had less community acceptance.

Like my previous rant about XML, I’m not suggesting throwing the baby out with the bathwater. XML is a key standard used for all variety of interop, semi-structured storage, and other tasks. Our goal should be to learn from our mistakes. What could we have done better? Why did we miss these issues when designing this in the first place? I have my own ideas for how I would redesign XML Namespaces, but I see no point in worrying about it now. Instead I struggle to design the best APIs that are easy to use, and encourage fast, robust customer application code, given what we already have.

Friday, October 15, 2004

When to use XML

Dare posted a writeup on when to use (and when not use) XML: The XML Litmus Test: Understanding When and Why to Use XML (a summary is also available on msdn which doesn't have the connectivity issues that Dare's personal blog has). More people who build software using XML should read this and apply some of these ideas. I can't tell you how many times I get called in on a customer issue, where my first reaction is 'Why are they using XML?' I admit that I have an unusual perspective, as I have spent most of the last 9 years building XML/SGML software. XML is deceptive. It looks so easy, and many of the complexities are not obvious. More of the XML books should include something along the lines of Dare's writeup.

Tuesday, October 12, 2004

Where XML goes astray...

It seems like every programmer and their brother has picked up XML and is using it as the proverbial hammer to nail some solution. Sometimes it works, sometimes it doesn’t. A lot of people have written about how XML doesn’t scale, how XML isn’t the right solution for problem X, but for all those complaints, XML has helped solve a lot of problems. What is more interesting is to see what problems it does appear to have gotten some of the most traction on.

First, some background: XML was originally designed as an evolution of SGML, a simplification that mostly matched a lot of then existing common usage patterns. Most of its creators saw XML and evolving and expanding the role of SGML, namely text markup. XML was primarily intended to support taking a stream of text intended to be interpreted as a human readable document, and delineate portions according to some role. This sequence of characters is a paragraph. That sequence should be displayed with a link to some other information. Et cetera, et cetera. Much of the process in defining XML based on the assumption that the text in an XML document would eventually be exposed for human consumption. You can see this in the rules for what characters are allowed in XML content, what are valid characters in Names, and even in “</tagname>” being required rather than just “</>”.

All of that is why I find it so interesting that XML has become so popular for such things as SOAP. XML was not designed with the SOAP scenarios in mind. Other examples of popular scenarios which deviate XML’s original goals are configuration files, quick-n-dirty databases, and RDF. I’ll call these ‘data’ scenarios, as opposed to the ‘document’ scenarios for which XML was originally intended. In fact, I think it is safe to say that there is more usage of XML for ‘data’ scenarios than for ‘document’ scenarios, today. I choose the terms ‘data’ and ‘document’, because these are the terms that are most often used when this issue is discussed on the XML-DEV mailing list and at work. Personally, I dislike the terminology, because there are many cases where a single document mixes both usage patterns, and because (strictly speaking) documents are data.

As often happens when an existing tool is reused for a purpose beyond its original purposes, XML is not exactly a perfect fit. It is a surprisingly good fit, but far from perfect. In fact, one of the few things that mess with XML’s fit for these applications, isn’t even something in the original XML specification, it got its own specification released less than a year later: XML Namespaces.
The 2 main things that XML 1.0 (pre-Namespaces) mucked up: whitespace and allowed characters. I’ll go at these issues in the reverse order to how I just listed them.

Allowed Characters

The logic went something like this: XML is all about marking up text documents, so the characters in an XML document should conform to what Unicode says are reasonable for a text document. That rules out most control characters, and means that surrogate pairs should be checked. All sounds good until you see some of the consequences. For example, most databases allow any character in a text column. What happens when you publish your database as XML? What do you do about values that include characters which are control characters that the XML specification disallowed? XML did not provide any escaping mechanism, and if you ask many XML experts they will tell you to base64 encode your data if it may include invalid characters. It gets worse.

The characters allowed in an XML name are far more limited. Basically, when designing XML, they allowed everything that Unicode (as defined then) considered a ‘letter’ or a ‘number’. Only 2 problems with that: (1) It turns out many characters common in Asian texts were left out of that category by the then-current Unicode specification. (2) The list of characters is sparse and random, making implementation slow and error prone. Issue (1) has been a significant problem for a number of customers I have worked with, and the only options are to either avoid those character ranges that are not allowed or to implement an application specific escaping mechanism. The fact that many early parsers (including some of Microsoft’s) did not correctly enforce the rules made the problem worse. I have looked at the code for uncounted XML parsers, and this is one of the areas that many parsers skip on. The major supported parsers typically implement this properly, but it is still a source of constant bugs and unexpected complexity, as well as a constraint on performance.


When we were first coding up MSXML, whitespace was one of our perpetual nightmares. In hand-authored XML documents (the most common form of documents back then), there tended to be a great deal of whitespace. Humans have a hard time reading XML if everything is jammed on one line. We like a tag per line and indenting. All those extra characters, just there so that our feeble minds could make sense of this awkward jumble of characters, ended up contributing significantly to our memory footprint, and caused many problems to our users. Consider this example:

<name>Joe Schmoe</name>
<addr>123 Seattle Ave</addr>

A customer coming to XML from a database back ground would normally expect that the first child of the <customer> element would be the <name> element. I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab. For the first official release version of MSXML, we found an awkward compromise, that confuses customers to this day, because it depends on some unexposed internal hints. It works great, so long as you don’t edit the DOM and write it out, expecting a pretty format, like the original version. It has been interesting to talk with people about this issue over the intervening years. I have had people claim that we violated the XML specification and had others thank us for saving them from having to care about all that extra noise in the DOM.

The problem is that XML doesn’t know the difference between the above scenario and something more like: (this is using the html tag vocabulary)

<b>this</b> is a test</pre></li>

This last example is actually quite interesting. The whitespace between the <ul> and the <li> tags is not significant, yet the whitespace between the <pre> and <b> tags is significant. The only way to know this is to actually have an innate understanding of the semantics of the tag vocabulary. That means that there is effectively no universal answer, and it is up to the application to do the right thing… an almost universal guarantee of applications bugs.

XML Namespaces

Namespaces is still, years after its release, a source of problems and disagreement. The XML Namespaces specification is simple and gets the job done with minimum fuss. The problem? It pushes an immense burden of complexity onto the APIs and XML reader/writer implementations. Supporting XML Namespaces introduces significant complexity in the parsers, because it forces parsers to parse the entire start-tag before returning any text information. It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities.

Then there is the issue of the ‘default namespace’. I still see regular emails from people confused about why their XPath doesn’t work because of namespace issues. Namespaces is possibly the single largest obstacle for people new to XML. So much else about XML seems common sense, and then XML Namespaces rears it’s ugly head. I still regularly argue how our code should handle odd edge cases introduced by namespaces.


Note that nowhere above do I talk about how XML should have handled these issues. In most cases, when the original decisions were made and they made sense to me. I like to believe that I have learned a lesson or two since, but who knows. My purpose in writing this was to educate people about where XML goes astray from what you expect. Proposing solutions is of no real use, since XML is a standard and isn’t changing significantly anytime soon. It is worth understanding where we made our worst mistakes to avoid making similar mistakes again. The above are some of the hard lessons I have learned, having been implementing XML APIs for customers for almost 7 years. These are not the only issues I have with the XML 1.0 specification; they are only the most glaring. If I could go back in time, these are the areas I would have attempted to influence in a difference direction the most.

Saturday, October 09, 2004

musing on music

I've spent the last 2 hours listening to old dance vinyl I have from my days as a techno/trance dj, and man do I realize how much I miss this stuff. For various reasons, although I love electronic dance music, when I do go out, I mostly go to a local goth-industrial club. I've started wandering to some of the other local clubs and am struck with how out-of-touch with the scene I have become. Or I'm just old. Or the scene is just crap. I can't find a club that regularly plays electronic music I like and draws a decent crowd. I've been to Club Medusa, and when it was an out-of-town DJ, it rocked. The local DJs, just don't do it for me, and the crown totally turns me off. The Last Supper Club has much better music (upstairs) but the dance floor is tiny and it really seems to be hit or miss whether there is a decent crowd.

Back when I was in college (not here in Seattle), I used to track special events. An odd disadvantage of having a real job is that I don't have to save up for the really good nights... so I just don't track special events at all. Ever since I moved off Capital Hill, it has become almost impossible to pick up a Stranger to look for upcoming events. I'm just adrift with my inadequate information.

Back to my original topic... listening to my old music. I have some 200 or so records, most of which are 8-9 years old, from when I DJed. Although I've had my own turntables for almost 7 years, I haven't really done much with any of that music in years. I recently moved and now have a decent setup for the first time in years... so I'm having fun pulling out my old records for a spin. Mostly, I'm just dropping them on and just letting them play through. It is damn cool to listen to this stuff. I forgot how much better some of this stuff is than what you typically get on some random dj compilation cd. Once I've worked my way through more of my collection, I'm going to start putting together mixes.

That raised an interesting question. Back when I would practice more regularly, I would make mix tapes for me to listen to... only I really have no desire to record to cassette anymore. There are only limited places I can listen to my recordings, and tapes just feel so old. So I was looking at the iRiver PMP-120. Archos has something similar... All I really need is something that can record MP3s, as well as playing them back. Yay! Excuse to buy new toys!

Friday, October 08, 2004

in the beginning...

Once upon a time, there was an unusual character, known to many as Dare who came and joined my team (working for the Evil Empire, I mean Microsoft). He and this other coworker, Joshua, had blogs, back before blogs were cool. Neat idea, I thought... I should do that. So now, something like 1-2 years later, I'm starting my own blog. Not on MSDN, because I don't view this as an aspect of work. Someday, I'll get around to writing a blog hosting cgi that does what I want... but while I try the whole concept out, I thought I'd use Google... I mean Blogger.

Me? I live in Seattle, in the city (neat Gas Works Park). I work for Microsoft as a Lead Developer, lording over MSXML and System.Xml development, and even writing a class or two myself now and again. I'm a music addict, totally addicted to almost all kinds of electronic music. I ride a motorcycle as my primary transportation. I've been known to randomly dye my hair blue, simply because I'm bored. The centerpiece of my livingroom is a british-gothic antique sideboard with my SL-1200 turntables and a Rane mxer. Not much of a bio... deal.