It seems like every programmer and their brother has picked up XML and is using it as the proverbial hammer to nail some solution. Sometimes it works, sometimes it doesn’t. A lot of people have written about how XML doesn’t scale, how XML isn’t the right solution for problem X, but for all those complaints, XML has helped solve a lot of problems. What is more interesting is to see what problems it does appear to have gotten some of the most traction on.
First, some background: XML was originally designed as an evolution of SGML, a simplification that mostly matched a lot of then existing common usage patterns. Most of its creators saw XML and evolving and expanding the role of SGML, namely text markup. XML was primarily intended to support taking a stream of text intended to be interpreted as a human readable document, and delineate portions according to some role. This sequence of characters is a paragraph. That sequence should be displayed with a link to some other information. Et cetera, et cetera. Much of the process in defining XML based on the assumption that the text in an XML document would eventually be exposed for human consumption. You can see this in the rules for what characters are allowed in XML content, what are valid characters in Names, and even in “</tagname>” being required rather than just “</>”.
All of that is why I find it so interesting that XML has become so popular for such things as SOAP. XML was not designed with the SOAP scenarios in mind. Other examples of popular scenarios which deviate XML’s original goals are configuration files, quick-n-dirty databases, and RDF. I’ll call these ‘data’ scenarios, as opposed to the ‘document’ scenarios for which XML was originally intended. In fact, I think it is safe to say that there is more usage of XML for ‘data’ scenarios than for ‘document’ scenarios, today. I choose the terms ‘data’ and ‘document’, because these are the terms that are most often used when this issue is discussed on the XML-DEV
mailing list and at work
. Personally, I dislike the terminology, because there are many cases where a single document mixes both usage patterns, and because (strictly speaking) documents are
As often happens when an existing tool is reused for a purpose beyond its original purposes, XML is not exactly a perfect fit. It is a surprisingly good fit, but far from perfect. In fact, one of the few things that mess with XML’s fit for these applications, isn’t even something in the original XML specification, it got its own specification released less than a year later: XML Namespaces.
The 2 main things that XML 1.0 (pre-Namespaces) mucked up: whitespace and allowed characters. I’ll go at these issues in the reverse order to how I just listed them.
The logic went something like this: XML is all about marking up text documents, so the characters in an XML document should conform to what Unicode says are reasonable for a text document. That rules out most control characters, and means that surrogate pairs should be checked. All sounds good until you see some of the consequences. For example, most databases allow any character in a text column. What happens when you publish your database as XML? What do you do about values that include characters which are control characters that the XML specification disallowed? XML did not provide any escaping mechanism, and if you ask many XML experts they will tell you to base64 encode your data if it may include invalid characters. It gets worse.
The characters allowed in an XML name are far more limited. Basically, when designing XML, they allowed everything that Unicode (as defined then) considered a ‘letter’ or a ‘number’. Only 2 problems with that: (1) It turns out many characters common in Asian texts were left out of that category by the then-current Unicode specification. (2) The list of characters is sparse and random, making implementation slow and error prone. Issue (1) has been a significant problem for a number of customers I have worked with, and the only options are to either avoid those character ranges that are not allowed or to implement an application specific escaping mechanism. The fact that many early parsers (including some of Microsoft’s) did not correctly enforce the rules made the problem worse. I have looked at the code for uncounted XML parsers, and this is one of the areas that many parsers skip on. The major supported parsers typically implement this properly, but it is still a source of constant bugs and unexpected complexity, as well as a constraint on performance.
When we were first coding up MSXML, whitespace was one of our perpetual nightmares. In hand-authored XML documents (the most common form of documents back then), there tended to be a great deal of whitespace. Humans have a hard time reading XML if everything is jammed on one line. We like a tag per line and indenting. All those extra characters, just there so that our feeble minds could make sense of this awkward jumble of characters, ended up contributing significantly to our memory footprint, and caused many problems to our users. Consider this example:
<addr>123 Seattle Ave</addr>
A customer coming to XML from a database back ground would normally expect that the first child of the <customer> element would be the <name> element. I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab. For the first official release version of MSXML, we found an awkward compromise, that confuses customers to this day, because it depends on some unexposed internal hints. It works great, so long as you don’t edit the DOM and write it out, expecting a pretty format, like the original version. It has been interesting to talk with people about this issue over the intervening years. I have had people claim that we violated the XML specification and had others thank us for saving them from having to care about all that extra noise in the DOM.
The problem is that XML doesn’t know the difference between the above scenario and something more like: (this is using the html tag vocabulary)
<b>this</b> is a test</pre></li>
This last example is actually quite interesting. The whitespace between the <ul> and the <li> tags is not significant, yet the whitespace between the <pre> and <b> tags is
significant. The only way to know this is to actually have an innate understanding of the semantics of the tag vocabulary. That means that there is effectively no universal answer, and it is up to the application to do the right thing… an almost universal guarantee of applications bugs.
Namespaces is still, years after its release, a source of problems and disagreement. The XML Namespaces specification is simple and gets the job done with minimum fuss. The problem? It pushes an immense burden of complexity onto the APIs and XML reader/writer implementations. Supporting XML Namespaces introduces significant complexity in the parsers, because it forces parsers to parse the entire start-tag before returning any text information. It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities.
Then there is the issue of the ‘default namespace’. I still see regular emails from people confused about why their XPath doesn’t work because of namespace issues. Namespaces is possibly the single largest obstacle for people new to XML. So much else about XML seems common sense, and then XML Namespaces rears it’s ugly head. I still regularly argue how our code should handle odd edge cases introduced by namespaces.
Note that nowhere above do I talk about how XML should
have handled these issues. In most cases, when the original decisions were made and they made sense to me. I like to believe that I have learned a lesson or two since, but who knows. My purpose in writing this was to educate people about where XML goes astray from what you expect. Proposing solutions is of no real use, since XML is a standard and isn’t changing significantly anytime soon. It is worth understanding where we made our worst mistakes to avoid making similar mistakes again. The above are some of the hard lessons I have learned, having been implementing XML APIs for customers for almost 7 years. These are not the only issues I have with the XML 1.0 specification; they are only the most glaring. If I could go back in time, these are the areas I would have attempted to influence in a difference direction the most.