Where XML goes astray...
It seems like every programmer and their brother has picked up XML and is using it as the proverbial hammer to nail some solution. Sometimes it works, sometimes it doesn’t. A lot of people have written about how XML doesn’t scale, how XML isn’t the right solution for problem X, but for all those complaints, XML has helped solve a lot of problems. What is more interesting is to see what problems it does appear to have gotten some of the most traction on.
First, some background: XML was originally designed as an evolution of SGML, a simplification that mostly matched a lot of then existing common usage patterns. Most of its creators saw XML and evolving and expanding the role of SGML, namely text markup. XML was primarily intended to support taking a stream of text intended to be interpreted as a human readable document, and delineate portions according to some role. This sequence of characters is a paragraph. That sequence should be displayed with a link to some other information. Et cetera, et cetera. Much of the process in defining XML based on the assumption that the text in an XML document would eventually be exposed for human consumption. You can see this in the rules for what characters are allowed in XML content, what are valid characters in Names, and even in “</tagname>” being required rather than just “</>”.
All of that is why I find it so interesting that XML has become so popular for such things as SOAP. XML was not designed with the SOAP scenarios in mind. Other examples of popular scenarios which deviate XML’s original goals are configuration files, quick-n-dirty databases, and RDF. I’ll call these ‘data’ scenarios, as opposed to the ‘document’ scenarios for which XML was originally intended. In fact, I think it is safe to say that there is more usage of XML for ‘data’ scenarios than for ‘document’ scenarios, today. I choose the terms ‘data’ and ‘document’, because these are the terms that are most often used when this issue is discussed on the XML-DEV mailing list and at work. Personally, I dislike the terminology, because there are many cases where a single document mixes both usage patterns, and because (strictly speaking) documents are data.
As often happens when an existing tool is reused for a purpose beyond its original purposes, XML is not exactly a perfect fit. It is a surprisingly good fit, but far from perfect. In fact, one of the few things that mess with XML’s fit for these applications, isn’t even something in the original XML specification, it got its own specification released less than a year later: XML Namespaces.
The 2 main things that XML 1.0 (pre-Namespaces) mucked up: whitespace and allowed characters. I’ll go at these issues in the reverse order to how I just listed them.
Allowed Characters
The logic went something like this: XML is all about marking up text documents, so the characters in an XML document should conform to what Unicode says are reasonable for a text document. That rules out most control characters, and means that surrogate pairs should be checked. All sounds good until you see some of the consequences. For example, most databases allow any character in a text column. What happens when you publish your database as XML? What do you do about values that include characters which are control characters that the XML specification disallowed? XML did not provide any escaping mechanism, and if you ask many XML experts they will tell you to base64 encode your data if it may include invalid characters. It gets worse.
The characters allowed in an XML name are far more limited. Basically, when designing XML, they allowed everything that Unicode (as defined then) considered a ‘letter’ or a ‘number’. Only 2 problems with that: (1) It turns out many characters common in Asian texts were left out of that category by the then-current Unicode specification. (2) The list of characters is sparse and random, making implementation slow and error prone. Issue (1) has been a significant problem for a number of customers I have worked with, and the only options are to either avoid those character ranges that are not allowed or to implement an application specific escaping mechanism. The fact that many early parsers (including some of Microsoft’s) did not correctly enforce the rules made the problem worse. I have looked at the code for uncounted XML parsers, and this is one of the areas that many parsers skip on. The major supported parsers typically implement this properly, but it is still a source of constant bugs and unexpected complexity, as well as a constraint on performance.
Whitespace
When we were first coding up MSXML, whitespace was one of our perpetual nightmares. In hand-authored XML documents (the most common form of documents back then), there tended to be a great deal of whitespace. Humans have a hard time reading XML if everything is jammed on one line. We like a tag per line and indenting. All those extra characters, just there so that our feeble minds could make sense of this awkward jumble of characters, ended up contributing significantly to our memory footprint, and caused many problems to our users. Consider this example:
A customer coming to XML from a database back ground would normally expect that the first child of the <customer> element would be the <name> element. I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab. For the first official release version of MSXML, we found an awkward compromise, that confuses customers to this day, because it depends on some unexposed internal hints. It works great, so long as you don’t edit the DOM and write it out, expecting a pretty format, like the original version. It has been interesting to talk with people about this issue over the intervening years. I have had people claim that we violated the XML specification and had others thank us for saving them from having to care about all that extra noise in the DOM.
The problem is that XML doesn’t know the difference between the above scenario and something more like: (this is using the html tag vocabulary)
This last example is actually quite interesting. The whitespace between the <ul> and the <li> tags is not significant, yet the whitespace between the <pre> and <b> tags is significant. The only way to know this is to actually have an innate understanding of the semantics of the tag vocabulary. That means that there is effectively no universal answer, and it is up to the application to do the right thing… an almost universal guarantee of applications bugs.
XML Namespaces
Namespaces is still, years after its release, a source of problems and disagreement. The XML Namespaces specification is simple and gets the job done with minimum fuss. The problem? It pushes an immense burden of complexity onto the APIs and XML reader/writer implementations. Supporting XML Namespaces introduces significant complexity in the parsers, because it forces parsers to parse the entire start-tag before returning any text information. It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities.
Then there is the issue of the ‘default namespace’. I still see regular emails from people confused about why their XPath doesn’t work because of namespace issues. Namespaces is possibly the single largest obstacle for people new to XML. So much else about XML seems common sense, and then XML Namespaces rears it’s ugly head. I still regularly argue how our code should handle odd edge cases introduced by namespaces.
Conclusion
Note that nowhere above do I talk about how XML should have handled these issues. In most cases, when the original decisions were made and they made sense to me. I like to believe that I have learned a lesson or two since, but who knows. My purpose in writing this was to educate people about where XML goes astray from what you expect. Proposing solutions is of no real use, since XML is a standard and isn’t changing significantly anytime soon. It is worth understanding where we made our worst mistakes to avoid making similar mistakes again. The above are some of the hard lessons I have learned, having been implementing XML APIs for customers for almost 7 years. These are not the only issues I have with the XML 1.0 specification; they are only the most glaring. If I could go back in time, these are the areas I would have attempted to influence in a difference direction the most.
First, some background: XML was originally designed as an evolution of SGML, a simplification that mostly matched a lot of then existing common usage patterns. Most of its creators saw XML and evolving and expanding the role of SGML, namely text markup. XML was primarily intended to support taking a stream of text intended to be interpreted as a human readable document, and delineate portions according to some role. This sequence of characters is a paragraph. That sequence should be displayed with a link to some other information. Et cetera, et cetera. Much of the process in defining XML based on the assumption that the text in an XML document would eventually be exposed for human consumption. You can see this in the rules for what characters are allowed in XML content, what are valid characters in Names, and even in “</tagname>” being required rather than just “</>”.
All of that is why I find it so interesting that XML has become so popular for such things as SOAP. XML was not designed with the SOAP scenarios in mind. Other examples of popular scenarios which deviate XML’s original goals are configuration files, quick-n-dirty databases, and RDF. I’ll call these ‘data’ scenarios, as opposed to the ‘document’ scenarios for which XML was originally intended. In fact, I think it is safe to say that there is more usage of XML for ‘data’ scenarios than for ‘document’ scenarios, today. I choose the terms ‘data’ and ‘document’, because these are the terms that are most often used when this issue is discussed on the XML-DEV mailing list and at work. Personally, I dislike the terminology, because there are many cases where a single document mixes both usage patterns, and because (strictly speaking) documents are data.
As often happens when an existing tool is reused for a purpose beyond its original purposes, XML is not exactly a perfect fit. It is a surprisingly good fit, but far from perfect. In fact, one of the few things that mess with XML’s fit for these applications, isn’t even something in the original XML specification, it got its own specification released less than a year later: XML Namespaces.
The 2 main things that XML 1.0 (pre-Namespaces) mucked up: whitespace and allowed characters. I’ll go at these issues in the reverse order to how I just listed them.
Allowed Characters
The logic went something like this: XML is all about marking up text documents, so the characters in an XML document should conform to what Unicode says are reasonable for a text document. That rules out most control characters, and means that surrogate pairs should be checked. All sounds good until you see some of the consequences. For example, most databases allow any character in a text column. What happens when you publish your database as XML? What do you do about values that include characters which are control characters that the XML specification disallowed? XML did not provide any escaping mechanism, and if you ask many XML experts they will tell you to base64 encode your data if it may include invalid characters. It gets worse.
The characters allowed in an XML name are far more limited. Basically, when designing XML, they allowed everything that Unicode (as defined then) considered a ‘letter’ or a ‘number’. Only 2 problems with that: (1) It turns out many characters common in Asian texts were left out of that category by the then-current Unicode specification. (2) The list of characters is sparse and random, making implementation slow and error prone. Issue (1) has been a significant problem for a number of customers I have worked with, and the only options are to either avoid those character ranges that are not allowed or to implement an application specific escaping mechanism. The fact that many early parsers (including some of Microsoft’s) did not correctly enforce the rules made the problem worse. I have looked at the code for uncounted XML parsers, and this is one of the areas that many parsers skip on. The major supported parsers typically implement this properly, but it is still a source of constant bugs and unexpected complexity, as well as a constraint on performance.
Whitespace
When we were first coding up MSXML, whitespace was one of our perpetual nightmares. In hand-authored XML documents (the most common form of documents back then), there tended to be a great deal of whitespace. Humans have a hard time reading XML if everything is jammed on one line. We like a tag per line and indenting. All those extra characters, just there so that our feeble minds could make sense of this awkward jumble of characters, ended up contributing significantly to our memory footprint, and caused many problems to our users. Consider this example:
<customer>
<name>Joe Schmoe</name>
<addr>123 Seattle Ave</addr>
</customer>
A customer coming to XML from a database back ground would normally expect that the first child of the <customer> element would be the <name> element. I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab. For the first official release version of MSXML, we found an awkward compromise, that confuses customers to this day, because it depends on some unexposed internal hints. It works great, so long as you don’t edit the DOM and write it out, expecting a pretty format, like the original version. It has been interesting to talk with people about this issue over the intervening years. I have had people claim that we violated the XML specification and had others thank us for saving them from having to care about all that extra noise in the DOM.
The problem is that XML doesn’t know the difference between the above scenario and something more like: (this is using the html tag vocabulary)
<ul>
<li><pre>
<b>this</b> is a test</pre></li>
</ul>
This last example is actually quite interesting. The whitespace between the <ul> and the <li> tags is not significant, yet the whitespace between the <pre> and <b> tags is significant. The only way to know this is to actually have an innate understanding of the semantics of the tag vocabulary. That means that there is effectively no universal answer, and it is up to the application to do the right thing… an almost universal guarantee of applications bugs.
XML Namespaces
Namespaces is still, years after its release, a source of problems and disagreement. The XML Namespaces specification is simple and gets the job done with minimum fuss. The problem? It pushes an immense burden of complexity onto the APIs and XML reader/writer implementations. Supporting XML Namespaces introduces significant complexity in the parsers, because it forces parsers to parse the entire start-tag before returning any text information. It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities.
Then there is the issue of the ‘default namespace’. I still see regular emails from people confused about why their XPath doesn’t work because of namespace issues. Namespaces is possibly the single largest obstacle for people new to XML. So much else about XML seems common sense, and then XML Namespaces rears it’s ugly head. I still regularly argue how our code should handle odd edge cases introduced by namespaces.
Conclusion
Note that nowhere above do I talk about how XML should have handled these issues. In most cases, when the original decisions were made and they made sense to me. I like to believe that I have learned a lesson or two since, but who knows. My purpose in writing this was to educate people about where XML goes astray from what you expect. Proposing solutions is of no real use, since XML is a standard and isn’t changing significantly anytime soon. It is worth understanding where we made our worst mistakes to avoid making similar mistakes again. The above are some of the hard lessons I have learned, having been implementing XML APIs for customers for almost 7 years. These are not the only issues I have with the XML 1.0 specification; they are only the most glaring. If I could go back in time, these are the areas I would have attempted to influence in a difference direction the most.

37 Comments:
Well, another issue of the same class is probably XML embeddability (reads as XML nonembeddablility).
And talking about your pessimistic conclusion - in fact XML *is* changing. For better or worse XML 1.1 fixes the first problem and XPath 2.0 provides support for default namespaces. Of course both brings also additional complexity such as schema typing in XPath, providing us with a stuff to blog about several years later.
I personally use the notataion "container style" or "overlay style" for data and document respectively.
Oleg, Xml 1.1 addresses the problem with control characters, but does not provide a normative solution for how to encode invalid name characters in names, and does nothing to address the complexity of either whitespace or xml-namespaces. Honestly, I don't think there is anything that can really be done about either, short of a major over-haul. The confusion customers experience with prefixes vs namespace-uri's is something we are stuck with. Great for the consultants making money teaching this stuff, but not great for APIs designers or users. The fact that 'a:b' can mean on thing in my document, and something completely different in my XPath query is just hard to wrap your head around. None of this is to say we should give up... just keep these issues in mind when building a new XML system, and make sure to handle both sides of the issue.
Nothing can be done, so learn lessons, get used to it and keep it in mind. I like such attitude!
Derek,
May we republish this splendid blog, under your byline of course, at sys-con.com/xml?
We'd need a brief author bio + contact e-mail.
Let me know, yes?
Thnx in advance! :)
--
Jeremy Geelan
Group Publisher, SYS-CON Media
http://sys-con.com
email: jeremy@sys-con.com
Web Services Edge 2005 East - International Web Services Conference & Expo
Hynes Convention Center, Boston, MA - February 15 - 17, 2005
Call for papers now open!
Tuesday -> 2/15, Conference & Expo
Wednesday -> 2/16, Conference & Expo
Thursday -> 2/17, Conference & Expo
http://sys-con.com/edge
Could you please explain im mor detail the complexities introduced by XML namespaces. Specifically, what do you mean by: "It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities."
Moreover, when you say namespaces forces parsers to parse the entire start-tag before returning any text information, are you talking about the performance overhead or about something else also? Please clarify.
Thanks,
Venkat
Enterprise best practice is to use XML whenever it is obviously unsuitable, such as as a syntax for scripting languages, build files, and configuration information. It is also an enterprise best practice to not use XML for what it is well designed for, such as document management.
I have a new post specifically on Namespaces that I will be posting soon.
I don't buy your argument at all that
[1] people put garbage in their databases
[2] XML is to blame for not accepting garbage in element content.
If it's garbage, don't put it there. If you have a legitimate use for it, i.e.it's representing information, you can convert it, e.g. into elements.
On the subject of whitespace, you don't need "intimate knowledge" of a vocabulary -- the DTD (or Schema) tells you which whitespace is significant, and in a standard way.
I won't claim XML to be perfect, but let's not invent problems with it.
Liam [Liam Quin, liam at w3 dot org]
Nice read. I've been battling with character encoding for XML myself. When you complicate it with database driven XML files and entries coming from other (badly encoded) sites, you just want to throw your hands up.
> I don't buy your argument at all that
> [1] people put garbage in their databases
Well, sad to say, but I work at a popular bookstore in Seattle and this issue actually happens. A legacy data file used control characters as delimiters and that gets loaded directly into the RDBMS. The service exposes the data as XML and those text characters cannot be represented. I think it was 0x01 or something like that.
I don't remember the resolution, but just wanted to mention that it does happen.
I have to also say that the problemz presented aren't really with XML, but design flaws in bad usage. I agree that people abuse XML by using it when something else would be more appropriate, that's true for everything, but I don't completely agree with your examples. XML as a languages is perfectly suitable for databases and settings, but it may not be appropriate to use an XML parsing library which was designed with formatted text in mind, rather than one which was intended to deliver XML markup as a hierarchical structure of elements and attributes. The only criticsm that I have with XML myself are minor redundancies, like the XML declaration doesn't need an extra question mark towards the end, and the common comment syntax that takes 8 keystrokes to type. Also the lack of a single line comment. I don't see any problem in the way whitespace is handled either, all whitespace should be delivered without significant alteration, and interpretation should be application specific to allow for many possible applications.
How much of this blog entry is original? I smell the strong stench of plagiarism.
James you amuse me. Baseless condemnation 3 years too late. That you for the Monday morning humour.
Generic Viagra
Online Generic Viagra
buy generic viagra cialis
Buy generic viagra cialis propecia
Buy generic cialis tadalafil
Generic cialis
Tadalafil
Buy generic viagra sildenafil
Generic viagra
Sildenafil
Propecia
Finasteride
Vardenafil
generic levitra
Buy generic levitra
Naruto Episode
BUY Naruto
Comprehensive information on viagra. available at http://www.viagraforce.com or any other suitable online Viagra source would inform you that Viagra is the perfect ED solution for the young generation as well as for old men and so if you are afflicted with impotency, get hold of Viagra immediately and take the medicine according to the doctor’s instructions.
If it's garbage, don't put it there. If you have a legitimate use for it, i.e.it's representing information, you can convert it, e.g. into elements.
The chances of contracting erectile dysfunction goes up substantially with age, increasing significantly above the age of sixty-five-which is rapidly approaching for the baby boomers. Although erectile dysfunction becomes more likely with advancing age, there is certainly no age cutoff for a sexually fulfilling life. Some men enjoy sexual activity even in their eighties and nineties. http://www.buy-viagra-with-us.com
Imitrex is prescribed for the treatment of a migraine attack with or without the presence of an aura (visual disturbances, usually sensations of halos or flickering lights, which precede an attack).
Buy imitrex
Great Review! Well written and quite descriptive as well.. If any item or topic comes out then you should be the one releasing it to the public and make it known! The way you describe it is very intriguing and feels like candy to my ears, if that really makes any sense :) but you catch my drift.. In one of my classes, we were given a paper with instructions of how to build a swan made of aluminum foil and we had to explain to our group verbally how to construct the swan.. It was difficult! But, manageable and we came second in place, but it was tasky :) Nevertheless you are very descriptive and if you post anything else up I will most definitely check it out! Great review!
Good review. I don't think there is anything that can really be done about either, short of a major over-haul.
If you, like so many others, are in the market for Generic Viagra, buying order viagra is your best bet for finding fantastic deals on high quality generic medications. We understand that so many people are struggling to make ends meet in these difficult times, and health insurance does not always include prescription coverage. More Info at: http://www.xlpharmacy.com
Great Review!
Well written :) Thank You :)
You don't have to go out for a Dr, you purchase the drug like phentermine from internet at your house and await for them to come to your doorway. Purchasing prescription drugs like buy adipex without a prescription will go on in the market. It's hazardous to purchase prescription drugs whenever you do not know the online pharmacy while purchasing prescription drugs. Nevertheless Food and Drug Administration recommends purchasing only from state-licensed pharmaceuticses who are settled in the USA. What can be suggested more.
All about gay sex Gay Prison Rape
Health and fitness is always a person's prime concern today's hectic lifestyle and fast-paced society. And to maintain and sustain a good figure, you need to follow strict dietary regimen and exercise routine. Obesity is a global concern and excess weight loss can lead a person to host of diseases. After all health matters most!
HI ,VISIT NOW SITE :
http://www.pro-weight-loss.com
There is a whole series of power means chemist's shops which offer available without prescription power means as for example Viagra to us. However, do these really feel also well and to recommend? Do I get really available without prescription power means skilful and how can I shop there certainly Viagra? These questions are answered with Viagra kaufen. Check it yourselves.
coole sache also sozusagen auch mal wieder billiger kaufen Potenzmittel Viagra
klar - lieber billiger als teuer - wah ;)
There's a lot of Viagra in here, so i'll post new Wonder Priligy against ejaculatio praecox, is here to buy.
I personally use the notataion "container style" or "overlay style" for data and document respectively.
female viagra
hey, i like xml ;-)
this is xml power
Yeah so on ! or what!
The confusion customers experience with prefixes vs namespace-uri's is something we are stuck with. Great for the consultants making money teaching this stuff, but not great for APIs designers or users. generic viagra cialis
Post a Comment
<< Home