Tuesday, October 12, 2004

Where XML goes astray...

It seems like every programmer and their brother has picked up XML and is using it as the proverbial hammer to nail some solution. Sometimes it works, sometimes it doesn’t. A lot of people have written about how XML doesn’t scale, how XML isn’t the right solution for problem X, but for all those complaints, XML has helped solve a lot of problems. What is more interesting is to see what problems it does appear to have gotten some of the most traction on.

First, some background: XML was originally designed as an evolution of SGML, a simplification that mostly matched a lot of then existing common usage patterns. Most of its creators saw XML and evolving and expanding the role of SGML, namely text markup. XML was primarily intended to support taking a stream of text intended to be interpreted as a human readable document, and delineate portions according to some role. This sequence of characters is a paragraph. That sequence should be displayed with a link to some other information. Et cetera, et cetera. Much of the process in defining XML based on the assumption that the text in an XML document would eventually be exposed for human consumption. You can see this in the rules for what characters are allowed in XML content, what are valid characters in Names, and even in “</tagname>” being required rather than just “</>”.

All of that is why I find it so interesting that XML has become so popular for such things as SOAP. XML was not designed with the SOAP scenarios in mind. Other examples of popular scenarios which deviate XML’s original goals are configuration files, quick-n-dirty databases, and RDF. I’ll call these ‘data’ scenarios, as opposed to the ‘document’ scenarios for which XML was originally intended. In fact, I think it is safe to say that there is more usage of XML for ‘data’ scenarios than for ‘document’ scenarios, today. I choose the terms ‘data’ and ‘document’, because these are the terms that are most often used when this issue is discussed on the XML-DEV mailing list and at work. Personally, I dislike the terminology, because there are many cases where a single document mixes both usage patterns, and because (strictly speaking) documents are data.

As often happens when an existing tool is reused for a purpose beyond its original purposes, XML is not exactly a perfect fit. It is a surprisingly good fit, but far from perfect. In fact, one of the few things that mess with XML’s fit for these applications, isn’t even something in the original XML specification, it got its own specification released less than a year later: XML Namespaces.
The 2 main things that XML 1.0 (pre-Namespaces) mucked up: whitespace and allowed characters. I’ll go at these issues in the reverse order to how I just listed them.

Allowed Characters

The logic went something like this: XML is all about marking up text documents, so the characters in an XML document should conform to what Unicode says are reasonable for a text document. That rules out most control characters, and means that surrogate pairs should be checked. All sounds good until you see some of the consequences. For example, most databases allow any character in a text column. What happens when you publish your database as XML? What do you do about values that include characters which are control characters that the XML specification disallowed? XML did not provide any escaping mechanism, and if you ask many XML experts they will tell you to base64 encode your data if it may include invalid characters. It gets worse.

The characters allowed in an XML name are far more limited. Basically, when designing XML, they allowed everything that Unicode (as defined then) considered a ‘letter’ or a ‘number’. Only 2 problems with that: (1) It turns out many characters common in Asian texts were left out of that category by the then-current Unicode specification. (2) The list of characters is sparse and random, making implementation slow and error prone. Issue (1) has been a significant problem for a number of customers I have worked with, and the only options are to either avoid those character ranges that are not allowed or to implement an application specific escaping mechanism. The fact that many early parsers (including some of Microsoft’s) did not correctly enforce the rules made the problem worse. I have looked at the code for uncounted XML parsers, and this is one of the areas that many parsers skip on. The major supported parsers typically implement this properly, but it is still a source of constant bugs and unexpected complexity, as well as a constraint on performance.

Whitespace

When we were first coding up MSXML, whitespace was one of our perpetual nightmares. In hand-authored XML documents (the most common form of documents back then), there tended to be a great deal of whitespace. Humans have a hard time reading XML if everything is jammed on one line. We like a tag per line and indenting. All those extra characters, just there so that our feeble minds could make sense of this awkward jumble of characters, ended up contributing significantly to our memory footprint, and caused many problems to our users. Consider this example:
	<customer>

<name>Joe Schmoe</name>
<addr>123 Seattle Ave</addr>
</customer>

A customer coming to XML from a database back ground would normally expect that the first child of the <customer> element would be the <name> element. I can’t explain how many times I had to explain that it was actually a text node with the value newline+tab. For the first official release version of MSXML, we found an awkward compromise, that confuses customers to this day, because it depends on some unexposed internal hints. It works great, so long as you don’t edit the DOM and write it out, expecting a pretty format, like the original version. It has been interesting to talk with people about this issue over the intervening years. I have had people claim that we violated the XML specification and had others thank us for saving them from having to care about all that extra noise in the DOM.

The problem is that XML doesn’t know the difference between the above scenario and something more like: (this is using the html tag vocabulary)
	<ul>

<li><pre>
<b>this</b> is a test</pre></li>
</ul>

This last example is actually quite interesting. The whitespace between the <ul> and the <li> tags is not significant, yet the whitespace between the <pre> and <b> tags is significant. The only way to know this is to actually have an innate understanding of the semantics of the tag vocabulary. That means that there is effectively no universal answer, and it is up to the application to do the right thing… an almost universal guarantee of applications bugs.

XML Namespaces

Namespaces is still, years after its release, a source of problems and disagreement. The XML Namespaces specification is simple and gets the job done with minimum fuss. The problem? It pushes an immense burden of complexity onto the APIs and XML reader/writer implementations. Supporting XML Namespaces introduces significant complexity in the parsers, because it forces parsers to parse the entire start-tag before returning any text information. It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities.

Then there is the issue of the ‘default namespace’. I still see regular emails from people confused about why their XPath doesn’t work because of namespace issues. Namespaces is possibly the single largest obstacle for people new to XML. So much else about XML seems common sense, and then XML Namespaces rears it’s ugly head. I still regularly argue how our code should handle odd edge cases introduced by namespaces.

Conclusion

Note that nowhere above do I talk about how XML should have handled these issues. In most cases, when the original decisions were made and they made sense to me. I like to believe that I have learned a lesson or two since, but who knows. My purpose in writing this was to educate people about where XML goes astray from what you expect. Proposing solutions is of no real use, since XML is a standard and isn’t changing significantly anytime soon. It is worth understanding where we made our worst mistakes to avoid making similar mistakes again. The above are some of the hard lessons I have learned, having been implementing XML APIs for customers for almost 7 years. These are not the only issues I have with the XML 1.0 specification; they are only the most glaring. If I could go back in time, these are the areas I would have attempted to influence in a difference direction the most.

81 Comments:

Blogger olegt said...

Well, another issue of the same class is probably XML embeddability (reads as XML nonembeddablility).

And talking about your pessimistic conclusion - in fact XML *is* changing. For better or worse XML 1.1 fixes the first problem and XPath 2.0 provides support for default namespaces. Of course both brings also additional complexity such as schema typing in XPath, providing us with a stuff to blog about several years later.

4:25 AM  
Anonymous Anonymous said...

I personally use the notataion "container style" or "overlay style" for data and document respectively.

2:10 PM  
Blogger derek said...

Oleg, Xml 1.1 addresses the problem with control characters, but does not provide a normative solution for how to encode invalid name characters in names, and does nothing to address the complexity of either whitespace or xml-namespaces. Honestly, I don't think there is anything that can really be done about either, short of a major over-haul. The confusion customers experience with prefixes vs namespace-uri's is something we are stuck with. Great for the consultants making money teaching this stuff, but not great for APIs designers or users. The fact that 'a:b' can mean on thing in my document, and something completely different in my XPath query is just hard to wrap your head around. None of this is to say we should give up... just keep these issues in mind when building a new XML system, and make sure to handle both sides of the issue.

7:13 PM  
Blogger olegt said...

Nothing can be done, so learn lessons, get used to it and keep it in mind. I like such attitude!

3:15 AM  
Anonymous Anonymous said...

Derek,
May we republish this splendid blog, under your byline of course, at sys-con.com/xml?

We'd need a brief author bio + contact e-mail.

Let me know, yes?

Thnx in advance! :)

--
Jeremy Geelan
Group Publisher, SYS-CON Media
http://sys-con.com

email: jeremy@sys-con.com

Web Services Edge 2005 East - International Web Services Conference & Expo
Hynes Convention Center, Boston, MA - February 15 - 17, 2005

Call for papers now open!

Tuesday -> 2/15, Conference & Expo
Wednesday -> 2/16, Conference & Expo
Thursday -> 2/17, Conference & Expo

http://sys-con.com/edge

5:59 AM  
Anonymous Anonymous said...

Could you please explain im mor detail the complexities introduced by XML namespaces. Specifically, what do you mean by: "It complicates XML stores, such as DOM implementations, because the XML Namespace specification only discusses parsing XML, and introduces a number of serious complications to edit scenarios. It complicates XML writers, because it introduces new constraints and ambiguities."
Moreover, when you say namespaces forces parsers to parse the entire start-tag before returning any text information, are you talking about the performance overhead or about something else also? Please clarify.

Thanks,
Venkat

7:44 AM  
Anonymous Anonymous said...

Enterprise best practice is to use XML whenever it is obviously unsuitable, such as as a syntax for scripting languages, build files, and configuration information. It is also an enterprise best practice to not use XML for what it is well designed for, such as document management.

3:13 AM  
Blogger derek said...

I have a new post specifically on Namespaces that I will be posting soon.

7:33 PM  
Anonymous Anonymous said...

I don't buy your argument at all that
[1] people put garbage in their databases
[2] XML is to blame for not accepting garbage in element content.

If it's garbage, don't put it there. If you have a legitimate use for it, i.e.it's representing information, you can convert it, e.g. into elements.

On the subject of whitespace, you don't need "intimate knowledge" of a vocabulary -- the DTD (or Schema) tells you which whitespace is significant, and in a standard way.

I won't claim XML to be perfect, but let's not invent problems with it.

Liam [Liam Quin, liam at w3 dot org]

5:09 PM  
Blogger Rakesh Pai said...

Nice read. I've been battling with character encoding for XML myself. When you complicate it with database driven XML files and entries coming from other (badly encoded) sites, you just want to throw your hands up.

3:50 AM  
Blogger Mike said...

> I don't buy your argument at all that
> [1] people put garbage in their databases
Well, sad to say, but I work at a popular bookstore in Seattle and this issue actually happens. A legacy data file used control characters as delimiters and that gets loaded directly into the RDBMS. The service exposes the data as XML and those text characters cannot be represented. I think it was 0x01 or something like that.

I don't remember the resolution, but just wanted to mention that it does happen.

12:24 AM  
Anonymous Anonymous said...

I have to also say that the problemz presented aren't really with XML, but design flaws in bad usage. I agree that people abuse XML by using it when something else would be more appropriate, that's true for everything, but I don't completely agree with your examples. XML as a languages is perfectly suitable for databases and settings, but it may not be appropriate to use an XML parsing library which was designed with formatted text in mind, rather than one which was intended to deliver XML markup as a hierarchical structure of elements and attributes. The only criticsm that I have with XML myself are minor redundancies, like the XML declaration doesn't need an extra question mark towards the end, and the common comment syntax that takes 8 keystrokes to type. Also the lack of a single line comment. I don't see any problem in the way whitespace is handled either, all whitespace should be delivered without significant alteration, and interpretation should be application specific to allow for many possible applications.

6:47 AM  
Blogger James Justin Harrell said...

How much of this blog entry is original? I smell the strong stench of plagiarism.

8:13 AM  
Blogger derek said...

James you amuse me. Baseless condemnation 3 years too late. That you for the Monday morning humour.

9:51 AM  
Anonymous Anonymous said...

Generic Viagra
Online Generic Viagra
buy generic viagra cialis
Buy generic viagra cialis propecia
Buy generic cialis tadalafil
Generic cialis
Tadalafil
Buy generic viagra sildenafil
Generic viagra
Sildenafil
Propecia
Finasteride
Vardenafil
generic levitra
Buy generic levitra

7:02 AM  
Anonymous naruto episode said...

Naruto Episode
BUY Naruto

11:09 PM  
Anonymous viagra said...

Comprehensive information on viagra. available at http://www.viagraforce.com or any other suitable online Viagra source would inform you that Viagra is the perfect ED solution for the young generation as well as for old men and so if you are afflicted with impotency, get hold of Viagra immediately and take the medicine according to the doctor’s instructions.

9:12 PM  
Anonymous us said...

If it's garbage, don't put it there. If you have a legitimate use for it, i.e.it's representing information, you can convert it, e.g. into elements.

3:57 PM  
Blogger lucy said...

The chances of contracting erectile dysfunction goes up substantially with age, increasing significantly above the age of sixty-five-which is rapidly approaching for the baby boomers. Although erectile dysfunction becomes more likely with advancing age, there is certainly no age cutoff for a sexually fulfilling life. Some men enjoy sexual activity even in their eighties and nineties. http://www.buy-viagra-with-us.com

4:24 AM  
Blogger Peter said...

Imitrex is prescribed for the treatment of a migraine attack with or without the presence of an aura (visual disturbances, usually sensations of halos or flickering lights, which precede an attack).

Buy imitrex

3:32 AM  
Anonymous Cialis Kaufen said...

Great Review! Well written and quite descriptive as well.. If any item or topic comes out then you should be the one releasing it to the public and make it known! The way you describe it is very intriguing and feels like candy to my ears, if that really makes any sense :) but you catch my drift.. In one of my classes, we were given a paper with instructions of how to build a swan made of aluminum foil and we had to explain to our group verbally how to construct the swan.. It was difficult! But, manageable and we came second in place, but it was tasky :) Nevertheless you are very descriptive and if you post anything else up I will most definitely check it out! Great review!

10:11 PM  
Anonymous generic drugs said...

Good review. I don't think there is anything that can really be done about either, short of a major over-haul.

1:13 PM  
Blogger cheapest generic viagra said...

If you, like so many others, are in the market for Generic Viagra, buying order viagra is your best bet for finding fantastic deals on high quality generic medications. We understand that so many people are struggling to make ends meet in these difficult times, and health insurance does not always include prescription coverage. More Info at: http://www.xlpharmacy.com

3:52 PM  
Anonymous Viagra said...

Great Review!

Well written :) Thank You :)

4:28 PM  
Blogger john said...

You don't have to go out for a Dr, you purchase the drug like phentermine from internet at your house and await for them to come to your doorway. Purchasing prescription drugs like buy adipex without a prescription will go on in the market. It's hazardous to purchase prescription drugs whenever you do not know the online pharmacy while purchasing prescription drugs. Nevertheless Food and Drug Administration recommends purchasing only from state-licensed pharmaceuticses who are settled in the USA. What can be suggested more.

7:39 AM  
Anonymous Anonymous said...

All about gay sex Gay Prison Rape

1:07 PM  
Blogger SewhatSneev said...

Health and fitness is always a person's prime concern today's hectic lifestyle and fast-paced society. And to maintain and sustain a good figure, you need to follow strict dietary regimen and exercise routine. Obesity is a global concern and excess weight loss can lead a person to host of diseases. After all health matters most!
HI ,VISIT NOW SITE :
http://www.pro-weight-loss.com

5:16 AM  
Anonymous rezeptfreie Potenzmittel said...

There is a whole series of power means chemist's shops which offer available without prescription power means as for example Viagra to us. However, do these really feel also well and to recommend? Do I get really available without prescription power means skilful and how can I shop there certainly Viagra? These questions are answered with Viagra kaufen. Check it yourselves.

1:17 AM  
Anonymous Anonymous said...

coole sache also sozusagen auch mal wieder billiger kaufen Potenzmittel Viagra

11:04 PM  
Anonymous kein zoll said...

klar - lieber billiger als teuer - wah ;)

7:26 AM  
Anonymous Priligy kaufen said...

There's a lot of Viagra in here, so i'll post new Wonder Priligy against ejaculatio praecox, is here to buy.

8:57 AM  
Anonymous female viagra said...

I personally use the notataion "container style" or "overlay style" for data and document respectively.

10:23 AM  
Anonymous John Davis said...

female viagra

10:31 AM  
Anonymous rezeptfreie potenzmittel said...

hey, i like xml ;-)

12:04 PM  
Anonymous potenzpillen said...

this is xml power

12:05 PM  
Anonymous Viagra kaufen said...

Yeah so on ! or what!

2:56 PM  
Blogger drmark said...

The confusion customers experience with prefixes vs namespace-uri's is something we are stuck with. Great for the consultants making money teaching this stuff, but not great for APIs designers or users. generic viagra cialis

10:32 PM  
Blogger milf said...

black mold exposureblack mold symptoms of exposurewrought iron garden gatesiron garden gates find them herefine thin hair hairstylessearch hair styles for fine thin hairnight vision binocularsbuy night vision binocularslipitor reactionslipitor allergic reactionsluxury beach resort in the philippines

afordable beach resorts in the philippineshomeopathy for eczema.baby eczema.save big with great mineral makeup bargainsmineral makeup wholesalersprodam iphone Apple prodam iphone prahacect iphone manualmanual for P 168 iphonefero 52 binocularsnight vision Fero 52 binocularsThe best night vision binoculars here

night vision binoculars bargainsfree photo albums computer programsfree software to make photo albumsfree tax formsprintable tax forms for free craftmatic air bedcraftmatic air bed adjustable info hereboyd air bedboyd night air bed lowest price

2:47 PM  
Blogger milf said...

new houston house houston house txstains removal dyestains removal clothesstains removalteeth whiteningteeth whiteningbright teeth

jennifer grey nosejennifer nose jobscalebrities nose jobsWomen with Big NosesWomen hairstylesBig Nose Women, hairstyles

2:47 PM  
Blogger milf said...

dessicant air dryerpediatric asthmaasthma specialistasthma children specialistcarpet cleaning dallas txcarpet cleaners dallascarpet cleaning dallas

vero beach vacationvero beach vacationsbeach vacation homes veroms beach vacationsms beach vacationms beach condosmaui beach vacationmaui beach vacationsmaui beach clubbeach vacationsyour beach vacationscheap beach vacations

bob hairstylebob haircutsbob layeredpob hairstylebobbedclassic bobCare for Curly HairTips for Curly Haircurly hair12r 22.5 best pricetires truck bustires 12r 22.5

washington new housenew house houstonnew house san antonionew house ventura

2:48 PM  
Blogger milf said...

find air beds in wisconsinbest air beds in wisconsincloud air beds

best cloud inflatable air bedssealy air beds portableportables air bedsrv luggage racksaluminum made rv luggage racksair bed raisedbest form raised air bedsbed air informercialsbest informercials bed airmattress sized air beds

bestair bed mattress antique doorknobsantique doorknob identification tipsdvd player troubleshootingtroubleshooting with the dvd playerflat panel television lcd vs plasmaflat panel lcd television versus plasma pic the bestadjustable bed air foam The best bed air foam

hoof prints antique equestrian printsantique hoof prints equestrian printsBuy air bedadjustablebuy the best adjustable air bedsair beds canadian storesCanadian stores for air beds

migraine causemigraine treatments floridaflorida headache clinicdrying dessicantair drying dessicant

2:48 PM  
Anonymous Anonymous said...

Ich habe diese Seite über Potenzmittel und Viagra gefunden!

2:27 AM  
Anonymous Anonymous said...

Its always the best way to Viagra kaufen

8:17 AM  
Anonymous Anonymous said...

Best Lawn Mower
improve search engine ranking
Get Backlinks

4:14 AM  
Anonymous Anonymous said...

Viagra kaufen Viagra kaufen

9:17 AM  
Anonymous Anonymous said...

Viagra bestellen Potenzpillen Viagra bestellen

9:19 AM  
Anonymous Anonymous said...

Sexcam Girls Sexcam Girls

4:34 AM  
Anonymous Anonymous said...

Webcamsex Gratis Webcamsex gratis

4:35 AM  
Anonymous Generic Propecia said...

Thank you to write this XML information

2:06 AM  
Blogger US Online Pharmacy said...

Cialis comes in 5 mg, 10 mg and 20 mg doses. You should better start taking Cialis with the dose of 10 mg once a day. However, consult with your doctor and take Cialis according to the prescription. http://www.8pills.com

2:25 AM  
Anonymous us drugstore said...

Thanks, helpful source.

10:15 PM  
Anonymous Anonymous said...

Sexcam Girls Sexcam Girl is waiting for you

11:56 PM  
Anonymous Anonymous said...

Cialis kaufen Hier kann man billig und gut Cialis kaufen

11:57 PM  
Blogger sandra said...

Generic Viagra is the medicine that is recommended by doctors worldwide for erectile dysfunction. It is the chemical composition of both forms of this drug and helps to cause erection on stimulation of the penis by inhibiting the enzyme phosphodiesterase type 5 which causes smooth muscle relaxation and a better erection.

11:15 PM  
Blogger SundayRose said...

Propecia comes in tablet form and should be taken only by men without liver problems. It should be taken regularly and needs to be taken with water - food is optional.

12:05 AM  
Anonymous Anonymous said...

Billig Viagra kauft man in der Potenz Apotheke oder im Pillen Discount fuer die schnelle Potenz

4:33 AM  
Anonymous Anonymous said...

a little bit more detail would be fine
please visit my page werbetechnik köln

8:27 AM  
Anonymous Anonymous said...

Gerätewagen und Hotelwagen kauft man bei uns!

8:29 AM  
Anonymous Anonymous said...

Alleinunterhalter Köln
If you need some Entertainment :)

8:29 AM  
Blogger zadoc1 said...

I like your blogging style, very original, apart is very interesting and I would like learn more.

zadoc
forex trading system

11:41 AM  
Blogger stevie said...

hi to all here i am just want to say about someting related to your personal life....boost your performance by taking generic Viagra...this will make you feel like men....
so go for it.....you can also try....Lovegra,meltabs,generic levitra etc......

2:06 AM  
Anonymous Anonymous said...

celebrity movie archive
celebrity oops
celebrity tube
naked celebrities

8:18 AM  
Anonymous natural Viagra substitutes said...

I believe that natural (or herbal) Viagra substitutes are much better than synthetic Viagra pill. Natural impotence remedies have no side effects at all.

12:05 PM  
Anonymous Buy Viagra said...

hello very very interesting blog and I have to also say that the problemz presented aren't really with XML, but design flaws in bad usage. I agree that people abuse XML by using it when something else would be more appropriate, that's true for everything, but I don't completely agree with your examples. XML as a languages is perfectly suitable for databases and settings, but it may not be appropriate to use an XML parsing library which was designed with formatted text in mind, rather than one which was intended to deliver XML markup as a hierarchical structure of elements and attributes

12:44 PM  
Anonymous BMI berechnen said...

great stuff

2:45 PM  
Anonymous Generic Cialis said...

hello
cool this blog is great very good my friend keep it going
Very good post, helped me a lot!

6:39 AM  
Blogger Dr banublle said...

Wow greate post very informative, all i can add is if you would like to read more on the subject of Buy Viagra you should visit this location! hope it is helpfulle!

2:48 PM  
Anonymous Actonel said...

Honestly, I don't think there is anything that can really be done about either, short of a major over-haul.We'd need a brief author bio

2:48 PM  
Blogger Eric said...

Thanks! Buy Viagra

4:38 AM  
Anonymous Strom Anbieter said...

great resource und good stuff

6:24 AM  
Anonymous Kredit said...

good review and excellent information.

6:30 AM  
Anonymous Kamagra said...

hello
I would be visiting this blog regularly. Thanks for sharing good information.
have a nice day!!!

7:21 AM  
Anonymous Anonymous said...

Who knows where to download XRumer 5.0 Palladium?
Help, please. All recommend this program to effectively advertise on the Internet, this is the best program!

4:17 AM  
Anonymous Anonymous said...

Buy phentermine

8:24 AM  
Blogger D'r kepler said...

FDA approved mens health medication viagra is not a drug to be taken lighliy you should read all about the pros and cons regarding the medication before you buy viagra!

2:51 PM  
Blogger barrycarter said...

If you've been diagnosed with ED (erectile dysfunction) you can Buy Cialis online for full customer satisfaction. order cheap cialis prescription or buy tadalafil at the best prices from buy specialist pharmacy.

10:19 AM  
Blogger Xanax Man said...

Phentermine 37.5 mg

11:55 AM  
Blogger Xanax Man said...

Phentermine 37.5

11:56 AM  
Anonymous Viagra Without Prescription said...

Nice nice nice! You're right!

11:04 PM  
Anonymous Silber kaufen said...

yes, nice review...good work

1:07 PM  

Post a Comment

<< Home