Tuesday, October 26, 2004

When is text not text?

Liam Quin posted a comment about my 'where xml goes astray' rant:
I don't buy your argument at all that
[1] people put garbage in their databases
[2] XML is to blame for not accepting garbage in element content.

If it's garbage, don't put it there. If you have a legitimate use for it, i.e.it's representing information, you can convert it, e.g. into elements.


Liam's argument is based on the presumption that the only reason to store a sequence of characters is that those characters are intended for display. If you did not intend to display the characters, then why worry about control characters, for example? While I might tend to side with him on the issue of storing invalid or incorrect surrogate pairs, there may be valid reasons (beyond just performance consideration) for that also. XML 1.1 validates my argument that control characters are useful and valid text content. The fact that it requires escaping them is exactly the kind of compromise that is appropriate and which should be the ultimate goal of a standard like XML.

An interesting example where the presumption that characters are about representing text is violated can easily be found in standard C libraries: strcspn(). This function takes 2 strings as arguments and "Scans string1 character by character, returning the number of characters read until the first occurrence of any character included in string2" according to the docs Google found for me (http://www.cplusplus.com/ref/cstring/strcspn.html). What if I want to marshal something like this over XML? How do I represent the 2nd parameter? (pre XML 1.1 that is...)

Some argue that you just shouldn't use XML for this. I (and many others) argue that XML is 90+% there. When we were so close, why throw out these use cases? Is preserving purity of 'text' as something more complex than just a sequence of characters so important that we should disallow other interpretations?

Back to Liam's points. [1] the data isn't 'garbage', it just isn't text. It is a sequence of characters. There is an important difference and a format that is about information interop should make it possible to exchange sequences of characters without dropping back to barbaric base64 encoding. [2] I'm not 'blaming' XML. I'm wishing that we could have made 'better' choices for XML (1.0). We obviously learned something because XML 1.1 addresses much of this.

If instead of element content, he was referring to tag-names, then that is a slightly different beast. By carefully restricting what characters where allowed in tag-names, XML slows down parser implementations, and dates itself by being tied to a specific Unicode version. The growth of international commerce has required a constant adjustment of our definition for what is allowable as a 'name' and what are allowable characters in a name. Trying to restrict the definition of what constitutes a tag-name causes problems for a standard focused on interop, such as XML. Interop means the server I'm talking to may be running decades old software. A client can't depend on the server to have been upgraded with a parser for XML 4.2, it must depend on some contractually agreed least-common-denominator; usually v1.0.

One lesson I learned from XML is that a standard like XML, should be designed to be open. Version 1.0 should only restrict what is absolutely necessary to restrict. If you target scenarios that you haven't fleshed out yet, then make sure you are as open as you are willing to be. XML 1.0 targets information interop, but assumes that the information is text, not just character data, but actual textual data (targeting eventual textual rendering). I would argue that this mismatch is the single largest source of problems with current XML deployments. Forcing anything remotely resembling 'data' to be encoded, just in case of non textual character sequences is a very high cost which ultimately provides very little real gain.


0 Comments:

Post a Comment

<< Home