Tuesday, February 28, 2006

Why is it called a Markup Language?

I have a few side interests which have recently found an odd synchronicity. The first interest is in using XML (or something like it) to describe UI. (I've long been a proponent of the idea, although I think most of the implementations utterly fail to fulfill the potential.) The second interest is in building a 'better' XML, or at least an alternative that might be better suited for how I see people actually using XML. The last abstract angle to this awkward web is my interest in building a better XML API. (Designing a better XML is actually just the flip side to designing a better API.) The odd synchronicity is that both tend to terrify me with how people often completely fail to understand what the 'Markup Language' in eXtensible Markup Language (XML) really means.

I'm a bit unusual, in that I came at XML from an SGML background, and yet I was a firm believer in XML's future as a data transport language. The SGML background means that I really do have a firm grasp on what it means to be a 'markup language'. SGML came about as a way to annotate or 'tag' some text, to describe higher level semantics that were not intuitively obvious from the text itself, at least not obvious to a computer. The <p> and <title> tags in HTML come to mind as perfect examples of this. Tags also provide a useful place to attach information that may be useful for best presenting the information to a user. The 'target' attribute on the <href> element is a good example of that. What is important to note though, is that this tags were just adding layers of meaning to an existing language text. This is why it was called a Markup Language. Taken to an extreme (and it wasn't that much of an extreme for some users), you started by writing the document in plain language, and then you went back and added these tags, thus you were 'marking up', just like your English teacher in high-school would add all those comments to your papers. What is rather interesting about SGML and XML is that while they are 'languages' for marking up plaintext, they are themselves a framework on which other 'languages' are built (think XHTML, RSS, Atom, etc).

Why do I bring this up? Load up an Atom document in a text editor. How much of the document is markup and how much is text? I mean this as no criticism. This is where I envisioned XML leading 10 years ago, when I first heard about W3C's efforts to create XML. We are at a point where the real content of the document is in these 'tags', the things which were originally designed as annotations.

Lets just tracks for a bit, to XML APIs. Working at Microsoft, I spent a number of years working with customers trying to use XML. The way XML was sold inside Microsoft meant that this shift from the primary data being the text to the primary data being the tags happened very early there. The team that was building the initial XML support libraries mostly came from a data-oriented background (probably because Adam Bosworth was the man who had put the team together... I just happened to have the dumb luck to have wandered across their path at the right time). Even before shipping Microsoft's first real XML library (MSXML), we were struggling with problems stemming from the fact that XML was designed for marking up text, more than it was designed for serializing data.

One of the most frustrating problems for XML API designers boils down to what is called 'mixed content'. Mixed content is what XML was designed for, where there is some text and various parts of that text have tags which layer on some further meaning; think <href>/<p>/<b>/<i>/etc in HTML. The <p> tag is the container, and the content of the <p> is mixed content, meaning it is a mix of text and tags. I also lump in the problem of Processing-Instructions and Comments into this general problem for API designers. The reason these are such a problem is that they don't map to traditional programming models veyr well. Consider this simple problem: given the XML: "<p>This is some <b>bold</b> text.</p>". What is the content of the <p> tag? Most APIs expose the content as a sequence of a 'text' node, a <b> tag, and another text node, where that <b> tag itself has a text node as content. If I am looking at this from a marked-up text perspective, the problem with that description is that all the text inside <p> is equally part of that paragraph, so why is some of it more issolated from the <p> tag that other parts? Now compare that to this XHTML snippet: "<ul><li>Item 1</li><li>Item 2</li><ul>". There we clearly want to keep the 'Item 1' text distict from 'Item 2', they are distinct. So what is the 'value' of <ul>? Obviously, one can't use the same logic to report the value of <ul> as we did for <p>.

One could definitely argue that part of XML's power is that it so easily expresses both concepts is such a simple syntax. The problem is that the burden has been moved onto the API and the API user. When talking about applications, such as HTML, where the majority of the effort is spent in authoring XML content to be processed by a limited number of tools, this trade-off makes absolute sense. But what about all these uses of XML where there is no 'author'? Where the 'user' is effectively the application developer? Now the tables have turned. This complexity has become an impedement in the usability of XML. This is part of why there is such a proliferation of XML binding tools. Most XML APIs mimic the abstract design of XML, which means they expose this complexity. Part of the reason that there are so many XML binding libraries is that XML is so flexible that there are many ways to say the same thing.

One way of summarizing the current situation is to say that the majority of XML users are not actually using XML as a 'markup language' at all. They are using it as a data serialization language.

So the problem I've struggled with for the last few years is how to design an XML API that reflects the fact that people use XML as a serialization language, not a real markup language. That would be easy, except for that fact that while most data is just data, some of it is markup... like XHTML. Worse, RSS and Atom are a perfect example of something that mixes both uses in one file. It seems to me that the trick is the fact that most applications actually know what parts of the data are data-serialization and what are markup. What if there were a way to basically let the user switch back and forth to the appropriate API for the task at hand? The problem is that I've never quite figured out how to do this. The trick is that you sometimes need to peek ahead and then switch your view based on what you saw.

Ultimately, I feel that one of the weights that may eventually lead to something replacing XML, is this lack of distinction between markup and data serialization.

1 Comments:

Anonymous Anonymous said...

We looked at this a bit in prevLife when we were generating APIs for a given schema. The idea was that if we saw that mixed content was allowed, we'd assume that the use case was basically textual markup, and emit APIs that supported that use. Otherwise, we'd assume that we were looking at data serialization and go with APIs that didn't work at all for textual markup.


We didn't get the chance to bake it enough before we had to ship, though.

10:17 AM  

Post a Comment

<< Home