Monday, August 01, 2005

Microformats, etc

I was sitting in a group code review of some tests of the XML Editor in the forthcoming version of Visual Studio, and we had an interesting discussion. The original tester had implemented these particular tests to be data-driven, meaning that the inputs and expected outputs were separated from the code that implements the test. In this case, the inputs/outputs were stored in a text file. One of the testers commented that since we are the XML team, we should use XML for this, rather than a custom text-file format. I completely agreed with him, and I still do agree that this particular test would benefit from storing the data in XML. On the other hand, XML is not the solution to world hunger.

So when is it a good idea to use XML for your data? The easy answer is that you should use XML when it is likely to be easier (in the long run) than creating your own parser. Using XML carries some cost. XML is verbose, and parsing is guaranteed to be slower than a custom parser. Why is XML such the rage then? Aside from the hype there is one very good reason to use XML. For many types of data, it is easier to just load the data into a DOM and extract the information from that, than it is to write a custom parser. That means less time spent debugging code, more time spent focusing on the problem at hand.

As a counter example, This weekend I was working on a simple file format that had 3 fields. A revision counter (an integer), a title (a string), and a binary blob. I thought about using XML, but didn’t. Why? Because it would have taken more lines of code to read/write XML, than to just use text. I use XML formats all the time for other types of data. There is nothing better for quick-n-dirty hacks. I can use jscript/java/c#/etc to generate the data, and XSLT/etc to post-process it. What is important is that I pick a format that lets me get on with my work.

I had an interesting discussion with someone about Google’s trick to send the data as actual javascript code. I pointed out that script apps have used this trick for configuration for years. Every tweaked the startup scripts of any Unix machine? Data stored as code is not new.

There is some discussion (see Dare’s commentary) that having all these custom vocabularies is bad, because it limits the ability of automated tools to reason about the data. While that is true, that is completely ignoring reality. Ever sit down at a table with a number of experts in a field that you do not know? They may be speaking English, but that doesn’t mean you understand what they are talking about. If you try and force them to speak in laymen’s terms, the efficiency of the information exchange drops dramatically. Specific languages are sometimes necessary. Individual specialties within Math and Computer Science all have customized definitions of terms, that sometimes conflict. Each specialty evolved it’s terminology to enable efficient, unambiguous communication between specialists in that field. Custom grammars are a necessity for efficient communication. Language reduces to the least common denominator of the intended listenership. If an application expects generic tools to process it's data, then it should use a well known standard. If local efficiency (or development or data) is more important, then use custom formats.