Thursday, May 24, 2007

Binary XML not quite so evil?

I've recently been spending more time writing code that uses my company's Efficient XML (the basis for the forthcoming W3C Efficient XML Interchange format). I've never been one to claim the Binary XML will replace XML, rather the sweet spot where we target Efficient XML usage is places where the alternative is a custom binary protocol. I was reimplementing a tool that used Efficient XML to serialize some potentially large data structures. I was sure that I could do better and that using Efficient XML was overkill. (Yes... all programmers think they can do it better than the automagic tools.) So I sat down and implemented a custom binary format for the same information. I had the advantage of a working implementation that used Efficient XML, so the core implementation was pretty quick; a few hours. One of the goals for this is compactness of the result, and the data structures being written out had circular references and other things that ruled out any normal serialization tools I knew of. I did some quick tests on some really simple data, and it all looked good. Time to go home for the day.

The next morning, I picked it up and gave it some real data. Crash and burn. I spent the rest of the day tracking down bit alignment errors, and all sorts of small compatibility bugs between the reader and the writer. By the end of the day I had it working on most of our data. It still failed on a few cases, but they generated multi-megabyte output. I had no idea how to debug this. I added tracing, but the trace files were too large to load into memory! By this point I had already spent more idea on my custom format than it had taken to implement and debug the Efficient XML based code, and that code didn't have the benefit of existing, working code, when I wrote it!

AT this point it was at least good enough to evaluate the compactness of my custom format. My format was hand optimized, down to the bits. I expected to beat the Efficient XML encoded data hands-down. So I ran some tests. My custom encoding did beat Efficient XML for most samples... but not all of them! In fact, Efficient XML beat my hand coded format my 20% in one case! What was going on?!?

Well, I knew right away why Efficient XML was beating my code. I had skimped in one case to simplify the code. To achieve equivalent encoding, I was going to have to encode it the same way Efficient XML encodes such situations. The scenario is when you have a set of individually optional values. This is a pain to handle manually, and I have never seen a manual encoding that handles this optimally.

So what does this tell us? That Efficient XML can truely be as compact as a custom binary format! Since the format is specified by using XSD, there are a number of tools out there to help define and document the format. You can prototype the format in Text XML, and then switch to Efficient XML, once the bugs are mostly ironed out. Alternatively, you can manually decode the Efficient XML stream to Text XML for debugging purposes. I've found this invaluable.

When using Efficient XML (rather than a custom binary format) you are programming against standard XML APIs. There is a cleaner separation of the bit-encoding from the rest of the encoding/decoding logic. I have long argued that XML can be a good fit for configuration files, simply because it means there is less parser logic, and it is easier to user standard tools to process your config files. Much of the same benefits apply to Efficient XML, with the caveat that you need to either use APIs that understand Efficient XML, or translate from Efficient XML to Text XML. The import point is that you have all the options available with very little effort, all while getting many of the benefits of a custom binary format.

Efficient XML is no panacea. It is not a replacement for all binary formats, just as XML is not the be-all/end-all. Efficient XML is an excellent choice when XML would be a viable choice, except for it's verbosity. (Efficient XML is also faster than Text XML to generate and parse). I have also played with auto-generating custom parsers for Efficient XML with a specific grammar. These can be blazingly fast and yet still working with conformant Efficient XML.

Lots of people like to talk about why they think Binary XML is a bad idea: (a) (b) (c). Most arguments against Binary XML focus on 2 points:
  1. Text is good, Binary is bad
  2. XML is defined as a textual format. Anything else isn't XML.
(1) There are some good reasons to recommend Text formats. Any text editor can be used to edit the data. It is easier to debug. In packet traces and other debug logs, it is easier to extract and investigate. I definitely agree that Text is easier than Binary to debug and apply generic tools to. But! to compare Binary XML to a custom binary format is unfair. All you need is one conversion tool to convert any Binary XML to Text XML, and thus get all the benefits of text. In comparison, you would need a custom tool for every custom binary format. The extra effort means that the custom tool would likely never be written. With Binary XML, the tool is just a given. I have used this many times to great effect, for both Text XML and Efficient XML.

Summary: Text is better than Custom Binary, but Binary XML is more like Text than it is like like Custom Binary.

(2) The XML spec does define XML as a Unicode stream of characters, no-one can argue with that. But why then is it OK to talk about XML APIs? or the XML Infoset? When people talk about 'XML ' (or ' XML') they are talking about leveraging XML. In order to really do anything with XML, you need a parser, so unless you are writing an XML parser, you never deal with straight XML anyway. Binary XML just extends the existing XML domain to include a more compact encoding. You give up some of the benefits of a Text format, while gaining many of the benefits of a custom binary format.

Summary: Most software that 'uses' XML isn't interacting with the raw text stream, so why does this matter so much. Binary XML isn't XML, it is Binary XML.

Ultimately, Binary XML is not about replacing XML with some new binary encoding. It is about leveraging the many benefits of XML in situations that can not use Text XML. Binary XML just extends the reach of all those existing XML tools, both for the application developer and the application user.

3 Comments:

Blogger Unknown said...

This comment has been removed by the author.

12:36 AM  
Blogger Unknown said...

Binary XML isn't that bad and can be extremely good in some scenarios, but it's a disaster when it comes to tools and APIs, I mean almost complete lack there of.

12:38 AM  
Anonymous Anonymous said...

Quote:
"Summary: Text is better than Custom Binary"

That's quite a statement :p
I'm sure it would be true if:

1. By "better than" you mean "not"

2. XML or [insert-your-favourite-text-format] is the center of the world.

To be honest if an auto-generated data-format beat you, it _only_ proves that you are either lazy or not very good at structuring data.

Binary XML is only useful as a replacement in cases where XML is already used, but it can obviously never beat a custom format.

4:04 AM  

Post a Comment

<< Home