Monday, February 07, 2005

Think'n aloud about RDF

This weekend I was playing with ideas about implementing a structured editor, an editor that understands the syntactic structure of your document and implements edits at the structural level, rather than at the text/character level. Most often you hear of this idea relative to editors designed for programming languages. Once upon a time, there was a team at Microsoft that worked on a programming environment that included a structural editor. The head of that team wandered off to found a company devoted to the idea. Although some people have commented that a structural editor made a number of common tasks overly complicated, I’ve always found the idea appealing.

As I was playing with the idea in my head, it occurred to me that what I was designing in my head, had a lot in common with Haystack, and MIT uber semantic-web platform. So I tried to map my ideas to RDF. Mind you, I'm not huge fan of RDF, it just looks like what the AI textbooks called 'Frames' 20 years ago. I was quickly struck by how limiting the simplistic data model of RDF is. I very quickly bumped into the fact that RDF's design, while great for meta-data, was abysmal for actual text data.

The (simplified) abstract model of a paragraph in a word-processor is that a paragraph is a sequence of sentences, each of which is a sequence of words, each of which is a sequence of characters. Formatting is applied to a span of characters. The edges of this span, may, or may not, match edges of words or sentences. How would you represent this in RDF? I can easily think of almost half a dozen ways to model this in RDF, and I don't like any of them.

A long while back, I was ranting to Dare about how I didn't like RDF because it encouraged 'little languages'. At first I was planning to blog about this, basically from a XML vs RDF perspective, but decided against it, because the more I thought about it, the more I realized XML suffered much of the same disease.

In a modern browser, when it renders <p>Mary had a little lamb. The cat had a hat.</p>, it has a whole set of complicated rules for how to interpret the sequence of characters inside the <p> element. First of all, there are two sentences and there should be more space separating the sentences than separating the words within a sentence. Secondly, word wrap should only occur on word boundaries. I'm sure there are lots of other rules as well, but those are the obvious ones. What that means is that the browser is actually interpreting the content of that <p> element not as a sequence of characters, but as a data structure in it's own right.

Back to RDF; how would you render that simple bit of HTML in RDF? I think most people would start with just creating a paragraph resource that contained the text. But now all that structure that the HTML rendering engine is using is lost to any RDF application. Designing an RDF schema is a delicate art of balancing how much structure is necessary to expose and how much to leave implied. Dates are another obvious example. This is all fine and good, until a new customer comes along with a new scenario, and they want to be able to refer to word #2 of sentence #1. As far as I can tell, RDF just leaves them to their own devices at this point.

This reminds me of two other issues I have with RDF. First, sequences. The way sequences are hacked into RDF is embarrassing. That is just such a blatant, ugly hack. The other issue I have is that there does not appear to be a way to treat a triple as a resource. What if I want to comment on someone else's comment? I have never seen a way to do that. Triples don't themselves have URIs, so there is no way to indicate that the target of one triple is another triple. Maybe I'm just too 'meta', but if I'm going to define a format for metadata, I'd plan in a way for people to define treat my metadata as data and annotate it with meta-metadata.

So, if I ever find time to build my structured editor, I guess I won't be modeling it's internal data modeling on RDF.