Monday, February 07, 2005

Think'n aloud about RDF

This weekend I was playing with ideas about implementing a structured editor, an editor that understands the syntactic structure of your document and implements edits at the structural level, rather than at the text/character level. Most often you hear of this idea relative to editors designed for programming languages. Once upon a time, there was a team at Microsoft that worked on a programming environment that included a structural editor. The head of that team wandered off to found a company devoted to the idea. Although some people have commented that a structural editor made a number of common tasks overly complicated, I’ve always found the idea appealing.

As I was playing with the idea in my head, it occurred to me that what I was designing in my head, had a lot in common with Haystack, and MIT uber semantic-web platform. So I tried to map my ideas to RDF. Mind you, I'm not huge fan of RDF, it just looks like what the AI textbooks called 'Frames' 20 years ago. I was quickly struck by how limiting the simplistic data model of RDF is. I very quickly bumped into the fact that RDF's design, while great for meta-data, was abysmal for actual text data.

The (simplified) abstract model of a paragraph in a word-processor is that a paragraph is a sequence of sentences, each of which is a sequence of words, each of which is a sequence of characters. Formatting is applied to a span of characters. The edges of this span, may, or may not, match edges of words or sentences. How would you represent this in RDF? I can easily think of almost half a dozen ways to model this in RDF, and I don't like any of them.

A long while back, I was ranting to Dare about how I didn't like RDF because it encouraged 'little languages'. At first I was planning to blog about this, basically from a XML vs RDF perspective, but decided against it, because the more I thought about it, the more I realized XML suffered much of the same disease.

In a modern browser, when it renders <p>Mary had a little lamb. The cat had a hat.</p>, it has a whole set of complicated rules for how to interpret the sequence of characters inside the <p> element. First of all, there are two sentences and there should be more space separating the sentences than separating the words within a sentence. Secondly, word wrap should only occur on word boundaries. I'm sure there are lots of other rules as well, but those are the obvious ones. What that means is that the browser is actually interpreting the content of that <p> element not as a sequence of characters, but as a data structure in it's own right.

Back to RDF; how would you render that simple bit of HTML in RDF? I think most people would start with just creating a paragraph resource that contained the text. But now all that structure that the HTML rendering engine is using is lost to any RDF application. Designing an RDF schema is a delicate art of balancing how much structure is necessary to expose and how much to leave implied. Dates are another obvious example. This is all fine and good, until a new customer comes along with a new scenario, and they want to be able to refer to word #2 of sentence #1. As far as I can tell, RDF just leaves them to their own devices at this point.

This reminds me of two other issues I have with RDF. First, sequences. The way sequences are hacked into RDF is embarrassing. That is just such a blatant, ugly hack. The other issue I have is that there does not appear to be a way to treat a triple as a resource. What if I want to comment on someone else's comment? I have never seen a way to do that. Triples don't themselves have URIs, so there is no way to indicate that the target of one triple is another triple. Maybe I'm just too 'meta', but if I'm going to define a format for metadata, I'd plan in a way for people to define treat my metadata as data and annotate it with meta-metadata.

So, if I ever find time to build my structured editor, I guess I won't be modeling it's internal data modeling on RDF.

3 Comments:

Anonymous Anonymous said...

Thanks for the insight.

Would you list a few instances you think RDF is a useful application of?

1:17 PM  
Blogger derek said...

RDF works for simple cases where it is obvious, and largely unaambiguous how to map the conceptual model to a labeled graph. It could trivially model a card-catalog, for example. The MIT Haystack project I mention in the entry manages to squeeze RDF for all it is worth, and provides some interesting examples. As I say above, I'm not a big fan of RDF. I keep coming back to thinking about RDF because I am very interested in the idea of data abstractions. I'm just, more and more, coming to the conclusion that RDF is too simplistic for its own good.

9:45 PM  
Blogger Danny said...

My first reaction was to consider modelling a paragraph in RDF rather a strange idea, but on second thoughts, I don't see why not. Parse tree, a tree can be represented in a graph. RDF is essentially an entity-relation model, the characters can be the entities and the structure defined through relations.

Like you say, for editing purposes a paragraph can be seen as a sequence of characters. rdf:Seq is pretty lousy, but there's a more suitable construct - the RDF Collection. It uses Lisp-like modelling, i.e. first/rest. One way would be to represent each alphabetic character to be a resource with a URI, with an individual paragraph being a list of these.

There are lots of possible ways of doing it, it would depend on the application. I don't think "until a new customer comes along with a new scenario" is a very strong argument against using RDF, as generally it fairs pretty well compared to more rigid modelling techniques.

I don't think the problem is that RDF's conceptual design is bad for text data, only that the concrete constructions (like URIs and RDF/XML) are more oriented towards less granular kinds of applications.

RDF does have a lot in common with AI frames (as does object-oriented programming ;-) but there has been a fair bit of development in those 20 years, most notably the Web.

You can treat triples as resources through reification - it's not ideal, but this can fulfil at least most uses. If necessary you can add things like context at the application level (Named Graphs is a proposal that would make it easier, but there aren't likely to be any changes to the specs in the near future).

It would make an interesting exercise to build and compare a structured editor based on say, an object-oriented design (C#?), a list-oriented design (Lisp?), an graph-oriented design (RDF?) and a tree-oriented design (DOM?!)...

9:18 AM  

Post a Comment

<< Home