office-metadata message

Subject: notes on RDF profiling
From: Bruce D'Arcus <bruce.darcus@OpenDocument.us>
To: office-metadata <office-metadata@lists.oasis-open.org>
Date: Wed, 23 Aug 2006 10:54:10 -0400
Re: the business of RDF profiling, I'm just posting some issues to 
consider when we get to it. Basically, it's sort of the thinking that 
drove the design of my demo schema.

1)  triples

What is fundamental to RDF is the triple model, and the use of URIs to 
identify things (the elements of the triples).

What that core provides is a really simple, but really powerful, model 
to mix and merge data. E.g. it's the answer to the requirement for a 
metadata system that is robustly extensible, and which allows data to 
be articulated across the boundaries of functionality (a document 
description to reference a contact record, for example).

At the most basic, then, all resource statements have a value of a 
literal or of a URI, which then references another resource.

The first then might look like:

	urn:x:1 --> x:title --> "A title"

... and the second:

	urn:x:1 --> x:author --> urn:person:y

... where the last urn is an identifier for a person record, which in 
turn might be:

	urn:person:y --> x:name --> "Jane Doe"

If you're thinking of a relational database, the first example is like 
a column in a "documents" table, while the second like a foreign key 
reference to a row in a "people" table.

If you're thinking about objects, the first is a simple string 
attribute, while the second a reference to another object.

Any RDF profile really must support this basic notion. Hell, even if it 
was a non-RDF metadata encoding, it should support it!

XMP, FWIW, does not really support the model, since it does not support 
values (objects) that are URIs. All properties are either literals or 
blank nodes ...

2)  blank nodes

Above the person record is identified by URI and thus separately 
described. This is a powerful approach because it means that person 
description becomes a node in the graph, where other descriptions can 
also link to it. It's good, normalized, data design.

But practically speaking, one may not always want to identify something 
as a discrete -- linkable -- resource like this.

This is what blank nodes do; allow you to have nested anonymous 
resource descriptions. I include them in my demo schema, and XMP also 
includes them. RDF/XML also has a short-hand syntax for this (the 
rdf:parseType="Resource" attribute), which XMP also supports.

3) properties as elements or attributes?

The single biggest problem with RDF/XML for XML tools is that it makes 
no distinction between elements and attributes. Properties can be 
encoded in either way. This gives XML tools problems.

There's a rational history behind this (they wanted to allow embedding 
of properties within XHTML IIRC), but I don't think it makes any sense 
to allow attributes in ODF for this. It buys us nothing really, and 
adds significant complexity from an XML perspective.

Adobe, for some reason, allows both in XMP.

4)  types

To plug RDF data into ontology-based systems, you type resources. You 
create a class like "meta:Document" and, if you like, write a little 
RDF schema fragment that gives further information about it ("it's a 
class that is like this other class, it has x, y, z natural language 
descriptions, etc.").

Types in RDF can either be assigned by replacing the rdf:Description 
wrapper with another term (like "meta:Document" above), or by using an 
rdf:type element with a URI rdf:resource attribute.

I personally think allowing typing makes sense, but there's one 
potential impedance mismatch with traditional OO programming languages 
one needs to take into account, which is that RDF descriptions can have 
many types. Also, because typing can be indicated in two ways, you need 
to account for this in XML tools.

In my bibliographic demo, I got addressed this by using a more generic 
typed node "bib:Reference" and then indicated subclass using a dc:type 
property. So from an RDF standpoint, there is effectively one type.

5)  reification

To be honest, I don't much understand this, which tells me it's a 
problem we really don't want to deal with. In general, it's the ability 
to make statements about statements, and it is a) seldom used in 
practice (or so I understand), and b) requires some syntactic 
gymnastics to support. There are a number of RDF experts who think now 
that reification was a mistake; adding too much complexity for too 
little gain.

If I understand right, XMP DOES support reification (described as 
"property qualifiers" in the XMP spec, p17). I don't think we should.

6)  containers and collections

Standard rdf containers are Seq, Alt and Bag. These are just ways to 
wrap properties. They're also one of the reason non-RDF people scream 
about the syntax.

Most RDF experts I've talked to think these are also problematic. 
Indeed, in recent discussions about a so-called "RDF Lite" profile, 
most RDFers agree they could forego these structures, because the basic 
triple model (and maybe typing) achieves the same thing in practice.

One of the controversial things about XMP is not so much that it allows 
these, but that it *requires* them for any duplicate properties.

I think for us it might make sense to allow them, but not in any to 
encourage their use. E.g. I think the generic profile validator I wrote 
probably would say these are valid, but I don't think our documentation 
should talk about them.

Collections are ways to wrap multiple resources. From a modeling 
perspective, it can be useful.

One of the practical problems with both the containers and the 
collections is that right now, for example, SPARQL (the new RDF query 
language from the W3C) does not support them. That will come later, but 
it'll probably be a couple of years.

There's one problem that we at the bibliographic would definitely need 
to get around, though, which is that author lists and such are ordered, 
while the RDF model is not. This would take some thought about how best 
to handle, but one suggestion that I'm liking from Ian Davis is that 
one allow a position or order property (which is how you'd do it in a 
relational database).

So here we're left with pragmatic questions about what specific details 
we ought to support to achieve our goals. If we assume RDF tools, 
there's no real need to worry about this. But if we assume non-RDF 
tools, then we have to recognize that each feature we support adds 
corresponding complexity (to the spec, to the RELAX NG schema(s), to 
processing).

Of course, we still have to clarify what your "goals" are ;-)

Bruce
Follow-Ups:
- Re: [office-metadata] notes on RDF profiling
  - From: Bruce D'Arcus <bruce.darcus@OpenDocument.us>