office message

Subject: Re: [office] Metadata subcommittee discussion
From: Bruce D'Arcus <bruce.darcus@OpenDocument.us>
To: office@lists.oasis-open.org
Date: Thu, 2 Feb 2006 16:11:57 -0500

Hi Rob,

On Feb 2, 2006, at 3:20 PM, robert_weir@us.ibm.com wrote:

> I tend to think of meta data as being of three types:
>
> 1) Fixed-schema like Dublin Core, a fixed set of bibliographic values 
> which apply to the document overall.  This is what ODF has today.   
> Rigid, but you always know what to expect.
>
> 2) Meta data to support semantic layering.  I'm thinking of cases 
> where there is a predefined schema for a particular use or industry 
> which can be used to annotate a document.  So, someone doing critical 
> commentary on ancient Greek texts might use one metadata schema, but a 
> lawyer reviewing testimony might use another.  A single document might 
> allow users to load multiple, pluggable metadata schemas to allow 
> multiple layers of semantic information at the same time, which a 
> clever editor could use via color coding, interlinear markings, etc.
>
> 3) Free, ad-hoc use.  Users can right click on any content or select 
> and add metadata in arbitrary schemas.  Or we allow arbitrary content 
> as child-elements and attributes of all ODF-defined markup.  Any given 
> editor may or may not understand these additional elements, but they 
> are obligated to save them back when the document is saved.
>
> So what are we trying to do?  A better way of doing #1?  A way to move 
> to #2 and at the same time redo #1 so it is more harmonious with how 
> we do #2?

I think that's how I view it, but let me flesh out where I think we're 
headed:

We say that while the existing DC metadata is a decent common 
denominator, there are all kinds of contexts in which users (which 
might include organizations) have need for more flexibility:

1) they might need to describe their documents as a whole in ways that 
go beyond DC and simple key/value statements. Let's say a publisher 
wants to includes details of the production process that cannot be 
covered with DC alone.

2) they want to describe (or simply carry already-existing) metadata 
for different kinds of document objects. If I embed an image or 
spreadsheet data in a document, for example, it should be possible for 
that source metadata to be stored in the file wrapper. Likewise for the 
bibliographic use case that is my focus (more below).

3) the "layering" of richer semantics on top of document content. This 
was the example I posted last week from Brian Jones, where a user might 
highlight pieces of content in ways that enhance search functionality. 
This is also the approach of RDF/A:

<http://www.w3.org/2001/sw/BestPractices/HTML/2006-01-24-rdfa-primer>

There's a way to do all three fairly elegantly.

In order:

1) we adopt a set of rules for extension. Those rules are likely to be 
RDF or some subset (e.g. XMP).

This gives the predictability you note in the sense of the model, but 
opens up significant flexibility. It's the best of both worlds really.

2) We define a list of document content elements which can hold an 
attribute (or set of them) that point to metadata descriptions (that 
conform to 1) that are stored in the package (see more on this below), 
or might simply be an identifier uri.

3)  As with 2, we allow these (optional) attributes to be attached to 
style definitions, so that in tagging content with a given style, a 
user would be adding those richer semantics.

> Something else?  If there is more of a consensus on part of this than 
> the whole, then maybe break it into stages.

My sense is the most controversial part is the precise details of 1. Do 
we adopt XMP as is (with its limitations)?  Do we work with Adobe to 
see if they can address some of these concerns? Do we instead simply 
take the existing ODF metadata support and extend it to support a 
richer subset of RDF? Or do we just say ODF metadata = RDF; the full 
model and syntax.

Each option has its trade-offs.

BTW, Patrick is interested in doing this in ways that would support 
creation of topic maps from ODF metadata. I'm thinking we ought to be 
able to do that.

> Also, "tagging" is becoming popular, with flickr, del.icio.us, etc. 
>  Isn't this just metadata?  Should we add a place for this?  

I think we should, per above.

It would be possible, in that case, to define templates with predefined 
"tag" terms, but to also associate those terms with a uri. That would 
offer the simplicity of tagging, but with the potential for more 
powerful solutions (e.g. you give a tag a uri, you could 
internationalize it, etc.).

> Also, at the package level are we sufficiently flexible?  Not quite 
> metadata in the way we've been thinking about, but what if an editor 
> wants to store an extra file in the zip?  Does the specification give 
> enough guidance on how to do that.  For example, I might want to 
> bundle up extrinsic metadata in a separate XML document, XLink'ed to 
> content in the main document XML.

 From my perspective, I would want to say that ALL metadata statements 
would be stored apart from the content file, and the linking would 
happen from content to metadata.

I think to allow your #3 solution above now would be too complicated, 
and probably unnecessary. Keeping the metadata statements apart from 
the content and using uri links is an easy way to accomplish a lot.

Let me outline the issues here from the standpoint of the bibliographic 
use case, just to give something quite concrete:

I am imagining three users collaborating on a paper, each using 
different ODF-compatible applications.

As they write the paper and add citations, the citations and 
bibliography are automatically generated from the embedded metadata.

Because the metadata is embedded, it's also portable. When the users 
pass the document around, the logic is always there so that the 
formatting can be regenerated.

And because the metadata is based on a standard model, it would also 
facilitate interoperability between different bibliographic 
applications.

So authors finish paper and send to publisher.  Publisher can extract 
all that metadata and make it available to search engines and journal 
providers.

Likewise -- and this is where Adobe and XMP comes in -- it would be 
possible to embed that metadata in PDF files, and to then enable those 
finished documents to be:

	-	more easily searched
	-	to offer enhanced functionality (copy-and-paste text to an ODF 
document, and metadata is copied with it, links to further information, 
etc.)

Bruce
Follow-Ups:
- Re: [office] Metadata subcommittee discussion
  - From: robert_weir@us.ibm.com
- Re: [office] Metadata subcommittee discussion
  - From: Patrick Durusau <patrick@durusau.net>
References:
- Re: [office] Metadata subcommittee discussion
  - From: robert_weir@us.ibm.com