office-metadata message

Subject: Content dublication and ODF related RDF vocabulary
From: Svante Schubert <Svante.Schubert@Sun.COM>
To: office-metadata <office-metadata@lists.oasis-open.org>
Date: Mon, 18 Dec 2006 00:47:42 +0100
Svante Schubert wrote:
> Elias Torres wrote:
>> Patrick Durusau <patrick@durusau.net> wrote on 12/13/2006 03:43:55 PM:
>>
[...]
>>> Another use case is that I am processing all the metadata files only 
>>> and
>>> not the content.xml files.
>>>
>>> Trivial example: All patient records are stored as ODF and the metadata
>>> for those files should include snomed:birthdate and snomed:age metadata
>>> statements, plus I assume snomed:insurer (I assume there is in the
>>> snomed vocabulary. Sorry John, could not resist.)
>>>
>>> In other words, if this data is actually missing from the file, the
>>> metadata properties don't either. I don't have to process 
>>> content.xml to
>>> discover these errors.
>>>
>>> Depending on what metadata you store in the metadata files, like 
>>> Bruce's
>>> bibliographic data, you could extract all that data in RDF without ever
>>> touching the content.xml files.
>>>
>>> Simply a question of how much overhead you think you will incur in
>>> processing a set of documents. Doing one to ten documents is probably
>>> trivial with either solution. Doing 100,000 documents or more, well, I
>>> think there would be performance differences.
>>>
>>> Granted that you and Bruce are arguing that people can choose one or 
>>> the
>>> other in terms of representation. On the other hand, I don't see any
>>> tangible benefit to the choice. If we can indeed do with one what 
>>> can be
>>> done with the other, my instincts say go with the one that we know is
>>> likely to scale.
>>>     
>>
>> [...]

>> RDF by nature deals very well with specifying
>> metadata externally from the content, so technically I can't argue 
>> with an
>> external only approach. However, content-duplication is something very
>> important that Svante,Barnd, John and others have expressed concerns. 
>> I'm
>> not sure who else, but at the moment you are the only one stating is 
>> not a
>> problem.
>>
>> Svante wants to avoid content duplication but I believe he is not
>> necessarily for RDFa, so I'll look forward to see how he solves the 
>> problem
>> in meta.xml of duplicating content.
>>
>>   
> I believe everyone would like to avoid duplication, as it is 
> equivalent to the risk of inconsistency.
> Although Elias might say - as so often - that as the application will 
> handle the data and not a human will edit it by himself, the risk of 
> inconsistency between the duplicated data of content and metadata is 
> about zero, we have to be aware that there is a risk, as soon we have 
> duplication.
> On the other hand as soon we are starting to deal with metadata, there 
> is always a risk that RDF subject and RDF object are no longer 
> consistent. For example the name of the person in the content might be 
> changed and no longer belong to the vcard data in the metadata.
>
Concerning inconsistency:
Sometimes parts of the content might be equivalent with parts of 
metadata, the same string is being used.
We have the following design options:

   1. store the string in the metadata, reference it from the content
   2. store the string in the content, reference it from the metadata
   3. store the string in the content and metadata (content duplication)
   4. store the string in the content and embed parts of the metadata as
      well (RDFa approach)

Here are some constraints I heard recently:

   1. Some people would like that none metadata-aware ODF application
      are still able to show such content without parsing the metadata
      (dropping option 1)
   2. Metadata tools should be able to work with metadata without being
      bothered parsing the content (dropping option 2)
   3. Avoid content duplication (dropping option 3)
   4. Avoid mixture of content and metadata, to change both more
      independently (dropping option 4)

On the first sight this seems unsolvable, but let us analyze request #3.
Is content duplication per se evil? If we save our work from the 
harddisc on a USB stick, we have made a backup, which is data 
duplication as well, and it is even recommended.
Of course, nobody is working on the backup, that might be the 
difference, but who is working on the metadata and content at the same time?
A different example, if a flight company is selling flights over the 
Internet, the data from it's database is duplicated for multiple users 
in their browsers, and by certain mechanisms (e.g. Two-Phase Commit.) 
able to be stored back consistently. We therefore can say data 
duplication is not evil per se.

We should state the metadata as the model, which have to be kept 
consistent with the working part in the content.
This include scenarios as filling a content used as a template from it's 
metadata during loading and as well referring to remote metadata 
(external DB).

We might go on and even say, that data duplication gives us some further 
security.
When the content has been changed (by hand or by DOM) and become by 
accident inconsistent with the metadata, the equivalent string could be 
validated by comparing with it's representation in the metadata. For 
example, imagine there is only the name of an account owner equivalent 
between content and metadata, but further details about the account are 
given in the metadata.
If the owner name has changed (e.g. DOM) in the content, but not the 
account details in the metadata, this can be validated by comparing the 
strings and a warning can be given to the user.
The common scenario will be that the visible metadata in the content is 
only a subset of the whole data to be kept consistent as there will be 
usually more hidden metadata about the content.

And therefore in general I am against data duplication, in our case I 
would even recommend it as a validation possibility (in the case of 
equal content & metadata strings).

Taking separated content and metadata as a design, I would like to give 
a rough high-level draft proposal on the linking:
As stated we keep the content and the metadata completely independent, 
and now let's say we do all the linking between content and metadata in 
one special mapping file.

We might even use our own metadata for this mapping. Using a RDF 
metafile with ODF vocabulary, could give us the possibility to explain 
types of relation between arbitrary data of the package to metadata 
(usually RDF files, we should try to cover even more).

The mapping in RDF should be in general possible, as all data in the 
content is marked by a xml:id and can be shown as an URI ref referring 
to the file in the ODF package and in case of a subset to it's xml:id. 
And the RDF metadata is qualified by URI anyway and might have rdf:ID as 
well.

By defining our own ODF vocabulary we might give some semantic about the 
relation between resources of the ODF package and metadata.
We might define new ODF predicates as odf:resource a general description 
saying 'is part of ODF package', which could be divided into two 
subclasses by odf:equal  (when content and metadata is the same) and 
odf:unequal. Further subclasses (for odf:unequal) to ease validation of 
metadata in the content and type description  might be possible 
(odf:schema).
If the content is not equal of a metadata field, but generated it could 
be marked (odf:generated), for example used for created long & short 
citation. The generated content might of course have subfields of equal 
metadata. These ODF predicates would be of enormous help for a default 
ODF application plug-in handling metadata, especially when no plug-in 
for certain metadata is installed.

In case the metadata is describing various parts of content (e.g. to be 
important),  various xml:id could be collected to a group in this file.
[Note: This file might be unique for ODF packages, to ease searching for 
metadata. If documents are embedded these files will be merged.]

Such a xmeta.xml file would make it possible to directly relate metadata 
to any file in the ODF package.
This is promising, as there are customers, who desire to to add metadata 
to header and footer lines, which are stored in styles.xml.
Images, embedded scripts, every file thinkable, might be directly 
related to metadata and even schema to validate metadata in the content 
are possible to be referenced.

This as a rough draft of a design possibility, detailed examples follow 
if there are not too many overseen concerns.

Regards,
Svante
Follow-Ups:
- Re: [office-metadata] Content duplication and ODF related RDFvocabulary
  - From: Svante Schubert <Svante.Schubert@Sun.COM>
References:
- Re: [office-metadata] RDFa model and xml:id
  - From: Elias Torres <eliast@us.ibm.com>
- Re: [office-metadata] RDFa model and xml:id
  - From: Svante Schubert <Svante.Schubert@Sun.COM>