office message

Subject: Re: [office-metadata] [issue] split object literals
From: Michael Brauer <Michael.Brauer@Sun.COM>
To: "Bruce D'Arcus" <bdarcus@gmail.com>
Date: Wed, 10 Jan 2007 09:39:04 +0100
Hi Bruce,

Bruce D'Arcus wrote:
> 
> On Jan 8, 2007, at 9:10 AM, Svante Schubert wrote:
> 
>>> Second, as I mentioned to Michael off-list, this problem is 
>>> conceptually quite similar to a recent discussion in the TC about 
>>> handling split-lists. If I am right about that, I would suggest that 
>>> we treat it as a general issue rather than one specific to metadata.
>>
>> Right now, I see no real similarity of the metadata and the list 
>> issues. Therefore I suggest that we wait how the list experts indent 
>> to solve their issue, and check then whether there is a similarity to 
>> a metadata use case.
> 
> David Faure was proposing an ID (IIRC list:id) that could be used on 
> multiple list nodes (so xml:id would not work), to denote that they are 
> in fact part of the same conceptual structure.
> 
> At the time I had mentioned that our work might overlap/conflict, but I 
> wasn't quite sure. Exactly how we address the possible overlap here 
> isn't that critical to me; but we do need to keep it in mind. Perhaps 
> Michael has some thoughts?

Well, I'm not sure whether there is an overlap. In one case, we want to 
define which counter and style is used by lists or numbered paragraphs. 
In the other case, we are looking for some markup that combines text 
spans to a single object. My understanding is that we are very close to 
having a solution for the list issue. I therefore propose to keep this 
two issues separate.

> 
>>> I see two reasonable solutions:
>>>
>>> 1) the in-context approach
>>>
>>> Use an attribute like object:id. One could use the same attribute for 
>>> lists as well, and so have a general solution to this problem. Where 
>>> a node has such an attribute, a processor would know to look for one 
>>> or more additional nodes with the same attribute value, and to treat 
>>> them as a single object (list, or property value).
>>
>> This is a valid approach. In the case of in-content meta data: Which 
>> of the nodes would actually carry the other meta data attributes? All 
>> nodes, or only the first?
> 
> Probably all relevant nodes.
> 
>> Would you use the same id then referencing the nodes from metadata in 
>> the package?
> 
> If referencing them from the package, probably just xml:id. Not sure; 
> this is just an idea I came up with this morning.
> 
>>> 2) the package approach
>>>
>>> This is what Michael and Svante have been talking about, where you'd 
>>> have pointers to the parts in some separate XML chunk. You'd need 
>>> xml:id attributes on the property nodes to do this association.
>>>
>>
>> Yes, that's the approach we are talking about. Using this approach, if 
>> you want to have in-content metadata, you may add the additional meta 
>> data to the new XML element that contains the links to the property 
>> nodes.
> 
> As I have been talking about this with Michael (if I understand 
> conversation correctly), the ONLY thing this XML does is link *split* 
> property nodes. It would otherwise not be needed.

That's correct.
> 
> In other words, it is a solution to account for what I would hope would 
> be the exceptional case; not the common one.

First of all, what I was talking about were RDF subjects. You examples 
are about RDF objects. So we may talk about different things actually.

Regarding RDF subjects: Whether splitted subjects are the common case or 
the exceptional case depends on the documents in question and use cases. 
What's important to me is that we cover this "exceptional" case.
We may of cause provide some kind of shortcut for the case that the RDF 
subject in question consists of a single span. Whether this simplifies 
the processing of the metadata is something we would have to analyze, 
because we then would have two ways to address the subjects/objects (or 
three, if we provide in-content and package meta data),
and all metadata aware applications would have to implement this.

Regarding RDF objects: I do understand that there are use cases where a 
RDF object shall be displayed or shall even be editable within the 
content. Putting the RDF object into the content is one option. Other 
options in my opinions are to keep them separate from the content and to 
use text fields and/or XForms to display them or to create a binding 
between the two. Right now, I don't know what the best solution is, but 
I think we should consider them all. But in any case, we should discuss 
the cases where the *RDF subject* is in the content separately from the 
case that the *RDF object* is in the content, or is displayed there. 
These are in my opion two different aspects of the to be defined 
metadata feature. They must of cause fit together, but the markup we 
define for the two aspects does not have to be the same.

I further would like to note that I think that it is essential that RDF 
*subjects* can be splitted. Right now, I don't have an opinion whether 
this also applies to RDF *objects*, because I even don't have a clear 
opinion whether RDF objects should appear in the content, or should only 
be displayed there as mentioned above.

> 
>>> Both of these approaches would still need the meta attributes; they 
>>> are just two approaches to solving a very narrow problem specific to 
>>> using them in an office file format.
>>>
>>> Is the above characterization fair?
>>
>> Yes, it is, except that their naming is confusing. Since the two 
>> approaches can be combined with in-content and with package metadata.
> 
> One never has to worry about split property nodes in RDF/XML, which is 
> why I am focusing on it as a problem of the in-content encoding.

What do you mean by "worry"? That one doesn't have to care about 
splitted property nodes, or that there will be a solution?


> 
> Basically, what I am proposing here is two options:
> 
> Option 1 (what I was calling in-content)
> ========================================
> 
> <text:p>
>   <text:span object:id="xyz" meta:about="http://ex.net/x"; 
> meta:property="ex:title">Some </text:span>
> </text:p>
> <text:p>
>   <text:span object:id="xyz" meta:about="http://ex.net/x"; 
> meta:property="ex:title">Title</text:span>
> </text:p>
> 
> [note: no extra XML in the package; it's all in the content, and easy to 
> process with XSLT and such]
> 
> Option 2 (what I was calling in-package)
> ========================================
> 
> <text:p>
>   <text:span xml:id="x" meta:about="http://ex.net/x"; 
> meta:property="ex:title">Some </text:span>
> </text:p>
> <text:p>
>   <text:span xml:id="y" meta:about="http://ex.net/x"; 
> meta:property="ex:title">Title</text:span>
> </text:p>
> 
> ... and in the package, using something like Michael's example:
> 
> <office:meta-subject>
>   <office:part idref="x"/>
>   <office:part idref="y"/>
> </office:meta-subject>
> 
> [note: by default, basically, a processor would understand those as two 
> statements, and would have to look up in the package whether they would 
> need to merge/concatenate the literal content]

As said above, I was taking about subjects rather than objects. That's 
in fact the reason why I have called the above element <meta-subject> 
rather than <meta-object>.

The <meta-subject> element actually would appear in the content rather 
than in the metadata. That means, I would consider the combination of 
text spans to a single metadata subject (or object) to be a real feature 
of the content, and not be a metadata implementation detail. In other 
words: The metadata reference a single subject. That this subject does 
not contain the text inline, but contains references to it, is the 
implementation detail.

For the RDF subject case, the content.xml would look like this:

<office:meta-subject xml:id="xyz">
   <office:part idref="x"/>
   <office:part idref="y"/>
</office:meta-subject>

<text:p>
   <text:span object:id="x">Some </text:span>
</text:p>
<text:p>
   <text:span object:id="y">Title</text:span>
</text:p>

The metadata would look like this:

<rdf:Description rdf:about="content.xml#xyz"
  <ex:title>My Title</ex:Title>
</ref:Description>

In the in-package case the metadata fragment would appear in some stream 
next to the content.xml. For the in-content case one could simply move 
the metadata fragment into the content.xml and adapt the "about" URI. It 
is probably also possible to mix the metadata markup with the content (I 
assume that is what RDFa does), but I don't know how this will look 
like. Bruce, can you provide an example for this?

Anyway, I like the idea of identifying the spans that belong to a 
certain object by a single id as it is the case in your option 1:

<office:meta-subject xml:id="xyz"/>

<text:p>
   <text:span office:belongs-to="xyz">Some </text:span>
</text:p>
<text:p>
   <text:span office:belongs-to="xyz">Title</text:span>
</text:p>

The attribute that defines the id (in terms of XML) is xml:id. The 
office:belongs-to attributes are references to this id only. Although 
only a single id is used here, we cannot omit the <office:meta-subject> 
element, because we need to define the id that has to be unique.

The advantage this example has is that the application thats saves the 
document does not need to know in advance how many <text:span> elements 
make up a subject. Since the <text:span> elements are created on demand 
while saving the content, this actually is a very strong advantage.

If we stay with the RDF object example and further assume we actually 
need RDF objects in the content (and not only XForms or text fields), 
and if we further assume that these objects have to be splittable, then 
I would modify your examples above slightly:

First of all, I would move the <office:meta-subject> element into the
content.xml, as it is the case for my subject example.

For the "in-content" option I would move the meta:about and 
meta:property attributes to that <office:meta-subject> element, that 
definitively should get a better name:

content.xml:

<office:meta-subject xml:id="xyz" meta:about="http://ex.net/x"/>

<text:p>
   <text:span office:belongs-to="xyz">Some </text:span>
</text:p>
<text:p>
   <text:span office:belongs-to="xyz">Title</text:span>
</text:p>

I have used my 2nd suggestion above to combine the <text:span> elements, 
but of cause, one could also use my original suggestion which uses the 
<office:part> elements to do so.

For the "in-package" case I would actually assume that the metadata 
itself is in the package (let's say a meta.xml stream), but not in the 
content.xml. We may then use the id attribute of the 
<office:meta-subject> element to reference the subject in the 
content,xml from the meta data in the package:

meta.xml:

Something like

<rdf:Description rdf:about="http://ex.net/x";
  <ex:title" rdf:resource="content.xml#xyz">
</ref:Description>

[I'm not an RDF expert, so please don't take this example literally, but 
it is more or less the same as 
http://www.w3.org/TR/rdf-primer/#example4, except that the IRI has a
fragment identifier here]

content.xml:

<office:meta-subject xml:id="xyz"/>

<text:p>
   <text:span office:belongs-to="xyz">Some </text:span>
</text:p>
<text:p>
   <text:span office:belongs-to="xyz">Title</text:span>
</text:p>


> 
> Something just occurred to me, which is that RDF actually has more than 
> one kind of literal. There are also XML literals. We need to think about 
> that.

I think this depends on whether the use cases the SC has collected 
actually require that. My understanding is that the SC wants to define a 
markup that allows to represent the use cases, and that RDF-XML (or 
RDFa) are options for this, but what it is not the goal of the SC to add 
a full RDF_XML support to ODF.

> 
> Bruce

Michael



-- 
Michael Brauer, Technical Architect Software Engineering
StarOffice/OpenOffice.org
Sun Microsystems GmbH             Nagelsweg 55
D-20097 Hamburg, Germany          michael.brauer@sun.com
http://sun.com/staroffice         +49 40 23646 500
http://blogs.sun.com/GullFOSS
Follow-Ups:
- Re: [office] Re: [office-metadata] [issue] split object literals
  - From: Bruce D'Arcus <bruce.darcus@OpenDocument.us>