office-metadata message

Subject: Re: [office-metadata] Our discussion on the Wiki example
From: Elias Torres <eliast@us.ibm.com>
To: "Bruce D'Arcus" <bdarcus@gmail.com>
Date: Sun, 3 Dec 2006 21:41:37 -0500
I would like to remind us all that we don't have much time left and only
meeting weekly doesn't help either. I don't mind answering any questions on
the issues and am willing to work with the rest of the task force on
tweaking/changing/fixing the proposal presented here by Bruce and I.
However, I would like to say that the only reason why we suggest an
RDFa-like is because we believe it has the chance of solving a large
percent of the use case requirements in our charter without resolving to
custom XML schemas for each of the scenarios. If the suggestions/questions
presented to us indicate a fundamental enough change in direction (like
some of Svante's thoughts on structure and grammars) I would like to
request that those be presented as separate proposals to the group so we
can discuss separately. I'm worried that we will spend our little time
discussing points in emails as opposed to a specific approach or draft and
never get anything accomplished, I've seen it happened before.

Also, I'm hoping that we can make a decision as a group towards which
proposal we select moving forward, because I don't think is productive to
work on 2 or more proposals at the same time in order to reach a single
specification. Unless of course, the current proposal is flawed or
insufficient in respect to our use case and requirements.

Now onto Svante's email.

"Bruce D'Arcus" <bdarcus@gmail.com> wrote on 12/01/2006 04:37:57 PM:

>
> On Dec 1, 2006, at 1:53 PM, Svante Schubert wrote:
>
> > Instead of
> > "meta:about="urn:uuid:fe107eb0-7704-11db-9fe1-0800200c9a66", we
> > might use in our case meta:id="citation". It's mnemonic and the
> > value of the meta:id (which is not a xml:id as it does not have to
> > be unique, when expressing a type) would be offered during meta
> > data creation by the ODF application component, which is
> > responsible for this type of meta data.

A good point you make here is the fact that xml:id must be unique within an
XML document and that's not necessarily what we are after in our RDFa (or
should we call it ODFa? :D) approach. The meta:about attribute we are
suggesting is simply to denote the subject of the relationship we are
trying to establish within the content.

My experience with metadata tells me we have two basic options: we either
embed the metadata in the content or we store separately and link to it
somehow. RDFa is addressing the first and our requirements and environment
suggests we look at the second. I think we need to tream them separately
somewhat. Let me explain why.

Approach #1

In content.xml

<meta about="foo.jpg" property="dc:creator">Elias Torres</meta>

In meta.xml

Nothing.

Approach #2

In content.xml:

<img src="foo.jpg"/>

In meta.xml

<rdf:Description about="foo.jpg">
      <dc:creator>Elias Torres</dc:creator>
</rdf:Description>

As you can see, they really don't have much in common. In #1 we need a way
to model to express our metadata needs and we draw from RDFa one of the
ways to doing that (about, rel, rev, property, content, datatype
attributes). In #2 we only need a way to identify objects/resources within
the document and leave it up to the meta.xml to contain all of the possible
information. The main reason why I like #1 is because there's a lot of data
already in the document that we would like to avoid duplication, except
that I don't believe we can avoid it 100% of the time (e.g. the content
issue).

I think I failed to separate these two approaches enough on telecons but I
hope we can get back on track. Anyways, back to the question: does
meta:about/xml:id needs to be mnemonic? My answer is no. Let me show by
example:

<link about="http://torrez.us/who#elias"; rev="dc:creator" href="foo.jpg"/>

yields

<foo.jpg> dc:creator <http://torrez.us/who#elias>

As you can see the about attribute has nothing to do with mnemonics or
anything of that nature. It's about uniquely identifying resources in both
a closed and open world. The only reason why xml:id came into the
discussion is because we want to leverage things already identified in our
current documents such as <table table:name="table1">...</table>. In RDFa
and other HTML approaches we make use of both @id and @name to locate
things within the same document.

<meta about="table1" property="dc:creator">Elias Torres</meta>
<body>
<x />
<y />
<table name="table1">
...
</table>
</body>

Also, is ODF content/source copy and paste a requirement for our metadata
proposal? I didn't think it was. I hope we are not expecting people to hand
write ODF (e.g. no need for mnemonics).

>
> But how is that it's "mnemonic" intrinsically valuable? I don't think
> it is.
>
> > By this it is imaginable that even the implementation of metadata
> > is being exchanged in meta.xml, without a byte changing in the
> > content.xml. Imagine implementations like vcard vs. hcard.

By us supporting both #1 and #2 we support people exchanging metadata using
different schemas or ontologies.

>
> Am not following here. Can you restate?
>
> > Second someone would like to link to meta data.
>

Again, by using URIs people can link to meta data. If use xml:id="short" we
have no way of linking to resources (e.g. external documents or web pages).

> I still don't know what this means.
>
> > Instead of referencing to a certain structure (e.g. third paragraph
> > of the body) a link to the type of meta data in the package is
> > closer on the desired.
>

BTW, I'm not advocating we reference structure. Referencing structure sucks
(e.g. 2nd paragraph, 3rd table after the 1st paragraph, etc). I'd like
reference objects/resources.

> Sorry, again, am not understanding. Been a long day I guess.
>
> > Although I have no fool proof implementation by hand (XPointer?),
> > would such approach solve the problem of changing structure.
>
> Can you explain what you mean by "changing structure"? Is this is the
> split-paragraph example?

I'm not sure how linking to the type of metadata in the package solves the
linking problem at all. BTW, I have tried extensively to deal with
annotations in Office documents in a product we built for Life Sciences
organizations. I wrote plugins for Word, Excel and Powerpoint and just
linking to structure never worked. I even tried this on HTML using my own
XPointer of implementations but it simply does not work for changing
documents. It might for read-only versions of documents, but that's about
it.

>
> > This approach might exist aside of new introduced xml:id, which
> > could be generated by the user when the document is ready to
> > publish. xml:id should be stable similar to the API / interface of
> > a software and therefore handled with care.
> >
> > And finally, when our goal is to weave arbitrary metadata into ODF
> > in a most simple, generic way, I was distracted by @content - as
> > Bernd as well before <meta property="cal:dtstart"
> > content="20060508T1000-0500"> May 8th at 10am </meta> There is
> > detailed redundant information in the attribute and as well there
> > is a blob of data.

I would like to help us see that it's simply not feasible today to expect
all human-entered data to be machine readable. It's definitely the case for
dates, one of the most complicated pieces of data we deal in computers
today. If I'm not making any sense, think about dates in Japan where they
don't use a Gregorian calendar and use Emperors' reign for their years.
Anyways, let me try a few more examples to see if "more" data could be
extracted from the human-entered text.

<span property="amount" content="1000000">one million</span>

<span property="dtstart" content="15:05">5 past 3</span>

This year's <span property="net" content="-1600000">loss 1.6 million
dollars.</span>

I hope this is enough to understand that we should not be in the business
of removing the @content attribute, except for noting that it might be best
to re-use as much as content from the text as possible.

>
> Why is this a problem? One is for machines, and one for people.
>
> > In general I would rather prefer something like:
> >
> > <text:p meta:id="date">
> >     <text:span meta:id="month">May</text:span>
> >     <text:span meta:id="day">8th</text:span> at
> >     <text:span meta:id="time">10am</text:span>
> > </text:p>
>
> No, Svante. That's certainly not how you'd do it in RDFa. The ID is
> just that: an id that allows one to then associate something else
> with it (a link, metadata descriptions, etc.). It indicates no
> semantics at all. Using dumb strings of text for semantics is no more
> useful than just using styles.

I would like to look past a few of the minor issues with Svante's example.
I'll stay away from the naming issue, since I hope I've addressed that
earlier in my email. However, I would like to note a much more important
point that I would like everyone to study closely. Svante, I think you are
thinking too much about the structure of the data as opposed to specifying
metadata in a very granular way. In our RDFa date example we are NOT
focused on the structure. Let me give you an example:

<p about="event1">
  The party will start at <span property="cal:dtstart"
content="2006-12-12T15:05Z">5 past 3 on saturday</span> at <span
property="location">my house.</span>
</p>

In RDF we are not focused on "structure", we are interested in the
statements made in the model. In the example above, we have two statements
being asserted.

<#event1> cal:dtstart "2006-12-12T15:05Z" .
<#event1> cal:location "my house" .

The statements stand completely on their own and they are not necessarily
part of a greater structure. We can even try to solve the
really-really-really hard moving content problem. Here we go:

<p about="event1">
  The party will start at <span property="cal:dtstart"
content="2006-12-12T15:05Z">5 past 3 on saturday</span>.
</p>
<p about="somethingelse">
  Some text about something else....

  ... and before I forget the party will be at <span about="event1"
property="location">my house.</span>
</p>

If you notice, we moved the content around but in this case we were able to
maintain the metadata because the cut completely encompassed the <span>
element. There are specific reasons for us wanting to use RDF and RDFa. In
RDFa, triples are only when encountered with either a rel,rev or property
attribute. This property allows us to make statements about resources in
different parts of the document without having to worry about maintaining
structure because the model extracted from the content.xml in the "changed"
example is isomorphic to the first one. I know there's an equivalent
scenario for Svante's scenario if done right (e.g. we use properties
ex:month, ex:day, ex:time). However, I think in his case, he meant "date"
to denote (let's say the dtstart of the document). This means that month,
day and time are dependent on their document location and any movement of
the content would absolutely destroy our metadata hopes. I guess we could
introduce some "merge" rules to try to solve this, but the higher-level
structures, the more complicated the rules can become, think something
along the lines of XML diff/merge, all within the same document.

>
> > or shorter using default namespace (and none for the attribute) as
> >
> > <text:p s="date">
> >     <a s="month">May</a>
> >     <a s="day">8th</a> at
> >     <a s="time">10am</a>
> > </text:p>
> >
> > By doing so, other software aside of the correct plugin, would have
> > a chance to interpret the data.

I think that as you look at the work that Dan Connolly has done in the
area. You'll believe me that having this information:

<#event1> cal:dtstart "2006-12-12T15:05Z" .
<#event1> cal:location "my house" .

any plugin can interpret that data and do as it pleases (e.g. convert it to
some other format for display).

>
> OK, I think you need to step back and ask what problem are you trying
> to solve here? It seems to me you want to be able to bind a GUI to
> data, and then to particular application behavior (which could be a
> plug-in). E.g. say someone wants to add custom content processing;
> how do you do that? How does a plug-in know which content and which
> metadata to deal with?
>
> Is that right?
>
> If yes, the manifest can also be used for this, as well as some
> similar typing on custom fields.
>
> > For example the fall-back plugin of an ODF application (which
> > assist the user in showing / editing meta data, when the correct
> > plugin is not installed / found), would be able to assist the user.
> > Even more when the the possible set of data (e.g. all month) is
> > defined in an embedded grammar,
>
> OK, remember, what you are calling an "embedded grammar" is exactly
> what RDFa provides.
>
> > further features would be possible. For instance 'auto completion'
> > or 'drop down list' for the content of such a field are thinkable
> > for the future even for the fall-back plugin ( but most likely not
> > in it's first version).
>
> Sure, though where the metadata is is not significant; is it?
>
> Bruce

I hope I have given you decent answers/arguments to your questions
proposals. I touched on a few of the reasons why we are suggesting RDF to
address the requirements of this task force. We are not trying to invent a
new unproven mechanism to embed structure in the documents because that
would reduce to embedding any XML within ODF XML. We want to focus on the
requirements and believe we have a solution for 80% of the requirements in
a standards-based proposal. We'll continue showing examples that address
the current use cases and requirements. For sake of time, I would like to
see any new use cases and requirements added to the wiki and more formal
proposals that go along with it, in order to guarantee progress.

-Elias
Follow-Ups:
- Re: [office-metadata] Our discussion on the Wiki example
  - From: Svante Schubert <Svante.Schubert@Sun.COM>
- Re: [office-metadata] Our discussion on the Wiki example
  - From: Bruce D'Arcus <bruce.darcus@OpenDocument.us>
References:
- Re: [office-metadata] Our discussion on the Wiki example
  - From: "Bruce D'Arcus" <bdarcus@gmail.com>