office-metadata message

Subject: Re: [office-metadata] Our discussion on the Wiki example

From: Svante Schubert <Svante.Schubert@Sun.COM>
To: Elias Torres <eliast@us.ibm.com>, "Bruce D'Arcus" <bdarcus@gmail.com>
Date: Mon, 04 Dec 2006 18:46:13 +0100

Hi Elias,

Elias Torres wrote:

I would like to remind us all that we don't have much time left and only
meeting weekly doesn't help either. I don't mind answering any questions on
the issues and am willing to work with the rest of the task force on
tweaking/changing/fixing the proposal presented here by Bruce and I.

It surprises me that your statement sounds like RDFa or nothing.
Especially as RDFa is not a standard. Even if it will be once on the way, it will be certainly subject of changes. OASIS /ISO won't be able to reference it yet.

But what really bothers me is that it was designed for XHTML being a flat format. RDFa is about embedding the meta data. ODF is compound and anybody correct me if I am wrong, I was sure everybody figured out that these should be separated (without redundancy).

However, I would like to say that the only reason why we suggest an
RDFa-like is because we believe it has the chance of solving a large
percent of the use case requirements in our charter without resolving to
custom XML schemas for each of the scenarios.

Could you express the type of scenario, which is not possible.

If the suggestions/questions
presented to us indicate a fundamental enough change in direction (like
some of Svante's thoughts on structure and grammars) I would like to
request that those be presented as separate proposals to the group so we
can discuss separately.

I'm worried that we will spend our little time
discussing points in emails as opposed to a specific approach or draft and
never get anything accomplished, I've seen it happened before.

Also, I'm hoping that we can make a decision as a group towards which
proposal we select moving forward, because I don't think is productive to
work on 2 or more proposals at the same time in order to reach a single
specification. Unless of course, the current proposal is flawed or
insufficient in respect to our use case and requirements.

Elias, here you state it is not productive to work on >1 proposals, later you ask to write a different proposal.
We might as well step back and define the scenarios, we would like to show in Wiki as examples:
For instance:

metadata contains/reference additional data
metadata specifies a unique content
metadata specifies a class of content

and collect basic design decisions we agree on, like

No redundancy (no repetition of data from the content in the meta data)
RDF compatible
generic solution / coverage of use cases
simple solution

And afterwards we make proposals perhaps based on implementation like RDFa.
Analyzing their dis/advantages and choose one. Does not sound to complicated nor time consuming.
As Bruce was so kind to start with one example, I commented it asked for changes. I see no delay with this process.

Now onto Svante's email.

"Bruce D'Arcus" <bdarcus@gmail.com> wrote on 12/01/2006 04:37:57 PM:

On Dec 1, 2006, at 1:53 PM, Svante Schubert wrote:

Instead of
"meta:about="urn:uuid:fe107eb0-7704-11db-9fe1-0800200c9a66", we
might use in our case meta:id="citation". It's mnemonic and the
value of the meta:id (which is not a xml:id as it does not have to
be unique, when expressing a type) would be offered during meta
data creation by the ODF application component, which is
responsible for this type of meta data.


A good point you make here is the fact that xml:id must be unique within an
XML document and that's not necessarily what we are after in our RDFa (or
should we call it ODFa? :D) approach. The meta:about attribute we are
suggesting is simply to denote the subject of the relationship we are
trying to establish within the content.

My experience with metadata tells me we have two basic options: we either
embed the metadata in the content or we store separately and link to it
somehow. RDFa is addressing the first and our requirements and environment
suggests we look at the second. I think we need to tream them separately
somewhat. Let me explain why.

Approach #1

In content.xml

<meta about="foo.jpg" property="dc:creator">Elias Torres</meta>

In meta.xml

Nothing.

Approach #2

In content.xml:

<img src="foo.jpg"/>

In meta.xml

<rdf:Description about="foo.jpg">
      <dc:creator>Elias Torres</dc:creator>
</rdf:Description>

As you can see, they really don't have much in common. In #1 we need a way
to model to express our metadata needs and we draw from RDFa one of the
ways to doing that (about, rel, rev, property, content, datatype
attributes). In #2 we only need a way to identify objects/resources within
the document and leave it up to the meta.xml to contain all of the possible
information. The main reason why I like #1 is because there's a lot of data
already in the document that we would like to avoid duplication, except
that I don't believe we can avoid it 100% of the time (e.g. the content
issue).

I think I failed to separate these two approaches enough on telecons but I
hope we can get back on track. Anyways, back to the question: does
meta:about/xml:id needs to be mnemonic? My answer is no. Let me show by
example:

<link about="http://torrez.us/who#elias" rev="dc:creator" href="foo.jpg"/>

yields

<foo.jpg> dc:creator <http://torrez.us/who#elias>

As you can see the about attribute has nothing to do with mnemonics or
anything of that nature. It's about uniquely identifying resources in both
a closed and open world. The only reason why xml:id came into the
discussion is because we want to leverage things already identified in our
current documents such as <table table:name="table1">...</table>. In RDFa
and other HTML approaches we make use of both @id and @name to locate
things within the same document.

<meta about="table1" property="dc:creator">Elias Torres</meta>
<body>
<x />
<y />
<table name="table1">
...
</table>
</body>

Also, is ODF content/source copy and paste a requirement for our metadata
proposal? I didn't think it was. I hope we are not expecting people to hand
write ODF (e.g. no need for mnemonics).

Long story short, I said do not request 'urn:uuid' for an internal reference between content and meta, as you have used it in the example.
Mnemonic approach is helpful for the writer, should be recommended, but is and can not requested.
I prefer - as already stated - the approach of attribute references between content.xml and metadata, which is not one of your approaches above - not #1 nor #2.

In content.xml:
============
..
<text:p meta:class="date">
    <text:span meta:class="month">May</text:span>
    <text:span meta:class="day">8th</text:span> at
    <text:span meta:class="time">10am</text:span>
</text:p>
..
[NOTE:
I changed meta:id to meta:class to avoid the impression, that meta:id is unique. The naming 'meta:class' is not important for now.
And the value of meta:id is just an arbitrary string. But here only provided as mnemonic default string by a brave plugin programmer. ]

In meta package:
============
something RDF compatible

This is a very simple approach. Everything seems to be accomplished by it, what advantage have #1 or #2?

But how is that it's "mnemonic" intrinsically valuable? I don't think
it is.

By this it is imaginable that even the implementation of metadata
is being exchanged in meta.xml, without a byte changing in the
content.xml. Imagine implementations like vcard vs. hcard.


By us supporting both #1 and #2 we support people exchanging metadata using
different schemas or ontologies.

Am not following here. Can you restate?

Second someone would like to link to meta data.


Again, by using URIs people can link to meta data. If use xml:id="short" we
have no way of linking to resources (e.g. external documents or web pages).

I will explain the linking to meta data more verbose in a different mail as this has little to do with the rest.

Drafted in two sentence in general is my hopeful wish in linking the following:
I would like to be able to create a link to a document pointing to a certain semantic not to a structure.
Like pointing to the node set of all XML nodes having a certain class of meta data like Bruce's citation.

I still don't know what this means.

Instead of referencing to a certain structure (e.g. third paragraph
of the body) a link to the type of meta data in the package is
closer on the desired.


BTW, I'm not advocating we reference structure. Referencing structure sucks
(e.g. 2nd paragraph, 3rd table after the 1st paragraph, etc). I'd like
reference objects/resources.

Sorry, again, am not understanding. Been a long day I guess.

Although I have no fool proof implementation by hand (XPointer?),
would such approach solve the problem of changing structure.

Can you explain what you mean by "changing structure"? Is this is the
split-paragraph example?


I'm not sure how linking to the type of metadata in the package solves the
linking problem at all. BTW, I have tried extensively to deal with
annotations in Office documents in a product we built for Life Sciences
organizations. I wrote plugins for Word, Excel and Powerpoint and just
linking to structure never worked. I even tried this on HTML using my own
XPointer of implementations but it simply does not work for changing
documents. It might for read-only versions of documents, but that's about
it.

Interesting, it worth to discuss this separately.

This approach might exist aside of new introduced xml:id, which
could be generated by the user when the document is ready to
publish. xml:id should be stable similar to the API / interface of
a software and therefore handled with care.

And finally, when our goal is to weave arbitrary metadata into ODF
in a most simple, generic way, I was distracted by @content - as
Bernd as well before <meta property="cal:dtstart"
content="20060508T1000-0500"> May 8th at 10am </meta> There is
detailed redundant information in the attribute and as well there
is a blob of data.


I would like to help us see that it's simply not feasible today to expect
all human-entered data to be machine readable. It's definitely the case for
dates, one of the most complicated pieces of data we deal in computers
today. If I'm not making any sense, think about dates in Japan where they
don't use a Gregorian calendar and use Emperors' reign for their years.
Anyways, let me try a few more examples to see if "more" data could be
extracted from the human-entered text.

<span property="amount" content="1000000">one million</span>

<span property="dtstart" content="15:05">5 past 3</span>

This year's <span property="net" content="-1600000">loss 1.6 million
dollars.</span>

I hope this is enough to understand that we should not be in the business
of removing the @content attribute, except for noting that it might be best
to re-use as much as content from the text as possible.

This is not too complicated.
Mapping meta data to one another should be a common problem, which is (more or less easily) solved. At least the mapping of a Gregorian calendar to the dates in Japan is quite simple.
And mapping the logic from written numbers to the decimal system is no rocket science, either.

Why is this a problem? One is for machines, and one for people.

In general I would rather prefer something like:

<text:p meta:id="date">
    <text:span meta:id="month">May</text:span>
    <text:span meta:id="day">8th</text:span> at
    <text:span meta:id="time">10am</text:span>
</text:p>

No, Svante. That's certainly not how you'd do it in RDFa. The ID is
just that: an id that allows one to then associate something else
with it (a link, metadata descriptions, etc.). It indicates no
semantics at all. Using dumb strings of text for semantics is no more
useful than just using styles.


I would like to look past a few of the minor issues with Svante's example.
I'll stay away from the naming issue, since I hope I've addressed that
earlier in my email. However, I would like to note a much more important
point that I would like everyone to study closely. Svante, I think you are
thinking too much about the structure of the data as opposed to specifying
metadata in a very granular way. In our RDFa date example we are NOT
focused on the structure. Let me give you an example:

<p about="event1">
  The party will start at <span property="cal:dtstart"
content="2006-12-12T15:05Z">5 past 3 on saturday</span> at <span
property="location">my house.</span>
</p>

In RDF we are not focused on "structure", we are interested in the
statements made in the model. In the example above, we have two statements
being asserted.

<#event1> cal:dtstart "2006-12-12T15:05Z" .
<#event1> cal:location "my house" .

The statements stand completely on their own and they are not necessarily
part of a greater structure. We can even try to solve the
really-really-really hard moving content problem. Here we go:

<p about="event1">
  The party will start at <span property="cal:dtstart"
content="2006-12-12T15:05Z">5 past 3 on saturday</span>.
</p>
<p about="somethingelse">
  Some text about something else....

  ... and before I forget the party will be at <span about="event1"
property="location">my house.</span>
</p>

If you notice, we moved the content around but in this case we were able to
maintain the metadata because the cut completely encompassed the <span>
element. There are specific reasons for us wanting to use RDF and RDFa. In
RDFa, triples are only when encountered with either a rel,rev or property
attribute. This property allows us to make statements about resources in
different parts of the document without having to worry about maintaining
structure because the model extracted from the content.xml in the "changed"
example is isomorphic to the first one. I know there's an equivalent
scenario for Svante's scenario if done right (e.g. we use properties
ex:month, ex:day, ex:time). However, I think in his case, he meant "date"
to denote (let's say the dtstart of the document). This means that month,
day and time are dependent on their document location and any movement of
the content would absolutely destroy our metadata hopes. I guess we could
introduce some "merge" rules to try to solve this, but the higher-level
structures, the more complicated the rules can become, think something
along the lines of XML diff/merge, all within the same document.

or shorter using default namespace (and none for the attribute) as

<text:p s="date">
    <a s="month">May</a>
    <a s="day">8th</a> at
    <a s="time">10am</a>
</text:p>

By doing so, other software aside of the correct plugin, would have
a chance to interpret the data.


I think that as you look at the work that Dan Connolly has done in the
area.

Anything in particular in mind?

You'll believe me that having this information:

<#event1> cal:dtstart "2006-12-12T15:05Z" .
<#event1> cal:location "my house" .

any plugin can interpret that data and do as it pleases (e.g. convert it to
some other format for display).

OK, I think you need to step back and ask what problem are you trying
to solve here? It seems to me you want to be able to bind a GUI to
data, and then to particular application behavior (which could be a
plug-in). E.g. say someone wants to add custom content processing;
how do you do that? How does a plug-in know which content and which
metadata to deal with?

Is that right?

If yes, the manifest can also be used for this, as well as some
similar typing on custom fields.

For example the fall-back plugin of an ODF application (which
assist the user in showing / editing meta data, when the correct
plugin is not installed / found), would be able to assist the user.
Even more when the the possible set of data (e.g. all month) is
defined in an embedded grammar,

OK, remember, what you are calling an "embedded grammar" is exactly
what RDFa provides.

further features would be possible. For instance 'auto completion'
or 'drop down list' for the content of such a field are thinkable
for the future even for the fall-back plugin ( but most likely not
in it's first version).

Sure, though where the metadata is is not significant; is it?

Bruce


I hope I have given you decent answers/arguments to your questions
proposals. I touched on a few of the reasons why we are suggesting RDF to
address the requirements of this task force. We are not trying to invent a
new unproven mechanism to embed structure in the documents because that
would reduce to embedding any XML within ODF XML. We want to focus on the
requirements and believe we have a solution for 80% of the requirements in
a standards-based proposal. We'll continue showing examples that address
the current use cases and requirements. For sake of time, I would like to
see any new use cases and requirements added to the wiki and more formal
proposals that go along with it, in order to guarantee progress.

Wiki is a good point, we should keep this up.

-Elias

Best regards,
Svante

Follow-Ups:
- Re: [office-metadata] Our discussion on the Wiki example
  - From: Elias Torres <eliast@us.ibm.com>
- Re: [office-metadata] Our discussion on the Wiki example
  - From: Bruce D'Arcus <bruce.darcus@OpenDocument.us>

References:
- Re: [office-metadata] Our discussion on the Wiki example
  - From: Elias Torres <eliast@us.ibm.com>