office-metadata message

Subject: Re: [office-metadata] Our discussion on the Wiki example
From: Elias Torres <eliast@us.ibm.com>
To: Svante Schubert <Svante.Schubert@Sun.COM>
Date: Mon, 4 Dec 2006 13:27:47 -0500
[minor nitpick, I'm having problems reading your emails, especially
differentiating your text vs mine, at least in the reply. If possible I'd
ask for a favor to use text-only messaging, if not I'll deal with it.]

Svante.Schubert@Sun.COM wrote on 12/04/2006 12:46:13 PM:

> Hi Elias,
>
> Elias Torres wrote:
> I would like to remind us all that we don't have much time left and only
> meeting weekly doesn't help either. I don't mind answering any questions
on
> the issues and am willing to work with the rest of the task force on
> tweaking/changing/fixing the proposal presented here by Bruce and I.
>
> It surprises me that your statement sounds like RDFa or nothing.
> Especially as RDFa is not a standard. Even if it will be once on the
> way, it will be certainly subject of changes. OASIS /ISO won't be
> able to reference it yet.

Sorry, I didn't mean to say that, I meant that our proposal specifically
can only be twisted so much before it loses its benefits and capabilities
that are dealing with the existing requirements. I have also said several
times that we are not depending on the final outcome of the RDFa
specification in order for us to use it. We are borrowing the already
specified techniques and re-stating in our spec. No dependency.

>
> But what really bothers me is that it was designed for XHTML being a
> flat format. RDFa is about embedding the meta data. ODF is compound
> and anybody correct me if I am wrong, I was sure everybody figured
> out that these should be separated (without redundancy).
>

That's fine and why I said there are maybe two proposals. #1 allows
metadata both inside and outside of the content.xml #2 only needs
identification in the content.xml, all metadata in meta.xml.

If there was already agreement that no metadata should be specified in
content.xml then RDFa is no longer an option and I'm fine with that.

> However, I would like to say that the only reason why we suggest an
> RDFa-like is because we believe it has the chance of solving a large
> percent of the use case requirements in our charter without resolving to
> custom XML schemas for each of the scenarios.
> Could you express the type of scenario, which is not possible.

> If the suggestions/questions
> presented to us indicate a fundamental enough change in direction (like
> some of Svante's thoughts on structure and grammars) I would like to
> request that those be presented as separate proposals to the group so we
> can discuss separately.
> I'm worried that we will spend our little time
> discussing points in emails as opposed to a specific approach or draft
and
> never get anything accomplished, I've seen it happened before.
>
> Also, I'm hoping that we can make a decision as a group towards which
> proposal we select moving forward, because I don't think is productive to
> work on 2 or more proposals at the same time in order to reach a single
> specification. Unless of course, the current proposal is flawed or
> insufficient in respect to our use case and requirements.
>
> Elias, here you state it is not productive to work on >1 proposals,
> later you ask to write a different proposal.


I'm trying to say that we should have many rather-complete proposals now,
but once we move forward with one we stick to it, because if we don't have
agreement at some point close to the deadline we'll never finish the
specification. It'd be very selfish of me to say, we only can have 1
suggestion, hence 1 final draft. I am asking for many suggestions, but
let's decide at some point which one we are moving forward with.

> We might as well step back and define the scenarios, we would like
> to show in Wiki as examples:
> For instance:
> 1. metadata contains/reference additional data
> 2. metadata specifies a unique content
> 3. metadata specifies a class of content
> and collect basic design decisions we agree on, like
> 1. No redundancy (no repetition of data from the content in the meta
data)
> 2. RDF compatible
> 3. generic solution / coverage of use cases
> 4. simple solution
> And afterwards we make proposals perhaps based on implementation like
RDFa.
> Analyzing their dis/advantages and choose one. Does not sound to
> complicated nor time consuming.
> As Bruce was so kind to start with one example, I commented it asked
> for changes. I see no delay with this process.

I think that's a good set of criteria for us to evaluate specific
proposals. Let's carry on.

>
> Now onto Svante's email.
>
> "Bruce D'Arcus" <bdarcus@gmail.com> wrote on 12/01/2006 04:37:57 PM:
>
>
> On Dec 1, 2006, at 1:53 PM, Svante Schubert wrote:
>
>
> Instead of
> "meta:about="urn:uuid:fe107eb0-7704-11db-9fe1-0800200c9a66", we
> might use in our case meta:id="citation". It's mnemonic and the
> value of the meta:id (which is not a xml:id as it does not have to
> be unique, when expressing a type) would be offered during meta
> data creation by the ODF application component, which is
> responsible for this type of meta data.
>
>
> A good point you make here is the fact that xml:id must be unique within
an
> XML document and that's not necessarily what we are after in our RDFa (or
> should we call it ODFa? :D) approach. The meta:about attribute we are
> suggesting is simply to denote the subject of the relationship we are
> trying to establish within the content.
>
> My experience with metadata tells me we have two basic options: we either
> embed the metadata in the content or we store separately and link to it
> somehow. RDFa is addressing the first and our requirements and
environment
> suggests we look at the second. I think we need to tream them separately
> somewhat. Let me explain why.
>
> Approach #1
>
> In content.xml
>
> <meta about="foo.jpg" property="dc:creator">Elias Torres</meta>
>
> In meta.xml
>
> Nothing.
>
> Approach #2
>
> In content.xml:
>
> <img src="foo.jpg"/>
>
> In meta.xml
>
> <rdf:Description about="foo.jpg">
>       <dc:creator>Elias Torres</dc:creator>
> </rdf:Description>
>
> As you can see, they really don't have much in common. In #1 we need a
way
> to model to express our metadata needs and we draw from RDFa one of the
> ways to doing that (about, rel, rev, property, content, datatype
> attributes). In #2 we only need a way to identify objects/resources
within
> the document and leave it up to the meta.xml to contain all of the
possible
> information. The main reason why I like #1 is because there's a lot of
data
> already in the document that we would like to avoid duplication, except
> that I don't believe we can avoid it 100% of the time (e.g. the content
> issue).
>
> I think I failed to separate these two approaches enough on telecons but
I
> hope we can get back on track. Anyways, back to the question: does
> meta:about/xml:id needs to be mnemonic? My answer is no. Let me show by
> example:
>
> <link about="http://torrez.us/who#elias"; rev="dc:creator"
href="foo.jpg"/>
>
> yields
>
> <foo.jpg> dc:creator <http://torrez.us/who#elias>
>
> As you can see the about attribute has nothing to do with mnemonics or
> anything of that nature. It's about uniquely identifying resources in
both
> a closed and open world. The only reason why xml:id came into the
> discussion is because we want to leverage things already identified in
our
> current documents such as <table table:name="table1">...</table>. In RDFa
> and other HTML approaches we make use of both @id and @name to locate
> things within the same document.
>
> <meta about="table1" property="dc:creator">Elias Torres</meta>
> <body>
> <x />
> <y />
> <table name="table1">
> ...
> </table>
> </body>
>
> Also, is ODF content/source copy and paste a requirement for our metadata
> proposal? I didn't think it was. I hope we are not expecting people to
hand
> write ODF (e.g. no need for mnemonics).
>
> Long story short, I said do not request 'urn:uuid' for an internal
> reference between content and meta, as you have used it in the example.
> Mnemonic approach is helpful for the writer, should be recommended,
> but is and can not requested.
> I prefer - as already stated - the approach of attribute references
> between content.xml and metadata, which is not one of your
> approaches above - not #1 nor #2.

#2 is the one that only uses ids in the content to refer to metadata in the
meta.xml. I guess I was not clear enough.

>
> In content.xml:
> ============
> ..
> <text:p meta:class="date">
>     <text:span meta:class="month">May</text:span>
>     <text:span meta:class="day">8th</text:span> at
>     <text:span meta:class="time">10am</text:span>
> </text:p>
> ..
> [NOTE:
> I changed meta:id to meta:class to avoid the impression, that meta:
> id is unique. The naming 'meta:class' is not important for now.
> And the value of meta:id is just an arbitrary string. But here only
> provided as mnemonic default string by a brave plugin programmer. ]
>
>
> In meta package:
> ============
> something RDF compatible
>
> This is a very simple approach. Everything seems to be accomplished
> by it, what advantage have #1 or #2?


I'm missing the goal of your example. In our phone you were interested in
structure, so I figured you are looking for structure, but I'm not sure you
are really solving the ability to reference resources.  I don't see the
relationship between the content and the metadata. I don't see a unique
identifier that could be used in the metadata. I only see structure
information but missing information on parsing. For example, how do you
know which class attribute values are to be found, in what order, etc. Do
you allow this (e.g. structued content to be included a sub-level):

<text:p meta:class="date">
     <text:span meta:class="month">May</text:span>
     <text:span meta:class="day">8th</text:span> at
     <text:div>
       <text:span meta:class="time">10am</text:span>
     </text:div>
</text:p>

I'm very concerned with the rules necessary in order for someone to extract
the information from the page into a data model. We proposed the RDF model
because of its flexibility and RDFa has done the work for us on how to
parse it. Microformats on the other hand, take more your approach of
specifying custom structures in one-ofs specifications and custom parsers
are needed to understand each of them. RDFa is a specification of how to do
it generically. Maybe we have not gone into that part of the specification
where it stated how you parse the document for metadata. This is the main
reason why I don't want to "pollute" the RDFa approach with structure
information because the whole thing breaks down. If structure is more
important, then we should try that in a different proposal.

>
>
> But how is that it's "mnemonic" intrinsically valuable? I don't think
> it is.
>
>
> By this it is imaginable that even the implementation of metadata
> is being exchanged in meta.xml, without a byte changing in the
> content.xml. Imagine implementations like vcard vs. hcard.
>
>
> By us supporting both #1 and #2 we support people exchanging metadata
using
> different schemas or ontologies.
>
>
> Am not following here. Can you restate?
>
>
> Second someone would like to link to meta data.
>
>
> Again, by using URIs people can link to meta data. If use xml:id="short"
we
> have no way of linking to resources (e.g. external documents or web
pages).
>
> I will explain the linking to meta data more verbose in a different
> mail as this has little to do with the rest.
>
> Drafted in two sentence in general is my hopeful wish in linking
thefollowing:
> I would like to be able to create a link to a document pointing to a
> certain semantic not to a structure.
> Like pointing to the node set of all XML nodes having a certain
> class of meta data like Bruce's citation.

>
>
> I still don't know what this means.
>
>
> Instead of referencing to a certain structure (e.g. third paragraph
> of the body) a link to the type of meta data in the package is
> closer on the desired.
>
>
> BTW, I'm not advocating we reference structure. Referencing structure
sucks
> (e.g. 2nd paragraph, 3rd table after the 1st paragraph, etc). I'd like
> reference objects/resources.
>
>
> Sorry, again, am not understanding. Been a long day I guess.
>
>
> Although I have no fool proof implementation by hand (XPointer?),
> would such approach solve the problem of changing structure.
>
> Can you explain what you mean by "changing structure"? Is this is the
> split-paragraph example?
>
>
> I'm not sure how linking to the type of metadata in the package solves
the
> linking problem at all. BTW, I have tried extensively to deal with
> annotations in Office documents in a product we built for Life Sciences
> organizations. I wrote plugins for Word, Excel and Powerpoint and just
> linking to structure never worked. I even tried this on HTML using my own
> XPointer of implementations but it simply does not work for changing
> documents. It might for read-only versions of documents, but that's about
> it.
>
>
> Interesting, it worth to discuss this separately.
> This approach might exist aside of new introduced xml:id, which
> could be generated by the user when the document is ready to
> publish. xml:id should be stable similar to the API / interface of
> a software and therefore handled with care.
>
> And finally, when our goal is to weave arbitrary metadata into ODF
> in a most simple, generic way, I was distracted by @content - as
> Bernd as well before <meta property="cal:dtstart"
> content="20060508T1000-0500"> May 8th at 10am </meta> There is
> detailed redundant information in the attribute and as well there
> is a blob of data.
>
>
> I would like to help us see that it's simply not feasible today to expect
> all human-entered data to be machine readable. It's definitely the case
for
> dates, one of the most complicated pieces of data we deal in computers
> today. If I'm not making any sense, think about dates in Japan where they
> don't use a Gregorian calendar and use Emperors' reign for their years.
> Anyways, let me try a few more examples to see if "more" data could be
> extracted from the human-entered text.
>
> <span property="amount" content="1000000">one million</span>
>
> <span property="dtstart" content="15:05">5 past 3</span>
>
> This year's <span property="net" content="-1600000">loss 1.6 million
> dollars.</span>
>
> I hope this is enough to understand that we should not be in the business
> of removing the @content attribute, except for noting that it might be
best
> to re-use as much as content from the text as possible.
>
> This is not too complicated.
> Mapping meta data to one another should be a common problem, which
> is (more or less easily) solved. At least the mapping of a Gregorian
> calendar to the dates in Japan is quite simple.
> And mapping the logic from written numbers to the decimal system is
> no rocket science, either.

sigh.

>
>
> Why is this a problem? One is for machines, and one for people.
>
>
> In general I would rather prefer something like:
>
> <text:p meta:id="date">
>     <text:span meta:id="month">May</text:span>
>     <text:span meta:id="day">8th</text:span> at
>     <text:span meta:id="time">10am</text:span>
> </text:p>
>
> No, Svante. That's certainly not how you'd do it in RDFa. The ID is
> just that: an id that allows one to then associate something else
> with it (a link, metadata descriptions, etc.). It indicates no
> semantics at all. Using dumb strings of text for semantics is no more
> useful than just using styles.
>
>
> I would like to look past a few of the minor issues with Svante's
example.
> I'll stay away from the naming issue, since I hope I've addressed that
> earlier in my email. However, I would like to note a much more important
> point that I would like everyone to study closely. Svante, I think you
are
> thinking too much about the structure of the data as opposed to
specifying
> metadata in a very granular way. In our RDFa date example we are NOT
> focused on the structure. Let me give you an example:
>
> <p about="event1">
>   The party will start at <span property="cal:dtstart"
> content="2006-12-12T15:05Z">5 past 3 on saturday</span> at <span
> property="location">my house.</span>
> </p>
>
> In RDF we are not focused on "structure", we are interested in the
> statements made in the model. In the example above, we have two
statements
> being asserted.
>
> <#event1> cal:dtstart "2006-12-12T15:05Z" .
> <#event1> cal:location "my house" .
>
> The statements stand completely on their own and they are not necessarily
> part of a greater structure. We can even try to solve the
> really-really-really hard moving content problem. Here we go:
>
> <p about="event1">
>   The party will start at <span property="cal:dtstart"
> content="2006-12-12T15:05Z">5 past 3 on saturday</span>.
> </p>
> <p about="somethingelse">
>   Some text about something else....
>
>   ... and before I forget the party will be at <span about="event1"
> property="location">my house.</span>
> </p>
>
> If you notice, we moved the content around but in this case we were able
to
> maintain the metadata because the cut completely encompassed the <span>
> element. There are specific reasons for us wanting to use RDF and RDFa.
In
> RDFa, triples are only when encountered with either a rel,rev or property
> attribute. This property allows us to make statements about resources in
> different parts of the document without having to worry about maintaining
> structure because the model extracted from the content.xml in the
"changed"
> example is isomorphic to the first one. I know there's an equivalent
> scenario for Svante's scenario if done right (e.g. we use properties
> ex:month, ex:day, ex:time). However, I think in his case, he meant "date"
> to denote (let's say the dtstart of the document). This means that month,
> day and time are dependent on their document location and any movement of
> the content would absolutely destroy our metadata hopes. I guess we could
> introduce some "merge" rules to try to solve this, but the higher-level
> structures, the more complicated the rules can become, think something
> along the lines of XML diff/merge, all within the same document.
>
>
> or shorter using default namespace (and none for the attribute) as
>
> <text:p s="date">
>     <a s="month">May</a>
>     <a s="day">8th</a> at
>     <a s="time">10am</a>
> </text:p>
>
> By doing so, other software aside of the correct plugin, would have
> a chance to interpret the data.
>
>
> I think that as you look at the work that Dan Connolly has done in the
> area.
> Anything in particular in mind?

I believed you linked to this:
http://dig.csail.mit.edu/breadcrumbs/node/146

> You'll believe me that having this information:
>
> <#event1> cal:dtstart "2006-12-12T15:05Z" .
> <#event1> cal:location "my house" .
>
> any plugin can interpret that data and do as it pleases (e.g. convert it
to
> some other format for display).
>
>
>
> OK, I think you need to step back and ask what problem are you trying
> to solve here? It seems to me you want to be able to bind a GUI to
> data, and then to particular application behavior (which could be a
> plug-in). E.g. say someone wants to add custom content processing;
> how do you do that? How does a plug-in know which content and which
> metadata to deal with?
>
> Is that right?
>
> If yes, the manifest can also be used for this, as well as some
> similar typing on custom fields.
>
>
> For example the fall-back plugin of an ODF application (which
> assist the user in showing / editing meta data, when the correct
> plugin is not installed / found), would be able to assist the user.
> Even more when the the possible set of data (e.g. all month) is
> defined in an embedded grammar,
>
> OK, remember, what you are calling an "embedded grammar" is exactly
> what RDFa provides.
>
>
> further features would be possible. For instance 'auto completion'
> or 'drop down list' for the content of such a field are thinkable
> for the future even for the fall-back plugin ( but most likely not
> in it's first version).
>
> Sure, though where the metadata is is not significant; is it?
>
> Bruce
>
>
> I hope I have given you decent answers/arguments to your questions
> proposals. I touched on a few of the reasons why we are suggesting RDF to
> address the requirements of this task force. We are not trying to invent
a
> new unproven mechanism to embed structure in the documents because that
> would reduce to embedding any XML within ODF XML. We want to focus on the
> requirements and believe we have a solution for 80% of the requirements
in
> a standards-based proposal. We'll continue showing examples that address
> the current use cases and requirements. For sake of time, I would like to
> see any new use cases and requirements added to the wiki and more formal
> proposals that go along with it, in order to guarantee progress.
>
> Wiki is a good point, we should keep this up.

>
> -Elias
>
>
>
> Best regards,
> Svante
References:
- Re: [office-metadata] Our discussion on the Wiki example
  - From: Svante Schubert <Svante.Schubert@Sun.COM>