office message

Subject: [office] OpenDocument metadata and XMP
From: Alan Lillich <alillich@adobe.com>
To: office@lists.oasis-open.org
Date: Tue, 6 Dec 2005 13:52:35 -0800
OpenDocument TC members,

This posting is a commentary on metadata issues for OpenDocument and  
in particular how an XMP-like approach might address those issues. It  
is meant to further the already ongoing metadata discussion within  
the OpenDocument TC and only represents ideas, not any concrete  
proposal. This was written by Alan Lillich with some welcome early  
feedback from Duane Nickull and Bruce D'Arcus. The views expressed  
here do not constitute  the official opinion of Adobe Systems, nor of  
Duane and Bruce.

Please post comments the OpenDocument TC mailing list or to Alan  
Lillich (alillich@adobe.com) and Duane Nickull (dnickull@adobe.com).

I'm coming into this discussion somewhat late, there is a lot of  
ground to cover about OpenDocument, metadata in general, and XMP.  
This is a long posting, hopefully it is coherent and useful. I have  
tried to not specifically "make a case for XMP". Instead I have tried  
to present an objective discussion of metadata issues in a manner  
that will help the OpenDocument TC make decisions.

This posting is divided into sections:
1. Miscellaneous background
2. Decision factors for the OpenDocument TC
3. A suggested approach for OpenDocument
4. A description of XMP

======================================================================
1. Miscellaneous background

---------------------------------------------
Some background on the author:

I'm a software engineer with 27 years of work experience. I spent  
almost 10 years working on commercial Ada compilers and related  
software, and almost 10 years working for Apple on internals of the  
PowerPC Mac OS. I've been with Adobe almost 5 years, hired after XMP  
was first shipped with Acrobat 5 to take over development of the core  
XMP toolkit and help the other Adobe application teams incorporate  
support for XMP. I have a deep interest in shipping high quality,  
high volume commercial software. While I can't speak for the original  
design intentions behind XMP, I can address the value of XMP from the  
view of implementing its internals and helping client applications  
utilize it.

BTW - I have recently become a member of the OpenDocument TC,  
specifically to participate in this debate. I will however abstain  
from any vote concerning metadata in order to avoid the appearance of  
Adobe attempting to "push" XMP into OpenDocument.

--------------------------------------
About the Adobe XMP SDK:

If you've looked at the XMP SDK in the past, please look again.  
Earlier this year Adobe posted a significant update to the XMP  
Specification. This did not introduce significant changes to the XMP  
data model, but did significantly improve how it is described. The  
latest XMP spec has chapter 2 "XMP Data Model" and chapter 3 "XMP  
Storage Model". Adobe recently (October?) posted an entirely new  
implementation for the core XMP toolkit. This has a revamped API that  
is similar to the old but much easier to use and more complete. The  
code is a total rewrite, it is now smaller, faster, and more robust.

--------------------------------
Presumptions and bias:

A core part of making rational decisions is determining goals and  
placing valuations on choices. I've tried to avoid outright advocacy,  
but there certainly are presumptions and bias behind what is  
presented here.

One presumption is that we're talking about a solution that can be  
serialized as RDF. There is no presumption about how much of RDF is  
allowed. I do have a bias for a subset that retains expressive power  
while reducing implementation effort.

Perhaps the most significant presumption is that success of  
OpenDocument depends on the availability of a variety of high quality  
and low cost commercial applications. I suspect that business and  
government in the US and Europe will insist on the stability and  
support of commercial products. The completeness and quality of all  
applications, commercial or open source, depends quite a bit on the  
clarity and implementability of the OpenDocument specification. It  
needs to be easily, reliably, and consistently implemented. The ill  
effects that can arise if parts of the specification are unclear or  
hard to implement include:
	- Features might be too complex for mainstream users
	- Applications might be fragile or buggy
	- Applications might support private subsets, by intent or ignorance
	- The cost of implementation might reduce the variety of choice

A bias related to this presumption is that pragmatic choices are  
necessary. Good enough is not necessarily a four letter word. Time to  
market is important.

Pragmatic choices do not necessarily mean simplistic results. I have  
a strong bias for formal models that are reasonably simple, robust,  
and powerful.

Another presumption is that good software design will lead to  
application layering. In the case of metadata this means a core  
metadata toolkit that manages an application neutral metadata model,  
with client application logic layered on top. The core metadata  
toolkit provides a runtime model and API to the client code. The  
strength of the underlying formal model has a big effect on the cost  
of the core metadata toolkit, and on the richness of the client code  
that can be created above it. The design of the runtime model and API  
has a big effect on the cost to create rich and robust client code on  
top of it.

A final presumption is that a good data model with open extensibility  
is crucial. By that I mean extensibility within a well defined data  
model, not wide open anything-in-the-universe extensibility. End user  
appreciation of metadata is growing rapidly in breadth and  
sophistication. The value of OpenDocument to large organizations will  
be enhanced by open metadata extensibility. Examples of significant  
customer extension in the case of XMP include the ISO PDF/A standard  
(http://www.aiim.org/documents/standards/ISO_19005-1_(E).doc), and  
the IPTC extensions (http://www.iptc.org/IPTC4XMP/).

======================================================================
2. Decision factors for the OpenDocument TC

This section poses a bunch of questions that are hopefully relevant  
in designing a metadata solution for OpenDocument. I've tried to  
organize them in a more or less logical progression. Some of them  
might make more sense after reading the following section describing  
XMP.

- How quickly to move on new metadata?
There is an existing, albeit limited, metadata solution. Since a  
change is being contemplated, there is a lot to gain by getting it  
right. Is there a major release coming up that places a deadline or  
urgency on defining a better metadata solution?

- Will the new metadata allow open extension?
Can end users freely create new metadata elements, provided that they  
stay within a defined formal model?

- How are formal schema used?
Must end users provide a formal schema in order to use new metadata  
elements? If not required, is it allowed/supported? If not provided,  
what impact does this have on other aspects of general document  
checking? If formal schemas are not used, is the underlying data  
model explicit in the serialization? If formal schemas are not used,  
where are various kinds of errors detected?

- If formal schemas are used, what is the schema language?
RELAX NG is clearly a better schema language than XML Schema. Can XML  
Schema be used at all by those who insist on it?

- What is the formal model for the metadata?
What is the expressive capability of the formal model? Can it be  
easily taught to general users? Does it contain enough power for  
sophisticated users? Can sophisticated users reasonably work within  
any perceived limitations? Can it be implemented reliably, cheaply,  
and efficiently? Will it be easy for client applications to use? Are  
there existing implementations?

- Is the formal model based on RDF, or can it be expressed in RDF? If  
so, does it encompass all of RDF? If not all of RDF, what are the  
model constraints? Can any equivalent serialization of RDF be used?  
If so, what impact does that have on formal schemas?

- Does the formal model have a specific notion of reference? If so,  
does it work broadly for general local file system use, networked  
file use, Internet use? What happens to references as files are moved  
into and out of asset management systems? If there is a formal notion  
of reference, what characteristics of persistence and specificity  
does it have? How well does it satisfy local workflow needs?

- What kinds of "user standard" metadata features are layered on top  
of the formal model? Users want helpful visible features. They  
generally don't care if things are part of a formal model or part of  
conventions at higher levels. For example, a UI can make use of  
standard metadata elements to provide a rich browsing, searching, and  
discovery experience. It is not necessary to have every aspect  
ensconced in the formal model.

- How important is interaction with XMP? Is it important to create a  
document using OpenDocument then publish and distribute it as PDF? If  
so, how is the OpenDocument metadata mapped into XMP in the PDF? Is  
it important to import illustrations or images that contain XMP into  
OpenDodument files? If so, how is the XMP in those files mapped into  
the OpenDocument metadata? How does it return to XMP when published  
as PDF? This "how" includes both how is the mapping defined, how well  
do the formal models mesh, and how is the mapping implemented, what  
software must run? Is it important to work seamlessly with 3rd party  
asset management systems that recognize XMP?

- How important is interaction with other forms of metadata or other  
metadata systems? What other systems? How would the metadata be mapped?

- Are there things in XMP that are absolutely intolerable? Things  
that have no reasonable workaround? Does XMP place unacceptable  
limitations on possible future directions? Are there undesireable  
aspects of XMP that can reasonably be changed?

======================================================================
3. A suggested approach for OpenDocument

This is written with great trepidation. It is here for the sake of  
being concrete and complete, and to provide an honest suggestion.  
This is not a formal proposal from Adobe, nor an informal attempt to  
twist anyone's arm. It is nothing but one software engineer's  
suggestion - a software engineer with an obvious chance of being  
biased by personal experience.

I think the OpenDocument metadata effort could succeed by starting  
with XMP, understanding how to work within XMP, and only looking for  
truly necessary changes. This could be done reasonably quickly and  
easily. It saves a lot of abstract design effort, allowing the  
OpenDocument TC to concentrate on more concrete issues.

It would provide an RDF-based metadata model that has demonstrated  
practical value. One that can be reliably, cheaply, and efficiently  
implemented. With an existing C++ public implementation that matches  
internal use at Adobe (not a toy freebie). Adobe does not have a Java  
implementation at this time though.

This would provide a solution that exports seamlessly to PDF, imports  
seamlessly from existing files containing XMP, and integrates  
seamlessly with other systems recognizing XMP.

Since XMP can be serialized as legitimate RDF, there is an argument  
for easy, if not seamless, incorporation into other RDF stores.  
Slight decoration or modification of the XMP in these cases should be  
reasonably easy. And probably not unique to XMP, since the universe  
of RDF usage is not uniform.

======================================================================
4. A description of XMP

This section primarily describes XMP as it exists today. The purpose  
is to make sure everyone understands what the XMP specification  
specifies,  what it leaves unsaid, and what Adobe software can and  
cannot do, so that well informed choices can be made. There is no  
intent to imply that XMP is the best of all possible solutions.

You can break XMP into 4 distinct areas:
     - The abstract data model, the kinds of metadata values and  
structures.
     - The specific data model used by standard properties.
     - The serialization syntax.
     - The rules for embedding in files.

The abstract data model is the most important part. It defines the  
kind of metadata values and concepts that can be represented. The  
data model used by standard by standard properties is almost as  
important. Common modeling of standard properties is important for  
reliable data interchange.

The specific serialization syntax is not as important. As long as the  
mapping to the data model is well defined, it is reasonably easy to  
convert between different ways to write the metadata. Of course there  
are benefits and costs to any specific serialization. What I mean  
here is that the underlying formal data model defines what concepts  
can be expressed. How the data model is serialized in XML is not as  
important as the data model itself.

The file embedding rules are by far the least important here. It is  
important that metadata is embedded consistently for each file  
format, but these rules are specific to the format and not much  
related to the other areas.

The following subsections discuss aspects of the abstract data model.

-------------------------------------
The basic XMP data model

I've taken to describing the XMP data model as "qualified data  
structures". The basis is traditional C-like data structures: simple  
values, structs containing named fields, and arrays containing  
indexed items. These are natural concepts, easily explained even to  
novices, and can be composed into rich and complex data structures.  
Ignoring surrounding context and issues about alternative equivalent  
forms of RDF, here are some simple examples:

	<ns:UniqueID>74A9C2F643DC11DABBE284332F708B21</ns:UniqueID>

	<ns:ImageSize rdf:parseType="Resource">
		<ns:Height>900</ns:Height>
		<ns:Width>1600</ns:Width>
	</ns:ImageSize>

	<dc:subject>
		<rdf:Bag>
			<rdf:li>XMP</rdf:li>
			<rdf:li>example</rdf:li>
		</rdf:Bag>
	</dc:subject>

One of the main advantages of serializing XMP as RDF is that these  
aspects of the data model become self-evident. The core XMP toolkit  
knows that something is simple, or is a struct, or is an array  
directly from the serialized RDF, no additional schema knowledge is  
necessary. This allows new metadata to be freely and easily created  
by customers. Files can be shared without having to carry along  
schema descriptions. Similarly client applications can freely and  
easily create new metadata without creating formal schemas or  
requiring change in the core XMP toolkit. The client applications and  
users understand their metadata, it is not necessary for the core  
toolkit to do so. Granted formal schemas are necessary for automated  
checking, which is a good thing. The point here is that a lot of  
effective work and sharing can be done without burdening everyone  
with the overhead of creating formal schemas.

The notion of arrays in XMP seems to be often misunderstood, causing  
controversy in the use of RDF Bag, Seq, or Alt containers. One point  
is that within XMP these are just used to denote traditional arrays.  
The broader aspects of RDF containers are not part of the XMP data  
model. For XMP the difference between Bag, Seq, and Alt is simply a  
sideband hint that the items in the array are an unordered  
collection, an ordered collection, or a weakly ordered list of  
alternatives. A common question is why use arrays at all instead of  
repeated properties like:
	<dc:subject>XMP</dc:subject>
	<dc:subject>example</dc:subject>

The basic answer is the point about a self-evident data model in the  
RDF serialization. What if a given file only contained 1 dc:subject  
element? Is dc:subject a simple property or an array? Most humans  
have a very specific notion about whether a property is supposed to  
be unique (simple), or might have multiple values (an array). Using  
explicit array notation in the serialization makes this clear. Which  
in turn makes it clear in the XMP toolkit API, and in how client  
applications use that API. Client application code becomes more  
complex and UI design more difficult if everything is potentially an  
array.

------------------------------
XML markup in values

A small aside: The XMP data model does allow XML markup in values,  
but this is serialized with escaping. This is easier and more  
efficient to parse than use of rdf:parseType="Literal". The main  
difference is that with escaping the markup is not visible in the DOM  
of a generic XML parse. Having that visibility does not seem like a  
crucial feature. Having the markup be visible will also complicate  
formal schemas.

For example, a call like:
	xmp.SetProperty ( "Prop", "<elem>text</elem>" );
will get serialized as:
	<Prop>&lt;elem&gt;text&lt;/elem&gt;</Prop>

------------------------
Qualifiers in XMP

Qualifiers in XMP are from RDF, they are not part of traditional  
programming data structures. In the XMP data model qualifiers can be  
viewed as properties of properties. The XMP data model is fully  
general and recursive. They seem to be easily understood by users,  
fit easily into the core toolkit API, and provide a significant  
mechanism for growth and evolution. They do this by allowing later  
addition of information in a self evident and well structured way,  
without breaking clients using an earlier and simpler view.

For an example I'll first use an XMP data model display instead of  
RDF. Let's accept the notion of the XMP use of dc:creator as an  
ordered array of names. This works for the vast majority of needs:
	dc:creator	(orderedArray)
		[1] = "Bruce D'Arcus"
Suppose we now want to add some annotation for Bruce's blog. By  
adding this as a qualifier older clients still work just fine. In  
fact they could even have been written to anticipate qualifiers and  
display them when found:
	dc:creator	(isOrderedArray)
		[1] = "Bruce D'Arcus"	(hasQualifiers)
			ns:blog = "http://netapps.muohio.edu/blogs/darcusb/ 
darcusb/" (isQualifier isURI)

The RDF serialization of XMP uses the rdf:value notation for  
qualifiers. This is unfortunately a bit ugly and complicates formal  
schemas since it makes the qualified element look like a struct. The  
presence of the rdf:value "field" is what says this is not really a  
struct. The original unqualified array item:
	<rdf:li>Bruce D'Arcus</rdf:li>
Adding the qualifier:
	<rdf:li rdf:parseType="Resource">
		<rdf:value>Bruce D'Arcus</rdf:value>
		<ns:blog rdf:resource="http://netapps.muohio.edu/blogs/darcusb/ 
darcusb/"/>
	</rdf:li>

--------------------------
References in XMP

One aspect of programming, and many other, data models that is not a  
first class part of XMP is a notion of reference. By this I mean that  
the XMP specification does not define references, and the Adobe XMP  
toolkit does not contain specific API or logic for dealing with  
references. References can be defined and used within XMP by clients,  
they just are not a fundamental part of the data model. A reference  
is some form of address along with a means to find what is at that  
address. Having the address without being able to go there isn't of  
much use.

The lack of a formal notion of reference does not at all say that  
references cannot be represented or used within XMP. Specific kinds  
of references can easily be used. The onus is on the users of those  
references to define their semantics and representation.

In the qualifier example, the use of the rdf:resource notation does  
not constitute a formal reference. That is just sideband information  
that this particular simple value happens to be a URI. The XMP  
specification does not require any specific action for this. The  
Adobe XMP toolkit does not attempt to follow the URI, nor does it  
allow rdf:resource to be used as a general inclusion or redirection  
mechanism. All that said, the example qualifier is an informal form  
of reference in the sense of an address that can be understood and  
utilized by client software. A generic UI can even display it with a  
nice OpenWebPage button.

By avoiding a formal notion of reference XMP avoids being over  
constrained by picking a particular notion of address, or of being  
overly complex  in order to support a totally generalized notion of  
address. An important distinction between actual XMP usage and  
typical RDF examples is that XMP operates primarily in a file system  
world while RDF examples are almost always Internet oriented.

This is an important distinction with significant practical aspects.  
Suppose a reference is stored as a file URL. What happens to that  
reference as the file is copied around a network, or emailed, or  
moved into and out of an asset management system? What are the  
privacy issues related to putting file URLs in metadata without the  
user's conscious knowledge?

There are other aspects of references that URIs typically loose. At  
any rate URIs in the form of typical readable URLs. Like machine  
addresses, a URL references the current content at some location,  
i.e. it is all about the location regardless of content. It is  
incapable of being used for wider search, it breaks if the content  
moves, it cannot detect changes to the content. A typical URL is not  
persistent, it can't identify the content through time and space. Nor  
is it specific, it can't detect differences between altered forms of  
the content. Yes, general URIs can contain arbitrary knowledge, but  
that knowledge isn't of much use without an agent to perform lookup.

Consider a number of forms of reference to a book: title, which  
edition, which printing, ISBN number, Dewey Decimal number. Which of  
these is useful depends on local context. XMP leaves the definition  
and processing of references to clients. They are the ones with  
specific knowledge of local context and workflow.

As a more concrete example, consider how compound documents are  
typically created and published by InDesign. This isn't specifically  
about InDesign's use of XMP, but does illustrate the changing nature  
of a reference. During the creation process images and illustrations  
are usually placed into the layout by file reference. This lets the  
separate image file be updated by a graphic artist while an editorial  
person is working on the layout. When published to PDF, the images  
are physically incorporated. The file reference is no longer needed,  
and often not even wanted because of privacy concerns - the PDF file  
might be sent to places that have no business knowing what source  
files were used. However, XMP from the images can be embedded in the  
PDF and attached to the relevant image objects.

---------------------------------------------------
Interaction with metadata repositories

This section has looked inward at the XMP data model. There has been  
no mention of RDF triples or broader RDF implications. This is  
intentional. In terms of providing immediate customer benefit, first  
order value of XMP is getting the metadata into files, viewing/ 
editing it within applications like Photoshop, and viewing/editing/ 
searching it with applications like Bridge. By focusing inwards a  
number of simplifications can be made that make the metadata more  
approachable, and that make implementations more robust and less  
expensive.

That said, there is real value in being able to have interactions  
between XMP and other metadata systems. What this means for a given  
metadata repository depends on the internal data model of the  
repository, how that relates to the XMP data model, and the  
directions that metadata is moved between XMP and the repository.  
Information is potentially lost when moved from a less to a more  
constrained data model. Since XMP can be serialized using a subset of  
RDF, XMP can be ingested fairly easily into a general RDF store. It  
should be reasonably easy to transform the XMP if the particular  
usage of RDF by XMP is not what is preferred.

--------------------------
Latitude for change

I've seen well intentioned suggestions like: "Enhance XMP to fit with  
current RDF and XML best practices." People need to be very realistic  
about the feasibility of various kinds of change. XMP is a shipping  
technology, with hundreds of thousands if not millions of copies of  
applications using it. This includes 3 major generations of Adobe  
products. Backward compatibility is a major concern.

Global or implicit changes that would cause XMP to fail in existing  
file formats and applications are unlikely to happen. There would  
have to be some very compelling reason. Suppose a future version of  
XMP in Photoshop started writing dc:subject as repeated elements  
instead of an explicit array (rdf:Bag). The XMP in new files would  
not be accepted by any existing Adobe software, and probably not by  
any existing 3rd party software supporting XMP.

Global or implicit changes restricted to new file formats have a  
better chance of success. Suppose OpenDocument files were "NewXMP",  
using repeated elements and schema knowledge. No existing software  
specifically looks for XMP in OpenDocument files, so the exposure is  
less than the previous example. But there is software, especially 3rd  
party asset management systems, that use byte-oriented packet  
scanning to find XMP in arbitrary files. That software will not  
handle these new files.

Changes to XMP that are restricted to actual metadata usage, or  
otherwise under conscious user control, have a much better chance of  
being accepted. One example might be a user preference for a custom  
RDF serialization that is more amenable as input to a general RDF store.

------------------------
Plain XMP Syntax

I want to also give the OpenDocument TC a heads-up about something  
called Plain XMP. We will be posting a paper about this for review  
and discussion to the Adobe XMP web site in the near future. I want  
to emphasize that Adobe has made no decisions about this, we are  
simply looking for community review and feedback.

Plain XMP is being presented as a possible alternative serialization  
for the XMP data model, one that happens to be describable using XML  
Schema. The full XMP data model is represented, you can move back and  
forth between the RDF form of XMP and Plain XMP without loss. This  
does not signal any intent by Adobe to abandon RDF. This is purely an  
attempt to satisfy conflicting customer desires.

Since XMP first shipped with Acrobat 5, Adobe has gotten feedback  
from a number of customers or potential adopters of XMP that they  
don't like RDF. Why they don't like RDF isn't really an issue here.  
The Customer Is Always Right. There seem to be 3 common  
"complaints" (pardon the term) - general FUD about RDF, a dislike of  
the RDF XML syntax, and a desire to use "standard XML tools". This  
last generally means using W3C XML Schema.

Granted, RELAX NG is a vastly superior schema language. The conflict  
between RDF and XML Schema can be viewed as the fault of shortcomings  
in XML Schema. Again that isn't the point. The Customer Is Always Right.

A reasonable usage model for Plain XMP might be to put the RDF form  
of XMP in current file types by default, and maybe let users choose  
Plain XMP. New file types could go either way, realizing that  
existing packet scanners won't recognize Plain XMP. Future XMP  
toolkits would accept either. Client software could ask for a  
serialization in either form.

Plain XMP might also be more amenable to XSLT transformation than  
RDF, especially when qualifiers are used. This could make it useful  
for getting XMP into or out of metadata repositories.

Here are the previous examples serialized as Plain XMP, again  
ignoring surrounding context. Yes, there is going to be controversy  
about the use of an attribute versus character data for values. This  
will be explained in the Plain XMP proposal. In essence, this avoids  
some XML Schema problems. Arguably, XML used for data is distinctly  
different from XML used for "traditional markup". The latter requires  
character data, the former does not.

	<ns:UniqueID value="74A9C2F643DC11DABBE284332F708B21"/>

	<ns:ImageSize kind="struct">
		<ns:Height value="900"/>
		<ns:Width value="1600"/>
	</ns:ImageSize>

	<dc:subject kind="bag">
		<item value="XMP"/>
		<item value="example"/>
	</dc:subject>

	<dc:creator kind="seq">  <!-- This form drops the isURI tagging of  
ns:blog. -->
		<item value="Bruce D'Arcus" ns:blog="http://netapps.muohio.edu/ 
blogs/darcusb/darcusb/"/>
	</dc:creator>

	<dc:creator kind="seq">  <!-- This keeps the isURI tagging of  
ns:blog. -->
		<item value="Bruce D'Arcus">
			<ns:blog value="http://netapps.muohio.edu/blogs/darcusb/darcusb/";  
rdf:resource=""/>
		</item>
	</dc:creator>

======================================================================