[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: [office] OpenDocument metadata and XMP
OpenDocument TC members, This posting is a commentary on metadata issues for OpenDocument and in particular how an XMP-like approach might address those issues. It is meant to further the already ongoing metadata discussion within the OpenDocument TC and only represents ideas, not any concrete proposal. This was written by Alan Lillich with some welcome early feedback from Duane Nickull and Bruce D'Arcus. The views expressed here do not constitute the official opinion of Adobe Systems, nor of Duane and Bruce. Please post comments the OpenDocument TC mailing list or to Alan Lillich (alillich@adobe.com) and Duane Nickull (dnickull@adobe.com). I'm coming into this discussion somewhat late, there is a lot of ground to cover about OpenDocument, metadata in general, and XMP. This is a long posting, hopefully it is coherent and useful. I have tried to not specifically "make a case for XMP". Instead I have tried to present an objective discussion of metadata issues in a manner that will help the OpenDocument TC make decisions. This posting is divided into sections: 1. Miscellaneous background 2. Decision factors for the OpenDocument TC 3. A suggested approach for OpenDocument 4. A description of XMP ====================================================================== 1. Miscellaneous background --------------------------------------------- Some background on the author: I'm a software engineer with 27 years of work experience. I spent almost 10 years working on commercial Ada compilers and related software, and almost 10 years working for Apple on internals of the PowerPC Mac OS. I've been with Adobe almost 5 years, hired after XMP was first shipped with Acrobat 5 to take over development of the core XMP toolkit and help the other Adobe application teams incorporate support for XMP. I have a deep interest in shipping high quality, high volume commercial software. While I can't speak for the original design intentions behind XMP, I can address the value of XMP from the view of implementing its internals and helping client applications utilize it. BTW - I have recently become a member of the OpenDocument TC, specifically to participate in this debate. I will however abstain from any vote concerning metadata in order to avoid the appearance of Adobe attempting to "push" XMP into OpenDocument. -------------------------------------- About the Adobe XMP SDK: If you've looked at the XMP SDK in the past, please look again. Earlier this year Adobe posted a significant update to the XMP Specification. This did not introduce significant changes to the XMP data model, but did significantly improve how it is described. The latest XMP spec has chapter 2 "XMP Data Model" and chapter 3 "XMP Storage Model". Adobe recently (October?) posted an entirely new implementation for the core XMP toolkit. This has a revamped API that is similar to the old but much easier to use and more complete. The code is a total rewrite, it is now smaller, faster, and more robust. -------------------------------- Presumptions and bias: A core part of making rational decisions is determining goals and placing valuations on choices. I've tried to avoid outright advocacy, but there certainly are presumptions and bias behind what is presented here. One presumption is that we're talking about a solution that can be serialized as RDF. There is no presumption about how much of RDF is allowed. I do have a bias for a subset that retains expressive power while reducing implementation effort. Perhaps the most significant presumption is that success of OpenDocument depends on the availability of a variety of high quality and low cost commercial applications. I suspect that business and government in the US and Europe will insist on the stability and support of commercial products. The completeness and quality of all applications, commercial or open source, depends quite a bit on the clarity and implementability of the OpenDocument specification. It needs to be easily, reliably, and consistently implemented. The ill effects that can arise if parts of the specification are unclear or hard to implement include: - Features might be too complex for mainstream users - Applications might be fragile or buggy - Applications might support private subsets, by intent or ignorance - The cost of implementation might reduce the variety of choice A bias related to this presumption is that pragmatic choices are necessary. Good enough is not necessarily a four letter word. Time to market is important. Pragmatic choices do not necessarily mean simplistic results. I have a strong bias for formal models that are reasonably simple, robust, and powerful. Another presumption is that good software design will lead to application layering. In the case of metadata this means a core metadata toolkit that manages an application neutral metadata model, with client application logic layered on top. The core metadata toolkit provides a runtime model and API to the client code. The strength of the underlying formal model has a big effect on the cost of the core metadata toolkit, and on the richness of the client code that can be created above it. The design of the runtime model and API has a big effect on the cost to create rich and robust client code on top of it. A final presumption is that a good data model with open extensibility is crucial. By that I mean extensibility within a well defined data model, not wide open anything-in-the-universe extensibility. End user appreciation of metadata is growing rapidly in breadth and sophistication. The value of OpenDocument to large organizations will be enhanced by open metadata extensibility. Examples of significant customer extension in the case of XMP include the ISO PDF/A standard (http://www.aiim.org/documents/standards/ISO_19005-1_(E).doc), and the IPTC extensions (http://www.iptc.org/IPTC4XMP/). ====================================================================== 2. Decision factors for the OpenDocument TC This section poses a bunch of questions that are hopefully relevant in designing a metadata solution for OpenDocument. I've tried to organize them in a more or less logical progression. Some of them might make more sense after reading the following section describing XMP. - How quickly to move on new metadata? There is an existing, albeit limited, metadata solution. Since a change is being contemplated, there is a lot to gain by getting it right. Is there a major release coming up that places a deadline or urgency on defining a better metadata solution? - Will the new metadata allow open extension? Can end users freely create new metadata elements, provided that they stay within a defined formal model? - How are formal schema used? Must end users provide a formal schema in order to use new metadata elements? If not required, is it allowed/supported? If not provided, what impact does this have on other aspects of general document checking? If formal schemas are not used, is the underlying data model explicit in the serialization? If formal schemas are not used, where are various kinds of errors detected? - If formal schemas are used, what is the schema language? RELAX NG is clearly a better schema language than XML Schema. Can XML Schema be used at all by those who insist on it? - What is the formal model for the metadata? What is the expressive capability of the formal model? Can it be easily taught to general users? Does it contain enough power for sophisticated users? Can sophisticated users reasonably work within any perceived limitations? Can it be implemented reliably, cheaply, and efficiently? Will it be easy for client applications to use? Are there existing implementations? - Is the formal model based on RDF, or can it be expressed in RDF? If so, does it encompass all of RDF? If not all of RDF, what are the model constraints? Can any equivalent serialization of RDF be used? If so, what impact does that have on formal schemas? - Does the formal model have a specific notion of reference? If so, does it work broadly for general local file system use, networked file use, Internet use? What happens to references as files are moved into and out of asset management systems? If there is a formal notion of reference, what characteristics of persistence and specificity does it have? How well does it satisfy local workflow needs? - What kinds of "user standard" metadata features are layered on top of the formal model? Users want helpful visible features. They generally don't care if things are part of a formal model or part of conventions at higher levels. For example, a UI can make use of standard metadata elements to provide a rich browsing, searching, and discovery experience. It is not necessary to have every aspect ensconced in the formal model. - How important is interaction with XMP? Is it important to create a document using OpenDocument then publish and distribute it as PDF? If so, how is the OpenDocument metadata mapped into XMP in the PDF? Is it important to import illustrations or images that contain XMP into OpenDodument files? If so, how is the XMP in those files mapped into the OpenDocument metadata? How does it return to XMP when published as PDF? This "how" includes both how is the mapping defined, how well do the formal models mesh, and how is the mapping implemented, what software must run? Is it important to work seamlessly with 3rd party asset management systems that recognize XMP? - How important is interaction with other forms of metadata or other metadata systems? What other systems? How would the metadata be mapped? - Are there things in XMP that are absolutely intolerable? Things that have no reasonable workaround? Does XMP place unacceptable limitations on possible future directions? Are there undesireable aspects of XMP that can reasonably be changed? ====================================================================== 3. A suggested approach for OpenDocument This is written with great trepidation. It is here for the sake of being concrete and complete, and to provide an honest suggestion. This is not a formal proposal from Adobe, nor an informal attempt to twist anyone's arm. It is nothing but one software engineer's suggestion - a software engineer with an obvious chance of being biased by personal experience. I think the OpenDocument metadata effort could succeed by starting with XMP, understanding how to work within XMP, and only looking for truly necessary changes. This could be done reasonably quickly and easily. It saves a lot of abstract design effort, allowing the OpenDocument TC to concentrate on more concrete issues. It would provide an RDF-based metadata model that has demonstrated practical value. One that can be reliably, cheaply, and efficiently implemented. With an existing C++ public implementation that matches internal use at Adobe (not a toy freebie). Adobe does not have a Java implementation at this time though. This would provide a solution that exports seamlessly to PDF, imports seamlessly from existing files containing XMP, and integrates seamlessly with other systems recognizing XMP. Since XMP can be serialized as legitimate RDF, there is an argument for easy, if not seamless, incorporation into other RDF stores. Slight decoration or modification of the XMP in these cases should be reasonably easy. And probably not unique to XMP, since the universe of RDF usage is not uniform. ====================================================================== 4. A description of XMP This section primarily describes XMP as it exists today. The purpose is to make sure everyone understands what the XMP specification specifies, what it leaves unsaid, and what Adobe software can and cannot do, so that well informed choices can be made. There is no intent to imply that XMP is the best of all possible solutions. You can break XMP into 4 distinct areas: - The abstract data model, the kinds of metadata values and structures. - The specific data model used by standard properties. - The serialization syntax. - The rules for embedding in files. The abstract data model is the most important part. It defines the kind of metadata values and concepts that can be represented. The data model used by standard by standard properties is almost as important. Common modeling of standard properties is important for reliable data interchange. The specific serialization syntax is not as important. As long as the mapping to the data model is well defined, it is reasonably easy to convert between different ways to write the metadata. Of course there are benefits and costs to any specific serialization. What I mean here is that the underlying formal data model defines what concepts can be expressed. How the data model is serialized in XML is not as important as the data model itself. The file embedding rules are by far the least important here. It is important that metadata is embedded consistently for each file format, but these rules are specific to the format and not much related to the other areas. The following subsections discuss aspects of the abstract data model. ------------------------------------- The basic XMP data model I've taken to describing the XMP data model as "qualified data structures". The basis is traditional C-like data structures: simple values, structs containing named fields, and arrays containing indexed items. These are natural concepts, easily explained even to novices, and can be composed into rich and complex data structures. Ignoring surrounding context and issues about alternative equivalent forms of RDF, here are some simple examples: <ns:UniqueID>74A9C2F643DC11DABBE284332F708B21</ns:UniqueID> <ns:ImageSize rdf:parseType="Resource"> <ns:Height>900</ns:Height> <ns:Width>1600</ns:Width> </ns:ImageSize> <dc:subject> <rdf:Bag> <rdf:li>XMP</rdf:li> <rdf:li>example</rdf:li> </rdf:Bag> </dc:subject> One of the main advantages of serializing XMP as RDF is that these aspects of the data model become self-evident. The core XMP toolkit knows that something is simple, or is a struct, or is an array directly from the serialized RDF, no additional schema knowledge is necessary. This allows new metadata to be freely and easily created by customers. Files can be shared without having to carry along schema descriptions. Similarly client applications can freely and easily create new metadata without creating formal schemas or requiring change in the core XMP toolkit. The client applications and users understand their metadata, it is not necessary for the core toolkit to do so. Granted formal schemas are necessary for automated checking, which is a good thing. The point here is that a lot of effective work and sharing can be done without burdening everyone with the overhead of creating formal schemas. The notion of arrays in XMP seems to be often misunderstood, causing controversy in the use of RDF Bag, Seq, or Alt containers. One point is that within XMP these are just used to denote traditional arrays. The broader aspects of RDF containers are not part of the XMP data model. For XMP the difference between Bag, Seq, and Alt is simply a sideband hint that the items in the array are an unordered collection, an ordered collection, or a weakly ordered list of alternatives. A common question is why use arrays at all instead of repeated properties like: <dc:subject>XMP</dc:subject> <dc:subject>example</dc:subject> The basic answer is the point about a self-evident data model in the RDF serialization. What if a given file only contained 1 dc:subject element? Is dc:subject a simple property or an array? Most humans have a very specific notion about whether a property is supposed to be unique (simple), or might have multiple values (an array). Using explicit array notation in the serialization makes this clear. Which in turn makes it clear in the XMP toolkit API, and in how client applications use that API. Client application code becomes more complex and UI design more difficult if everything is potentially an array. ------------------------------ XML markup in values A small aside: The XMP data model does allow XML markup in values, but this is serialized with escaping. This is easier and more efficient to parse than use of rdf:parseType="Literal". The main difference is that with escaping the markup is not visible in the DOM of a generic XML parse. Having that visibility does not seem like a crucial feature. Having the markup be visible will also complicate formal schemas. For example, a call like: xmp.SetProperty ( "Prop", "<elem>text</elem>" ); will get serialized as: <Prop><elem>text</elem></Prop> ------------------------ Qualifiers in XMP Qualifiers in XMP are from RDF, they are not part of traditional programming data structures. In the XMP data model qualifiers can be viewed as properties of properties. The XMP data model is fully general and recursive. They seem to be easily understood by users, fit easily into the core toolkit API, and provide a significant mechanism for growth and evolution. They do this by allowing later addition of information in a self evident and well structured way, without breaking clients using an earlier and simpler view. For an example I'll first use an XMP data model display instead of RDF. Let's accept the notion of the XMP use of dc:creator as an ordered array of names. This works for the vast majority of needs: dc:creator (orderedArray) [1] = "Bruce D'Arcus" Suppose we now want to add some annotation for Bruce's blog. By adding this as a qualifier older clients still work just fine. In fact they could even have been written to anticipate qualifiers and display them when found: dc:creator (isOrderedArray) [1] = "Bruce D'Arcus" (hasQualifiers) ns:blog = "http://netapps.muohio.edu/blogs/darcusb/ darcusb/" (isQualifier isURI) The RDF serialization of XMP uses the rdf:value notation for qualifiers. This is unfortunately a bit ugly and complicates formal schemas since it makes the qualified element look like a struct. The presence of the rdf:value "field" is what says this is not really a struct. The original unqualified array item: <rdf:li>Bruce D'Arcus</rdf:li> Adding the qualifier: <rdf:li rdf:parseType="Resource"> <rdf:value>Bruce D'Arcus</rdf:value> <ns:blog rdf:resource="http://netapps.muohio.edu/blogs/darcusb/ darcusb/"/> </rdf:li> -------------------------- References in XMP One aspect of programming, and many other, data models that is not a first class part of XMP is a notion of reference. By this I mean that the XMP specification does not define references, and the Adobe XMP toolkit does not contain specific API or logic for dealing with references. References can be defined and used within XMP by clients, they just are not a fundamental part of the data model. A reference is some form of address along with a means to find what is at that address. Having the address without being able to go there isn't of much use. The lack of a formal notion of reference does not at all say that references cannot be represented or used within XMP. Specific kinds of references can easily be used. The onus is on the users of those references to define their semantics and representation. In the qualifier example, the use of the rdf:resource notation does not constitute a formal reference. That is just sideband information that this particular simple value happens to be a URI. The XMP specification does not require any specific action for this. The Adobe XMP toolkit does not attempt to follow the URI, nor does it allow rdf:resource to be used as a general inclusion or redirection mechanism. All that said, the example qualifier is an informal form of reference in the sense of an address that can be understood and utilized by client software. A generic UI can even display it with a nice OpenWebPage button. By avoiding a formal notion of reference XMP avoids being over constrained by picking a particular notion of address, or of being overly complex in order to support a totally generalized notion of address. An important distinction between actual XMP usage and typical RDF examples is that XMP operates primarily in a file system world while RDF examples are almost always Internet oriented. This is an important distinction with significant practical aspects. Suppose a reference is stored as a file URL. What happens to that reference as the file is copied around a network, or emailed, or moved into and out of an asset management system? What are the privacy issues related to putting file URLs in metadata without the user's conscious knowledge? There are other aspects of references that URIs typically loose. At any rate URIs in the form of typical readable URLs. Like machine addresses, a URL references the current content at some location, i.e. it is all about the location regardless of content. It is incapable of being used for wider search, it breaks if the content moves, it cannot detect changes to the content. A typical URL is not persistent, it can't identify the content through time and space. Nor is it specific, it can't detect differences between altered forms of the content. Yes, general URIs can contain arbitrary knowledge, but that knowledge isn't of much use without an agent to perform lookup. Consider a number of forms of reference to a book: title, which edition, which printing, ISBN number, Dewey Decimal number. Which of these is useful depends on local context. XMP leaves the definition and processing of references to clients. They are the ones with specific knowledge of local context and workflow. As a more concrete example, consider how compound documents are typically created and published by InDesign. This isn't specifically about InDesign's use of XMP, but does illustrate the changing nature of a reference. During the creation process images and illustrations are usually placed into the layout by file reference. This lets the separate image file be updated by a graphic artist while an editorial person is working on the layout. When published to PDF, the images are physically incorporated. The file reference is no longer needed, and often not even wanted because of privacy concerns - the PDF file might be sent to places that have no business knowing what source files were used. However, XMP from the images can be embedded in the PDF and attached to the relevant image objects. --------------------------------------------------- Interaction with metadata repositories This section has looked inward at the XMP data model. There has been no mention of RDF triples or broader RDF implications. This is intentional. In terms of providing immediate customer benefit, first order value of XMP is getting the metadata into files, viewing/ editing it within applications like Photoshop, and viewing/editing/ searching it with applications like Bridge. By focusing inwards a number of simplifications can be made that make the metadata more approachable, and that make implementations more robust and less expensive. That said, there is real value in being able to have interactions between XMP and other metadata systems. What this means for a given metadata repository depends on the internal data model of the repository, how that relates to the XMP data model, and the directions that metadata is moved between XMP and the repository. Information is potentially lost when moved from a less to a more constrained data model. Since XMP can be serialized using a subset of RDF, XMP can be ingested fairly easily into a general RDF store. It should be reasonably easy to transform the XMP if the particular usage of RDF by XMP is not what is preferred. -------------------------- Latitude for change I've seen well intentioned suggestions like: "Enhance XMP to fit with current RDF and XML best practices." People need to be very realistic about the feasibility of various kinds of change. XMP is a shipping technology, with hundreds of thousands if not millions of copies of applications using it. This includes 3 major generations of Adobe products. Backward compatibility is a major concern. Global or implicit changes that would cause XMP to fail in existing file formats and applications are unlikely to happen. There would have to be some very compelling reason. Suppose a future version of XMP in Photoshop started writing dc:subject as repeated elements instead of an explicit array (rdf:Bag). The XMP in new files would not be accepted by any existing Adobe software, and probably not by any existing 3rd party software supporting XMP. Global or implicit changes restricted to new file formats have a better chance of success. Suppose OpenDocument files were "NewXMP", using repeated elements and schema knowledge. No existing software specifically looks for XMP in OpenDocument files, so the exposure is less than the previous example. But there is software, especially 3rd party asset management systems, that use byte-oriented packet scanning to find XMP in arbitrary files. That software will not handle these new files. Changes to XMP that are restricted to actual metadata usage, or otherwise under conscious user control, have a much better chance of being accepted. One example might be a user preference for a custom RDF serialization that is more amenable as input to a general RDF store. ------------------------ Plain XMP Syntax I want to also give the OpenDocument TC a heads-up about something called Plain XMP. We will be posting a paper about this for review and discussion to the Adobe XMP web site in the near future. I want to emphasize that Adobe has made no decisions about this, we are simply looking for community review and feedback. Plain XMP is being presented as a possible alternative serialization for the XMP data model, one that happens to be describable using XML Schema. The full XMP data model is represented, you can move back and forth between the RDF form of XMP and Plain XMP without loss. This does not signal any intent by Adobe to abandon RDF. This is purely an attempt to satisfy conflicting customer desires. Since XMP first shipped with Acrobat 5, Adobe has gotten feedback from a number of customers or potential adopters of XMP that they don't like RDF. Why they don't like RDF isn't really an issue here. The Customer Is Always Right. There seem to be 3 common "complaints" (pardon the term) - general FUD about RDF, a dislike of the RDF XML syntax, and a desire to use "standard XML tools". This last generally means using W3C XML Schema. Granted, RELAX NG is a vastly superior schema language. The conflict between RDF and XML Schema can be viewed as the fault of shortcomings in XML Schema. Again that isn't the point. The Customer Is Always Right. A reasonable usage model for Plain XMP might be to put the RDF form of XMP in current file types by default, and maybe let users choose Plain XMP. New file types could go either way, realizing that existing packet scanners won't recognize Plain XMP. Future XMP toolkits would accept either. Client software could ask for a serialization in either form. Plain XMP might also be more amenable to XSLT transformation than RDF, especially when qualifiers are used. This could make it useful for getting XMP into or out of metadata repositories. Here are the previous examples serialized as Plain XMP, again ignoring surrounding context. Yes, there is going to be controversy about the use of an attribute versus character data for values. This will be explained in the Plain XMP proposal. In essence, this avoids some XML Schema problems. Arguably, XML used for data is distinctly different from XML used for "traditional markup". The latter requires character data, the former does not. <ns:UniqueID value="74A9C2F643DC11DABBE284332F708B21"/> <ns:ImageSize kind="struct"> <ns:Height value="900"/> <ns:Width value="1600"/> </ns:ImageSize> <dc:subject kind="bag"> <item value="XMP"/> <item value="example"/> </dc:subject> <dc:creator kind="seq"> <!-- This form drops the isURI tagging of ns:blog. --> <item value="Bruce D'Arcus" ns:blog="http://netapps.muohio.edu/ blogs/darcusb/darcusb/"/> </dc:creator> <dc:creator kind="seq"> <!-- This keeps the isURI tagging of ns:blog. --> <item value="Bruce D'Arcus"> <ns:blog value="http://netapps.muohio.edu/blogs/darcusb/darcusb/" rdf:resource=""/> </item> </dc:creator> ======================================================================
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]