office message

Subject: A few of specific examples
From: robert_weir@us.ibm.com
To: office@lists.oasis-open.org
Date: Tue, 3 Feb 2009 19:50:00 -0500
Some specific examples of how and why arbitrary proprietary extensions are 
evil.

Two common concerns with users is the need for privacy and security.  The 
issue of personally-identifying meta-data is increasingly in the news. 
Some products, like Microsoft Office, have a built-in operation that will 
remove such information from a Word document.  There are also third-party 
application that will strip such metadata from a document.

So, suppose you want to write such an operation for an ODF document.. What 
do you do?  Simple enough, look to meta.xml scrub extension elements under 
<office:meta>, etc.  The places where metadata is stored is deterministic. 
 The standard is clear where they are.  But allow arbitrary extensions 
everywhere, and you have no idea where the metadata is.  Your ability to 
write a generic tool like this is made far more difficult.  You can't tell 
whether an extension contains metadata, content, processing instructions, 
executable code, or whatever.

Similarly, there is the  need to scan a document for virus or malicious 
macros.  Remember all the Word viruses from a few years ago?  The risk is 
still there.  Antivirus vendors have been somewhat successful in 
addressing such risks with mail gateway filters which act in part by 
examining file attachments and scanning them for risky content.  As a 
policy some companies will disallow any external document with a macro to 
go through their firewall.  So how would you do this for an ODF document? 
Well, ODF says scripts go into the <office:script> element.  So the simple 
solution is to scan for that element and if it exists, to flag the 
document as a higher risk.  But with arbitrary proprietary extensions, how 
do we know that they don't contain executable content?  How does the virus 
scanner handle arbitrary elements, which may contain metadata, content, 
processing extensions, scripts or anything?  The easiest solution would be 
to ban documents that contain extensions.  Is that what we want?

Similarly, a search engine will want to find all text in a document for 
indexing.  Reading the ODF specification it is clear what is content and 
what is not, so a proper indexer can be written.  But with arbitrary 
proprietary extensions, this task is impossible,  I would not know whether 
the extensions elements should be indexed or not.

Also, a program that translates a document from one language to another, 
preserving all formatting and styles.  Reading the ODF spec, I can easily 
determine what elements are content and which are not and then run machine 
translation on just the content.  But with arbitrary proprietary 
extensions, I have no idea.  I risk doing a partial translation, if the 
extension elements represent user-visible content.

There is also the question of document referential integrity.  Suppose I 
want to write a program that takes a large ODF document and splits it up 
into chapters, one ODF document per chapter.  According to the ODF 
standard this is easy.  I can trace the style dependencies and duplicate 
what is needed and make several documents from a single ODF document. 
Similarly, I could take multiple ODF documents and combine them into a 
single document, merging the styles as needed.  But in the presence of 
arbitrary proprietary extensions I cannot do either of these operations 
safely, since I do not understand the semantics of these extensions. 

Now I can imagine a well-thought out extensibility mechanism that would 
address the above concerns.  I'd certainly entertain any such proposals. 
But merely saying "The X in XML standards for eXtensibility" is not a 
considered engineering approach.   Extensibility requires that we think 
out issues such as versioning, content negotiation, fall-backs, 
namespacing, round-tripping, as well as offer clear guidelines for how 
extensions declare whether they contain translatable text, metadata, 
executable code, or other categories of importance.  The fail-safe 
approach is to remove this option until such time as we can do it right. 

If there is sufficient interest to work on this, we could create a new 
subcommittee on extensibility to work on developing a detailed proposal in 
this area, obviously for consideration post ODF 1.2.

-Rob
Follow-Ups:
- RE: [office] A few of specific examples
  - From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>