office-comment message

Subject: Public Comment

From: comment-form@oasis-open.org
To: office-comment@lists.oasis-open.org
Date: 28 Apr 2006 18:19:11 -0000

Comment from: bnordgren@fs.fed.us

Name: Bryce Nordgren
Title: Physical Scientist
Organization: USDA Forest Service
Regarding Specification: ODF

I am completing an initial evaluation of the ODF specification with an eye to creating a Java library to manipulate a small subset of ODF files.

The most alarming discovery during this investigation is the wanton mixing of data model and data encoding. I am coming from the standpoint of the ISO 19100 series of standards, where the data model and encoding are so well distinguished that these two completely different aspects are generally specified in separate standards.

I am most keenly interested in seeing a presentation of the data model which ODF encodes. It is the data model which imposes constraints on content, not the encoding. Furthermore, there is no schema language expressive enough to fully capture the content constraints. (see postscript to this comment for example) Presenting the data model separately enables a more concise, intuitive (data models are generally pictorial--e.g. UML), and complete communication of the concepts and their relations. A data model allows humans to learn the abstractions before learning how these abstractions map onto a specific language: in this case, Relax-NG.

This is especially important to address your own concerns about "Full validation" (http://www.oasis-open.org/archives/office/200603/msg00078.html). Relax-NG is not capable of expressing the complete constraint set contained in the standard, for the excellent reason that it is not a programming language. Some constraints imposed by the text of the standard must always be checked (or implemented) outside of XML.

The data model is not an API, but it is a language-neutral base from which an API may be constructed in specific languages. These languages may be Java, C++, Python, XML Schema, Relax-NG, or tomorrow's favorite. Expressing the model independently of the encoding specification provides an external reference to which the normative Relax-NG schema defers. It also facilitates transition between encoding technologies (e.g. future migration from Relax-NG->whatever). But most importantly, and most immediately, a separate expression of the model aids software implementors in constructing, accepting, and manipulating valid documents in-core which are compatible with ODF encoding constraints: the software API and the file format derive from the same data model. You are certainly not responsible for defining someone else's API for them. However, failing to explicitly define your own data model is a shortcoming which will slow your format's acceptance. Everyone must re-invent your wheel.

If someone is willing to receive it, I do have an example UML model for some of the basic table and spreadsheet data items which I pulled together out of the spec. It's available either as a Poseidon model (there's a free version) or as a series of images. Just let me know where to send it.

Bryce Nordgren
USDA Forest Service

Postscript--example of Relax-NG shortcoming in table data structure:

Examination of the encoding specification (lines 3493-3511; p. 177 of the standard) express the idea that two "table-table-columns" pseudotypes may not appear consecutively in the file. Lines 3773-3787 (p. 188) demonstrate that there is no difference between the "table-table-header-columns" and "table-table-columns" pseudotypes, aside from the name of the containing element.

The implication is that non-header table columns should be maximally aggregated. However, the same constraint is not placed on the header columns. You can have as many consecutive "table-table-header-columns" pseudotypes as you wish, translating to as many <table:table-header-columns> elements as desired. The schema also permits ping-ponging back and forth between header column groups and non-header column groups. These are the only constraints imposed by the normative encoding scheme.

The model represented by this encoding, however, is briefly described by the text in section 8.2.2.

In summary (combo of text and schema):
1] Header columns must be adjacent;
2] Header columns are different from non-header columns only by the fact that they are "flagged";
3] Header columns may span column groups; and
4] There may be only one set of header columns in a table.

Note how this clearly defines the model even more succinctly than the text of 8.2.2, which is basically a plain-language description of how to further constrain the encoding such that it adheres to a model which is never directly presented. From this, readers can easily understand what is and is not permissible without first wrestling through the awkward and inadequate expression of these constraints in a particular schema language.