oiic-formation-discuss message

Subject: Canonical mappings.. let's start!

From: jose lorenzo <hozelda@yahoo.com>
To: oiic-formation-discuss@lists.oasis-open.org
Date: Wed, 25 Jun 2008 10:47:52 -0700 (PDT)

Before proceeding with the main discussion of this email, I [or insert your name here] want to state up front that I am willing to post to this mailing list examples of anything I get done (if anything) that might ordinarily be considered examples of the sort of work the TC or third parties would undertake. In doing this, I would be helping to advance the discussion and/or sell a point, and also the contribution would serve as research-in-advance to save the future TC some time while having my bit of influence.

This doesn't mean that I reject questions asking about the relationship to the TC charter. That is always the most relevant question to ask, and I may in fact not address it appropriately until asked, if asked.

*****

Mappings/algorithms that define a canonical mapping are very important. For almost any task, you (or the tool) are served (save time, etc), whenever you can simply deal with a single unique member to represent a family of "equal" members. The main example given so far was to aid in testing. That is a great example, but certainly not the only one. All interactions across tools can become more sophisticated, precise, and simpler to code if we can define how the relevant canonical form is reached.

Any XML language could benefit from having canonical mappings (to suit the occasion, profile, etc). The standard XML mappings are most certainly not enough for many XML language+scenarios. ODF should standardize some or many such canonical mappings and/or the extension method for creating your own ODF canonical mapping. There may (?) be a lot of details to consider and to specify in order to allow such maps to be defined in interoperable ways. Additionally, having pre-defined maps with known semantics and which are correct/legal/consistent makes it easy for more tools to play the game since the tools can be smaller and easier to make, not requiring the ability to verify that the standard maps are consistent or that these make sense. Also, the tools wouldn't have to be capable of deriving any other important bit of information that a custom map would be missing as compared to a standardized map, whose properties would be well-known. Also, a standardized
map might contain details that would not be describable in custom maps (based on state of the art).

Of all the ways we might want to define these canonicalization algorithms, some might be much more useful than others. One approach would be to define as much of the mappings as possible through a language of equations where one side describes many instances (from portions of a document) compactly while the other side describes the single instance that represents them. [We may or may not want to allow the single instance to lie outside of the set being mapped.] Additionally, the "algorithm" would describe any further constraints and semantics (if any) that would not otherwise be expressable in the map pattern rules language. Three candidates come to mind to define the mapping patterns: EBNF, XPath, whatever Schematron uses.

From my limited experience, I would say right now that XPath would be the preferred candidate, though I don't remember if I read that Schematron also uses plain old XPath. If Schematron uses something else, that is likely better, though XPath is probably better supported (current support may end up not being that important a consideration).

There are many things to consider. Does anyone want to take a crack at constructing a map for a very simple made up document language + instance? Feel free to explore as many new definitions and notations as necessary. Also, you may want to use your own "pseudo code" notations to demonstrate a concept or issue.

More importantly, does anyone have information on research or work that currently exists along these lines (eg, does anyone have experience with Schematron)?

I would like a definition that is close to XML if that would make it easier for tools to generate such mappings easily and interpret custom mappings easily. Users being able to define (generally, with GUI help and auto generation) such custom canonical forms (in whole or) in part would be very useful to many use cases. And if the rules can be based on inheritance (ie, "take a base ruleset and apply the following 'diff' ") and, further, be easily represented in a short and readable text file... well, it sounds like this would be great.

Starting this work of exploring mapping construction and details now (to be conducted on a dev mailing list or website perhaps) is a good idea since there is no better time than the present, and all insight will help the charter creation work to some extent and will potentially really help the future TC. Also, getting some of this worked on now (research, including impl attempts) may help attract more techy people to the TC or at least to contribute within the 90 day period. Whatever we end up doing will likely only be in rough draft and incomplete form, so resources spent will be modest.. and the pressure would not be on.

[This last paragraph was intended to reinforce the ones at the top.]