RE: [oiic-formation-discuss] Welcome!

-------- Original Message --------
Subject: Re: [oiic-formation-discuss] Welcome!
From: robert_weir@us.ibm.com
Date: Thu, June 05, 2008 1:53 pm
To: "Dave Pawson" <dave.pawson@gmail.com>
Cc: oiic-formation-discuss@lists.oasis-open.org

Hi Dave,

I don't want to take us too far down the implementation on this list -- we're really supposed to be discussing the charter and terms of the proposed TC. However, it is a fair point that we should be able to demonstrate feasibility, and that the proposed goals of the TC are reasonable and achievable.

> > >> Perhaps NVDL would help there, certainly xproc will be a useful tool. > >> > > > > To my knowledge, no ODF implementation today actually writes additional > > content to an ODF document in a foreign namespace, although the standard > > allows this. But what I have seen is an application adding additional > > attributes into an existing ODF namespace. But this is a simple validity > > error and is caught directly by any validating parser. But Rick's > > pre-validation NVDL is something we can add to our bag of tricks, in case it > > ever comes up in the future. > > So a 'valid' file, with other namespaced content would be deemed > invalid by a simple validating parser check? > Are additional attributes (namespaced) also allowed? > >

It is a little more complicated than that. For example, the math:math element in ODF 1.0 is defined to allow any content under it:

<define name="mathMarkup">
<zeroOrMore>
<choice>
<attribute>
<anyName/>
</attribute>
<text/>
<element>
<anyName/>
<ref name="mathMarkup"/>
</element>
</choice>
</zeroOrMore>
</define>

There are cleaner ways of doing this now, with NVDL, to describe compound documents, some defined with DTD's, some with XML Schema, some with Relax NG, etc., but that is what we had for ODF 1.0 back in 2005. So it is quite possible for someone to place markup under math:math that is in a foreign namespace, and it would still validate. But the text of the standard makes it clear that only MathML is allowed there, so we could use other logic to verify that this constraint is met.

There are a few other places where the RNG allows anything, but the text of the standard is more restrictive.
> > > > > When we talk about conformance with ODF, we're really talking about two > > things, since the ODF standard defines document conformance as well as > > application conformance. The former is the easier one to test, and lends > > itself to automation. > > > > A full check of ODF document conformance would need to do something like: > > > > 1) Verify the document file name extensions and/or MIME content type and > > verify that it matches the contents of the underlying document. An ODT file > > containing a spreadsheet should be noted, for example. > > > > 2) Verify the correctness of the Zip container. Is it actually following > > the referenced Zip specification? > > > > 3) Verify the referential integrity of the package. Does the manifest > > reference files that don't exist, for example? Are all the required parts > > present? > > Generally a programming task? > File name matching etc, > What about the compression? > >

Checking the compression structures would be part of #2, I think. I'd avoid any approach that simply uses PkZip or WinZip and passes anything that doesn't give an error. We really need to verify the zip compression structures themselves.
> > > > 4) Verify the Relax NG validity of each of the contained XML documents, > > pre-processing as needed. > > How would you define pre-processing then? > As needed seems a bit vague? > 1. Remove all non ODF specified namespaced elements? > 2. Remove all non ODF specified attributes? > (Or not, since there is a potential invalidity here?) > (what of namespaced attributes in non ODF namespaces?) > >

Yes and Yes.

> > > > 5) Verify additional referential integrity constraints. For example, the > > content XML typically refers to named styles in the syles xml. These > > cross-document references need to be checked. > > Schematron sounds ideal for this. > > > > > > 6) Verify the various micro-formats contained in ODF. There are some things > > that are not easily expressable as a schema type, even using a regex. For > > example, spreadsheet functions, with its hundreds of functions, some with > > variable arguments, which could take cell ranges, named ranges, orconstants > > as parameters. These are defined in the standard via EBNF. A full > > conformance test would take each of these attributes and verify that they > > match the production rules defined by the EBNF. > > Has anyone created a full grammar for these? > Is grammar based validation most appropriate? > How to collect them for validation? >

There are around 14 places in ODF that have some sort of micro-format. This ranges from 3D transforms, to spreadsheet formuls to SVG-like paths, etc. In ODF 1.0 these are not all described in formal grammars. But the intent for ODF 1.2 is that they will all have IETF style EBNF's.

I don't know if there are more modern approaches to doing this, but when I was a student we would use lex/yacc to create scanners and parsers for each of these EBNF's, and call them from the appropriate spots. Maybe there is a better way today?

> Not an easy one one by the sounds of it. > More appropriately, how to formally define validity for these cases. >
> > > > > 7) Other recommendations of the ODF standard, even where not conformance > > requirements. These should be checked, and warnings (not errors) emitted. > > Good. Second definition required, when warnings and when 'errors' > > > For example, we have a number of accessibility best practices that could be > > statically verifiable. Similarly, we can have portability warnings. For > > example, a spreadsheet can have as many rows as it wished, but for > > portability we might recommend no more than 64K rows. > > Might? Shouldn't this spec be explicit? How to validate against 'might' :-) > How to recognise these 'recommendations' in the spec? > >

By "might" I mean I'm too lazy to lookup whether we actually make that recommendation. But I do know that David Wheeler has been putting similar portability recommendations into his OpenFormula drafts. In any case, formal provisions of the standard will clearly state what is mandatory ("shall") as well as what is recommended ("should"). A reasonable mapping would be to consider violations of the former to be errors, and violations of the latter to be warnings.
> > > > There are probably other pieces as well, but that's an outline of what we > > could do for document conformance. Ideally I'd like any such tool to be > > event-driven (like SAX) and pluggable, so other modules can be independently > > developed and later added. > > xproc seems a good candidate wrapper. > Ordering and when to halt then becomes an issue. >

I'm not familiar with xproc, but it looks interesting.
> How to link in programmatic(or shell scripted) validation > with xml based validation. > > > Far more solid start Robert, thanks. > Is there any requirement for an instance to 'look alike' in two > implementations? > > I've heard that expressed as a definition of portability in the past >

From the end-users' perspective, this is certainly an expectation, that interoperability means that the document looks and behaves the same regardless of what ODF editor they use. But not all uses of ODF involve end-users on a desktop with a display. So the ODF standard does not say that "bold" text must be displayed with 200% font weight or else the implementation is not conformant. If we did that, then a search engine that doesn't display the text at all, but uses the bold tag to increase the weight of the bold terms in the term index would not be conformant. And the screen reader that reads the bold text with vocal emphasis would not be conformant. So ODF essentially says bold indicates bold and an application should do whatever it does with bold text.

However an application that has runtime semantics that are repugnant to the semantics of ODF should be at least warned. For example, if an application takes bolded text and reverses the letters in those words and moves them into a footnote on the previous page, then that would be certainly hurt interoperability.

Similarly, colors in ODF are expressed as RGB values. So they are relative to an color model where the actual rendered colors will be device dependent. So a circle filled with 'red' will be whatever the device considers to be 'red', combined with whatever ambient lighting conditions add to the color.

Now we could strictly define absolute colors and the exact typographical meaning of "bold", and nail down every detail of how ODF renders, but in the end you would have something quite different than ODF. It is a trade-off. HTML's rendering model has much greater latitude than PDF does, especially when dealing with text flow and different window sizes. HTML can reflow. PDF just scales. So which is more interoperable? The pre-press person and the person trying to read the document on a Blackberry might respond differently.

That said, I think there is room for this proposed TC to tackle some of the rendering issues. We're not going to turn ODF into PDF. But we can certainly identify the areas where implementations' divergent renderings cause the greatest interoperability problems, and propose changes to the vendors and to the ODF TC to improve the situation.

In the end, interoperability problems can come from problems in the standard or problems in the application. On the standard side we have:

1) Ambiguities — The specification may describe a feature in a way that is open to more than one interpretation. This may be caused by imprecise language, or by incomplete description of the feature. For example, if a specification defines a sine and cosine function, but fails to say whether their inputs are in degrees or radians, then this function is ambiguous.

2) Out of scope features — The specification totally lacks description of a feature, making it out of scope for the standard. For example, neither ODF nor OOXML specifes the storage model, the syntax or the semantics of embedded scripts. If a feature is out of scope, then there is no expectation of interoperability with that feature.

3) Undefined behaviors — These may be intentional or accidental. A specification may explicitly call out some behaviors as "undefined", "implementation-dependent" or "implementation-defined". This is often done in order to allow an implementation to implement the feature in the best performing way. For example, the size of integers are implementation-defined in the C/C++ programming languages, so they are free to take advantage of the capabilities of different machine architectures. Even a language like Java, which goes much further than many to ensure interoperability, has undefined behaviors in the area of multi-threading, for performance reasons. There is a trade-off here. A specification that specifies everything and leaves nothing to the discretion of the implementation will be unable to take advantage of the features of a particular platform. But a specification that leaves too much to the whim of the implementation will hinder interoperability.

3) Errors — These may range from typographical errors, to incorrect use of control language like "shall" or "shall not", to missing pages or sections in the specification, to inconsistency in provisions. If one part of the specification says X is required, and another says it is not, then implementations may vary in how feature X is treated.

4) Feature Creep — A standard can collapse under its own weight. There is often a trade-off between expressiveness of a standard (what features it can describe) and the ease of implementation. The ideal is to be very expressive as well as easy to implement. If a standard attempts to do everything that everyone could possibly want, and does so indiscriminately, then the unwieldy complexity of the standard will make it more difficult for implementations to implement, and this will hinder interoperability.

And on the application side we have:

1) Implementation bugs — Conformance to a standard, like any other product feature, gets weighed against a long list of priorities for any given product release. There is always more work to do than time to do it. Whether a high-quality implementation of a standard becomes a priority will depend on factors such as user-demand, competition, and for open source projects, the level of interest of developers contributing to the community.

2) Functional subsets — Even in heavily funded commercial ventures standards support can be partial. Look at Microsoft's Internet Explorer, for example. How many years did it take to get reasonable CSS2 support? When an application supports only a subset of a standard, interoperability with applications that allow the full feature set of the standard, or a different subset of the standard, will suffer.

3) Functional supersets — Similarly, an application can extend the standard, often using mechanisms allowed and defined by the standard, to create functional supersets that, if poorly designed, can cause interoperability issues.

4) Varying conceptual models — For example, a traditional WYSIWYG word processor has a page layout that is determined by the metrics of the printer the document will eventually print to. But a web-based editor is free from those constraints. In fact, if the eventual target of the document is a web page, these constraints are irrelevant. So we have here a conceptual difference, where one implementation sees the printed page as a constraint on layout, and another application is in an environment where page width is more flexible. Document exchange between two editors with different conceptual models of page size will require extra effort to ensure interoperability.

(Users are also part of the interoperability equation. A user who enters "see page 23" rather than using a dynamic link, or who right aligns a page header with by inserting 57 spaces, that user is creating a non-portable document, in the same way that a programmer writing C code that depends on the size of an integer is writing non-portable code.)

From a practical standpoint, what I've found, based on the interoperability workshop we had in Barcelona last year, was that the functional subsets problem was the major contributor to rendering interoperability problems. This was obviously based on the relative maturity of implementations. Not everyone had implemented all of the standards. We also found plenty of implementation bugs. But I didn't see any cases where one vendor said "I thought the spec said X" and another said "I thought the spec said Y". So that is why I think creating an ODF test suite is a worthwhile endeavor.

Regards

-Rob

oiic-formation-discuss message