office-comment message

Subject: RE: [office-comment] Text in OpenFormula - inadequate forinternational use

From: Alex Brown <alexb@griffinbrown.co.uk>
To: "dwheeler@dwheeler.com" <dwheeler@dwheeler.com>,"office-comment@lists.oasis-open.org" <office-comment@lists.oasis-open.org>
Date: Wed, 5 May 2010 21:03:42 +0100

David hi

> The specification is specifically written to *permit* support of arbitrary
> Unicode/10646 text at run-time; if there is an example that *inhibits* this, it's
> an error, and we need to fix it.  Please let us know of any.  I don't know of
> any (your citation permits it, for example).

My contention is that it needs to *require* support for internationalized text, not just permit it (as an optional add-on to meeting our apparently indispensible Anglo-US needs).

> The primary purpose of all spreadsheet formulas is to calculate *numbers*,
> not *text*.  *No* spreadsheet implementation is good at processing text,
> because that's not what they're intended for.  Often, the rare text
> processing at all is stuff like "show the number of * as shown in this other
> cell", which does not depend on international characters.  For example, that
> there isn't even an iterator defined in the language, so a lot of text
> processing simply *can't* be done by spreadsheet formulas.
>
> Which means that I read your comment as, "this language is not very good at
> tasks it's not intended for".  Which is true, but that is true for all things made
> by mankind.

I don't think there's any expectation that OpenFormula needs to have super-sophisticated text handling. But what is does specify needs to be internationalized, interoperable and clear.

> It would be trivial to change the *specification* to require Unicode/ISO 10646
> support; just change a few "should"s to "shalls".  But that would not
> suddenly make *implementations* support it, esp. beyond the BMP.

What implementations? (genuine question). Isn't it quite hard to find a programming language these days that *doesn't* support Unicode?

Surely the principal draft implementation (in OO.o) already operates on Unicode text, since that is what is being supplied to OpenFormula evaluators by the documents containing the expressions.

>  Since
> it's intended primarily for numerical calculation, the few text processing
> functions are there for various trivial and historical purposes.  Producing a
> specification that is *not* implemented is a sham and a waste of everyone's
> time, and we really want to *avoid* that.

And we certainly don't want to be disenfranchising non-Western users for the sake of Western developer convenience. Supporting basic text functions for Unicode text is hardly rocket science. 

> There's an obvious compromise position, thankfully.  We could mandate
> Unicode/10646 support, at least for the BMP, in the "medium" group
> conformance clause.  That way, tiny implementations that implement a small
> set of functions could still meet *something*, and there'd be an obvious
> growth path.  I think that's the better approach, if it is to be mandated
> somewhere.

Something like that sound like a way forward. I think the cleanest way to do it would be to allow implementations to define a Unicode character repertoire (e.g. which could constrain Unicode to its ASCII subset).
 
> > A thorough pass should be made of the text to remove references to ASCII
> text (except for legacy purposes) and rebase text representation and
> handling on Unicode.
> 
> The "limits" section requires a minimum number of characters, and I think
> the best way to make that sensible is to specify the minimum number of
> ASCII characters in a text (string) type.  Otherwise, the limits in practice
> would depend on the encoding (e.g., UTF-8 vs. UTF-16, are characters
> beyond BMP allowed, etc.), and they'd be hard for users to understand.

As a text-head I have to confess (and as I just posted to this list) I'm freaked out by the way text-handling descriptions in OF keep dipping into the physical world of bytes. It's as if (for a number-head) every discussion of numbers had a note worrying about byte counts and endianism!

I don't really get what "characters" means in the Limits section (or rather, I think I read it differently to what might be intended). It is stated a formula has a max length of 1,024 characters. I understand this to mean that the string contains representations of no more than 1,024 Unicode scalar values (whereas some might say 1,024 "CC data elements" [1]). As I've said elsewhere it's a problem  that "character" is not defined.

Now I don't see how encoding comes into it. For an ODF host the encoding will be opaque to the XML processor and hence to the application so by the time the data arrives at the function, all memory of how it was encoded in the document will be irrelevant or lost. This kind of limit needs to be expressed in the document format, not in OF, surely?

The letter "A" could have been encoded as the (ASCII-compatible) text sequence &#65; for example. I don't see how or why OpenFormula needs to care about that.

- Alex.

[1] see http://www.open-std.org/JTC1/SC22/JSG/docs/m3/docs/jsgn313.txt 

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email 
______________________________________________________________________

Follow-Ups:
- Re: [office-comment] Text in OpenFormula - inadequate forinternational use
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>

References:
- Text in OpenFormula - inadequate for international use
  - From: Alex Brown <alexb@griffinbrown.co.uk>
- Re: [office-comment] Text in OpenFormula - inadequate forinternational use
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>