office-comment message

Subject: longer-than-expected strings (OpenFormula CD01)

From: Alex Brown <alexb@griffinbrown.co.uk>
To: "office-comment@lists.oasis-open.org"<office-comment@lists.oasis-open.org>
Date: Wed, 5 May 2010 19:23:03 +0100

Dear all,

A note for 5.20.25 (UNICODE) has:

----
Depending on the encoding of T the value returned may actually have to take more octets into account, for example in UTF-8 or UTF-16 encodings.
----

While it's half clear what this means, I don't understand why the "octets" mentioned should feature in this clause. The parameter for this function is "text" which is defined earlier in the spec as "a sequence of zero or more characters", not "octets".

It would make about as much sense to say "for larger Unicode values the number returned may be represented by more one than one octet"! It may be true, but it's a sudden strange flip into the world of physical bytes.

In general the text of this draft seems to sit uneasily between the physical (octet) layer and the logical (character) level when describing operations on characters. Since a mention of "UTF-8 or UTF-16 encodings" is made here, it seems reasonable to assume that an OpenFormula implementation must have some knowledge of the prevailing encoding mechanism. How is this know?

I think it would be much clearer if the narrative text of this spec stopped trying to grapple with encoding issues sporadically and it was stated clearly in one place that there was a prevailing encoding in effect which operated for any (de)serialisation operations. Descriptions of text-processing could then be couched cleanly in terms of logical (text-level) operations.

- Alex.

______________________________________________________________________
This email has been scanned by the MessageLabs Email Security System.
For more information please visit http://www.messagelabs.com/email
______________________________________________________________________