office-formula message

Subject: Re: [office-formula] "international" characters?
From: Patrick Durusau <patrick@durusau.net>
To: dwheeler@dwheeler.com
Date: Thu, 08 Jan 2009 07:59:56 -0500
David,

David A. Wheeler wrote:
> Patrick Durusau:
>   
>> The phrase "international characters" is used under small group (2.1.1) 
>> for named expression identifiers. It is noted that some applications may 
>> not display a glyph but an ISO 10646 code.
>>
>> But, under text (string) (4.1), it is noted that "implementations 
>> *should* support Unicode strings, but *shall* at least support strings 
>> of ASCII characters."
>>     
>
> Correct.  There is a _big_ difference in the level of support required
> for international strings in the spec:
> * The stream of text for OpenFormula itself is defined using
>    ISO 10646/Unicode and XML, using the normal conventions for them.
> * International identifiers _MUST_ be supported, so that people can use
>    variable names (names of named expressions)
>    with Chinese names or whatever.  The note is simply a note
>    that clarifies that implementations need not actually display the characters
>    as glyphs, they just have to process them correctly... that's valuable
>    for low-powered implementations, so that they can at least
>    interoperate correctly.
> * Values of type 'Text' (string), which are processed by an implementation
>    that implements the spec. Background: Most spreadsheets' non-blank cells
>    are either math formulas (using numbers), or labels
>    (which are out of scope of OpenFormula).  Text processing isn't a big use,
>    as far as I can tell.
>    There are some spreadsheet documents that process text, of course,
>    so there is some minimal support for it.  But ASCII-only support is actually
>    adequate for many of those kinds of uses, so there's no MANDATE
>    that implementations do more (though they may).
>
> Now, you can ask the question "Should the spec demand more?", and that's
> the RIGHT question to ask.  I think it'd be quite plausible to REQUIRE support
> for Text in the Medium or Large group.  But for a lot of people, spreadsheets
> are for calculating.  As long as labels and named expression names (variables)
> can be arbitrary characters, that's adequate for many, so the "Small" model
> at least shouldn't require it (in my mind).  Properly handling international
> string handling is often non-trivial (ask the Python developers!), and since
> it's not a common use, it seems extreme to ask that of everyone.
>
>   
Sure, and I accept the notion that a spreadsheet application could, not 
necessarily a good idea, not offer text handling at all.

Let me try again:

Does OpenFormula require support for ISO 10646/Unicode for labels and 
named expressions?

(Note that I don't equate a "yes" answer to that question with 
"arbitrary characters" or "international identifiers," etc.)

Next, I think we agree that any conforming implementation can support 
TEXT functions or it can choose to omit TEXT functions.

Then, the next question is: If an implementation supports TEXT 
functions, what sort of input to those functions does it support? Yes?

That could be ISO 10646/Unicode or, it could choose to support some 
other input. Yes?

Part of my problem is that you mention ASCII, for example, but then 
concede later that any encoding could in fact be used for most of the 
TEXT functions. I assume in those cases that the TEXT functions 
"conform" to OpenFormula.

ASCII (see below) isn't required for conformance but simply may be 
commonly found. That's a note, not a normative statement. (Unless you 
want to define a layer of conformance where all labels and named 
expressions are in ISO 10646/Unicode and any TEXT function has input 
only in "ASCII" (assuming a normative reference).)

What apparently is required for conformance is 1) specification of an 
encoding (the locale), and 2) following the rules of that encoding for 
TEXT functions. Yes?

>> Question: In all cases are we talking about Unicode strings in one of 
>> the defined representations (UTF-8), etc.?
>>     
>
> Yes, though the specification is designed so that the specific encoding used
> for Text values (that are manipulated by the language)
> is up to the implementation - and thus not specified.
> Look at the function definitions - they're carefully written to NOT depend
> on which encoding is used.  Stuff like LEN counts the "number of characters"
> (not "number of bytes").  The "B" operators like LENB are carefully
> defined so that the actual encoding is opaque; they require certain
> properties of their answers, but again don't mandate an encoding.
>   
Well, are you sure? From LEN:

> Computes number of characters (/not/ the number of bytes) in /T/. If 
> /T/ is of type Number, it is automatically converted to Text, 
> including a fractional part and decimal separator if necessary. 
> Implementations that support ISO 10646 / Unicode /*shall*/ consider 
> any character in the Basic Multilingual Plane (BMP) basic plane as one 
> character, even if they occupy multiple bytes. (The BMP are the 
> characters numbered 0 through 65535 inclusive). Implementations 
> /*should*/ consider any character not in the BMP as one character as well.
>
Looks to me like the semantics of LEN depend upon ISO 10646/Unicode. Or 
at least the semantics of depending on other encodings is undefined.

BTW, rather than saying "ASCII" you can say (0x00..0x7E) since in UTF-8, 
those are indistinguishable from "ASCII" as I think you are using the term.

>  
>   
>> Question: Or, is the string (4.1) language meant to allow a non-Unicode 
>> based encoding?
>>     
>
> I don't know what you mean by a "non-Unicode based encoding".
> It would be my EXPECTATION that strings would be encoded as
> ASCII (if ASCII only), UTF-8, UTF-16, or UTF-32 (the latter two in
> some consistent endianness).  But the spec should still be satisfied if
> Latin-1, UTF-7, or something else were used.  (I'm not sure EBCDIC
> would be okay, but I don't think anyone's worrying about that!!)
>
> An application could even use different encodings, tagging each
> string with a different encoding.  That would complicate implementation
> of the "B" operators (like LEFTB and LENB), but it's fine.
>   
And where would that encoding be indicated?

Hope you are having a great day!

Patrick

-- 
Patrick Durusau
patrick@durusau.net
Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)
References:
- "international" characters?
  - From: Patrick Durusau <patrick@durusau.net>
- Re: [office-formula] "international" characters?
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>