office-formula message

Subject: Re: [office-formula] "international" characters?

From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Wed, 07 Jan 2009 13:05:42 -0500 (EST)

Patrick Durusau:
> The phrase "international characters" is used under small group (2.1.1) 
> for named expression identifiers. It is noted that some applications may 
> not display a glyph but an ISO 10646 code.
> 
> But, under text (string) (4.1), it is noted that "implementations 
> *should* support Unicode strings, but *shall* at least support strings 
> of ASCII characters."

Correct.  There is a _big_ difference in the level of support required
for international strings in the spec:
* The stream of text for OpenFormula itself is defined using
   ISO 10646/Unicode and XML, using the normal conventions for them.
* International identifiers _MUST_ be supported, so that people can use
   variable names (names of named expressions)
   with Chinese names or whatever.  The note is simply a note
   that clarifies that implementations need not actually display the characters
   as glyphs, they just have to process them correctly... that's valuable
   for low-powered implementations, so that they can at least
   interoperate correctly.
* Values of type 'Text' (string), which are processed by an implementation
   that implements the spec. Background: Most spreadsheets' non-blank cells
   are either math formulas (using numbers), or labels
   (which are out of scope of OpenFormula).  Text processing isn't a big use,
   as far as I can tell.
   There are some spreadsheet documents that process text, of course,
   so there is some minimal support for it.  But ASCII-only support is actually
   adequate for many of those kinds of uses, so there's no MANDATE
   that implementations do more (though they may).

Now, you can ask the question "Should the spec demand more?", and that's
the RIGHT question to ask.  I think it'd be quite plausible to REQUIRE support
for Text in the Medium or Large group.  But for a lot of people, spreadsheets
are for calculating.  As long as labels and named expression names (variables)
can be arbitrary characters, that's adequate for many, so the "Small" model
at least shouldn't require it (in my mind).  Properly handling international
string handling is often non-trivial (ask the Python developers!), and since
it's not a common use, it seems extreme to ask that of everyone.

> Question: In all cases are we talking about Unicode strings in one of 
> the defined representations (UTF-8), etc.?

Yes, though the specification is designed so that the specific encoding used
for Text values (that are manipulated by the language)
is up to the implementation - and thus not specified.
Look at the function definitions - they're carefully written to NOT depend
on which encoding is used.  Stuff like LEN counts the "number of characters"
(not "number of bytes").  The "B" operators like LENB are carefully
defined so that the actual encoding is opaque; they require certain
properties of their answers, but again don't mandate an encoding.
 
> Question: Or, is the string (4.1) language meant to allow a non-Unicode 
> based encoding?

I don't know what you mean by a "non-Unicode based encoding".
It would be my EXPECTATION that strings would be encoded as
ASCII (if ASCII only), UTF-8, UTF-16, or UTF-32 (the latter two in
some consistent endianness).  But the spec should still be satisfied if
Latin-1, UTF-7, or something else were used.  (I'm not sure EBCDIC
would be okay, but I don't think anyone's worrying about that!!)

An application could even use different encodings, tagging each
string with a different encoding.  That would complicate implementation
of the "B" operators (like LEFTB and LENB), but it's fine.
 
> Question: If a non-Unicode based encoding, which definition of that 
> encoding are we using?

See above.
 
> I don't necessarily disagree but it does seem odd that Unicode support 
> is required for identifiers but optional under text.

Understand, but it's intentional.  The reason is that spreadsheets
typically don't manipulate a lot of text, but named expressions
(variables) _ARE_ widely used, and the ability to have names in
any language IS important to end-users.

If we were specifying a language which is COMMONLY used to manipulate
text values, then I'd feel very differently.

> Noting that Unicode "support" doesn't necessarily mean that you get a 
> meaningful display of the text.

True, but it seemed to be a point of confusion, so the note was added.

> (Can someone more familiar with display 
> issues comment on the usual behavior for missing characters? I seem to 
> recall that all you usually see are some glyph but I don't remember 
> which one. I don't recall it being the Unicode code number.)

Since we're not specifying the display, I think it's out of scope to do more
than make a short note.  I know that many programs display a box with
the 4-digit Hex code inside if it's inside the Unicode/ISO 10646 BMP, but
not a glyph they can display.  No clue what they do outside the BMP.

--- David A. Wheeler

Follow-Ups:
- Re: [office-formula] "international" characters?
  - From: Patrick Durusau <patrick@durusau.net>
- Re: [office-formula] "international" characters?
  - From: Patrick Durusau <patrick@durusau.net>

References:
- "international" characters?
  - From: Patrick Durusau <patrick@durusau.net>