Subject: Re: [office-formula] "international" characters?
Patrick Durusau: > The phrase "international characters" is used under small group (2.1.1) > for named expression identifiers. It is noted that some applications may > not display a glyph but an ISO 10646 code. > > But, under text (string) (4.1), it is noted that "implementations > *should* support Unicode strings, but *shall* at least support strings > of ASCII characters." Correct. There is a _big_ difference in the level of support required for international strings in the spec: * The stream of text for OpenFormula itself is defined using ISO 10646/Unicode and XML, using the normal conventions for them. * International identifiers _MUST_ be supported, so that people can use variable names (names of named expressions) with Chinese names or whatever. The note is simply a note that clarifies that implementations need not actually display the characters as glyphs, they just have to process them correctly... that's valuable for low-powered implementations, so that they can at least interoperate correctly. * Values of type 'Text' (string), which are processed by an implementation that implements the spec. Background: Most spreadsheets' non-blank cells are either math formulas (using numbers), or labels (which are out of scope of OpenFormula). Text processing isn't a big use, as far as I can tell. There are some spreadsheet documents that process text, of course, so there is some minimal support for it. But ASCII-only support is actually adequate for many of those kinds of uses, so there's no MANDATE that implementations do more (though they may). Now, you can ask the question "Should the spec demand more?", and that's the RIGHT question to ask. I think it'd be quite plausible to REQUIRE support for Text in the Medium or Large group. But for a lot of people, spreadsheets are for calculating. As long as labels and named expression names (variables) can be arbitrary characters, that's adequate for many, so the "Small" model at least shouldn't require it (in my mind). Properly handling international string handling is often non-trivial (ask the Python developers!), and since it's not a common use, it seems extreme to ask that of everyone. > Question: In all cases are we talking about Unicode strings in one of > the defined representations (UTF-8), etc.? Yes, though the specification is designed so that the specific encoding used for Text values (that are manipulated by the language) is up to the implementation - and thus not specified. Look at the function definitions - they're carefully written to NOT depend on which encoding is used. Stuff like LEN counts the "number of characters" (not "number of bytes"). The "B" operators like LENB are carefully defined so that the actual encoding is opaque; they require certain properties of their answers, but again don't mandate an encoding. > Question: Or, is the string (4.1) language meant to allow a non-Unicode > based encoding? I don't know what you mean by a "non-Unicode based encoding". It would be my EXPECTATION that strings would be encoded as ASCII (if ASCII only), UTF-8, UTF-16, or UTF-32 (the latter two in some consistent endianness). But the spec should still be satisfied if Latin-1, UTF-7, or something else were used. (I'm not sure EBCDIC would be okay, but I don't think anyone's worrying about that!!) An application could even use different encodings, tagging each string with a different encoding. That would complicate implementation of the "B" operators (like LEFTB and LENB), but it's fine. > Question: If a non-Unicode based encoding, which definition of that > encoding are we using? See above. > I don't necessarily disagree but it does seem odd that Unicode support > is required for identifiers but optional under text. Understand, but it's intentional. The reason is that spreadsheets typically don't manipulate a lot of text, but named expressions (variables) _ARE_ widely used, and the ability to have names in any language IS important to end-users. If we were specifying a language which is COMMONLY used to manipulate text values, then I'd feel very differently. > Noting that Unicode "support" doesn't necessarily mean that you get a > meaningful display of the text. True, but it seemed to be a point of confusion, so the note was added. > (Can someone more familiar with display > issues comment on the usual behavior for missing characters? I seem to > recall that all you usually see are some glyph but I don't remember > which one. I don't recall it being the Unicode code number.) Since we're not specifying the display, I think it's out of scope to do more than make a short note. I know that many programs display a box with the 4-digit Hex code inside if it's inside the Unicode/ISO 10646 BMP, but not a glyph they can display. No clue what they do outside the BMP. --- David A. Wheeler