Subject: Re: [office-formula] "international" characters?
David, I was reminded of your response by the "floor - ceiling" discussion on the main list. How is interoperability impacted when I go from an application that supports Unicode for string values to an application that supports only ASCII characters? Shouldn't the rule be that all string values are Unicode (UTF-8/UTF-16) and that applications are permitted to choose the subset from UTF-8 that we normally identify as ASCII? (That being the reason it was so encoded to deal with the most common case for those characters.) That is to say that we define conformance as being the expression of string values in either UTF-8 or UTF-16 and allow for the case where only the "ASCII" subset is desired? Thinking that then for interoperability reasons we say that when a string includes characters that fall outside of the "ASCII" subset, that an application that doesn't support those characters simply preserves the values that are in place. Or perhaps signals in some manner that the operation in question cannot be performed. If, as you say below, the case of manipulation of string values beyond the "ASCII" subset rarely comes up, then the odds of getting such an error should be pretty remote. Hope you are having a great day! Patrick PS: BTW, I did notice under 4.1 Text (String) we say: "A text value of zero is termed the empty string." There are five (5) such usages in the current draft. I don't particularly care for separate definition sections although they are allowed by the JTC 1 Directives. I would prefer that we define all terms in place but using a standard style. Since the definitions take different verbal forms in the text it would be necessary to isolate all of them first. I assume that most of the definitions are consistent across the various parts of the draft but that needs to be checked. David A. Wheeler wrote: > Patrick Durusau: > >> The phrase "international characters" is used under small group (2.1.1) >> for named expression identifiers. It is noted that some applications may >> not display a glyph but an ISO 10646 code. >> >> But, under text (string) (4.1), it is noted that "implementations >> *should* support Unicode strings, but *shall* at least support strings >> of ASCII characters." >> > > Correct. There is a _big_ difference in the level of support required > for international strings in the spec: > * The stream of text for OpenFormula itself is defined using > ISO 10646/Unicode and XML, using the normal conventions for them. > * International identifiers _MUST_ be supported, so that people can use > variable names (names of named expressions) > with Chinese names or whatever. The note is simply a note > that clarifies that implementations need not actually display the characters > as glyphs, they just have to process them correctly... that's valuable > for low-powered implementations, so that they can at least > interoperate correctly. > * Values of type 'Text' (string), which are processed by an implementation > that implements the spec. Background: Most spreadsheets' non-blank cells > are either math formulas (using numbers), or labels > (which are out of scope of OpenFormula). Text processing isn't a big use, > as far as I can tell. > There are some spreadsheet documents that process text, of course, > so there is some minimal support for it. But ASCII-only support is actually > adequate for many of those kinds of uses, so there's no MANDATE > that implementations do more (though they may). > > Now, you can ask the question "Should the spec demand more?", and that's > the RIGHT question to ask. I think it'd be quite plausible to REQUIRE support > for Text in the Medium or Large group. But for a lot of people, spreadsheets > are for calculating. As long as labels and named expression names (variables) > can be arbitrary characters, that's adequate for many, so the "Small" model > at least shouldn't require it (in my mind). Properly handling international > string handling is often non-trivial (ask the Python developers!), and since > it's not a common use, it seems extreme to ask that of everyone. > > >> Question: In all cases are we talking about Unicode strings in one of >> the defined representations (UTF-8), etc.? >> > > Yes, though the specification is designed so that the specific encoding used > for Text values (that are manipulated by the language) > is up to the implementation - and thus not specified. > Look at the function definitions - they're carefully written to NOT depend > on which encoding is used. Stuff like LEN counts the "number of characters" > (not "number of bytes"). The "B" operators like LENB are carefully > defined so that the actual encoding is opaque; they require certain > properties of their answers, but again don't mandate an encoding. > > >> Question: Or, is the string (4.1) language meant to allow a non-Unicode >> based encoding? >> > > I don't know what you mean by a "non-Unicode based encoding". > It would be my EXPECTATION that strings would be encoded as > ASCII (if ASCII only), UTF-8, UTF-16, or UTF-32 (the latter two in > some consistent endianness). But the spec should still be satisfied if > Latin-1, UTF-7, or something else were used. (I'm not sure EBCDIC > would be okay, but I don't think anyone's worrying about that!!) > > An application could even use different encodings, tagging each > string with a different encoding. That would complicate implementation > of the "B" operators (like LEFTB and LENB), but it's fine. > > >> Question: If a non-Unicode based encoding, which definition of that >> encoding are we using? >> > > See above. > > >> I don't necessarily disagree but it does seem odd that Unicode support >> is required for identifiers but optional under text. >> > > Understand, but it's intentional. The reason is that spreadsheets > typically don't manipulate a lot of text, but named expressions > (variables) _ARE_ widely used, and the ability to have names in > any language IS important to end-users. > > If we were specifying a language which is COMMONLY used to manipulate > text values, then I'd feel very differently. > > >> Noting that Unicode "support" doesn't necessarily mean that you get a >> meaningful display of the text. >> > > True, but it seemed to be a point of confusion, so the note was added. > > >> (Can someone more familiar with display >> issues comment on the usual behavior for missing characters? I seem to >> recall that all you usually see are some glyph but I don't remember >> which one. I don't recall it being the Unicode code number.) >> > > Since we're not specifying the display, I think it's out of scope to do more > than make a short note. I know that many programs display a box with > the 4-digit Hex code inside if it's inside the Unicode/ISO 10646 BMP, but > not a glyph they can display. No clue what they do outside the BMP. > > --- David A. Wheeler > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php > > > -- Patrick Durusau firstname.lastname@example.org Chair, V1 - US TAG to JTC 1/SC 34 Convener, JTC 1/SC 34/WG 3 (Topic Maps) Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300 Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)