[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [office-formula] "international" characters?
Patrick Durusau wrote: > How is interoperability impacted when I go from an application that > supports Unicode for string values to an application that supports only > ASCII characters? I expect that the ASCII-only application can only handle text that is within the ASCII subset. > Shouldn't the rule be that all string values are Unicode (UTF-8/UTF-16) ... Huh? UTF-8 and UTF-16 are merely encodings. They don't specify what the permitted character values are during text processing. My expectation would be that some applications could only be trusted to portably handle the values 32 through 127, plus newline, for character values. Which has nothing to do with UTF-7, UTF-8, UTF-16, or UTF-32, since these are just different encoding systems for ranges of possible values far larger than this. It might be easier to "pull up the rock" to see the _kind_ of implementation the spec is trying to permit at the low end. For low-end systems, where text processing is rare and is stuff like "concatentate asterisks", an implementation can choose to use fixed-width 8-bit characters to store Text values. The characters 0..127 are usually ASCII, and the characters represented with 128...255 might be mapped to radically different Unicode characters depending on the current locale. On a low-powered system, this is an "obvious" way to get efficient storage. This completely fails for most Asian languages, and it also poorly handles multi-language text. But note that this doesn't affect labels, the names of named expressions, and so on, so for many applications this is enough. The goal is to allow simple implementations for cases where more sophisticated text processing isn't needed. Granted, perhaps this is just not worth it. Maybe we should just require Text types to support fully internationalized characters... as long as a UTF-8-based implementation is possible, it's not TOO hard. The funny thing here is that creating a spec that permits simple implementation may be harder than creating a simpler spec that imposes more requirements on the implementation... :-). > and that applications are permitted to choose the subset from UTF-8 that > we normally identify as ASCII? (That being the reason it was so encoded > to deal with the most common case for those characters.) > > That is to say that we define conformance as being the expression of > string values in either UTF-8 or UTF-16 and allow for the case where > only the "ASCII" subset is desired? This seems to mix encoding during read/write with internal processing. > Thinking that then for interoperability reasons we say that when a > string includes characters that fall outside of the "ASCII" subset, that > an application that doesn't support those characters simply preserves > the values that are in place. Or perhaps signals in some manner that the > operation in question cannot be performed. I'm not sure I understand you here. A low-powered implementation can do a lot by simply encoding characters as 8-bit characters, 1 byte/character. This makes length, etc., easy to calculate. If they use a locale, they can handle non-ASCII text easily for many countries. Of course, they give up handling arbitrary internationalized text. > If, as you say below, the case of manipulation of string values beyond > the "ASCII" subset rarely comes up, then the odds of getting such an > error should be pretty remote. > > Hope you are having a great day! > > Patrick > > PS: BTW, I did notice under 4.1 Text (String) we say: "A text value of > zero is termed the empty string." There are five (5) such usages in the > current draft. I don't particularly care for separate definition > sections although they are allowed by the JTC 1 Directives. I would > prefer that we define all terms in place but using a standard style. > Since the definitions take different verbal forms in the text it would > be necessary to isolate all of them first. I assume that most of the > definitions are consistent across the various parts of the draft but > that needs to be checked. I agree with you, a standard style would be a good thing. I'm not a big fan of separate definition sections either - they can make the text harder to read and understand. If the same term has a _different_ meaning, then clearly we have a bug that needs fixing :-). --- David A. Wheeler