office-formula message

Subject: Re: [office-formula] "international" characters?

From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: Patrick Durusau <patrick@durusau.net>
Date: Tue, 20 Jan 2009 23:00:13 -0500

Patrick Durusau wrote:
> How is interoperability impacted when I go from an application that 
> supports Unicode for string values to an application that supports only 
> ASCII characters?

I expect that the ASCII-only application can only handle text that is 
within the ASCII subset.

> Shouldn't the rule be that all string values are Unicode (UTF-8/UTF-16) ...

Huh?  UTF-8 and UTF-16 are merely encodings.  They don't specify what 
the permitted character values are during text processing.

My expectation would be that some applications could only be trusted to 
portably handle the values 32 through 127, plus newline, for character 
values.  Which has nothing to do with UTF-7, UTF-8, UTF-16, or UTF-32, 
since these are just different encoding systems for ranges of possible 
values far larger than this.

It might be easier to "pull up the rock" to see the _kind_ of 
implementation the spec is trying to permit at the low end.  For low-end 
systems, where text processing is rare and is stuff like "concatentate 
asterisks", an implementation can choose to use fixed-width 8-bit 
characters to store Text values.  The characters 0..127 are usually 
ASCII, and the characters represented with 128...255 might be mapped to 
radically different Unicode characters depending on the current locale. 
  On a low-powered system, this is an "obvious" way to get efficient 
storage.  This completely fails for most Asian languages, and it also 
poorly handles multi-language text.  But note that this doesn't affect 
labels, the names of named expressions, and so on, so for many 
applications this is enough.  The goal is to allow simple 
implementations for cases where more sophisticated text processing isn't 
needed.

Granted, perhaps this is just not worth it.  Maybe we should just 
require Text types to support fully internationalized characters... as 
long as a UTF-8-based implementation is possible, it's not TOO hard. 
The funny thing here is that creating a spec that permits simple 
implementation may be harder than creating a simpler spec that imposes 
more requirements on the implementation... :-).

> and that applications are permitted to choose the subset from UTF-8 that 
> we normally identify as ASCII? (That being the reason it was so encoded 
> to deal with the most common case for those characters.)
> 
> That is to say that we define conformance as being the expression of 
> string values in either UTF-8 or UTF-16 and allow for the case where 
> only the "ASCII" subset is desired?

This seems to mix encoding during read/write with internal processing.

> Thinking that then for interoperability reasons we say that when a 
> string includes characters that fall outside of the "ASCII" subset, that 
> an application that doesn't support those characters simply preserves 
> the values that are in place. Or perhaps signals in some manner that the 
> operation in question cannot be performed.

I'm not sure I understand you here.

A low-powered implementation can do a lot by simply encoding characters 
as 8-bit characters, 1 byte/character.  This makes length, etc., easy to 
calculate.  If they use a locale, they can handle non-ASCII text easily 
for many countries.  Of course, they give up handling arbitrary 
internationalized text.

> If, as you say below, the case of manipulation of string values beyond 
> the "ASCII" subset rarely comes up, then the odds of getting such an 
> error should be pretty remote.
> 
> Hope you are having a great day!
> 
> Patrick
> 
> PS: BTW, I did notice under 4.1 Text (String) we say: "A text value of 
> zero is termed the empty string." There are five (5) such usages in the 
> current draft. I don't particularly care for separate definition 
> sections although they are allowed by the JTC 1 Directives. I would 
> prefer that we define all terms in place but using a standard style. 
> Since the definitions take different verbal forms in the text it would 
> be necessary to isolate all of them first. I assume that most of the 
> definitions are consistent across the various parts of the draft but 
> that needs to be checked.

I agree with you, a standard style would be a good thing.  I'm not a big 
fan of separate definition sections either - they can make the text 
harder to read and understand.

If the same term has a _different_ meaning, then clearly we have a bug 
that needs fixing :-).

--- David A. Wheeler

Follow-Ups:
- Re: [office-formula] "international" characters?
  - From: robert_weir@us.ibm.com

References:
- "international" characters?
  - From: Patrick Durusau <patrick@durusau.net>
- Re: [office-formula] "international" characters?
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- Re: [office-formula] "international" characters?
  - From: Patrick Durusau <patrick@durusau.net>