office-formula message

Subject: Fwd: Re: [office-formula] Summary of 2009-10-13 teleconference
From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: office-formula@lists.oasis-open.org
Date: Wed, 14 Oct 2009 13:04:38 -0400 (EDT)
I said:
> > * The "Text" (String) type will simply defined as something that can contain 0 or more "characters".  We will separately discuss what "character" means.  I personally think we should recommend, but not require, that implementations support a character set and encoding that permit any legal Unicode code point be a character, but we didn't discuss that today.
 
Eike Rathke:
> I think we should say that implementations
> - shall support Unicode BMP and
> - should support the entire Unicode character range.
> 
> This comforts those that internally use UCS2 encoding only, though
> I don't know if there are implementations limited such. I don't think
> there would be implementations of OpenFormula that do not support
> Unicode BMP.

I get the impression that some implementations (at least Excel) *vary* the internal coding that they use for characters, based on a platform setting, and that this fact leaks into the results of text processing.  E.G., if a platform's setting says to use 8-bit ISO 8859-1 (Latin-1), then it CANNOT represent the entire Unicode BMP internally, even though it CAN do so if a different platform setting is used, and even though the implementation IS capable of reading XML files containing the entire Unicode range.

Now, we could say that implementations simply have to have a standards-compliant setting, but if people won't make that the NORMAL case, then there's a problem.  We don't want a "standards ghetto" that people won't want to use in real life.

> > * We will need to add a discussion noting that implementations may have a specific character set and character encoding as a setting, and that this may limit which characters may be included in strings.
> 
> I don't see how that would affect the specification other than we could
> note that some implementations are limited and thus results may differ
> if an ODF/OpenFormula document is read by such. I don't think that is
> the responsibility of the specification though.

It does if it affects interoperability.

> > Do we need to have a way to STORE this information in an OpenDocument file? If so, how?
> 
> I don't think so. Strings are stored in the encoding given by <?xml encoding="...">
> I don't see much benefit in storing the internal encoding of the
> generating implementation, other than readers would be required to
> convert from their internal Unicode encoding to that other encoding for
> functions such as CODE() and CHAR(). Doing so would impose a bunch of
> otherwise unnecessary conversion routines on implementations, maybe even
> including encodings not registered with IANA. However, we define those
> functions to ASCII for values 1<=N<=127 and to be
> implementation-dependent for values 128<=N<255 anyway.

The issue is that if a platform uses an internal representation that CANNOT represent arbitrary Unicode strings, then we need to tell the implementation information what the character requirements ARE, so that data isn't lost.  And we must give that information before character data is converted to the internal format, else it's a pain.  E.G., it's okay for an implementation to use a Latin-1 representation internally AS LONG AS all characters to be exchanged in that spreadsheet are in the Latin-1 set.

All of this is irrelevant if the implementation always uses an internal representation can represent all Unicode code points (e.g., UTF-8 or UTF-16).  But this does not appear to be the case.

> > * CHAR() and CODE() are *not* deprecated... they stay in.  Instead, they "normalize" to ASCII values, regardless of the internal representation.  I believe this only affects those who use a particular Arabic encoding that uses 0...127 for non-ASCII characters, and only when using those functions.
> 
> We still define 128..255 to be implementaion-dependent, yes?

Yes, unless someone objects.

--- David A. Wheeler
Follow-Ups:
- RE: Re: [office-formula] Summary of 2009-10-13 teleconference
  - From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>