office-formula message

Subject: CODE and CHAR should not be Unicode aware,proposing UNICODE and UNICHAR

From: Eike Rathke <erack@sun.com>
To: OASIS ODFF SC <office-formula@lists.oasis-open.org>
Date: Thu, 15 Feb 2007 13:47:12 +0100

Hi,

CODE and CHAR currently are defined to handle Unicode. For
interoperability (sigh) reasons I don't think that is a good idea. Ecma
doesn't say anything about it, but Excel versions up to Excel 2003
handle those functions differently, depending even on the system where
the document originated: for documents created on Windows it uses the
Windows-1252 ANSI code page, and for documents created on a Mac it uses
a Mac code page, would have to lookup in the Excel online-help which one
it was exactly. Unicode is not supported. Don't know what Excel 2007
does though. Anyone?

Furthermore, the Korean (and maybe Japanese, others?) localized Excel
versions seem (!) to support Unicode with these functions, but instead
the CODE function delivers the collation point of a syllable character,
and not the Unicode value, which I consider sophisticated nonsense. When
loaded into an English Excel version the functionality is lost and code
63 for question mark is the result instead, since Korean characters
aren't present in cp1252. When stored with an English Excel and loaded
in a Korean version again the result is still broken (stored value
displayed) unless the formula is recalculated.

It seems right to restrict CODE and CHAR to a code page, though I don't
see a way to include the Windows/Mac differentiation in ODF. If we
define the Windows-1252 code page, Mac documents imported will output
garbage for those functions. An application can handle this when
importing an Excel document, but the information will be lost once
stored as ODF. Additionally, using a code page not matching the current
system's encoding may also lead to garbage with user input, so
applications may tend to use the current encoding instead, or you'd need
to map things twice. Gnumeric uses cp1252, Kspread Unicode, OOo the
system encoding. Taking this all together makes CODE and CHAR highly
unportable. I propose to define cp1252 to be used with these functions.
Opinions?

For a clean Unicode environment and portable documents I propose to add
two new functions UNICODE and UNICHAR. Objections?

  Eike

-- 
Automatic string conversions considered dangerous. They are the GOTO statements
of spreadsheets.  --Robert Weir on the OpenDocument formula subcommittee's list.

Follow-Ups:
- Re: [office-formula] CODE and CHAR should not be Unicode aware,proposing UNICODE and UNICHAR
  - From: Andreas J Guelzow <aguelzow@math.concordia.ab.ca>