office-formula message

Subject: Re: [office-formula] combining characters
From: "David A. Wheeler" <dwheeler@dwheeler.com>
To: "office-formula@lists.oasis-open.org" <office-formula@lists.oasis-open.org>
Date: Tue, 08 Sep 2009 00:02:54 -0400
Robert Weir/Cambridge/IBM wrote:
> There is a Unicode FAQ on this question:
> http://unicode.org/faq/char_combmark.html#7
> 
> It lists three ways of looking at string length, and it looks like 
> Gnumeric and Calc are doing method #2.
> 
> Whatever we do, I think we need to ensure consistency across the range of 
> string manipulation functions, since they are typically used in 
> conjunction with each other. 
...
> I can certainly forgive LEFTB and such, since they are hitting the lower 
> level character representations.  But I think that the basic LEN,RIGHT, 
> LEFT, etc., functions we want to be working with user expectations on how 
> text works.  Not being a user of compose sequences, I am on uncertain 
> ground here.

I find it *unsurprising* that different sequences of codepoints would 
compare differently.  That doesn't mean we *should* do it that way, of 
course, but that's *unsurprising*.

Per that page, the basic choices are:
1. Code Units: the bytes are in the physical representation of the string.
2. Codepoints: code points in the string.
3. Graphemes: what end-users consider as characters.

I would expect that LEFTB and the other *B operators would work off #1, 
the *Code Units*, which would vary depending on the implementation.  The 
only requirement is that 0 is always the beginning, and that the output 
of FINDB be a "byte position" you could feed back to RIGHTB, LEFTB, etc. 
    That doesn't mean that's necessarily right, but that would make sense.

On the other hand, I would NOT expect that LEFT, RIGHT, LEN, etc., would 
be that tied to the byte representation; that suggests we should pick #2 
or #3.

If you want to do #3 (Graphemes), you essentially HAVE to normalize, 
either when reading in or when performing each operation.  It'd be 
efficient if the normalization occurred when reading the text in, but 
there's a side-effect: That *changes* what is processed.  You could even 
argue that it replaces one surprising behavior with another. 
Unfortunately, there seem to be multiple ways to normalize, too.  At the 
least there is NFC vs. NFD, and probably other issues too.  So if we 
expect implementations to possibly normalize text as it is read in, I 
think we need to explicitly say so.

And then we go down the rat-hole, I'm sorry to say.  I've found a lot of 
material on the web.   Here are a few interesting pages related to 
normalization:
* http://www.w3.org/International/wiki/NormalizationProposal
This is proposal, which as far as I can tell is NOT accepted; text below.
* http://annevankesteren.nl/2009/02/unicode-normalization
Blog commentary.
* http://www.macchiato.com/unicode/nfc-faq
NFC FAQ.
* http://www.unicode.org/unicode/faq/normalization.html#6
* http://www.unicode.org/reports/tr15/
Technical report on Unicode Normalization

A lot of material I've come up with strongly recommends at *least* 
making sure that string comparisons act as if all text was normalized 
(by either the comparison operator, or by normalizing text ahead of time):
* The Unicode 5 (Ch3) spec says the following as "C6":
"A process shall not assume that the interpretations of two 
canonical-equivalent character sequences are distinct."
(http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf#G29705).
* Here's the W3C proposal suggests something similar:
    1.  Core specifications, such as XML, HTML, and CSS, MUST define 
whether canonically equivalent sequences are considered identical or not.
          1. They SHOULD require canonical equivalence to be considered 
identical, since users often cannot control how their keystrokes are 
converted into Unicode code points (either initially or due to 
conversion from a legacy encoding by a separate process).
          2. If canonical equivalence is not required, the Specification 
MUST include a health warning and recommendation suggesting the use of 
consistent code point sequences, preferably in NFC.
    2. Specifications that perform string matching for equality MUST 
specify that the comparison is done as if all strings were converted to 
Unicode Normalization Form C prior to the comparison. Note that actual 
normalization is not required and that a variety of performance-boosting 
strategies are available here.

 From what I can see, Apple normally uses a *variant* of NFD:
  http://lists.apple.com/archives/carbon-dev/2009/Aug/msg00002.html
whereas W3C prefers NFC:
   http://www.w3.org/TR/charmod-norm/
Ugh.

I'm only skimming the surface, and it's already knotty, sigh.

It seems absurd to burden "small" implementations with complex rules 
that they may not even use. Many spreadsheet files don't require text 
processing at all (labels being outside of scope).  But an international 
standard needs to be able to handle character data reasonably, too.

One possibility would be to simplify require that "Small" support ASCII 
text, and not necessarily more, for run-time processing.  In contrast, 
"Large" could be required to support (at run-time) all Unicode 
characters and do all character comparisons in one of the permitted 
normalizations (possibly by normalizing them while reading them in).  If 
we wanted implementations to be ABLE to normalize, we should explicitly 
say so, since this would mean that reading in a file and writing it back 
out could "appear" to change the text (though if your view is that only 
the normalized version matters, then it didn't change the text at all).

--- David A. Wheeler
References:
- combining characters
  - From: Andreas J Guelzow <aguelzow@math.concordia.ab.ca>
- Re: [office-formula] combining characters
  - From: Robert Weir/Cambridge/IBM <robert_weir@us.ibm.com>