office-formula message

Subject: Re: [office-formula] combining characters

From: "Andreas J. Guelzow" <aguelzow@pyrshep.ca>
To: office-formula@lists.oasis-open.org
Date: Tue, 01 Sep 2009 23:16:28 -0600

On Tue, 2009-09-01 at 23:31 -0400, Robert Weir/Cambridge/IBM wrote:
> There is a Unicode FAQ on this question:
> 
> http://unicode.org/faq/char_combmark.html#7
> 
> It lists three ways of looking at string length, and it looks like 
> Gnumeric and Calc are doing method #2.

Note that I just described what OOo and GNumeric are currently doing.
Personally I find it disturbing that LEN("é") depends on whether or not
é is e+0301 or a single code point.
> 
> Whatever we do, I think we need to ensure consistency across the range of 
> string manipulation functions, since they are typically used in 
> conjunction with each other. 
> 
> So in your example, would FIND("é","e") return 1?

I believe you mean FIND("e","é"). If é is e+0301 this returns 1 and
similarly 

>   Would LEFT("é",1) 
> return 'e'?  

yes
> And what would RIGHT("é",1) return?
an empty looking string of length 1.  :-(


>   and UPPER("é")?.

A two code point string E+0301

>   And 
> LEN(UPPER("é"))?

2

> 
> I think it is far more challenging to define these functions in an 
> intuitive and self-consistent fashion if we assume that the strings are 
> normalized. 

If we don't assume the strings to be normalized, there is nothing
intuitive happening here.
> 
> However, if the apps themselves are not behaving in an intuitive and 
> self-consistent fashion, then that is another challenge.

Elsewhere in OpenFormula there are situations where strange function
combinations are defined simply because existing implementation have
them (for example B and BINOMDIST). So if we want to just encode what
implementations are doing I guess we can't asked for normalization.
(Personally I would rather have a rational collection of functions.)

Andreas

> 
> I can certainly forgive LEFTB and such, since they are hitting the lower 
> level character representations.  But I think that the basic LEN,RIGHT, 
> LEFT, etc., functions we want to be working with user expectations on how 
> text works.  Not being a user of compose sequences, I am on uncertain 
> ground here.  Does anyone have a good sense of this?  For example, when a 
> user enters a compose sequence, do they think of it as a short cut for 
> entering a single character?  Or do they think of it as a way of entering 
> multiple characters that may display as a single glyph, but behave like 
> multiple characters when doing string operations?
> 
> -Rob
> 
> Andreas J Guelzow <aguelzow@math.concordia.ab.ca> wrote on 09/01/2009 
> 11:19:23 AM:
> > 
> > Just for the record, in both OOo3.1 and Gnumeric 1.9.10 we have
> > =len("é")
> > evaluating as 2 if the accented e is entered using combining characters.
> > 
> > Andreas
> > -- 
> > Andreas J. Guelzow
> > Concordia University College of Alberta
> > 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail.  Follow this link to all your TCs in OASIS at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php 
> 
-- 
Andreas J. Guelzow <aguelzow@pyrshep.ca>

References:
- combining characters
  - From: Andreas J Guelzow <aguelzow@math.concordia.ab.ca>
- Re: [office-formula] combining characters
  - From: Robert Weir/Cambridge/IBM <robert_weir@us.ibm.com>