[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [office-formula] combining characters
On Tue, 2009-09-01 at 23:31 -0400, Robert Weir/Cambridge/IBM wrote: > There is a Unicode FAQ on this question: > > http://unicode.org/faq/char_combmark.html#7 > > It lists three ways of looking at string length, and it looks like > Gnumeric and Calc are doing method #2. Note that I just described what OOo and GNumeric are currently doing. Personally I find it disturbing that LEN("é") depends on whether or not é is e+0301 or a single code point. > > Whatever we do, I think we need to ensure consistency across the range of > string manipulation functions, since they are typically used in > conjunction with each other. > > So in your example, would FIND("é","e") return 1? I believe you mean FIND("e","é"). If é is e+0301 this returns 1 and similarly > Would LEFT("é",1) > return 'e'? yes > And what would RIGHT("é",1) return? an empty looking string of length 1. :-( > and UPPER("é")?. A two code point string E+0301 > And > LEN(UPPER("é"))? 2 > > I think it is far more challenging to define these functions in an > intuitive and self-consistent fashion if we assume that the strings are > normalized. If we don't assume the strings to be normalized, there is nothing intuitive happening here. > > However, if the apps themselves are not behaving in an intuitive and > self-consistent fashion, then that is another challenge. Elsewhere in OpenFormula there are situations where strange function combinations are defined simply because existing implementation have them (for example B and BINOMDIST). So if we want to just encode what implementations are doing I guess we can't asked for normalization. (Personally I would rather have a rational collection of functions.) Andreas > > I can certainly forgive LEFTB and such, since they are hitting the lower > level character representations. But I think that the basic LEN,RIGHT, > LEFT, etc., functions we want to be working with user expectations on how > text works. Not being a user of compose sequences, I am on uncertain > ground here. Does anyone have a good sense of this? For example, when a > user enters a compose sequence, do they think of it as a short cut for > entering a single character? Or do they think of it as a way of entering > multiple characters that may display as a single glyph, but behave like > multiple characters when doing string operations? > > -Rob > > Andreas J Guelzow <aguelzow@math.concordia.ab.ca> wrote on 09/01/2009 > 11:19:23 AM: > > > > Just for the record, in both OOo3.1 and Gnumeric 1.9.10 we have > > =len("é") > > evaluating as 2 if the accented e is entered using combining characters. > > > > Andreas > > -- > > Andreas J. Guelzow > > Concordia University College of Alberta > > > > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php > -- Andreas J. Guelzow <aguelzow@pyrshep.ca>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]