[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [xacml] Unicode strings
All, Hmm... I think I was mistaken about java. On second thought, I think java strings do code point collation. Can anyone here confirm this? Regards, Erik Erik Rissanen wrote: > All, > > During the last call we had discussion about string equality in XACML. > > This link contains answers to all questions about unicode string > comparison which you have been too afraid to ask: > > http://www.unicode.org/unicode/reports/tr10/ > > (I told you it was complex. :-)) > > Some basic terminology can be found here: > > http://www.w3.org/TR/2005/REC-charmod-20050215/ > > I cannot say that I have understood it in depth, so please correct me if > I am wrong, but it appears as we have some choices for defining string > comparisons in XACML: > > 1. Use unicode code point collation. > > 2. Use the unicode collation algorithm with the default unicode > collation element table (DUCET). > > 3. Use the unicode collation algorithm with locale specific collations. > > 4. Compare byte streams in some encoding, such as UTF-16, of the unicode > strings. > > The third option means that string comparisons would depend on the > locale, which would give different results for different people, and we > have one more item of metadata to manage. This sounds dangerous for a > security application such as XACML, and unnecessary since most strings > won't be human language in the first place. So I suggest that we skip > option 3. > > I don't like the fourth either since it's either the same as number 1, > or might give "strange" results depending on the encoding we choose. > (With strange I mean that the order can be very much different from the > unicode code point collation depending on where the encoding splits up > the unicode table.) Though it appears to me that comparison of java > strings does this with an UTF-16 encoding. > > 1 means that strings are compared by their unicode code point > representation. This appears to be the default in XQuery. > > I am not sure what 2 actually is. My impression is that it is a > collation table intended to make it simple to define common human > language collation tables as small deltas to this table. But I could be > wrong. > > For 2 (I think) and 3 there is an implementation available here: > http://www.icu-project.org/ > > I propose that we use 1. It's simple and appears most suitable for > "machine readable stuff" we are dealing with. Though it means that for > instance in java string comparison (other than equal) need some special > treatment. I'm not sure if this would be a performance problem. Probably > not. See example here: http://mindprod.com/jgloss/codepoint.html > > Another benefit of 1 is that (I think) it is independent on which > version of unicode is used. > > Regards, > Erik > > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]