[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Unicode strings
All, During the last call we had discussion about string equality in XACML. This link contains answers to all questions about unicode string comparison which you have been too afraid to ask: http://www.unicode.org/unicode/reports/tr10/ (I told you it was complex. :-)) Some basic terminology can be found here: http://www.w3.org/TR/2005/REC-charmod-20050215/ I cannot say that I have understood it in depth, so please correct me if I am wrong, but it appears as we have some choices for defining string comparisons in XACML: 1. Use unicode code point collation. 2. Use the unicode collation algorithm with the default unicode collation element table (DUCET). 3. Use the unicode collation algorithm with locale specific collations. 4. Compare byte streams in some encoding, such as UTF-16, of the unicode strings. The third option means that string comparisons would depend on the locale, which would give different results for different people, and we have one more item of metadata to manage. This sounds dangerous for a security application such as XACML, and unnecessary since most strings won't be human language in the first place. So I suggest that we skip option 3. I don't like the fourth either since it's either the same as number 1, or might give "strange" results depending on the encoding we choose. (With strange I mean that the order can be very much different from the unicode code point collation depending on where the encoding splits up the unicode table.) Though it appears to me that comparison of java strings does this with an UTF-16 encoding. 1 means that strings are compared by their unicode code point representation. This appears to be the default in XQuery. I am not sure what 2 actually is. My impression is that it is a collation table intended to make it simple to define common human language collation tables as small deltas to this table. But I could be wrong. For 2 (I think) and 3 there is an implementation available here: http://www.icu-project.org/ I propose that we use 1. It's simple and appears most suitable for "machine readable stuff" we are dealing with. Though it means that for instance in java string comparison (other than equal) need some special treatment. I'm not sure if this would be a performance problem. Probably not. See example here: http://mindprod.com/jgloss/codepoint.html Another benefit of 1 is that (I think) it is independent on which version of unicode is used. Regards, Erik
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]