OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xacml message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Unicode strings


All,

During the last call we had discussion about string equality in XACML.

This link contains answers to all questions about unicode string
comparison which you have been too afraid to ask:

http://www.unicode.org/unicode/reports/tr10/

(I told you it was complex. :-))

Some basic terminology can be found here:

http://www.w3.org/TR/2005/REC-charmod-20050215/

I cannot say that I have understood it in depth, so please correct me if
I am wrong, but it appears as we have some choices for defining string
comparisons in XACML:

1. Use unicode code point collation.

2. Use the unicode collation algorithm with the default unicode
collation element table (DUCET).

3. Use the unicode collation algorithm with locale specific collations.

4. Compare byte streams in some encoding, such as UTF-16, of the unicode
strings.

The third option means that string comparisons would depend on the
locale, which would give different results for different people, and we
have one more item of metadata to manage. This sounds dangerous for a
security application such as XACML, and unnecessary since most strings
won't be human language in the first place. So I suggest that we skip
option 3.

I don't like the fourth either since it's either the same as number 1,
or might give "strange" results depending on the encoding we choose.
(With strange I mean that the order can be very much different from the
unicode code point collation depending on where the encoding splits up
the unicode table.) Though it appears to me that comparison of java
strings does this with an UTF-16 encoding.

1 means that strings are compared by their unicode code point
representation. This appears to be the default in XQuery.

I am not sure what 2 actually is. My impression is that it is a
collation table intended to make it simple to define common human
language collation tables as small deltas to this table. But I could be
wrong.

For 2 (I think) and 3 there is an implementation available here:
http://www.icu-project.org/

I propose that we use 1. It's simple and appears most suitable for
"machine readable stuff" we are dealing with. Though it means that for
instance in java string comparison (other than equal) need some special
treatment. I'm not sure if this would be a performance problem. Probably
not. See example here: http://mindprod.com/jgloss/codepoint.html

Another benefit of 1 is that (I think) it is independent on which
version of unicode is used.

Regards,
Erik



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]