xacml message

Subject: Re: [xacml] Unicode strings
From: Erik Rissanen <erik@axiomatics.com>
To: XACML TC <xacml@lists.oasis-open.org>
Date: Mon, 22 Sep 2008 20:17:10 +0200
All,

Hmm... I think I was mistaken about java. On second thought, I think
java strings do code point collation. Can anyone here confirm this?

Regards,
Erik


Erik Rissanen wrote:
> All,
>
> During the last call we had discussion about string equality in XACML.
>
> This link contains answers to all questions about unicode string
> comparison which you have been too afraid to ask:
>
> http://www.unicode.org/unicode/reports/tr10/
>
> (I told you it was complex. :-))
>
> Some basic terminology can be found here:
>
> http://www.w3.org/TR/2005/REC-charmod-20050215/
>
> I cannot say that I have understood it in depth, so please correct me if
> I am wrong, but it appears as we have some choices for defining string
> comparisons in XACML:
>
> 1. Use unicode code point collation.
>
> 2. Use the unicode collation algorithm with the default unicode
> collation element table (DUCET).
>
> 3. Use the unicode collation algorithm with locale specific collations.
>
> 4. Compare byte streams in some encoding, such as UTF-16, of the unicode
> strings.
>
> The third option means that string comparisons would depend on the
> locale, which would give different results for different people, and we
> have one more item of metadata to manage. This sounds dangerous for a
> security application such as XACML, and unnecessary since most strings
> won't be human language in the first place. So I suggest that we skip
> option 3.
>
> I don't like the fourth either since it's either the same as number 1,
> or might give "strange" results depending on the encoding we choose.
> (With strange I mean that the order can be very much different from the
> unicode code point collation depending on where the encoding splits up
> the unicode table.) Though it appears to me that comparison of java
> strings does this with an UTF-16 encoding.
>
> 1 means that strings are compared by their unicode code point
> representation. This appears to be the default in XQuery.
>
> I am not sure what 2 actually is. My impression is that it is a
> collation table intended to make it simple to define common human
> language collation tables as small deltas to this table. But I could be
> wrong.
>
> For 2 (I think) and 3 there is an implementation available here:
> http://www.icu-project.org/
>
> I propose that we use 1. It's simple and appears most suitable for
> "machine readable stuff" we are dealing with. Though it means that for
> instance in java string comparison (other than equal) need some special
> treatment. I'm not sure if this would be a performance problem. Probably
> not. See example here: http://mindprod.com/jgloss/codepoint.html
>
> Another benefit of 1 is that (I think) it is independent on which
> version of unicode is used.
>
> Regards,
> Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail.  Follow this link to all your TCs in OASIS at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php 
>
>
References:
- Unicode strings
  - From: Erik Rissanen <erik@axiomatics.com>