OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

xdi message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [xdi] Minutes: XDI TC Telecon Friday 2014-03-21

This appears to be Unicode codepoint binary order and not UCA. Try it on "\ud800\udc00", "\u10000", and "\ue000" to verify. 

Sent from my iPhone

On Mar 26, 2014, at 10:55, Markus Sabadello <markus.sabadello@xdi.org> wrote:

So currently in XDI2 I sort nodes and statements using an instance of TreeMap.

This uses String.compareTo for sorting:

The Java documentation says:

"The comparison is based on the Unicode value of each character in the strings. The character sequence represented by this String object is compared lexicographically to the character sequence represented by the argument string."

Is this consistent with what TR10 specifies?

What would ICU add here that Java itself doesn't already provide?


On Wed, Mar 26, 2014 at 6:38 PM, Martin, Will <Will.Martin@neustar.biz> wrote:
That's right. TR10 gives the canonical entry point for  sort specification for Unicode, regardless of encoding.  The standard tool for implementing unicode operations is at 

These documents and tools have driven IR applications of the most important or complex or successful type for more than 15 years.

Date: Wednesday, March 26, 2014 1:01 PM
To: Joseph Boyle <planetwork@josephboyle.net>
Cc: OASIS - XDI TC <xdi@lists.oasis-open.org>
Subject: Re: [xdi] Minutes: XDI TC Telecon Friday 2014-03-21

I don't really understand this table, but I was wondering whether the encoding is actually relevant to sorting.

How does sorting really work with Unicode?
Are we supposed to do byte sorting (in which case the encoding would matter).
Or are we supposed to do character sorting, isn't this something that should be specified somewhere in Unicode, independently of the encoding that is used?

Maybe this is what we're looking for. http://www.unicode.org/reports/tr10/


On Tue, Mar 25, 2014 at 8:46 AM, Joseph Boyle <planetwork@josephboyle.net> wrote:

On Mar 23, 2014, at 5:11 AM, Markus Sabadello <markus.sabadello@xdi.org> wrote:

Finally, we talked about Unicode and how the differences between UTF-8 and UTF-16 may affect ordering and therefore signatures. Joseph explained that Java internally uses UTF-16, whereas XDI serializations require UTF-8 encoding.

Markus will review the relevant XDI2 code sections to see if this is an issue. In Java, the popular ICU4j library may be needed to produce correct results.

I think it is enough to alter the comparison of the top digit of two code units so that D > E, D > F:

   | 0-C  D  EF
0-C|  =   <   <
D  |  >   =   >
EF |  >   <   =

This procedure gives a result for unpaired surrogates rather than throwing an error, but screening out unpaired surrogates is a separate issue.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]