Re: [xdi] Minutes: XDI TC Telecon Friday 2014-03-21

So currently in XDI2 I sort nodes and statements using an instance of TreeMap.
http://docs.oracle.com/javase/7/docs/api/java/util/TreeMap.html

This uses String.compareTo for sorting:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#compareTo%28java.lang.String%29

The Java documentation says:

"The comparison is based on the Unicode value of each character in the strings. The character sequence represented by this String object is compared lexicographically to the character sequence represented by the argument string."

Is this consistent with what TR10 specifies?

What would ICU add here that Java itself doesn't already provide?

Markus

On Wed, Mar 26, 2014 at 6:38 PM, Martin, Will <Will.Martin@neustar.biz> wrote:

That's right. TR10 gives the canonical entry point for sort specification for Unicode, regardless of encoding. The standard tool for implementing unicode operations is at

http://site.icu-project.org/

These documents and tools have driven IR applications of the most important or complex or successful type for more than 15 years.

Date: Wednesday, March 26, 2014 1:01 PM
To: Joseph Boyle <planetwork@josephboyle.net>
Cc: OASIS - XDI TC <xdi@lists.oasis-open.org>
Subject: Re: [xdi] Minutes: XDI TC Telecon Friday 2014-03-21

I don't really understand this table, but I was wondering whether the encoding is actually relevant to sorting.

How does sorting really work with Unicode?

Are we supposed to do byte sorting (in which case the encoding would matter).

Or are we supposed to do character sorting, isn't this something that should be specified somewhere in Unicode, independently of the encoding that is used?

Maybe this is what we're looking for. http://www.unicode.org/reports/tr10/

Markus

On Tue, Mar 25, 2014 at 8:46 AM, Joseph Boyle <planetwork@josephboyle.net> wrote:

On Mar 23, 2014, at 5:11 AM, Markus Sabadello <markus.sabadello@xdi.org> wrote:

Finally, we talked about Unicode and how the differences between UTF-8 and UTF-16 may affect ordering and therefore signatures. Joseph explained that Java internally uses UTF-16, whereas XDI serializations require UTF-8 encoding.

Markus will review the relevant XDI2 code sections to see if this is an issue. In Java, the popular ICU4j library may be needed to produce correct results.

I think it is enough to alter the comparison of the top digit of two code units so that D > E, D > F:

| 0-C D EF

0-C| = < <

D | > = >

EF | > < =

This procedure gives a result for unpaired surrogates rather than throwing an error, but screening out unpaired surrogates is a separate issue.

xdi message