Re: [xdi] Minutes: XDI TC Telecon Friday 2014-03-21

U+1000 vs D800 DC00: -51200

U+2000 vs D800 DC00: -47104

U+4000 vs D800 DC00: -38912

U+8000 vs D800 DC00: -22528

U+C000 vs D800 DC00: -6144

U+D000 vs D800 DC00: -2048

U+E000 vs D800 DC00: 2048

U+F000 vs D800 DC00: 6144

I made another mistake in previous mails: the surrogate range starts at D800, not D000.

On Mar 27, 2014, at 12:52 PM, Markus Sabadello <markus.sabadello@gmail.com> wrote:

I think you forgot to use all the variables, you only use stre000 on every line, not the other ones.

But this is really interesting!

Markus

On Thu, Mar 27, 2014 at 7:57 PM, Joseph Boyle <planetwork@josephboyle.net> wrote:

I think this shows compareTo respects Unicode code point order:

public class unicodesorttest {
public static void main(String[] args) {

int pair[] = {0xd800,0xdc00};
String str1000 = new String("\u1000");

String str2000 = new String("\u2000");
String str4000 = new String("\u4000");

String str8000 = new String("\u8000");
String strc000 = new String("\uc000");

String strd000 = new String("\ud000");
String stre000 = new String("\ue000");

String strf000 = new String("\uf000");
String d800dc00 = new String(pair,0,2);

System.out.printf("U+E000 vs U+E000: %d%n",stre000.compareTo(stre000));

System.out.printf("D800 DC00 vs D800 DC00: %d%n",d800dc00.compareTo(d800dc00));
System.out.printf("U+1000 vs D800 DC00: %d%n",stre000.compareTo(d800dc00));
System.out.printf("U+2000 vs D800 DC00: %d%n",stre000.compareTo(d800dc00));

System.out.printf("U+8000 vs D800 DC00: %d%n",stre000.compareTo(d800dc00));
System.out.printf("U+C000 vs D800 DC00: %d%n",stre000.compareTo(d800dc00));
System.out.printf("U+D000 vs D800 DC00: %d%n",stre000.compareTo(d800dc00));

System.out.printf("U+E000 vs D800 DC00: %d%n",stre000.compareTo(d800dc00));
System.out.printf("U+F000 vs D800 DC00: %d%n",stre000.compareTo(d800dc00));
}
}

U+E000 vs U+E000: 0
D800 DC00 vs D800 DC00: 0

U+1000 vs D800 DC00: 2048
U+2000 vs D800 DC00: 2048
U+8000 vs D800 DC00: 2048
U+C000 vs D800 DC00: 2048
U+D000 vs D800 DC00: 2048

U+E000 vs D800 DC00: 2048
U+F000 vs D800 DC00: 2048

On Mar 26, 2014, at 11:02 AM, Joseph Boyle <planetwork@josephboyle.net> wrote:

This appears to be Unicode codepoint binary order and not UCA. Try it on "\ud800\udc00", "\u10000", and "\ue000" to verify.

Sent from my iPhone

On Mar 26, 2014, at 10:55, Markus Sabadello <markus.sabadello@xdi.org> wrote:

So currently in XDI2 I sort nodes and statements using an instance of TreeMap.
http://docs.oracle.com/javase/7/docs/api/java/util/TreeMap.html

This uses String.compareTo for sorting:
http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#compareTo%28java.lang.String%29

The Java documentation says:

"The comparison is based on the Unicode value of each character in the strings. The character sequence represented by this String object is compared lexicographically to the character sequence represented by the argument string."

Is this consistent with what TR10 specifies?

What would ICU add here that Java itself doesn't already provide?

Markus

On Wed, Mar 26, 2014 at 6:38 PM, Martin, Will <Will.Martin@neustar.biz> wrote:

That's right. TR10 gives the canonical entry point for sort specification for Unicode, regardless of encoding. The standard tool for implementing unicode operations is at

http://site.icu-project.org/

These documents and tools have driven IR applications of the most important or complex or successful type for more than 15 years.

Date: Wednesday, March 26, 2014 1:01 PM
To: Joseph Boyle <planetwork@josephboyle.net>
Cc: OASIS - XDI TC <xdi@lists.oasis-open.org>
Subject: Re: [xdi] Minutes: XDI TC Telecon Friday 2014-03-21

I don't really understand this table, but I was wondering whether the encoding is actually relevant to sorting.

How does sorting really work with Unicode?

Are we supposed to do byte sorting (in which case the encoding would matter).

Or are we supposed to do character sorting, isn't this something that should be specified somewhere in Unicode, independently of the encoding that is used?

Maybe this is what we're looking for. http://www.unicode.org/reports/tr10/

Markus

On Tue, Mar 25, 2014 at 8:46 AM, Joseph Boyle <planetwork@josephboyle.net> wrote:

On Mar 23, 2014, at 5:11 AM, Markus Sabadello <markus.sabadello@xdi.org> wrote:

Finally, we talked about Unicode and how the differences between UTF-8 and UTF-16 may affect ordering and therefore signatures. Joseph explained that Java internally uses UTF-16, whereas XDI serializations require UTF-8 encoding.
Markus will review the relevant XDI2 code sections to see if this is an issue. In Java, the popular ICU4j library may be needed to produce correct results.

I think it is enough to alter the comparison of the top digit of two code units so that D > E, D > F:

| 0-C D EF

0-C| = < <

D | > = >

EF | > < =

This procedure gives a result for unpaired surrogates rather than throwing an error, but screening out unpaired surrogates is a separate issue.

xdi message