Re: [xdi] Minutes: XDI TC Telecon Friday 2014-03-21

UTF-8 (and UTF-32) binary sort order is the same as Unicode codepoint order. However UTF-16 binary order departs from codepoint order; U+E000-U+FFFF would sort above U+10000-U+10FFF. Sorting UTF-16 code units in the order 0000, ... , CFFF, E000, E001, ... , FFFF, D000, D001, ... , DFFF will give Unicode codepoint order for well-formed input. ICU is not needed for this.

Unicode Collation Order addresses issues like making accented letter = plain letter + combining accent. It is a heavyweight algorithm requiring ICU and it does not make sense to use it simply for ordering for binary signature generation.

Sent from my iPhone

On Mar 26, 2014, at 10:38, "Martin, Will" <Will.Martin@neustar.biz> wrote:

That's right. TR10 gives the canonical entry point for sort specification for Unicode, regardless of encoding. The standard tool for implementing unicode operations is at

http://site.icu-project.org/

These documents and tools have driven IR applications of the most important or complex or successful type for more than 15 years.

Date: Wednesday, March 26, 2014 1:01 PM
To: Joseph Boyle <planetwork@josephboyle.net>
Cc: OASIS - XDI TC <xdi@lists.oasis-open.org>
Subject: Re: [xdi] Minutes: XDI TC Telecon Friday 2014-03-21

I don't really understand this table, but I was wondering whether the encoding is actually relevant to sorting.

How does sorting really work with Unicode?

Are we supposed to do byte sorting (in which case the encoding would matter).

Or are we supposed to do character sorting, isn't this something that should be specified somewhere in Unicode, independently of the encoding that is used?

Maybe this is what we're looking for. http://www.unicode.org/reports/tr10/

Markus

On Tue, Mar 25, 2014 at 8:46 AM, Joseph Boyle <planetwork@josephboyle.net> wrote:

On Mar 23, 2014, at 5:11 AM, Markus Sabadello <markus.sabadello@xdi.org> wrote:

Finally, we talked about Unicode and how the differences between UTF-8 and UTF-16 may affect ordering and therefore signatures. Joseph explained that Java internally uses UTF-16, whereas XDI serializations require UTF-8 encoding.

Markus will review the relevant XDI2 code sections to see if this is an issue. In Java, the popular ICU4j library may be needed to produce correct results.

I think it is enough to alter the comparison of the top digit of two code units so that D > E, D > F:

| 0-C D EF

0-C| = < <

D | > = >

EF | > < =

This procedure gives a result for unpaired surrogates rather than throwing an error, but screening out unpaired surrogates is a separate issue.

xdi message