xri message

Subject: Minutes of Special XRI TC call on Syntax Issue #4: XRI Normalization 5pm Pacific Tuesday 9/13
From: "Drummond Reed" <drummond.reed@cordance.net>
To: <xri@lists.oasis-open.org>
Date: Wed, 14 Sep 2005 13:15:10 -0700
XRI TC Members and Observers,

This email contains the minutes of the special TC call held yesterday at 5pm
Pacific time on Syntax Issue #4: XRI Normalization.

Attendees included Drummond Reed, Les Chasen, Nat Sakimura, Peter Davis, and
William Tan.

The minutes have been written up (together with the requirements and the
resulting proposal) on the XRI TC wiki on the issue page at:

	http://wiki.oasis-open.org/xri/Xri2Cd02/SynTax/I4XriNormalization

A copy of this wiki page, including the minutes (as the Discussion section
near the end) is included below for reference (although this page is frankly
much easier to read on the wiki).

=Drummond 


= Requirements/Proposal Page for Syntax Issue #4: XRI Normalization =

[[TableOfContents]]

== Introduction/Motivation ==
This issue was raised by Wil Tan, Les Chasen, and Sharon Nino at NeuStar due
to their implementation experience with [http://www.ietf.org/rfc/rfc3490.txt
Internationalize Domain Names]. Since normalization rules for widely used
infrastructure can be loosened but almost never be tightened after adoption,
they recommend that the TC look carefully at specifying Unicode Normal Form
KC (NFKC) for XRI normalization.

== Status ==
 * Version: 1
 * Action: Active proposal that needs discussion and closure.

== Requirements ==

 * Before XRIs gain wide adoption, establish a clear standard for the
critical issue of how XRIs that use the Unicode character set will be
normalized.
 * Reasonably minimize the discrepancy between user expectation of
equivalence and machine determination of equivalence of XRIs.
 * Minimize the normalization processing burden for the whole of XRI
infrastructure while also making sure no one point in the infrastructure
suffers too great a burden.
 * If possible, ensure that XRI normalization does not create
incompatability the IRI and URI specifications.

== Background ==
The following background and spec excerpts are very helpful in understanding
the issue and proposal.

 * The [http://www.ietf.org/rfc/rfc3987.txt IRI specification] requires
Unicode NFC normalization in the original encoding of an IRI (if it is not
already encoded), NOT on the conversion of an IRI to a URI (as some of us
expected). See section 3.1 step 1a and 1b as well as section 5.3.2.2 (both
excerpted below for easy reference.)

 * The [http://www.ietf.org/rfc/rfc3490.txt Internationalized Domain Names
(IDN) specifications] rely on the [http://www.ietf.org/rfc/rfc3454.txt
StringPrep specification], which requires the stricter NFKC rules that
normalize a wider set of "compatability characters" which are allowed under
Unicode but which make identifier comparision more difficult for both humans
and machines. Note that the IDN specifications (and NFKC) apply to the
ireg-name component of an IRI in certain schemes. The reasons for requiring
NFKC are explained in section 4 of this spec (excerpted below for easy
reference).

=== Excerpt of Start Of Section 3.1 Of IRI Spec ===
{{{
   Applications MUST map IRIs to URIs by using the following two steps.

   Step 1.  Generate a UCS character sequence from the original IRI
            format.  This step has the following three variants,
            depending on the form of the input:

            a. If the IRI is written on paper, read aloud, or otherwise
               represented as a sequence of characters independent of
               any character encoding, represent the IRI as a sequence
               of characters from the UCS normalized according to
               Normalization Form C (NFC, [UTR15]).

            b. If the IRI is in some digital representation (e.g., an
               octet stream) in some known non-Unicode character
               encoding, convert the IRI to a sequence of characters
               from the UCS normalized according to NFC.

            c. If the IRI is in a Unicode-based character encoding (for
               example, UTF-8 or UTF-16), do not normalize (see section
               5.3.2.2 for details).  Apply step 2 directly to the
               encoded Unicode character sequence.

   Step 2.  For each character in 'ucschar' or 'iprivate', apply steps
            2.1 through 2.3 below.

       2.1.  Convert the character to a sequence of one or more octets
             using UTF-8 [RFC3629].

       2.2.  Convert each octet to %HH, where HH is the hexadecimal
             notation of the octet value.  Note that this is identical
             to the percent-encoding mechanism in section 2.1 of
             [RFC3986].  To reduce variability, the hexadecimal notation
             SHOULD use uppercase letters.

       2.3.  Replace the original character with the resulting character
             sequence (i.e., a sequence of %HH triplets).
}}}

=== Excerpt of Section 5.3.2.2 of IRI Spec ===
{{{
5.3.2.2.  Character Normalization

   The Unicode Standard [UNIV4] defines various equivalences between
   sequences of characters for various purposes.  Unicode Standard Annex
   #15 [UTR15] defines various Normalization Forms for these
   equivalences, in particular Normalization Form C (NFC, Canonical
   Decomposition, followed by Canonical Composition) and Normalization
   Form KC (NFKC, Compatibility Decomposition, followed by Canonical
   Composition).

   Equivalence of IRIs MUST rely on the assumption that IRIs are
   appropriately pre-character-normalized rather than apply character
   normalization when comparing two IRIs.  The exceptions are conversion
   from a non-digital form, and conversion from a non-UCS-based
   character encoding to a UCS-based character encoding. In these cases,
   NFC or a normalizing transcoder using NFC MUST be used for
   interoperability.  To avoid false negatives and problems with
   transcoding, IRIs SHOULD be created by using NFC.  Using NFKC may
   avoid even more problems; for example, by choosing half-width Latin
   letters instead of full-width ones, and full-width instead of
   half-width Katakana.

   As an example, "http://www.example.org/r&#xE9;sum&#xE9;.html"; (in XML
   Notation) is in NFC.  On the other hand,
   "http://www.example.org/re&#x301;sume&#x301;.html"; is not in NFC.

   The former uses precombined e-acute characters, and the latter uses
   "e" characters followed by combining acute accents.  Both usages are
   defined as canonically equivalent in [UNIV4].

   Note: Because it is unknown how a particular sequence of characters
      is being treated with respect to character normalization, it would
      be inappropriate to allow third parties to normalize an IRI
      arbitrarily.  This does not contradict the recommendation that
      when a resource is created, its IRI should be as character
      normalized as possible (i.e., NFC or even NFKC).  This is similar
      to the uppercase/lowercase problems.  Some parts of a URI are case
      insensitive (domain name).  For others, it is unclear whether they
      are case sensitive, case insensitive, or something in between
      (e.g., case sensitive, but with a multiple choice selection if the
      wrong case is used, instead of a direct negative result).  The
      best recipe is that the creator use a reasonable capitalization
      and, when transferring the URI, capitalization never be changed.

   Various IRI schemes may allow the usage of Internationalized Domain
   Names (IDN) [RFC3490] either in the ireg-name part or elsewhere.
   Character Normalization also applies to IDNs, as discussed in section
   5.3.3.
}}}

=== Excerpt of Section 4 of StringPrep Spec ===
{{{
4. Normalization

   The output of the mapping step is optionally normalized using one of
   the Unicode normalization forms, as described in [UAX15].  A profile
   can specify one of two options for Unicode normalization:

   - no normalization

   - Unicode normalization with form KC

   A profile MAY choose to do no normalization.  However, such a profile
   can easily yield results that will be surprising to typical users,
   depending on the input mechanism they use.  For example, some input
   mechanisms enter compatibility characters that look exactly like the
   underlying characters, but have different code points.  Another
   example of where Unicode normalization helps create predictable
   results is with characters that have multiple combining diacritics:
   normalization orders those diacritics in a predictable fashion.

   On the other hand, Unicode normalization requires fairly large tables
   and somewhat complicated character reordering logic.  The size and
   complexity should not be considered daunting except in the most
   restricted of environments, and needs to be weighed against the
   problems of user surprise from comparing unnormalized strings.  Note
   that the tables used for normalization are not given in this
   document, but instead must be derived from the Unicode database, as
   described in [UAX15].

   There is a third form of normalization, Unicode normalization with
   form C.  If a profile is going to use a Unicode normalization, it
   MUST use Unicode normalization form KC.  Form KC maps many
   "compatibility characters" to their equivalents.  Some user interface
   systems make it possible to enter compatibility characters instead of
   the base equivalents.  Thus, using form KC instead of form C will
   cause more strings that users would expect to match to actually
   match.
}}}


== Proposal ==
''(Note that the following proposal was generated in the special TC call
held on this topic - see the Discussion section below.)''

Revise appropriate sections of the Syntax spec to require NFKC normalization
as part of encoding an XRI in XRI normal form. This eliminates the need to
specify normalization in the transformation of an XRI into an IRI, since
this normalization will have already have been done. Also, because NFKC is
stricter than NFC, it also maintains full compatability with IRI, since XRIs
transformed to IRIs will be a subset of all valid IRIs.


== Discussion ==

Following is a copy of the discussion from the minutes of a special TC call
on this topic held 5PM Pacific on 2005/09/13.

----
Discussion on the call quickly moved from: a) the issue of whether NFKC or
NFC should be specified on the conversion from an XRI in XRI normal form to
IRI normal form, to b) the issue of whether NFKC or NFC should be specified
on the conversion from a native XRI to an XRI in XRI normal form.

The latter approach matches the approach taken in the IRI spec, where
conversion of a "native" IRI (the IRI in a native application before any
encoding has been applied, or when conversion into UTF-8 is necessary)
requires normalization using NFC.

The attendees agreed that IRI was most likely motivated to use NFC by the
huge installed base of existing IRI implementations, which effectively
precluded them from specifying NFKC. However, XRI does not have this
installed base. So there was was unanimous agreement that we would be doing
the world of XRI adopters a big service by specifying NFKC as the
normalization requirement *in the conversion of a native XRI into XRI normal
form*.

After lengthy discussion it was agreed that, for the same reasons cited
above in the #3 excerpt above (Section 4 of the StringPrep spec), this would
be the best approach for the entirety of XRI infrastructure, since it
requires XRIs to be normalized according to NFKC at their very earliest
point of origin in the infrastructure (the native user interfaces or
applications that generate them). This simplifies the processing and
equivalence-checking burden on every other participant in the
infrastructure. As section 4 of StringPrep spec says:

"The size and complexity [of Unicode NFCK normalization] should not be
considered daunting [on an application] except in the most restricted of
environments, and needs to be weighed against the problems of user surprise
from comparing unnormalized strings."

Thus our final recommendation is that the appropriate sections of the Syntax
spec be revised to require NFKC normalization as part of encoding an XRI in
XRI normal form. This eliminates the need to specify normalization in the
transformation of an XRI into an IRI, since this normalization will have
already have been done. Also, because NFKC is stricter than NFC, it also
maintains full compatability with IRI, since XRIs transformed to IRIs will
be a subset of all valid IRIs.
----
References:
- Special XRI TC call on IRI Normalization 5pm Pacific Tuesday 9/13
  - From: "Drummond Reed" <drummond.reed@cordance.net>