[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Minutes of Special XRI TC call on Syntax Issue #4: XRI Normalization 5pm Pacific Tuesday 9/13
XRI TC Members and Observers, This email contains the minutes of the special TC call held yesterday at 5pm Pacific time on Syntax Issue #4: XRI Normalization. Attendees included Drummond Reed, Les Chasen, Nat Sakimura, Peter Davis, and William Tan. The minutes have been written up (together with the requirements and the resulting proposal) on the XRI TC wiki on the issue page at: http://wiki.oasis-open.org/xri/Xri2Cd02/SynTax/I4XriNormalization A copy of this wiki page, including the minutes (as the Discussion section near the end) is included below for reference (although this page is frankly much easier to read on the wiki). =Drummond = Requirements/Proposal Page for Syntax Issue #4: XRI Normalization = [[TableOfContents]] == Introduction/Motivation == This issue was raised by Wil Tan, Les Chasen, and Sharon Nino at NeuStar due to their implementation experience with [http://www.ietf.org/rfc/rfc3490.txt Internationalize Domain Names]. Since normalization rules for widely used infrastructure can be loosened but almost never be tightened after adoption, they recommend that the TC look carefully at specifying Unicode Normal Form KC (NFKC) for XRI normalization. == Status == * Version: 1 * Action: Active proposal that needs discussion and closure. == Requirements == * Before XRIs gain wide adoption, establish a clear standard for the critical issue of how XRIs that use the Unicode character set will be normalized. * Reasonably minimize the discrepancy between user expectation of equivalence and machine determination of equivalence of XRIs. * Minimize the normalization processing burden for the whole of XRI infrastructure while also making sure no one point in the infrastructure suffers too great a burden. * If possible, ensure that XRI normalization does not create incompatability the IRI and URI specifications. == Background == The following background and spec excerpts are very helpful in understanding the issue and proposal. * The [http://www.ietf.org/rfc/rfc3987.txt IRI specification] requires Unicode NFC normalization in the original encoding of an IRI (if it is not already encoded), NOT on the conversion of an IRI to a URI (as some of us expected). See section 3.1 step 1a and 1b as well as section 5.3.2.2 (both excerpted below for easy reference.) * The [http://www.ietf.org/rfc/rfc3490.txt Internationalized Domain Names (IDN) specifications] rely on the [http://www.ietf.org/rfc/rfc3454.txt StringPrep specification], which requires the stricter NFKC rules that normalize a wider set of "compatability characters" which are allowed under Unicode but which make identifier comparision more difficult for both humans and machines. Note that the IDN specifications (and NFKC) apply to the ireg-name component of an IRI in certain schemes. The reasons for requiring NFKC are explained in section 4 of this spec (excerpted below for easy reference). === Excerpt of Start Of Section 3.1 Of IRI Spec === {{{ Applications MUST map IRIs to URIs by using the following two steps. Step 1. Generate a UCS character sequence from the original IRI format. This step has the following three variants, depending on the form of the input: a. If the IRI is written on paper, read aloud, or otherwise represented as a sequence of characters independent of any character encoding, represent the IRI as a sequence of characters from the UCS normalized according to Normalization Form C (NFC, [UTR15]). b. If the IRI is in some digital representation (e.g., an octet stream) in some known non-Unicode character encoding, convert the IRI to a sequence of characters from the UCS normalized according to NFC. c. If the IRI is in a Unicode-based character encoding (for example, UTF-8 or UTF-16), do not normalize (see section 5.3.2.2 for details). Apply step 2 directly to the encoded Unicode character sequence. Step 2. For each character in 'ucschar' or 'iprivate', apply steps 2.1 through 2.3 below. 2.1. Convert the character to a sequence of one or more octets using UTF-8 [RFC3629]. 2.2. Convert each octet to %HH, where HH is the hexadecimal notation of the octet value. Note that this is identical to the percent-encoding mechanism in section 2.1 of [RFC3986]. To reduce variability, the hexadecimal notation SHOULD use uppercase letters. 2.3. Replace the original character with the resulting character sequence (i.e., a sequence of %HH triplets). }}} === Excerpt of Section 5.3.2.2 of IRI Spec === {{{ 5.3.2.2. Character Normalization The Unicode Standard [UNIV4] defines various equivalences between sequences of characters for various purposes. Unicode Standard Annex #15 [UTR15] defines various Normalization Forms for these equivalences, in particular Normalization Form C (NFC, Canonical Decomposition, followed by Canonical Composition) and Normalization Form KC (NFKC, Compatibility Decomposition, followed by Canonical Composition). Equivalence of IRIs MUST rely on the assumption that IRIs are appropriately pre-character-normalized rather than apply character normalization when comparing two IRIs. The exceptions are conversion from a non-digital form, and conversion from a non-UCS-based character encoding to a UCS-based character encoding. In these cases, NFC or a normalizing transcoder using NFC MUST be used for interoperability. To avoid false negatives and problems with transcoding, IRIs SHOULD be created by using NFC. Using NFKC may avoid even more problems; for example, by choosing half-width Latin letters instead of full-width ones, and full-width instead of half-width Katakana. As an example, "http://www.example.org/résumé.html" (in XML Notation) is in NFC. On the other hand, "http://www.example.org/résumé.html" is not in NFC. The former uses precombined e-acute characters, and the latter uses "e" characters followed by combining acute accents. Both usages are defined as canonically equivalent in [UNIV4]. Note: Because it is unknown how a particular sequence of characters is being treated with respect to character normalization, it would be inappropriate to allow third parties to normalize an IRI arbitrarily. This does not contradict the recommendation that when a resource is created, its IRI should be as character normalized as possible (i.e., NFC or even NFKC). This is similar to the uppercase/lowercase problems. Some parts of a URI are case insensitive (domain name). For others, it is unclear whether they are case sensitive, case insensitive, or something in between (e.g., case sensitive, but with a multiple choice selection if the wrong case is used, instead of a direct negative result). The best recipe is that the creator use a reasonable capitalization and, when transferring the URI, capitalization never be changed. Various IRI schemes may allow the usage of Internationalized Domain Names (IDN) [RFC3490] either in the ireg-name part or elsewhere. Character Normalization also applies to IDNs, as discussed in section 5.3.3. }}} === Excerpt of Section 4 of StringPrep Spec === {{{ 4. Normalization The output of the mapping step is optionally normalized using one of the Unicode normalization forms, as described in [UAX15]. A profile can specify one of two options for Unicode normalization: - no normalization - Unicode normalization with form KC A profile MAY choose to do no normalization. However, such a profile can easily yield results that will be surprising to typical users, depending on the input mechanism they use. For example, some input mechanisms enter compatibility characters that look exactly like the underlying characters, but have different code points. Another example of where Unicode normalization helps create predictable results is with characters that have multiple combining diacritics: normalization orders those diacritics in a predictable fashion. On the other hand, Unicode normalization requires fairly large tables and somewhat complicated character reordering logic. The size and complexity should not be considered daunting except in the most restricted of environments, and needs to be weighed against the problems of user surprise from comparing unnormalized strings. Note that the tables used for normalization are not given in this document, but instead must be derived from the Unicode database, as described in [UAX15]. There is a third form of normalization, Unicode normalization with form C. If a profile is going to use a Unicode normalization, it MUST use Unicode normalization form KC. Form KC maps many "compatibility characters" to their equivalents. Some user interface systems make it possible to enter compatibility characters instead of the base equivalents. Thus, using form KC instead of form C will cause more strings that users would expect to match to actually match. }}} == Proposal == ''(Note that the following proposal was generated in the special TC call held on this topic - see the Discussion section below.)'' Revise appropriate sections of the Syntax spec to require NFKC normalization as part of encoding an XRI in XRI normal form. This eliminates the need to specify normalization in the transformation of an XRI into an IRI, since this normalization will have already have been done. Also, because NFKC is stricter than NFC, it also maintains full compatability with IRI, since XRIs transformed to IRIs will be a subset of all valid IRIs. == Discussion == Following is a copy of the discussion from the minutes of a special TC call on this topic held 5PM Pacific on 2005/09/13. ---- Discussion on the call quickly moved from: a) the issue of whether NFKC or NFC should be specified on the conversion from an XRI in XRI normal form to IRI normal form, to b) the issue of whether NFKC or NFC should be specified on the conversion from a native XRI to an XRI in XRI normal form. The latter approach matches the approach taken in the IRI spec, where conversion of a "native" IRI (the IRI in a native application before any encoding has been applied, or when conversion into UTF-8 is necessary) requires normalization using NFC. The attendees agreed that IRI was most likely motivated to use NFC by the huge installed base of existing IRI implementations, which effectively precluded them from specifying NFKC. However, XRI does not have this installed base. So there was was unanimous agreement that we would be doing the world of XRI adopters a big service by specifying NFKC as the normalization requirement *in the conversion of a native XRI into XRI normal form*. After lengthy discussion it was agreed that, for the same reasons cited above in the #3 excerpt above (Section 4 of the StringPrep spec), this would be the best approach for the entirety of XRI infrastructure, since it requires XRIs to be normalized according to NFKC at their very earliest point of origin in the infrastructure (the native user interfaces or applications that generate them). This simplifies the processing and equivalence-checking burden on every other participant in the infrastructure. As section 4 of StringPrep spec says: "The size and complexity [of Unicode NFCK normalization] should not be considered daunting [on an application] except in the most restricted of environments, and needs to be weighed against the problems of user surprise from comparing unnormalized strings." Thus our final recommendation is that the appropriate sections of the Syntax spec be revised to require NFKC normalization as part of encoding an XRI in XRI normal form. This eliminates the need to specify normalization in the transformation of an XRI into an IRI, since this normalization will have already have been done. Also, because NFKC is stricter than NFC, it also maintains full compatability with IRI, since XRIs transformed to IRIs will be a subset of all valid IRIs. ----
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]