OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xdi message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Proposed XDI Identifier Canonicalization Rules for XDI Core


Due to conflicts, today's meeting was abbreviated and we did not get to this important topic that I need to cover in the XDI Addressing section (the final section of Core that I want to finish this week).

So I wanted to raise the proposal here, on the mailing list, for discussion.

Proposed XDI Identifier Canonicalization Rules

The Identifier Canonicalization section of XDI Namespaces will state the XDI identifier canonicalization rules. I propose the following rules for “maximum canonicalization” of XDI identifiers. The motivation is to ensure the fewest false positives and false negatives in XDI identifier matching.


The main thrust of the proposal is to take the issue of case-sensitivity completely off the table for the ASCII character range by allowing only lowercase letters, digits, and and three symbol characters (dot, dash, and underscore). This means uppercase ASCII characters MUST be escape encoded, which provides a way to preserve case sensitivity for systems that need to translate other identifiers into and out of XDI addresses.


The one exception is percent-encoding, which I propose (again for purposes of maximum canonicalization) MUST use uppercase ASCII letters, following this advice in RFC 3987:


5.3.2.1. Case Normalization
For all IRIs, the hexadecimal digits within a percent-encoding triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore should be normalized to use uppercase letters for the digits A - F.


That leaves us with the question of normalization of Unicode characters above the ASCII range. There are two questions here: 1) case folding (case sensitivity), and 2) character normalization.


On the first question, section 5.3.2.1 of RFC 3987 says:


   Creating schemes that allow case-insensitive syntax components
   containing non-ASCII characters should be avoided. Case normalization
   of non-ASCII characters can be culturally dependent and is always a
   complex operation.  The only exception concerns non-ASCII host names
   for which the character normalization includes a mapping step derived
   from case folding.



On the second question, section 5.3.2.2 of RFC 3987 says:


   The Unicode Standard [UNIV4] defines various equivalences between
   sequences of characters for various purposes.  Unicode Standard Annex
   #15 [UTR15] defines various Normalization Forms for these
   equivalences, in particular Normalization Form C (NFC, Canonical
   Decomposition, followed by Canonical Composition) and Normalization
   Form KC (NFKC, Compatibility Decomposition, followed by Canonical
   Composition).

   Equivalence of IRIs MUST rely on the assumption that IRIs are
   appropriately pre-character-normalized rather than apply character
   normalization when comparing two IRIs.  The exceptions are conversion
   from a non-digital form, and conversion from a non-UCS-based
   character encoding to a UCS-based character encoding. In these cases,
   NFC or a normalizing transcoder using NFC MUST be used for
   interoperability.  To avoid false negatives and problems with
   transcoding, IRIs SHOULD be created by using NFC.  Using NFKC may
   avoid even more problems; for example, by choosing half-width Latin
   letters instead of full-width ones, and full-width instead of
   half-width Katakana.


Joseph, on these two questions, what do you recommend?





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]