[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Proposed XDI Identifier Canonicalization Rules for XDI Core
The Identifier Canonicalization section of XDI Namespaces will state the XDI identifier canonicalization rules. I propose the following rules for “maximum canonicalization” of XDI identifiers. The motivation is to ensure the fewest false positives and false negatives in XDI identifier matching.
The main thrust of the proposal is to take the issue of case-sensitivity completely off the table for the ASCII character range by allowing only lowercase letters, digits, and and three symbol characters (dot, dash, and underscore). This means uppercase ASCII characters MUST be escape encoded, which provides a way to preserve case sensitivity for systems that need to translate other identifiers into and out of XDI addresses.
The one exception is percent-encoding, which I propose (again for purposes of maximum canonicalization) MUST use uppercase ASCII letters, following this advice in RFC 3987:
5.3.2.1. Case Normalization
For all IRIs, the hexadecimal digits within a percent-encoding triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore should be normalized to use uppercase letters for the digits A - F.
That leaves us with the question of normalization of Unicode characters above the ASCII range. There are two questions here: 1) case folding (case sensitivity), and 2) character normalization.
On the first question, section 5.3.2.1 of RFC 3987 says:
Creating schemes that allow case-insensitive syntax components containing non-ASCII characters should be avoided. Case normalization of non-ASCII characters can be culturally dependent and is always a complex operation. The only exception concerns non-ASCII host names for which the character normalization includes a mapping step derived from case folding.
On the second question, section 5.3.2.2 of RFC 3987 says:
The Unicode Standard [UNIV4] defines various equivalences between sequences of characters for various purposes. Unicode Standard Annex #15 [UTR15] defines various Normalization Forms for these equivalences, in particular Normalization Form C (NFC, Canonical Decomposition, followed by Canonical Composition) and Normalization Form KC (NFKC, Compatibility Decomposition, followed by Canonical Composition). Equivalence of IRIs MUST rely on the assumption that IRIs are appropriately pre-character-normalized rather than apply character normalization when comparing two IRIs. The exceptions are conversion from a non-digital form, and conversion from a non-UCS-based character encoding to a UCS-based character encoding. In these cases, NFC or a normalizing transcoder using NFC MUST be used for interoperability. To avoid false negatives and problems with transcoding, IRIs SHOULD be created by using NFC. Using NFKC may avoid even more problems; for example, by choosing half-width Latin letters instead of full-width ones, and full-width instead of half-width Katakana.
Joseph, on these two questions, what do you recommend?
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]