[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Background on XRI I18N issues for tomorrow's TC call (long)
On tomorrow's telecon (7/10, 3PM PDT), the #1 issue we'd like to close is our overall 1.0 design approach to internationalization (I18N). Nat Sakimura has done the hard work on the Editor's TC of preparing a proposal that deals with what can be very daunting issues to those who don't deal with I18N on a regular basis (a hearty thanks, Nat!). Nat will be on the call to lead discussion of this issue (unfortunately Gabe, who has been working with Nat on this, is on vacation and will not be able to attend). Several of the other editors, including Dave McAlpin and myself, have been researching a number of the related specs so we can fully understand the issues involved. Thankfully, the more we learn, the more we appreciate the elegance and power of Nat's proposal. The purpose of this email is to help focus discussion tomorrow by: a) summarizing Nat's proposal to everyone on the TC, b) providing his actual first draft text for your reference, and c) providing several excerpts from relevant specifications to collect them for easy reference to help illuminate and justify Nat's conclusions. SUMMARY OF NAT'S PROPOSAL The starting point of Nat's proposal is, to quote him directly, "IMHO XRI should be internationalized from the beginning. Introducing XRI and IXRI [an internationalized version of the XRI syntax based on IRI] separately will create unnecessary confusion and uncleanness in the implementation as well as no adoption in reality. We should design the system UTF-8 clean, and % escaping of non-ascii characters should happen only as the last resort." The core issue is that because RFC 2396 and 2396bis are based strictly on the US-ASCII character set, they require escaping of any character outside of this set. So any URI scheme based directly on 2396 syntax cannot be internationalized, i.e., it cannot contain characters in a native script other than US-ASCII. By contrast an internationalized URI syntax (hereinafter referred to as IRI, after the Internet Draft of the same name - see http://www.ietf.org/internet-drafts/draft-duerst-iri-04.txt) solves this problem by enhancing generic 2396 URI syntax in three ways: 1) expanding the legal (unreserved) character set to include Unicode characters (think of this as the "human-readable IRI"), 2) specifying how those characters are to be encoded (in UTF-8) for purposes of machine processing (think of this as the phase transition from "human-readable IRI" to "machine-readable IRI"), and 3) specifying how this UTF-8 encoded string is escaped into a legal US-ASCII URI string (think of this as the phase transition from "machine-readable IRI" to "strict URI"). The point of the IRI spec is *not* to define a new URI scheme, but to define an internationalized version of RFC 2396 URI syntax from which new URI schemes could then derive from directly (instead of 2396) in order to accommodate I18N character sets right from the start. Nat's proposal is to do just that. Even though the IRI spec is not yet a referenceable RFC, it is relatively mature (on it's fourth draft, just released). So Nat suggests we go ahead and inherit the changes in generic URI syntax that it proposes, thereby making XRI syntax fully internationalized from the start. Specifically this means that our concept of an XRI would now be expanded to include 3 "levels" of XRIs: IRI Level (highest): The XRI consists of the fully unescaped human-readable native character set (including reserved syntactic characters as defined in our 06 draft). UTF-8 Level (intermediate): The XRI is fully UTF-8 encoded, however that encoding is not escaped as per 2396 rules. URI Level (lowest): The XRI is fully escaped to US-ASCII as per RFC 2396 escaping rules. As Nat's text above explains, since more and more software is able to display IRI level XRIs, and more protocols and formats (like XML) are able to deal with UTF-8 level XRIs, it will only be in the "last resort" where the underlying protocols require strict 2396-conformant URIs that the XRI will have to be "percent escaped" down to the URI level. (This is buttressed by the fact that the IETF itself, in RFC 2277, IETF Policy on Character Sets and Languages, recommends that all new IETF specs that deal with text support UTF-8 at a minimum - see the reference included below). Finally, Nat's proposal reflects that UTF-8 does not currently have a way of encoding the metadata necessary to convert from the UTF-8 level back up to the IRI level. However XRI cross-reference syntax (specifically the $ space) conveniently provides a way to encode such metadata following IETF rules for language identifiers, so Nat proposes a solution for this final step of the ladder. Below is Nat's first draft proposed text, followed by annotated references from several of the key underlying specifications. NAT'S FIRST DRAFT PROPOSED TEXT 2.3 Character Encoding and Internationalization The basic character encoding of XRI is UTF-8 as per recommended by [RFC2718]. Since XRI is a human readable identifier, the representation of the XRI on the underlying document should use the character encoding of the underlying document. However, this string must be converted to UTF-8 before any further processing. Thus, URI conversion must be made only after UTF-8 conversion. In general, conversion between local language encoding representation and URI representation will require the following two steps. 1. Conversion between Local language encoding and UTF-8 2. Conversion between UTF-8 and URI 2.3.1 Local language encoding to UTF-8 conversion To represent the glyph of UTF-8 string correctly, language information and font information may be required. One short coming of UTF-8 is that it does not necessarily carry these information with it. On the other hand, local language encoding always has the language information associated with it. Thus, to make it possible to revert back to the local language representation, there has to be a way to record the language and font context. To accommodate this requirement, XRI facilitates the mark up by use of cross references and $l special identifier defined in Appendix B. Once the language and font context is set up, this will be valid until it is reset by another cross reference. [Note: It may be better to use the the 14th plane of the ISO 10646]. Example: xri://($l/en/Times).english.($l/en/Arial)string.($l/ja).japaneseString.( $l/ko).koreanString.($l/ch).chineseString When converting the local language encoding, it must be converted to a sequence of characters from the UCS normalized according to Normalization Form C. 2.3.2 Conversion between UTF-8 and URI To convert UTF-8 to RFC2396 format, hostname and other parts needs to be treated separately. For hostname, the conversion must use Punycode. For other parts, the conversion must use the escaping method defined in section 2.2.3. REFERENCES [Ed. note: these are numbered for easy reference on the phone call tomorrow] [1] EXCERPT FROM RFC 2277 - IETF Policy on Character Sets and Languages (http://www.faqs.org/rfcs/rfc2277.html) [Ed. note: I'm including this just to point out that IETF recommends we create a separate heading in a spec to call our our I18N design choices. They put it right up there with Security Considerations.] 6. Documenting internationalization decisions In documents that deal with internationalization issues at all, a synopsis of the approaches chosen for internationalization SHOULD be collected into a section called "Internationalization considerations", and placed next to the Security Considerations section. This provides an easy reference for those who are looking for advice on these issues when implementing the protocol. [2] EXCERPT #1 FROM IRI INTERNET DRAFT 04 [Ed. note: this explains the relative maturity of IRIs, and in particular that they are already allowed in XML, XLink, and XML Schema.] 6.3 Format of URIs and IRIs in Documents and Protocols [portions cut here] Note: Some formats already accommodate IRIs, although they use different terminology. HTML 4.0 [HTML4] defines the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink [XLink], and XML Schema [XMLSchema] and specifications based upon them allow IRIs. Also, it is expected that all relevant new W3C formats and protocols will be required to handle IRIs [CharMod]. [3] EXCERPT #2 FROM IRI INTERNET DRAFT 04 [Ed. note: this is the section of the IRI spec that explains why the use of UTF-8 as the standard encoding for an IRI. Note especially the second paragraph, where it points out that RFC 2718 recommends the use of UTF-8 encoding for all new URI schemes, and that this recommendation was followed in the URN spec, RFC 2141 (also excerpted below).] 6.4 Use of UTF-8 for Encoding Original Characters This section discusses details and gives examples for point c) in Section 1.2. In order to be able to use IRIs, the URI corresponding to the IRI in question has to encode original characters into octets using UTF-8. This can be specified for all URIs of an URI scheme, or can apply to individual URIs for schemes that do not specify how to encode original characters. It can apply to the whole URI, or only some part. For new URI schemes, using UTF-8 is recommended in [RFC2718]. Examples where this is already used are the URN syntax [RFC2141], IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, the HTTP URL scheme does not specify how to encode original characters, and therefore IRIs only can be used for some HTTP URLs. For example, for a document with a URI of http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to construct a corresponding IRI (in XML notation, see Section 1.4): http://www.example.org/résumé.html (é stands for the e-acute character, and %C3%A9 is the UTF-8 encoded and escaped representation of that character). On the other hand, for a document with an URI of http://www.example.org/r%E9sum%E9.html, the escaped octets cannot be converted to actual characters in an IRI, because the escaping is not based on UTF-8. The requirement for the use of UTF-8 applies to all parts of an URI, with the exception of the ihostname part. However, it is possible that the capability of IRIs to represent a wide range of characters directly is used just in some parts of the IRI (or IRI reference). The other parts of the IRI may only contain ASCII characters, or they may not be based on UTF-8. They may be based on another encoding, or they may directly encode raw binary data (see also [RFC2397]). For example, it is possible to have an URI reference of http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the document name is encoded in iso-8859-1 based on server settings, but the fragment identifier is encoded in UTF-8 according to [XPointer]. The IRI corresponding to the above URI would be (in XML notation) http://www.example.org/r%E9sum%E9.xml#résumé. [4] EXCERPT FROM RFC 2141 - URN SYNTAX (http://www.ietf.org/rfc/rfc2141.txt) [Ed. note: this is the section of 2141 where UTF-8 encoding is specified for all URNs - in particular the last paragraph of this excerpt. The reference there ("[5]") is to the Unicode standard specifying UTF-8.] 2.2 Namespace Specific String Syntax As required by RFC 1737, there is a single canonical representation of the NSS portion of an URN. The format of this single canonical form follows: <NSS> ::= 1*<URN Chars> <URN Chars> ::= <trans> | "%" <hex> <hex> <trans> ::= <upper> | <lower> | <number> | <other> | <reserved> <hex> ::= | "A" | "B" | "C" | "D" | "E" | "F" | "a" | "b" | "c" | "d" | "e" | "f" <other> ::= "(" | ")" | "+" | "," | "-" | "." | ":" | "=" | "@" | ";" | "$" | "_" | "!" | "*" | "'" Depending on the rules governing a namespace, valid identifiers in a namespace might contain characters that are not members of the URN character set above (<URN chars>). Such strings MUST be translated into canonical NSS format before using them as protocol elements or otherwise passing them on to other applications. Translation is done by encoding each character outside the URN character set as a sequence of one to six octets using UTF-8 encoding [5], and the encoding of each of those octets as "%" followed by two characters from the character set above. The two characters give the hexadecimal representation of that octet. [5] EXCERPT #3 FROM IRI INTERNET DRAFT 04 [Ed. note: This is the section of the IRI spec where they explain how to go from the IRI level down through the UTF-8 level to the URI level. We will probably need to include this completely in the XRI spec.] 3.1 Mapping of IRIs to URIs This section defines how to map an IRI to a URI. Everything in this section applies also to IRI references and URI references, as well as components thereof (for example fragment identifiers). This mapping has two purposes: a) Syntactical: Many URI schemes and components define additional syntactical restrictions not captured in Section 2.2. Such restrictions can be applied to IRIs by noting that IRIs are only valid if they map to syntactically valid URIs. This means that such syntactical restrictions do not have to be defined again on the IRI level. b) Interpretational: URIs identify resources in various ways. IRIs also identify resources. When the IRI is used solely for identification purposes, it is not necessary to map the IRI to an URI (see Section 5). However, when an IRI is used for resource retrieval, the resource that the IRI locates is the same as the one located by the URI obtained after converting the IRI according to the procedure defined here. This means that there is no need to define resolution separately on the IRI level. Applications MUST map IRIs to URIs using the following two steps. Step 1) This step generates a UCS-based encoding from the original IRI format. This step has three variants, depending on the form of the input. Variant A) If the IRI is written on paper or read out loud, or otherwise represented as a sequence of characters independent of any encoding: Represent the IRI as a sequence of characters from the UCS normalized according to Normalization Form C (NFC, [UTR15]). Variant B) If the IRI is in some digital representation (e.g. an octet stream) in some known non-Unicode encoding: Convert the IRI to a sequence of characters from the UCS normalized according to NFC. Variant C) If the IRI is in an Unicode-based encoding (for example UTF-8 or UTF-16): Do not normalize. Move directly to Step 2. Step 2) If the IRI contains an 'ihostname' part, replace this 'ihostname' part by the part converted using the ToASCII operation specified in Section 4.1 of [RFC3490], with the flag UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set to FALSE for creating IRIs and set to TRUE otherwise. Step 3) For each character that is disallowed in URI references, apply steps 1) through 3) below. The disallowed characters consist of all non-ASCII characters allowed in IRIs. 1) Convert the character to a sequence of one or more octets using UTF-8 [RFCXXXX]. 2) Convert each octet to %HH, where HH is the hexadecimal notation of the octet value. Note: This is identical to the escaping mechanism in Section 2.4.1 of [RFC2396]. Note: To reduce variability, the hexadecimal notation SHOULD use upper case letters. 3) Replace the original character by the resulting character sequence (i.e. a sequence of %HH triplets).
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]