xri message

Subject: Background on XRI I18N issues for tomorrow's TC call (long)
From: "Drummond Reed" <drummond.reed@onename.com>
To: <xri@lists.oasis-open.org>
Date: Thu, 10 Jul 2003 00:55:03 -0700
On tomorrow's telecon (7/10, 3PM PDT), the #1 issue we'd like to close
is our overall 1.0 design approach to internationalization (I18N). Nat
Sakimura has done the hard work on the Editor's TC of preparing a
proposal that deals with what can be very daunting issues to those who
don't deal with I18N on a regular basis (a hearty thanks, Nat!). Nat
will be on the call to lead discussion of this issue (unfortunately
Gabe, who has been working with Nat on this, is on vacation and will not
be able to attend).

Several of the other editors, including Dave McAlpin and myself, have
been researching a number of the related specs so we can fully
understand the issues involved. Thankfully, the more we learn, the more
we appreciate the elegance and power of Nat's proposal.

The purpose of this email is to help focus discussion tomorrow by: a)
summarizing Nat's proposal to everyone on the TC, b) providing his
actual first draft text for your reference, and c) providing several
excerpts from relevant specifications to collect them for easy reference
to help illuminate and justify Nat's conclusions.

SUMMARY OF NAT'S PROPOSAL

The starting point of Nat's proposal is, to quote him directly, "IMHO
XRI should be internationalized from the beginning. Introducing XRI and
IXRI [an internationalized version of the XRI syntax based on IRI]
separately will create unnecessary confusion and uncleanness in the
implementation as well as no adoption in reality. We should design the
system UTF-8 clean, and % escaping of non-ascii characters should happen
only as the last resort."

The core issue is that because RFC 2396 and 2396bis are based strictly
on the US-ASCII character set, they require escaping of any character
outside of this set. So any URI scheme based directly on 2396 syntax
cannot be internationalized, i.e., it cannot contain characters in a
native script other than US-ASCII. 

By contrast an internationalized URI syntax (hereinafter referred to as
IRI, after the Internet Draft of the same name - see
http://www.ietf.org/internet-drafts/draft-duerst-iri-04.txt) solves this
problem by enhancing generic 2396 URI syntax in three ways: 1) expanding
the legal (unreserved) character set to include Unicode characters
(think of this as the "human-readable IRI"), 2) specifying how those
characters are to be encoded (in UTF-8) for purposes of machine
processing (think of this as the phase transition from "human-readable
IRI" to "machine-readable IRI"), and 3) specifying how this UTF-8
encoded string is escaped into a legal US-ASCII URI string (think of
this as the phase transition from "machine-readable IRI" to "strict
URI").

The point of the IRI spec is *not* to define a new URI scheme, but to
define an internationalized version of RFC 2396 URI syntax from which
new URI schemes could then derive from directly (instead of 2396) in
order to accommodate I18N character sets right from the start.

Nat's proposal is to do just that. Even though the IRI spec is not yet a
referenceable RFC, it is relatively mature (on it's fourth draft, just
released). So Nat suggests we go ahead and inherit the changes in
generic URI syntax that it proposes, thereby making XRI syntax fully
internationalized from the start. Specifically this means that our
concept of an XRI would now be expanded to include 3 "levels" of XRIs:

IRI Level (highest): The XRI consists of the fully unescaped
human-readable native character set (including reserved syntactic
characters as defined in our 06 draft).

UTF-8 Level (intermediate): The XRI is fully UTF-8 encoded, however that
encoding is not escaped as per 2396 rules.

URI Level (lowest): The XRI is fully escaped to US-ASCII as per RFC 2396
escaping rules.

As Nat's text above explains, since more and more software is able to
display IRI level XRIs, and more protocols and formats (like XML) are
able to deal with UTF-8 level XRIs, it will only be in the "last resort"
where the underlying protocols require strict 2396-conformant URIs that
the XRI will have to be "percent escaped" down to the URI level. (This
is buttressed by the fact that the IETF itself, in RFC 2277, IETF Policy
on Character Sets and Languages, recommends that all new IETF specs that
deal with text support UTF-8 at a minimum - see the reference included
below).

Finally, Nat's proposal reflects that UTF-8 does not currently have a
way of encoding the metadata necessary to convert from the UTF-8 level
back up to the IRI level. However XRI cross-reference syntax
(specifically the $ space) conveniently provides a way to encode such
metadata following IETF rules for language identifiers, so Nat proposes
a solution for this final step of the ladder.

Below is Nat's first draft proposed text, followed by annotated
references from several of the key underlying specifications.

NAT'S FIRST DRAFT PROPOSED TEXT

2.3 Character Encoding and Internationalization
The basic character encoding of XRI is UTF-8 as per recommended by
[RFC2718]. Since XRI is a human readable identifier, the representation
of the XRI on the underlying document should use the character encoding
of the underlying document. However, this string must be converted to
UTF-8 before any further processing. Thus, URI conversion must be made
only after UTF-8 conversion. In general, conversion between local
language encoding representation and URI representation will require the
following two steps.  

  1.	Conversion between Local language encoding and UTF-8
  2.	Conversion between UTF-8 and URI

2.3.1 Local language encoding to UTF-8 conversion
To represent the glyph of UTF-8 string correctly, language information
and font information may be required. One short coming of UTF-8 is that
it does not necessarily carry these information with it. On the other
hand, local language encoding always has the language information
associated with it. Thus, to make it possible to revert back to the
local language representation, there has to be a way to record the
language and font context. To accommodate this requirement, XRI
facilitates the mark up by use of cross references and $l special
identifier defined in Appendix B. Once the language and font context is
set up, this will be valid until it is reset by another cross reference.
[Note: It may be better to use the the 14th plane of the ISO 10646]. 

Example: 
xri://($l/en/Times).english.($l/en/Arial)string.($l/ja).japaneseString.(
$l/ko).koreanString.($l/ch).chineseString  

When converting the local language encoding, it must be converted to a
sequence of characters from the UCS normalized according to
Normalization Form C. 

2.3.2 Conversion between UTF-8 and URI
To convert UTF-8 to RFC2396 format, hostname and other parts needs to be
treated separately. For hostname, the conversion must use Punycode. For
other parts, the conversion must use the escaping method defined in
section 2.2.3. 


REFERENCES 
[Ed. note: these are numbered for easy reference on the phone call
tomorrow]


[1] EXCERPT FROM RFC 2277 - IETF Policy on Character Sets and Languages
(http://www.faqs.org/rfcs/rfc2277.html) 
[Ed. note: I'm including this just to point out that IETF recommends we
create a separate heading in a spec to call our our I18N design choices.
They put it right up there with Security Considerations.]

6.  Documenting internationalization decisions

   In documents that deal with internationalization issues at all, a
   synopsis of the approaches chosen for internationalization SHOULD be
   collected into a section called "Internationalization
   considerations", and placed next to the Security Considerations
   section.

   This provides an easy reference for those who are looking for advice
   on these issues when implementing the protocol.


[2] EXCERPT #1 FROM IRI INTERNET DRAFT 04
[Ed. note: this explains the relative maturity of IRIs, and in
particular that they are already allowed in XML, XLink, and XML Schema.]

6.3 Format of URIs and IRIs in Documents and Protocols

[portions cut here]

    Note: Some formats already accommodate IRIs, although they use
    different terminology.  HTML 4.0 [HTML4] defines the conversion from
    IRIs to URIs as error-avoiding behavior.  XML 1.0 [XML1], XLink
    [XLink], and XML Schema [XMLSchema] and specifications based upon
    them allow IRIs.  Also, it is expected that all relevant new W3C
    formats and protocols will be required to handle IRIs [CharMod].


[3] EXCERPT #2 FROM IRI INTERNET DRAFT 04

[Ed. note: this is the section of the IRI spec that explains why the use
of UTF-8 as the standard encoding for an IRI. Note especially the second
paragraph, where it points out that RFC 2718 recommends the use of UTF-8
encoding for all new URI schemes, and that this recommendation was
followed in the URN spec, RFC 2141 (also excerpted below).]

6.4 Use of UTF-8 for Encoding Original Characters

    This section discusses details and gives examples for point c) in
    Section 1.2.  In order to be able to use IRIs, the URI corresponding
    to the IRI in question has to encode original characters into octets
    using UTF-8.  This can be specified for all URIs of an URI scheme,
or
    can apply to individual URIs for schemes that do not specify how to
    encode original characters.  It can apply to the whole URI, or only
    some part.

    For new URI schemes, using UTF-8 is recommended in [RFC2718].
    Examples where this is already used are the URN syntax [RFC2141],
    IMAP URLs [RFC2192], and POP URLs [RFC2384].  On the other hand, the
    HTTP URL scheme does not specify how to encode original characters,
    and therefore IRIs only can be used for some HTTP URLs.

    For example, for a document with a URI of
    http://www.example.org/r%C3%A9sum%C3%A9.html, it is possible to
    construct a corresponding IRI (in XML notation, see Section 1.4):
    http://www.example.org/résumé.html (é stands for the
    e-acute character, and %C3%A9 is the UTF-8 encoded and escaped
    representation of that character).  On the other hand, for a
document
    with an URI of http://www.example.org/r%E9sum%E9.html, the escaped
    octets cannot be converted to actual characters in an IRI, because
    the escaping is not based on UTF-8.

    The requirement for the use of UTF-8 applies to all parts of an URI,
    with the exception of the ihostname part.  However, it is possible
    that the capability of IRIs to represent a wide range of characters
    directly is used just in some parts of the IRI (or IRI reference).
    The other parts of the IRI may only contain ASCII characters, or
they
    may not be based on UTF-8.  They may be based on another encoding,
or
    they may directly encode raw binary data (see also [RFC2397]).

    For example, it is possible to have an URI reference of
    http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9, where the
    document name is encoded in iso-8859-1 based on server settings, but
    the fragment identifier is encoded in UTF-8 according to [XPointer].
    The IRI corresponding to the above URI would be (in XML notation)
    http://www.example.org/r%E9sum%E9.xml#résumé.


[4] EXCERPT FROM RFC 2141 - URN SYNTAX
(http://www.ietf.org/rfc/rfc2141.txt) 
[Ed. note: this is the section of 2141 where UTF-8 encoding is specified
for all URNs - in particular the last paragraph of this excerpt. The
reference there ("[5]") is to the Unicode standard specifying UTF-8.]

2.2 Namespace Specific String Syntax

   As required by RFC 1737, there is a single canonical representation
   of the NSS portion of an URN.   The format of this single canonical
   form follows:

   <NSS>         ::= 1*<URN Chars>

  <URN Chars>    ::= <trans> | "%"  <hex> <hex>

  <trans>       ::= <upper> | <lower> | <number> | <other> | <reserved>

  <hex>          ::=  | "A" | "B" | "C" | "D" | "E" | "F" |
                     "a" | "b" | "c" | "d" | "e" | "f"

 <other>         ::= "(" | ")" | "+" | "," | "-" | "." |
                     ":" | "=" | "@" | ";" | "$" |
                     "_" | "!" | "*" | "'"

   Depending on the rules governing a namespace, valid identifiers in a
   namespace might contain characters that are not members of the URN
   character set above (<URN chars>).  Such strings MUST be translated
   into canonical NSS format before using them as protocol elements or
   otherwise passing them on to other applications. Translation is done
   by encoding each character outside the URN character set as a
   sequence of one to six octets using UTF-8 encoding [5], and the
   encoding of each of those octets as "%" followed by two characters
   from the  character set above. The two characters give the
   hexadecimal representation of that octet.


[5] EXCERPT #3 FROM IRI INTERNET DRAFT 04
[Ed. note: This is the section of the IRI spec where they explain how to
go from the IRI level down through the UTF-8 level to the URI level. We
will probably need to include this completely in the XRI spec.]

3.1 Mapping of IRIs to URIs

    This section defines how to map an IRI to a URI.  Everything in this
    section applies also to IRI references and URI references, as well
as
    components thereof (for example fragment identifiers).

    This mapping has two purposes:

       a) Syntactical:  Many URI schemes and components define
additional
          syntactical restrictions not captured in Section 2.2.  Such
          restrictions can be applied to IRIs by noting that IRIs are
          only valid if they map to syntactically valid URIs.  This
means
          that such syntactical restrictions do not have to be defined
          again on the IRI level.

       b) Interpretational:  URIs identify resources in various ways.
          IRIs also identify resources.  When the IRI is used solely for
          identification purposes, it is not necessary to map the IRI to
          an URI (see Section 5).  However, when an IRI is used for
          resource retrieval, the resource that the IRI locates is the
          same as the one located by the URI obtained after converting
          the IRI according to the procedure defined here.  This means
          that there is no need to define resolution separately on the
          IRI level.

    Applications MUST map IRIs to URIs using the following two steps.

       Step 1) This step generates a UCS-based encoding from the
original
          IRI format.  This step has three variants, depending on the
          form of the input.

             Variant A) If the IRI is written on paper or read out loud,
                or otherwise represented as a sequence of characters
                independent of any encoding: Represent the IRI as a
                sequence of characters from the UCS normalized according
                to Normalization Form C (NFC, [UTR15]).

             Variant B) If the IRI is in some digital representation
                (e.g.  an octet stream) in some known non-Unicode
                encoding: Convert the IRI to a sequence of characters
                from the UCS normalized according to NFC.

             Variant C) If the IRI is in an Unicode-based encoding (for
                example UTF-8 or UTF-16): Do not normalize.  Move
                directly to Step 2.

       Step 2) If the IRI contains an 'ihostname' part, replace this
          'ihostname' part by the part converted using the ToASCII
          operation specified in Section 4.1 of [RFC3490], with the flag
          UseSTD3ASCIIRules set to TRUE and the flag AllowUnassigned set
          to FALSE for creating IRIs and set to TRUE otherwise.

       Step 3) For each character that is disallowed in URI references,
          apply steps 1) through 3) below.  The disallowed characters
          consist of all non-ASCII characters allowed in IRIs.

             1) Convert the character to a sequence of one or more
octets
                using UTF-8 [RFCXXXX].

             2) Convert each octet to %HH, where HH is the hexadecimal
                notation of the octet value.  Note: This is identical to
                the escaping mechanism in Section 2.4.1 of [RFC2396].
                Note: To reduce variability, the hexadecimal notation
                SHOULD use upper case letters.

             3) Replace the original character by the resulting
character
                sequence (i.e.  a sequence of %HH triplets).
Follow-Ups:
- RE: [xri] Background on XRI I18N issues for tomorrow's TC call (long)
  - From: "Dave McAlpin" <dave.mcalpin@epokinc.com>