RE: [xri] Problem with XRI 1.0 syntax

Let us break the compatibility. We have to decide almost NOW.

What would be the best character, then?

Nat

From: Dave McAlpin [mailto:dave.mcalpin@epok.net]
Sent: Thursday, June 03, 2004 7:37 AM
To: xri@lists.oasis-open.org
Subject: [xri] Problem with XRI 1.0 syntax

An issue has come up around XRI syntax related to the use of the dot (“.”) character. Changes to RFC2396bis subsequent to XRI 1.0 approval (discussed below), reinforced by feedback from early implementers, require careful consideration by the TC. Here’s a fairly detailed look at the problem.

--Background

One of the core design goals for the XRI TC was to create a syntax that supports "human friendly" identifiers. Another was to follow as closely as possible existing standards and precedents. These two goals conflicted to some extent when defining semantics around the dot (".") character, particularly in the XRI authority component where resolution is normatively specified.

Dot is the traditional separator in DNS names (www.epok.net, for example), and is effectively a second level separator in the authority component of URIs. For instance, http://www.epok.net/foo can be broken down into scheme (http), authority (www.epok.net) and path (/foo). Authority can be further divided into www, epok and net. Dot is the character that serves as the second level separator.

A similar requirement for a second level separator exists in an XRI that has an XRI authority component, i.e. an authority that starts with a global character. For example, in xri:@epok.seattle/foo, @epok.seattle is the XRI authority component, further separated by the dot between epok and seattle and by an implicit dot between @ and epok. Dot in this case serves the same purpose as it does in DNS - it delimits the resolution units in the XRI authority.

The XRI TC recognized that giving dot special semantics was somewhat in conflict with the human friendly goal of XRI. It meant, for example, that a very readable XRI like xri:=dave.mcalpin, while syntactically legal, might not match user expectations. A naīve user could easily view “dave.mcalpin” as a single token, while the resolution spec treats “dave” and “mcalpin” as two delgated tokens. The decision, however, was that the use of dot was so well ingrained by DNS that it was the only reasonable choice for a second level separator. To support equivalent syntax, we introduced something called a "relative cross reference", whose purpose was to allow a string that contained delimiters to be treated as a single token. The above example, then, became xri:=(dave.mcalpin), where the parens set off a single token for the purpose of resolution. The identifier xri:=dave%2Emcalpin accomplishes the same thing, but it clearly has a problem with human friendliness.

Feedback from early implementers, however, suggested that many users did in fact expect dot to be a normal character. Users very much preferred =dave.mcalpin, for example, to =(dave.mcalpin).

--Dot and RFC2396

For a number of reasons, the xri-authority component of an XRI (e.g. @epok.seattle in the XRI xri:@epok.seattle/foo), is not an authority from the perspective of generic URIs, as defined by RFC2396, but rather is part of the path. In RFC2396, dot is part of the unreserved character set, and specifically is unreserved for path components. RFC2396 is somewhat ambiguous as to whether or not scheme designers are allowed to apply special semantics to unreserved characters. Because so many useful characters are unreserved, however, in actual practice scheme designers (including the XRI TC) have often used unreserved characters as delimiters, effectively moving them into the reserved set for a particular scheme.

--Dot and RFC2396bis

A revision of RFC2396, generally referred to as RFC2396bis, has been in draft since October 2002 - in fact a substantial part of the XRI 1.0 specification was based on BNF from the RFC2396bis draft that was current last fall. However a substantial change was made to RFC2396bis draft this spring. This new draft is available at http://gbiv.com/protocols/uri/rev-2002/rfc2396bis.html.

Among other things, this new draft of RFC2396bis attempts to clarify which characters should and should not be used as delimiters by scheme designers. Specifically, sections 2.2 and 2.3 of RFC2396bis discuss reserved and unreserved characters. As noted above, this was a confusing part of the original 2396. In the endnotes, the authors note, "Section 2 on characters has been rewritten to explain what characters are reserved, when they are reserved, and why they are reserved even when not used as delimiters by the generic syntax. The mark characters that are typically unsafe to decode, including the exclamation mark ("!"), asterisk ("*"), single-quote ("'"), and open and close parentheses ("(" and ")"), have been moved to the reserved set in order to clarify the distinction between reserved and unreserved and hopefully answer the most common question of scheme designers."

RFC2396 splits reserved characters into two sets - gen-delims and sub-delims. The gen-delim set is made up of characters that are used as delimiters in generic URI syntax (though not necessarily in all components of generic URIs). The sub-delim set is made up of characters that are not used in generic URI syntax but that are reserved for use as delimiters by designers of particular URI schemes. Taken together, gen-delims and sub-delims represent the reserved set, and section 2.2 says, "URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent."

Unfortunately, "." is not a character in the reserved set. Section 2.3 places "." (along with "-", "_" and "~") in the unreserved set and says, "URIs that differ in the replacement of an unreserved character with its corresponding percent-encoded octet are equivalent: they identify the same resource." In fact, it goes on to say that escaped unreserved characters in a URI "should be decoded to their corresponding unreserved character by URI normalizers".

It's very good to have clarity around this issue, but it's bad for XRIs because we treat dot as if it were reserved. We assume that an XRI with an escaped dot is NOT equivalent to an XRI in which dot is unescaped, which is clearly in conflict with 2396bis. Note that in this respect, 2396bis is very much in alignment with the implementation feedback mentioned above. Users requested XRIs in a form like xri:=dave.mcalpin, where dave.mcalpin is treated as a single token. This is legal (in fact, it’s required) by 2396bis, but is not possible by the XRI 1.0 spec.

--Implications

Unless there’s a change in 2396bis (and it doesn’t look like there will be), the XRI TC should reconsider our use of the dot character in XRI syntax. This is a serious issue because almost any corrective action will break backward compatibility with the 1.0 spec. On the other hand, compatibility with generic URI syntax is extremely important for XRIs, both because of the ubiquity of URI processors and because the defined resolution protocol depends on transforming an XRI into a syntactically legal URI.

I’m very interested in opinions and proposals from other members of the TC, but to me this seems like an issue we need to address. If we do end up breaking compatibility with 1.0, it’s much better to decide sooner rather than later.

Dave

xri message