[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [xri] Problem with XRI 1.0 syntax
Let us break the
compatibility. We have to decide almost NOW. What would be the best
character, then? Nat From:
Dave McAlpin [mailto:dave.mcalpin@epok.net] An issue has come up around XRI syntax related to the
use of the dot (“.”) character. Changes to RFC2396bis subsequent to
XRI 1.0 approval (discussed below), reinforced by feedback from early
implementers, require careful consideration by the TC. Here’s a fairly
detailed look at the problem. --Background One of the core design goals for the XRI TC was to
create a syntax that supports "human friendly" identifiers. Another
was to follow as closely as possible existing standards and precedents. These
two goals conflicted to some extent when defining semantics around the dot
(".") character, particularly in the XRI authority component where
resolution is normatively specified. Dot is the traditional separator in DNS names
(www.epok.net, for example), and is effectively a second level separator in the
authority component of URIs. For instance, http://www.epok.net/foo can be
broken down into scheme (http), authority (www.epok.net) and path (/foo).
Authority can be further divided into www, epok and net. Dot is the character
that serves as the second level separator. A similar requirement for a second level separator
exists in an XRI that has an XRI authority component, i.e. an authority that
starts with a global character. For example, in xri:@epok.seattle/foo,
@epok.seattle is the XRI authority component, further separated by the dot
between epok and seattle and by an implicit dot between @ and epok. Dot in this
case serves the same purpose as it does in DNS - it delimits the resolution
units in the XRI authority. The XRI TC recognized that giving dot special
semantics was somewhat in conflict with the human friendly goal of XRI. It
meant, for example, that a very readable XRI like xri:=dave.mcalpin, while
syntactically legal, might not match user expectations. A naïve user could
easily view “dave.mcalpin” as a single token, while the resolution
spec treats “dave” and “mcalpin” as two delgated
tokens. The decision, however, was that the use of dot was so well ingrained by
DNS that it was the only reasonable choice for a second level separator. To
support equivalent syntax, we introduced something called a "relative
cross reference", whose purpose was to allow a string that contained
delimiters to be treated as a single token. The above example, then, became
xri:=(dave.mcalpin), where the parens set off a single token for the purpose of
resolution. The identifier xri:=dave%2Emcalpin accomplishes the same thing, but
it clearly has a problem with human friendliness. Feedback from early implementers, however, suggested
that many users did in fact expect dot to be a normal character. Users very
much preferred =dave.mcalpin, for example, to =(dave.mcalpin). --Dot and RFC2396 For a number of reasons, the xri-authority component
of an XRI (e.g. @epok.seattle in the XRI xri:@epok.seattle/foo), is not an
authority from the perspective of generic URIs, as defined by RFC2396, but
rather is part of the path. In RFC2396, dot is part of the unreserved character
set, and specifically is unreserved for path components. RFC2396 is somewhat
ambiguous as to whether or not scheme designers are allowed to apply special
semantics to unreserved characters. Because so many useful characters are
unreserved, however, in actual practice scheme designers (including the XRI TC)
have often used unreserved characters as delimiters, effectively moving them
into the reserved set for a particular scheme. --Dot and RFC2396bis A revision of RFC2396, generally referred to as
RFC2396bis, has been in draft since October 2002 - in fact a substantial part
of the XRI 1.0 specification was based on BNF from the RFC2396bis draft that
was current last fall. However a substantial change was made to RFC2396bis
draft this spring. This new draft is available at http://gbiv.com/protocols/uri/rev-2002/rfc2396bis.html.
Among other things, this new draft of RFC2396bis
attempts to clarify which characters should and should not be used as
delimiters by scheme designers. Specifically,
sections 2.2 and 2.3 of RFC2396bis discuss reserved and unreserved characters.
As noted above, this was a confusing part of the original 2396. In the
endnotes, the authors note, "Section 2 on characters has been rewritten to
explain what characters are reserved, when they are reserved, and why they are
reserved even when not used as delimiters by the generic syntax. The mark
characters that are typically unsafe to decode, including the exclamation mark
("!"), asterisk ("*"), single-quote ("'"), and
open and close parentheses ("(" and ")"), have been moved
to the reserved set in order to clarify the distinction between reserved and
unreserved and hopefully answer the most common question of scheme
designers." RFC2396 splits reserved characters into two sets -
gen-delims and sub-delims. The gen-delim set is made up of characters that are
used as delimiters in generic URI syntax (though not necessarily in all
components of generic URIs). The sub-delim set is made up of characters that
are not used in generic URI syntax but that are reserved for use as delimiters
by designers of particular URI schemes. Taken together, gen-delims and
sub-delims represent the reserved set, and section 2.2 says, "URIs that
differ in the replacement of a reserved character with its corresponding
percent-encoded octet are not equivalent." Unfortunately, "." is not a character in
the reserved set. Section 2.3 places "." (along with "-",
"_" and "~") in the unreserved set and says, "URIs
that differ in the replacement of an unreserved character with its
corresponding percent-encoded octet are equivalent: they identify the same
resource." In fact, it goes on to say that escaped unreserved characters
in a URI "should be decoded to their corresponding unreserved character by
URI normalizers". It's very good to have clarity around this issue, but
it's bad for XRIs because we treat dot as if it were reserved. We assume that
an XRI with an escaped dot is NOT equivalent to an XRI in which dot is
unescaped, which is clearly in conflict with 2396bis. Note that in this
respect, 2396bis is very much in alignment with the implementation feedback
mentioned above. Users requested XRIs in a form like xri:=dave.mcalpin, where
dave.mcalpin is treated as a single token. This is legal (in fact, it’s
required) by 2396bis, but is not possible by the XRI 1.0 spec. --Implications Unless there’s a change in 2396bis (and it
doesn’t look like there will be), the XRI TC should reconsider our use of
the dot character in XRI syntax. This is a serious issue because almost any
corrective action will break backward compatibility with the 1.0 spec. On the
other hand, compatibility with generic URI syntax is extremely important for
XRIs, both because of the ubiquity of URI processors and because the defined
resolution protocol depends on transforming an XRI into a syntactically legal
URI. I’m very interested in opinions and proposals
from other members of the TC, but to me this seems like an issue we need to
address. If we do end up breaking compatibility with 1.0, it’s much
better to decide sooner rather than later. Dave |
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]