Please see comments inline.
An XRI can be in multiple normal forms at
once. For example, xri://@foo is (I believe, but maybe I need to be corrected)
in URI normal form, XRI normal form, and IRI normal form. Which one form is it
in? The answer is "mu" - its in all forms at once.
1b. I think an HXRI should only have
URI normal form and thus the unescaping rules should always assume URI
normal form and that address question #2 I think.
[Wil] You’re right, my attack vector was invalid.
1c. anyURI actually only accepts URIs (ie
URI normal form). Are you asking about situations where we don't have typig information
(like an XML document without and XSD, for example)? In general, XML documents
allow unescaped IRIs - so we can't really get any guidance about the form from
its placement in a document that allows UTF-8 charset.
[Wil] Right. So, we cannot rely on the context in which the
XRI reference appears.
I'm trying to think of a concrete example
of the "detecting the form" question - the only issues come up in
knowing whether an XRI has to be escaped or unescaped when
"normalizing" to a form (such as normalizing to URI normal form for
the purposes of resolution) - I *believe* there's not really any ambiguity (ie
a parser should be able to tell what to do) - but this requires me now to go
back into the syntax spec which I've not looked into for a while.
[Wil] In general, I think we should implement the parser by
accepting two categories: XRI-NF, and IRI/URI-NF. The former can be a bit loose
if some unreserved characters were percent encoded, but not if delimiters are
encoded. The latter case involves a little bit of guesswork by first testing if
the string is a syntactically valid URI. If so, assume URI-NF. Otherwise, for
example because it contains Unicode characters, we should attempt to parse it
as an IRI.
From my understanding, there is no sure way of telling if an
XRI reference is in IRI normal form. For example, @foo%25bar is in XRI-NF, but
could well be the result of transforming @foo%BAr to IRI-NF.
From: Tan, William [mailto:William.Tan@neustar.biz]
Sent: Wednesday, May 31, 2006 7:20
Subject: [xri] Normal Forms
I'm currently working on proper I18N in OpenXRI,
which includes handling XRI/IRI/URI-normal form properly. This has brought up
some issues that have been bugging me but I didn't get enough of a grip to even
begin to ask questions. I will attempt now.
1. The XRI syntax spec speaks of
XRI, IRI and URI normal forms. However, it does not provide recommendations on
the usage of these forms and context in which each of these forms should
Should implementations be applying algorithms to detect a particular normal
form? Or should it be explicitly told.
Should a HXRI (resolution spec) only accept URI normal form?
What about contexts that have the capacity to accept anyURI and have no problem
2. Converting an IRI reference
(which could be rogue) to XRI normal form may present a security problem
because it is done across the entire string before parsing. E.g.
Google has "xri://@google" and creates a
A malicious user could register @google%2Fsearch and the XRI parser could not
tell between %2F being a 3-character sequence within an I-name or is it a
percent encoding due to XRI-to-IRI transformation.
I may be wrong in any of these, so I'd like to defer
to those of you who have thought long and hard about XRI-IRI-URI
I would be more than willing to discuss this further
as I'm not convinced that the points above are clearly articulated, and I
suspect that I'm missing some point that may be obvious to others.