Proposal for a Telecon to close the Normal Forms issue

Please see comments inline.

An XRI can be in multiple normal forms at once. For example, xri://@foo is (I believe, but maybe I need to be corrected) in URI normal form, XRI normal form, and IRI normal form. Which one form is it in? The answer is "mu" - its in all forms at once.

1b. I think an HXRI should only have URI normal form and thus the unescaping rules should always assume URI normal form and that address question #2 I think.

[Wil] You’re right, my attack vector was invalid.

1c. anyURI actually only accepts URIs (ie URI normal form). Are you asking about situations where we don't have typig information (like an XML document without and XSD, for example)? In general, XML documents allow unescaped IRIs - so we can't really get any guidance about the form from its placement in a document that allows UTF-8 charset.

[Wil] Right. So, we cannot rely on the context in which the XRI reference appears.

I'm trying to think of a concrete example of the "detecting the form" question - the only issues come up in knowing whether an XRI has to be escaped or unescaped when "normalizing" to a form (such as normalizing to URI normal form for the purposes of resolution) - I *believe* there's not really any ambiguity (ie a parser should be able to tell what to do) - but this requires me now to go back into the syntax spec which I've not looked into for a while.

[Wil] In general, I think we should implement the parser by accepting two categories: XRI-NF, and IRI/URI-NF. The former can be a bit loose if some unreserved characters were percent encoded, but not if delimiters are encoded. The latter case involves a little bit of guesswork by first testing if the string is a syntactically valid URI. If so, assume URI-NF. Otherwise, for example because it contains Unicode characters, we should attempt to parse it as an IRI.

From my understanding, there is no sure way of telling if an XRI reference is in IRI normal form. For example, @foo%25bar is in XRI-NF, but could well be the result of transforming @foo%BAr to IRI-NF.

-Gabe

From: Tan, William [mailto:William.Tan@neustar.biz]
Sent: Wednesday, May 31, 2006 7:20 AM
To: xri@lists.oasis-open.org
Subject: [xri] Normal Forms

Hi list,

I'm currently working on proper I18N in OpenXRI, which includes handling XRI/IRI/URI-normal form properly. This has brought up some issues that have been bugging me but I didn't get enough of a grip to even begin to ask questions. I will attempt now.

1.    The XRI syntax spec speaks of XRI, IRI and URI normal forms. However, it does not provide recommendations on the usage of these forms and context in which each of these forms should appear:

      a.    Should implementations be applying algorithms to detect a particular normal form? Or should it be explicitly told.

      b.    Should a HXRI (resolution spec) only accept URI normal form?

      c.    What about contexts that have the capacity to accept anyURI and have no problem representing non-ASCII?

2.    Converting an IRI reference (which could be rogue) to XRI normal form may present a security problem because it is done across the entire string before parsing. E.g.

Google has "xri://@google" and creates a HXRI http://xri.net/@google/search. A malicious user could register @google%2Fsearch and the XRI parser could not tell between %2F being a 3-character sequence within an I-name or is it a percent encoding due to XRI-to-IRI transformation.

I may be wrong in any of these, so I'd like to defer to those of you who have thought long and hard about XRI-IRI-URI conversions.=20

I would be more than willing to discuss this further as I'm not convinced that the points above are clearly articulated, and I suspect that I'm missing some point that may be obvious to others.

Thanks.

=wil (http://xri.net/=wil)

xri message