office message

Subject: RE: [office] IRI vs URI Discussion Today (2010-09-13)

From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>
To: "'Andreas J. Guelzow'" <andreas.guelzow@concordia.ab.ca>
Date: Mon, 13 Sep 2010 18:02:36 -0700

Good news:

 1. I am happy to report that the IRI to URI mapping in [RFC3987] only
converts a set of allowed Unicode Characters that are not part of the Basic
Latin set.  So the appearance of Basic Latin characters and C0+C1 controls
has to already be valid for appearance in a URI (or be already %-encoded in
a place where %-encodings may appear).
    This makes some business with the mapping easier than I thought.

 2. The IRI specification [RFC3987] makes the valuable statement that "When
an IRI is used for resource retrieval, the resource that the IRI locates is
the same as the one located by the URI obtained after converting the IRI
according to the procedure defined here.  This means there is no need to
define resolution separately on the IRI level."  On the other hand, they
don't recommend arbitrarily mapping back and forth, keeping any mapping or
attempted inversions to the minimum necessary.

 - Dennis 

PS: using %62 instead of the letter "b" is definitely not recommended.  It
should certainly not be done by software.  But if it is in an IRI that comes
into our possession, it is wise not to change it.  The security issues that
go with this sort of thing (as a way of obscuring something about a web site
or resource) might be handled by how it is presented, but not by
automatically adjusting it.

-----Original Message-----
From: Dennis E. Hamilton [mailto:dennis.hamilton@acm.org] 
Sent: Monday, September 13, 2010 10:36
To: 'Andreas J. Guelzow'
Cc: 'ODF TC List'
Subject: RE: [office] IRI vs URI Discussion Today (2010-09-13)

Yes, I intentionally used the %62 escape.  However, in URIs there is no
assurance that the 0x62 byte is intended to be the ASCII/ISO 646 encoding
for the letter "b".  That's why, among other reasons, the rule for URIs in
namespace declarations (and in some other cases) says that the namespaces
identified by URIs http://example.com/abc and http://example.com/a%62c are
different.

That's why it is important to urge that producers SHOULD NOT %-encode the
UTF8 encoding of any Basic Latin Characters that are freely-usable in URIs
without any escaping and that consumers SHOULD NOT decode any %-encoding
within IRIs in the markup of a consumed document.  

 - Dennis

FURTHER THOUGHTS

[ ... ]

Here's an odd case.  If for some reason the URI mapping of a non-URI IRI is
provided as the value of a markup item whose datatype is anyURI, no
%-encoding in it should be decoded in submission to a URI/IRI resolver.  In
deciding if two IRIs are the same or not, it is probably appropriate to map
them both to URIs and see if those are the same.  (The mapping should do
something rational for those parts of URIs that are not case-sensitive, such
as the letters for hexadecimal digits in a %-encoding.)  

I am tempted to say in regard to the consumption of ODF documents that
mapping to URIs MAY always be done before submission to a resolver, whether
or not IRIs are directly acceptable to the resolver.  Something tells me
this is a natural consequence of the way mapping of IRIs to URIs is defined,
but I am not 100% certain of that at this point.  I can't imagine an
interoperable case without this assurance, however.

[ ... ]

References:
- IRI vs URI Discussion Today (2010-09-13)
  - From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>
- Re: [office] IRI vs URI Discussion Today (2010-09-13)
  - From: "Andreas J. Guelzow" <andreas.guelzow@concordia.ab.ca>
- RE: [office] IRI vs URI Discussion Today (2010-09-13)
  - From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>