office-formula message

Subject: RE: [office-formula] Summary 2010-09-07 - IRI vs URI
From: "Andreas J. Guelzow" <andreas.guelzow@concordia.ab.ca>
To: "dennis.hamilton@acm.org" <dennis.hamilton@acm.org>
Date: Wed, 8 Sep 2010 00:32:58 -0600
Hi Dennis,

when I hear the word "subset" I understand it as a mathematical term:

IRIs form a subset of URIs means that every IRI is a URI.
URIs form a subset of IRIs means that every URI is an IRI.

This is completely separate from any consideration of mappings between
those two sets.

On Tue, 2010-09-07 at 22:55 -0600, Dennis E. Hamilton wrote:
> Here's my extended analysis of what is involved with that.  
> 
> I believe the statement patently incorrect if read to mean that IRIs and
> URIs are co-extensive and there is nothing that needs to be done.

I don't know what the meaning of "co-extensive" in this context is but
in RFC3987 it is clear that every URI is an IRI since the syntax of IRIs
including the use of components and reserved characters is the same as
that for URIs except that IRIs have additional unreserved characters. SO
there are IRIs that are not URIs.

RFC3987 also describes a (non-injective) mapping from IRIs to URIs. I
use the term non-injective here in the mathematical sense, several
distinct IRIs map to the same URI.



>   The IRI
> specification would be very short and not have to pay attention to mappings
> and when and how they apply were it literally true.  They would also not
> require a separate grammar for IRIs.

Well, that is not quite correct since tehre are IRIs that are not URIs
(notwithstanding that according to RFC3987 they will map to a valid
URI.)

> 
> Whatever the case, I believe it is necessary to say exactly how we expect
> IRIs to be mapped to URIs, 

This is described in RFC3987 and a simple reference ought to suffice.

> what the admissable IRIs are in the case of
> relative references to package files and subdocuments, and what the
> corresponding manifest:full-path and Zip directory file names are.

It should be quite easy to describe these IRIs, namely exactly those
that map to a URI that was peviously described to be admissible.

> 
> GRANDFATHERED URIS VS. UNICODE IRIS
> 
> Here is the context in section 3.1, revealingly entitled "Mapping of IRIs to
> URIs":
> 
> "The above mapping from IRIs to URIs produces URIs fully conforming to
>    [RFC3986].  The mapping is also an identity transformation for URIs
>    and is idempotent;  applying the mapping a second time will not
>    change anything.  Every URI is by definition an IRI."
> 
> There is no question the mapping is idempotent as defined in [RFC3987],
> because the resulting URI has no disallowed ASCII-character encodings and so
> running the mapping again changes nothing.  That is to say, the mapping is
> an identity transformation for IRIs that are already well-formed URIs.
> 
> It is in that sense that I say IRIs are subsets of URIs

"subset" has an agreed upon mathematical meaning, we really should not
use "subset" where "superset" would be correct.

>  or, put better, the
> image of admissable IRIs

admissable? 

> that are not already syntactically well-formed URIs
> is a subset of the URIs. 

If they are not URIs in which sense could they form a subset of URIs?

>  Also, there is the usual problem of mappings of
> this nature in that there is no assured inversion from an IRI-mapped URI
> back to an IRI that is not the URI.

This is only a problem if such an inversion is needed. (Note that the
transformation of section 3.2 in RFC 3987 does give a "best" IRI for
every URI.)
> 
> This is also the sense in which I believe the statement "Every URI is by
> definition an IRI" is at best misleading and at worse simply incorrect,

I really don't see how you can say that that statement is incorrect. Can
you give a single URI that is not an IRI (ie. satisfies the IRI syntax)?


> since there are well-formed URIs that can never be produced from IRIs that
> are entirely in Unicode and that only %-encode Basic Latin characters that
> are not permitted as single-character <pchar>s.  It is further misleading if
> taken to mean that a syntactical IRI can be used where URIs are required.
> There are many places where well-formed URIs are required (e.g., the XML
> Schema anyURI datatype and elsewhere).  

and every IRI can be transformed into such a URI.
> 
> WHY FUSS ABOUT THIS?
> 
> To ensure that the mapping can be accurately inverted, it is necessary to
> restrict what %-encoded bytes are allowed in URIs and which are to be
> employed in reconstructing a non-URI IRI, if any, that is presumably the
> inverse mapping of the URI in hand.  This is a strong constraint, because it
> signifies to me that only Unicode is carried by IRIs (where URIs do not
> necessarily have that limitation on how %-encoded bytes with values greater
> than %7f are to be understood).  The considerations for assuring an inverse
> mapping are reflected in section 3.2 and perhaps elsewhere in [RFC3987].  
> 
> I assume, to satisfy the requirement that ODF support IRIs at all, one needs
> to ensure that the IRI using non-allowed URI characters can be recoverable
> to satisfy whatever use cases there are in mind by those who require that
> IRIs be supported in naming of package files and in URI references
> generally.  Since the requirement came from JTC1 National Body Japan, I
> presume that it is desirable to see the IRIs with the actual CJK characters
> whose Unicode code points are IRI-encoded in the naming of package files and
> in the introduction of URIs in various ways in ODF documents.  How packages
> could be extracted into file systems allowing CJK character encodings in
> file and directory names is, of course, outside of our control beyond
> providing interested parties a consistent way of interpreting the Zip file
> names that ODF restricts itself to.
> 
> It is on behalf of that requirement that I believe it is important to
> ascertain, within ODF, when an IRI must be mapped to a URI and the form of
> URI be used.  This matters, in particular, for any relative references that
> involves segments that are part of ODF Package manifest:full-path values and
> that need to match what is used in the Zip directory entry for a package
> file and/or the ODF notion of an identified package subdocument.  My
> recommendation is that the manifest:full-path always be fully IRI encoded
> (even though it is neither IRI nor URI) and likewise for the corresponding
> Zip directory entry, when there is one.  Furthermore, the only %-encodings
> should be what is required for this purpose and the only unencoded
> characters should be a limited subset of what is used in URIs.  My
> recommendation is to allow only non-empty segment names having only <pchar>s
> without ":", perhaps without "@", and with "/" as the segment separator.
> There should be no "." and ".." segments, since these only exist in URI
> references and are inappropriate in Zip directory file names).
> 

I am still at a loss to understand why such an inversion should be
needed. In fact the absence of such an inversion is even more
justification of allowing IRIs. 

Andreas
References:
- Summary 2010-09-07 of OpenFormula meeting
  - From: "David A. Wheeler" <dwheeler@dwheeler.com>
- RE: [office-formula] Summary 2010-09-07 - IRI vs URI
  - From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>
- RE: [office-formula] Summary 2010-09-07 - IRI vs URI
  - From: "Andreas J. Guelzow" <andreas.guelzow@concordia.ab.ca>
- RE: [office-formula] Summary 2010-09-07 - IRI vs URI
  - From: "Dennis E. Hamilton" <dennis.hamilton@acm.org>