RE: [xri] Draft -07 feedback from another Visa person (responses cont'd

I do not think that there is a good way of introducing “case insensitivity” to Unicode in general.

Case insensitivity is a form of normalization, which is very problematic for the international character set. IMHO, only meaningful form of general equivalence is the bit to bit comparison or the comparison of the resolved result from the same authority.

Nat

-----Original Message-----
From: Dave McAlpin [mailto:dave.mcalpin@epokinc.com]
Sent: Friday, September 05, 2003 2:43 AM
To: 'Wachob, Gabe'; xri@lists.oasis-open.org
Subject: RE: [xri] Draft -07 feedback from another Visa person (responses cont'd)

-----Original Message-----
From: Wachob, Gabe [mailto:gwachob@visa.com]
Sent: Thursday, August 28, 2003 4:29 PM
To: 'xri@lists.oasis-open.org'
Subject: RE: [xri] Draft -07 feedback from another Visa person (responses cont'd)

Outlook continues to drive me up the wall - this should be the end of my initial responses to the comments from Terence Spielman.

Responses continue in this email:

550-551 I didn't get the meaning of persistent or re-assignable identifiers
out of this description.

Yes, this needs beefing up, as the inline comment mentions.

2.2.3.1 Is it allowable to escape unicode characters? For example, if one
wanted to express an international XRI in IA5 (ASCII)? In this
case, the %AB format described in 2.2.3.1 is insufficient to support
the expanded character width.

I'll defer this question to our resident unicode & escaping guru, Dave McAlpin.

I think step 5 in 2.2.3.2 addresses this when we specify “one escaped triplet for each octet in the UTF-8 encoding of the disallowed character”. Did you have something else in mind?

694 Does the lack od idempotency affect semantics or syntax? I would
hope it would only be syntax.

Again, this gets deferred to Dave McAlpin.

It affects semantics. If an XRI is inadvertently escaped twice and unescaped once, for example, the result might be semantically different than the original XRI (this depends, of course, on the original XRI). It’s the essentially the same problem mentioned in section 2.4.2 of 2396, which says “implementers should be careful not to escape or unescaped the same string more than once, since unescaping an already unescaped string might lead to misinterpreting a percent data character as another escaped character, or vice versa in the case of escaping an already escaped string.”

2.2.3.3 How about this as an alternative?
     Escape all current escapes (%s).
     Escape all syntactic elements with cross references (parens).
     Escape all parens.

Dave McAlpin has thought through the escaping issues quite a bit. We are trying to track the (as-yet-not-finalized) RFC 2396bis and IRI (internationalized resource identifiers) specs, and this adds some complexity with the benefit of aligning with emerging best practices and architectures. I'd leave it to Dave to explain exactly how he ended up with the escaping procedure we have.

I don’t understand the second step. Can you give an example of escaping “all syntactic elements with cross references”?

878-879 Why are XRI authorities compared in a case-insensitive manner?

Thats a good question. Not sure, honestly. Dave? Drummond?

Mostly, I think, to make the comparison rules for XRIAuthority consistent with those for URIAuthority (as specified by section 6 of 2396). It may be confusing, though, in that it only applies to characters in the ALPHA production. That’s fine for URIAuthorities because they only allow characters in the ALPHA production, but the XRIAuthority can contain international characters. Is your objection is that it’s odd that ‘e’ and ‘E’ are equivalent, but ‘e’ with an accent mark is not equivalent to ‘E’ with the same mark? If it is, then I agree. Is there a good way to specify case-insensitivity for all Unicode characters?

Section 3 (I still need to do some reading)

Global:

Has there been any work on DECODING XRIs? It's not immediately
clear from the ABNF that decoding is unambiguous.

I believe the decoding is mechanical and unambigous. Dave?

In general, the escaping/unescaping mirrors IRI work, along with one extra step for escaping () (parentheses). We definitely wanted to make sure the transformations were reversible.

I think the question is actually whether the BNF is unambiguous, i.e. does an XRI exist that could be interpreted in more than one way by the BNF? I’ve done some work in this area, but I certainly wouldn’t consider the BNF “proven” at this point.

In addition, aside from unresolvable references, is it possible
to canonicalize XRIs? This is a highly desireable feature
(for equivalence, at a minimum).

We talked quite a bit about this. The decision was made to be silent on canonicalization because equivalence is actually unambigious given the rules stated. Now, that doesn't mean that its at all obvious.

I do think giving names to the escaped vs. unescpaed forms of XRI, at least, would be useful. Canonicalization would then just be transforming an identifier into one of those forms. We didn't want to mandate a single canonical form because different environments would need XRIs in different levels of escaping and it would be unfortunate to require a specific canonicalization form that would require otherwise-unneeded transformation.

Again, Dave McAlpin probably has better input on this.

A canonical representation might be useful for comparison, but it would involve a formal definition of things like “minimally escaped”, which would be fairly difficult to nail down. It would also depend on the existence of a canonical form for URIs used as cross-references. In other words, an XRI wouldn’t have a canonical form if it contained cross-references that didn’t define a canonical form.

Note that equivalence rules are generally problematic. The IRI proposal, for example, completely dodges the question of equivalence when it says, “There is no general rule or procedure to decide whether two arbitrary IRIs are equivalent or not… Each specification or application that uses IRIs has to decide on the appropriate criterion for IRI equivalence.” 2396bis notes that even terms like “different” and “equivalent” are fuzzy in the general spec and ultimately application dependent.

An XRI is not a URI (because of the expanded syntax). But
is an URI an XRI? (no, because of different scheme (xri)).
I think it would be nice to all URIs be valid XRIs.

Well, by definition, all URIs can't be XRIs because URI's have different schemes - XRI's must all have the "xri:" scheme. I think the goal of having all URIs easily and trivially transformable into XRIs (ie remove the scheme and insert xri:) is laudable, though its unclear that in many cases this makes a lot of sense. This is because the XRIs are structured and resolution of the XRIs (at the very least) gives special meaning to the firs segment (the authority) -- not all URIs are hierarchical or treat the first "segment" specially. Examples include mailto:, uuid:, cid: etc

Note also that it’s trivial to convert any legal URI into an XRI by simply enclosing it in a cross-reference, e.g. mailto:bob@example.com -> xri:(mailto:bob@example.com), though I don’t know that that’s generally useful.

Hope that kicks off the conversation and gives us editors some good pointers on where we need to focus on cleaning up of language.

-Gabe

xri message