[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [xri] Draft -07 feedback from another Visa person (responses cont'd)
I do not think that there is a good way of introducing “case insensitivity” to Unicode in general. Case insensitivity is a form of normalization, which is very problematic for the international character set. IMHO, only meaningful form of general equivalence is the bit to bit comparison or the comparison of the resolved result from the same authority.
Nat
-----Original Message-----
-----Original
Message-----
Outlook continues to drive me up the wall - this should be the end of my initial responses to the comments from Terence Spielman.
Responses continue in this email:
Yes, this needs beefing up, as the inline comment mentions.
I'll defer this question to our resident unicode & escaping guru, Dave McAlpin.
I think step 5 in 2.2.3.2 addresses this when we specify “one escaped triplet for each octet in the UTF-8 encoding of the disallowed character”. Did you have something else in mind?
Again, this gets deferred to Dave McAlpin.
It affects semantics. If an XRI is inadvertently escaped twice and unescaped once, for example, the result might be semantically different than the original XRI (this depends, of course, on the original XRI). It’s the essentially the same problem mentioned in section 2.4.2 of 2396, which says “implementers should be careful not to escape or unescaped the same string more than once, since unescaping an already unescaped string might lead to misinterpreting a percent data character as another escaped character, or vice versa in the case of escaping an already escaped string.”
Dave McAlpin has thought through the escaping issues quite a bit. We are trying to track the (as-yet-not-finalized) RFC 2396bis and IRI (internationalized resource identifiers) specs, and this adds some complexity with the benefit of aligning with emerging best practices and architectures. I'd leave it to Dave to explain exactly how he ended up with the escaping procedure we have.
I don’t understand the second step. Can you give an example of escaping “all syntactic elements with cross references”?
Thats a good question. Not sure, honestly. Dave? Drummond?
Mostly, I think, to make the comparison rules for XRIAuthority consistent with those for URIAuthority (as specified by section 6 of 2396). It may be confusing, though, in that it only applies to characters in the ALPHA production. That’s fine for URIAuthorities because they only allow characters in the ALPHA production, but the XRIAuthority can contain international characters. Is your objection is that it’s odd that ‘e’ and ‘E’ are equivalent, but ‘e’ with an accent mark is not equivalent to ‘E’ with the same mark? If it is, then I agree. Is there a good way to specify case-insensitivity for all Unicode characters?
I believe the decoding is mechanical and unambigous. Dave?
In general, the escaping/unescaping mirrors IRI work, along with one extra step for escaping () (parentheses). We definitely wanted to make sure the transformations were reversible.
I think the question is actually whether the BNF is unambiguous, i.e. does an XRI exist that could be interpreted in more than one way by the BNF? I’ve done some work in this area, but I certainly wouldn’t consider the BNF “proven” at this point.
We talked quite a bit about this. The decision was made to be silent on canonicalization because equivalence is actually unambigious given the rules stated. Now, that doesn't mean that its at all obvious.
I do think giving names to the escaped vs. unescpaed forms of XRI, at least, would be useful. Canonicalization would then just be transforming an identifier into one of those forms. We didn't want to mandate a single canonical form because different environments would need XRIs in different levels of escaping and it would be unfortunate to require a specific canonicalization form that would require otherwise-unneeded transformation.
Again, Dave McAlpin probably has better input on this.
A canonical representation might be useful for comparison, but it would involve a formal definition of things like “minimally escaped”, which would be fairly difficult to nail down. It would also depend on the existence of a canonical form for URIs used as cross-references. In other words, an XRI wouldn’t have a canonical form if it contained cross-references that didn’t define a canonical form.
Note that equivalence rules are generally problematic. The IRI proposal, for example, completely dodges the question of equivalence when it says, “There is no general rule or procedure to decide whether two arbitrary IRIs are equivalent or not… Each specification or application that uses IRIs has to decide on the appropriate criterion for IRI equivalence.” 2396bis notes that even terms like “different” and “equivalent” are fuzzy in the general spec and ultimately application dependent.
Well, by definition, all URIs can't be XRIs because URI's have different schemes - XRI's must all have the "xri:" scheme. I think the goal of having all URIs easily and trivially transformable into XRIs (ie remove the scheme and insert xri:) is laudable, though its unclear that in many cases this makes a lot of sense. This is because the XRIs are structured and resolution of the XRIs (at the very least) gives special meaning to the firs segment (the authority) -- not all URIs are hierarchical or treat the first "segment" specially. Examples include mailto:, uuid:, cid: etc
Note also that it’s trivial to convert any legal URI into an XRI by simply enclosing it in a cross-reference, e.g. mailto:bob@example.com -> xri:(mailto:bob@example.com), though I don’t know that that’s generally useful.
Hope that kicks off the conversation and gives us editors some good pointers on where we need to focus on cleaning up of language.
-Gabe |
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]