xri message

Subject: RE: [xri] Groups - xri-requirements-1.0-draft-05b.doc uploaded

From: "Wachob, Gabe" <gwachob@visa.com>
To: xri@lists.oasis-open.org
Date: Wed, 21 May 2003 12:26:14 -0700

An even better link about the problems with Unicode and (specifically) East Asian languages: http://www-106.ibm.com/developerworks/unicode/library/u-secret.html?dwzone=unicode -Gabe > -----Original Message----- > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > Sent: Wednesday, May 21, 2003 3:10 AM > To: Wachob, Gabe; xri@lists.oasis-open.org > Cc: DI > Subject: RE: [xri] Groups - > xri-requirements-1.0-draft-05b.doc uploaded > > > Thanks Gabe and Peter for your comments. > > More I study this problem, more it seems that the source of > the problem actually is Unicode and ISO 10646-1. I wish we > have made DIS 10646 ver.1.0 the ISO standard. Japan was > pushing for it, but most people were not worried about the > limitation of Unicode at the time, which became apparent in 5 > years. Much of the problem of the IRI also arises from it. > Essentially, the source of the problem comes from the fact > that in Unicode (and thus UTF-8, ISO 10646-1), you cannot > distinguish the language from the code unless you are lucky. > > Without going into the details, let me state the original > problem stated earlier in this ML in the following fashion. > > There seems to be two issues involved in IRI <-> URI conversion. > First is actually > (P1) local charset IRI to UTF-8 IRI conversion, > and the second is > (P2) UFT-8 IRI to URI conversion. > > The issue around (P2) is much easier than (P1), so I will > start from it. > > (P2) above can be stated as follows. > Define f() : IRI (UTF-8) -> URI escaping function. > Define g() : URI -> IRI conversion function. > Let u = f(i). > Then there exists an i such that i != g(u). > > The reason: > g() != f^-1() because of the following. > > (a) g() must not result in a octet sequence that is > not part of a strictly legal UTF-8 octet > sequence: URI may contain the escaped sequence > that did not originate from UFT-8. > e.g., the sequence of iso-8859-1 encoding. > (b) There are further restrictions on the legal octet > stream over the UTF-8. E.g., half width Japanese > kana is not legal in IRI. These have to be re-escaped. > > This is valid in general, i.e., if we consider g() and f^-1() > over the set U, which is the legal URI space, g() and f^-1() > are not equal. > (In other words, although f(g(u)) = u for any u, there exists > an x such that g(f(x)) != x). > > I was wondering if this is much of a concern for us. The IRI > draft spec states that " the IRI resulting from this > conversion may not be exactly the same as the original IRI > (if there ever was one) " because we are not concerned about > U but a subset of U which is derived by the conversion of IRI. > > Proposition: for any x, which is an element of U' such that > U' = f(I) where I is the set of legal IRIs. > then, > g(x) = f^-1(x) for any x. > > Reason: > (1) Any escaped octet in this case is guaranteed to > have come out of UTF-8. So, we do not have to worry > about (a) above. > (2) The set "I" does not include an element referred by (b) above. > > (I am just thinking in a abstract logic. I am not familiar > with BIDI and other problem spaces, but mathematically, the > above looks good to me. One caveat: I have to check the > escape function f() again to see if it is good. f() must be > such that f^-1() exists.) > > Thus, the source of problem seem to lie in the fact that > legal URI is a super set of the set of URIs derived out of > IRI. This indeed is a source of compatibility problem like > Peter says. This poses us a question: > "Do we really want to state things in term of URI, at > least temporarily?"　 > My answer is NO. If we do that, we will be strangulated in > the same fashion as IRI is being done. > > Next problem (P1) is much more complicated. The problem (P1) > is actually a problem of UTF-8 and Unicode. Union of local > charset is larger than UTF-8 space. More over, > Unicode/UTF-8/ISO 10646 has no way of telling what language > context it is in by itself. It has to be used in conjunction > with some other information like locale, language flag, or font set. > > UTF-8 is collapsing several distinct but looking similar > local characters into one representation. For example, the > first character of my last name "Saki" cannot be represented > correctly with it. For the eye of "Latin character world" > person, it may just be the same as straight and cursive "g", > but it is not. If I submit official document with UTF-8 > representation of my name, the government will not accept it > because the name is different. Local charset to UTF-8 > conversion is actually many to one function. Thus, invert > function does not exist. It is going to be a mapping. The > situation become even more complicated when we mix several language. > > This one will never be solved unless we abandon UTF-8 or > Unicode for that matter. There are supplemental problems in > searching through combined characters etc. For example, Deva > Nagali (sp?) would have its own representation problem. I > would like to draw Reva Modi's attention on it, who is > probably more knowledgeable on these matters than I. I have > contacted Mr. Kobayashi, who is the only member of the > Unicode Technical Committee from Japan already, and will > continue doing so. From what I found, Mr. Kobayashi is also a > bit negative on the capability of Unicode for the sake of > real I18N. (On the other hand, DIS 10646 ver.1.0 would have > done it properly. Unicode and ISO 10646 was originally fixed > length encoding. One of the main reason for pushing Unicode > and not DIS 10646 ver.1.0 was because the later was variable > length encoding and conceived as inefficient. The shortcoming > became apparent within 3 years, and this fixed length thing > was abandoned in 1996. Now, we have lost t! > he! > fixed length-ness, and are facing the problems that DIS > 10646 ver.1.0 would not have had. After all, at the time of > ISO 10646 voting, it seems only the multi-byte country with > long history of computerized local language processing was > Japan. The majority voting does not always result in the > correct result...) Today, even HTML ver.4 is asking for > language switch like LANG="ja" and LANG="en". This is another > indicator of the fact that the dream of Unified Character > Coding is lost. > > Now, do we want to tackle this formidable problem? Hmmm. A > good question. > > The bottom line is: if we decide to live in the UTF-8 world, > then IRI seems to be OK. If we want real I18N, we probably > are leaving the Unicode world. > > On the other hand, we MUST NOT point to IRI in the normative > spec. We should extract the IRI spec and insert it into our > spec until IRI actually become a spec. (If I remember > correctly, this was the requirement for the use of IRI > proposed spec.) > > Nat > > -----Original Message----- > From: Wachob, Gabe [mailto:gwachob@visa.com] > Sent: Wednesday, May 21, 2003 1:49 AM > To: xri@lists.oasis-open.org > Subject: RE: [xri] Groups - > xri-requirements-1.0-draft-05b.doc uploaded > > Nat- > Actually, the IRI spec says that mapping from URIs to > IRIs unambiguously requires context not present in a URI. In > http://www.w3.org/International/iri-edit/draft-duerst-iri.html > #URItoIRI the problem is demonstrated in the situation where > you convert an IRI to a URI and back -- "the IRI resulting > from this conversion may not be exactly the same as the > original IRI". Maybe we can address this ambiguity - perhaps > you can figure this out better than i have done so far. > > Also, as for the "defining a URI" vs. "defining an IRI" > - I'm not sure how this plays out. We absolutely need to be > able to use XRIs anywhere one would use a URI. However, we > know that IRIs can always be mapped to URIs (and in the case > where there are no non-URI characters in and IRI, the IRI is > syntactically equivalent to the URI to start with). > > As for equivalence - thats something we *have* to > discuss as the specifiers of a URI scheme. We don't have to > say much - we can say that two XRI URIs are "equivalent" if > they are octet-by-octet the same (though there are issues > about unescaping sequences before or after the comparison). I > suppose it gets trickier if you define XRIs as an IRI scheme. > > The other problem with relying on the IRI spec right > now is that its not a spec yet. Its still only a draft over > at the IETF, and the IETF process is slow. I'm guessing we > won't see a finalized IRI spec in 2003. > > Don't get me wrong - I think we should leverage IRIs > somehow. I'd even be in favor of defining XRIs as an IRI > scheme if we could ensure that would not cause any problems > for those many places where URIs are called for (after > conversion to the URI form). I just think its more > complicated than simply referring to the IRI spec (a lot more > complicated). > > -Gabe > > > -----Original Message----- > > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > > Sent: Tuesday, May 20, 2003 12:06 AM > > To: xri@lists.oasis-open.org > > Subject: RE: [xri] Groups - > > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > Gabe, > > > > Conceptually, IRI has larger set than URI (IRI includes > URI), but both > > are countable and thus can be mapped one to one, I think. > > Could you give > > me an example of mapping one URI to multiple IRIs please? > > > > Fundamentally, the question for us probably is "do we really > > want to be > > bound by this aging URI standard?" To me, URI v.s. IRI > controversy is > > largely due to the backward compatibility issues. If we think > > afresh, we > > probably do not choose URI to be the normative format because > > it is the > > source of milliard of problems for I18N. Unicode is not > perfect (some > > purists say that it is useless - it generally cannot > distinguish among > > similar but distinct characters because these are collapsed > into one), > > but is much cleaner. Resolution does not have to go through the > > transformation to URI. Our internationalized identifier > should be able > > to be resolved directly. > > > > On equivalence: I think URI equivalence arguments do not > > affect us. This > > is because we have abstract permanent identifier, which can > be pretty > > restrictive in the allowed character set as we do not need the human > > readability. To test the equivalence of two identifiers, we should > > resolve to the permanent identifier and compare them. To protect the > > privacy, we might not want to expose the permanent > identifier. In this > > case, the proxy should give out True/False result. We have a much > > powerful tool than URIs in this regard. > > > > Nat > > > > -----Original Message----- > > From: Wachob, Gabe [mailto:gwachob@visa.com] > > Sent: Friday, May 16, 2003 4:25 AM > > To: 'Drummond Reed'; xri@lists.oasis-open.org > > Subject: RE: [xri] Groups - > > xri-requirements-1.0-draft-05b.doc uploaded > > > > Drummond- > > A few notes. > > > > First, in section 3.4.5 (you said 3.3.5) - "non-resolvable > > syntax" - whats the use case? Why do we need to *prevent* an > > attempt to > > resolve? Why would a software component resolve an identifier > > unless it > > needed to? It seems like there are only two cases: a piece > of software > > needs to resolve the identifier, or it doesn't. This > decision is based > > on application semantics, not the syntax of the identifier. How does > > marking an identifier as "non-resolvable" help at all? > > > > In section 3.4.6 (internationalization) - there is a discussiong > > going on at the W3C TAG (issue named something like "IRIEverywhere") > > where the appropriateness of where IRIs should be used is being > > discussed. It is clear, for example, that IRIs cannot be used > > everywhere > > URIs can be used. The issue is whether *future* specs > should refer to > > IRIs or URIs. An IRI can be "cast down" into a URI > unambiguously, but > > because there are several ways to translate unicode into > > ascii, its not > > always possible to unambigously convert an URI back into an > > IRI (without > > some context like the encoding used to go from IRI to URI). > > So, while I > > think we should definitely address IRIs and XRIs, I don't think XRIs > > should expect to be solving the problems that IRIs have with the > > relationshipt to URIs. We *could* propose a way to encode the things > > that are needed to unambiguously convert a URI back into an > > IRI, but I'm > > guessing that would actually break the IRI spec. I'm going > > out beyond my > > competency ! > > here I think. > > Bottom line is that we either have to wait for the IRI things to > > shake out, or we have to tread new ground in i18n. I > *definitely* want > > XRIs to be "i18n enabled", but I'm a little worried about us > > planning on > > achieving that in the short term by relying on IRIs. > > > > This document has come a LONG way and I think does a pretty good > > job of identifying why we are all here. Congrats and thanks > > to all those > > who contributed. I'm sure there will be more input and fixes > > to the doc, > > but I feel like we're very close to the "good enough" state > > where we can > > then concentrate on the syntax and resolution specs. > > > > -Gabe > > > > > > > -----Original Message----- > > > From: Drummond Reed [mailto:drummond.reed@onename.com] > > > Sent: Thursday, May 15, 2003 11:45 AM > > > To: xri@lists.oasis-open.org > > > Subject: RE: [xri] Groups - > > > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > > > > First, let me note two reasons for posting v5b: > > > > > > 1) I found out from Marc Le Maitre this morning that > leaving "Track > > > Changes" on screwed up the section numbering, so it makes > > it difficult > > > to talk about requirement numbers. Let's use v5b on the > call today. > > > > > > 2) There was an MS Word cross-reference error > (unfortunately not all > > > that uncommon) in 3.4.7 that needed fixing. > > > > > > Please make any edits to this clean version after making > sure "Track > > > Changes" is turned on. > > > > > > I will review the key updates on the TC call this > afternoon, but the > > > major areas to review are: > > > > > > * Sections 2.1 - 2.3 of the Motivations section. These were > > rewritten > > > for the third time to reflect the consensus regarding terminology. > > > > > > * Requirement 3.1.2 was rewritten to reflect the URN > > conformance topic > > > as discussed on the list. > > > > > > * The original requirements section 3.3 was broken into the > > > new sections > > > 3.3 and 3.4 to reflect the clarifications in 2.2 and 2.3 about > > > persistence and HFIs/MFIs. > > > > > > * 3.3.5 (Non-Resolvable Syntax) was added to reflect a > > > requirement Marc > > > Le Maitre has surfaced from the Namespace committee of the > > > U.S. XML.gov > > > working group. > > > > > > * 3.4.6 (Internationalization) was edited to reflect Nat's input > > > regarding IRIs. We should discuss this on today's call. > > > > > > * The Glossary was updated and all TO DO's in it were finished. > > > > > > The only remaining TO DOs are a few entries in the > > > informative glossary > > > and Appendix A (Acknowledgments). > > > > > > Talk to everyone at 3pm PDT. > > > > > > =Drummond > > > > > > -----Original Message----- > > > From: Drummond Reed > > > Sent: Thursday, May 15, 2003 11:13 AM > > > To: xri@lists.oasis-open.org > > > Subject: [xri] Groups - > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > The document xri-requirements-1.0-draft-05b.doc has been > > submitted by > > > Drummond Reed (drummond.reed@onename.com) to the > Extensible Resource > > > Identifier TC document repository. > > > > > > Document Description: > > > v5b of XRI Requirements and Glossary - This is a CLEAN > > version with a > > > faulty MS Word cross-reference fixed. Please submit any edits > > > using this > > > version. > > > > > > Download Document: > > > http://www.oasis-open.org/apps/org/workgroup/xri/download.php/ > > > 2050/xri-r > > > equirements-1.0-draft-05b.doc > > > > > > View Document Details: > > > http://www.oasis-open.org/apps/org/workgroup/xri/document.php? > > > document_i > > > d=2050 > > > > > > > > > PLEASE NOTE: If the above links do not work for you, your email > > > application > > > may be breaking the link into two pieces. You may be able > > to copy and > > > paste > > > the entire link address into the address field of your > web browser. > > > > > > -OASIS Open Administration > > > > > > > You may leave a Technical Committee at any time by visiting > http://www.oasis-open.org/apps/org/workgroup/xri/members/leave _workgroup.php You may leave a Technical Committee at any time by visiting http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup.php