xri message

Subject: RE: I18N strategy

From: "Sakimura, Nat" <n-sakimura@nri.co.jp>
To: "Wachob, Gabe" <gwachob@visa.com>
Date: Fri, 23 May 2003 10:47:33 +0900

I agree. But at the same time, since IRI will probably not carry enough information in it, we need to do a bit of tweaking on the resolution side. Unlike DNS, we may want an option to tell the resolution context (e.g. Language) to the resolver. Nat -----Original Message----- From: Wachob, Gabe [mailto:gwachob@visa.com] Sent: Friday, May 23, 2003 3:14 AM To: Sakimura, Nat Subject: RE: I18N strategy Well, I think the bottom line is that we can go forward with the IRI stuff for now - please do remind us when we are making assumptions that would limit our future flexibility with respect to UTF-8. I think thats all we can do for now. -Gabe > -----Original Message----- > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > Sent: Thursday, May 22, 2003 11:04 AM > To: Wachob, Gabe > Subject: RE: I18N strategy > > > There actually is shift-in mechanism to signify the language > in Unicode. We probably can use it theoretically, though the > Unicode spec strongly discourages it. The problem is, "how > does one insert that character?" Gabe's suggestion is easier > to deal with in this respect, but it is not particularly > human friendly when reading. Another variation of this is to > make the LANG property structured similarly to the identifier > itself. E.g., > > GET xri://English/Japanese.Korean.Chinese/ > LANG //en/ja.ko.zh/ > FONT //New York/Osaka.Seoul.Shanghai/ > > UTF-8 is fine, but there can be other encoding schemes which > may become more popular in the future. My intent was only > that the spec perhaps should not be dependent on the UTF-8 > structure for the resolution algorithm so that we leave the > possibility of resolving another encoding schemes. After all, > we might end up in resolving base 64 encoded iso-2022 encoded > string after all. > > -----Original Message----- > From: Wachob, Gabe [mailto:gwachob@visa.com] > Sent: Friday, May 23, 2003 1:57 AM > To: Sakimura, Nat; Drummond Reed; Wachob, Gabe; > xri@lists.oasis-open.org > Cc: DI > Subject: RE: I18N strategy > > I was wondering if it wouldn't be possible to insert language > hints as part of the XRI syntax. Haven't really thought about this .. > > xri://english/,l=ja.japanese.,l=ko.korean.,l=ch.chinese > > Where ,l= is a special form which "switches" the "current > language" (going from left to right - but this is a problem > potentially too) to a new one. Everything would be understood > to be in the "current language" until another language > identifier is discovered. > > This is just a "top of the head" suggestion and I don't even > know if it would address the bulk of the problems Nat has identified. > > It seems hard to justify not going with the IRI framework and > accepting the limitations it has - but I *am* concerned that > we are potentially alienating a huge segment of the world's > population. On the other hand, I'm afraid that not following > the IRI (and W3C Web Character Model - a very good read for > people not yet familiar with I18N issues at > http://www.w3.org/TR/charmod/ ) would mean that we would have > to bite off a lot of work for ourselves, and might create all > sorts of interoperability problems. > > Nat, I'm curious what "not relying on UTF-8" means. UTF-8 is > just a way of interpreting a series of octets. Do you mean > that we shouldn't have resolvers assume that character > strings are in UTF-8? If so, how do resolvers know what > encoding is being used in an XRI? > > It sounds like there is consensus on just moving ahead with > IRI/Unicode - but I still don't feel quite comfortable that > we've worked through this issue completely (or "beat this > dead horse completely", depending on your point of view). > > -Gabe > > > -----Original Message----- > > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > > Sent: Thursday, May 22, 2003 1:06 AM > > To: Drummond Reed; Wachob, Gabe; xri@lists.oasis-open.org > > Cc: DI > > Subject: RE: I18N strategy > > > > > > My opinion for the time being is like this: > > > > (1) for the time being at least, we have to live with Unicode > > and UTF-8. > > This, in turn, means that we abandon the idea of using > > different language in one identifier but we must abide. This > > is a hard decision because we do federate. In XRI, something > > like xri://Arabic/Devanagary/Japanese.Korean.Chinese/ is > > possible as a concept, but we cannot make it human readable > > reliably with Unicode, unfortunately. > > > > (2) Prepare for something else in the future. i.e., try not > > to depend on UTF-8. The resolver should be able to handle any > > bytes stream. > > > > (3) We have to have some way of attaching the language > > information to it. Otherwise, resolver cannot reliably search > > and resolve. For example, HTTP based resolver can do > something like: > > > > GET xri://an-xri-identifier > > LANG="ja" > > > > Resolver will probably need to advertise its capability to > > the clients. > > > > (4) Search Thesaurus. This is purely optional, but some > > implementation may want to implement character thesaurus > > feature so that variable form of one character can be > > searched by a single input. For example, if one search for u > > umlaut, both u+00fc and u+0075 + U+0308 are searched. > > > > Nat > > > > -----Original Message----- > > From: Drummond Reed [mailto:drummond.reed@onename.com] > > Sent: Thursday, May 22, 2003 3:26 PM > > To: Sakimura, Nat; Wachob, Gabe; xri@lists.oasis-open.org > > Cc: DI > > Subject: I18N strategy > > > > [Note thread change: was RE: [xri] Groups - > > xri-requirements-1.0-draft-05b.doc uploaded] > > > > Nat, > > > > Thank you very much for this exceptionally clear and cogent > > analysis of > > the I18N challenges we face. For those of us who are not > > specialists in > > I18N, it really helps to understand the major issues > involved and the > > tradeoffs involved with both IRI and Unicode > > > > There is no question in my mind (and never has been) that > XRI must be > > internationalized to the greatest extent feasible at any particular > > point in time. I know Gabe feels the same way and I believe > > this is the > > consensus of the entire TC (anyone who disagrees, please > > speak up now). > > > > So the key decision we face is: what is "the greatest > extent feasible" > > in May 2003? Based on your knowledge of the problem space, > > what strategy > > do you suggest the TC follow to make this decision? > > > > Also, as a procedural note, I suggest we add an Internationalization > > section to the XRI spec outline in which we document this > strategy, as > > many in the I18N community will be interested specifically > > how the spec > > handles this issue. > > > > =Drummond > > > > -----Original Message----- > > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > > Sent: Wednesday, May 21, 2003 3:10 AM > > To: Wachob, Gabe; xri@lists.oasis-open.org > > Cc: DI > > Subject: RE: [xri] Groups - > > xri-requirements-1.0-draft-05b.doc uploaded > > > > Thanks Gabe and Peter for your comments. > > > > More I study this problem, more it seems that the source of > > the problem > > actually is Unicode and ISO 10646-1. I wish we have made DIS 10646 > > ver.1.0 the ISO standard. Japan was pushing for it, but most > > people were > > not worried about the limitation of Unicode at the time, > which became > > apparent in 5 years. Much of the problem of the IRI also > > arises from it. > > Essentially, the source of the problem comes from the fact that in > > Unicode (and thus UTF-8, ISO 10646-1), you cannot distinguish the > > language from the code unless you are lucky. > > > > Without going into the details, let me state the original > > problem stated > > earlier in this ML in the following fashion. > > > > There seems to be two issues involved in IRI <-> URI conversion. > > First is actually > > (P1) local charset IRI to UTF-8 IRI conversion, > > and the second is > > (P2) UFT-8 IRI to URI conversion. > > > > The issue around (P2) is much easier than (P1), so I will > > start from it. > > > > (P2) above can be stated as follows. > > Define f() : IRI (UTF-8) -> URI escaping function. > > Define g() : URI -> IRI conversion function. > > Let u = f(i). > > Then there exists an i such that i != g(u). > > > > The reason: > > g() != f^-1() because of the following. > > > > (a) g() must not result in a octet sequence that is > > not part of a strictly legal UTF-8 octet > > sequence: URI may contain the escaped sequence > > that did not originate from UFT-8. > > e.g., the sequence of iso-8859-1 encoding. > > (b) There are further restrictions on the legal octet > > stream over the UTF-8. E.g., half width Japanese > > kana is not legal in IRI. These have to be re-escaped. > > > > This is valid in general, i.e., if we consider g() and > f^-1() over the > > set U, which is the legal URI space, g() and f^-1() are not equal. > > (In other words, although f(g(u)) = u for any u, there exists > > an x such > > that g(f(x)) != x). > > > > I was wondering if this is much of a concern for us. The IRI > > draft spec > > states that " the IRI resulting from this conversion may not > > be exactly > > the same as the original IRI (if there ever was one) " > because we are > > not concerned about U but a subset of U which is derived by the > > conversion of IRI. > > > > Proposition: for any x, which is an element of U' such that > > U' = f(I) where I is the set of legal IRIs. > > then, > > g(x) = f^-1(x) for any x. > > > > Reason: > > (1) Any escaped octet in this case is guaranteed to > > have come out of UTF-8. So, we do not have to worry > > about (a) above. > > (2) The set "I" does not include an element referred by (b) above. > > > > (I am just thinking in a abstract logic. I am not familiar > > with BIDI and > > other problem spaces, but mathematically, the above looks > good to me. > > One caveat: I have to check the escape function f() again > to see if it > > is good. f() must be such that f^-1() exists.) > > > > Thus, the source of problem seem to lie in the fact that > > legal URI is a > > super set of the set of URIs derived out of IRI. This indeed > > is a source > > of compatibility problem like Peter says. This poses us a question: > > "Do we really want to state things in term of URI, at least > > temporarily?"　 > > My answer is NO. If we do that, we will be strangulated in the same > > fashion as IRI is being done. > > > > Next problem (P1) is much more complicated. The problem (P1) > > is actually > > a problem of UTF-8 and Unicode. Union of local charset is > larger than > > UTF-8 space. More over, Unicode/UTF-8/ISO 10646 has no way > of telling > > what language context it is in by itself. It has to be used in > > conjunction with some other information like locale, > language flag, or > > font set. > > > > UTF-8 is collapsing several distinct but looking similar local > > characters into one representation. For example, the first > > character of > > my last name "Saki" cannot be represented correctly with it. > > For the eye > > of "Latin character world" person, it may just be the same > as straight > > and cursive "g", but it is not. If I submit official document > > with UTF-8 > > representation of my name, the government will not accept it > > because the > > name is different. Local charset to UTF-8 conversion is > > actually many to > > one function. Thus, invert function does not exist. It is > > going to be a > > mapping. The situation become even more complicated when we > > mix several > > language. > > > > This one will never be solved unless we abandon UTF-8 or Unicode for > > that matter. There are supplemental problems in searching through > > combined characters etc. For example, Deva Nagali (sp?) > would have its > > own representation problem. I would like to draw Reva > Modi's attention > > on it, who is probably more knowledgeable on these matters than I. I > > have contacted Mr. Kobayashi, who is the only member of the Unicode > > Technical Committee from Japan already, and will continue > > doing so. From > > what I found, Mr. Kobayashi is also a bit negative on the > > capability of > > Unicode for the sake of real I18N. (On the other hand, DIS > > 10646 ver.1.0 > > would have done it properly. Unicode and ISO 10646 was > > originally fixed > > length encoding. One of the main reason for pushing Unicode > > and not DIS > > 10646 ver.1.0 was because the later was variable length encoding and > > conceived as inefficient. The shortcoming became apparent within 3 > > years, and this fixed length thing was abandoned in 1996. > Now, we have > > lost t! > > he! > > fixed length-ness, and are facing the problems that DIS > 10646 ver.1.0 > > would not have had. After all, at the time of ISO 10646 > > voting, it seems > > only the multi-byte country with long history of computerized local > > language processing was Japan. The majority voting does not always > > result in the correct result...) Today, even HTML ver.4 is > asking for > > language switch like LANG="ja" and LANG="en". This is another > > indicator > > of the fact that the dream of Unified Character Coding is lost. > > > > Now, do we want to tackle this formidable problem? Hmmm. A good > > question. > > > > The bottom line is: if we decide to live in the UTF-8 > world, then IRI > > seems to be OK. If we want real I18N, we probably are leaving the > > Unicode world. > > > > On the other hand, we MUST NOT point to IRI in the > normative spec. We > > should extract the IRI spec and insert it into our spec until IRI > > actually become a spec. (If I remember correctly, this was the > > requirement for the use of IRI proposed spec.) > > > > Nat > > > > -----Original Message----- > > From: Wachob, Gabe [mailto:gwachob@visa.com] > > Sent: Wednesday, May 21, 2003 1:49 AM > > To: xri@lists.oasis-open.org > > Subject: RE: [xri] Groups - > > xri-requirements-1.0-draft-05b.doc uploaded > > > > Nat- > > Actually, the IRI spec says that mapping from URIs to IRIs > > unambiguously requires context not present in a URI. In > > http://www.w3.org/International/iri-edit/draft-duerst-iri.html > > #URItoIRI > > the problem is demonstrated in the situation where you > > convert an IRI to > > a URI and back -- "the IRI resulting from this conversion may not be > > exactly the same as the original IRI". Maybe we can address this > > ambiguity - perhaps you can figure this out better than i > have done so > > far. > > > > Also, as for the "defining a URI" vs. "defining an > IRI" - I'm > > not sure how this plays out. We absolutely need to be able > to use XRIs > > anywhere one would use a URI. However, we know that IRIs > can always be > > mapped to URIs (and in the case where there are no non-URI > > characters in > > and IRI, the IRI is syntactically equivalent to the URI to > > start with). > > > > As for equivalence - thats something we *have* to > > discuss as the > > specifiers of a URI scheme. We don't have to say much - we > > can say that > > two XRI URIs are "equivalent" if they are octet-by-octet the same > > (though there are issues about unescaping sequences before or > > after the > > comparison). I suppose it gets trickier if you define XRIs as an IRI > > scheme. > > > > The other problem with relying on the IRI spec right > > now is that > > its not a spec yet. Its still only a draft over at the IETF, and the > > IETF process is slow. I'm guessing we won't see a finalized > > IRI spec in > > 2003. > > > > Don't get me wrong - I think we should leverage > IRIs somehow. > > I'd even be in favor of defining XRIs as an IRI scheme if we could > > ensure that would not cause any problems for those many places where > > URIs are called for (after conversion to the URI form). I > > just think its > > more complicated than simply referring to the IRI spec (a lot more > > complicated). > > > > -Gabe > > > > > -----Original Message----- > > > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > > > Sent: Tuesday, May 20, 2003 12:06 AM > > > To: xri@lists.oasis-open.org > > > Subject: RE: [xri] Groups - > > > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > > > > Gabe, > > > > > > Conceptually, IRI has larger set than URI (IRI includes > > URI), but both > > > are countable and thus can be mapped one to one, I think. > > > Could you give > > > me an example of mapping one URI to multiple IRIs please? > > > > > > Fundamentally, the question for us probably is "do we really > > > want to be > > > bound by this aging URI standard?" To me, URI v.s. IRI > > controversy is > > > largely due to the backward compatibility issues. If we think > > > afresh, we > > > probably do not choose URI to be the normative format because > > > it is the > > > source of milliard of problems for I18N. Unicode is not > > perfect (some > > > purists say that it is useless - it generally cannot > > distinguish among > > > similar but distinct characters because these are collapsed > > into one), > > > but is much cleaner. Resolution does not have to go through the > > > transformation to URI. Our internationalized identifier > > should be able > > > to be resolved directly. > > > > > > On equivalence: I think URI equivalence arguments do not > > > affect us. This > > > is because we have abstract permanent identifier, which can > > be pretty > > > restrictive in the allowed character set as we do not > need the human > > > readability. To test the equivalence of two identifiers, we should > > > resolve to the permanent identifier and compare them. To > protect the > > > privacy, we might not want to expose the permanent > > identifier. In this > > > case, the proxy should give out True/False result. We have a much > > > powerful tool than URIs in this regard. > > > > > > Nat > > > > > > -----Original Message----- > > > From: Wachob, Gabe [mailto:gwachob@visa.com] > > > Sent: Friday, May 16, 2003 4:25 AM > > > To: 'Drummond Reed'; xri@lists.oasis-open.org > > > Subject: RE: [xri] Groups - > > > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > Drummond- > > > A few notes. > > > > > > First, in section 3.4.5 (you said 3.3.5) - "non-resolvable > > > syntax" - whats the use case? Why do we need to *prevent* an > > > attempt to > > > resolve? Why would a software component resolve an identifier > > > unless it > > > needed to? It seems like there are only two cases: a piece > > of software > > > needs to resolve the identifier, or it doesn't. This > > decision is based > > > on application semantics, not the syntax of the > identifier. How does > > > marking an identifier as "non-resolvable" help at all? > > > > > > In section 3.4.6 (internationalization) - there is a > > discussiong > > > going on at the W3C TAG (issue named something like > "IRIEverywhere") > > > where the appropriateness of where IRIs should be used is being > > > discussed. It is clear, for example, that IRIs cannot be used > > > everywhere > > > URIs can be used. The issue is whether *future* specs > > should refer to > > > IRIs or URIs. An IRI can be "cast down" into a URI > > unambiguously, but > > > because there are several ways to translate unicode into > > > ascii, its not > > > always possible to unambigously convert an URI back into an > > > IRI (without > > > some context like the encoding used to go from IRI to URI). > > > So, while I > > > think we should definitely address IRIs and XRIs, I don't > think XRIs > > > should expect to be solving the problems that IRIs have with the > > > relationshipt to URIs. We *could* propose a way to encode > the things > > > that are needed to unambiguously convert a URI back into an > > > IRI, but I'm > > > guessing that would actually break the IRI spec. I'm going > > > out beyond my > > > competency ! > > > here I think. > > > Bottom line is that we either have to wait for the > > IRI things to > > > shake out, or we have to tread new ground in i18n. I > > *definitely* want > > > XRIs to be "i18n enabled", but I'm a little worried about us > > > planning on > > > achieving that in the short term by relying on IRIs. > > > > > > This document has come a LONG way and I think does a > > pretty good > > > job of identifying why we are all here. Congrats and thanks > > > to all those > > > who contributed. I'm sure there will be more input and fixes > > > to the doc, > > > but I feel like we're very close to the "good enough" state > > > where we can > > > then concentrate on the syntax and resolution specs. > > > > > > -Gabe > > > > > > > > > > -----Original Message----- > > > > From: Drummond Reed [mailto:drummond.reed@onename.com] > > > > Sent: Thursday, May 15, 2003 11:45 AM > > > > To: xri@lists.oasis-open.org > > > > Subject: RE: [xri] Groups - > > > > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > > > > > > > First, let me note two reasons for posting v5b: > > > > > > > > 1) I found out from Marc Le Maitre this morning that > > leaving "Track > > > > Changes" on screwed up the section numbering, so it makes > > > it difficult > > > > to talk about requirement numbers. Let's use v5b on the > > call today. > > > > > > > > 2) There was an MS Word cross-reference error > > (unfortunately not all > > > > that uncommon) in 3.4.7 that needed fixing. > > > > > > > > Please make any edits to this clean version after making > > sure "Track > > > > Changes" is turned on. > > > > > > > > I will review the key updates on the TC call this > > afternoon, but the > > > > major areas to review are: > > > > > > > > * Sections 2.1 - 2.3 of the Motivations section. These were > > > rewritten > > > > for the third time to reflect the consensus regarding > terminology. > > > > > > > > * Requirement 3.1.2 was rewritten to reflect the URN > > > conformance topic > > > > as discussed on the list. > > > > > > > > * The original requirements section 3.3 was broken into the > > > > new sections > > > > 3.3 and 3.4 to reflect the clarifications in 2.2 and 2.3 about > > > > persistence and HFIs/MFIs. > > > > > > > > * 3.3.5 (Non-Resolvable Syntax) was added to reflect a > > > > requirement Marc > > > > Le Maitre has surfaced from the Namespace committee of the > > > > U.S. XML.gov > > > > working group. > > > > > > > > * 3.4.6 (Internationalization) was edited to reflect Nat's input > > > > regarding IRIs. We should discuss this on today's call. > > > > > > > > * The Glossary was updated and all TO DO's in it were finished. > > > > > > > > The only remaining TO DOs are a few entries in the > > > > informative glossary > > > > and Appendix A (Acknowledgments). > > > > > > > > Talk to everyone at 3pm PDT. > > > > > > > > =Drummond > > > > > > > > -----Original Message----- > > > > From: Drummond Reed > > > > Sent: Thursday, May 15, 2003 11:13 AM > > > > To: xri@lists.oasis-open.org > > > > Subject: [xri] Groups - > > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > > > The document xri-requirements-1.0-draft-05b.doc has been > > > submitted by > > > > Drummond Reed (drummond.reed@onename.com) to the > > Extensible Resource > > > > Identifier TC document repository. > > > > > > > > Document Description: > > > > v5b of XRI Requirements and Glossary - This is a CLEAN > > > version with a > > > > faulty MS Word cross-reference fixed. Please submit any edits > > > > using this > > > > version. > > > > > > > > Download Document: > > > > http://www.oasis-open.org/apps/org/workgroup/xri/download.php/ > > > > 2050/xri-r > > > > equirements-1.0-draft-05b.doc > > > > > > > > View Document Details: > > > > http://www.oasis-open.org/apps/org/workgroup/xri/document.php? > > > > document_i > > > > d=2050 > > > > > > > > > > > > PLEASE NOTE: If the above links do not work for you, your email > > > > application > > > > may be breaking the link into two pieces. You may be able > > > to copy and > > > > paste > > > > the entire link address into the address field of your > > web browser. > > > > > > > > -OASIS Open Administration > > > > > > > > > > You may leave a Technical Committee at any time by visiting > > http://www.oasis-open.org/apps/org/workgroup/xri/members/leave > _workgroup > .php > > You may leave a Technical Committee at any time by visiting > http://www.oasis-open.org/apps/org/workgroup/xri/members/leave _workgroup .php You may leave a Technical Committee at any time by visiting http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup .php

winmail.dat