OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xri message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: I18N strategy


Drummond, This actually is pretty good in the sense that we do not have to bring in yet another syntactical structure. It is easy enough to type in as well. None the less, I still feel that we need some way of telling the server the default context separate from the identifier itself just like HTTP, because many people would omit the language portion. In reality, not much mixing of language would happen, I believe, so the default context setting (in this case, lang and charset) should be pretty useful. Nat -----Original Message----- From: Drummond Reed [mailto:drummond.reed@onename.com] Sent: Friday, May 23, 2003 3:42 PM To: Wachob, Gabe; Sakimura, Nat; xri@lists.oasis-open.org Cc: DI Subject: RE: I18N strategy Gabe, it may be exactly the right tract putting the language metadata "inline", i.e., in the XRI syntax. But it should take the same form as we concluded versioning metadata should, i.e., another form of cross-reference. In fact, it could even be a use case of a global xref that you and I once discussed. Back then the example I was using was a version date that would apply globally to all the reassignable identifiers in an XRI, essentially freezing them at a moment in time. The syntax to do that would be to make the entire XRI relative to a top-level xref that sets the datetime stamp, i.e.: xri://($d/2000-01-12T)/www.example.com/:1234:4678 (Note that I'm using $ as the special spec context symbol). We could do the same thing with language metadata, and it could appear at any level. As you suggest, the default would be the same as the next higher level until you explicitly changed languages. So your example (assuming "l" was the letter in the $ namespace for language) would look like: xri://($l/en).english.($l/ja).japanese.($l/ko).korean.,($l/ch).chinese (It's not terribly human-friendly, but there's a good chance a UI would be able to discern the different languages being used to create the XRI and insert the language xrefs automatically.) Nat, how far would this go towards the issue you raised? =Drummond -----Original Message----- From: Wachob, Gabe [mailto:gwachob@visa.com] Sent: Thursday, May 22, 2003 9:57 AM To: 'Sakimura, Nat'; Drummond Reed; Wachob, Gabe; xri@lists.oasis-open.org Cc: DI Subject: RE: I18N strategy I was wondering if it wouldn't be possible to insert language hints as part of the XRI syntax. Haven't really thought about this .. xri://english/,l=ja.japanese.,l=ko.korean.,l=ch.chinese Where ,l= is a special form which "switches" the "current language" (going from left to right - but this is a problem potentially too) to a new one. Everything would be understood to be in the "current language" until another language identifier is discovered. This is just a "top of the head" suggestion and I don't even know if it would address the bulk of the problems Nat has identified. It seems hard to justify not going with the IRI framework and accepting the limitations it has - but I *am* concerned that we are potentially alienating a huge segment of the world's population. On the other hand, I'm afraid that not following the IRI (and W3C Web Character Model - a very good read for people not yet familiar with I18N issues at http://www.w3.org/TR/charmod/ ) would mean that we would have to bite off a lot of work for ourselves, and might create all sorts of interoperability problems. Nat, I'm curious what "not relying on UTF-8" means. UTF-8 is just a way of interpreting a series of octets. Do you mean that we shouldn't have resolvers assume that character strings are in UTF-8? If so, how do resolvers know what encoding is being used in an XRI? It sounds like there is consensus on just moving ahead with IRI/Unicode - but I still don't feel quite comfortable that we've worked through this issue completely (or "beat this dead horse completely", depending on your point of view). -Gabe > -----Original Message----- > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > Sent: Thursday, May 22, 2003 1:06 AM > To: Drummond Reed; Wachob, Gabe; xri@lists.oasis-open.org > Cc: DI > Subject: RE: I18N strategy > > > My opinion for the time being is like this: > > (1) for the time being at least, we have to live with Unicode > and UTF-8. > This, in turn, means that we abandon the idea of using > different language in one identifier but we must abide. This > is a hard decision because we do federate. In XRI, something > like xri://Arabic/Devanagary/Japanese.Korean.Chinese/ is > possible as a concept, but we cannot make it human readable > reliably with Unicode, unfortunately. > > (2) Prepare for something else in the future. i.e., try not > to depend on UTF-8. The resolver should be able to handle any > bytes stream. > > (3) We have to have some way of attaching the language > information to it. Otherwise, resolver cannot reliably search > and resolve. For example, HTTP based resolver can do something like: > > GET xri://an-xri-identifier > LANG="ja" > > Resolver will probably need to advertise its capability to > the clients. > > (4) Search Thesaurus. This is purely optional, but some > implementation may want to implement character thesaurus > feature so that variable form of one character can be > searched by a single input. For example, if one search for u > umlaut, both u+00fc and u+0075 + U+0308 are searched. > > Nat > > -----Original Message----- > From: Drummond Reed [mailto:drummond.reed@onename.com] > Sent: Thursday, May 22, 2003 3:26 PM > To: Sakimura, Nat; Wachob, Gabe; xri@lists.oasis-open.org > Cc: DI > Subject: I18N strategy > > [Note thread change: was RE: [xri] Groups - > xri-requirements-1.0-draft-05b.doc uploaded] > > Nat, > > Thank you very much for this exceptionally clear and cogent > analysis of > the I18N challenges we face. For those of us who are not > specialists in > I18N, it really helps to understand the major issues involved and the > tradeoffs involved with both IRI and Unicode > > There is no question in my mind (and never has been) that XRI must be > internationalized to the greatest extent feasible at any particular > point in time. I know Gabe feels the same way and I believe > this is the > consensus of the entire TC (anyone who disagrees, please > speak up now). > > So the key decision we face is: what is "the greatest extent feasible" > in May 2003? Based on your knowledge of the problem space, > what strategy > do you suggest the TC follow to make this decision? > > Also, as a procedural note, I suggest we add an Internationalization > section to the XRI spec outline in which we document this strategy, as > many in the I18N community will be interested specifically > how the spec > handles this issue. > > =Drummond > > -----Original Message----- > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > Sent: Wednesday, May 21, 2003 3:10 AM > To: Wachob, Gabe; xri@lists.oasis-open.org > Cc: DI > Subject: RE: [xri] Groups - > xri-requirements-1.0-draft-05b.doc uploaded > > Thanks Gabe and Peter for your comments. > > More I study this problem, more it seems that the source of > the problem > actually is Unicode and ISO 10646-1. I wish we have made DIS 10646 > ver.1.0 the ISO standard. Japan was pushing for it, but most > people were > not worried about the limitation of Unicode at the time, which became > apparent in 5 years. Much of the problem of the IRI also > arises from it. > Essentially, the source of the problem comes from the fact that in > Unicode (and thus UTF-8, ISO 10646-1), you cannot distinguish the > language from the code unless you are lucky. > > Without going into the details, let me state the original > problem stated > earlier in this ML in the following fashion. > > There seems to be two issues involved in IRI <-> URI conversion. > First is actually > (P1) local charset IRI to UTF-8 IRI conversion, > and the second is > (P2) UFT-8 IRI to URI conversion. > > The issue around (P2) is much easier than (P1), so I will > start from it. > > (P2) above can be stated as follows. > Define f() : IRI (UTF-8) -> URI escaping function. > Define g() : URI -> IRI conversion function. > Let u = f(i). > Then there exists an i such that i != g(u). > > The reason: > g() != f^-1() because of the following. > > (a) g() must not result in a octet sequence that is > not part of a strictly legal UTF-8 octet > sequence: URI may contain the escaped sequence > that did not originate from UFT-8. > e.g., the sequence of iso-8859-1 encoding. > (b) There are further restrictions on the legal octet > stream over the UTF-8. E.g., half width Japanese > kana is not legal in IRI. These have to be re-escaped. > > This is valid in general, i.e., if we consider g() and f^-1() over the > set U, which is the legal URI space, g() and f^-1() are not equal. > (In other words, although f(g(u)) = u for any u, there exists > an x such > that g(f(x)) != x). > > I was wondering if this is much of a concern for us. The IRI > draft spec > states that " the IRI resulting from this conversion may not > be exactly > the same as the original IRI (if there ever was one) " because we are > not concerned about U but a subset of U which is derived by the > conversion of IRI. > > Proposition: for any x, which is an element of U' such that > U' = f(I) where I is the set of legal IRIs. > then, > g(x) = f^-1(x) for any x. > > Reason: > (1) Any escaped octet in this case is guaranteed to > have come out of UTF-8. So, we do not have to worry > about (a) above. > (2) The set "I" does not include an element referred by (b) above. > > (I am just thinking in a abstract logic. I am not familiar > with BIDI and > other problem spaces, but mathematically, the above looks good to me. > One caveat: I have to check the escape function f() again to see if it > is good. f() must be such that f^-1() exists.) > > Thus, the source of problem seem to lie in the fact that > legal URI is a > super set of the set of URIs derived out of IRI. This indeed > is a source > of compatibility problem like Peter says. This poses us a question: > "Do we really want to state things in term of URI, at least > temporarily?"  > My answer is NO. If we do that, we will be strangulated in the same > fashion as IRI is being done. > > Next problem (P1) is much more complicated. The problem (P1) > is actually > a problem of UTF-8 and Unicode. Union of local charset is larger than > UTF-8 space. More over, Unicode/UTF-8/ISO 10646 has no way of telling > what language context it is in by itself. It has to be used in > conjunction with some other information like locale, language flag, or > font set. > > UTF-8 is collapsing several distinct but looking similar local > characters into one representation. For example, the first > character of > my last name "Saki" cannot be represented correctly with it. > For the eye > of "Latin character world" person, it may just be the same as straight > and cursive "g", but it is not. If I submit official document > with UTF-8 > representation of my name, the government will not accept it > because the > name is different. Local charset to UTF-8 conversion is > actually many to > one function. Thus, invert function does not exist. It is > going to be a > mapping. The situation become even more complicated when we > mix several > language. > > This one will never be solved unless we abandon UTF-8 or Unicode for > that matter. There are supplemental problems in searching through > combined characters etc. For example, Deva Nagali (sp?) would have its > own representation problem. I would like to draw Reva Modi's attention > on it, who is probably more knowledgeable on these matters than I. I > have contacted Mr. Kobayashi, who is the only member of the Unicode > Technical Committee from Japan already, and will continue > doing so. From > what I found, Mr. Kobayashi is also a bit negative on the > capability of > Unicode for the sake of real I18N. (On the other hand, DIS > 10646 ver.1.0 > would have done it properly. Unicode and ISO 10646 was > originally fixed > length encoding. One of the main reason for pushing Unicode > and not DIS > 10646 ver.1.0 was because the later was variable length encoding and > conceived as inefficient. The shortcoming became apparent within 3 > years, and this fixed length thing was abandoned in 1996. Now, we have > lost t! > he! > fixed length-ness, and are facing the problems that DIS 10646 ver.1.0 > would not have had. After all, at the time of ISO 10646 > voting, it seems > only the multi-byte country with long history of computerized local > language processing was Japan. The majority voting does not always > result in the correct result...) Today, even HTML ver.4 is asking for > language switch like LANG="ja" and LANG="en". This is another > indicator > of the fact that the dream of Unified Character Coding is lost. > > Now, do we want to tackle this formidable problem? Hmmm. A good > question. > > The bottom line is: if we decide to live in the UTF-8 world, then IRI > seems to be OK. If we want real I18N, we probably are leaving the > Unicode world. > > On the other hand, we MUST NOT point to IRI in the normative spec. We > should extract the IRI spec and insert it into our spec until IRI > actually become a spec. (If I remember correctly, this was the > requirement for the use of IRI proposed spec.) > > Nat > > -----Original Message----- > From: Wachob, Gabe [mailto:gwachob@visa.com] > Sent: Wednesday, May 21, 2003 1:49 AM > To: xri@lists.oasis-open.org > Subject: RE: [xri] Groups - > xri-requirements-1.0-draft-05b.doc uploaded > > Nat- > Actually, the IRI spec says that mapping from URIs to IRIs > unambiguously requires context not present in a URI. In > http://www.w3.org/International/iri-edit/draft-duerst-iri.html > #URItoIRI > the problem is demonstrated in the situation where you > convert an IRI to > a URI and back -- "the IRI resulting from this conversion may not be > exactly the same as the original IRI". Maybe we can address this > ambiguity - perhaps you can figure this out better than i have done so > far. > > Also, as for the "defining a URI" vs. "defining an IRI" - I'm > not sure how this plays out. We absolutely need to be able to use XRIs > anywhere one would use a URI. However, we know that IRIs can always be > mapped to URIs (and in the case where there are no non-URI > characters in > and IRI, the IRI is syntactically equivalent to the URI to > start with). > > As for equivalence - thats something we *have* to > discuss as the > specifiers of a URI scheme. We don't have to say much - we > can say that > two XRI URIs are "equivalent" if they are octet-by-octet the same > (though there are issues about unescaping sequences before or > after the > comparison). I suppose it gets trickier if you define XRIs as an IRI > scheme. > > The other problem with relying on the IRI spec right > now is that > its not a spec yet. Its still only a draft over at the IETF, and the > IETF process is slow. I'm guessing we won't see a finalized > IRI spec in > 2003. > > Don't get me wrong - I think we should leverage IRIs somehow. > I'd even be in favor of defining XRIs as an IRI scheme if we could > ensure that would not cause any problems for those many places where > URIs are called for (after conversion to the URI form). I > just think its > more complicated than simply referring to the IRI spec (a lot more > complicated). > > -Gabe > > > -----Original Message----- > > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > > Sent: Tuesday, May 20, 2003 12:06 AM > > To: xri@lists.oasis-open.org > > Subject: RE: [xri] Groups - > > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > Gabe, > > > > Conceptually, IRI has larger set than URI (IRI includes > URI), but both > > are countable and thus can be mapped one to one, I think. > > Could you give > > me an example of mapping one URI to multiple IRIs please? > > > > Fundamentally, the question for us probably is "do we really > > want to be > > bound by this aging URI standard?" To me, URI v.s. IRI > controversy is > > largely due to the backward compatibility issues. If we think > > afresh, we > > probably do not choose URI to be the normative format because > > it is the > > source of milliard of problems for I18N. Unicode is not > perfect (some > > purists say that it is useless - it generally cannot > distinguish among > > similar but distinct characters because these are collapsed > into one), > > but is much cleaner. Resolution does not have to go through the > > transformation to URI. Our internationalized identifier > should be able > > to be resolved directly. > > > > On equivalence: I think URI equivalence arguments do not > > affect us. This > > is because we have abstract permanent identifier, which can > be pretty > > restrictive in the allowed character set as we do not need the human > > readability. To test the equivalence of two identifiers, we should > > resolve to the permanent identifier and compare them. To protect the > > privacy, we might not want to expose the permanent > identifier. In this > > case, the proxy should give out True/False result. We have a much > > powerful tool than URIs in this regard. > > > > Nat > > > > -----Original Message----- > > From: Wachob, Gabe [mailto:gwachob@visa.com] > > Sent: Friday, May 16, 2003 4:25 AM > > To: 'Drummond Reed'; xri@lists.oasis-open.org > > Subject: RE: [xri] Groups - > > xri-requirements-1.0-draft-05b.doc uploaded > > > > Drummond- > > A few notes. > > > > First, in section 3.4.5 (you said 3.3.5) - "non-resolvable > > syntax" - whats the use case? Why do we need to *prevent* an > > attempt to > > resolve? Why would a software component resolve an identifier > > unless it > > needed to? It seems like there are only two cases: a piece > of software > > needs to resolve the identifier, or it doesn't. This > decision is based > > on application semantics, not the syntax of the identifier. How does > > marking an identifier as "non-resolvable" help at all? > > > > In section 3.4.6 (internationalization) - there is a > discussiong > > going on at the W3C TAG (issue named something like "IRIEverywhere") > > where the appropriateness of where IRIs should be used is being > > discussed. It is clear, for example, that IRIs cannot be used > > everywhere > > URIs can be used. The issue is whether *future* specs > should refer to > > IRIs or URIs. An IRI can be "cast down" into a URI > unambiguously, but > > because there are several ways to translate unicode into > > ascii, its not > > always possible to unambigously convert an URI back into an > > IRI (without > > some context like the encoding used to go from IRI to URI). > > So, while I > > think we should definitely address IRIs and XRIs, I don't think XRIs > > should expect to be solving the problems that IRIs have with the > > relationshipt to URIs. We *could* propose a way to encode the things > > that are needed to unambiguously convert a URI back into an > > IRI, but I'm > > guessing that would actually break the IRI spec. I'm going > > out beyond my > > competency ! > > here I think. > > Bottom line is that we either have to wait for the > IRI things to > > shake out, or we have to tread new ground in i18n. I > *definitely* want > > XRIs to be "i18n enabled", but I'm a little worried about us > > planning on > > achieving that in the short term by relying on IRIs. > > > > This document has come a LONG way and I think does a > pretty good > > job of identifying why we are all here. Congrats and thanks > > to all those > > who contributed. I'm sure there will be more input and fixes > > to the doc, > > but I feel like we're very close to the "good enough" state > > where we can > > then concentrate on the syntax and resolution specs. > > > > -Gabe > > > > > > > -----Original Message----- > > > From: Drummond Reed [mailto:drummond.reed@onename.com] > > > Sent: Thursday, May 15, 2003 11:45 AM > > > To: xri@lists.oasis-open.org > > > Subject: RE: [xri] Groups - > > > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > > > > First, let me note two reasons for posting v5b: > > > > > > 1) I found out from Marc Le Maitre this morning that > leaving "Track > > > Changes" on screwed up the section numbering, so it makes > > it difficult > > > to talk about requirement numbers. Let's use v5b on the > call today. > > > > > > 2) There was an MS Word cross-reference error > (unfortunately not all > > > that uncommon) in 3.4.7 that needed fixing. > > > > > > Please make any edits to this clean version after making > sure "Track > > > Changes" is turned on. > > > > > > I will review the key updates on the TC call this > afternoon, but the > > > major areas to review are: > > > > > > * Sections 2.1 - 2.3 of the Motivations section. These were > > rewritten > > > for the third time to reflect the consensus regarding terminology. > > > > > > * Requirement 3.1.2 was rewritten to reflect the URN > > conformance topic > > > as discussed on the list. > > > > > > * The original requirements section 3.3 was broken into the > > > new sections > > > 3.3 and 3.4 to reflect the clarifications in 2.2 and 2.3 about > > > persistence and HFIs/MFIs. > > > > > > * 3.3.5 (Non-Resolvable Syntax) was added to reflect a > > > requirement Marc > > > Le Maitre has surfaced from the Namespace committee of the > > > U.S. XML.gov > > > working group. > > > > > > * 3.4.6 (Internationalization) was edited to reflect Nat's input > > > regarding IRIs. We should discuss this on today's call. > > > > > > * The Glossary was updated and all TO DO's in it were finished. > > > > > > The only remaining TO DOs are a few entries in the > > > informative glossary > > > and Appendix A (Acknowledgments). > > > > > > Talk to everyone at 3pm PDT. > > > > > > =Drummond > > > > > > -----Original Message----- > > > From: Drummond Reed > > > Sent: Thursday, May 15, 2003 11:13 AM > > > To: xri@lists.oasis-open.org > > > Subject: [xri] Groups - > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > The document xri-requirements-1.0-draft-05b.doc has been > > submitted by > > > Drummond Reed (drummond.reed@onename.com) to the > Extensible Resource > > > Identifier TC document repository. > > > > > > Document Description: > > > v5b of XRI Requirements and Glossary - This is a CLEAN > > version with a > > > faulty MS Word cross-reference fixed. Please submit any edits > > > using this > > > version. > > > > > > Download Document: > > > http://www.oasis-open.org/apps/org/workgroup/xri/download.php/ > > > 2050/xri-r > > > equirements-1.0-draft-05b.doc > > > > > > View Document Details: > > > http://www.oasis-open.org/apps/org/workgroup/xri/document.php? > > > document_i > > > d=2050 > > > > > > > > > PLEASE NOTE: If the above links do not work for you, your email > > > application > > > may be breaking the link into two pieces. You may be able > > to copy and > > > paste > > > the entire link address into the address field of your > web browser. > > > > > > -OASIS Open Administration > > > > > > > You may leave a Technical Committee at any time by visiting > http://www.oasis-open.org/apps/org/workgroup/xri/members/leave _workgroup .php You may leave a Technical Committee at any time by visiting http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup .php You may leave a Technical Committee at any time by visiting http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup .php

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]