OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

xri message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [xri] Groups - xri-requirements-1.0-draft-05b.doc uploaded


OK, I found a correct spelling : It is Devanagary, not "Deva Nagali". Nat -----Original Message----- From: Sakimura, Nat Sent: Wednesday, May 21, 2003 7:10 PM To: Wachob, Gabe; xri@lists.oasis-open.org Cc: DI Subject: RE: [xri] Groups - xri-requirements-1.0-draft-05b.doc uploaded Thanks Gabe and Peter for your comments. More I study this problem, more it seems that the source of the problem actually is Unicode and ISO 10646-1. I wish we have made DIS 10646 ver.1.0 the ISO standard. Japan was pushing for it, but most people were not worried about the limitation of Unicode at the time, which became apparent in 5 years. Much of the problem of the IRI also arises from it. Essentially, the source of the problem comes from the fact that in Unicode (and thus UTF-8, ISO 10646-1), you cannot distinguish the language from the code unless you are lucky. Without going into the details, let me state the original problem stated earlier in this ML in the following fashion. There seems to be two issues involved in IRI <-> URI conversion. First is actually (P1) local charset IRI to UTF-8 IRI conversion, and the second is (P2) UFT-8 IRI to URI conversion. The issue around (P2) is much easier than (P1), so I will start from it. (P2) above can be stated as follows. Define f() : IRI (UTF-8) -> URI escaping function. Define g() : URI -> IRI conversion function. Let u = f(i). Then there exists an i such that i != g(u). The reason: g() != f^-1() because of the following. (a) g() must not result in a octet sequence that is not part of a strictly legal UTF-8 octet sequence: URI may contain the escaped sequence that did not originate from UFT-8. e.g., the sequence of iso-8859-1 encoding. (b) There are further restrictions on the legal octet stream over the UTF-8. E.g., half width Japanese kana is not legal in IRI. These have to be re-escaped. This is valid in general, i.e., if we consider g() and f^-1() over the set U, which is the legal URI space, g() and f^-1() are not equal. (In other words, although f(g(u)) = u for any u, there exists an x such that g(f(x)) != x). I was wondering if this is much of a concern for us. The IRI draft spec states that " the IRI resulting from this conversion may not be exactly the same as the original IRI (if there ever was one) " because we are not concerned about U but a subset of U which is derived by the conversion of IRI. Proposition: for any x, which is an element of U' such that U' = f(I) where I is the set of legal IRIs. then, g(x) = f^-1(x) for any x. Reason: (1) Any escaped octet in this case is guaranteed to have come out of UTF-8. So, we do not have to worry about (a) above. (2) The set "I" does not include an element referred by (b) above. (I am just thinking in a abstract logic. I am not familiar with BIDI and other problem spaces, but mathematically, the above looks good to me. One caveat: I have to check the escape function f() again to see if it is good. f() must be such that f^-1() exists.) Thus, the source of problem seem to lie in the fact that legal URI is a super set of the set of URIs derived out of IRI. This indeed is a source of compatibility problem like Peter says. This poses us a question: "Do we really want to state things in term of URI, at least temporarily?"  My answer is NO. If we do that, we will be strangulated in the same fashion as IRI is being done. Next problem (P1) is much more complicated. The problem (P1) is actually a problem of UTF-8 and Unicode. Union of local charset is larger than UTF-8 space. More over, Unicode/UTF-8/ISO 10646 has no way of telling what language context it is in by itself. It has to be used in conjunction with some other information like locale, language flag, or font set. UTF-8 is collapsing several distinct but looking similar local characters into one representation. For example, the first character of my last name "Saki" cannot be represented correctly with it. For the eye of "Latin character world" person, it may just be the same as straight and cursive "g", but it is not. If I submit official document with UTF-8 representation of my name, the government will not accept it because the name is different. Local charset to UTF-8 conversion is actually many to one function. Thus, invert function does not exist. It is going to be a mapping. The situation become even more complicated when we mix several language. This one will never be solved unless we abandon UTF-8 or Unicode for that matter. There are supplemental problems in searching through combined characters etc. For example, Deva Nagali (sp?) would have its own representation problem. I would like to draw Reva Modi's attention on it, who is probably more knowledgeable on these matters than I. I have contacted Mr. Kobayashi, who is the only member of the Unicode Technical Committee from Japan already, and will continue doing so. From what I found, Mr. Kobayashi is also a bit negative on the capability of Unicode for the sake of real I18N. (On the other hand, DIS 10646 ver.1.0 would have done it properly. Unicode and ISO 10646 was originally fixed length encoding. One of the main reason for pushing Unicode and not DIS 10646 ver.1.0 was because the later was variable length encoding and conceived as inefficient. The shortcoming became apparent within 3 years, and this fixed length thing was abandoned in 1996. Now, we have lost t! he! fixed length-ness, and are facing the problems that DIS 10646 ver.1.0 would not have had. After all, at the time of ISO 10646 voting, it seems only the multi-byte country with long history of computerized local language processing was Japan. The majority voting does not always result in the correct result...) Today, even HTML ver.4 is asking for language switch like LANG="ja" and LANG="en". This is another indicator of the fact that the dream of Unified Character Coding is lost. Now, do we want to tackle this formidable problem? Hmmm. A good question. The bottom line is: if we decide to live in the UTF-8 world, then IRI seems to be OK. If we want real I18N, we probably are leaving the Unicode world. On the other hand, we MUST NOT point to IRI in the normative spec. We should extract the IRI spec and insert it into our spec until IRI actually become a spec. (If I remember correctly, this was the requirement for the use of IRI proposed spec.) Nat -----Original Message----- From: Wachob, Gabe [mailto:gwachob@visa.com] Sent: Wednesday, May 21, 2003 1:49 AM To: xri@lists.oasis-open.org Subject: RE: [xri] Groups - xri-requirements-1.0-draft-05b.doc uploaded Nat- Actually, the IRI spec says that mapping from URIs to IRIs unambiguously requires context not present in a URI. In http://www.w3.org/International/iri-edit/draft-duerst-iri.html#URItoIRI the problem is demonstrated in the situation where you convert an IRI to a URI and back -- "the IRI resulting from this conversion may not be exactly the same as the original IRI". Maybe we can address this ambiguity - perhaps you can figure this out better than i have done so far. Also, as for the "defining a URI" vs. "defining an IRI" - I'm not sure how this plays out. We absolutely need to be able to use XRIs anywhere one would use a URI. However, we know that IRIs can always be mapped to URIs (and in the case where there are no non-URI characters in and IRI, the IRI is syntactically equivalent to the URI to start with). As for equivalence - thats something we *have* to discuss as the specifiers of a URI scheme. We don't have to say much - we can say that two XRI URIs are "equivalent" if they are octet-by-octet the same (though there are issues about unescaping sequences before or after the comparison). I suppose it gets trickier if you define XRIs as an IRI scheme. The other problem with relying on the IRI spec right now is that its not a spec yet. Its still only a draft over at the IETF, and the IETF process is slow. I'm guessing we won't see a finalized IRI spec in 2003. Don't get me wrong - I think we should leverage IRIs somehow. I'd even be in favor of defining XRIs as an IRI scheme if we could ensure that would not cause any problems for those many places where URIs are called for (after conversion to the URI form). I just think its more complicated than simply referring to the IRI spec (a lot more complicated). -Gabe > -----Original Message----- > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > Sent: Tuesday, May 20, 2003 12:06 AM > To: xri@lists.oasis-open.org > Subject: RE: [xri] Groups - > xri-requirements-1.0-draft-05b.doc uploaded > > > Gabe, > > Conceptually, IRI has larger set than URI (IRI includes URI), but both > are countable and thus can be mapped one to one, I think. > Could you give > me an example of mapping one URI to multiple IRIs please? > > Fundamentally, the question for us probably is "do we really > want to be > bound by this aging URI standard?" To me, URI v.s. IRI controversy is > largely due to the backward compatibility issues. If we think > afresh, we > probably do not choose URI to be the normative format because > it is the > source of milliard of problems for I18N. Unicode is not perfect (some > purists say that it is useless - it generally cannot distinguish among > similar but distinct characters because these are collapsed into one), > but is much cleaner. Resolution does not have to go through the > transformation to URI. Our internationalized identifier should be able > to be resolved directly. > > On equivalence: I think URI equivalence arguments do not > affect us. This > is because we have abstract permanent identifier, which can be pretty > restrictive in the allowed character set as we do not need the human > readability. To test the equivalence of two identifiers, we should > resolve to the permanent identifier and compare them. To protect the > privacy, we might not want to expose the permanent identifier. In this > case, the proxy should give out True/False result. We have a much > powerful tool than URIs in this regard. > > Nat > > -----Original Message----- > From: Wachob, Gabe [mailto:gwachob@visa.com] > Sent: Friday, May 16, 2003 4:25 AM > To: 'Drummond Reed'; xri@lists.oasis-open.org > Subject: RE: [xri] Groups - > xri-requirements-1.0-draft-05b.doc uploaded > > Drummond- > A few notes. > > First, in section 3.4.5 (you said 3.3.5) - "non-resolvable > syntax" - whats the use case? Why do we need to *prevent* an > attempt to > resolve? Why would a software component resolve an identifier > unless it > needed to? It seems like there are only two cases: a piece of software > needs to resolve the identifier, or it doesn't. This decision is based > on application semantics, not the syntax of the identifier. How does > marking an identifier as "non-resolvable" help at all? > > In section 3.4.6 (internationalization) - there is a discussiong > going on at the W3C TAG (issue named something like "IRIEverywhere") > where the appropriateness of where IRIs should be used is being > discussed. It is clear, for example, that IRIs cannot be used > everywhere > URIs can be used. The issue is whether *future* specs should refer to > IRIs or URIs. An IRI can be "cast down" into a URI unambiguously, but > because there are several ways to translate unicode into > ascii, its not > always possible to unambigously convert an URI back into an > IRI (without > some context like the encoding used to go from IRI to URI). > So, while I > think we should definitely address IRIs and XRIs, I don't think XRIs > should expect to be solving the problems that IRIs have with the > relationshipt to URIs. We *could* propose a way to encode the things > that are needed to unambiguously convert a URI back into an > IRI, but I'm > guessing that would actually break the IRI spec. I'm going > out beyond my > competency ! > here I think. > Bottom line is that we either have to wait for the IRI things to > shake out, or we have to tread new ground in i18n. I *definitely* want > XRIs to be "i18n enabled", but I'm a little worried about us > planning on > achieving that in the short term by relying on IRIs. > > This document has come a LONG way and I think does a pretty good > job of identifying why we are all here. Congrats and thanks > to all those > who contributed. I'm sure there will be more input and fixes > to the doc, > but I feel like we're very close to the "good enough" state > where we can > then concentrate on the syntax and resolution specs. > > -Gabe > > > > -----Original Message----- > > From: Drummond Reed [mailto:drummond.reed@onename.com] > > Sent: Thursday, May 15, 2003 11:45 AM > > To: xri@lists.oasis-open.org > > Subject: RE: [xri] Groups - > > xri-requirements-1.0-draft-05b.doc uploaded > > > > > > First, let me note two reasons for posting v5b: > > > > 1) I found out from Marc Le Maitre this morning that leaving "Track > > Changes" on screwed up the section numbering, so it makes > it difficult > > to talk about requirement numbers. Let's use v5b on the call today. > > > > 2) There was an MS Word cross-reference error (unfortunately not all > > that uncommon) in 3.4.7 that needed fixing. > > > > Please make any edits to this clean version after making sure "Track > > Changes" is turned on. > > > > I will review the key updates on the TC call this afternoon, but the > > major areas to review are: > > > > * Sections 2.1 - 2.3 of the Motivations section. These were > rewritten > > for the third time to reflect the consensus regarding terminology. > > > > * Requirement 3.1.2 was rewritten to reflect the URN > conformance topic > > as discussed on the list. > > > > * The original requirements section 3.3 was broken into the > > new sections > > 3.3 and 3.4 to reflect the clarifications in 2.2 and 2.3 about > > persistence and HFIs/MFIs. > > > > * 3.3.5 (Non-Resolvable Syntax) was added to reflect a > > requirement Marc > > Le Maitre has surfaced from the Namespace committee of the > > U.S. XML.gov > > working group. > > > > * 3.4.6 (Internationalization) was edited to reflect Nat's input > > regarding IRIs. We should discuss this on today's call. > > > > * The Glossary was updated and all TO DO's in it were finished. > > > > The only remaining TO DOs are a few entries in the > > informative glossary > > and Appendix A (Acknowledgments). > > > > Talk to everyone at 3pm PDT. > > > > =Drummond > > > > -----Original Message----- > > From: Drummond Reed > > Sent: Thursday, May 15, 2003 11:13 AM > > To: xri@lists.oasis-open.org > > Subject: [xri] Groups - xri-requirements-1.0-draft-05b.doc uploaded > > > > The document xri-requirements-1.0-draft-05b.doc has been > submitted by > > Drummond Reed (drummond.reed@onename.com) to the Extensible Resource > > Identifier TC document repository. > > > > Document Description: > > v5b of XRI Requirements and Glossary - This is a CLEAN > version with a > > faulty MS Word cross-reference fixed. Please submit any edits > > using this > > version. > > > > Download Document: > > http://www.oasis-open.org/apps/org/workgroup/xri/download.php/ > > 2050/xri-r > > equirements-1.0-draft-05b.doc > > > > View Document Details: > > http://www.oasis-open.org/apps/org/workgroup/xri/document.php? > > document_i > > d=2050 > > > > > > PLEASE NOTE: If the above links do not work for you, your email > > application > > may be breaking the link into two pieces. You may be able > to copy and > > paste > > the entire link address into the address field of your web browser. > > > > -OASIS Open Administration > > > > You may leave a Technical Committee at any time by visiting http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup.php You may leave a Technical Committee at any time by visiting http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup.php You may leave a Technical Committee at any time by visiting http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup.php

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]