xri message
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]
Subject: RE: [xri] Groups - xri-requirements-1.0-draft-05b.doc uploaded
- From: "Wachob, Gabe" <gwachob@visa.com>
- To: xri@lists.oasis-open.org
- Date: Wed, 21 May 2003 12:26:14 -0700
An even better link about the problems with Unicode and (specifically) East Asian languages:
http://www-106.ibm.com/developerworks/unicode/library/u-secret.html?dwzone=unicode
-Gabe
> -----Original Message-----
> From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp]
> Sent: Wednesday, May 21, 2003 3:10 AM
> To: Wachob, Gabe; xri@lists.oasis-open.org
> Cc: DI
> Subject: RE: [xri] Groups -
> xri-requirements-1.0-draft-05b.doc uploaded
>
>
> Thanks Gabe and Peter for your comments.
>
> More I study this problem, more it seems that the source of
> the problem actually is Unicode and ISO 10646-1. I wish we
> have made DIS 10646 ver.1.0 the ISO standard. Japan was
> pushing for it, but most people were not worried about the
> limitation of Unicode at the time, which became apparent in 5
> years. Much of the problem of the IRI also arises from it.
> Essentially, the source of the problem comes from the fact
> that in Unicode (and thus UTF-8, ISO 10646-1), you cannot
> distinguish the language from the code unless you are lucky.
>
> Without going into the details, let me state the original
> problem stated earlier in this ML in the following fashion.
>
> There seems to be two issues involved in IRI <-> URI conversion.
> First is actually
> (P1) local charset IRI to UTF-8 IRI conversion,
> and the second is
> (P2) UFT-8 IRI to URI conversion.
>
> The issue around (P2) is much easier than (P1), so I will
> start from it.
>
> (P2) above can be stated as follows.
> Define f() : IRI (UTF-8) -> URI escaping function.
> Define g() : URI -> IRI conversion function.
> Let u = f(i).
> Then there exists an i such that i != g(u).
>
> The reason:
> g() != f^-1() because of the following.
>
> (a) g() must not result in a octet sequence that is
> not part of a strictly legal UTF-8 octet
> sequence: URI may contain the escaped sequence
> that did not originate from UFT-8.
> e.g., the sequence of iso-8859-1 encoding.
> (b) There are further restrictions on the legal octet
> stream over the UTF-8. E.g., half width Japanese
> kana is not legal in IRI. These have to be re-escaped.
>
> This is valid in general, i.e., if we consider g() and f^-1()
> over the set U, which is the legal URI space, g() and f^-1()
> are not equal.
> (In other words, although f(g(u)) = u for any u, there exists
> an x such that g(f(x)) != x).
>
> I was wondering if this is much of a concern for us. The IRI
> draft spec states that " the IRI resulting from this
> conversion may not be exactly the same as the original IRI
> (if there ever was one) " because we are not concerned about
> U but a subset of U which is derived by the conversion of IRI.
>
> Proposition: for any x, which is an element of U' such that
> U' = f(I) where I is the set of legal IRIs.
> then,
> g(x) = f^-1(x) for any x.
>
> Reason:
> (1) Any escaped octet in this case is guaranteed to
> have come out of UTF-8. So, we do not have to worry
> about (a) above.
> (2) The set "I" does not include an element referred by (b) above.
>
> (I am just thinking in a abstract logic. I am not familiar
> with BIDI and other problem spaces, but mathematically, the
> above looks good to me. One caveat: I have to check the
> escape function f() again to see if it is good. f() must be
> such that f^-1() exists.)
>
> Thus, the source of problem seem to lie in the fact that
> legal URI is a super set of the set of URIs derived out of
> IRI. This indeed is a source of compatibility problem like
> Peter says. This poses us a question:
> "Do we really want to state things in term of URI, at
> least temporarily?"
> My answer is NO. If we do that, we will be strangulated in
> the same fashion as IRI is being done.
>
> Next problem (P1) is much more complicated. The problem (P1)
> is actually a problem of UTF-8 and Unicode. Union of local
> charset is larger than UTF-8 space. More over,
> Unicode/UTF-8/ISO 10646 has no way of telling what language
> context it is in by itself. It has to be used in conjunction
> with some other information like locale, language flag, or font set.
>
> UTF-8 is collapsing several distinct but looking similar
> local characters into one representation. For example, the
> first character of my last name "Saki" cannot be represented
> correctly with it. For the eye of "Latin character world"
> person, it may just be the same as straight and cursive "g",
> but it is not. If I submit official document with UTF-8
> representation of my name, the government will not accept it
> because the name is different. Local charset to UTF-8
> conversion is actually many to one function. Thus, invert
> function does not exist. It is going to be a mapping. The
> situation become even more complicated when we mix several language.
>
> This one will never be solved unless we abandon UTF-8 or
> Unicode for that matter. There are supplemental problems in
> searching through combined characters etc. For example, Deva
> Nagali (sp?) would have its own representation problem. I
> would like to draw Reva Modi's attention on it, who is
> probably more knowledgeable on these matters than I. I have
> contacted Mr. Kobayashi, who is the only member of the
> Unicode Technical Committee from Japan already, and will
> continue doing so. From what I found, Mr. Kobayashi is also a
> bit negative on the capability of Unicode for the sake of
> real I18N. (On the other hand, DIS 10646 ver.1.0 would have
> done it properly. Unicode and ISO 10646 was originally fixed
> length encoding. One of the main reason for pushing Unicode
> and not DIS 10646 ver.1.0 was because the later was variable
> length encoding and conceived as inefficient. The shortcoming
> became apparent within 3 years, and this fixed length thing
> was abandoned in 1996. Now, we have lost t!
> he!
> fixed length-ness, and are facing the problems that DIS
> 10646 ver.1.0 would not have had. After all, at the time of
> ISO 10646 voting, it seems only the multi-byte country with
> long history of computerized local language processing was
> Japan. The majority voting does not always result in the
> correct result...) Today, even HTML ver.4 is asking for
> language switch like LANG="ja" and LANG="en". This is another
> indicator of the fact that the dream of Unified Character
> Coding is lost.
>
> Now, do we want to tackle this formidable problem? Hmmm. A
> good question.
>
> The bottom line is: if we decide to live in the UTF-8 world,
> then IRI seems to be OK. If we want real I18N, we probably
> are leaving the Unicode world.
>
> On the other hand, we MUST NOT point to IRI in the normative
> spec. We should extract the IRI spec and insert it into our
> spec until IRI actually become a spec. (If I remember
> correctly, this was the requirement for the use of IRI
> proposed spec.)
>
> Nat
>
> -----Original Message-----
> From: Wachob, Gabe [mailto:gwachob@visa.com]
> Sent: Wednesday, May 21, 2003 1:49 AM
> To: xri@lists.oasis-open.org
> Subject: RE: [xri] Groups -
> xri-requirements-1.0-draft-05b.doc uploaded
>
> Nat-
> Actually, the IRI spec says that mapping from URIs to
> IRIs unambiguously requires context not present in a URI. In
> http://www.w3.org/International/iri-edit/draft-duerst-iri.html
> #URItoIRI the problem is demonstrated in the situation where
> you convert an IRI to a URI and back -- "the IRI resulting
> from this conversion may not be exactly the same as the
> original IRI". Maybe we can address this ambiguity - perhaps
> you can figure this out better than i have done so far.
>
> Also, as for the "defining a URI" vs. "defining an IRI"
> - I'm not sure how this plays out. We absolutely need to be
> able to use XRIs anywhere one would use a URI. However, we
> know that IRIs can always be mapped to URIs (and in the case
> where there are no non-URI characters in and IRI, the IRI is
> syntactically equivalent to the URI to start with).
>
> As for equivalence - thats something we *have* to
> discuss as the specifiers of a URI scheme. We don't have to
> say much - we can say that two XRI URIs are "equivalent" if
> they are octet-by-octet the same (though there are issues
> about unescaping sequences before or after the comparison). I
> suppose it gets trickier if you define XRIs as an IRI scheme.
>
> The other problem with relying on the IRI spec right
> now is that its not a spec yet. Its still only a draft over
> at the IETF, and the IETF process is slow. I'm guessing we
> won't see a finalized IRI spec in 2003.
>
> Don't get me wrong - I think we should leverage IRIs
> somehow. I'd even be in favor of defining XRIs as an IRI
> scheme if we could ensure that would not cause any problems
> for those many places where URIs are called for (after
> conversion to the URI form). I just think its more
> complicated than simply referring to the IRI spec (a lot more
> complicated).
>
> -Gabe
>
> > -----Original Message-----
> > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp]
> > Sent: Tuesday, May 20, 2003 12:06 AM
> > To: xri@lists.oasis-open.org
> > Subject: RE: [xri] Groups -
> > xri-requirements-1.0-draft-05b.doc uploaded
> >
> >
> > Gabe,
> >
> > Conceptually, IRI has larger set than URI (IRI includes
> URI), but both
> > are countable and thus can be mapped one to one, I think.
> > Could you give
> > me an example of mapping one URI to multiple IRIs please?
> >
> > Fundamentally, the question for us probably is "do we really
> > want to be
> > bound by this aging URI standard?" To me, URI v.s. IRI
> controversy is
> > largely due to the backward compatibility issues. If we think
> > afresh, we
> > probably do not choose URI to be the normative format because
> > it is the
> > source of milliard of problems for I18N. Unicode is not
> perfect (some
> > purists say that it is useless - it generally cannot
> distinguish among
> > similar but distinct characters because these are collapsed
> into one),
> > but is much cleaner. Resolution does not have to go through the
> > transformation to URI. Our internationalized identifier
> should be able
> > to be resolved directly.
> >
> > On equivalence: I think URI equivalence arguments do not
> > affect us. This
> > is because we have abstract permanent identifier, which can
> be pretty
> > restrictive in the allowed character set as we do not need the human
> > readability. To test the equivalence of two identifiers, we should
> > resolve to the permanent identifier and compare them. To protect the
> > privacy, we might not want to expose the permanent
> identifier. In this
> > case, the proxy should give out True/False result. We have a much
> > powerful tool than URIs in this regard.
> >
> > Nat
> >
> > -----Original Message-----
> > From: Wachob, Gabe [mailto:gwachob@visa.com]
> > Sent: Friday, May 16, 2003 4:25 AM
> > To: 'Drummond Reed'; xri@lists.oasis-open.org
> > Subject: RE: [xri] Groups -
> > xri-requirements-1.0-draft-05b.doc uploaded
> >
> > Drummond-
> > A few notes.
> >
> > First, in section 3.4.5 (you said 3.3.5) - "non-resolvable
> > syntax" - whats the use case? Why do we need to *prevent* an
> > attempt to
> > resolve? Why would a software component resolve an identifier
> > unless it
> > needed to? It seems like there are only two cases: a piece
> of software
> > needs to resolve the identifier, or it doesn't. This
> decision is based
> > on application semantics, not the syntax of the identifier. How does
> > marking an identifier as "non-resolvable" help at all?
> >
> > In section 3.4.6 (internationalization) - there is a discussiong
> > going on at the W3C TAG (issue named something like "IRIEverywhere")
> > where the appropriateness of where IRIs should be used is being
> > discussed. It is clear, for example, that IRIs cannot be used
> > everywhere
> > URIs can be used. The issue is whether *future* specs
> should refer to
> > IRIs or URIs. An IRI can be "cast down" into a URI
> unambiguously, but
> > because there are several ways to translate unicode into
> > ascii, its not
> > always possible to unambigously convert an URI back into an
> > IRI (without
> > some context like the encoding used to go from IRI to URI).
> > So, while I
> > think we should definitely address IRIs and XRIs, I don't think XRIs
> > should expect to be solving the problems that IRIs have with the
> > relationshipt to URIs. We *could* propose a way to encode the things
> > that are needed to unambiguously convert a URI back into an
> > IRI, but I'm
> > guessing that would actually break the IRI spec. I'm going
> > out beyond my
> > competency !
> > here I think.
> > Bottom line is that we either have to wait for the IRI things to
> > shake out, or we have to tread new ground in i18n. I
> *definitely* want
> > XRIs to be "i18n enabled", but I'm a little worried about us
> > planning on
> > achieving that in the short term by relying on IRIs.
> >
> > This document has come a LONG way and I think does a pretty good
> > job of identifying why we are all here. Congrats and thanks
> > to all those
> > who contributed. I'm sure there will be more input and fixes
> > to the doc,
> > but I feel like we're very close to the "good enough" state
> > where we can
> > then concentrate on the syntax and resolution specs.
> >
> > -Gabe
> >
> >
> > > -----Original Message-----
> > > From: Drummond Reed [mailto:drummond.reed@onename.com]
> > > Sent: Thursday, May 15, 2003 11:45 AM
> > > To: xri@lists.oasis-open.org
> > > Subject: RE: [xri] Groups -
> > > xri-requirements-1.0-draft-05b.doc uploaded
> > >
> > >
> > > First, let me note two reasons for posting v5b:
> > >
> > > 1) I found out from Marc Le Maitre this morning that
> leaving "Track
> > > Changes" on screwed up the section numbering, so it makes
> > it difficult
> > > to talk about requirement numbers. Let's use v5b on the
> call today.
> > >
> > > 2) There was an MS Word cross-reference error
> (unfortunately not all
> > > that uncommon) in 3.4.7 that needed fixing.
> > >
> > > Please make any edits to this clean version after making
> sure "Track
> > > Changes" is turned on.
> > >
> > > I will review the key updates on the TC call this
> afternoon, but the
> > > major areas to review are:
> > >
> > > * Sections 2.1 - 2.3 of the Motivations section. These were
> > rewritten
> > > for the third time to reflect the consensus regarding terminology.
> > >
> > > * Requirement 3.1.2 was rewritten to reflect the URN
> > conformance topic
> > > as discussed on the list.
> > >
> > > * The original requirements section 3.3 was broken into the
> > > new sections
> > > 3.3 and 3.4 to reflect the clarifications in 2.2 and 2.3 about
> > > persistence and HFIs/MFIs.
> > >
> > > * 3.3.5 (Non-Resolvable Syntax) was added to reflect a
> > > requirement Marc
> > > Le Maitre has surfaced from the Namespace committee of the
> > > U.S. XML.gov
> > > working group.
> > >
> > > * 3.4.6 (Internationalization) was edited to reflect Nat's input
> > > regarding IRIs. We should discuss this on today's call.
> > >
> > > * The Glossary was updated and all TO DO's in it were finished.
> > >
> > > The only remaining TO DOs are a few entries in the
> > > informative glossary
> > > and Appendix A (Acknowledgments).
> > >
> > > Talk to everyone at 3pm PDT.
> > >
> > > =Drummond
> > >
> > > -----Original Message-----
> > > From: Drummond Reed
> > > Sent: Thursday, May 15, 2003 11:13 AM
> > > To: xri@lists.oasis-open.org
> > > Subject: [xri] Groups -
> xri-requirements-1.0-draft-05b.doc uploaded
> > >
> > > The document xri-requirements-1.0-draft-05b.doc has been
> > submitted by
> > > Drummond Reed (drummond.reed@onename.com) to the
> Extensible Resource
> > > Identifier TC document repository.
> > >
> > > Document Description:
> > > v5b of XRI Requirements and Glossary - This is a CLEAN
> > version with a
> > > faulty MS Word cross-reference fixed. Please submit any edits
> > > using this
> > > version.
> > >
> > > Download Document:
> > > http://www.oasis-open.org/apps/org/workgroup/xri/download.php/
> > > 2050/xri-r
> > > equirements-1.0-draft-05b.doc
> > >
> > > View Document Details:
> > > http://www.oasis-open.org/apps/org/workgroup/xri/document.php?
> > > document_i
> > > d=2050
> > >
> > >
> > > PLEASE NOTE: If the above links do not work for you, your email
> > > application
> > > may be breaking the link into two pieces. You may be able
> > to copy and
> > > paste
> > > the entire link address into the address field of your
> web browser.
> > >
> > > -OASIS Open Administration
> > >
> >
> > You may leave a Technical Committee at any time by visiting
> http://www.oasis-open.org/apps/org/workgroup/xri/members/leave
_workgroup.php
You may leave a Technical Committee at any time by visiting http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup.php
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]