xri message
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]
Subject: RE: I18N strategy
- From: "Sakimura, Nat" <n-sakimura@nri.co.jp>
- To: "Wachob, Gabe" <gwachob@visa.com>
- Date: Fri, 23 May 2003 10:47:33 +0900
I agree. But at the same time, since IRI will probably not carry enough information in it, we need to do a bit of tweaking on the resolution side. Unlike DNS, we may want an option to tell the resolution context (e.g. Language) to the resolver.
Nat
-----Original Message-----
From: Wachob, Gabe [mailto:gwachob@visa.com]
Sent: Friday, May 23, 2003 3:14 AM
To: Sakimura, Nat
Subject: RE: I18N strategy
Well, I think the bottom line is that we can go forward with the IRI stuff for now - please do remind us when we are making assumptions that would limit our future flexibility with respect to UTF-8. I think thats all we can do for now.
-Gabe
> -----Original Message-----
> From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp]
> Sent: Thursday, May 22, 2003 11:04 AM
> To: Wachob, Gabe
> Subject: RE: I18N strategy
>
>
> There actually is shift-in mechanism to signify the language
> in Unicode. We probably can use it theoretically, though the
> Unicode spec strongly discourages it. The problem is, "how
> does one insert that character?" Gabe's suggestion is easier
> to deal with in this respect, but it is not particularly
> human friendly when reading. Another variation of this is to
> make the LANG property structured similarly to the identifier
> itself. E.g.,
>
> GET xri://English/Japanese.Korean.Chinese/
> LANG //en/ja.ko.zh/
> FONT //New York/Osaka.Seoul.Shanghai/
>
> UTF-8 is fine, but there can be other encoding schemes which
> may become more popular in the future. My intent was only
> that the spec perhaps should not be dependent on the UTF-8
> structure for the resolution algorithm so that we leave the
> possibility of resolving another encoding schemes. After all,
> we might end up in resolving base 64 encoded iso-2022 encoded
> string after all.
>
> -----Original Message-----
> From: Wachob, Gabe [mailto:gwachob@visa.com]
> Sent: Friday, May 23, 2003 1:57 AM
> To: Sakimura, Nat; Drummond Reed; Wachob, Gabe;
> xri@lists.oasis-open.org
> Cc: DI
> Subject: RE: I18N strategy
>
> I was wondering if it wouldn't be possible to insert language
> hints as part of the XRI syntax. Haven't really thought about this ..
>
> xri://english/,l=ja.japanese.,l=ko.korean.,l=ch.chinese
>
> Where ,l= is a special form which "switches" the "current
> language" (going from left to right - but this is a problem
> potentially too) to a new one. Everything would be understood
> to be in the "current language" until another language
> identifier is discovered.
>
> This is just a "top of the head" suggestion and I don't even
> know if it would address the bulk of the problems Nat has identified.
>
> It seems hard to justify not going with the IRI framework and
> accepting the limitations it has - but I *am* concerned that
> we are potentially alienating a huge segment of the world's
> population. On the other hand, I'm afraid that not following
> the IRI (and W3C Web Character Model - a very good read for
> people not yet familiar with I18N issues at
> http://www.w3.org/TR/charmod/ ) would mean that we would have
> to bite off a lot of work for ourselves, and might create all
> sorts of interoperability problems.
>
> Nat, I'm curious what "not relying on UTF-8" means. UTF-8 is
> just a way of interpreting a series of octets. Do you mean
> that we shouldn't have resolvers assume that character
> strings are in UTF-8? If so, how do resolvers know what
> encoding is being used in an XRI?
>
> It sounds like there is consensus on just moving ahead with
> IRI/Unicode - but I still don't feel quite comfortable that
> we've worked through this issue completely (or "beat this
> dead horse completely", depending on your point of view).
>
> -Gabe
>
> > -----Original Message-----
> > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp]
> > Sent: Thursday, May 22, 2003 1:06 AM
> > To: Drummond Reed; Wachob, Gabe; xri@lists.oasis-open.org
> > Cc: DI
> > Subject: RE: I18N strategy
> >
> >
> > My opinion for the time being is like this:
> >
> > (1) for the time being at least, we have to live with Unicode
> > and UTF-8.
> > This, in turn, means that we abandon the idea of using
> > different language in one identifier but we must abide. This
> > is a hard decision because we do federate. In XRI, something
> > like xri://Arabic/Devanagary/Japanese.Korean.Chinese/ is
> > possible as a concept, but we cannot make it human readable
> > reliably with Unicode, unfortunately.
> >
> > (2) Prepare for something else in the future. i.e., try not
> > to depend on UTF-8. The resolver should be able to handle any
> > bytes stream.
> >
> > (3) We have to have some way of attaching the language
> > information to it. Otherwise, resolver cannot reliably search
> > and resolve. For example, HTTP based resolver can do
> something like:
> >
> > GET xri://an-xri-identifier
> > LANG="ja"
> >
> > Resolver will probably need to advertise its capability to
> > the clients.
> >
> > (4) Search Thesaurus. This is purely optional, but some
> > implementation may want to implement character thesaurus
> > feature so that variable form of one character can be
> > searched by a single input. For example, if one search for u
> > umlaut, both u+00fc and u+0075 + U+0308 are searched.
> >
> > Nat
> >
> > -----Original Message-----
> > From: Drummond Reed [mailto:drummond.reed@onename.com]
> > Sent: Thursday, May 22, 2003 3:26 PM
> > To: Sakimura, Nat; Wachob, Gabe; xri@lists.oasis-open.org
> > Cc: DI
> > Subject: I18N strategy
> >
> > [Note thread change: was RE: [xri] Groups -
> > xri-requirements-1.0-draft-05b.doc uploaded]
> >
> > Nat,
> >
> > Thank you very much for this exceptionally clear and cogent
> > analysis of
> > the I18N challenges we face. For those of us who are not
> > specialists in
> > I18N, it really helps to understand the major issues
> involved and the
> > tradeoffs involved with both IRI and Unicode
> >
> > There is no question in my mind (and never has been) that
> XRI must be
> > internationalized to the greatest extent feasible at any particular
> > point in time. I know Gabe feels the same way and I believe
> > this is the
> > consensus of the entire TC (anyone who disagrees, please
> > speak up now).
> >
> > So the key decision we face is: what is "the greatest
> extent feasible"
> > in May 2003? Based on your knowledge of the problem space,
> > what strategy
> > do you suggest the TC follow to make this decision?
> >
> > Also, as a procedural note, I suggest we add an Internationalization
> > section to the XRI spec outline in which we document this
> strategy, as
> > many in the I18N community will be interested specifically
> > how the spec
> > handles this issue.
> >
> > =Drummond
> >
> > -----Original Message-----
> > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp]
> > Sent: Wednesday, May 21, 2003 3:10 AM
> > To: Wachob, Gabe; xri@lists.oasis-open.org
> > Cc: DI
> > Subject: RE: [xri] Groups -
> > xri-requirements-1.0-draft-05b.doc uploaded
> >
> > Thanks Gabe and Peter for your comments.
> >
> > More I study this problem, more it seems that the source of
> > the problem
> > actually is Unicode and ISO 10646-1. I wish we have made DIS 10646
> > ver.1.0 the ISO standard. Japan was pushing for it, but most
> > people were
> > not worried about the limitation of Unicode at the time,
> which became
> > apparent in 5 years. Much of the problem of the IRI also
> > arises from it.
> > Essentially, the source of the problem comes from the fact that in
> > Unicode (and thus UTF-8, ISO 10646-1), you cannot distinguish the
> > language from the code unless you are lucky.
> >
> > Without going into the details, let me state the original
> > problem stated
> > earlier in this ML in the following fashion.
> >
> > There seems to be two issues involved in IRI <-> URI conversion.
> > First is actually
> > (P1) local charset IRI to UTF-8 IRI conversion,
> > and the second is
> > (P2) UFT-8 IRI to URI conversion.
> >
> > The issue around (P2) is much easier than (P1), so I will
> > start from it.
> >
> > (P2) above can be stated as follows.
> > Define f() : IRI (UTF-8) -> URI escaping function.
> > Define g() : URI -> IRI conversion function.
> > Let u = f(i).
> > Then there exists an i such that i != g(u).
> >
> > The reason:
> > g() != f^-1() because of the following.
> >
> > (a) g() must not result in a octet sequence that is
> > not part of a strictly legal UTF-8 octet
> > sequence: URI may contain the escaped sequence
> > that did not originate from UFT-8.
> > e.g., the sequence of iso-8859-1 encoding.
> > (b) There are further restrictions on the legal octet
> > stream over the UTF-8. E.g., half width Japanese
> > kana is not legal in IRI. These have to be re-escaped.
> >
> > This is valid in general, i.e., if we consider g() and
> f^-1() over the
> > set U, which is the legal URI space, g() and f^-1() are not equal.
> > (In other words, although f(g(u)) = u for any u, there exists
> > an x such
> > that g(f(x)) != x).
> >
> > I was wondering if this is much of a concern for us. The IRI
> > draft spec
> > states that " the IRI resulting from this conversion may not
> > be exactly
> > the same as the original IRI (if there ever was one) "
> because we are
> > not concerned about U but a subset of U which is derived by the
> > conversion of IRI.
> >
> > Proposition: for any x, which is an element of U' such that
> > U' = f(I) where I is the set of legal IRIs.
> > then,
> > g(x) = f^-1(x) for any x.
> >
> > Reason:
> > (1) Any escaped octet in this case is guaranteed to
> > have come out of UTF-8. So, we do not have to worry
> > about (a) above.
> > (2) The set "I" does not include an element referred by (b) above.
> >
> > (I am just thinking in a abstract logic. I am not familiar
> > with BIDI and
> > other problem spaces, but mathematically, the above looks
> good to me.
> > One caveat: I have to check the escape function f() again
> to see if it
> > is good. f() must be such that f^-1() exists.)
> >
> > Thus, the source of problem seem to lie in the fact that
> > legal URI is a
> > super set of the set of URIs derived out of IRI. This indeed
> > is a source
> > of compatibility problem like Peter says. This poses us a question:
> > "Do we really want to state things in term of URI, at least
> > temporarily?"
> > My answer is NO. If we do that, we will be strangulated in the same
> > fashion as IRI is being done.
> >
> > Next problem (P1) is much more complicated. The problem (P1)
> > is actually
> > a problem of UTF-8 and Unicode. Union of local charset is
> larger than
> > UTF-8 space. More over, Unicode/UTF-8/ISO 10646 has no way
> of telling
> > what language context it is in by itself. It has to be used in
> > conjunction with some other information like locale,
> language flag, or
> > font set.
> >
> > UTF-8 is collapsing several distinct but looking similar local
> > characters into one representation. For example, the first
> > character of
> > my last name "Saki" cannot be represented correctly with it.
> > For the eye
> > of "Latin character world" person, it may just be the same
> as straight
> > and cursive "g", but it is not. If I submit official document
> > with UTF-8
> > representation of my name, the government will not accept it
> > because the
> > name is different. Local charset to UTF-8 conversion is
> > actually many to
> > one function. Thus, invert function does not exist. It is
> > going to be a
> > mapping. The situation become even more complicated when we
> > mix several
> > language.
> >
> > This one will never be solved unless we abandon UTF-8 or Unicode for
> > that matter. There are supplemental problems in searching through
> > combined characters etc. For example, Deva Nagali (sp?)
> would have its
> > own representation problem. I would like to draw Reva
> Modi's attention
> > on it, who is probably more knowledgeable on these matters than I. I
> > have contacted Mr. Kobayashi, who is the only member of the Unicode
> > Technical Committee from Japan already, and will continue
> > doing so. From
> > what I found, Mr. Kobayashi is also a bit negative on the
> > capability of
> > Unicode for the sake of real I18N. (On the other hand, DIS
> > 10646 ver.1.0
> > would have done it properly. Unicode and ISO 10646 was
> > originally fixed
> > length encoding. One of the main reason for pushing Unicode
> > and not DIS
> > 10646 ver.1.0 was because the later was variable length encoding and
> > conceived as inefficient. The shortcoming became apparent within 3
> > years, and this fixed length thing was abandoned in 1996.
> Now, we have
> > lost t!
> > he!
> > fixed length-ness, and are facing the problems that DIS
> 10646 ver.1.0
> > would not have had. After all, at the time of ISO 10646
> > voting, it seems
> > only the multi-byte country with long history of computerized local
> > language processing was Japan. The majority voting does not always
> > result in the correct result...) Today, even HTML ver.4 is
> asking for
> > language switch like LANG="ja" and LANG="en". This is another
> > indicator
> > of the fact that the dream of Unified Character Coding is lost.
> >
> > Now, do we want to tackle this formidable problem? Hmmm. A good
> > question.
> >
> > The bottom line is: if we decide to live in the UTF-8
> world, then IRI
> > seems to be OK. If we want real I18N, we probably are leaving the
> > Unicode world.
> >
> > On the other hand, we MUST NOT point to IRI in the
> normative spec. We
> > should extract the IRI spec and insert it into our spec until IRI
> > actually become a spec. (If I remember correctly, this was the
> > requirement for the use of IRI proposed spec.)
> >
> > Nat
> >
> > -----Original Message-----
> > From: Wachob, Gabe [mailto:gwachob@visa.com]
> > Sent: Wednesday, May 21, 2003 1:49 AM
> > To: xri@lists.oasis-open.org
> > Subject: RE: [xri] Groups -
> > xri-requirements-1.0-draft-05b.doc uploaded
> >
> > Nat-
> > Actually, the IRI spec says that mapping from URIs to IRIs
> > unambiguously requires context not present in a URI. In
> > http://www.w3.org/International/iri-edit/draft-duerst-iri.html
> > #URItoIRI
> > the problem is demonstrated in the situation where you
> > convert an IRI to
> > a URI and back -- "the IRI resulting from this conversion may not be
> > exactly the same as the original IRI". Maybe we can address this
> > ambiguity - perhaps you can figure this out better than i
> have done so
> > far.
> >
> > Also, as for the "defining a URI" vs. "defining an
> IRI" - I'm
> > not sure how this plays out. We absolutely need to be able
> to use XRIs
> > anywhere one would use a URI. However, we know that IRIs
> can always be
> > mapped to URIs (and in the case where there are no non-URI
> > characters in
> > and IRI, the IRI is syntactically equivalent to the URI to
> > start with).
> >
> > As for equivalence - thats something we *have* to
> > discuss as the
> > specifiers of a URI scheme. We don't have to say much - we
> > can say that
> > two XRI URIs are "equivalent" if they are octet-by-octet the same
> > (though there are issues about unescaping sequences before or
> > after the
> > comparison). I suppose it gets trickier if you define XRIs as an IRI
> > scheme.
> >
> > The other problem with relying on the IRI spec right
> > now is that
> > its not a spec yet. Its still only a draft over at the IETF, and the
> > IETF process is slow. I'm guessing we won't see a finalized
> > IRI spec in
> > 2003.
> >
> > Don't get me wrong - I think we should leverage
> IRIs somehow.
> > I'd even be in favor of defining XRIs as an IRI scheme if we could
> > ensure that would not cause any problems for those many places where
> > URIs are called for (after conversion to the URI form). I
> > just think its
> > more complicated than simply referring to the IRI spec (a lot more
> > complicated).
> >
> > -Gabe
> >
> > > -----Original Message-----
> > > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp]
> > > Sent: Tuesday, May 20, 2003 12:06 AM
> > > To: xri@lists.oasis-open.org
> > > Subject: RE: [xri] Groups -
> > > xri-requirements-1.0-draft-05b.doc uploaded
> > >
> > >
> > > Gabe,
> > >
> > > Conceptually, IRI has larger set than URI (IRI includes
> > URI), but both
> > > are countable and thus can be mapped one to one, I think.
> > > Could you give
> > > me an example of mapping one URI to multiple IRIs please?
> > >
> > > Fundamentally, the question for us probably is "do we really
> > > want to be
> > > bound by this aging URI standard?" To me, URI v.s. IRI
> > controversy is
> > > largely due to the backward compatibility issues. If we think
> > > afresh, we
> > > probably do not choose URI to be the normative format because
> > > it is the
> > > source of milliard of problems for I18N. Unicode is not
> > perfect (some
> > > purists say that it is useless - it generally cannot
> > distinguish among
> > > similar but distinct characters because these are collapsed
> > into one),
> > > but is much cleaner. Resolution does not have to go through the
> > > transformation to URI. Our internationalized identifier
> > should be able
> > > to be resolved directly.
> > >
> > > On equivalence: I think URI equivalence arguments do not
> > > affect us. This
> > > is because we have abstract permanent identifier, which can
> > be pretty
> > > restrictive in the allowed character set as we do not
> need the human
> > > readability. To test the equivalence of two identifiers, we should
> > > resolve to the permanent identifier and compare them. To
> protect the
> > > privacy, we might not want to expose the permanent
> > identifier. In this
> > > case, the proxy should give out True/False result. We have a much
> > > powerful tool than URIs in this regard.
> > >
> > > Nat
> > >
> > > -----Original Message-----
> > > From: Wachob, Gabe [mailto:gwachob@visa.com]
> > > Sent: Friday, May 16, 2003 4:25 AM
> > > To: 'Drummond Reed'; xri@lists.oasis-open.org
> > > Subject: RE: [xri] Groups -
> > > xri-requirements-1.0-draft-05b.doc uploaded
> > >
> > > Drummond-
> > > A few notes.
> > >
> > > First, in section 3.4.5 (you said 3.3.5) - "non-resolvable
> > > syntax" - whats the use case? Why do we need to *prevent* an
> > > attempt to
> > > resolve? Why would a software component resolve an identifier
> > > unless it
> > > needed to? It seems like there are only two cases: a piece
> > of software
> > > needs to resolve the identifier, or it doesn't. This
> > decision is based
> > > on application semantics, not the syntax of the
> identifier. How does
> > > marking an identifier as "non-resolvable" help at all?
> > >
> > > In section 3.4.6 (internationalization) - there is a
> > discussiong
> > > going on at the W3C TAG (issue named something like
> "IRIEverywhere")
> > > where the appropriateness of where IRIs should be used is being
> > > discussed. It is clear, for example, that IRIs cannot be used
> > > everywhere
> > > URIs can be used. The issue is whether *future* specs
> > should refer to
> > > IRIs or URIs. An IRI can be "cast down" into a URI
> > unambiguously, but
> > > because there are several ways to translate unicode into
> > > ascii, its not
> > > always possible to unambigously convert an URI back into an
> > > IRI (without
> > > some context like the encoding used to go from IRI to URI).
> > > So, while I
> > > think we should definitely address IRIs and XRIs, I don't
> think XRIs
> > > should expect to be solving the problems that IRIs have with the
> > > relationshipt to URIs. We *could* propose a way to encode
> the things
> > > that are needed to unambiguously convert a URI back into an
> > > IRI, but I'm
> > > guessing that would actually break the IRI spec. I'm going
> > > out beyond my
> > > competency !
> > > here I think.
> > > Bottom line is that we either have to wait for the
> > IRI things to
> > > shake out, or we have to tread new ground in i18n. I
> > *definitely* want
> > > XRIs to be "i18n enabled", but I'm a little worried about us
> > > planning on
> > > achieving that in the short term by relying on IRIs.
> > >
> > > This document has come a LONG way and I think does a
> > pretty good
> > > job of identifying why we are all here. Congrats and thanks
> > > to all those
> > > who contributed. I'm sure there will be more input and fixes
> > > to the doc,
> > > but I feel like we're very close to the "good enough" state
> > > where we can
> > > then concentrate on the syntax and resolution specs.
> > >
> > > -Gabe
> > >
> > >
> > > > -----Original Message-----
> > > > From: Drummond Reed [mailto:drummond.reed@onename.com]
> > > > Sent: Thursday, May 15, 2003 11:45 AM
> > > > To: xri@lists.oasis-open.org
> > > > Subject: RE: [xri] Groups -
> > > > xri-requirements-1.0-draft-05b.doc uploaded
> > > >
> > > >
> > > > First, let me note two reasons for posting v5b:
> > > >
> > > > 1) I found out from Marc Le Maitre this morning that
> > leaving "Track
> > > > Changes" on screwed up the section numbering, so it makes
> > > it difficult
> > > > to talk about requirement numbers. Let's use v5b on the
> > call today.
> > > >
> > > > 2) There was an MS Word cross-reference error
> > (unfortunately not all
> > > > that uncommon) in 3.4.7 that needed fixing.
> > > >
> > > > Please make any edits to this clean version after making
> > sure "Track
> > > > Changes" is turned on.
> > > >
> > > > I will review the key updates on the TC call this
> > afternoon, but the
> > > > major areas to review are:
> > > >
> > > > * Sections 2.1 - 2.3 of the Motivations section. These were
> > > rewritten
> > > > for the third time to reflect the consensus regarding
> terminology.
> > > >
> > > > * Requirement 3.1.2 was rewritten to reflect the URN
> > > conformance topic
> > > > as discussed on the list.
> > > >
> > > > * The original requirements section 3.3 was broken into the
> > > > new sections
> > > > 3.3 and 3.4 to reflect the clarifications in 2.2 and 2.3 about
> > > > persistence and HFIs/MFIs.
> > > >
> > > > * 3.3.5 (Non-Resolvable Syntax) was added to reflect a
> > > > requirement Marc
> > > > Le Maitre has surfaced from the Namespace committee of the
> > > > U.S. XML.gov
> > > > working group.
> > > >
> > > > * 3.4.6 (Internationalization) was edited to reflect Nat's input
> > > > regarding IRIs. We should discuss this on today's call.
> > > >
> > > > * The Glossary was updated and all TO DO's in it were finished.
> > > >
> > > > The only remaining TO DOs are a few entries in the
> > > > informative glossary
> > > > and Appendix A (Acknowledgments).
> > > >
> > > > Talk to everyone at 3pm PDT.
> > > >
> > > > =Drummond
> > > >
> > > > -----Original Message-----
> > > > From: Drummond Reed
> > > > Sent: Thursday, May 15, 2003 11:13 AM
> > > > To: xri@lists.oasis-open.org
> > > > Subject: [xri] Groups -
> > xri-requirements-1.0-draft-05b.doc uploaded
> > > >
> > > > The document xri-requirements-1.0-draft-05b.doc has been
> > > submitted by
> > > > Drummond Reed (drummond.reed@onename.com) to the
> > Extensible Resource
> > > > Identifier TC document repository.
> > > >
> > > > Document Description:
> > > > v5b of XRI Requirements and Glossary - This is a CLEAN
> > > version with a
> > > > faulty MS Word cross-reference fixed. Please submit any edits
> > > > using this
> > > > version.
> > > >
> > > > Download Document:
> > > > http://www.oasis-open.org/apps/org/workgroup/xri/download.php/
> > > > 2050/xri-r
> > > > equirements-1.0-draft-05b.doc
> > > >
> > > > View Document Details:
> > > > http://www.oasis-open.org/apps/org/workgroup/xri/document.php?
> > > > document_i
> > > > d=2050
> > > >
> > > >
> > > > PLEASE NOTE: If the above links do not work for you, your email
> > > > application
> > > > may be breaking the link into two pieces. You may be able
> > > to copy and
> > > > paste
> > > > the entire link address into the address field of your
> > web browser.
> > > >
> > > > -OASIS Open Administration
> > > >
> > >
> > > You may leave a Technical Committee at any time by visiting
> > http://www.oasis-open.org/apps/org/workgroup/xri/members/leave
> _workgroup
> .php
>
> You may leave a Technical Committee at any time by visiting
> http://www.oasis-open.org/apps/org/workgroup/xri/members/leave
_workgroup
.php
You may leave a Technical Committee at any time by visiting
http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup
.php
winmail.dat
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [List Home]