xri message

Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals)

From: "Sakimura, Nat" <n-sakimura@nri.co.jp>
To: "Wachob, Gabe" <gwachob@visa.com>, <xri@lists.oasis-open.org>
Date: Tue, 29 Jul 2003 12:16:08 +0900

"center" in reality in Chinese/Japanese has same glyph, so, let me take "grain" as an example. Character "grain" share the same glyph with Japanese and traditional Chinese. Let us call this glyph as "T-grain". On the other hand, in simplified Chinese, the glyph becomes totally different looking, which even does not resemble at all. Let us call this glyph as "S-grain". This "S-grain" glyph is exactly same as the Japanese glyph representation of "Valley" character. So, let us call it "T-valley". As the result, the following cases occur: (1) When printed, "S-grain" and "T-valley" is the same. For humans, however, they are not equivalent as the semantics is completely different. Their character representation is different. So, this factor is properly addressed by the character by character comparison. (2) When printed, "S-grain" and "T-grain" is different. However, the semantics are the same, and they share the same "character code point". Character-by-character comparison states that they are equivalent. For humans, they are also equivalent as long as we have the language and font (whether the print is using simplified or traditional) context. So, this factor is also properly addressed by the character by character comparison. (3) When printed, "S-valley" and "T-valley" is the same. Glyph "S-valley" actually has two character code points : one for "valley" and one for "grain". Although "S-valley" looks the same in both instances when printed, human can usually distinguish them through context. Computers are usually dumb that it cannot understand the context, however, it really does not have to understand the context here because depending on the context, they are assigned different character code-point. So, the character by character comparison works well here, too. The reason I used S and T instead of C and J is because this problem is known as "ST problems -- Simplified/Traditional problems." It can happen between Chinese and Japanese, but more fundamentally, it is a problem within Chinese, whether to use Simplified font or Traditional font. Unicode was designed as internal encoding, so it deals with this problem well. Actually, this ST problem did not come from Unicode. This popped up during the domain name internationalization. Nat > -----Original Message----- > From: Wachob, Gabe [mailto:gwachob@visa.com] > Sent: Friday, July 25, 2003 9:02 AM > To: Sakimura, Nat; xri@lists.oasis-open.org > Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals) > > Nat- > I understand all the issues you describe below. However, I don't > understand how your email responds to my question of how the use of different > $f-cross-refernces to distinguish two XRIs makes them the same for comparison, > but different for semantic understanding by a human. It seems a fundamental > tenet that if two identifiers are different in both the characters they contain > (specifically, in the "$f" cross reference), AND and their meanings to humans, > they should NOT be considered equivalent. > > What I believe you are suggesting is that even if they differ in the > characters they contain (in the $f cross reference) they should be considered > equivalent. But (and this is what I am unclear on) you also suggest that if > they differ in the font used, they differ in meaning to the human user. If thats > the case, then they shouldn't be considered different identifiers? > > I would favor a processor being able to declare equivalence only if > the two identifiers are unicode character-by-character equivalent (subject to > case insensitivity on the authority part for ascii characters). Anything beyond > this base test of equivalence needs very strong support and I don't yet see > that here. > > -Gabe > > > -----Original Message----- > > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > > Sent: Thursday, July 24, 2003 4:41 PM > > To: Wachob, Gabe; Dave McAlpin; Wachob, Gabe; xri@lists.oasis-open.org > > Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals) > > > > > > What you do not understand probably is the distinction > > between character and glyph. > > > > Character is an absolute semantic code point which may have > > multiple representation as glyph. > > Thus, even if glyph is different due to either font or > > language, they were compressed into one character. > > The spirit of the ISO standard dictates that we should make a > > comparison and search on this "character" and not glyph. > > > > Examples: > > Font case is easy to depict and understand even in English. > > center and > face="Arial">center are semantically equivalent. > > > > Language is rather difficult, but this is my shot: > > <lang="en-US">center and <lang="en-GB">centre</lang> > > are semantically equivalent. > > > > Now, of course, "center" is not a character in English so > > they are not equivalent in ISO10464 world, but it dipicts the > > Han-Unification well. Indeed, "center" in this case is a > > character in Chinese/Japanese/Korean and > > <lang="ja">center</lang> and <lang="kr">center</lang> etc. > > are given the same character code point because they share > > its roots and has the same semantics. On the other hand, > > e-accecent can be encoded as a e-accent character or e > > character + accent character. They share the roots, and they > > are semantically equivalent, but they have two code points! > > (Now you start to see how broken this standard is. This is > > the source of complaints that many people have with regard to > > Unicode.) Here comes the normalization that states that > > composed characters that has a corresponding character should > > be converted to a-character reresentation (i.e., e character > > + accent character should be converted to a e-accent > > character). In this case, it looks easy enough. However, in > > other languages, this formation of composed character may > > depend on location and context in which it is placed. Thus, > > normalization becomes extremely difficult. Do we want to > > require the resolver to have such knowledge? My answer was > > no. That is too heavy weight. > > > > Nat > > > > > > -----Original Message----- > > From: Wachob, Gabe [mailto:gwachob@visa.com] > > Sent: 2003/07/25 ($B6b(J) 2:35 > > To: Sakimura, Nat; Dave McAlpin; Wachob, Gabe; > > xri@lists.oasis-open.org > > Cc: > > Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals) > > > > > > I don't quite understand this proposal because if the > > font tags are important to the human understanding of an XRI, > > why are they insignificant for comparing two XRIs? > > > > In other words, if fonts have to do with the semantics > > of the XRI, shouldn't they be part of the comparison? If they > > are purely presentational (ie no semantics), then why are > > they part of the XRI spec? > > > > -Gabe > > -----Original Message----- > > From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp] > > Sent: Thursday, July 24, 2003 3:22 AM > > To: Dave McAlpin; Wachob, Gabe; xri@lists.oasis-open.org > > Subject: RE: [xri] I18n and $ tags (on the $l > > and $f proposals) > > > > > > That$B!G(Js right. Now, I expect many European > > participants will oppose to this very simplistic way of > > defining equivalence (partly due to the fact that it looks > > somewhat workable in their language space), but to me, this > > is good enough because I have no hope of defining $B!F(Jglyph > > based equivalence$B!G(J on every language on the earth. If we > > really need complex equivalence, we should make use of > > external $B!H(Jthesauri(J$B!I(J service. It should not be in the core. > > > > Nat > > > > -----Original Message----- > > From: Dave McAlpin [mailto:dave.mcalpin@epokinc.com] > > Sent: Tuesday, July 22, 2003 11:51 PM > > To: Sakimura, Nat; Wachob, Gabe; > > xri@lists.oasis-open.org > > Subject: RE: [xri] I18n and $ tags (on the $l > > and $f proposals) > > > > Thanks Nat, this is very helpful. So your > > recommendation for equivalence is that we should 1) convert > > to UTF-8 if necessary 2) remove any language related tags > > (including font and glyph selector, if included) and 3) > > perform a character by character (i.e. codepoint by > > codepoint) comparison. Is that right? > > > > Dave > > -----Original Message----- > > From: Sakimura, Nat > > [mailto:n-sakimura@nri.co.jp] > > Sent: Monday, July 21, 2003 11:15 PM > > To: Wachob, Gabe; xri@lists.oasis-open.org > > Subject: RE: [xri] I18n and $ tags (on > > the $l and $f proposals) > > The reason $f beside $l came up was > > that to represent the octet stream in a human readable > > fashion, Unicode and hence ISO 10464 requires following information: > > > > 1. Actual octet stream > > 2. Language > > 3. Glyph selector > > 4. Font > > > > I know, this sounds like a sick joke, > > but this is the reality. > > (That$B!G(Js why I was grumbling earlier > > that I wish we had DIS 10464 ver.1 as the ISO standard.) > > > > I believe we can go pretty far with > > only 1 and 2, but I do not want to pretend that I know the > > problems other languages will encounter, so it is better to > > leave some room to preserve original amount of information. > > You are right, it might not be used very often in the real > > life, but some people may need it. Then, why should we remove it? > > > > As far as the equivalence is concerned, > > I believe that we should be comparing either the actual octet > > stream itself or the terminal outcome of the resolution. > > Equivalence is another huge topic involving the > > normalization, which may be even harder than > > multi-lingalization itself. I would even go forward and say > > that equivalence should not be dealing with normalization, > > but that might be a little too extreme. I suspect > > normalization will be a nightmare for the implementers, > > because one has to have a mapping between the composed form > > and decomposed form, and of course, you need the language and > > context information for this to happen. Also, when a new > > composed form is added, one has to add it to the mappings. It > > sounds too difficult to me. > > > > Nat > > > > -----Original Message----- > > From: Wachob, Gabe [mailto:gwachob@visa.com] > > Sent: Saturday, July 19, 2003 4:57 AM > > To: xri@lists.oasis-open.org > > Subject: RE: [xri] I18n and $ tags (on > > the $l and $f proposals) > > > > > > While I really think these proposals > > *could* be useful, I think they would be used (especially the > > $f one) in a relatively limited set of situations (i.e. those > > where the XRIs are presented to humans). > > > > Thats a provocative statement I've just > > made. Some folks have in their minds that most (many?) XRIs > > will be presented to humans. Some folks (me included) believe > > most won't. > > > > What I truly believe is that for some > > applications of XRIs, a large proportion will be presented to > > humans, and for other applications, they won't be presented > > to humans. Of course, we see this sort of flexibility as a > > strength. But this sort of flexibility is also the source of > > tension when deciding when to include or exclude features. > > > > I sound like a broken record, but I > > want to make sure that we are addressing a *real* need and > > that the solution doesn't create more complexity than it > > tries to eliminate. > > For example, the $f/(+Arial) proposal > > looks good on the surface but there are several complicating factors: > > 1) You probably don't want a top-level > > +<font-name> entry because I could easily see a font name > > conflicting with another use of the term which is the font > > name. There are a ton of fanciful font names and I could > > easily see +Modern being ambiguous as a font name or > > something else. So we'd end up with +font/Modern, which would > > appear as $f/(+font/Modern). > > 2) Look how complicated the XRIs get... > > Even if you assume the font information is inserted by the > > UIs (and not presented to the user), this seems to complicate > > equivalence rules... > > 3) It seems that no matter what the > > structure is for font names, someone is going to have to > > manage a list of font names. Fonts are subject to > > intellectual property rights (at least in some places) and > > this tends to mean that there is no central registry of font > > names that everyone agrees on and is managed. Fonts are > > considered "property" which is licensed (though there are > > "public domain" ones). This is not a problem directly, but > > leads (I believe) to a situation where the universe of fonts > > is rather scattered and hard to survey properly. Certainly > > not something we want to do anyway. Use of the +font > > namespace seems appropriate. > > So, we need to be very clear about the > > problems we are solving using this $f mechanism, because if > > they don't outweight the complexity, we shouldn't do them. > > Whats the use case? How is this driven > > by internationalization concerns? If so, can we be more > > specific about the disambiguation we are trying to address? > > Without having the background of i18n, it strikes me as > > *really* odd to specify presentation information in the > > identifier -- I know others will have the same response. > > Outside of $f (to which i am > > specifically pushing back), I agree with Geoffrey that using > > + cross references under other $ names (language, version > > syntax, etc) is a Good Thing. They allow a great deal of > > flexibility at the cost of human readability/usability (which > > is a fine compromise for me, in the use cases I am biased towards). > > -Gabe > > > > > > > > > -----Original Message----- > > > From: geoffrey.strongin@amd.com > > [mailto:geoffrey.strongin@amd.com <mailto:geoffrey.strongin@amd.com> ] > > > Sent: Monday, July 14, 2003 8:42 AM > > > To: xri@lists.oasis-open.org > > > Subject: RE: [xri] I18n and $ tags > > > > > > > > > I like this. It really leverages the > > power of the + namespace. > > > > > > Geoffrey > > > > > > > -----Original Message----- > > > > From: Drummond Reed > > [mailto:drummond.reed@onename.com <mailto:drummond.reed@onename.com> ] > > > > Sent: Friday, July 11, 2003 11:58 PM > > > > To: Dave McAlpin; xri@lists.oasis-open.org > > > > Subject: RE: [xri] I18n and $ tags > > > > > > > > > > > > -----Original Message----- > > > > From: Dave McAlpin > > [mailto:dave.mcalpin@epokinc.com <mailto:dave.mcalpin@epokinc.com> ] > > > > Sent: Friday, July 11, 2003 3:57 PM > > > > To: xri@lists.oasis-open.org > > > > Subject: [xri] I18n and $ tags > > > > > > > > I assume internationalization does > > not apply to the $ tags. > > > > For example, > > > > there's no internationalized > > version of $v. Is this correct? > > > > Is this ok? > > > > > > > > Dave > > > > > > > > *****Drummond replies***** > > > > > > > > I think it's not only correct, but > > also a good thing. There > > > > should be no > > > > need to internationalize the $ > > space for the following > > > > reason: IMHO, the > > > > purpose of the $ space is to > > provide a mechanism for > > > > extending the very > > > > limited set of reserved chars in > > 2396 (which we've already > > > had to bust > > > > out of in order to add support for > > xrefs and sub-segments) > > > in order to > > > > have sufficient metadata (and > > extensibility) to describe > > > > identifiers in > > > > ways that are vital to the act of > > identification, i.e., > > > > language, font, > > > > version syntax, query syntax, > > resolvability, human-readable comment, > > > > etc. > > > > > > > > For this reason, I propose that in > > Appendix B we state a formal a > > > > requirement that the vocabulary in > > the $ identifier space > > > (note that I > > > > don't call it a namespace for the > > reasons I'm about to argue) be as > > > > terse as possible, not just to > > enforce compactness, but to reinforce > > > > that it is an extension of the > > reserved-symbol-space and not > > > > intended to > > > > carry linguistic-level semantics. > > > > > > > > For example, the $l (language) > > space should, as Nat > > > proposed, use the > > > > two-letter codes for languages > > specified in ISO standard 639 > > > > referenced > > > > in RFC 1766. It should NOT use > > full-length equivalents. > > > > > > > > The proposed $f (font) space for > > font names would violate > > > this rule if > > > > it used full-length English font > > names. (Furthermore, if we > > > > did that, it > > > > would beg for > > internationalization). To avoid both > > > problems, we should > > > > try to find a compact font name > > abbreviation registry that we can > > > > reference, similar to ISO 639 for > > language abbreviations. > > > > > > > > If we can't find one, and we don't > > want to create one (at > > > > least I don't > > > > want to), there is another solution > > - one that applies > > > nicely to any $ > > > > space. In place of an exact, > > rigorously specified > > > vocabulary, every $ > > > > space can also cross-reference > > common names in the + space. > > > Here's an > > > > example of how that would work for > > a font name: > > > > > > > > > > xri:($l/fr).($f/(+Arial)).french-word-in-Arial-font/foo > > > > > > > > Rather than using "($f/Arial)", > > which would means "Arial" > > > was formally > > > > registered in the "$f" space, the > > segment "($f/(+Arial)" > > > simply means > > > > "Arial" is a common name in the > > context of a font. I'm not a font > > > > expert, but I'd be willing to guess > > that a large percentage of > > > > typographic software would > > recognize that common name for a font. > > > > Furthermore, the xri above would > > also tell the XRI parser that the > > > > common name "Arial" should be > > interpreted not just in the context of > > > > being a font, but specifically > > being a French name for a font. That > > > > should reduce the chance of > > misinterpretation even further. > > > > > > > > Use of the + space for real-world > > common names for metadata > > > like fonts > > > > means there is an easy way to apply > > the 80/20 rule, while leaving it > > > > open for the $f space to reference > > a more exhaustive and > > > non-ambiguous > > > > font name abbreviation registry later. > > > > > > > > Again, I think this rule should be > > applied across the board to all $ > > > > spaces, including language, font, > > version syntax, query syntax, etc. > > > > > > > > =Drummond > > > > > > > > > > > > > > > > > > > > You may leave a Technical Committee > > at any time by visiting > > > > > http://www.oasis-open.org/apps/org/workgroup/xri/members/leave > > <http://www.oasis-open.org/apps/org/workgroup/xri/members/leave> > > _workgroup.php > > > > > > > > You may leave a Technical Committee at > > any time by visiting > > http://www.oasis-open.org/apps/org/workgroup/xri/members/leave > _workgroup.php > <http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup. > php>