xri message

Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals)
From: "Sakimura, Nat" <n-sakimura@nri.co.jp>
To: "Wachob, Gabe" <gwachob@visa.com>, "Dave McAlpin" <dave.mcalpin@epokinc.com>, "Wachob, Gabe" <gwachob@visa.com>, <xri@lists.oasis-open.org>
Date: Fri, 25 Jul 2003 08:40:57 +0900
What you do not understand probably is the distinction between character and glyph. 
 
Character is an absolute semantic code point which may have multiple representation as glyph. 
Thus, even if glyph is different due to either font or language, they were compressed into one character. 
The spirit of the ISO standard dictates that we should make a comparison and search on this "character" and not glyph. 
 
Examples: 
Font case is easy to depict and understand even in English. 
<font face="Times">center</font> and <font face="Arial">center</font> are semantically equivalent. 
 
Language is rather difficult, but this is my shot: 
<lang="en-US">center</font> and <lang="en-GB">centre</lang> are semantically equivalent. 
 
Now, of course, "center" is not a character in English so they are not equivalent in ISO10464 world, but it dipicts the Han-Unification well. Indeed, "center" in this case is a character in Chinese/Japanese/Korean and <lang="ja">center</lang> and <lang="kr">center</lang> etc. are given the same character code point because they share its roots and has the same semantics. On the other hand, e-accecent can be encoded as a e-accent character or e character + accent character. They share the roots, and they are semantically equivalent, but they have two code points! (Now you start to see how broken this standard is. This is the source of complaints that many people have with regard to Unicode.) Here comes the normalization that states that composed characters that has a corresponding character should be converted to a-character reresentation (i.e., e character + accent character should be converted to a e-accent character). In this case, it looks easy enough. However, in other languages, this formation of composed character may depend on location and context in which it is placed. Thus, normalization becomes extremely difficult. Do we want to require the resolver to have such knowledge? My answer was no. That is too heavy weight. 
 
Nat
 
 
	-----Original Message----- 
	From: Wachob, Gabe [mailto:gwachob@visa.com] 
	Sent: 2003/07/25 (金) 2:35 
	To: Sakimura, Nat; Dave McAlpin; Wachob, Gabe; xri@lists.oasis-open.org 
	Cc: 
	Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals)
	
	
	I don't quite understand this proposal because if the font tags are important to the human understanding of an XRI, why are they insignificant for comparing two XRIs?
	 
	In other words, if fonts have to do with the semantics of the XRI, shouldn't they be part of the comparison? If they are purely presentational (ie no semantics), then why are they part of the XRI spec?
	 
	    -Gabe
		-----Original Message-----
		From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp]
		Sent: Thursday, July 24, 2003 3:22 AM
		To: Dave McAlpin; Wachob, Gabe; xri@lists.oasis-open.org
		Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals)
		
		
		That’s right. Now, I expect many European participants will oppose to this very simplistic way of defining equivalence (partly due to the fact that it looks somewhat workable in their language space), but to me, this is good enough because I have no hope of defining ‘glyph based equivalence’ on every language on the earth. If we really need complex equivalence, we should make use of external “thesauri” service. It should not be in the core. 
		 
		Nat
		 
		-----Original Message-----
		From: Dave McAlpin [mailto:dave.mcalpin@epokinc.com] 
		Sent: Tuesday, July 22, 2003 11:51 PM
		To: Sakimura, Nat; Wachob, Gabe; xri@lists.oasis-open.org
		Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals)
		 
		Thanks Nat, this is very helpful. So your recommendation for equivalence is that we should 1) convert to UTF-8 if necessary 2) remove any language related tags (including font and glyph selector, if included) and 3) perform a character by character (i.e. codepoint by codepoint) comparison. Is that right?
		 
		Dave
			-----Original Message-----
			From: Sakimura, Nat [mailto:n-sakimura@nri.co.jp]
			Sent: Monday, July 21, 2003 11:15 PM
			To: Wachob, Gabe; xri@lists.oasis-open.org
			Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals)
			The reason $f beside $l came up was that to represent the octet stream in a human readable fashion, Unicode and hence ISO 10464 requires following information: 
			 
			1. Actual octet stream
			2. Language
			3. Glyph selector
			4. Font
			 
			I know, this sounds like a sick joke, but this is the reality. 
			(That’s why I was grumbling earlier that I wish we had DIS 10464 ver.1 as the ISO standard.)
			 
			I believe we can go pretty far with only 1 and 2, but I do not want to pretend that I know the problems other languages will encounter, so it is better to leave some room to preserve original amount of information. You are right, it might not be used very often in the real life, but some people may need it. Then, why should we remove it? 
			 
			As far as the equivalence is concerned, I believe that we should be comparing either the actual octet stream itself or the terminal outcome of the resolution. Equivalence is another huge topic involving the normalization, which may be even harder than multi-lingalization itself. I would even go forward and say that equivalence should not be dealing with normalization, but that might be a little too extreme. I suspect normalization will be a nightmare for the implementers, because one has to have a mapping between the composed form and decomposed form, and of course, you need the language and context information for this to happen. Also, when a new composed form is added, one has to add it to the mappings. It sounds too difficult to me. 
			 
			Nat
			 
			-----Original Message-----
			From: Wachob, Gabe [mailto:gwachob@visa.com] 
			Sent: Saturday, July 19, 2003 4:57 AM
			To: xri@lists.oasis-open.org
			Subject: RE: [xri] I18n and $ tags (on the $l and $f proposals)
			 
			 
			While I really think these proposals *could* be useful, I think they would be used (especially the $f one) in a relatively limited set of situations (i.e. those where the XRIs are presented to humans).
			
			Thats a provocative statement I've just made. Some folks have in their minds that most (many?) XRIs will be presented to humans. Some folks (me included) believe most won't.
			
			What I truly believe is that for some applications of XRIs, a large proportion will be presented to humans, and for other applications, they won't be presented to humans. Of course, we see this sort of flexibility as a strength. But this sort of flexibility is also the source of tension when deciding when to include or exclude features.
			
			I sound like a broken record, but I want to make sure that we are addressing a *real* need and that the solution doesn't create more complexity than it tries to eliminate.
			For example, the $f/(+Arial) proposal looks good on the surface but there are several complicating factors:
			1) You probably don't want a top-level +<font-name> entry because I could easily see a font name conflicting with another use of the term which is the font name. There are a ton of fanciful font names and I could easily see +Modern being ambiguous as a font name or something else. So we'd end up with +font/Modern, which would appear as $f/(+font/Modern). 
			2) Look how complicated the XRIs get... Even if you assume the font information is inserted by the UIs (and not presented to the user), this seems to complicate equivalence rules... 
			3) It seems that no matter what the structure is for font names, someone is going to have to manage a list of font names. Fonts are subject to intellectual property rights (at least in some places) and this tends to mean that there is no central registry of font names that everyone agrees on and is managed. Fonts are considered "property" which is licensed (though there are "public domain" ones). This is not a problem directly, but leads (I believe) to a situation where the universe of fonts is rather scattered and hard to survey properly. Certainly not something we want to do anyway. Use of the +font namespace seems appropriate. 
			So, we need to be very clear about the problems we are solving using this $f mechanism, because if they don't outweight the complexity, we shouldn't do them. 
			Whats the use case? How is this driven by internationalization concerns? If so, can we be more specific about the disambiguation we are trying to address? Without having the background of i18n, it strikes me as *really* odd to specify presentation information in the identifier --  I know others will have the same response.
			Outside of $f (to which i am specifically pushing back), I agree with Geoffrey that using  + cross references under other $ names (language, version syntax, etc) is a Good Thing. They allow a great deal of flexibility at the cost of human readability/usability (which is a fine compromise for me, in the use cases I am biased towards). 
			    -Gabe
			
			
			
			> -----Original Message-----
			> From: geoffrey.strongin@amd.com [mailto:geoffrey.strongin@amd.com <mailto:geoffrey.strongin@amd.com> ]
			> Sent: Monday, July 14, 2003 8:42 AM
			> To: xri@lists.oasis-open.org
			> Subject: RE: [xri] I18n and $ tags
			>
			>
			> I like this.  It really leverages the power of the + namespace.
			>
			> Geoffrey
			>
			> > -----Original Message-----
			> > From: Drummond Reed [mailto:drummond.reed@onename.com <mailto:drummond.reed@onename.com> ]
			> > Sent: Friday, July 11, 2003 11:58 PM
			> > To: Dave McAlpin; xri@lists.oasis-open.org
			> > Subject: RE: [xri] I18n and $ tags
			> >
			> >
			> > -----Original Message-----
			> > From: Dave McAlpin [mailto:dave.mcalpin@epokinc.com <mailto:dave.mcalpin@epokinc.com> ]
			> > Sent: Friday, July 11, 2003 3:57 PM
			> > To: xri@lists.oasis-open.org
			> > Subject: [xri] I18n and $ tags
			> >
			> > I assume internationalization does not apply to the $ tags.
			> > For example,
			> > there's no internationalized version of $v. Is this correct?
			> > Is this ok?
			> >
			> > Dave
			> >
			> > *****Drummond replies*****
			> >
			> > I think it's not only correct, but also a good thing. There
			> > should be no
			> > need to internationalize the $ space for the following
			> > reason: IMHO, the
			> > purpose of the $ space is to provide a mechanism for
			> > extending the very
			> > limited set of reserved chars in 2396 (which we've already
			> had to bust
			> > out of in order to add support for xrefs and sub-segments)
			> in order to
			> > have sufficient metadata (and extensibility) to describe
			> > identifiers in
			> > ways that are vital to the act of identification, i.e.,
			> > language, font,
			> > version syntax, query syntax, resolvability, human-readable comment,
			> > etc.
			> >
			> > For this reason, I propose that in Appendix B we state a formal a
			> > requirement that the vocabulary in the $ identifier space
			> (note that I
			> > don't call it a namespace for the reasons I'm about to argue) be as
			> > terse as possible, not just to enforce compactness, but to reinforce
			> > that it is an extension of the reserved-symbol-space and not
			> > intended to
			> > carry linguistic-level semantics.
			> >
			> > For example, the $l (language) space should, as Nat
			> proposed, use the
			> > two-letter codes for languages specified in ISO standard 639
			> > referenced
			> > in RFC 1766. It should NOT use full-length equivalents.
			> >
			> > The proposed $f (font) space for font names would violate
			> this rule if
			> > it used full-length English font names. (Furthermore, if we
			> > did that, it
			> > would beg for internationalization). To avoid both
			> problems, we should
			> > try to find a compact font name abbreviation registry that we can
			> > reference, similar to ISO 639 for language abbreviations.
			> >
			> > If we can't find one, and we don't want to create one (at
			> > least I don't
			> > want to), there is another solution - one that applies
			> nicely to any $
			> > space. In place of an exact, rigorously specified
			> vocabulary, every $
			> > space can also cross-reference common names in the + space.
			> Here's an
			> > example of how that would work for a font name:
			> >
			> >     xri:($l/fr).($f/(+Arial)).french-word-in-Arial-font/foo
			> >
			> > Rather than using "($f/Arial)", which would means "Arial"
			> was formally
			> > registered in the "$f" space, the segment "($f/(+Arial)"
			> simply means
			> > "Arial" is a common name in the context of a font. I'm not a font
			> > expert, but I'd be willing to guess that a large percentage of
			> > typographic software would recognize that common name for a font.
			> > Furthermore, the xri above would also tell the XRI parser that the
			> > common name "Arial" should be interpreted not just in the context of
			> > being a font, but specifically being a French name for a font. That
			> > should reduce the chance of misinterpretation even further.
			> >
			> > Use of the + space for real-world common names for metadata
			> like fonts
			> > means there is an easy way to apply the 80/20 rule, while leaving it
			> > open for the $f space to reference a more exhaustive and
			> non-ambiguous
			> > font name abbreviation registry later.
			> >
			> > Again, I think this rule should be applied across the board to all $
			> > spaces, including language, font, version syntax, query syntax, etc.
			> >
			> > =Drummond
			> >
			> >
			> >
			> >
			> > You may leave a Technical Committee at any time by visiting
			> http://www.oasis-open.org/apps/org/workgroup/xri/members/leave <http://www.oasis-open.org/apps/org/workgroup/xri/members/leave> 
			_workgroup.php
			
			
			
			You may leave a Technical Committee at any time by visiting http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup.php <http://www.oasis-open.org/apps/org/workgroup/xri/members/leave_workgroup.php>