cgmo-webcgm message

Subject: Re: [cgmo-webcgm] implications of URI vs. IRI

From: Lofton Henderson <lofton@rockynet.com>
To: Benoit Bezaire <benoit@itedo.com>,cgmo-webcgm@lists.oasis-open.org
Date: Tue, 04 Oct 2005 17:06:04 -0600

At 05:09 PM 10/4/2005 -0400, Benoit Bezaire wrote:
>Hi Lofton,
>
>I just did a quick search... I think that URI is only restricting
>characters to US-ASCII; it has no control on the encoding (utf-8,
>utf-16 etc...).
>
>In XML syntax such as XHTML and SVG, files can have just about any
>encoding; I'm not aware of any special processing for the xlink:href
>attribute (i.e., this is a URI, change the encoding to _blah_). It
>wouldn't make any sense. The scope of the encoding is for the complete
>document.
>
>The above is not a fact, only my understanding.

It matches my understanding.  And it is clear that XML and/or URI (rfc3986) 
require "URI escaping" for non-ASCII characters in URIs, i.e., for 
character that are outside of the ASCII repertoire.  And this is 
independent of the character-set encoding of the URI.

So finally, a URI from HTML into CGM containing a reference-by-name to "my 
object group" would be written like this:

<a href="http://example.org/myCGM.cgm#name(my%20object%20group)">blah</a>

and a WebCGM 'linkuri' first parameter would be this:

http://example.org/myCGM.cgm#name(my%20object%20group)

-Lofton.


>Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote:
>
>LH> All --
>
>LH> When I was putting together first unicode tests, Dieter also supplied me
>LH> with this nifty "advanced" test.  It gets into Japanese text for SF text
>LH> like APS ids and names.
>
>LH> It highlights an interesting implication of our decision to stick with URI
>LH> instead of switching to IRI.  URI encoding requires that any non-ASCII
>LH> characters are included by the "URI escaping mechanism", see WebCGM 
>3.1.1.4
>LH> [1], and the more detailed XML description [2].  Basically, get the
>LH> **UTF8** representation of the characters, and replace each byte in that
>LH> representation by the 3-character string %HH, where HH is the hex
>LH> representation of the byte.
>
>LH> So suppose consider for example the 2-character id of the object in the
>LH> upper-left box, and its use in a link from the object in the 
>upper-right box.
>
>LH> If that id were the two characters c1c2, lets suppose that it could be
>LH> represented by the 4 utf8 bytes b1b2b3b4 (I'm just guessing about "4",
>LH> since UTF8 is variable length, it could be more).  Then to put that id 
>into
>LH> a URI string, it would have to be the 12-character string:
>
>LH> %hh%hh%hh%hh
>
>LH> where the hh are the are the 4 pairs of hex digits that represent the 4
>LH> utf16 bytes. I.e., the CGM URI for the link would be:
>
>LH> #id(%hh%hh%hh%hh, view_context)
>
>LH> Side question.  Does URI (rfc3986 [3]) restrict only the character
>LH> repertoire of the URI, or does it restrict also the encoding? I.e., can a
>LH> URI be encoded in ascii, isoLatin1, or utf8, or utf16, or whatever, as 
>long
>LH> as it restricts its repertoire to the URI repertoire?  I suspect 
>"yes", but
>LH> I don't know the answer.  It would be interesting for someone to 
>research it.
>
>LH> Thoughts?
>
>LH> Regards,
>LH> -Lofton.
>
>LH> [0]
>LH> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_1_4
>LH> [1] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent
>LH> [3] URI:  http://www.ietf.org/rfc/rfc3986.txt
>LH> [4] IRI:  http://www.ietf.org/rfc/rfc3987.txt

Follow-Ups:
- RE: [cgmo-webcgm] implications of URI vs. IRI
  - From: =?GB2312?B?RGlldGVyICBXZWlkZW5icqi5Y2s=?= <dieter@itedo.com>

References:
- Re: [cgmo-webcgm] implications of URI vs. IRI
  - From: Benoit Bezaire <benoit@itedo.com>