[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [cgmo-webcgm] implications of URI vs. IRI
At 05:09 PM 10/4/2005 -0400, Benoit Bezaire wrote: >Hi Lofton, > >I just did a quick search... I think that URI is only restricting >characters to US-ASCII; it has no control on the encoding (utf-8, >utf-16 etc...). > >In XML syntax such as XHTML and SVG, files can have just about any >encoding; I'm not aware of any special processing for the xlink:href >attribute (i.e., this is a URI, change the encoding to _blah_). It >wouldn't make any sense. The scope of the encoding is for the complete >document. > >The above is not a fact, only my understanding. It matches my understanding. And it is clear that XML and/or URI (rfc3986) require "URI escaping" for non-ASCII characters in URIs, i.e., for character that are outside of the ASCII repertoire. And this is independent of the character-set encoding of the URI. So finally, a URI from HTML into CGM containing a reference-by-name to "my object group" would be written like this: <a href="http://example.org/myCGM.cgm#name(my%20object%20group)">blah</a> and a WebCGM 'linkuri' first parameter would be this: http://example.org/myCGM.cgm#name(my%20object%20group) -Lofton. >Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote: > >LH> All -- > >LH> When I was putting together first unicode tests, Dieter also supplied me >LH> with this nifty "advanced" test. It gets into Japanese text for SF text >LH> like APS ids and names. > >LH> It highlights an interesting implication of our decision to stick with URI >LH> instead of switching to IRI. URI encoding requires that any non-ASCII >LH> characters are included by the "URI escaping mechanism", see WebCGM >3.1.1.4 >LH> [1], and the more detailed XML description [2]. Basically, get the >LH> **UTF8** representation of the characters, and replace each byte in that >LH> representation by the 3-character string %HH, where HH is the hex >LH> representation of the byte. > >LH> So suppose consider for example the 2-character id of the object in the >LH> upper-left box, and its use in a link from the object in the >upper-right box. > >LH> If that id were the two characters c1c2, lets suppose that it could be >LH> represented by the 4 utf8 bytes b1b2b3b4 (I'm just guessing about "4", >LH> since UTF8 is variable length, it could be more). Then to put that id >into >LH> a URI string, it would have to be the 12-character string: > >LH> %hh%hh%hh%hh > >LH> where the hh are the are the 4 pairs of hex digits that represent the 4 >LH> utf16 bytes. I.e., the CGM URI for the link would be: > >LH> #id(%hh%hh%hh%hh, view_context) > >LH> Side question. Does URI (rfc3986 [3]) restrict only the character >LH> repertoire of the URI, or does it restrict also the encoding? I.e., can a >LH> URI be encoded in ascii, isoLatin1, or utf8, or utf16, or whatever, as >long >LH> as it restricts its repertoire to the URI repertoire? I suspect >"yes", but >LH> I don't know the answer. It would be interesting for someone to >research it. > >LH> Thoughts? > >LH> Regards, >LH> -Lofton. > >LH> [0] >LH> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_1_4 >LH> [1] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent >LH> [3] URI: http://www.ietf.org/rfc/rfc3986.txt >LH> [4] IRI: http://www.ietf.org/rfc/rfc3987.txt
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]