[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [cgmo-webcgm] implications of URI vs. IRI
Hi Lofton, I just did a quick search... I think that URI is only restricting characters to US-ASCII; it has no control on the encoding (utf-8, utf-16 etc...). In XML syntax such as XHTML and SVG, files can have just about any encoding; I'm not aware of any special processing for the xlink:href attribute (i.e., this is a URI, change the encoding to _blah_). It wouldn't make any sense. The scope of the encoding is for the complete document. The above is not a fact, only my understanding. -- Benoit mailto:benoit@itedo.com Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote: LH> All -- LH> When I was putting together first unicode tests, Dieter also supplied me LH> with this nifty "advanced" test. It gets into Japanese text for SF text LH> like APS ids and names. LH> It highlights an interesting implication of our decision to stick with URI LH> instead of switching to IRI. URI encoding requires that any non-ASCII LH> characters are included by the "URI escaping mechanism", see WebCGM 3.1.1.4 LH> [1], and the more detailed XML description [2]. Basically, get the LH> **UTF8** representation of the characters, and replace each byte in that LH> representation by the 3-character string %HH, where HH is the hex LH> representation of the byte. LH> So suppose consider for example the 2-character id of the object in the LH> upper-left box, and its use in a link from the object in the upper-right box. LH> If that id were the two characters c1c2, lets suppose that it could be LH> represented by the 4 utf8 bytes b1b2b3b4 (I'm just guessing about "4", LH> since UTF8 is variable length, it could be more). Then to put that id into LH> a URI string, it would have to be the 12-character string: LH> %hh%hh%hh%hh LH> where the hh are the are the 4 pairs of hex digits that represent the 4 LH> utf16 bytes. I.e., the CGM URI for the link would be: LH> #id(%hh%hh%hh%hh, view_context) LH> Side question. Does URI (rfc3986 [3]) restrict only the character LH> repertoire of the URI, or does it restrict also the encoding? I.e., can a LH> URI be encoded in ascii, isoLatin1, or utf8, or utf16, or whatever, as long LH> as it restricts its repertoire to the URI repertoire? I suspect "yes", but LH> I don't know the answer. It would be interesting for someone to research it. LH> Thoughts? LH> Regards, LH> -Lofton. LH> [0] LH> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_1_4 LH> [1] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent LH> [3] URI: http://www.ietf.org/rfc/rfc3986.txt LH> [4] IRI: http://www.ietf.org/rfc/rfc3987.txt
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]