OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cgmo-webcgm message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [cgmo-webcgm] implications of URI vs. IRI


Hi Lofton,

I just did a quick search... I think that URI is only restricting
characters to US-ASCII; it has no control on the encoding (utf-8,
utf-16 etc...).

In XML syntax such as XHTML and SVG, files can have just about any
encoding; I'm not aware of any special processing for the xlink:href
attribute (i.e., this is a URI, change the encoding to _blah_). It
wouldn't make any sense. The scope of the encoding is for the complete
document.

The above is not a fact, only my understanding.

-- 
 Benoit   mailto:benoit@itedo.com


Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote:

LH> All --

LH> When I was putting together first unicode tests, Dieter also supplied me
LH> with this nifty "advanced" test.  It gets into Japanese text for SF text
LH> like APS ids and names.

LH> It highlights an interesting implication of our decision to stick with URI
LH> instead of switching to IRI.  URI encoding requires that any non-ASCII
LH> characters are included by the "URI escaping mechanism", see WebCGM 3.1.1.4
LH> [1], and the more detailed XML description [2].  Basically, get the
LH> **UTF8** representation of the characters, and replace each byte in that
LH> representation by the 3-character string %HH, where HH is the hex 
LH> representation of the byte.

LH> So suppose consider for example the 2-character id of the object in the
LH> upper-left box, and its use in a link from the object in the upper-right box.

LH> If that id were the two characters c1c2, lets suppose that it could be
LH> represented by the 4 utf8 bytes b1b2b3b4 (I'm just guessing about "4",
LH> since UTF8 is variable length, it could be more).  Then to put that id into
LH> a URI string, it would have to be the 12-character string:

LH> %hh%hh%hh%hh

LH> where the hh are the are the 4 pairs of hex digits that represent the 4
LH> utf16 bytes. I.e., the CGM URI for the link would be:

LH> #id(%hh%hh%hh%hh, view_context)

LH> Side question.  Does URI (rfc3986 [3]) restrict only the character
LH> repertoire of the URI, or does it restrict also the encoding? I.e., can a
LH> URI be encoded in ascii, isoLatin1, or utf8, or utf16, or whatever, as long
LH> as it restricts its repertoire to the URI repertoire?  I suspect "yes", but
LH> I don't know the answer.  It would be interesting for someone to research it.

LH> Thoughts?

LH> Regards,
LH> -Lofton.

LH> [0]
LH> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_1_4
LH> [1] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent
LH> [3] URI:  http://www.ietf.org/rfc/rfc3986.txt
LH> [4] IRI:  http://www.ietf.org/rfc/rfc3987.txt





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]