cgmo-webcgm message

Subject: Re[4]: [cgmo-webcgm] implications of URI vs. IRI
From: Benoit Bezaire <benoit@itedo.com>
To: cgmo-webcgm@lists.oasis-open.org
Date: Tue, 11 Oct 2005 11:07:27 -0400
Hi Dieter,

Thanks for the example, we are talking about the same thing.

I understand that ATA and WebCGM has allowed spaces in URI fragments
for the last 10 years, but from my interpretation of RFC2396; those
linkuris are illegal. Here is a quote from Section 4.1 of
http://www.ietf.org/rfc/rfc2396.txt
"The character restrictions described in Section 2 for URI also apply
to the fragment in a URI-reference."

And by reading Section 2, you end up reading that spaces are not
allowed.

That being said, your interpretation of the SVG wording sounds
acceptable. The sentence 'or must result in a URI reference after the
escaping procedure' seems to be saving us! I'm in favor of adding
wording to the spec to clarify this issue (the 3 bullet wording would
be good also).

I no longer have a preference if we should deprecate or not. On one
side, I think that this is a can of worms and forcing escaping
simplifies things; on the other, I agree that long %HH for Asian names
is not ideal.

Allowing both is probably the less painful approach for users and
implementers at this time.

Regards,

-- 
 Benoit   mailto:benoit@itedo.com


Tuesday, October 11, 2005, 10:15:06 AM, Dieter wrote:

DW> Hi Benoit,

DW> see inline

>> -----Original Message-----
>> From: Benoit Bezaire [mailto:benoit@itedo.com] 
>> Sent: Tuesday, October 11, 2005 3:48 PM
>> To: cgmo-webcgm@lists.oasis-open.org
>> Subject: Re[2]: [cgmo-webcgm] implications of URI vs. IRI
>> 
>> Hi Dieter,
>> 
>> You said:
>> NOTE: If we required an escaped string inside the CGM now, 
>> this will make almost all existing files invalid ones as soon 
>> as a simple space is in a name attribute.
>> 
>> You are talking about the 'name' attribute within a URI only, correct?
>> Or, let me rephrase...
>> Files which have a name attribute (containing a space) that 
>> is used in a URI become invalid, right?
DW> I am referring to the link destination parameter of a linkuri attribute.
DW> Yes, something like (pseudo-code)

DW> linkuri "http://www.cgmopen.org/abc.cgm#name(my name with blank)" "some
DW> title" "_blank"

DW> would become illegal, and this is the form (without escaping) that has been
DW> used forever in the ATA and WebCGM environment (almost 10 years now).
 
>> 
>> I would be in favor of deprecating (i.e., authors should stop 
>> creating such files) the old behavior (no escaping) and 
>> adding 'a la' SVG wording to the spec. Like Dieter says, but 
>> with an emphasis on deprecating the old behavior.
DW> The way I understand the SVG wording is that both forms would be legal:

DW> http://www.cgmopen.org/abc.cgm#name(my name with blank)
DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank)

DW> I would NOT deprecate the first form, because it would force us to build
DW> long strings for japanese or similar characters, following the rules as
DW> described below.

DW> Do you read the SVG spec the same way, or am I wrong?

DW> Regards,
DW> Dieter

>> 
>> -- 
>>  Benoit   mailto:benoit@itedo.com
>> 
>>  
>> Thursday, October 6, 2005, 7:52:43 AM, Dieter wrote:
>> 
>> DW> All,
>> 
>> DW> I am not yet convinced that we are heading in the right 
>> direction here.
>> 
>> DW> Example:
>> DW> Let's assume we have the string "nihon" inside a linkUri: "id(日本)"
>> 
>> DW> using UTF-16 (big endian) this is: 65 e5 67 2c (4 Bytes) 
>> converted 
>> DW> to UTF-8: EF BB BF E6 97 A5 E6 9C AC (9 Bytes)
>> 
>> DW> and then you can apply escaping for all non-ascii chars
>> 
>> DW> %EF%BB%BF%E6%97%A5%E6%9C%AC (27 Bytes)
>> 
>> DW> and now we store it into the linkURI attribute, however, since 
>> DW> somewhere else in the file we have this string in japanese 
>> DW> characters as an ID, all non-graphical strings will be stored as
>> DW> UTF-16 (could be
>> DW> UTF-8 as well):
>> 
>> DW> I save the writing, you end up with 54 bytes.
>> 
>> DW> So we are moving from 4 bytes to 54 bytes.
>> 
>> DW> I hope that this accurately describes the procedure that has been
>> DW> discussed over the past couple of days.
>> 
>> DW> Comparison to SVG:
>> DW> In 5.3.2. [1], SVG says the following:
>> 
>> DW> "The value of the href attribute must be a URI reference 
>> as defined 
>> DW> in [RFC2396], or must result in a URI reference after the 
>> escaping 
>> DW> procedure described below is applied. The procedure is 
>> applied when 
>> DW> passing the URI reference to a URI resolver."
>> 
>> DW> Interesting to see the last sentence here. IMO this means, it is
>> DW> perfectly legal to store the URI reference using any encoding, as
>> DW> long as it will be transcoded to UTF-8 and escaped before 
>> passing it on to a URI resolver.
>> 
>> DW> This has always been my understanding, and this is how all of our
>> DW> products have been handling references.
>> 
>> DW> NOTE:
>> DW> If we required an escaped string inside the CGM now, this 
>> will make 
>> DW> almost all existing files invalid ones as soon as a 
>> simple space is 
>> DW> in a name attribute.
>> 
>> DW> RECOMMENDATION:
>> DW> Amend wording slightly to match watch SVG is doing and allow for
>> DW> both styles, escaped and not escaped.
>> 
>> DW> Comments?
>> 
>> DW> Regards,
>> DW> Dieter
>> 
>> 
>> DW> [1] http://www.w3.org/TR/SVG11/struct.html#xlinkRefAttrs
>> 
>> 
>> >> -----Original Message-----
>> >> From: Lofton Henderson [mailto:lofton@rockynet.com]
>> >> Sent: Wednesday, October 05, 2005 1:06 AM
>> >> To: Benoit Bezaire; cgmo-webcgm@lists.oasis-open.org
>> >> Subject: Re: [cgmo-webcgm] implications of URI vs. IRI
>> >> 
>> >> At 05:09 PM 10/4/2005 -0400, Benoit Bezaire wrote:
>> >> >Hi Lofton,
>> >> >
>> >> >I just did a quick search... I think that URI is only restricting
>> >> >characters to US-ASCII; it has no control on the encoding (utf-8,
>> >> >utf-16 etc...).
>> >> >
>> >> >In XML syntax such as XHTML and SVG, files can have just 
>> about any 
>> >> >encoding; I'm not aware of any special processing for the 
>> xlink:href 
>> >> >attribute (i.e., this is a URI, change the encoding to 
>> _blah_). It 
>> >> >wouldn't make any sense. The scope of the encoding is for
>> >> the complete
>> >> >document.
>> >> >
>> >> >The above is not a fact, only my understanding.
>> >> 
>> >> It matches my understanding.  And it is clear that XML and/or URI
>> >> (rfc3986) require "URI escaping" for non-ASCII characters in URIs,
>> >> i.e., for character that are outside of the ASCII repertoire.  And
>> >> this is independent of the character-set encoding of the URI.
>> >> 
>> >> So finally, a URI from HTML into CGM containing a 
>> reference-by-name 
>> >> to "my object group" would be written like this:
>> >> 
>> >> <a
>> >> 
>> href="http://example.org/myCGM.cgm#name(my%20object%20group)">blah</a
>> >> >
>> >> 
>> >> and a WebCGM 'linkuri' first parameter would be this:
>> >> 
>> >> http://example.org/myCGM.cgm#name(my%20object%20group)
>> >> 
>> >> -Lofton.
>> >> 
>> >> 
>> >> >Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote:
>> >> >
>> >> >LH> All --
>> >> >
>> >> >LH> When I was putting together first unicode tests, Dieter also
>> >> >LH> supplied me with this nifty "advanced" test.  It gets
>> >> into Japanese
>> >> >LH> text for SF text like APS ids and names.
>> >> >
>> >> >LH> It highlights an interesting implication of our decision
>> >> to stick
>> >> >LH> with URI instead of switching to IRI.  URI encoding
>> >> requires that
>> >> >LH> any non-ASCII characters are included by the "URI escaping 
>> >> >LH> mechanism", see WebCGM
>> >> >3.1.1.4
>> >> >LH> [1], and the more detailed XML description [2].  
>> >> Basically, get the
>> >> >LH> **UTF8** representation of the characters, and replace
>> >> each byte in
>> >> >LH> that representation by the 3-character string %HH, where
>> >> HH is the
>> >> >LH> hex representation of the byte.
>> >> >
>> >> >LH> So suppose consider for example the 2-character id of
>> >> the object in
>> >> >LH> the upper-left box, and its use in a link from the 
>> object in the
>> >> >upper-right box.
>> >> >
>> >> >LH> If that id were the two characters c1c2, lets suppose
>> >> that it could
>> >> >LH> be represented by the 4 utf8 bytes b1b2b3b4 (I'm just 
>> guessing 
>> >> >LH> about "4", since UTF8 is variable length, it could be
>> >> more).  Then
>> >> >LH> to put that id
>> >> >into
>> >> >LH> a URI string, it would have to be the 12-character string:
>> >> >
>> >> >LH> %hh%hh%hh%hh
>> >> >
>> >> >LH> where the hh are the are the 4 pairs of hex digits that
>> >> represent
>> >> >LH> the 4
>> >> >LH> utf16 bytes. I.e., the CGM URI for the link would be:
>> >> >
>> >> >LH> #id(%hh%hh%hh%hh, view_context)
>> >> >
>> >> >LH> Side question.  Does URI (rfc3986 [3]) restrict only the
>> >> character
>> >> >LH> repertoire of the URI, or does it restrict also the
>> >> encoding? I.e.,
>> >> >LH> can a URI be encoded in ascii, isoLatin1, or utf8, or 
>> utf16, or 
>> >> >LH> whatever, as
>> >> >long
>> >> >LH> as it restricts its repertoire to the URI repertoire? 
>>  I suspect
>> >> >"yes", but
>> >> >LH> I don't know the answer.  It would be interesting for 
>> someone to
>> >> >research it.
>> >> >
>> >> >LH> Thoughts?
>> >> >
>> >> >LH> Regards,
>> >> >LH> -Lofton.
>> >> >
>> >> >LH> [0]
>> >> >LH> 
>> >>
>> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_
>> >> >LH> 1_4 [1]
>> >> >LH>
>> http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent
>> >> >LH> [3] URI:  http://www.ietf.org/rfc/rfc3986.txt
>> >> >LH> [4] IRI:  http://www.ietf.org/rfc/rfc3987.txt
Follow-Ups:
- RE: Re[4]: [cgmo-webcgm] implications of URI vs. IRI
  - From: =?GB2312?B?RGlldGVyICBXZWlkZW5icqi5Y2s=?= <dieter@itedo.com>
References:
- Re[2]: [cgmo-webcgm] implications of URI vs. IRI
  - From: Benoit Bezaire <benoit@itedo.com>
- RE: Re[2]: [cgmo-webcgm] implications of URI vs. IRI
  - From: =?GB2312?B?RGlldGVyICBXZWlkZW5icqi5Y2s=?= <dieter@itedo.com>