cgmo-webcgm message

Subject: RE: Re[6]: [cgmo-webcgm] implications of URI vs. IRI
From: Dieter Weidenbrück <dieter@itedo.com>
To: "'Benoit Bezaire'" <benoit@itedo.com>,<cgmo-webcgm@lists.oasis-open.org>
Date: Wed, 12 Oct 2005 07:52:57 +0200
All,

Benoit is right, this is important.

Consequences:
- if we go for "escaped only", most likely every file from the past will
  be invalid if it had a space or similar in it.
- if we go for "non-escaped only" we will have no change compared to
  WebCGM 1.0, however, we need to double-check whether this is in line
  with the RFC.

Questions:
- How did other authoring tools do this in the past?
- What do other viewer tools expect if they read an existing WebCGM file?

I think this information is urgently needed to understand the situation
a bit better.

Regards,
Dieter 

> -----Original Message-----
> From: Benoit Bezaire [mailto:benoit@itedo.com] 
> Sent: Wednesday, October 12, 2005 5:05 AM
> To: cgmo-webcgm@lists.oasis-open.org
> Subject: Re[6]: [cgmo-webcgm] implications of URI vs. IRI
> 
> Hi Lofton,
> 
> I think that some of your questions are answered in 2.4.2:
> 
> 2.4.2. When to Escape and Unescape
> 
>   A URI is always in an "escaped" form, since escaping or unescaping a
>   completed URI might change its semantics.
>   [...]
>   Because the percent "%" character always has the reserved purpose of
>   being the escape indicator, it must be escaped as "%25" in order to
>   be used as data within a URI.  Implementers should be careful not to
>   escape or unescape the same string more than once, since unescaping
>   an already unescaped string might lead to misinterpreting a percent
>   data character as another escaped character, or vice versa in the
>   case of escaping an already escaped string.
> 
> One last comment; this is _again_ a three way conversation 
> (Lofton, Dieter and myself)... everyone should be involved in 
> this conversation (users and implementers, what do you want), 
> you are all affected by this. We want a 'valid' solution that 
> will have little disruption on WebCGM 1.0 content; let's try 
> to work towards that goal.
> 
> Regards,
> 
> -- 
>  Benoit   mailto:benoit@itedo.com
> 
> 
> Tuesday, October 11, 2005, 7:05:19 PM, Lofton wrote:
> 
> LH> More...
> 
> LH> I am giving some more thought to it to the ambiguity problem 
> LH> about"both" (i.e., both forms allowed in the fragment, linkuri, 
> LH> etc,a'la SVG.)
> 
> LH> Firstly, a possible solution.  One could always add a rule for 
> LH> CGMinterpreters, that any %hh 3-tuple in a fragment (or linkuri 
> LH> 1stparameter, or ...) will be take by the CGM interpreter 
> as a URI 
> LH> escapingsequence.  So caveat to WebCGM generators ...
> LH> although the 'name'ApsAttr might allow something like 
> that as part 
> LH> of the 'name' value, youhad better not do it, because you will 
> LH> create an ambiguity when you usethat 'name' value in a 
> fragment (or 
> LH> linkuri, DOM, XCF) and will NOT getthe result you want.
> 
> LH> Secondly...
> 
> LH> There is still something about the SVG sentence that bothers 
> LH> me,"...must be a URI reference as defined in [RFC2396], or must 
> LH> resultin a URI reference after the escaping procedure described 
> LH> below isapplied".  Specifically, was the *first* phrase 
> ("must bea 
> LH> URI reference as defined in [RFC2396]") meant to include
> LH> thecase(s):
> 
> LH> 1.) it is all safe ASCII in its original data form, with 
> no URI escapingneeded or present?
> LH> 2.) or was it maybe unsafe, but is already URI escaped?
> LH> 3.) or both?
> 
> LH> e1 illustrates #1 (all safe, no problem characters, no escaping 
> LH> needed ordone).  e2 illustrates #2 (already escaped).
> 
> LH> e1)  <image href="rasterImage.png" .../>
> LH> e2)  <image href="raster%20image.png" .../>
> 
> LH> Are both valid in SVG?
> 
> LH> I'm going to reread 2396 again.  Chapter 2 talks about 
> all thisstuff 
> LH> (as well as questions like local encoding), but it is not 
> LH> lightreading.  I'm also thinking to ask Chris about his memory of 
> LH> thesentence, particularly the intent of its first phrase.
> 
> LH> -Lofton.
> 
> LH> At 01:10 PM 10/11/2005 -0600, Lofton Henderson wrote:
> LH> At 05:20 PM 10/11/2005
> LH> +0200,=?GB2312?B?RGlldGVyICBXZWlkZW5icqi5Y2s=?= wrote:
> LH> [...]
> LH> good, and agreed.
> 
> 
> LH> Not so fast!
> 
> LH> Actually, I do agree that we should use the SVG interpretation, 
> LH> ifpossible.  I'm not sure how we ended up differently, 
> since Chris 
> LH> wasconsulting on and helping with this detail (it might be the 
> LH> timedifference -- 1999 for WebCGM 1.0 versus 2001 for SVG 
> -- Chris 
> LH> and SVGmight have figured out properly in those two years).
> 
> LH> My problem is:  exactly how to do it.  One logical method 
> mightbe an 
> LH> erratum on 1.0 -- logical because we ended up diverging 
> from SVG1.0 
> LH> on that detail, and didn't intend to.  (Would require someaction 
> LH> within W3C, to update the errata file that is linked from 
> theStatus 
> LH> section of the WebCGM 1.0 Recommendation.)  An erratum 
> (inthe "both" 
> LH> direction) would mean that both forms are valid 
> 1.0content, from the 
> LH> very beginning
> 
> LH> Anther possibility:  fix the language for 2.0, so that"both"
> LH> are allowed from 2.0 on.  (This makes 1.0 contentproblematic, if 
> LH> both forms have been used.)
> 
> LH> About the question of "both"...
> 
> >> The sentence '...must be a URIreference as defined in 
> [RFC2396], or 
> >> must result in a URI referenceafter the escaping procedure 
> described 
> >> below is applied"
> >> 
> >> DW> The way I understand the SVG wording is that both 
> forms wouldbe legal:
> >> 
> >> DW>http://www.cgmopen.org/abc.cgm#name(myname with blank)  
> >> DW>http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank)
> 
> 
> LH> Rfc2396 makes it clear (section 2.3 and 2.4) that the 
> presence of % 
> LH> should tell a URI resolver that URI escaping is in effect 
> -- % isn't 
> LH> a valid reserved (delimiter or subdelimiter) character, 
> nor a valid 
> LH> unreserved character, for the URI.
> 
> LH> However, % is a valid character in the repertoire of the 'name' 
> LH> ApsAttr, right?  So "%myFunnyName%" is a valid 'name'
> LH> APSattr in a WebCGM instance, right?  And the 3-character 
> "%20" is a 
> LH> valid 'name' ApsAttr, right?
> 
> LH> So if WebCGM allowed "both", and you encountered a fragment:
> 
> LH> #name(a%20b) ,
> 
> LH> what would you give to the URI resolver?  Two choices:
> 
> LH> a%20b  [assumes that the generator already applied uri-escaping] 
> LH> a%2520b  [assumes that generator did NOT uri-escape already]
> 
> LH> [btw, hex for % is 0x25, so % as an actual URI character 
> is given to 
> LH> URI resolver as %25]
> 
> LH> Thoughts?  (This gives me a headache!)
> 
> LH> -Lofton.
> 
> 
> LH> One more comment:
> LH> Spaces in "name" attributes have been allowed long before any 
> LH> linkURI and/or XML rules existed, thus nobody ever thought about 
> LH> this detail. Everything was stored in the CGM as the rules for 
> LH> non-graphical strings mandated.
> LH> One could say that this could have been clarified in WebCGM 1.0, 
> LH> however, I find it quite useful to have both forms available.
> 
> LH> Dieter
> 
> >> -----Original Message-----
> >> From: Benoit Bezaire [mailto:benoit@itedo.com]
> >> Sent: Tuesday, October 11, 2005 5:07 PM
> >> To: cgmo-webcgm@lists.oasis-open.org
> >> Subject: Re[4]: [cgmo-webcgm] implications of URI vs. IRI
> >> 
> >> Hi Dieter,
> >> 
> >> Thanks for the example, we are talking about the same thing.
> >> 
> >> I understand that ATA and WebCGM has allowed spaces in URI 
> fragments 
> >> for the last 10 years, but from my interpretation of 
> RFC2396; those 
> >> linkuris are illegal. Here is a quote from Section 4.1 of 
> >> http://www.ietf.org/rfc/rfc2396.txt
> >> "The character restrictions described in Section 2 for URI 
> also apply 
> >> to the fragment in a URI-reference."
> >> 
> >> And by reading Section 2, you end up reading that spaces are not 
> >> allowed.
> >> 
> >> That being said, your interpretation of the SVG wording sounds 
> >> acceptable. The sentence 'or must result in a URI 
> reference after the 
> >> escaping procedure' seems to be saving us! I'm in favor of adding 
> >> wording to the spec to clarify this issue (the 3 bullet 
> wording would 
> >> be good also).
> >> 
> >> I no longer have a preference if we should deprecate or not. 
> >> On one side, I think that this is a can of worms and 
> forcing escaping 
> >> simplifies things; on the other, I agree that long %HH for Asian 
> >> names is not ideal.
> >> 
> >> Allowing both is probably the less painful approach for users and 
> >> implementers at this time.
> >> 
> >> Regards,
> >> 
> >> --
> >>  Benoit   mailto:benoit@itedo.com
> >> 
> >> 
> >> Tuesday, October 11, 2005, 10:15:06 AM, Dieter wrote:
> >> 
> >> DW> Hi Benoit,
> >> 
> >> DW> see inline
> >> 
> >> >> -----Original Message-----
> >> >> From: Benoit Bezaire [mailto:benoit@itedo.com]
> >> >> Sent: Tuesday, October 11, 2005 3:48 PM
> >> >> To: cgmo-webcgm@lists.oasis-open.org
> >> >> Subject: Re[2]: [cgmo-webcgm] implications of URI vs. IRI
> >> >> 
> >> >> Hi Dieter,
> >> >> 
> >> >> You said:
> >> >> NOTE: If we required an escaped string inside the CGM now,
> >> this will
> >> >> make almost all existing files invalid ones as soon as a
> >> simple space
> >> >> is in a name attribute.
> >> >> 
> >> >> You are talking about the 'name' attribute within a URI
> >> only, correct?
> >> >> Or, let me rephrase...
> >> >> Files which have a name attribute (containing a space)
> >> that is used
> >> >> in a URI become invalid, right?
> >> DW> I am referring to the link destination parameter of a
> >> linkuri attribute.
> >> DW> Yes, something like (pseudo-code)
> >> 
> >> DW> linkuri "http://www.cgmopen.org/abc.cgm#name(my name 
> with blank)"
> >> DW> "some title" "_blank"
> >> 
> >> DW> would become illegal, and this is the form (without
> >> escaping) that
> >> DW> has been used forever in the ATA and WebCGM environment
> >> (almost 10 years now).
> >>  
> >> >> 
> >> >> I would be in favor of deprecating (i.e., authors should stop 
> >> >> creating such files) the old behavior (no escaping) and
> >> adding 'a la' 
> >> >> SVG wording to the spec. Like Dieter says, but with an 
> emphasis on 
> >> >> deprecating the old behavior.
> >> DW> The way I understand the SVG wording is that both forms
> >> would be legal:
> >> 
> >> DW> http://www.cgmopen.org/abc.cgm#name(my name with blank) 
> >> DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank)
> >> 
> >> DW> I would NOT deprecate the first form, because it would
> >> force us to
> >> DW> build long strings for japanese or similar characters,
> >> following the
> >> DW> rules as described below.
> >> 
> >> DW> Do you read the SVG spec the same way, or am I wrong?
> >> 
> >> DW> Regards,
> >> DW> Dieter
> >> 
> >> >> 
> >> >> --
> >> >>  Benoit   mailto:benoit@itedo.com
> >> >> 
> >> >>  
> >> >> Thursday, October 6, 2005, 7:52:43 AM, Dieter wrote:
> >> >> 
> >> >> DW> All,
> >> >> 
> >> >> DW> I am not yet convinced that we are heading in the right
> >> >> direction here.
> >> >> 
> >> >> DW> Example:
> >> >> DW> Let's assume we have the string "nihon" inside a
> >> linkUri: "id(ÈÕ±¾)"
> >> >> 
> >> >> DW> using UTF-16 (big endian) this is: 65 e5 67 2c (4 Bytes)
> >> >> converted
> >> >> DW> to UTF-8: EF BB BF E6 97 A5 E6 9C AC (9 Bytes)
> >> >> 
> >> >> DW> and then you can apply escaping for all non-ascii chars
> >> >> 
> >> >> DW> %EF%BB%BF%E6%97%A5%E6%9C%AC (27 Bytes)
> >> >> 
> >> >> DW> and now we store it into the linkURI attribute, 
> however, since 
> >> >> DW> somewhere else in the file we have this string in japanese 
> >> >> DW> characters as an ID, all non-graphical strings will be
> >> stored as
> >> >> DW> UTF-16 (could be
> >> >> DW> UTF-8 as well):
> >> >> 
> >> >> DW> I save the writing, you end up with 54 bytes.
> >> >> 
> >> >> DW> So we are moving from 4 bytes to 54 bytes.
> >> >> 
> >> >> DW> I hope that this accurately describes the procedure
> >> that has been
> >> >> DW> discussed over the past couple of days.
> >> >> 
> >> >> DW> Comparison to SVG:
> >> >> DW> In 5.3.2. [1], SVG says the following:
> >> >> 
> >> >> DW> "The value of the href attribute must be a URI reference
> >> >> as defined
> >> >> DW> in [RFC2396], or must result in a URI reference after the
> >> >> escaping
> >> >> DW> procedure described below is applied. The procedure is
> >> >> applied when
> >> >> DW> passing the URI reference to a URI resolver."
> >> >> 
> >> >> DW> Interesting to see the last sentence here. IMO this
> >> means, it is
> >> >> DW> perfectly legal to store the URI reference using any
> >> encoding, as
> >> >> DW> long as it will be transcoded to UTF-8 and escaped before
> >> >> passing it on to a URI resolver.
> >> >> 
> >> >> DW> This has always been my understanding, and this is how
> >> all of our
> >> >> DW> products have been handling references.
> >> >> 
> >> >> DW> NOTE:
> >> >> DW> If we required an escaped string inside the CGM now, this
> >> >> will make
> >> >> DW> almost all existing files invalid ones as soon as a
> >> >> simple space is
> >> >> DW> in a name attribute.
> >> >> 
> >> >> DW> RECOMMENDATION:
> >> >> DW> Amend wording slightly to match watch SVG is doing and
> >> allow for
> >> >> DW> both styles, escaped and not escaped.
> >> >> 
> >> >> DW> Comments?
> >> >> 
> >> >> DW> Regards,
> >> >> DW> Dieter
> >> >> 
> >> >> 
> >> >> DW> [1] http://www.w3.org/TR/SVG11/struct.html#xlinkRefAttrs
> >> >> 
> >> >> 
> >> >> >> -----Original Message-----
> >> >> >> From: Lofton Henderson [mailto:lofton@rockynet.com]
> >> >> >> Sent: Wednesday, October 05, 2005 1:06 AM
> >> >> >> To: Benoit Bezaire; cgmo-webcgm@lists.oasis-open.org
> >> >> >> Subject: Re: [cgmo-webcgm] implications of URI vs. IRI
> >> >> >> 
> >> >> >> At 05:09 PM 10/4/2005 -0400, Benoit Bezaire wrote:
> >> >> >> >Hi Lofton,
> >> >> >> >
> >> >> >> >I just did a quick search... I think that URI is only
> >> restricting
> >> >> >> >characters to US-ASCII; it has no control on the
> >> encoding (utf-8,
> >> >> >> >utf-16 etc...).
> >> >> >> >
> >> >> >> >In XML syntax such as XHTML and SVG, files can have just
> >> >> about any
> >> >> >> >encoding; I'm not aware of any special processing for the
> >> >> xlink:href
> >> >> >> >attribute (i.e., this is a URI, change the encoding to
> >> >> _blah_). It
> >> >> >> >wouldn't make any sense. The scope of the encoding is for
> >> >> >> the complete
> >> >> >> >document.
> >> >> >> >
> >> >> >> >The above is not a fact, only my understanding.
> >> >> >> 
> >> >> >> It matches my understanding.  And it is clear that XML
> >> and/or URI
> >> >> >> (rfc3986) require "URI escaping" for non-ASCII
> >> characters in URIs,
> >> >> >> i.e., for character that are outside of the ASCII
> >> repertoire.  And
> >> >> >> this is independent of the character-set encoding of the URI.
> >> >> >> 
> >> >> >> So finally, a URI from HTML into CGM containing a
> >> >> reference-by-name
> >> >> >> to "my object group" would be written like this:
> >> >> >> 
> >> >> >> <a
> >> >> >> 
> >> >> 
> >> 
> href="http://example.org/myCGM.cgm#name(my%20object%20group)">blah</a
> >> >> >> >
> >> >> >> 
> >> >> >> and a WebCGM 'linkuri' first parameter would be this:
> >> >> >> 
> >> >> >> http://example.org/myCGM.cgm#name(my%20object%20group)
> >> >> >> 
> >> >> >> -Lofton.
> >> >> >> 
> >> >> >> 
> >> >> >> >Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote:
> >> >> >> >
> >> >> >> >LH> All --
> >> >> >> >
> >> >> >> >LH> When I was putting together first unicode tests,
> >> Dieter also
> >> >> >> >LH> supplied me with this nifty "advanced" test.  It gets
> >> >> >> into Japanese
> >> >> >> >LH> text for SF text like APS ids and names.
> >> >> >> >
> >> >> >> >LH> It highlights an interesting implication of our decision
> >> >> >> to stick
> >> >> >> >LH> with URI instead of switching to IRI.  URI encoding
> >> >> >> requires that
> >> >> >> >LH> any non-ASCII characters are included by the 
> "URI escaping 
> >> >> >> >LH> mechanism", see WebCGM
> >> >> >> >3.1.1.4
> >> >> >> >LH> [1], and the more detailed XML description [2].
> >> >> >> Basically, get the
> >> >> >> >LH> **UTF8** representation of the characters, and replace
> >> >> >> each byte in
> >> >> >> >LH> that representation by the 3-character string %HH, where
> >> >> >> HH is the
> >> >> >> >LH> hex representation of the byte.
> >> >> >> >
> >> >> >> >LH> So suppose consider for example the 2-character id of
> >> >> >> the object in
> >> >> >> >LH> the upper-left box, and its use in a link from the
> >> >> object in the
> >> >> >> >upper-right box.
> >> >> >> >
> >> >> >> >LH> If that id were the two characters c1c2, lets suppose
> >> >> >> that it could
> >> >> >> >LH> be represented by the 4 utf8 bytes b1b2b3b4 (I'm just
> >> >> guessing
> >> >> >> >LH> about "4", since UTF8 is variable length, it could be
> >> >> >> more).  Then
> >> >> >> >LH> to put that id
> >> >> >> >into
> >> >> >> >LH> a URI string, it would have to be the 
> 12-character string:
> >> >> >> >
> >> >> >> >LH> %hh%hh%hh%hh
> >> >> >> >
> >> >> >> >LH> where the hh are the are the 4 pairs of hex digits that
> >> >> >> represent
> >> >> >> >LH> the 4
> >> >> >> >LH> utf16 bytes. I.e., the CGM URI for the link would be:
> >> >> >> >
> >> >> >> >LH> #id(%hh%hh%hh%hh, view_context)
> >> >> >> >
> >> >> >> >LH> Side question.  Does URI (rfc3986 [3]) restrict only the
> >> >> >> character
> >> >> >> >LH> repertoire of the URI, or does it restrict also the
> >> >> >> encoding? I.e.,
> >> >> >> >LH> can a URI be encoded in ascii, isoLatin1, or utf8, or
> >> >> utf16, or
> >> >> >> >LH> whatever, as
> >> >> >> >long
> >> >> >> >LH> as it restricts its repertoire to the URI repertoire? 
> >> >>  I suspect
> >> >> >> >"yes", but
> >> >> >> >LH> I don't know the answer.  It would be interesting for
> >> >> someone to
> >> >> >> >research it.
> >> >> >> >
> >> >> >> >LH> Thoughts?
> >> >> >> >
> >> >> >> >LH> Regards,
> >> >> >> >LH> -Lofton.
> >> >> >> >
> >> >> >> >LH> [0]
> >> >> >> >LH> 
> >> >> >>
> >> >>
> >> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_
> >> >> >> >LH> 1_4 [1]
> >> >> >> >LH>
> >> >> http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent
> >> >> >> >LH> [3] URI:  http://www.ietf.org/rfc/rfc3986.txt
> >> >> >> >LH> [4] IRI:  http://www.ietf.org/rfc/rfc3987.txt
> 
> 
> 
>
Follow-Ups:
- RE: Re[6]: [cgmo-webcgm] implications of URI vs. IRI
  - From: Lofton Henderson <lofton@rockynet.com>
- RE: Re[6]: [cgmo-webcgm] implications of URI vs. IRI
  - From: Lofton Henderson <lofton@rockynet.com>
References:
- Re[6]: [cgmo-webcgm] implications of URI vs. IRI
  - From: Benoit Bezaire <benoit@itedo.com>