[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: Re[6]: [cgmo-webcgm] implications of URI vs. IRI
All, Benoit is right, this is important. Consequences: - if we go for "escaped only", most likely every file from the past will be invalid if it had a space or similar in it. - if we go for "non-escaped only" we will have no change compared to WebCGM 1.0, however, we need to double-check whether this is in line with the RFC. Questions: - How did other authoring tools do this in the past? - What do other viewer tools expect if they read an existing WebCGM file? I think this information is urgently needed to understand the situation a bit better. Regards, Dieter > -----Original Message----- > From: Benoit Bezaire [mailto:benoit@itedo.com] > Sent: Wednesday, October 12, 2005 5:05 AM > To: cgmo-webcgm@lists.oasis-open.org > Subject: Re[6]: [cgmo-webcgm] implications of URI vs. IRI > > Hi Lofton, > > I think that some of your questions are answered in 2.4.2: > > 2.4.2. When to Escape and Unescape > > A URI is always in an "escaped" form, since escaping or unescaping a > completed URI might change its semantics. > [...] > Because the percent "%" character always has the reserved purpose of > being the escape indicator, it must be escaped as "%25" in order to > be used as data within a URI. Implementers should be careful not to > escape or unescape the same string more than once, since unescaping > an already unescaped string might lead to misinterpreting a percent > data character as another escaped character, or vice versa in the > case of escaping an already escaped string. > > One last comment; this is _again_ a three way conversation > (Lofton, Dieter and myself)... everyone should be involved in > this conversation (users and implementers, what do you want), > you are all affected by this. We want a 'valid' solution that > will have little disruption on WebCGM 1.0 content; let's try > to work towards that goal. > > Regards, > > -- > Benoit mailto:benoit@itedo.com > > > Tuesday, October 11, 2005, 7:05:19 PM, Lofton wrote: > > LH> More... > > LH> I am giving some more thought to it to the ambiguity problem > LH> about"both" (i.e., both forms allowed in the fragment, linkuri, > LH> etc,a'la SVG.) > > LH> Firstly, a possible solution. One could always add a rule for > LH> CGMinterpreters, that any %hh 3-tuple in a fragment (or linkuri > LH> 1stparameter, or ...) will be take by the CGM interpreter > as a URI > LH> escapingsequence. So caveat to WebCGM generators ... > LH> although the 'name'ApsAttr might allow something like > that as part > LH> of the 'name' value, youhad better not do it, because you will > LH> create an ambiguity when you usethat 'name' value in a > fragment (or > LH> linkuri, DOM, XCF) and will NOT getthe result you want. > > LH> Secondly... > > LH> There is still something about the SVG sentence that bothers > LH> me,"...must be a URI reference as defined in [RFC2396], or must > LH> resultin a URI reference after the escaping procedure described > LH> below isapplied". Specifically, was the *first* phrase > ("must bea > LH> URI reference as defined in [RFC2396]") meant to include > LH> thecase(s): > > LH> 1.) it is all safe ASCII in its original data form, with > no URI escapingneeded or present? > LH> 2.) or was it maybe unsafe, but is already URI escaped? > LH> 3.) or both? > > LH> e1 illustrates #1 (all safe, no problem characters, no escaping > LH> needed ordone). e2 illustrates #2 (already escaped). > > LH> e1) <image href="rasterImage.png" .../> > LH> e2) <image href="raster%20image.png" .../> > > LH> Are both valid in SVG? > > LH> I'm going to reread 2396 again. Chapter 2 talks about > all thisstuff > LH> (as well as questions like local encoding), but it is not > LH> lightreading. I'm also thinking to ask Chris about his memory of > LH> thesentence, particularly the intent of its first phrase. > > LH> -Lofton. > > LH> At 01:10 PM 10/11/2005 -0600, Lofton Henderson wrote: > LH> At 05:20 PM 10/11/2005 > LH> +0200,=?GB2312?B?RGlldGVyICBXZWlkZW5icqi5Y2s=?= wrote: > LH> [...] > LH> good, and agreed. > > > LH> Not so fast! > > LH> Actually, I do agree that we should use the SVG interpretation, > LH> ifpossible. I'm not sure how we ended up differently, > since Chris > LH> wasconsulting on and helping with this detail (it might be the > LH> timedifference -- 1999 for WebCGM 1.0 versus 2001 for SVG > -- Chris > LH> and SVGmight have figured out properly in those two years). > > LH> My problem is: exactly how to do it. One logical method > mightbe an > LH> erratum on 1.0 -- logical because we ended up diverging > from SVG1.0 > LH> on that detail, and didn't intend to. (Would require someaction > LH> within W3C, to update the errata file that is linked from > theStatus > LH> section of the WebCGM 1.0 Recommendation.) An erratum > (inthe "both" > LH> direction) would mean that both forms are valid > 1.0content, from the > LH> very beginning > > LH> Anther possibility: fix the language for 2.0, so that"both" > LH> are allowed from 2.0 on. (This makes 1.0 contentproblematic, if > LH> both forms have been used.) > > LH> About the question of "both"... > > >> The sentence '...must be a URIreference as defined in > [RFC2396], or > >> must result in a URI referenceafter the escaping procedure > described > >> below is applied" > >> > >> DW> The way I understand the SVG wording is that both > forms wouldbe legal: > >> > >> DW>http://www.cgmopen.org/abc.cgm#name(myname with blank) > >> DW>http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank) > > > LH> Rfc2396 makes it clear (section 2.3 and 2.4) that the > presence of % > LH> should tell a URI resolver that URI escaping is in effect > -- % isn't > LH> a valid reserved (delimiter or subdelimiter) character, > nor a valid > LH> unreserved character, for the URI. > > LH> However, % is a valid character in the repertoire of the 'name' > LH> ApsAttr, right? So "%myFunnyName%" is a valid 'name' > LH> APSattr in a WebCGM instance, right? And the 3-character > "%20" is a > LH> valid 'name' ApsAttr, right? > > LH> So if WebCGM allowed "both", and you encountered a fragment: > > LH> #name(a%20b) , > > LH> what would you give to the URI resolver? Two choices: > > LH> a%20b [assumes that the generator already applied uri-escaping] > LH> a%2520b [assumes that generator did NOT uri-escape already] > > LH> [btw, hex for % is 0x25, so % as an actual URI character > is given to > LH> URI resolver as %25] > > LH> Thoughts? (This gives me a headache!) > > LH> -Lofton. > > > LH> One more comment: > LH> Spaces in "name" attributes have been allowed long before any > LH> linkURI and/or XML rules existed, thus nobody ever thought about > LH> this detail. Everything was stored in the CGM as the rules for > LH> non-graphical strings mandated. > LH> One could say that this could have been clarified in WebCGM 1.0, > LH> however, I find it quite useful to have both forms available. > > LH> Dieter > > >> -----Original Message----- > >> From: Benoit Bezaire [mailto:benoit@itedo.com] > >> Sent: Tuesday, October 11, 2005 5:07 PM > >> To: cgmo-webcgm@lists.oasis-open.org > >> Subject: Re[4]: [cgmo-webcgm] implications of URI vs. IRI > >> > >> Hi Dieter, > >> > >> Thanks for the example, we are talking about the same thing. > >> > >> I understand that ATA and WebCGM has allowed spaces in URI > fragments > >> for the last 10 years, but from my interpretation of > RFC2396; those > >> linkuris are illegal. Here is a quote from Section 4.1 of > >> http://www.ietf.org/rfc/rfc2396.txt > >> "The character restrictions described in Section 2 for URI > also apply > >> to the fragment in a URI-reference." > >> > >> And by reading Section 2, you end up reading that spaces are not > >> allowed. > >> > >> That being said, your interpretation of the SVG wording sounds > >> acceptable. The sentence 'or must result in a URI > reference after the > >> escaping procedure' seems to be saving us! I'm in favor of adding > >> wording to the spec to clarify this issue (the 3 bullet > wording would > >> be good also). > >> > >> I no longer have a preference if we should deprecate or not. > >> On one side, I think that this is a can of worms and > forcing escaping > >> simplifies things; on the other, I agree that long %HH for Asian > >> names is not ideal. > >> > >> Allowing both is probably the less painful approach for users and > >> implementers at this time. > >> > >> Regards, > >> > >> -- > >> Benoit mailto:benoit@itedo.com > >> > >> > >> Tuesday, October 11, 2005, 10:15:06 AM, Dieter wrote: > >> > >> DW> Hi Benoit, > >> > >> DW> see inline > >> > >> >> -----Original Message----- > >> >> From: Benoit Bezaire [mailto:benoit@itedo.com] > >> >> Sent: Tuesday, October 11, 2005 3:48 PM > >> >> To: cgmo-webcgm@lists.oasis-open.org > >> >> Subject: Re[2]: [cgmo-webcgm] implications of URI vs. IRI > >> >> > >> >> Hi Dieter, > >> >> > >> >> You said: > >> >> NOTE: If we required an escaped string inside the CGM now, > >> this will > >> >> make almost all existing files invalid ones as soon as a > >> simple space > >> >> is in a name attribute. > >> >> > >> >> You are talking about the 'name' attribute within a URI > >> only, correct? > >> >> Or, let me rephrase... > >> >> Files which have a name attribute (containing a space) > >> that is used > >> >> in a URI become invalid, right? > >> DW> I am referring to the link destination parameter of a > >> linkuri attribute. > >> DW> Yes, something like (pseudo-code) > >> > >> DW> linkuri "http://www.cgmopen.org/abc.cgm#name(my name > with blank)" > >> DW> "some title" "_blank" > >> > >> DW> would become illegal, and this is the form (without > >> escaping) that > >> DW> has been used forever in the ATA and WebCGM environment > >> (almost 10 years now). > >> > >> >> > >> >> I would be in favor of deprecating (i.e., authors should stop > >> >> creating such files) the old behavior (no escaping) and > >> adding 'a la' > >> >> SVG wording to the spec. Like Dieter says, but with an > emphasis on > >> >> deprecating the old behavior. > >> DW> The way I understand the SVG wording is that both forms > >> would be legal: > >> > >> DW> http://www.cgmopen.org/abc.cgm#name(my name with blank) > >> DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank) > >> > >> DW> I would NOT deprecate the first form, because it would > >> force us to > >> DW> build long strings for japanese or similar characters, > >> following the > >> DW> rules as described below. > >> > >> DW> Do you read the SVG spec the same way, or am I wrong? > >> > >> DW> Regards, > >> DW> Dieter > >> > >> >> > >> >> -- > >> >> Benoit mailto:benoit@itedo.com > >> >> > >> >> > >> >> Thursday, October 6, 2005, 7:52:43 AM, Dieter wrote: > >> >> > >> >> DW> All, > >> >> > >> >> DW> I am not yet convinced that we are heading in the right > >> >> direction here. > >> >> > >> >> DW> Example: > >> >> DW> Let's assume we have the string "nihon" inside a > >> linkUri: "id(ÈÕ±¾)" > >> >> > >> >> DW> using UTF-16 (big endian) this is: 65 e5 67 2c (4 Bytes) > >> >> converted > >> >> DW> to UTF-8: EF BB BF E6 97 A5 E6 9C AC (9 Bytes) > >> >> > >> >> DW> and then you can apply escaping for all non-ascii chars > >> >> > >> >> DW> %EF%BB%BF%E6%97%A5%E6%9C%AC (27 Bytes) > >> >> > >> >> DW> and now we store it into the linkURI attribute, > however, since > >> >> DW> somewhere else in the file we have this string in japanese > >> >> DW> characters as an ID, all non-graphical strings will be > >> stored as > >> >> DW> UTF-16 (could be > >> >> DW> UTF-8 as well): > >> >> > >> >> DW> I save the writing, you end up with 54 bytes. > >> >> > >> >> DW> So we are moving from 4 bytes to 54 bytes. > >> >> > >> >> DW> I hope that this accurately describes the procedure > >> that has been > >> >> DW> discussed over the past couple of days. > >> >> > >> >> DW> Comparison to SVG: > >> >> DW> In 5.3.2. [1], SVG says the following: > >> >> > >> >> DW> "The value of the href attribute must be a URI reference > >> >> as defined > >> >> DW> in [RFC2396], or must result in a URI reference after the > >> >> escaping > >> >> DW> procedure described below is applied. The procedure is > >> >> applied when > >> >> DW> passing the URI reference to a URI resolver." > >> >> > >> >> DW> Interesting to see the last sentence here. IMO this > >> means, it is > >> >> DW> perfectly legal to store the URI reference using any > >> encoding, as > >> >> DW> long as it will be transcoded to UTF-8 and escaped before > >> >> passing it on to a URI resolver. > >> >> > >> >> DW> This has always been my understanding, and this is how > >> all of our > >> >> DW> products have been handling references. > >> >> > >> >> DW> NOTE: > >> >> DW> If we required an escaped string inside the CGM now, this > >> >> will make > >> >> DW> almost all existing files invalid ones as soon as a > >> >> simple space is > >> >> DW> in a name attribute. > >> >> > >> >> DW> RECOMMENDATION: > >> >> DW> Amend wording slightly to match watch SVG is doing and > >> allow for > >> >> DW> both styles, escaped and not escaped. > >> >> > >> >> DW> Comments? > >> >> > >> >> DW> Regards, > >> >> DW> Dieter > >> >> > >> >> > >> >> DW> [1] http://www.w3.org/TR/SVG11/struct.html#xlinkRefAttrs > >> >> > >> >> > >> >> >> -----Original Message----- > >> >> >> From: Lofton Henderson [mailto:lofton@rockynet.com] > >> >> >> Sent: Wednesday, October 05, 2005 1:06 AM > >> >> >> To: Benoit Bezaire; cgmo-webcgm@lists.oasis-open.org > >> >> >> Subject: Re: [cgmo-webcgm] implications of URI vs. IRI > >> >> >> > >> >> >> At 05:09 PM 10/4/2005 -0400, Benoit Bezaire wrote: > >> >> >> >Hi Lofton, > >> >> >> > > >> >> >> >I just did a quick search... I think that URI is only > >> restricting > >> >> >> >characters to US-ASCII; it has no control on the > >> encoding (utf-8, > >> >> >> >utf-16 etc...). > >> >> >> > > >> >> >> >In XML syntax such as XHTML and SVG, files can have just > >> >> about any > >> >> >> >encoding; I'm not aware of any special processing for the > >> >> xlink:href > >> >> >> >attribute (i.e., this is a URI, change the encoding to > >> >> _blah_). It > >> >> >> >wouldn't make any sense. The scope of the encoding is for > >> >> >> the complete > >> >> >> >document. > >> >> >> > > >> >> >> >The above is not a fact, only my understanding. > >> >> >> > >> >> >> It matches my understanding. And it is clear that XML > >> and/or URI > >> >> >> (rfc3986) require "URI escaping" for non-ASCII > >> characters in URIs, > >> >> >> i.e., for character that are outside of the ASCII > >> repertoire. And > >> >> >> this is independent of the character-set encoding of the URI. > >> >> >> > >> >> >> So finally, a URI from HTML into CGM containing a > >> >> reference-by-name > >> >> >> to "my object group" would be written like this: > >> >> >> > >> >> >> <a > >> >> >> > >> >> > >> > href="http://example.org/myCGM.cgm#name(my%20object%20group)">blah</a > >> >> >> > > >> >> >> > >> >> >> and a WebCGM 'linkuri' first parameter would be this: > >> >> >> > >> >> >> http://example.org/myCGM.cgm#name(my%20object%20group) > >> >> >> > >> >> >> -Lofton. > >> >> >> > >> >> >> > >> >> >> >Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote: > >> >> >> > > >> >> >> >LH> All -- > >> >> >> > > >> >> >> >LH> When I was putting together first unicode tests, > >> Dieter also > >> >> >> >LH> supplied me with this nifty "advanced" test. It gets > >> >> >> into Japanese > >> >> >> >LH> text for SF text like APS ids and names. > >> >> >> > > >> >> >> >LH> It highlights an interesting implication of our decision > >> >> >> to stick > >> >> >> >LH> with URI instead of switching to IRI. URI encoding > >> >> >> requires that > >> >> >> >LH> any non-ASCII characters are included by the > "URI escaping > >> >> >> >LH> mechanism", see WebCGM > >> >> >> >3.1.1.4 > >> >> >> >LH> [1], and the more detailed XML description [2]. > >> >> >> Basically, get the > >> >> >> >LH> **UTF8** representation of the characters, and replace > >> >> >> each byte in > >> >> >> >LH> that representation by the 3-character string %HH, where > >> >> >> HH is the > >> >> >> >LH> hex representation of the byte. > >> >> >> > > >> >> >> >LH> So suppose consider for example the 2-character id of > >> >> >> the object in > >> >> >> >LH> the upper-left box, and its use in a link from the > >> >> object in the > >> >> >> >upper-right box. > >> >> >> > > >> >> >> >LH> If that id were the two characters c1c2, lets suppose > >> >> >> that it could > >> >> >> >LH> be represented by the 4 utf8 bytes b1b2b3b4 (I'm just > >> >> guessing > >> >> >> >LH> about "4", since UTF8 is variable length, it could be > >> >> >> more). Then > >> >> >> >LH> to put that id > >> >> >> >into > >> >> >> >LH> a URI string, it would have to be the > 12-character string: > >> >> >> > > >> >> >> >LH> %hh%hh%hh%hh > >> >> >> > > >> >> >> >LH> where the hh are the are the 4 pairs of hex digits that > >> >> >> represent > >> >> >> >LH> the 4 > >> >> >> >LH> utf16 bytes. I.e., the CGM URI for the link would be: > >> >> >> > > >> >> >> >LH> #id(%hh%hh%hh%hh, view_context) > >> >> >> > > >> >> >> >LH> Side question. Does URI (rfc3986 [3]) restrict only the > >> >> >> character > >> >> >> >LH> repertoire of the URI, or does it restrict also the > >> >> >> encoding? I.e., > >> >> >> >LH> can a URI be encoded in ascii, isoLatin1, or utf8, or > >> >> utf16, or > >> >> >> >LH> whatever, as > >> >> >> >long > >> >> >> >LH> as it restricts its repertoire to the URI repertoire? > >> >> I suspect > >> >> >> >"yes", but > >> >> >> >LH> I don't know the answer. It would be interesting for > >> >> someone to > >> >> >> >research it. > >> >> >> > > >> >> >> >LH> Thoughts? > >> >> >> > > >> >> >> >LH> Regards, > >> >> >> >LH> -Lofton. > >> >> >> > > >> >> >> >LH> [0] > >> >> >> >LH> > >> >> >> > >> >> > >> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_ > >> >> >> >LH> 1_4 [1] > >> >> >> >LH> > >> >> http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent > >> >> >> >LH> [3] URI: http://www.ietf.org/rfc/rfc3986.txt > >> >> >> >LH> [4] IRI: http://www.ietf.org/rfc/rfc3987.txt > > > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]