[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re[6]: [cgmo-webcgm] implications of URI vs. IRI
Hi Lofton, I think that some of your questions are answered in 2.4.2: 2.4.2. When to Escape and Unescape A URI is always in an "escaped" form, since escaping or unescaping a completed URI might change its semantics. [...] Because the percent "%" character always has the reserved purpose of being the escape indicator, it must be escaped as "%25" in order to be used as data within a URI. Implementers should be careful not to escape or unescape the same string more than once, since unescaping an already unescaped string might lead to misinterpreting a percent data character as another escaped character, or vice versa in the case of escaping an already escaped string. One last comment; this is _again_ a three way conversation (Lofton, Dieter and myself)... everyone should be involved in this conversation (users and implementers, what do you want), you are all affected by this. We want a 'valid' solution that will have little disruption on WebCGM 1.0 content; let's try to work towards that goal. Regards, -- Benoit mailto:benoit@itedo.com Tuesday, October 11, 2005, 7:05:19 PM, Lofton wrote: LH> More... LH> I am giving some more thought to it to the ambiguity problem LH> about"both" (i.e., both forms allowed in the fragment, linkuri, LH> etc,a'la SVG.) LH> Firstly, a possible solution. One could always add a rule LH> for CGMinterpreters, that any %hh 3-tuple in a fragment (or LH> linkuri 1stparameter, or ...) will be take by the CGM interpreter LH> as a URI escapingsequence. So caveat to WebCGM generators ... LH> although the 'name'ApsAttr might allow something like that as part LH> of the 'name' value, youhad better not do it, because you will LH> create an ambiguity when you usethat 'name' value in a fragment LH> (or linkuri, DOM, XCF) and will NOT getthe result you want. LH> Secondly... LH> There is still something about the SVG sentence that bothers LH> me,"...must be a URI reference as defined in [RFC2396], or must LH> resultin a URI reference after the escaping procedure described LH> below isapplied". Specifically, was the *first* phrase ("must bea LH> URI reference as defined in [RFC2396]") meant to include LH> thecase(s): LH> 1.) it is all safe ASCII in its original data form, with no URI escapingneeded or present? LH> 2.) or was it maybe unsafe, but is already URI escaped? LH> 3.) or both? LH> e1 illustrates #1 (all safe, no problem characters, no LH> escaping needed ordone). e2 illustrates #2 (already escaped). LH> e1) <image href="rasterImage.png" .../> LH> e2) <image href="raster%20image.png" .../> LH> Are both valid in SVG? LH> I'm going to reread 2396 again. Chapter 2 talks about all LH> thisstuff (as well as questions like local encoding), but it is LH> not lightreading. I'm also thinking to ask Chris about his memory LH> of thesentence, particularly the intent of its first phrase. LH> -Lofton. LH> At 01:10 PM 10/11/2005 -0600, Lofton Henderson wrote: LH> At 05:20 PM 10/11/2005 LH> +0200,=?GB2312?B?RGlldGVyICBXZWlkZW5icqi5Y2s=?= wrote: LH> [...] LH> good, and agreed. LH> Not so fast! LH> Actually, I do agree that we should use the SVG LH> interpretation, ifpossible. I'm not sure how we ended up LH> differently, since Chris wasconsulting on and helping with this LH> detail (it might be the timedifference -- 1999 for WebCGM 1.0 LH> versus 2001 for SVG -- Chris and SVGmight have figured out LH> properly in those two years). LH> My problem is: exactly how to do it. One logical method LH> mightbe an erratum on 1.0 -- logical because we ended up diverging LH> from SVG1.0 on that detail, and didn't intend to. (Would require LH> someaction within W3C, to update the errata file that is linked LH> from theStatus section of the WebCGM 1.0 Recommendation.) An LH> erratum (inthe "both" direction) would mean that both forms are LH> valid 1.0content, from the very beginning LH> Anther possibility: fix the language for 2.0, so that"both" LH> are allowed from 2.0 on. (This makes 1.0 contentproblematic, if LH> both forms have been used.) LH> About the question of "both"... >> The sentence '...must be a URIreference as defined in >> [RFC2396], or must result in a URI referenceafter the escaping >> procedure described below is applied" >> >> DW> The way I understand the SVG wording is that both forms wouldbe legal: >> >> DW>http://www.cgmopen.org/abc.cgm#name(myname with blank) >> DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank) LH> Rfc2396 makes it clear (section 2.3 and 2.4) that the LH> presence of % should tell a URI resolver that URI escaping is in LH> effect -- % isn't a valid reserved (delimiter or subdelimiter) LH> character, nor a valid unreserved character, for the URI. LH> However, % is a valid character in the repertoire of the LH> 'name' ApsAttr, right? So "%myFunnyName%" is a valid 'name' LH> APSattr in a WebCGM instance, right? And the 3-character "%20" is LH> a valid 'name' ApsAttr, right? LH> So if WebCGM allowed "both", and you encountered a fragment: LH> #name(a%20b) , LH> what would you give to the URI resolver? Two choices: LH> a%20b [assumes that the generator already applied uri-escaping] LH> a%2520b [assumes that generator did NOT uri-escape already] LH> [btw, hex for % is 0x25, so % as an actual URI character is given to URI resolver as %25] LH> Thoughts? (This gives me a headache!) LH> -Lofton. LH> One more comment: LH> Spaces in "name" attributes have been allowed long before any linkURI and/or LH> XML rules LH> existed, thus nobody ever thought about this detail. Everything was stored LH> in the CGM LH> as the rules for non-graphical strings mandated. LH> One could say that this could have been clarified in WebCGM 1.0, however, I LH> find it LH> quite useful to have both forms available. LH> Dieter >> -----Original Message----- >> From: Benoit Bezaire [mailto:benoit@itedo.com] >> Sent: Tuesday, October 11, 2005 5:07 PM >> To: cgmo-webcgm@lists.oasis-open.org >> Subject: Re[4]: [cgmo-webcgm] implications of URI vs. IRI >> >> Hi Dieter, >> >> Thanks for the example, we are talking about the same thing. >> >> I understand that ATA and WebCGM has allowed spaces in URI >> fragments for the last 10 years, but from my interpretation >> of RFC2396; those linkuris are illegal. Here is a quote from >> Section 4.1 of http://www.ietf.org/rfc/rfc2396.txt >> "The character restrictions described in Section 2 for URI >> also apply to the fragment in a URI-reference." >> >> And by reading Section 2, you end up reading that spaces are >> not allowed. >> >> That being said, your interpretation of the SVG wording >> sounds acceptable. The sentence 'or must result in a URI >> reference after the escaping procedure' seems to be saving >> us! I'm in favor of adding wording to the spec to clarify >> this issue (the 3 bullet wording would be good also). >> >> I no longer have a preference if we should deprecate or not. >> On one side, I think that this is a can of worms and forcing >> escaping simplifies things; on the other, I agree that long >> %HH for Asian names is not ideal. >> >> Allowing both is probably the less painful approach for users >> and implementers at this time. >> >> Regards, >> >> -- >> Benoit mailto:benoit@itedo.com >> >> >> Tuesday, October 11, 2005, 10:15:06 AM, Dieter wrote: >> >> DW> Hi Benoit, >> >> DW> see inline >> >> >> -----Original Message----- >> >> From: Benoit Bezaire [mailto:benoit@itedo.com] >> >> Sent: Tuesday, October 11, 2005 3:48 PM >> >> To: cgmo-webcgm@lists.oasis-open.org >> >> Subject: Re[2]: [cgmo-webcgm] implications of URI vs. IRI >> >> >> >> Hi Dieter, >> >> >> >> You said: >> >> NOTE: If we required an escaped string inside the CGM now, >> this will >> >> make almost all existing files invalid ones as soon as a >> simple space >> >> is in a name attribute. >> >> >> >> You are talking about the 'name' attribute within a URI >> only, correct? >> >> Or, let me rephrase... >> >> Files which have a name attribute (containing a space) >> that is used >> >> in a URI become invalid, right? >> DW> I am referring to the link destination parameter of a >> linkuri attribute. >> DW> Yes, something like (pseudo-code) >> >> DW> linkuri "http://www.cgmopen.org/abc.cgm#name(my name with blank)" >> DW> "some title" "_blank" >> >> DW> would become illegal, and this is the form (without >> escaping) that >> DW> has been used forever in the ATA and WebCGM environment >> (almost 10 years now). >> >> >> >> >> I would be in favor of deprecating (i.e., authors should stop >> >> creating such files) the old behavior (no escaping) and >> adding 'a la' >> >> SVG wording to the spec. Like Dieter says, but with an emphasis on >> >> deprecating the old behavior. >> DW> The way I understand the SVG wording is that both forms >> would be legal: >> >> DW> http://www.cgmopen.org/abc.cgm#name(my name with blank) >> DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank) >> >> DW> I would NOT deprecate the first form, because it would >> force us to >> DW> build long strings for japanese or similar characters, >> following the >> DW> rules as described below. >> >> DW> Do you read the SVG spec the same way, or am I wrong? >> >> DW> Regards, >> DW> Dieter >> >> >> >> >> -- >> >> Benoit mailto:benoit@itedo.com >> >> >> >> >> >> Thursday, October 6, 2005, 7:52:43 AM, Dieter wrote: >> >> >> >> DW> All, >> >> >> >> DW> I am not yet convinced that we are heading in the right >> >> direction here. >> >> >> >> DW> Example: >> >> DW> Let's assume we have the string "nihon" inside a >> linkUri: "id(ÈÕ±¾)" >> >> >> >> DW> using UTF-16 (big endian) this is: 65 e5 67 2c (4 Bytes) >> >> converted >> >> DW> to UTF-8: EF BB BF E6 97 A5 E6 9C AC (9 Bytes) >> >> >> >> DW> and then you can apply escaping for all non-ascii chars >> >> >> >> DW> %EF%BB%BF%E6%97%A5%E6%9C%AC (27 Bytes) >> >> >> >> DW> and now we store it into the linkURI attribute, however, since >> >> DW> somewhere else in the file we have this string in japanese >> >> DW> characters as an ID, all non-graphical strings will be >> stored as >> >> DW> UTF-16 (could be >> >> DW> UTF-8 as well): >> >> >> >> DW> I save the writing, you end up with 54 bytes. >> >> >> >> DW> So we are moving from 4 bytes to 54 bytes. >> >> >> >> DW> I hope that this accurately describes the procedure >> that has been >> >> DW> discussed over the past couple of days. >> >> >> >> DW> Comparison to SVG: >> >> DW> In 5.3.2. [1], SVG says the following: >> >> >> >> DW> "The value of the href attribute must be a URI reference >> >> as defined >> >> DW> in [RFC2396], or must result in a URI reference after the >> >> escaping >> >> DW> procedure described below is applied. The procedure is >> >> applied when >> >> DW> passing the URI reference to a URI resolver." >> >> >> >> DW> Interesting to see the last sentence here. IMO this >> means, it is >> >> DW> perfectly legal to store the URI reference using any >> encoding, as >> >> DW> long as it will be transcoded to UTF-8 and escaped before >> >> passing it on to a URI resolver. >> >> >> >> DW> This has always been my understanding, and this is how >> all of our >> >> DW> products have been handling references. >> >> >> >> DW> NOTE: >> >> DW> If we required an escaped string inside the CGM now, this >> >> will make >> >> DW> almost all existing files invalid ones as soon as a >> >> simple space is >> >> DW> in a name attribute. >> >> >> >> DW> RECOMMENDATION: >> >> DW> Amend wording slightly to match watch SVG is doing and >> allow for >> >> DW> both styles, escaped and not escaped. >> >> >> >> DW> Comments? >> >> >> >> DW> Regards, >> >> DW> Dieter >> >> >> >> >> >> DW> [1] http://www.w3.org/TR/SVG11/struct.html#xlinkRefAttrs >> >> >> >> >> >> >> -----Original Message----- >> >> >> From: Lofton Henderson [mailto:lofton@rockynet.com] >> >> >> Sent: Wednesday, October 05, 2005 1:06 AM >> >> >> To: Benoit Bezaire; cgmo-webcgm@lists.oasis-open.org >> >> >> Subject: Re: [cgmo-webcgm] implications of URI vs. IRI >> >> >> >> >> >> At 05:09 PM 10/4/2005 -0400, Benoit Bezaire wrote: >> >> >> >Hi Lofton, >> >> >> > >> >> >> >I just did a quick search... I think that URI is only >> restricting >> >> >> >characters to US-ASCII; it has no control on the >> encoding (utf-8, >> >> >> >utf-16 etc...). >> >> >> > >> >> >> >In XML syntax such as XHTML and SVG, files can have just >> >> about any >> >> >> >encoding; I'm not aware of any special processing for the >> >> xlink:href >> >> >> >attribute (i.e., this is a URI, change the encoding to >> >> _blah_). It >> >> >> >wouldn't make any sense. The scope of the encoding is for >> >> >> the complete >> >> >> >document. >> >> >> > >> >> >> >The above is not a fact, only my understanding. >> >> >> >> >> >> It matches my understanding. And it is clear that XML >> and/or URI >> >> >> (rfc3986) require "URI escaping" for non-ASCII >> characters in URIs, >> >> >> i.e., for character that are outside of the ASCII >> repertoire. And >> >> >> this is independent of the character-set encoding of the URI. >> >> >> >> >> >> So finally, a URI from HTML into CGM containing a >> >> reference-by-name >> >> >> to "my object group" would be written like this: >> >> >> >> >> >> <a >> >> >> >> >> >> href="http://example.org/myCGM.cgm#name(my%20object%20group)">blah</a >> >> >> > >> >> >> >> >> >> and a WebCGM 'linkuri' first parameter would be this: >> >> >> >> >> >> http://example.org/myCGM.cgm#name(my%20object%20group) >> >> >> >> >> >> -Lofton. >> >> >> >> >> >> >> >> >> >Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote: >> >> >> > >> >> >> >LH> All -- >> >> >> > >> >> >> >LH> When I was putting together first unicode tests, >> Dieter also >> >> >> >LH> supplied me with this nifty "advanced" test. It gets >> >> >> into Japanese >> >> >> >LH> text for SF text like APS ids and names. >> >> >> > >> >> >> >LH> It highlights an interesting implication of our decision >> >> >> to stick >> >> >> >LH> with URI instead of switching to IRI. URI encoding >> >> >> requires that >> >> >> >LH> any non-ASCII characters are included by the "URI escaping >> >> >> >LH> mechanism", see WebCGM >> >> >> >3.1.1.4 >> >> >> >LH> [1], and the more detailed XML description [2]. >> >> >> Basically, get the >> >> >> >LH> **UTF8** representation of the characters, and replace >> >> >> each byte in >> >> >> >LH> that representation by the 3-character string %HH, where >> >> >> HH is the >> >> >> >LH> hex representation of the byte. >> >> >> > >> >> >> >LH> So suppose consider for example the 2-character id of >> >> >> the object in >> >> >> >LH> the upper-left box, and its use in a link from the >> >> object in the >> >> >> >upper-right box. >> >> >> > >> >> >> >LH> If that id were the two characters c1c2, lets suppose >> >> >> that it could >> >> >> >LH> be represented by the 4 utf8 bytes b1b2b3b4 (I'm just >> >> guessing >> >> >> >LH> about "4", since UTF8 is variable length, it could be >> >> >> more). Then >> >> >> >LH> to put that id >> >> >> >into >> >> >> >LH> a URI string, it would have to be the 12-character string: >> >> >> > >> >> >> >LH> %hh%hh%hh%hh >> >> >> > >> >> >> >LH> where the hh are the are the 4 pairs of hex digits that >> >> >> represent >> >> >> >LH> the 4 >> >> >> >LH> utf16 bytes. I.e., the CGM URI for the link would be: >> >> >> > >> >> >> >LH> #id(%hh%hh%hh%hh, view_context) >> >> >> > >> >> >> >LH> Side question. Does URI (rfc3986 [3]) restrict only the >> >> >> character >> >> >> >LH> repertoire of the URI, or does it restrict also the >> >> >> encoding? I.e., >> >> >> >LH> can a URI be encoded in ascii, isoLatin1, or utf8, or >> >> utf16, or >> >> >> >LH> whatever, as >> >> >> >long >> >> >> >LH> as it restricts its repertoire to the URI repertoire? >> >> I suspect >> >> >> >"yes", but >> >> >> >LH> I don't know the answer. It would be interesting for >> >> someone to >> >> >> >research it. >> >> >> > >> >> >> >LH> Thoughts? >> >> >> > >> >> >> >LH> Regards, >> >> >> >LH> -Lofton. >> >> >> > >> >> >> >LH> [0] >> >> >> >LH> >> >> >> >> >> >> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_ >> >> >> >LH> 1_4 [1] >> >> >> >LH> >> >> http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent >> >> >> >LH> [3] URI: http://www.ietf.org/rfc/rfc3986.txt >> >> >> >LH> [4] IRI: http://www.ietf.org/rfc/rfc3987.txt
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]