[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re[6]: [cgmo-webcgm] implications of URI vs. IRI
Hi Lofton,
I think that some of your questions are answered in 2.4.2:
2.4.2. When to Escape and Unescape
A URI is always in an "escaped" form, since escaping or unescaping a
completed URI might change its semantics.
[...]
Because the percent "%" character always has the reserved purpose of
being the escape indicator, it must be escaped as "%25" in order to
be used as data within a URI. Implementers should be careful not to
escape or unescape the same string more than once, since unescaping
an already unescaped string might lead to misinterpreting a percent
data character as another escaped character, or vice versa in the
case of escaping an already escaped string.
One last comment; this is _again_ a three way conversation (Lofton,
Dieter and myself)... everyone should be involved in this conversation
(users and implementers, what do you want), you are all affected by
this. We want a 'valid' solution that will have little disruption on
WebCGM 1.0 content; let's try to work towards that goal.
Regards,
--
Benoit mailto:benoit@itedo.com
Tuesday, October 11, 2005, 7:05:19 PM, Lofton wrote:
LH> More...
LH> I am giving some more thought to it to the ambiguity problem
LH> about"both" (i.e., both forms allowed in the fragment, linkuri,
LH> etc,a'la SVG.)
LH> Firstly, a possible solution. One could always add a rule
LH> for CGMinterpreters, that any %hh 3-tuple in a fragment (or
LH> linkuri 1stparameter, or ...) will be take by the CGM interpreter
LH> as a URI escapingsequence. So caveat to WebCGM generators ...
LH> although the 'name'ApsAttr might allow something like that as part
LH> of the 'name' value, youhad better not do it, because you will
LH> create an ambiguity when you usethat 'name' value in a fragment
LH> (or linkuri, DOM, XCF) and will NOT getthe result you want.
LH> Secondly...
LH> There is still something about the SVG sentence that bothers
LH> me,"...must be a URI reference as defined in [RFC2396], or must
LH> resultin a URI reference after the escaping procedure described
LH> below isapplied". Specifically, was the *first* phrase ("must bea
LH> URI reference as defined in [RFC2396]") meant to include
LH> thecase(s):
LH> 1.) it is all safe ASCII in its original data form, with no URI escapingneeded or present?
LH> 2.) or was it maybe unsafe, but is already URI escaped?
LH> 3.) or both?
LH> e1 illustrates #1 (all safe, no problem characters, no
LH> escaping needed ordone). e2 illustrates #2 (already escaped).
LH> e1) <image href="rasterImage.png" .../>
LH> e2) <image href="raster%20image.png" .../>
LH> Are both valid in SVG?
LH> I'm going to reread 2396 again. Chapter 2 talks about all
LH> thisstuff (as well as questions like local encoding), but it is
LH> not lightreading. I'm also thinking to ask Chris about his memory
LH> of thesentence, particularly the intent of its first phrase.
LH> -Lofton.
LH> At 01:10 PM 10/11/2005 -0600, Lofton Henderson wrote:
LH> At 05:20 PM 10/11/2005
LH> +0200,=?GB2312?B?RGlldGVyICBXZWlkZW5icqi5Y2s=?= wrote:
LH> [...]
LH> good, and agreed.
LH> Not so fast!
LH> Actually, I do agree that we should use the SVG
LH> interpretation, ifpossible. I'm not sure how we ended up
LH> differently, since Chris wasconsulting on and helping with this
LH> detail (it might be the timedifference -- 1999 for WebCGM 1.0
LH> versus 2001 for SVG -- Chris and SVGmight have figured out
LH> properly in those two years).
LH> My problem is: exactly how to do it. One logical method
LH> mightbe an erratum on 1.0 -- logical because we ended up diverging
LH> from SVG1.0 on that detail, and didn't intend to. (Would require
LH> someaction within W3C, to update the errata file that is linked
LH> from theStatus section of the WebCGM 1.0 Recommendation.) An
LH> erratum (inthe "both" direction) would mean that both forms are
LH> valid 1.0content, from the very beginning
LH> Anther possibility: fix the language for 2.0, so that"both"
LH> are allowed from 2.0 on. (This makes 1.0 contentproblematic, if
LH> both forms have been used.)
LH> About the question of "both"...
>> The sentence '...must be a URIreference as defined in
>> [RFC2396], or must result in a URI referenceafter the escaping
>> procedure described below is applied"
>>
>> DW> The way I understand the SVG wording is that both forms wouldbe legal:
>>
>> DW>http://www.cgmopen.org/abc.cgm#name(myname with blank)
>> DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank)
LH> Rfc2396 makes it clear (section 2.3 and 2.4) that the
LH> presence of % should tell a URI resolver that URI escaping is in
LH> effect -- % isn't a valid reserved (delimiter or subdelimiter)
LH> character, nor a valid unreserved character, for the URI.
LH> However, % is a valid character in the repertoire of the
LH> 'name' ApsAttr, right? So "%myFunnyName%" is a valid 'name'
LH> APSattr in a WebCGM instance, right? And the 3-character "%20" is
LH> a valid 'name' ApsAttr, right?
LH> So if WebCGM allowed "both", and you encountered a fragment:
LH> #name(a%20b) ,
LH> what would you give to the URI resolver? Two choices:
LH> a%20b [assumes that the generator already applied uri-escaping]
LH> a%2520b [assumes that generator did NOT uri-escape already]
LH> [btw, hex for % is 0x25, so % as an actual URI character is given to URI resolver as %25]
LH> Thoughts? (This gives me a headache!)
LH> -Lofton.
LH> One more comment:
LH> Spaces in "name" attributes have been allowed long before any linkURI and/or
LH> XML rules
LH> existed, thus nobody ever thought about this detail. Everything was stored
LH> in the CGM
LH> as the rules for non-graphical strings mandated.
LH> One could say that this could have been clarified in WebCGM 1.0, however, I
LH> find it
LH> quite useful to have both forms available.
LH> Dieter
>> -----Original Message-----
>> From: Benoit Bezaire [mailto:benoit@itedo.com]
>> Sent: Tuesday, October 11, 2005 5:07 PM
>> To: cgmo-webcgm@lists.oasis-open.org
>> Subject: Re[4]: [cgmo-webcgm] implications of URI vs. IRI
>>
>> Hi Dieter,
>>
>> Thanks for the example, we are talking about the same thing.
>>
>> I understand that ATA and WebCGM has allowed spaces in URI
>> fragments for the last 10 years, but from my interpretation
>> of RFC2396; those linkuris are illegal. Here is a quote from
>> Section 4.1 of http://www.ietf.org/rfc/rfc2396.txt
>> "The character restrictions described in Section 2 for URI
>> also apply to the fragment in a URI-reference."
>>
>> And by reading Section 2, you end up reading that spaces are
>> not allowed.
>>
>> That being said, your interpretation of the SVG wording
>> sounds acceptable. The sentence 'or must result in a URI
>> reference after the escaping procedure' seems to be saving
>> us! I'm in favor of adding wording to the spec to clarify
>> this issue (the 3 bullet wording would be good also).
>>
>> I no longer have a preference if we should deprecate or not.
>> On one side, I think that this is a can of worms and forcing
>> escaping simplifies things; on the other, I agree that long
>> %HH for Asian names is not ideal.
>>
>> Allowing both is probably the less painful approach for users
>> and implementers at this time.
>>
>> Regards,
>>
>> --
>> Benoit mailto:benoit@itedo.com
>>
>>
>> Tuesday, October 11, 2005, 10:15:06 AM, Dieter wrote:
>>
>> DW> Hi Benoit,
>>
>> DW> see inline
>>
>> >> -----Original Message-----
>> >> From: Benoit Bezaire [mailto:benoit@itedo.com]
>> >> Sent: Tuesday, October 11, 2005 3:48 PM
>> >> To: cgmo-webcgm@lists.oasis-open.org
>> >> Subject: Re[2]: [cgmo-webcgm] implications of URI vs. IRI
>> >>
>> >> Hi Dieter,
>> >>
>> >> You said:
>> >> NOTE: If we required an escaped string inside the CGM now,
>> this will
>> >> make almost all existing files invalid ones as soon as a
>> simple space
>> >> is in a name attribute.
>> >>
>> >> You are talking about the 'name' attribute within a URI
>> only, correct?
>> >> Or, let me rephrase...
>> >> Files which have a name attribute (containing a space)
>> that is used
>> >> in a URI become invalid, right?
>> DW> I am referring to the link destination parameter of a
>> linkuri attribute.
>> DW> Yes, something like (pseudo-code)
>>
>> DW> linkuri "http://www.cgmopen.org/abc.cgm#name(my name with blank)"
>> DW> "some title" "_blank"
>>
>> DW> would become illegal, and this is the form (without
>> escaping) that
>> DW> has been used forever in the ATA and WebCGM environment
>> (almost 10 years now).
>>
>> >>
>> >> I would be in favor of deprecating (i.e., authors should stop
>> >> creating such files) the old behavior (no escaping) and
>> adding 'a la'
>> >> SVG wording to the spec. Like Dieter says, but with an emphasis on
>> >> deprecating the old behavior.
>> DW> The way I understand the SVG wording is that both forms
>> would be legal:
>>
>> DW> http://www.cgmopen.org/abc.cgm#name(my name with blank)
>> DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank)
>>
>> DW> I would NOT deprecate the first form, because it would
>> force us to
>> DW> build long strings for japanese or similar characters,
>> following the
>> DW> rules as described below.
>>
>> DW> Do you read the SVG spec the same way, or am I wrong?
>>
>> DW> Regards,
>> DW> Dieter
>>
>> >>
>> >> --
>> >> Benoit mailto:benoit@itedo.com
>> >>
>> >>
>> >> Thursday, October 6, 2005, 7:52:43 AM, Dieter wrote:
>> >>
>> >> DW> All,
>> >>
>> >> DW> I am not yet convinced that we are heading in the right
>> >> direction here.
>> >>
>> >> DW> Example:
>> >> DW> Let's assume we have the string "nihon" inside a
>> linkUri: "id(ÈÕ±¾)"
>> >>
>> >> DW> using UTF-16 (big endian) this is: 65 e5 67 2c (4 Bytes)
>> >> converted
>> >> DW> to UTF-8: EF BB BF E6 97 A5 E6 9C AC (9 Bytes)
>> >>
>> >> DW> and then you can apply escaping for all non-ascii chars
>> >>
>> >> DW> %EF%BB%BF%E6%97%A5%E6%9C%AC (27 Bytes)
>> >>
>> >> DW> and now we store it into the linkURI attribute, however, since
>> >> DW> somewhere else in the file we have this string in japanese
>> >> DW> characters as an ID, all non-graphical strings will be
>> stored as
>> >> DW> UTF-16 (could be
>> >> DW> UTF-8 as well):
>> >>
>> >> DW> I save the writing, you end up with 54 bytes.
>> >>
>> >> DW> So we are moving from 4 bytes to 54 bytes.
>> >>
>> >> DW> I hope that this accurately describes the procedure
>> that has been
>> >> DW> discussed over the past couple of days.
>> >>
>> >> DW> Comparison to SVG:
>> >> DW> In 5.3.2. [1], SVG says the following:
>> >>
>> >> DW> "The value of the href attribute must be a URI reference
>> >> as defined
>> >> DW> in [RFC2396], or must result in a URI reference after the
>> >> escaping
>> >> DW> procedure described below is applied. The procedure is
>> >> applied when
>> >> DW> passing the URI reference to a URI resolver."
>> >>
>> >> DW> Interesting to see the last sentence here. IMO this
>> means, it is
>> >> DW> perfectly legal to store the URI reference using any
>> encoding, as
>> >> DW> long as it will be transcoded to UTF-8 and escaped before
>> >> passing it on to a URI resolver.
>> >>
>> >> DW> This has always been my understanding, and this is how
>> all of our
>> >> DW> products have been handling references.
>> >>
>> >> DW> NOTE:
>> >> DW> If we required an escaped string inside the CGM now, this
>> >> will make
>> >> DW> almost all existing files invalid ones as soon as a
>> >> simple space is
>> >> DW> in a name attribute.
>> >>
>> >> DW> RECOMMENDATION:
>> >> DW> Amend wording slightly to match watch SVG is doing and
>> allow for
>> >> DW> both styles, escaped and not escaped.
>> >>
>> >> DW> Comments?
>> >>
>> >> DW> Regards,
>> >> DW> Dieter
>> >>
>> >>
>> >> DW> [1] http://www.w3.org/TR/SVG11/struct.html#xlinkRefAttrs
>> >>
>> >>
>> >> >> -----Original Message-----
>> >> >> From: Lofton Henderson [mailto:lofton@rockynet.com]
>> >> >> Sent: Wednesday, October 05, 2005 1:06 AM
>> >> >> To: Benoit Bezaire; cgmo-webcgm@lists.oasis-open.org
>> >> >> Subject: Re: [cgmo-webcgm] implications of URI vs. IRI
>> >> >>
>> >> >> At 05:09 PM 10/4/2005 -0400, Benoit Bezaire wrote:
>> >> >> >Hi Lofton,
>> >> >> >
>> >> >> >I just did a quick search... I think that URI is only
>> restricting
>> >> >> >characters to US-ASCII; it has no control on the
>> encoding (utf-8,
>> >> >> >utf-16 etc...).
>> >> >> >
>> >> >> >In XML syntax such as XHTML and SVG, files can have just
>> >> about any
>> >> >> >encoding; I'm not aware of any special processing for the
>> >> xlink:href
>> >> >> >attribute (i.e., this is a URI, change the encoding to
>> >> _blah_). It
>> >> >> >wouldn't make any sense. The scope of the encoding is for
>> >> >> the complete
>> >> >> >document.
>> >> >> >
>> >> >> >The above is not a fact, only my understanding.
>> >> >>
>> >> >> It matches my understanding. And it is clear that XML
>> and/or URI
>> >> >> (rfc3986) require "URI escaping" for non-ASCII
>> characters in URIs,
>> >> >> i.e., for character that are outside of the ASCII
>> repertoire. And
>> >> >> this is independent of the character-set encoding of the URI.
>> >> >>
>> >> >> So finally, a URI from HTML into CGM containing a
>> >> reference-by-name
>> >> >> to "my object group" would be written like this:
>> >> >>
>> >> >> <a
>> >> >>
>> >>
>> href="http://example.org/myCGM.cgm#name(my%20object%20group)">blah</a
>> >> >> >
>> >> >>
>> >> >> and a WebCGM 'linkuri' first parameter would be this:
>> >> >>
>> >> >> http://example.org/myCGM.cgm#name(my%20object%20group)
>> >> >>
>> >> >> -Lofton.
>> >> >>
>> >> >>
>> >> >> >Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote:
>> >> >> >
>> >> >> >LH> All --
>> >> >> >
>> >> >> >LH> When I was putting together first unicode tests,
>> Dieter also
>> >> >> >LH> supplied me with this nifty "advanced" test. It gets
>> >> >> into Japanese
>> >> >> >LH> text for SF text like APS ids and names.
>> >> >> >
>> >> >> >LH> It highlights an interesting implication of our decision
>> >> >> to stick
>> >> >> >LH> with URI instead of switching to IRI. URI encoding
>> >> >> requires that
>> >> >> >LH> any non-ASCII characters are included by the "URI escaping
>> >> >> >LH> mechanism", see WebCGM
>> >> >> >3.1.1.4
>> >> >> >LH> [1], and the more detailed XML description [2].
>> >> >> Basically, get the
>> >> >> >LH> **UTF8** representation of the characters, and replace
>> >> >> each byte in
>> >> >> >LH> that representation by the 3-character string %HH, where
>> >> >> HH is the
>> >> >> >LH> hex representation of the byte.
>> >> >> >
>> >> >> >LH> So suppose consider for example the 2-character id of
>> >> >> the object in
>> >> >> >LH> the upper-left box, and its use in a link from the
>> >> object in the
>> >> >> >upper-right box.
>> >> >> >
>> >> >> >LH> If that id were the two characters c1c2, lets suppose
>> >> >> that it could
>> >> >> >LH> be represented by the 4 utf8 bytes b1b2b3b4 (I'm just
>> >> guessing
>> >> >> >LH> about "4", since UTF8 is variable length, it could be
>> >> >> more). Then
>> >> >> >LH> to put that id
>> >> >> >into
>> >> >> >LH> a URI string, it would have to be the 12-character string:
>> >> >> >
>> >> >> >LH> %hh%hh%hh%hh
>> >> >> >
>> >> >> >LH> where the hh are the are the 4 pairs of hex digits that
>> >> >> represent
>> >> >> >LH> the 4
>> >> >> >LH> utf16 bytes. I.e., the CGM URI for the link would be:
>> >> >> >
>> >> >> >LH> #id(%hh%hh%hh%hh, view_context)
>> >> >> >
>> >> >> >LH> Side question. Does URI (rfc3986 [3]) restrict only the
>> >> >> character
>> >> >> >LH> repertoire of the URI, or does it restrict also the
>> >> >> encoding? I.e.,
>> >> >> >LH> can a URI be encoded in ascii, isoLatin1, or utf8, or
>> >> utf16, or
>> >> >> >LH> whatever, as
>> >> >> >long
>> >> >> >LH> as it restricts its repertoire to the URI repertoire?
>> >> I suspect
>> >> >> >"yes", but
>> >> >> >LH> I don't know the answer. It would be interesting for
>> >> someone to
>> >> >> >research it.
>> >> >> >
>> >> >> >LH> Thoughts?
>> >> >> >
>> >> >> >LH> Regards,
>> >> >> >LH> -Lofton.
>> >> >> >
>> >> >> >LH> [0]
>> >> >> >LH>
>> >> >>
>> >>
>> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_
>> >> >> >LH> 1_4 [1]
>> >> >> >LH>
>> >> http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent
>> >> >> >LH> [3] URI: http://www.ietf.org/rfc/rfc3986.txt
>> >> >> >LH> [4] IRI: http://www.ietf.org/rfc/rfc3987.txt
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]