OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cgmo-webcgm message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re[6]: [cgmo-webcgm] implications of URI vs. IRI


Hi Lofton,

I think that some of your questions are answered in 2.4.2:

2.4.2. When to Escape and Unescape

  A URI is always in an "escaped" form, since escaping or unescaping a
  completed URI might change its semantics.
  [...]
  Because the percent "%" character always has the reserved purpose of
  being the escape indicator, it must be escaped as "%25" in order to
  be used as data within a URI.  Implementers should be careful not to
  escape or unescape the same string more than once, since unescaping
  an already unescaped string might lead to misinterpreting a percent
  data character as another escaped character, or vice versa in the
  case of escaping an already escaped string.

One last comment; this is _again_ a three way conversation (Lofton,
Dieter and myself)... everyone should be involved in this conversation
(users and implementers, what do you want), you are all affected by
this. We want a 'valid' solution that will have little disruption on
WebCGM 1.0 content; let's try to work towards that goal.

Regards,

-- 
 Benoit   mailto:benoit@itedo.com


Tuesday, October 11, 2005, 7:05:19 PM, Lofton wrote:

LH> More...

LH> I am giving some more thought to it to the ambiguity problem
LH> about"both" (i.e., both forms allowed in the fragment, linkuri,
LH> etc,a'la SVG.)  

LH> Firstly, a possible solution.  One could always add a rule
LH> for CGMinterpreters, that any %hh 3-tuple in a fragment (or
LH> linkuri 1stparameter, or ...) will be take by the CGM interpreter
LH> as a URI escapingsequence.  So caveat to WebCGM generators ...
LH> although the 'name'ApsAttr might allow something like that as part
LH> of the 'name' value, youhad better not do it, because you will
LH> create an ambiguity when you usethat 'name' value in a fragment
LH> (or linkuri, DOM, XCF) and will NOT getthe result you want.

LH> Secondly...

LH> There is still something about the SVG sentence that bothers
LH> me,"...must be a URI reference as defined in [RFC2396], or must
LH> resultin a URI reference after the escaping procedure described
LH> below isapplied".  Specifically, was the *first* phrase ("must bea
LH> URI reference as defined in [RFC2396]") meant to include
LH> thecase(s):

LH> 1.) it is all safe ASCII in its original data form, with no URI escapingneeded or present?
LH> 2.) or was it maybe unsafe, but is already URI escaped?
LH> 3.) or both?

LH> e1 illustrates #1 (all safe, no problem characters, no
LH> escaping needed ordone).  e2 illustrates #2 (already escaped).

LH> e1)  <image href="rasterImage.png" .../>
LH> e2)  <image href="raster%20image.png" .../>

LH> Are both valid in SVG?

LH> I'm going to reread 2396 again.  Chapter 2 talks about all
LH> thisstuff (as well as questions like local encoding), but it is
LH> not lightreading.  I'm also thinking to ask Chris about his memory
LH> of thesentence, particularly the intent of its first phrase.

LH> -Lofton.

LH> At 01:10 PM 10/11/2005 -0600, Lofton Henderson wrote:
LH> At 05:20 PM 10/11/2005
LH> +0200,=?GB2312?B?RGlldGVyICBXZWlkZW5icqi5Y2s=?= wrote:
LH> [...]
LH> good, and agreed.


LH> Not so fast!

LH> Actually, I do agree that we should use the SVG
LH> interpretation, ifpossible.  I'm not sure how we ended up
LH> differently, since Chris wasconsulting on and helping with this
LH> detail (it might be the timedifference -- 1999 for WebCGM 1.0
LH> versus 2001 for SVG -- Chris and SVGmight have figured out
LH> properly in those two years).

LH> My problem is:  exactly how to do it.  One logical method
LH> mightbe an erratum on 1.0 -- logical because we ended up diverging
LH> from SVG1.0 on that detail, and didn't intend to.  (Would require
LH> someaction within W3C, to update the errata file that is linked
LH> from theStatus section of the WebCGM 1.0 Recommendation.)  An
LH> erratum (inthe "both" direction) would mean that both forms are
LH> valid 1.0content, from the very beginning

LH> Anther possibility:  fix the language for 2.0, so that"both"
LH> are allowed from 2.0 on.  (This makes 1.0 contentproblematic, if
LH> both forms have been used.)

LH> About the question of "both"...

>> The sentence '...must be a URIreference as defined in
>> [RFC2396], or must result in a URI referenceafter the escaping
>> procedure described below is applied"
>> 
>> DW> The way I understand the SVG wording is that both forms wouldbe legal:
>> 
>> DW>http://www.cgmopen.org/abc.cgm#name(myname with blank) 
>> DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank)


LH> Rfc2396 makes it clear (section 2.3 and 2.4) that the
LH> presence of % should tell a URI resolver that URI escaping is in
LH> effect -- % isn't a valid reserved (delimiter or subdelimiter)
LH> character, nor a valid unreserved character, for the URI.

LH> However, % is a valid character in the repertoire of the
LH> 'name' ApsAttr, right?  So "%myFunnyName%" is a valid 'name'
LH> APSattr in a WebCGM instance, right?  And the 3-character "%20" is
LH> a valid 'name' ApsAttr, right?

LH> So if WebCGM allowed "both", and you encountered a fragment:

LH> #name(a%20b) ,

LH> what would you give to the URI resolver?  Two choices:

LH> a%20b  [assumes that the generator already applied uri-escaping]
LH> a%2520b  [assumes that generator did NOT uri-escape already]

LH> [btw, hex for % is 0x25, so % as an actual URI character is given to URI resolver as %25]

LH> Thoughts?  (This gives me a headache!)

LH> -Lofton.


LH> One more comment:
LH> Spaces in "name" attributes have been allowed long before any linkURI and/or
LH> XML rules
LH> existed, thus nobody ever thought about this detail. Everything was stored
LH> in the CGM
LH> as the rules for non-graphical strings mandated.
LH> One could say that this could have been clarified in WebCGM 1.0, however, I
LH> find it
LH> quite useful to have both forms available.

LH> Dieter 

>> -----Original Message-----
>> From: Benoit Bezaire [mailto:benoit@itedo.com] 
>> Sent: Tuesday, October 11, 2005 5:07 PM
>> To: cgmo-webcgm@lists.oasis-open.org
>> Subject: Re[4]: [cgmo-webcgm] implications of URI vs. IRI
>> 
>> Hi Dieter,
>> 
>> Thanks for the example, we are talking about the same thing.
>> 
>> I understand that ATA and WebCGM has allowed spaces in URI 
>> fragments for the last 10 years, but from my interpretation 
>> of RFC2396; those linkuris are illegal. Here is a quote from 
>> Section 4.1 of http://www.ietf.org/rfc/rfc2396.txt
>> "The character restrictions described in Section 2 for URI 
>> also apply to the fragment in a URI-reference."
>> 
>> And by reading Section 2, you end up reading that spaces are 
>> not allowed.
>> 
>> That being said, your interpretation of the SVG wording 
>> sounds acceptable. The sentence 'or must result in a URI 
>> reference after the escaping procedure' seems to be saving 
>> us! I'm in favor of adding wording to the spec to clarify 
>> this issue (the 3 bullet wording would be good also).
>> 
>> I no longer have a preference if we should deprecate or not. 
>> On one side, I think that this is a can of worms and forcing 
>> escaping simplifies things; on the other, I agree that long 
>> %HH for Asian names is not ideal.
>> 
>> Allowing both is probably the less painful approach for users 
>> and implementers at this time.
>> 
>> Regards,
>> 
>> -- 
>>  Benoit   mailto:benoit@itedo.com
>> 
>> 
>> Tuesday, October 11, 2005, 10:15:06 AM, Dieter wrote:
>> 
>> DW> Hi Benoit,
>> 
>> DW> see inline
>> 
>> >> -----Original Message-----
>> >> From: Benoit Bezaire [mailto:benoit@itedo.com]
>> >> Sent: Tuesday, October 11, 2005 3:48 PM
>> >> To: cgmo-webcgm@lists.oasis-open.org
>> >> Subject: Re[2]: [cgmo-webcgm] implications of URI vs. IRI
>> >> 
>> >> Hi Dieter,
>> >> 
>> >> You said:
>> >> NOTE: If we required an escaped string inside the CGM now, 
>> this will 
>> >> make almost all existing files invalid ones as soon as a 
>> simple space 
>> >> is in a name attribute.
>> >> 
>> >> You are talking about the 'name' attribute within a URI 
>> only, correct?
>> >> Or, let me rephrase...
>> >> Files which have a name attribute (containing a space) 
>> that is used 
>> >> in a URI become invalid, right?
>> DW> I am referring to the link destination parameter of a 
>> linkuri attribute.
>> DW> Yes, something like (pseudo-code)
>> 
>> DW> linkuri "http://www.cgmopen.org/abc.cgm#name(my name with blank)"
>> DW> "some title" "_blank"
>> 
>> DW> would become illegal, and this is the form (without 
>> escaping) that 
>> DW> has been used forever in the ATA and WebCGM environment 
>> (almost 10 years now).
>>  
>> >> 
>> >> I would be in favor of deprecating (i.e., authors should stop 
>> >> creating such files) the old behavior (no escaping) and 
>> adding 'a la' 
>> >> SVG wording to the spec. Like Dieter says, but with an emphasis on
>> >> deprecating the old behavior.
>> DW> The way I understand the SVG wording is that both forms 
>> would be legal:
>> 
>> DW> http://www.cgmopen.org/abc.cgm#name(my name with blank) 
>> DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank)
>> 
>> DW> I would NOT deprecate the first form, because it would 
>> force us to 
>> DW> build long strings for japanese or similar characters, 
>> following the 
>> DW> rules as described below.
>> 
>> DW> Do you read the SVG spec the same way, or am I wrong?
>> 
>> DW> Regards,
>> DW> Dieter
>> 
>> >> 
>> >> -- 
>> >>  Benoit   mailto:benoit@itedo.com
>> >> 
>> >>  
>> >> Thursday, October 6, 2005, 7:52:43 AM, Dieter wrote:
>> >> 
>> >> DW> All,
>> >> 
>> >> DW> I am not yet convinced that we are heading in the right
>> >> direction here.
>> >> 
>> >> DW> Example:
>> >> DW> Let's assume we have the string "nihon" inside a 
>> linkUri: "id(ÈÕ±¾)"
>> >> 
>> >> DW> using UTF-16 (big endian) this is: 65 e5 67 2c (4 Bytes)
>> >> converted
>> >> DW> to UTF-8: EF BB BF E6 97 A5 E6 9C AC (9 Bytes)
>> >> 
>> >> DW> and then you can apply escaping for all non-ascii chars
>> >> 
>> >> DW> %EF%BB%BF%E6%97%A5%E6%9C%AC (27 Bytes)
>> >> 
>> >> DW> and now we store it into the linkURI attribute, however, since
>> >> DW> somewhere else in the file we have this string in japanese 
>> >> DW> characters as an ID, all non-graphical strings will be 
>> stored as
>> >> DW> UTF-16 (could be
>> >> DW> UTF-8 as well):
>> >> 
>> >> DW> I save the writing, you end up with 54 bytes.
>> >> 
>> >> DW> So we are moving from 4 bytes to 54 bytes.
>> >> 
>> >> DW> I hope that this accurately describes the procedure 
>> that has been 
>> >> DW> discussed over the past couple of days.
>> >> 
>> >> DW> Comparison to SVG:
>> >> DW> In 5.3.2. [1], SVG says the following:
>> >> 
>> >> DW> "The value of the href attribute must be a URI reference
>> >> as defined
>> >> DW> in [RFC2396], or must result in a URI reference after the
>> >> escaping
>> >> DW> procedure described below is applied. The procedure is
>> >> applied when
>> >> DW> passing the URI reference to a URI resolver."
>> >> 
>> >> DW> Interesting to see the last sentence here. IMO this 
>> means, it is 
>> >> DW> perfectly legal to store the URI reference using any 
>> encoding, as 
>> >> DW> long as it will be transcoded to UTF-8 and escaped before
>> >> passing it on to a URI resolver.
>> >> 
>> >> DW> This has always been my understanding, and this is how 
>> all of our 
>> >> DW> products have been handling references.
>> >> 
>> >> DW> NOTE:
>> >> DW> If we required an escaped string inside the CGM now, this
>> >> will make
>> >> DW> almost all existing files invalid ones as soon as a
>> >> simple space is
>> >> DW> in a name attribute.
>> >> 
>> >> DW> RECOMMENDATION:
>> >> DW> Amend wording slightly to match watch SVG is doing and 
>> allow for 
>> >> DW> both styles, escaped and not escaped.
>> >> 
>> >> DW> Comments?
>> >> 
>> >> DW> Regards,
>> >> DW> Dieter
>> >> 
>> >> 
>> >> DW> [1] http://www.w3.org/TR/SVG11/struct.html#xlinkRefAttrs
>> >> 
>> >> 
>> >> >> -----Original Message-----
>> >> >> From: Lofton Henderson [mailto:lofton@rockynet.com]
>> >> >> Sent: Wednesday, October 05, 2005 1:06 AM
>> >> >> To: Benoit Bezaire; cgmo-webcgm@lists.oasis-open.org
>> >> >> Subject: Re: [cgmo-webcgm] implications of URI vs. IRI
>> >> >> 
>> >> >> At 05:09 PM 10/4/2005 -0400, Benoit Bezaire wrote:
>> >> >> >Hi Lofton,
>> >> >> >
>> >> >> >I just did a quick search... I think that URI is only 
>> restricting 
>> >> >> >characters to US-ASCII; it has no control on the 
>> encoding (utf-8,
>> >> >> >utf-16 etc...).
>> >> >> >
>> >> >> >In XML syntax such as XHTML and SVG, files can have just
>> >> about any
>> >> >> >encoding; I'm not aware of any special processing for the
>> >> xlink:href
>> >> >> >attribute (i.e., this is a URI, change the encoding to
>> >> _blah_). It
>> >> >> >wouldn't make any sense. The scope of the encoding is for
>> >> >> the complete
>> >> >> >document.
>> >> >> >
>> >> >> >The above is not a fact, only my understanding.
>> >> >> 
>> >> >> It matches my understanding.  And it is clear that XML 
>> and/or URI
>> >> >> (rfc3986) require "URI escaping" for non-ASCII 
>> characters in URIs, 
>> >> >> i.e., for character that are outside of the ASCII 
>> repertoire.  And 
>> >> >> this is independent of the character-set encoding of the URI.
>> >> >> 
>> >> >> So finally, a URI from HTML into CGM containing a
>> >> reference-by-name
>> >> >> to "my object group" would be written like this:
>> >> >> 
>> >> >> <a
>> >> >> 
>> >> 
>> href="http://example.org/myCGM.cgm#name(my%20object%20group)">blah</a
>> >> >> >
>> >> >> 
>> >> >> and a WebCGM 'linkuri' first parameter would be this:
>> >> >> 
>> >> >> http://example.org/myCGM.cgm#name(my%20object%20group)
>> >> >> 
>> >> >> -Lofton.
>> >> >> 
>> >> >> 
>> >> >> >Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote:
>> >> >> >
>> >> >> >LH> All --
>> >> >> >
>> >> >> >LH> When I was putting together first unicode tests, 
>> Dieter also 
>> >> >> >LH> supplied me with this nifty "advanced" test.  It gets
>> >> >> into Japanese
>> >> >> >LH> text for SF text like APS ids and names.
>> >> >> >
>> >> >> >LH> It highlights an interesting implication of our decision
>> >> >> to stick
>> >> >> >LH> with URI instead of switching to IRI.  URI encoding
>> >> >> requires that
>> >> >> >LH> any non-ASCII characters are included by the "URI escaping
>> >> >> >LH> mechanism", see WebCGM
>> >> >> >3.1.1.4
>> >> >> >LH> [1], and the more detailed XML description [2].  
>> >> >> Basically, get the
>> >> >> >LH> **UTF8** representation of the characters, and replace
>> >> >> each byte in
>> >> >> >LH> that representation by the 3-character string %HH, where
>> >> >> HH is the
>> >> >> >LH> hex representation of the byte.
>> >> >> >
>> >> >> >LH> So suppose consider for example the 2-character id of
>> >> >> the object in
>> >> >> >LH> the upper-left box, and its use in a link from the
>> >> object in the
>> >> >> >upper-right box.
>> >> >> >
>> >> >> >LH> If that id were the two characters c1c2, lets suppose
>> >> >> that it could
>> >> >> >LH> be represented by the 4 utf8 bytes b1b2b3b4 (I'm just
>> >> guessing
>> >> >> >LH> about "4", since UTF8 is variable length, it could be
>> >> >> more).  Then
>> >> >> >LH> to put that id
>> >> >> >into
>> >> >> >LH> a URI string, it would have to be the 12-character string:
>> >> >> >
>> >> >> >LH> %hh%hh%hh%hh
>> >> >> >
>> >> >> >LH> where the hh are the are the 4 pairs of hex digits that
>> >> >> represent
>> >> >> >LH> the 4
>> >> >> >LH> utf16 bytes. I.e., the CGM URI for the link would be:
>> >> >> >
>> >> >> >LH> #id(%hh%hh%hh%hh, view_context)
>> >> >> >
>> >> >> >LH> Side question.  Does URI (rfc3986 [3]) restrict only the
>> >> >> character
>> >> >> >LH> repertoire of the URI, or does it restrict also the
>> >> >> encoding? I.e.,
>> >> >> >LH> can a URI be encoded in ascii, isoLatin1, or utf8, or
>> >> utf16, or
>> >> >> >LH> whatever, as
>> >> >> >long
>> >> >> >LH> as it restricts its repertoire to the URI repertoire? 
>> >>  I suspect
>> >> >> >"yes", but
>> >> >> >LH> I don't know the answer.  It would be interesting for
>> >> someone to
>> >> >> >research it.
>> >> >> >
>> >> >> >LH> Thoughts?
>> >> >> >
>> >> >> >LH> Regards,
>> >> >> >LH> -Lofton.
>> >> >> >
>> >> >> >LH> [0]
>> >> >> >LH> 
>> >> >>
>> >>
>> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_
>> >> >> >LH> 1_4 [1]
>> >> >> >LH>
>> >> http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent
>> >> >> >LH> [3] URI:  http://www.ietf.org/rfc/rfc3986.txt
>> >> >> >LH> [4] IRI:  http://www.ietf.org/rfc/rfc3987.txt





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]