RE: Re[6]: [cgmo-webcgm] implications of URI vs. IRI

All,

I like option 2. unescaped only

Regards,

Forrest

From: Dieter Weidenbrück [mailto:dieter@itedo.com]
Sent: Wednesday, October 12, 2005 10:21 AM
To: 'Lofton Henderson'; 'Benoit Bezaire'; cgmo-webcgm@lists.oasis-open.org
Subject: RE: Re[6]: [cgmo-webcgm] implications of URI vs. IRI

Lofton,

good explanation, and the option "go for both" of course exist.

However, it would create a problem with existing data. A file containing

#name(abc%20def)

which is unescaped, would be changed to

#name(abc def)

unless some version checking is done. So for reading and writing we have to do additional

checking and potential conversion in these cases.

If we simply go with unescaped only, we can avoid this.

I don't have too hard feelings about this, however, I can't see any benefit coming from allowing

both variants in the CGM. It only creates additional work, and it definitely would be a significant

change of the CD at this point, which so far does not distinguish between two different cases.

So my impression is that

1. escaped only leads to a _lot_ of files from the past being illegal

2. unescaped only seems to be the smoothest way without losing anything

3. allowing for both creates additional work for the %xx cases

Again, my option would be 2.

Dieter

From: Lofton Henderson [mailto:lofton@rockynet.com]
Sent: Wednesday, October 12, 2005 5:08 PM
To: dieter@itedo.com; 'Benoit Bezaire'; cgmo-webcgm@lists.oasis-open.org
Subject: RE: Re[6]: [cgmo-webcgm] implications of URI vs. IRI

At 07:52 AM 10/12/2005 +0200, Dieter Weidenbrück wrote:

[...]
Consequences:
- if we go for "escaped only", most likely every file from the past will
be invalid if it had a space or similar in it.
- if we go for "non-escaped only" we will have no change compared to
WebCGM 1.0, however, we need to double-check whether this is in line
with the RFC.

Or if we go for "both" (like SVG), we need an understanding about resolving ambiguity. Suggestion: an interpreter should consider any potential URI-escaping string, i.e., any %hh triplet, to be URI-escaped data.

In other words, although %20 is in fact a valid 3-character ApsAttr 'name', don't put that in your fragment that if you want your stuff to work!

About Dieter's "in line with the RFC" (above) ... I think it is okay, as the dialog with SVG suggests. Also, the RFC is somewhat vague, but it talks about when (in the information pipeline) escaping and unescaping happen. Basically, escaping happens "early", and unescaping happens "late". Early and late are defined relative to assembling the whole URI from its syntax components:

URI   = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

I'm oversimplifying slightly, but... In other words, when building a URI, the components "scheme", "hier-part", "query", "fragment" need to be URI encoded before assembling them with the delimiters (":", ... , "#") into the whole URI. That way, the syntax is unambiguous for parsing and recovering the components (e.g., in case one of the delimiters is a data character in a component). Similarly, a processor must parse out the components before unescaping the data within a component.

The RFC is then somewhat vague, within these general constraints, about who is generating or processing the URI, when, and how. Which is good for us, IMO -- it lets us draw the lines (divide the responsibilities) where we please.

The suggestion that an interpreter should consider any potential URI-escaped string, i.e., any %hh triplet, to be URI-escaped data would in fact still allow "%20" to be a 'name' in a fragment. The fragment would have to look like:

#name(%2520)

The interpreter would treat the %25 as uri-escaped data. Unescaping %25, the string would become %20 (the RFC is clear to stop here -- don't try to unescape a 2nd time, using the just-unescaped "%").

I hope I'm reading this correctly, and would appreciate some backup/confirmation:
http://www.ietf.org/rfc/rfc3986.txt

-Lofton.

Questions:
- How did other authoring tools do this in the past?
- What do other viewer tools expect if they read an existing WebCGM file?

I think this information is urgently needed to understand the situation
a bit better.

Regards,
Dieter

> -----Original Message-----
> From: Benoit Bezaire [mailto:benoit@itedo.com]
> Sent: Wednesday, October 12, 2005 5:05 AM
> To: cgmo-webcgm@lists.oasis-open.org
> Subject: Re[6]: [cgmo-webcgm] implications of URI vs. IRI
>
> Hi Lofton,
>
> I think that some of your questions are answered in 2.4.2:
>
> 2.4.2. When to Escape and Unescape
>
>   A URI is always in an "escaped" form, since escaping or unescaping a
>   completed URI might change its semantics.
>   [...]
>   Because the percent "%" character always has the reserved purpose of
>   being the escape indicator, it must be escaped as "%25" in order to
>   be used as data within a URI. Implementers should be careful not to
>   escape or unescape the same string more than once, since unescaping
>   an already unescaped string might lead to misinterpreting a percent
>   data character as another escaped character, or vice versa in the
>   case of escaping an already escaped string.
>
> One last comment; this is _again_ a three way conversation
> (Lofton, Dieter and myself)... everyone should be involved in
> this conversation (users and implementers, what do you want),
> you are all affected by this. We want a 'valid' solution that
> will have little disruption on WebCGM 1.0 content; let's try
> to work towards that goal.
>
> Regards,
>
> --
> Benoit   mailto:benoit@itedo.com
>
>
> Tuesday, October 11, 2005, 7:05:19 PM, Lofton wrote:
>
> LH> More...
>
> LH> I am giving some more thought to it to the ambiguity problem
> LH> about"both" (i.e., both forms allowed in the fragment, linkuri,
> LH> etc,a'la SVG.)
>
> LH> Firstly, a possible solution. One could always add a rule for
> LH> CGMinterpreters, that any %hh 3-tuple in a fragment (or linkuri
> LH> 1stparameter, or ...) will be take by the CGM interpreter
> as a URI
> LH> escapingsequence. So caveat to WebCGM generators ...
> LH> although the 'name'ApsAttr might allow something like
> that as part
> LH> of the 'name' value, youhad better not do it, because you will
> LH> create an ambiguity when you usethat 'name' value in a
> fragment (or
> LH> linkuri, DOM, XCF) and will NOT getthe result you want.
>
> LH> Secondly...
>
> LH> There is still something about the SVG sentence that bothers
> LH> me,"...must be a URI reference as defined in [RFC2396], or must
> LH> resultin a URI reference after the escaping procedure described
> LH> below isapplied". Specifically, was the *first* phrase
> ("must bea
> LH> URI reference as defined in [RFC2396]") meant to include
> LH> thecase(s):
>
> LH> 1.) it is all safe ASCII in its original data form, with
> no URI escapingneeded or present?
> LH> 2.) or was it maybe unsafe, but is already URI escaped?
> LH> 3.) or both?
>
> LH> e1 illustrates #1 (all safe, no problem characters, no escaping
> LH> needed ordone). e2 illustrates #2 (already escaped).
>
> LH> e1) <image href=""rasterImage.png"" .../>
> LH> e2) <image href=""raster%20image.png"" .../>
>
> LH> Are both valid in SVG?
>
> LH> I'm going to reread 2396 again. Chapter 2 talks about
> all thisstuff
> LH> (as well as questions like local encoding), but it is not
> LH> lightreading. I'm also thinking to ask Chris about his memory of
> LH> thesentence, particularly the intent of its first phrase.
>
> LH> -Lofton.
>
> LH> At 01:10 PM 10/11/2005 -0600, Lofton Henderson wrote:
> LH> At 05:20 PM 10/11/2005
> LH> +0200,=?GB2312?B?RGlldGVyICBXZWlkZW5icqi5Y2s=?= wrote:
> LH> [...]
> LH> good, and agreed.
>
>
> LH> Not so fast!
>
> LH> Actually, I do agree that we should use the SVG interpretation,
> LH> ifpossible. I'm not sure how we ended up differently,
> since Chris
> LH> wasconsulting on and helping with this detail (it might be the
> LH> timedifference -- 1999 for WebCGM 1.0 versus 2001 for SVG
> -- Chris
> LH> and SVGmight have figured out properly in those two years).
>
> LH> My problem is: exactly how to do it. One logical method
> mightbe an
> LH> erratum on 1.0 -- logical because we ended up diverging
> from SVG1.0
> LH> on that detail, and didn't intend to. (Would require someaction
> LH> within W3C, to update the errata file that is linked from
> theStatus
> LH> section of the WebCGM 1.0 Recommendation.) An erratum
> (inthe "both"
> LH> direction) would mean that both forms are valid
> 1.0content, from the
> LH> very beginning
>
> LH> Anther possibility: fix the language for 2.0, so that"both"
> LH> are allowed from 2.0 on. (This makes 1.0 contentproblematic, if
> LH> both forms have been used.)
>
> LH> About the question of "both"...
>
> >> The sentence '...must be a URIreference as defined in
> [RFC2396], or
> >> must result in a URI referenceafter the escaping procedure
> described
> >> below is applied"
> >>
> >> DW> The way I understand the SVG wording is that both
> forms wouldbe legal:
> >>
> >> DW>http://www.cgmopen.org/abc.cgm#name(myname with blank)
> >> DW>http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank)
>
>
> LH> Rfc2396 makes it clear (section 2.3 and 2.4) that the
> presence of %
> LH> should tell a URI resolver that URI escaping is in effect
> -- % isn't
> LH> a valid reserved (delimiter or subdelimiter) character,
> nor a valid
> LH> unreserved character, for the URI.
>
> LH> However, % is a valid character in the repertoire of the 'name'
> LH> ApsAttr, right? So "%myFunnyName%" is a valid 'name'
> LH> APSattr in a WebCGM instance, right? And the 3-character
> "%20" is a
> LH> valid 'name' ApsAttr, right?
>
> LH> So if WebCGM allowed "both", and you encountered a fragment:
>
> LH> #name(a%20b) ,
>
> LH> what would you give to the URI resolver? Two choices:
>
> LH> a%20b [assumes that the generator already applied uri-escaping]
> LH> a%2520b [assumes that generator did NOT uri-escape already]
>
> LH> [btw, hex for % is 0x25, so % as an actual URI character
> is given to
> LH> URI resolver as %25]
>
> LH> Thoughts? (This gives me a headache!)
>
> LH> -Lofton.
>
>
> LH> One more comment:
> LH> Spaces in "name" attributes have been allowed long before any
> LH> linkURI and/or XML rules existed, thus nobody ever thought about
> LH> this detail. Everything was stored in the CGM as the rules for
> LH> non-graphical strings mandated.
> LH> One could say that this could have been clarified in WebCGM 1.0,
> LH> however, I find it quite useful to have both forms available.
>
> LH> Dieter
>
> >> -----Original Message-----
> >> From: Benoit Bezaire [mailto:benoit@itedo.com]
> >> Sent: Tuesday, October 11, 2005 5:07 PM
> >> To: cgmo-webcgm@lists.oasis-open.org
> >> Subject: Re[4]: [cgmo-webcgm] implications of URI vs. IRI
> >>
> >> Hi Dieter,
> >>
> >> Thanks for the example, we are talking about the same thing.
> >>
> >> I understand that ATA and WebCGM has allowed spaces in URI
> fragments
> >> for the last 10 years, but from my interpretation of
> RFC2396; those
> >> linkuris are illegal. Here is a quote from Section 4.1 of
> >> http://www.ietf.org/rfc/rfc2396.txt
> >> "The character restrictions described in Section 2 for URI
> also apply
> >> to the fragment in a URI-reference."
> >>
> >> And by reading Section 2, you end up reading that spaces are not
> >> allowed.
> >>
> >> That being said, your interpretation of the SVG wording sounds
> >> acceptable. The sentence 'or must result in a URI
> reference after the
> >> escaping procedure' seems to be saving us! I'm in favor of adding
> >> wording to the spec to clarify this issue (the 3 bullet
> wording would
> >> be good also).
> >>
> >> I no longer have a preference if we should deprecate or not.
> >> On one side, I think that this is a can of worms and
> forcing escaping
> >> simplifies things; on the other, I agree that long %HH for Asian
> >> names is not ideal.
> >>
> >> Allowing both is probably the less painful approach for users and
> >> implementers at this time.
> >>
> >> Regards,
> >>
> >> --
> >> Benoit   mailto:benoit@itedo.com
> >>
> >>
> >> Tuesday, October 11, 2005, 10:15:06 AM, Dieter wrote:
> >>
> >> DW> Hi Benoit,
> >>
> >> DW> see inline
> >>
> >> >> -----Original Message-----
> >> >> From: Benoit Bezaire [mailto:benoit@itedo.com]
> >> >> Sent: Tuesday, October 11, 2005 3:48 PM
> >> >> To: cgmo-webcgm@lists.oasis-open.org
> >> >> Subject: Re[2]: [cgmo-webcgm] implications of URI vs. IRI
> >> >>
> >> >> Hi Dieter,
> >> >>
> >> >> You said:
> >> >> NOTE: If we required an escaped string inside the CGM now,
> >> this will
> >> >> make almost all existing files invalid ones as soon as a
> >> simple space
> >> >> is in a name attribute.
> >> >>
> >> >> You are talking about the 'name' attribute within a URI
> >> only, correct?
> >> >> Or, let me rephrase...
> >> >> Files which have a name attribute (containing a space)
> >> that is used
> >> >> in a URI become invalid, right?
> >> DW> I am referring to the link destination parameter of a
> >> linkuri attribute.
> >> DW> Yes, something like (pseudo-code)
> >>
> >> DW> linkuri "http://www.cgmopen.org/abc.cgm#name(my name
> with blank)"
> >> DW> "some title" "_blank"
> >>
> >> DW> would become illegal, and this is the form (without
> >> escaping) that
> >> DW> has been used forever in the ATA and WebCGM environment
> >> (almost 10 years now).
> >>
> >> >>
> >> >> I would be in favor of deprecating (i.e., authors should stop
> >> >> creating such files) the old behavior (no escaping) and
> >> adding 'a la'
> >> >> SVG wording to the spec. Like Dieter says, but with an
> emphasis on
> >> >> deprecating the old behavior.
> >> DW> The way I understand the SVG wording is that both forms
> >> would be legal:
> >>
> >> DW> http://www.cgmopen.org/abc.cgm#name(my name with blank)
> >> DW> http://www.cgmopen.org/abc.cgm#name(my name%20with%20blank)
> >>
> >> DW> I would NOT deprecate the first form, because it would
> >> force us to
> >> DW> build long strings for japanese or similar characters,
> >> following the
> >> DW> rules as described below.
> >>
> >> DW> Do you read the SVG spec the same way, or am I wrong?
> >>
> >> DW> Regards,
> >> DW> Dieter
> >>
> >> >>
> >> >> --
> >> >> Benoit   mailto:benoit@itedo.com
> >> >>
> >> >>
> >> >> Thursday, October 6, 2005, 7:52:43 AM, Dieter wrote:
> >> >>
> >> >> DW> All,
> >> >>
> >> >> DW> I am not yet convinced that we are heading in the right
> >> >> direction here.
> >> >>
> >> >> DW> Example:
> >> >> DW> Let's assume we have the string "nihon" inside a
> >> linkUri: "id(ÈÕ±¾)"
> >> >>
> >> >> DW> using UTF-16 (big endian) this is: 65 e5 67 2c (4 Bytes)
> >> >> converted
> >> >> DW> to UTF-8: EF BB BF E6 97 A5 E6 9C AC (9 Bytes)
> >> >>
> >> >> DW> and then you can apply escaping for all non-ascii chars
> >> >>
> >> >> DW> %EF%BB%BF%E6%97%A5%E6%9C%AC (27 Bytes)
> >> >>
> >> >> DW> and now we store it into the linkURI attribute,
> however, since
> >> >> DW> somewhere else in the file we have this string in japanese
> >> >> DW> characters as an ID, all non-graphical strings will be
> >> stored as
> >> >> DW> UTF-16 (could be
> >> >> DW> UTF-8 as well):
> >> >>
> >> >> DW> I save the writing, you end up with 54 bytes.
> >> >>
> >> >> DW> So we are moving from 4 bytes to 54 bytes.
> >> >>
> >> >> DW> I hope that this accurately describes the procedure
> >> that has been
> >> >> DW> discussed over the past couple of days.
> >> >>
> >> >> DW> Comparison to SVG:
> >> >> DW> In 5.3.2. [1], SVG says the following:
> >> >>
> >> >> DW> "The value of the href attribute must be a URI reference
> >> >> as defined
> >> >> DW> in [RFC2396], or must result in a URI reference after the
> >> >> escaping
> >> >> DW> procedure described below is applied. The procedure is
> >> >> applied when
> >> >> DW> passing the URI reference to a URI resolver."
> >> >>
> >> >> DW> Interesting to see the last sentence here. IMO this
> >> means, it is
> >> >> DW> perfectly legal to store the URI reference using any
> >> encoding, as
> >> >> DW> long as it will be transcoded to UTF-8 and escaped before
> >> >> passing it on to a URI resolver.
> >> >>
> >> >> DW> This has always been my understanding, and this is how
> >> all of our
> >> >> DW> products have been handling references.
> >> >>
> >> >> DW> NOTE:
> >> >> DW> If we required an escaped string inside the CGM now, this
> >> >> will make
> >> >> DW> almost all existing files invalid ones as soon as a
> >> >> simple space is
> >> >> DW> in a name attribute.
> >> >>
> >> >> DW> RECOMMENDATION:
> >> >> DW> Amend wording slightly to match watch SVG is doing and
> >> allow for
> >> >> DW> both styles, escaped and not escaped.
> >> >>
> >> >> DW> Comments?
> >> >>
> >> >> DW> Regards,
> >> >> DW> Dieter
> >> >>
> >> >>
> >> >> DW> [1] http://www.w3.org/TR/SVG11/struct.html#xlinkRefAttrs
> >> >>
> >> >>
> >> >> >> -----Original Message-----
> >> >> >> From: Lofton Henderson [mailto:lofton@rockynet.com]
> >> >> >> Sent: Wednesday, October 05, 2005 1:06 AM
> >> >> >> To: Benoit Bezaire; cgmo-webcgm@lists.oasis-open.org
> >> >> >> Subject: Re: [cgmo-webcgm] implications of URI vs. IRI
> >> >> >>
> >> >> >> At 05:09 PM 10/4/2005 -0400, Benoit Bezaire wrote:
> >> >> >> >Hi Lofton,
> >> >> >> >
> >> >> >> >I just did a quick search... I think that URI is only
> >> restricting
> >> >> >> >characters to US-ASCII; it has no control on the
> >> encoding (utf-8,
> >> >> >> >utf-16 etc...).
> >> >> >> >
> >> >> >> >In XML syntax such as XHTML and SVG, files can have just
> >> >> about any
> >> >> >> >encoding; I'm not aware of any special processing for the
> >> >> xlink:href
> >> >> >> >attribute (i.e., this is a URI, change the encoding to
> >> >> _blah_). It
> >> >> >> >wouldn't make any sense. The scope of the encoding is for
> >> >> >> the complete
> >> >> >> >document.
> >> >> >> >
> >> >> >> >The above is not a fact, only my understanding.
> >> >> >>
> >> >> >> It matches my understanding. And it is clear that XML
> >> and/or URI
> >> >> >> (rfc3986) require "URI escaping" for non-ASCII
> >> characters in URIs,
> >> >> >> i.e., for character that are outside of the ASCII
> >> repertoire. And
> >> >> >> this is independent of the character-set encoding of the URI.
> >> >> >>
> >> >> >> So finally, a URI from HTML into CGM containing a
> >> >> reference-by-name
> >> >> >> to "my object group" would be written like this:
> >> >> >>
> >> >> >> <a
> >> >> >>
> >> >>
> >>
> href=""http://example.org/myCGM.cgm#name(my%20object%20group)">blah</a
> >> >> >> >
> >> >> >>
> >> >> >> and a WebCGM 'linkuri' first parameter would be this:
> >> >> >>
> >> >> >> http://example.org/myCGM.cgm#name(my%20object%20group)
> >> >> >>
> >> >> >> -Lofton.
> >> >> >>
> >> >> >>
> >> >> >> >Tuesday, September 20, 2005, 2:45:48 PM, Lofton wrote:
> >> >> >> >
> >> >> >> >LH> All --
> >> >> >> >
> >> >> >> >LH> When I was putting together first unicode tests,
> >> Dieter also
> >> >> >> >LH> supplied me with this nifty "advanced" test. It gets
> >> >> >> into Japanese
> >> >> >> >LH> text for SF text like APS ids and names.
> >> >> >> >
> >> >> >> >LH> It highlights an interesting implication of our decision
> >> >> >> to stick
> >> >> >> >LH> with URI instead of switching to IRI. URI encoding
> >> >> >> requires that
> >> >> >> >LH> any non-ASCII characters are included by the
> "URI escaping
> >> >> >> >LH> mechanism", see WebCGM
> >> >> >> >3.1.1.4
> >> >> >> >LH> [1], and the more detailed XML description [2].
> >> >> >> Basically, get the
> >> >> >> >LH> **UTF8** representation of the characters, and replace
> >> >> >> each byte in
> >> >> >> >LH> that representation by the 3-character string %HH, where
> >> >> >> HH is the
> >> >> >> >LH> hex representation of the byte.
> >> >> >> >
> >> >> >> >LH> So suppose consider for example the 2-character id of
> >> >> >> the object in
> >> >> >> >LH> the upper-left box, and its use in a link from the
> >> >> object in the
> >> >> >> >upper-right box.
> >> >> >> >
> >> >> >> >LH> If that id were the two characters c1c2, lets suppose
> >> >> >> that it could
> >> >> >> >LH> be represented by the 4 utf8 bytes b1b2b3b4 (I'm just
> >> >> guessing
> >> >> >> >LH> about "4", since UTF8 is variable length, it could be
> >> >> >> more). Then
> >> >> >> >LH> to put that id
> >> >> >> >into
> >> >> >> >LH> a URI string, it would have to be the
> 12-character string:
> >> >> >> >
> >> >> >> >LH> %hh%hh%hh%hh
> >> >> >> >
> >> >> >> >LH> where the hh are the are the 4 pairs of hex digits that
> >> >> >> represent
> >> >> >> >LH> the 4
> >> >> >> >LH> utf16 bytes. I.e., the CGM URI for the link would be:
> >> >> >> >
> >> >> >> >LH> #id(%hh%hh%hh%hh, view_context)
> >> >> >> >
> >> >> >> >LH> Side question. Does URI (rfc3986 [3]) restrict only the
> >> >> >> character
> >> >> >> >LH> repertoire of the URI, or does it restrict also the
> >> >> >> encoding? I.e.,
> >> >> >> >LH> can a URI be encoded in ascii, isoLatin1, or utf8, or
> >> >> utf16, or
> >> >> >> >LH> whatever, as
> >> >> >> >long
> >> >> >> >LH> as it restricts its repertoire to the URI repertoire?
> >> >> I suspect
> >> >> >> >"yes", but
> >> >> >> >LH> I don't know the answer. It would be interesting for
> >> >> someone to
> >> >> >> >research it.
> >> >> >> >
> >> >> >> >LH> Thoughts?
> >> >> >> >
> >> >> >> >LH> Regards,
> >> >> >> >LH> -Lofton.
> >> >> >> >
> >> >> >> >LH> [0]
> >> >> >> >LH>
> >> >> >>
> >> >>
> >> http://docs.oasis-open.org/webcgm/v2.0/WebCGM20-IC.html#webcgm_3_1_
> >> >> >> >LH> 1_4 [1]
> >> >> >> >LH>
> >> >> http://www.w3.org/TR/2004/REC-xml-20040204/#sec-external-ent
> >> >> >> >LH> [3] URI: http://www.ietf.org/rfc/rfc3986.txt
> >> >> >> >LH> [4] IRI: http://www.ietf.org/rfc/rfc3987.txt
>
>
>
>

cgmo-webcgm message