cgmo-webcgm message

Subject: RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
From: Dieter Weidenbruck <dieter@itedo.com>
To: "Lofton Henderson" <lofton@rockynet.com>,"Robert Orosz" <roboro@AUTO-TROL.com>,<cgmo-webcgm@lists.oasis-open.org>
Date: Sat, 25 Jun 2005 20:52:29 +0200
I strongly agree with Lofton.

> -----Original Message-----
> From: Lofton Henderson [mailto:lofton@rockynet.com]
> Sent: Saturday, June 25, 2005 7:28 PM
> To: Robert Orosz; cgmo-webcgm@lists.oasis-open.org
> Subject: RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
>
>
> Since two people have opted for "Erratum for 1.0", we should be
> clear what
> are the consequences:
>
> A.) all existing 1.0 files that invoke UTFx, which have been
> 1.0-conformaing since 1999, would become non-conforming on the
> day that the
> erratum becomes effective.  Thus, MetaCheck (once upgraded for
> the erratum)
> would declare all existing 1.0 files non-conforming.  This is the way
> errata work in ISO and W3C.
>
> B.) all existing currently conforming 1.0 generators in the field would
> become non-conforming.
>
> C.) all existing currently conforming 1.0 viewers in the field
> would become
> non-conforming, and if some generators started putting out "new"
> 1.0 files,
> all existing 1.0 viewers in the field, which previously handled all 1.0
> files fine, would now begin to malfunction on "conforming" 1.0 files.
>
> At 09:53 AM 6/24/2005 -0600, Robert Orosz wrote:
> >[...]
> >Regarding ISSUE 1, I think an erratum should be issued for WebCGM 1.0,
> >because it is well, wrong.
>
> I guess I take a pragmatic view.  What is worse, that we are "wrong" in
> some formal sense?  Or that we create real chaos and confusion
> about valid
> 1.0 content and invalidate otherwise fine legacy 1.0 implementations?
>
> >Essentially, you are proposing that the
> >sequences ESC 2/5 4/9 and ESC 2/5 4/12 are private codes for UTF-8 and
> >UTF-16 respectively.
>
> I don't see it that way.  For SF, the ISO 2022 sequences are used in the
> content ('id' string of BegMet).  They always were correct and
> remain correct:
>
> ESC 2/5 2/15 4/9
> ESC 2/5 2/15 4/12
>
> For CSL entries, WebCGM 1.0 normatively stated:
>
> the two-part parameter "'complete code', 4/9" means UTF8
> the two-part parameter "'complete code', 4/12" means UTF16
>
> It is unambiguous in the context of WebCGM 1.0.  It is, formally
> speaking,
> wrong with respect to how the clauses of CGM:1999 say these should be
> derived from the correct 2022 sequences...
>
> >However, section 6.3.4.3 of CGM:1999 states under the
> >"CHARACTER SETS INTENDED TO BE DESIGNATED AS COMPLETE CODES." paragraph:
> >
> >.... "If <F> is from column 3, the coding system is a private
> code. If <F>
> >is from
> >columns 4 through 7, it is a code for which a designating and invoking
> >escape sequence has been registered in the International
> Register Of Coded
> >Character Sets To Be Used With Escape Sequences."
> >
> >So, the sequences ESC 2/5 3/9 and ESC 2/5 3/12 would have been acceptable
> >instead.
> >
> >Section 9.1.3 of CGM:1999 states:
> >
> >"A profile of ISO/IEC 8632 shall not specify any requirement that would
> >contradict or cause non-conformance to ISO/IEC 8632."
> >
> >In my opinion, WebCGM 1.0 is contradicting ISO/IEC 8632 by designating an
> >escape sequence with a final byte from column 4 as a private code for the
> >Character Set List element.  Issuing an erratum on this would alert users
> >that what is currently specified is wrong, and it is being deprecated and
> >changed in WebCGM 2.0.
>
> With chaotic practical implications for existing and new 1.0 content, and
> existing and new 1.0 implementations.  We should seriously
> consider before
> we do that, whether being formally "correct" offers sufficient advantages
> that outweigh the very real pragmatic downside.
>
> Especially since WebCGM doesn't actually use ISO2022 (generalized
> intra-string character set *switching* via control sequences) in any real
> sense, but rather has a couple of mechanisms that are based on a very
> narrow technical detail of 2022 -- no one is going to write a 2022
> processor for WebCGM, which processor would likely get confused
> if a wrong
> sequence showed up in a string.
>
> Btw, I could point out several ways in which WebCGM arguably violates the
> CGM:1999 standard.
>
> Regards,
> -Lofton.
>
>
> >Regarding ISSUE 2, either Alt.2 or Alt.3 are fine with me.
> >
> >Regards,
> >
> >Rob
> >
> >-----Original Message-----
> >From: Lofton Henderson [mailto:lofton@rockynet.com]
> >Sent: Thursday, June 23, 2005 5:29 PM
> >To: cgmo-webcgm@lists.oasis-open.org
> >Subject: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
> >
> >
> >WebCGM TC,
> >
> >I have an action item to research "UTF-x sequence tails".  Thanks to
> >Forrest for providing me some references and some motivation, I
> have gotten
> >the information, and I make recommendations below.
> >
> >[1] http://www.unihan.com.cn/Cjk/ana18.htm
> >[2] http://www.unihan.com.cn/Cjk/ana19.htm
> >
> >At [1] and [2], we find the ISO/IEC 2022 escape sequences:
> >
> >UTF-8 implementation level 3:  ESC 2/5 2/15 4/9
> >UTF-16  implementation level 3:  ESC 2/5 2/15 4/12
> >
> >At [3], I found a lucid explanation of this stuff, and particularly what
> >"implementation level 1,2,3" mean.  In the past, we chose implementation
> >level 3 (whether or not it was a well-considered decision is another
> >question).
> >
> >[3] http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html#3
> >
> >Separate the cases of non-graphical text (SF) and graphical text (S) in
> >WebCGM 1.0.
> >
> >Non-graphical text (SF):  T.14.5
> >-----
> >T.14.5 says that the metafile id (BEGIN METAFILE parameter, type
> SF) shall
> >have as its first 4 octets the 4-octet sequences above, to
> declare for the
> >whole metafile that SF is UTF-8 or UTF-16.
> >
> >Conclusion:  no problem here.
> >
> >Graphical text (S):  T.16.14
> >-----
> >T.16.14 takes the last character of the above 4-octet sequences as the
> >'tail', for use in the CHARACTER SET LIST (CSL) element.  So the two-part
> >data for CSL are specified as:
> >
> >UTF-8 implementation level 3:  'complete code', 4/9
> >UTF-16  implementation level 3:  'complete code', 4/12
> >
> >This was based on information in CGM:1999 section 6.3.4.3, that
> >characterizes the escape sequences for complete codes as:  ESC 2/5 I* F.
> >I* is zero or more "intermediate characters", and F is a single final
> >character.  WebCGM 1.0 took only F for the tail.  But CGM:1999 says:
> >
> > >The character set declaration ... consists of 'complete code'
> followed by
> > >a string consisting of those characters in the code's ISO 2022 escape
> > >sequence which come after the first two characters, ESC  2/5.
> >
> >Conclusion:  WebCGM 1.0 is wrong for the CSL tails for UTF-8 and
> >UTF-16.  The CSL data should be:
> >
> >UTF-8 implementation level 3:  'complete code', 2/15 4/9
> >UTF-16  implementation level 3:  'complete code', 2/15 4/12
> >
> >ISSUES:
> >===
> >ISSUE 1:  Should we issue an erratum for WebCGM 1.0?
> >
> >Alternatives:
> >Alt.1:  No
> >Alt.2:  Yes
> >
> >Recommendation for Issue 1:  Alt.1, No.
> >
> >Discussion:  CGM:1999 CSL is not really an implementation of ISO
> 2022, but
> >rather takes concepts and bits of escape sequences as parameters for the
> >CSL to designate character sets.  WebCGM 1.0 lists data that is
> to be used
> >to designate 6 character sets.  Though wrong according to ISO2022, on the
> >other hand these are effectively just tokens to select the 6 char. sets,
> >and it is unambiguous in the context of WebCGM.  To change WebCGM 1.0 by
> >erratum will invalidate existing WebCGM 1.0 products in the
> field, for new
> >WebCGM 1.0 content.  And would cause existing "valid" 1.0
> content to become
> >invalid.  It's not worth it, IMO.
> >
> >ISSUE 2:  Should we correct it for WebCGM 2.0?
> >
> >Alternatives:
> >Alt.1:  No
> >Alt.2:  Yes
> >Alt.3:  Yes, but do it by deprecation of the old 1.0 forms (2.0
> generators
> >shall generate only the 2.0 forms, 2.0 viewers shall accept 1.0 forms as
> >well as 2.0 forms)
> >
> >Recommendation for Issue 2:  Alt.3, Yes, but by deprecation of old.
> >
> >Discussion:  If generators are writing 2.0 files, and they put out the
> >proper forms, then there really shouldn't be a problem with old (1.0)
> >viewers in the field -- they won't understand other 2.0 stuff
> anyway.  2.0
> >generators and 2.0 viewers will be using "correct" forms.
> >
> >Thoughts?
> >
> >-Lofton.
>
>
>
References:
- RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
  - From: Lofton Henderson <lofton@rockynet.com>