cgmo-webcgm message

Subject: RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
From: Lofton Henderson <lofton@rockynet.com>
To: Robert Orosz <roboro@AUTO-TROL.com>,cgmo-webcgm@lists.oasis-open.org
Date: Sat, 25 Jun 2005 11:27:33 -0600
Since two people have opted for "Erratum for 1.0", we should be clear what 
are the consequences:

A.) all existing 1.0 files that invoke UTFx, which have been 
1.0-conformaing since 1999, would become non-conforming on the day that the 
erratum becomes effective.  Thus, MetaCheck (once upgraded for the erratum) 
would declare all existing 1.0 files non-conforming.  This is the way 
errata work in ISO and W3C.

B.) all existing currently conforming 1.0 generators in the field would 
become non-conforming.

C.) all existing currently conforming 1.0 viewers in the field would become 
non-conforming, and if some generators started putting out "new" 1.0 files, 
all existing 1.0 viewers in the field, which previously handled all 1.0 
files fine, would now begin to malfunction on "conforming" 1.0 files.

At 09:53 AM 6/24/2005 -0600, Robert Orosz wrote:
>[...]
>Regarding ISSUE 1, I think an erratum should be issued for WebCGM 1.0,
>because it is well, wrong.

I guess I take a pragmatic view.  What is worse, that we are "wrong" in 
some formal sense?  Or that we create real chaos and confusion about valid 
1.0 content and invalidate otherwise fine legacy 1.0 implementations?

>Essentially, you are proposing that the
>sequences ESC 2/5 4/9 and ESC 2/5 4/12 are private codes for UTF-8 and
>UTF-16 respectively.

I don't see it that way.  For SF, the ISO 2022 sequences are used in the 
content ('id' string of BegMet).  They always were correct and remain correct:

ESC 2/5 2/15 4/9
ESC 2/5 2/15 4/12

For CSL entries, WebCGM 1.0 normatively stated:

the two-part parameter "'complete code', 4/9" means UTF8
the two-part parameter "'complete code', 4/12" means UTF16

It is unambiguous in the context of WebCGM 1.0.  It is, formally speaking, 
wrong with respect to how the clauses of CGM:1999 say these should be 
derived from the correct 2022 sequences...

>However, section 6.3.4.3 of CGM:1999 states under the
>"CHARACTER SETS INTENDED TO BE DESIGNATED AS COMPLETE CODES." paragraph:
>
>.... "If <F> is from column 3, the coding system is a private code. If <F>
>is from
>columns 4 through 7, it is a code for which a designating and invoking
>escape sequence has been registered in the International Register Of Coded
>Character Sets To Be Used With Escape Sequences."
>
>So, the sequences ESC 2/5 3/9 and ESC 2/5 3/12 would have been acceptable
>instead.
>
>Section 9.1.3 of CGM:1999 states:
>
>"A profile of ISO/IEC 8632 shall not specify any requirement that would
>contradict or cause non-conformance to ISO/IEC 8632."
>
>In my opinion, WebCGM 1.0 is contradicting ISO/IEC 8632 by designating an
>escape sequence with a final byte from column 4 as a private code for the
>Character Set List element.  Issuing an erratum on this would alert users
>that what is currently specified is wrong, and it is being deprecated and
>changed in WebCGM 2.0.

With chaotic practical implications for existing and new 1.0 content, and 
existing and new 1.0 implementations.  We should seriously consider before 
we do that, whether being formally "correct" offers sufficient advantages 
that outweigh the very real pragmatic downside.

Especially since WebCGM doesn't actually use ISO2022 (generalized 
intra-string character set *switching* via control sequences) in any real 
sense, but rather has a couple of mechanisms that are based on a very 
narrow technical detail of 2022 -- no one is going to write a 2022 
processor for WebCGM, which processor would likely get confused if a wrong 
sequence showed up in a string.

Btw, I could point out several ways in which WebCGM arguably violates the 
CGM:1999 standard.

Regards,
-Lofton.


>Regarding ISSUE 2, either Alt.2 or Alt.3 are fine with me.
>
>Regards,
>
>Rob
>
>-----Original Message-----
>From: Lofton Henderson [mailto:lofton@rockynet.com]
>Sent: Thursday, June 23, 2005 5:29 PM
>To: cgmo-webcgm@lists.oasis-open.org
>Subject: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
>
>
>WebCGM TC,
>
>I have an action item to research "UTF-x sequence tails".  Thanks to
>Forrest for providing me some references and some motivation, I have gotten
>the information, and I make recommendations below.
>
>[1] http://www.unihan.com.cn/Cjk/ana18.htm
>[2] http://www.unihan.com.cn/Cjk/ana19.htm
>
>At [1] and [2], we find the ISO/IEC 2022 escape sequences:
>
>UTF-8 implementation level 3:  ESC 2/5 2/15 4/9
>UTF-16  implementation level 3:  ESC 2/5 2/15 4/12
>
>At [3], I found a lucid explanation of this stuff, and particularly what
>"implementation level 1,2,3" mean.  In the past, we chose implementation
>level 3 (whether or not it was a well-considered decision is another
>question).
>
>[3] http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html#3
>
>Separate the cases of non-graphical text (SF) and graphical text (S) in
>WebCGM 1.0.
>
>Non-graphical text (SF):  T.14.5
>-----
>T.14.5 says that the metafile id (BEGIN METAFILE parameter, type SF) shall
>have as its first 4 octets the 4-octet sequences above, to declare for the
>whole metafile that SF is UTF-8 or UTF-16.
>
>Conclusion:  no problem here.
>
>Graphical text (S):  T.16.14
>-----
>T.16.14 takes the last character of the above 4-octet sequences as the
>'tail', for use in the CHARACTER SET LIST (CSL) element.  So the two-part
>data for CSL are specified as:
>
>UTF-8 implementation level 3:  'complete code', 4/9
>UTF-16  implementation level 3:  'complete code', 4/12
>
>This was based on information in CGM:1999 section 6.3.4.3, that
>characterizes the escape sequences for complete codes as:  ESC 2/5 I* F.
>I* is zero or more "intermediate characters", and F is a single final
>character.  WebCGM 1.0 took only F for the tail.  But CGM:1999 says:
>
> >The character set declaration ... consists of 'complete code' followed by
> >a string consisting of those characters in the code's ISO 2022 escape
> >sequence which come after the first two characters, ESC  2/5.
>
>Conclusion:  WebCGM 1.0 is wrong for the CSL tails for UTF-8 and
>UTF-16.  The CSL data should be:
>
>UTF-8 implementation level 3:  'complete code', 2/15 4/9
>UTF-16  implementation level 3:  'complete code', 2/15 4/12
>
>ISSUES:
>===
>ISSUE 1:  Should we issue an erratum for WebCGM 1.0?
>
>Alternatives:
>Alt.1:  No
>Alt.2:  Yes
>
>Recommendation for Issue 1:  Alt.1, No.
>
>Discussion:  CGM:1999 CSL is not really an implementation of ISO 2022, but
>rather takes concepts and bits of escape sequences as parameters for the
>CSL to designate character sets.  WebCGM 1.0 lists data that is to be used
>to designate 6 character sets.  Though wrong according to ISO2022, on the
>other hand these are effectively just tokens to select the 6 char. sets,
>and it is unambiguous in the context of WebCGM.  To change WebCGM 1.0 by
>erratum will invalidate existing WebCGM 1.0 products in the field, for new
>WebCGM 1.0 content.  And would cause existing "valid" 1.0 content to become
>invalid.  It's not worth it, IMO.
>
>ISSUE 2:  Should we correct it for WebCGM 2.0?
>
>Alternatives:
>Alt.1:  No
>Alt.2:  Yes
>Alt.3:  Yes, but do it by deprecation of the old 1.0 forms (2.0 generators
>shall generate only the 2.0 forms, 2.0 viewers shall accept 1.0 forms as
>well as 2.0 forms)
>
>Recommendation for Issue 2:  Alt.3, Yes, but by deprecation of old.
>
>Discussion:  If generators are writing 2.0 files, and they put out the
>proper forms, then there really shouldn't be a problem with old (1.0)
>viewers in the field -- they won't understand other 2.0 stuff anyway.  2.0
>generators and 2.0 viewers will be using "correct" forms.
>
>Thoughts?
>
>-Lofton.
Follow-Ups:
- RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
  - From: Dieter Weidenbruck <dieter@itedo.com>
References:
- RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
  - From: Robert Orosz <roboro@AUTO-TROL.com>