cgmo-webcgm message

Subject: RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
From: Robert Orosz <roboro@AUTO-TROL.com>
To: cgmo-webcgm@lists.oasis-open.org
Date: Fri, 24 Jun 2005 09:53:48 -0600
Lofton,

Regarding ISSUE 1, I think an erratum should be issued for WebCGM 1.0,
because it is well, wrong.  Essentially, you are proposing that the
sequences ESC 2/5 4/9 and ESC 2/5 4/12 are private codes for UTF-8 and
UTF-16 respectively.  However, section 6.3.4.3 of CGM:1999 states under the
"CHARACTER SETS INTENDED TO BE DESIGNATED AS COMPLETE CODES." paragraph:

.... "If <F> is from column 3, the coding system is a private code. If <F>
is from
columns 4 through 7, it is a code for which a designating and invoking
escape sequence has been registered in the International Register Of Coded
Character Sets To Be Used With Escape Sequences."

So, the sequences ESC 2/5 3/9 and ESC 2/5 3/12 would have been acceptable
instead.

Section 9.1.3 of CGM:1999 states:

"A profile of ISO/IEC 8632 shall not specify any requirement that would
contradict or cause non-conformance to ISO/IEC 8632."

In my opinion, WebCGM 1.0 is contradicting ISO/IEC 8632 by designating an
escape sequence with a final byte from column 4 as a private code for the
Character Set List element.  Issuing an erratum on this would alert users
that what is currently specified is wrong, and it is being deprecated and
changed in WebCGM 2.0.

Regarding ISSUE 2, either Alt.2 or Alt.3 are fine with me.

Regards,

Rob

-----Original Message-----
From: Lofton Henderson [mailto:lofton@rockynet.com]
Sent: Thursday, June 23, 2005 5:29 PM
To: cgmo-webcgm@lists.oasis-open.org
Subject: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)


WebCGM TC,

I have an action item to research "UTF-x sequence tails".  Thanks to 
Forrest for providing me some references and some motivation, I have gotten 
the information, and I make recommendations below.

[1] http://www.unihan.com.cn/Cjk/ana18.htm
[2] http://www.unihan.com.cn/Cjk/ana19.htm

At [1] and [2], we find the ISO/IEC 2022 escape sequences:

UTF-8 implementation level 3:  ESC 2/5 2/15 4/9
UTF-16  implementation level 3:  ESC 2/5 2/15 4/12

At [3], I found a lucid explanation of this stuff, and particularly what 
"implementation level 1,2,3" mean.  In the past, we chose implementation 
level 3 (whether or not it was a well-considered decision is another
question).

[3] http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html#3

Separate the cases of non-graphical text (SF) and graphical text (S) in 
WebCGM 1.0.

Non-graphical text (SF):  T.14.5
-----
T.14.5 says that the metafile id (BEGIN METAFILE parameter, type SF) shall 
have as its first 4 octets the 4-octet sequences above, to declare for the 
whole metafile that SF is UTF-8 or UTF-16.

Conclusion:  no problem here.

Graphical text (S):  T.16.14
-----
T.16.14 takes the last character of the above 4-octet sequences as the 
'tail', for use in the CHARACTER SET LIST (CSL) element.  So the two-part 
data for CSL are specified as:

UTF-8 implementation level 3:  'complete code', 4/9
UTF-16  implementation level 3:  'complete code', 4/12

This was based on information in CGM:1999 section 6.3.4.3, that 
characterizes the escape sequences for complete codes as:  ESC 2/5 I* F.
I* is zero or more "intermediate characters", and F is a single final 
character.  WebCGM 1.0 took only F for the tail.  But CGM:1999 says:

>The character set declaration ... consists of 'complete code' followed by 
>a string consisting of those characters in the code's ISO 2022 escape 
>sequence which come after the first two characters, ESC  2/5.

Conclusion:  WebCGM 1.0 is wrong for the CSL tails for UTF-8 and 
UTF-16.  The CSL data should be:

UTF-8 implementation level 3:  'complete code', 2/15 4/9
UTF-16  implementation level 3:  'complete code', 2/15 4/12

ISSUES:
===
ISSUE 1:  Should we issue an erratum for WebCGM 1.0?

Alternatives:
Alt.1:  No
Alt.2:  Yes

Recommendation for Issue 1:  Alt.1, No.

Discussion:  CGM:1999 CSL is not really an implementation of ISO 2022, but 
rather takes concepts and bits of escape sequences as parameters for the 
CSL to designate character sets.  WebCGM 1.0 lists data that is to be used 
to designate 6 character sets.  Though wrong according to ISO2022, on the 
other hand these are effectively just tokens to select the 6 char. sets, 
and it is unambiguous in the context of WebCGM.  To change WebCGM 1.0 by 
erratum will invalidate existing WebCGM 1.0 products in the field, for new 
WebCGM 1.0 content.  And would cause existing "valid" 1.0 content to become 
invalid.  It's not worth it, IMO.

ISSUE 2:  Should we correct it for WebCGM 2.0?

Alternatives:
Alt.1:  No
Alt.2:  Yes
Alt.3:  Yes, but do it by deprecation of the old 1.0 forms (2.0 generators 
shall generate only the 2.0 forms, 2.0 viewers shall accept 1.0 forms as 
well as 2.0 forms)

Recommendation for Issue 2:  Alt.3, Yes, but by deprecation of old.

Discussion:  If generators are writing 2.0 files, and they put out the 
proper forms, then there really shouldn't be a problem with old (1.0) 
viewers in the field -- they won't understand other 2.0 stuff anyway.  2.0 
generators and 2.0 viewers will be using "correct" forms.

Thoughts?

-Lofton.
Follow-Ups:
- RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
  - From: Lofton Henderson <lofton@rockynet.com>