cgmo-webcgm message

Subject: UTF-8 & UTF-16 sequences (ISSUEs)
From: Lofton Henderson <lofton@rockynet.com>
To: cgmo-webcgm@lists.oasis-open.org
Date: Thu, 23 Jun 2005 17:29:03 -0600
WebCGM TC,

I have an action item to research "UTF-x sequence tails".  Thanks to 
Forrest for providing me some references and some motivation, I have gotten 
the information, and I make recommendations below.

[1] http://www.unihan.com.cn/Cjk/ana18.htm
[2] http://www.unihan.com.cn/Cjk/ana19.htm

At [1] and [2], we find the ISO/IEC 2022 escape sequences:

UTF-8 implementation level 3:  ESC 2/5 2/15 4/9
UTF-16  implementation level 3:  ESC 2/5 2/15 4/12

At [3], I found a lucid explanation of this stuff, and particularly what 
"implementation level 1,2,3" mean.  In the past, we chose implementation 
level 3 (whether or not it was a well-considered decision is another question).

[3] http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html#3

Separate the cases of non-graphical text (SF) and graphical text (S) in 
WebCGM 1.0.

Non-graphical text (SF):  T.14.5
-----
T.14.5 says that the metafile id (BEGIN METAFILE parameter, type SF) shall 
have as its first 4 octets the 4-octet sequences above, to declare for the 
whole metafile that SF is UTF-8 or UTF-16.

Conclusion:  no problem here.

Graphical text (S):  T.16.14
-----
T.16.14 takes the last character of the above 4-octet sequences as the 
'tail', for use in the CHARACTER SET LIST (CSL) element.  So the two-part 
data for CSL are specified as:

UTF-8 implementation level 3:  'complete code', 4/9
UTF-16  implementation level 3:  'complete code', 4/12

This was based on information in CGM:1999 section 6.3.4.3, that 
characterizes the escape sequences for complete codes as:  ESC 2/5 I* F.
I* is zero or more "intermediate characters", and F is a single final 
character.  WebCGM 1.0 took only F for the tail.  But CGM:1999 says:

>The character set declaration ... consists of 'complete code' followed by 
>a string consisting of those characters in the code's ISO 2022 escape 
>sequence which come after the first two characters, ESC  2/5.

Conclusion:  WebCGM 1.0 is wrong for the CSL tails for UTF-8 and 
UTF-16.  The CSL data should be:

UTF-8 implementation level 3:  'complete code', 2/15 4/9
UTF-16  implementation level 3:  'complete code', 2/15 4/12

ISSUES:
===
ISSUE 1:  Should we issue an erratum for WebCGM 1.0?

Alternatives:
Alt.1:  No
Alt.2:  Yes

Recommendation for Issue 1:  Alt.1, No.

Discussion:  CGM:1999 CSL is not really an implementation of ISO 2022, but 
rather takes concepts and bits of escape sequences as parameters for the 
CSL to designate character sets.  WebCGM 1.0 lists data that is to be used 
to designate 6 character sets.  Though wrong according to ISO2022, on the 
other hand these are effectively just tokens to select the 6 char. sets, 
and it is unambiguous in the context of WebCGM.  To change WebCGM 1.0 by 
erratum will invalidate existing WebCGM 1.0 products in the field, for new 
WebCGM 1.0 content.  And would cause existing "valid" 1.0 content to become 
invalid.  It's not worth it, IMO.

ISSUE 2:  Should we correct it for WebCGM 2.0?

Alternatives:
Alt.1:  No
Alt.2:  Yes
Alt.3:  Yes, but do it by deprecation of the old 1.0 forms (2.0 generators 
shall generate only the 2.0 forms, 2.0 viewers shall accept 1.0 forms as 
well as 2.0 forms)

Recommendation for Issue 2:  Alt.3, Yes, but by deprecation of old.

Discussion:  If generators are writing 2.0 files, and they put out the 
proper forms, then there really shouldn't be a problem with old (1.0) 
viewers in the field -- they won't understand other 2.0 stuff anyway.  2.0 
generators and 2.0 viewers will be using "correct" forms.

Thoughts?

-Lofton.
Follow-Ups:
- RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
  - From: Dieter Weidenbruck <dieter@itedo.com>