cgmopen-members message

Subject: WebCGM and UTF16

From: Lofton Henderson <lofton@rockynet.com>
To: cgmopen-members@lists.oasis-open.org
Date: Wed, 15 Aug 2001 08:55:46 -0600

CGM Open Members,

Recently, a question came up about the use of Unicode UTF16 in WebCGM instances. The byte order of the two-byte codes of UTF16 is not unambiguously specified by the Unicode standard. For example, to represent the 6 character ASCII string "WebCGM" in UTF16, the same 7-bit ASCII codes are used for one byte of the UTF16 representation, and the other byte is zero (this is true of 8-bit ISOLatin1 also, not just the LHS ASCII subset). So, would the data stream in a WebCGM instance be the 12-byte sequence:

Option a): 0 W 0 e 0 b 0 C 0 G 0 M

or is it:

Option b): W 0 e 0 b 0 C 0 G 0 M 0

This issue is discussed in section 2.7 of Unicode (see http://www.unicode.org/unicode/uni2book/ch02.pdf). An optional (not required) BOM (byte order marker) is defined, for use in circumstances where the order might otherwise be ambiguous.

Here is the ambiguity with regard to WebCGM parameters of type SF (non-graphical string) or S (graphical string) -- is the BOM:

1. prohibited?
2. or, required?
3. or, allowed but not required?

Implicit in #1 is that a single standard order is mandated for all UTF16 strings in all WebCGM instances. There are all sorts of flavors and questions associated with #2 and #3: what is the default (if #3); does the BOM (0xFEFF or 0xFFFE) have to occur in every string instance; ...?

(Tutorial background. Recall that type SF strings are all of one character set in a given WebCGM instance, and that type is IsoLatin1 by default, and may be changed to UTF8 or UTF16 by a 4-character esc [introducer] sequence at the start of the BegMF id string. Character sets of type S strings may be switched within a WebCGM using the normal Character Set List and (Alternate) Character Set Index mechanisms.)

We think that #1 is the correct WebCGM interpretation. The CGM binary encoding was specified with an unambiguous byte order, after considerable discussion (mid-1980s) about the endian issue. If you view the 16-bit UTF16 codes to be a CGM "word" (see section 5.3 of Part 3). Then the correct representation of UTF16 codes in the WebCGM data stream is "big endian". I.e., Option (a) above, i.e.,

0 W 0 e 0 b 0 C 0 G 0 M

This interpretation has been agreed by the one implementation I know of that can generate UTF16.

Does anyone disagree with this interpretation and clarification?

Regards,
Lofton.

*******************

Lofton Henderson

1919 Fourteenth St., #604

Boulder, CO 80302

Phone: 303-449-8728

Email: lofton@rockynet.com

*******************

Follow-Ups:
- Re: WebCGM and UTF16
  - From: Dieter@isodraw.de (Dieter Weidenbrueck)