cgmopen-members message
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [Elist Home]
Subject: WebCGM and UTF16
- From: Lofton Henderson <lofton@rockynet.com>
- To: cgmopen-members@lists.oasis-open.org
- Date: Wed, 15 Aug 2001 08:55:46 -0600
CGM Open Members,
Recently, a question came up about the use of Unicode UTF16 in WebCGM
instances. The byte order of the two-byte codes of UTF16 is not
unambiguously specified by the Unicode standard. For example, to
represent the 6 character ASCII string "WebCGM" in UTF16, the
same 7-bit ASCII codes are used for one byte of the UTF16 representation,
and the other byte is zero (this is true of 8-bit ISOLatin1 also, not
just the LHS ASCII subset). So, would the data stream in a WebCGM
instance be the 12-byte sequence:
Option a): 0 W 0 e 0 b 0 C 0 G 0 M
or is it:
Option b): W 0 e 0 b 0 C 0 G 0 M 0
This issue is discussed in section 2.7 of Unicode (see
http://www.unicode.org/unicode/uni2book/ch02.pdf).
An optional (not required) BOM (byte order marker) is defined, for
use in circumstances where the order might otherwise be ambiguous.
Here is the ambiguity with regard to WebCGM parameters of type SF
(non-graphical string) or S (graphical string) -- is the BOM:
1. prohibited?
2. or, required?
3. or, allowed but not required?
Implicit in #1 is that a single standard order is mandated for all UTF16
strings in all WebCGM instances. There are all sorts of flavors and
questions associated with #2 and #3: what is the default (if #3);
does the BOM (0xFEFF or 0xFFFE) have to occur in every string instance;
...?
(Tutorial background. Recall that type SF strings are all of one
character set in a given WebCGM instance, and that type is IsoLatin1 by
default, and may be changed to UTF8 or UTF16 by a 4-character esc
[introducer] sequence at the start of the BegMF id string.
Character sets of type S strings may be switched within a WebCGM using
the normal Character Set List and (Alternate) Character Set Index
mechanisms.)
We think that #1 is the correct WebCGM interpretation. The CGM
binary encoding was specified with an unambiguous byte order, after
considerable discussion (mid-1980s) about the endian issue. If you
view the 16-bit UTF16 codes to be a CGM "word" (see section 5.3
of Part 3). Then the correct representation of UTF16 codes in the
WebCGM data stream is "big endian". I.e., Option (a)
above, i.e.,
0 W 0 e 0 b 0 C 0 G 0 M
This interpretation has been agreed by the one implementation I know of
that can generate UTF16.
Does anyone disagree with this interpretation and clarification?
Regards,
Lofton.
*******************
Lofton Henderson
1919 Fourteenth St., #604
Boulder, CO 80302
Phone: 303-449-8728
Email: lofton@rockynet.com
*******************
[Date Prev]
| [Thread Prev]
| [Thread Next]
| [Date Next]
--
[Date Index]
| [Thread Index]
| [Elist Home]
Powered by eList eXpress LLC