All,
I support this approach. We changed both IsoDraw 5
and IsoView 3 to write UTF-16 as described below.
Any older files that may have been written in
little-endian byte order can be read by IsoDraw and saved again as
big-endian.
Dieter Weidenbrück
ITEDO Software GmbH
----- Original Message -----
Sent: Wednesday, August 15, 2001 4:55
PM
Subject: WebCGM and UTF16
CGM Open Members,
Recently, a question came up about the
use of Unicode UTF16 in WebCGM instances. The byte order of the two-byte
codes of UTF16 is not unambiguously specified by the Unicode standard.
For example, to represent the 6 character ASCII string "WebCGM" in UTF16, the
same 7-bit ASCII codes are used for one byte of the UTF16 representation, and
the other byte is zero (this is true of 8-bit ISOLatin1 also, not just the LHS
ASCII subset). So, would the data stream in a WebCGM instance be the
12-byte sequence:
Option a): 0 W 0 e 0 b 0 C 0 G 0 M
or is
it:
Option b): W 0 e 0 b 0 C 0 G 0 M 0
This issue is
discussed in section 2.7 of Unicode (see http://www.unicode.org/unicode/uni2book/ch02.pdf). An optional (not required) BOM (byte order marker) is
defined, for use in circumstances where the order might otherwise be
ambiguous.
Here is the ambiguity with regard to WebCGM parameters of
type SF (non-graphical string) or S (graphical string) -- is the
BOM:
1. prohibited? 2. or, required? 3. or, allowed but not
required?
Implicit in #1 is that a single standard order is mandated
for all UTF16 strings in all WebCGM instances. There are all sorts of
flavors and questions associated with #2 and #3: what is the default (if
#3); does the BOM (0xFEFF or 0xFFFE) have to occur in every string instance;
...?
(Tutorial background. Recall that type SF strings are all
of one character set in a given WebCGM instance, and that type is IsoLatin1 by
default, and may be changed to UTF8 or UTF16 by a 4-character esc [introducer]
sequence at the start of the BegMF id string. Character sets of type S
strings may be switched within a WebCGM using the normal Character Set List
and (Alternate) Character Set Index mechanisms.)
We think that #1 is
the correct WebCGM interpretation. The CGM binary encoding was specified
with an unambiguous byte order, after considerable discussion (mid-1980s)
about the endian issue. If you view the 16-bit UTF16 codes to be a CGM
"word" (see section 5.3 of Part 3). Then the correct representation of
UTF16 codes in the WebCGM data stream is "big endian". I.e., Option (a)
above, i.e.,
0 W 0 e 0 b 0 C 0 G 0 M
This interpretation has
been agreed by the one implementation I know of that can generate
UTF16.
Does anyone disagree with this interpretation and
clarification?
Regards, Lofton.
*******************
Lofton Henderson
1919 Fourteenth St., #604
Boulder, CO 80302
Phone: 303-449-8728
Email: lofton@rockynet.com *******************
|