OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cgmopen-members message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Subject: Re: WebCGM and UTF16


All,
 
I support this approach. We changed both IsoDraw 5 and IsoView 3 to write UTF-16 as described below.
Any older files that may have been written in little-endian byte order can be read by IsoDraw and saved again as big-endian.
 
Dieter Weidenbrück
ITEDO Software GmbH
----- Original Message -----
Sent: Wednesday, August 15, 2001 4:55 PM
Subject: WebCGM and UTF16

CGM Open Members,

Recently, a question came up about the use of Unicode UTF16 in WebCGM instances.  The byte order of the two-byte codes of UTF16 is not unambiguously specified by the Unicode standard.  For example, to represent the 6 character ASCII string "WebCGM" in UTF16, the same 7-bit ASCII codes are used for one byte of the UTF16 representation, and the other byte is zero (this is true of 8-bit ISOLatin1 also, not just the LHS ASCII subset).  So, would the data stream in a WebCGM instance be the 12-byte sequence:

Option a):  0 W 0 e 0 b 0 C 0 G 0 M

or is it:

Option b):  W 0 e 0 b 0 C 0 G 0 M 0

This issue is discussed in section 2.7 of Unicode (see http://www.unicode.org/unicode/uni2book/ch02.pdf). An optional (not required) BOM (byte order marker) is defined, for use in circumstances where the order might otherwise be ambiguous.

Here is the ambiguity with regard to WebCGM parameters of type SF (non-graphical string) or S (graphical string) -- is the BOM:

1. prohibited?
2. or, required?
3. or, allowed but not required?

Implicit in #1 is that a single standard order is mandated for all UTF16 strings in all WebCGM instances.  There are all sorts of flavors and questions associated with #2 and #3:  what is the default (if #3); does the BOM (0xFEFF or 0xFFFE) have to occur in every string instance; ...?

(Tutorial background.  Recall that type SF strings are all of one character set in a given WebCGM instance, and that type is IsoLatin1 by default, and may be changed to UTF8 or UTF16 by a 4-character esc [introducer] sequence at the start of the BegMF id string.  Character sets of type S strings may be switched within a WebCGM using the normal Character Set List and (Alternate) Character Set Index mechanisms.)

We think that #1 is the correct WebCGM interpretation.  The CGM binary encoding was specified with an unambiguous byte order, after considerable discussion (mid-1980s) about the endian issue.  If you view the 16-bit UTF16 codes to be a CGM "word" (see section 5.3 of Part 3).  Then the correct representation of UTF16 codes in the WebCGM data stream is "big endian".  I.e., Option (a) above, i.e.,

0 W 0 e 0 b 0 C 0 G 0 M

This interpretation has been agreed by the one implementation I know of that can generate UTF16.

Does anyone disagree with this interpretation and clarification?

Regards,
Lofton.



*******************
Lofton Henderson
1919 Fourteenth St., #604
Boulder, CO   80302

Phone:  303-449-8728
Email:  lofton@rockynet.com
*******************


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]


Powered by eList eXpress LLC