[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs)
I strongly agree with Lofton. > -----Original Message----- > From: Lofton Henderson [mailto:lofton@rockynet.com] > Sent: Saturday, June 25, 2005 7:28 PM > To: Robert Orosz; cgmo-webcgm@lists.oasis-open.org > Subject: RE: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs) > > > Since two people have opted for "Erratum for 1.0", we should be > clear what > are the consequences: > > A.) all existing 1.0 files that invoke UTFx, which have been > 1.0-conformaing since 1999, would become non-conforming on the > day that the > erratum becomes effective. Thus, MetaCheck (once upgraded for > the erratum) > would declare all existing 1.0 files non-conforming. This is the way > errata work in ISO and W3C. > > B.) all existing currently conforming 1.0 generators in the field would > become non-conforming. > > C.) all existing currently conforming 1.0 viewers in the field > would become > non-conforming, and if some generators started putting out "new" > 1.0 files, > all existing 1.0 viewers in the field, which previously handled all 1.0 > files fine, would now begin to malfunction on "conforming" 1.0 files. > > At 09:53 AM 6/24/2005 -0600, Robert Orosz wrote: > >[...] > >Regarding ISSUE 1, I think an erratum should be issued for WebCGM 1.0, > >because it is well, wrong. > > I guess I take a pragmatic view. What is worse, that we are "wrong" in > some formal sense? Or that we create real chaos and confusion > about valid > 1.0 content and invalidate otherwise fine legacy 1.0 implementations? > > >Essentially, you are proposing that the > >sequences ESC 2/5 4/9 and ESC 2/5 4/12 are private codes for UTF-8 and > >UTF-16 respectively. > > I don't see it that way. For SF, the ISO 2022 sequences are used in the > content ('id' string of BegMet). They always were correct and > remain correct: > > ESC 2/5 2/15 4/9 > ESC 2/5 2/15 4/12 > > For CSL entries, WebCGM 1.0 normatively stated: > > the two-part parameter "'complete code', 4/9" means UTF8 > the two-part parameter "'complete code', 4/12" means UTF16 > > It is unambiguous in the context of WebCGM 1.0. It is, formally > speaking, > wrong with respect to how the clauses of CGM:1999 say these should be > derived from the correct 2022 sequences... > > >However, section 6.3.4.3 of CGM:1999 states under the > >"CHARACTER SETS INTENDED TO BE DESIGNATED AS COMPLETE CODES." paragraph: > > > >.... "If <F> is from column 3, the coding system is a private > code. If <F> > >is from > >columns 4 through 7, it is a code for which a designating and invoking > >escape sequence has been registered in the International > Register Of Coded > >Character Sets To Be Used With Escape Sequences." > > > >So, the sequences ESC 2/5 3/9 and ESC 2/5 3/12 would have been acceptable > >instead. > > > >Section 9.1.3 of CGM:1999 states: > > > >"A profile of ISO/IEC 8632 shall not specify any requirement that would > >contradict or cause non-conformance to ISO/IEC 8632." > > > >In my opinion, WebCGM 1.0 is contradicting ISO/IEC 8632 by designating an > >escape sequence with a final byte from column 4 as a private code for the > >Character Set List element. Issuing an erratum on this would alert users > >that what is currently specified is wrong, and it is being deprecated and > >changed in WebCGM 2.0. > > With chaotic practical implications for existing and new 1.0 content, and > existing and new 1.0 implementations. We should seriously > consider before > we do that, whether being formally "correct" offers sufficient advantages > that outweigh the very real pragmatic downside. > > Especially since WebCGM doesn't actually use ISO2022 (generalized > intra-string character set *switching* via control sequences) in any real > sense, but rather has a couple of mechanisms that are based on a very > narrow technical detail of 2022 -- no one is going to write a 2022 > processor for WebCGM, which processor would likely get confused > if a wrong > sequence showed up in a string. > > Btw, I could point out several ways in which WebCGM arguably violates the > CGM:1999 standard. > > Regards, > -Lofton. > > > >Regarding ISSUE 2, either Alt.2 or Alt.3 are fine with me. > > > >Regards, > > > >Rob > > > >-----Original Message----- > >From: Lofton Henderson [mailto:lofton@rockynet.com] > >Sent: Thursday, June 23, 2005 5:29 PM > >To: cgmo-webcgm@lists.oasis-open.org > >Subject: [cgmo-webcgm] UTF-8 & UTF-16 sequences (ISSUEs) > > > > > >WebCGM TC, > > > >I have an action item to research "UTF-x sequence tails". Thanks to > >Forrest for providing me some references and some motivation, I > have gotten > >the information, and I make recommendations below. > > > >[1] http://www.unihan.com.cn/Cjk/ana18.htm > >[2] http://www.unihan.com.cn/Cjk/ana19.htm > > > >At [1] and [2], we find the ISO/IEC 2022 escape sequences: > > > >UTF-8 implementation level 3: ESC 2/5 2/15 4/9 > >UTF-16 implementation level 3: ESC 2/5 2/15 4/12 > > > >At [3], I found a lucid explanation of this stuff, and particularly what > >"implementation level 1,2,3" mean. In the past, we chose implementation > >level 3 (whether or not it was a well-considered decision is another > >question). > > > >[3] http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html#3 > > > >Separate the cases of non-graphical text (SF) and graphical text (S) in > >WebCGM 1.0. > > > >Non-graphical text (SF): T.14.5 > >----- > >T.14.5 says that the metafile id (BEGIN METAFILE parameter, type > SF) shall > >have as its first 4 octets the 4-octet sequences above, to > declare for the > >whole metafile that SF is UTF-8 or UTF-16. > > > >Conclusion: no problem here. > > > >Graphical text (S): T.16.14 > >----- > >T.16.14 takes the last character of the above 4-octet sequences as the > >'tail', for use in the CHARACTER SET LIST (CSL) element. So the two-part > >data for CSL are specified as: > > > >UTF-8 implementation level 3: 'complete code', 4/9 > >UTF-16 implementation level 3: 'complete code', 4/12 > > > >This was based on information in CGM:1999 section 6.3.4.3, that > >characterizes the escape sequences for complete codes as: ESC 2/5 I* F. > >I* is zero or more "intermediate characters", and F is a single final > >character. WebCGM 1.0 took only F for the tail. But CGM:1999 says: > > > > >The character set declaration ... consists of 'complete code' > followed by > > >a string consisting of those characters in the code's ISO 2022 escape > > >sequence which come after the first two characters, ESC 2/5. > > > >Conclusion: WebCGM 1.0 is wrong for the CSL tails for UTF-8 and > >UTF-16. The CSL data should be: > > > >UTF-8 implementation level 3: 'complete code', 2/15 4/9 > >UTF-16 implementation level 3: 'complete code', 2/15 4/12 > > > >ISSUES: > >=== > >ISSUE 1: Should we issue an erratum for WebCGM 1.0? > > > >Alternatives: > >Alt.1: No > >Alt.2: Yes > > > >Recommendation for Issue 1: Alt.1, No. > > > >Discussion: CGM:1999 CSL is not really an implementation of ISO > 2022, but > >rather takes concepts and bits of escape sequences as parameters for the > >CSL to designate character sets. WebCGM 1.0 lists data that is > to be used > >to designate 6 character sets. Though wrong according to ISO2022, on the > >other hand these are effectively just tokens to select the 6 char. sets, > >and it is unambiguous in the context of WebCGM. To change WebCGM 1.0 by > >erratum will invalidate existing WebCGM 1.0 products in the > field, for new > >WebCGM 1.0 content. And would cause existing "valid" 1.0 > content to become > >invalid. It's not worth it, IMO. > > > >ISSUE 2: Should we correct it for WebCGM 2.0? > > > >Alternatives: > >Alt.1: No > >Alt.2: Yes > >Alt.3: Yes, but do it by deprecation of the old 1.0 forms (2.0 > generators > >shall generate only the 2.0 forms, 2.0 viewers shall accept 1.0 forms as > >well as 2.0 forms) > > > >Recommendation for Issue 2: Alt.3, Yes, but by deprecation of old. > > > >Discussion: If generators are writing 2.0 files, and they put out the > >proper forms, then there really shouldn't be a problem with old (1.0) > >viewers in the field -- they won't understand other 2.0 stuff > anyway. 2.0 > >generators and 2.0 viewers will be using "correct" forms. > > > >Thoughts? > > > >-Lofton. > > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]