[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [oic] interoperability and non-latin scripts
Mingfei Jia(¼ÖÃ÷·É), Thank you. This is very useful. I want to test my understanding of the impact of GB18030 on the package structure, manifest, and relative URIs that refer to elements of the package. Because ISO 646, UTF8, and GB18030 octets 0x00 to 0x7F are the same Unicode Code points, there is no problem for the "ASCII" names that are standard parts of ODF Document packages: "mimetype", the MIME Type value, "Thumbnails/thumbnail.png", and "META-inf/manifest.xml" are correct regardless of the understood encoding. Likewise, whatever the encoding of the manifest.xml file, the manifest:full-path attribute values for the well-known ODF Document parts are the same regardless of the manifest.xml encoding: "content.xml", "meta.xml", and "styles.xml" The (MIME) media type names are also encoded in the common ISO 646 subset. So long as ODF Document producers only use "ASCII" for the made-up names of additional package items, there should be no problem with the manifest, which is unlikely to have any non-ISO646 characters in it. The ODF Document producer must also always use "/" for a path separator, regardless of the local practice for the file system. So long as there are no non-ISO646 characters in full-path values, relative IRIs that refer to components of the package will be the same regardless of the encoding of the XML Document that contains the IRI as an attribute value. ARE THESE ACCURATE ASSUMPTIONS? MORE QUESTIONS: 1. I can see a problem with regard to the use of encryption in the package. If a password is entered in GB18030 and that is used in the key derivation process, another processor that uses UTF-8, UTF-16, or UTF-32 will fail to produce the same key even though the same password is entered by the user. Is there any convention among ODF implementations in Asia to ensure that a single character-set and encoding is always used for the password so that the key derivation process is repeatable? 2. I can see a problem where non-ISO646 codes are used in Zip item names (that is, in full-path values) for other content. It seems to me that the encoding specified for manifest.xml should dictate the encoding of Zip item names (their "full-path" names), and the processing of relative IRIs that refer to package components must include conversion of segments of the full-path to the same encoding as used for the manifest.xml. (Fragment references to ID values in the target component must be resolved using the encoding of the target XML document, instead.) Is this your understanding of how this should work? 3. I think IRIs that refer beyond the package are a separate problem. I agree that they should be in the same encoding as the XML Document that has the IRI as an attribute value. How that is reconciled with the encoding supported for the host file system is not something that ODF can address. It will be a challenge for implementations. - Dennis -----Original Message----- From: Ming Fei Jia [mailto:jiamingf@cn.ibm.com] Sent: Tuesday, February 17, 2009 07:46 To: dennis.hamilton@acm.org Cc: oic@lists.oasis-open.org Subject: RE: [oic] interoperability and non-latin scripts Dennis, I clarify firstly that I mentioned the Chinese encoding issues means we should add the testing for non-latin encoding documents. Actually those encodings are compatible with Unicode. If ODF allows non Unicode encoding as well as ODF applications support these encodings, there should be no issues. > From: > > "Dennis E. Hamilton" <dennis.hamilton@acm.org> [ ... ] > 1. Do GB18030 and GB2312 character-set encodings all have corresponding > character encodings in Unicode [4.0? 5.0?]? Yes. GB18030 has the corresponding character encodings in Unicode 5.0. GB2312 is a subset of GB18030. > > 2. I am asking because XML is specified in terms of Unicode no matter what > the encoding parameter is. I understand one might want to say > encoding="GB2312" to ensure that text is confined to the characters and > encodings of that specfication to be useful in entry, display, printing and > processing outside of the ODF package. Having a reliable "standard" > mapping to Unicode is valuable, if available. (It also matters what version > of XML 1.0 we specify as normative for ODF, in terms of what can appear in > special types, such as xml:id, NCNAMEs, etc.) > > 3. How do you see this impacting use of IRIs and "full-path" names of Zip > items? Can the "full-path" be carried in UTF-8 even though the coded > characters are meant to be limited to those of GB2312 or GB18030? Likewise, > would you expect that manifest.xml could have encoding="GB2312" (for > example)? You mean a mixed encoding in a xml file, "full-path" is encoded in UTF-8, and the other text is encoded by GB2312 or GB18030. I did not verify that case, even it works, I do not prefer it either. I think generally the full text is encoded by one kind of encoding. If IRI is encoded by non Unicode, as well as ODF application supports that encoding,it should works. Of course, OS also need to support that encoding. [ ... ]
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]