OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

oic message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [oic] interoperability and non-latin scripts


Mingfei Jia(¼ÖÃ÷·É),

Thank you.  This is very useful.

I want to test my understanding of the impact of GB18030 on the package
structure, manifest, and relative URIs that refer to elements of the
package.

Because ISO 646, UTF8, and GB18030 octets 0x00 to 0x7F are the same Unicode
Code points, there is no problem for the "ASCII" names that are standard
parts of ODF Document packages: "mimetype", the MIME Type value,
"Thumbnails/thumbnail.png", and "META-inf/manifest.xml" are correct
regardless of the understood encoding.

Likewise, whatever the encoding of the manifest.xml file, the
manifest:full-path attribute values for the well-known ODF Document parts
are the same regardless of the manifest.xml encoding: "content.xml",
"meta.xml", and "styles.xml"  The (MIME) media type names are also encoded
in the common ISO 646 subset.

So long as ODF Document producers only use "ASCII" for the made-up names of
additional package items, there should be no problem with the manifest,
which is unlikely to have any non-ISO646 characters in it.  The ODF Document
producer must also always use "/" for a path separator, regardless of the
local practice for the file system.

So long as there are no non-ISO646 characters in full-path values, relative
IRIs that refer to components of the package will be the same regardless of
the encoding of the XML Document that contains the IRI as an attribute
value.

ARE THESE ACCURATE ASSUMPTIONS?

MORE QUESTIONS:

1. I can see a problem with regard to the use of encryption in the package.
If a password is entered in GB18030 and that is used in the key derivation
process, another processor that uses UTF-8, UTF-16, or UTF-32 will fail to
produce the same key even though the same password is entered by the user.
Is there any convention among ODF implementations in Asia to ensure that a
single character-set and encoding is always used for the password so that
the key derivation process is repeatable?

2. I can see a problem where non-ISO646 codes are used in Zip item names
(that is, in full-path values) for other content.  It seems to me that the
encoding specified for manifest.xml should dictate the encoding of Zip item
names (their "full-path" names), and the processing of relative IRIs that
refer to package components must include conversion of segments of the
full-path to the same encoding as used for the manifest.xml.  (Fragment
references to ID values in the target component must be resolved using the
encoding of the target XML document, instead.)  Is this your understanding
of how this should work?

3. I think IRIs that refer beyond the package are a separate problem.  I
agree that they should be in the same encoding as the XML Document that has
the IRI as an attribute value.  How that is reconciled with the encoding
supported for the host file system is not something that ODF can address.
It will be a challenge for implementations.

 - Dennis



-----Original Message-----
From: Ming Fei Jia [mailto:jiamingf@cn.ibm.com] 
Sent: Tuesday, February 17, 2009 07:46
To: dennis.hamilton@acm.org
Cc: oic@lists.oasis-open.org
Subject: RE: [oic] interoperability and non-latin scripts

Dennis,

I clarify firstly that I mentioned the Chinese encoding issues means we
should add the testing for non-latin encoding documents. Actually those
encodings are compatible with Unicode. If ODF allows non Unicode encoding as
well as ODF applications support these encodings, there should be no issues.


> From:
> 
> "Dennis E. Hamilton" <dennis.hamilton@acm.org>

[ ... ]
> 1. Do GB18030 and GB2312 character-set encodings all have corresponding
> character encodings in Unicode [4.0? 5.0?]?
Yes. GB18030 has the corresponding character encodings in Unicode 5.0.
GB2312 is a subset of GB18030.
> 
> 2. I am asking because XML is specified in terms of Unicode no matter what
> the encoding parameter is.  I understand one might want to say
> encoding="GB2312" to ensure that text is confined to the characters and
> encodings of that specfication to be useful in entry, display, printing
and
> processing outside of the ODF package.   Having a reliable "standard"
> mapping to Unicode is valuable, if available.  (It also matters what
version
> of XML 1.0 we specify as normative for ODF, in terms of what can appear in
> special types, such as xml:id, NCNAMEs, etc.) 
> 
> 3. How do you see this impacting use of IRIs and "full-path" names of Zip
> items?  Can the "full-path" be carried in UTF-8 even though the coded
> characters are meant to be limited to those of GB2312 or GB18030?
Likewise,
> would you expect that manifest.xml could have encoding="GB2312" (for
> example)?
You mean a mixed encoding in a xml file, "full-path" is encoded in UTF-8,
and the other text is encoded by GB2312 or GB18030. I did not verify that
case, even it works, I do not prefer it either. I think generally the full
text is encoded by one kind of encoding. If IRI is encoded by non Unicode,
as well as ODF application supports that encoding,it should works. Of
course, OS also need to support that encoding.
[ ... ]



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]