OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

ubl-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [ubl-dev] Two new documents re: UBL Methodology for Code List and Value Validation


At 2007-04-24 07:00 -0700, David RR Webber \(XML\) wrote:
>Good work - this is always a task to ensure everything is consistent.
>
>I noticed this little innocuous line ; -)
>
>  - use of UTF-8 encoding for all XML artefacts
>
>Which way did you go on country names?  Some had accented characters 
>- are those now just plain characters?

I didn't change the repertoire of characters, David, only the encoding.

>Just curious....

This came to me the last time I taught my XML syntax class just recently.

An XML document without an XML declaration with encoding is 
interpreted by a receiving system in any default encoding triggered 
by a higher-level protocol (if present).  It is a surprise to 
students to hear that UTF-8 is *not* necessarily the default encoding 
when an XML document does not have an XML declaration, though the 
default happens to be UTF-8 in most cases because higher-level 
protocols are not in play:

   http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding

So the default encoding can be at the whim of any higher-level 
protocols that might be engaged to transmit the document (I'm 
thinking here perhaps of a Shift-JIS assumption in a Japanese 
transmission).  So there is an infinitesimal but not impossible risk 
of mismatch if I published committee artefacts without an XML declaration.

An XML document with an XML declaration with encoding declares the 
document is in the specific encoding mentioned ... so having this 
removes any risk in that regard.

But conformant XML processors are only required to support UTF-8 and 
UTF-16.  Some of the encodings I was using for convenience and for 
manual data entry were US-ASCII and ISO-8879-1.  While I'm sure most 
XML processors would support these, as an international standards 
artefact I thought it best to make no assumptions about the XML 
processors that might be working with the documents.  Again an 
infinitesimal but not impossible risk of a user not being able to 
function with the artefacts.

So, summed together, that indicates that committee-published artefacts:

  (1) - should have an XML declaration for encoding; and
  (2) - should declare the use of UTF-8 and UTF-16.

When I came to that conclusion on my own that reminded me (Doh!) that 
there are two "additional document constraints" in UBL that say 
exactly the same thing and I didn't have to go figuring this all out on my own:

   http://docs.oasis-open.org/ubl/os-UBL-2.0/UBL-2.0.html#d0e3610
   [IND2] All UBL instance documents MUST identify their character
   encoding within the XML declaration.
   [IND3] In conformance with ISO IEC ITU UN/CEFACT eBusiness
   Memorandum of Understanding Management Group (MOUMG) Resolution
   01/08 (MOU/MG01n83) as agreed to by OASIS, all UBL XML SHOULD be
   expressed using UTF-8.

So ... an XML document with accented letters in country names, 
expressed in UTF-8 encoding, still has accented letters in country 
names because the UTF-8 encoding encodes the entire Unicode 
repertoire and those accented letters are in the repertoire.

Thankfully, encoding is orthogonal to repertoire and the encoding 
decision does not impose any restrictions on characters needed in an 
XML document.

I'll continue to use US-ASCII and ISO-8879-1 for my internal work (my 
editing software doesn't support UTF-8), but as I'm touching more 
documents I'm producing for external consumption, the more careful 
I'm trying to be about explicitly having an XML declaration for UTF-8 
encoding.  I can write in my own encoding and use an XSLT identity 
transform to convert any of my files into UTF-8 for publishing as a 
committee document.

I hope this helps understand my rationale.  I apologize if it sounds 
pedantic, but I felt it necessary for the archive to spell out the 
reasoning so that readers understand the decision was not made lightly.

. . . . . . . . . . . . . . . . . Ken

--
World-wide corporate, govt. & user group XML, XSL and UBL training
RSS feeds:     publicly-available developer resources and training
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/u/
Box 266, Kars, Ontario CANADA K0A-2E0    +1(613)489-0999 (F:-0995)
Male Cancer Awareness Aug'05  http://www.CraneSoftwrights.com/u/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]