ubl-dev message

Subject: Re: Datatype Methodology
From: jon.bosak@sun.com
To: david@drrw.info
Date: Tue, 9 May 2006 12:32:36 -0700 (PDT)
Could we please restrict these discussions to ubl-dev?  I'm really
tired of seeing two copies of every post.

The W3C XML recommendation goes into agonizing detail about which
unicode characters are allowed and which are not.  That's where
these concerns are and should be addressed -- at the level of the
XML recommendation.  An application that does not support the
characters specified in the XML recommendation is not a conformant
XML application.  Full stop.

If an application purporting to be an XML application chokes on
particular unicode characters allowed by XML, then that
application is broken; fix it.

If users persist in embedding characters in file names that aren't
legal in the target system, tell them to stop.

It is not the job of UBL to create mechanisms for specifying which
of the 30,000+ unicode characters are allowed in XML documents and
which are not.  That job belongs to other specification efforts.
Insofar as this is a real problem (and I'm not sure that it is),
it applies to all XML documents, not just UBL.  If a solution is
necessary, it should be developed at a level that applies to XML
in general.

Jon

   Date: Tue, 09 May 2006 05:50:07 -0700
   From: "David RR Webber (XML)" <david@drrw.info>
   Cc: ubl-dev@lists.oasis-open.org, ubl@lists.oasis-open.org,
	   CAM OASIS TC <cam@lists.oasis-open.org>

   Steve, 

   Right now the only way I'm aware of controlling this is thru the XML
   prologue and setting UTF-8, etc. 

   Like Bryan - we have found this problematic in production.  File
   attachments and file names is one area where people can create a
   filename on one O/S that is then not processable / gives problems -
   especially persisting into the backend database (e.g. Oracle) or during
   file handle opening. 

   The only way we have addressed this to date is to issue manual
   guidelines to submitters.  Because these characters can cause issues in
   the processing at various levels - failures can occur prior to or after
   the CAM step ; -) 

   It's a good thought though - to add the ability to filter on character
   codes via an exclusion table mechanism - that would then point up the
   problem - e.g. invalid character code found in element <dataitem123>
   etc.  And then a predicate applyCharacterFilter(/XPath/, filtername). 

   DW


    -------- Original Message --------
   Subject: Re: [ubl-dev] SV: [ubl] Re: [ubl-dev] Datatype Methodology RE:
   [ubl-dev] SBS and  Restricted Data Types
   From: stephen.green@systml.co.uk
   Date: Tue, May 09, 2006 5:55 am
   To: ubl-dev@lists.oasis-open.org, ubl@lists.oasis-open.org

   Bryan, All,

   This raises and interesting point. There is surely an important need
   to specify in a trading agreement the character set to be used in
   the documents. I wonder whether even CAM has this :-) After all, should
   my application have to be able to support musical notation or
   hieroglyphics
   in a product description? Maybe there should be a way to specify a
   subset
   of a character set too (especially if it is Unicode we are talking
   about).
   I bet many have had problems when a character decodes to two characters
   in
   certain systems (e.g. the GBP sign ): not good for translation to fixed
   width and/or EDI.

   All the best

   Steve

   Quoting Bryan  Rasmussen <BRS@itst.dk>:

   > I agree with not setting string length restrictions, I think it would be nice
   > to have string length minimums or constraints to require some content in an
   > element if the element is required, but it's not a big thing for me.
   >
   > Another thing though would be restricting characters that are not needed, as
   > per the recommendations in http://www.w3.org/TR/unicode-xml/#Suitable
   >
   > I think what should be restricted is (from document):
   >
   > U+202A .. U+202E BIDI embedding controls
   > (LRE, RLE, LRO, RLO, PDF) Strongly discouraged in [HTML 4.0]
   > U+206A .. U+206B Activate/Inhibit Symmetric swapping Deprecated  in Unicode
   > U+206C .. U+206D Activate/Inhibit Arabic form shaping Deprecated in Unicode
   > U+206E .. U+206F Activate/Inhibit National digit shapes Deprecated in Unicode
   >
   > U+FFF9 .. U+FFFB Interlinear annotation characters Use ruby markup [Ruby]
   > U+FEFF Byte order mark / ZWNBSP Use only as byte order mark. Use U+2060 Word
   > Joiner instead of using U+FEFF as ZWNBSP
   > U+FFFC Object replacement character Use markup
   > U+1D173..U+1D173A Scoping for Musical Notation Use an appropriate markup
   > language
   > U+E0000 .. U+E007F Language Tag codepoints
   >
   > I don't want to restrict the use of line feeds etc. as is recommended in the
   > aforementioned document.
   >
   > Cheers,
   > Bryan Rasmussen
   >