[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [ubl-dev] SV: [ubl] Re: [ubl-dev] Datatype Methodology RE: [ubl-dev] SBS and Restricted Data Types
Ken, Thanks for the CRVL reference - I found - http://xml.coverpages.org/DSDL-Part7-200502.pdf and this looked comprehensive. Are there any XML usage samples out there too? Thanks, DW p.s. Sorry I forgot the XML prologue got deprecated - people still insist in putting <?xml version="1.0"?> everywhere though ; -) -------- Original Message -------- Subject: RE: [ubl-dev] SV: [ubl] Re: [ubl-dev] Datatype Methodology RE: [ubl-dev] SBS and Restricted Data Types From: "G. Ken Holman" <gkholman@CraneSoftwrights.com> Date: Tue, May 09, 2006 4:06 pm To: UBL-Dev <ubl-dev@lists.oasis-open.org> At 2006-05-09 05:50 -0700, David RR Webber \(XML\) wrote: >Right now the only way I'm aware of controlling this is thru the XML >prologue and setting UTF-8, etc. There is no XML prologue ... there was an SGML prologue but it doesn't exist in XML. >Like Bryan - we have found this problematic in production. File >attachments and file names is one area where people can create a >filename on one O/S that is then not processable / gives problems - >especially persisting into the backend database (e.g. Oracle) or during >file handle opening. I don't believe we have any UBL information items that name system resources, so this shouldn't be a problem. >The only way we have addressed this to date is to issue manual >guidelines to submitters. Because these characters can cause issues in >the processing at various levels - failures can occur prior to or after >the CAM step ; -) Characters that are invalid to XML and, according to the W3C Note http://www.w3.org/TR/2003/NOTE-unicode-xml-20030613/ not appropriate should be avoided ... other characters might be undesirable but shouldn't cause system failures. >It's a good thought though - to add the ability to filter on character >codes via an exclusion table mechanism - that would then point up the >problem - e.g. invalid character code found in element <dataitem123> >etc. And then a predicate applyCharacterFilter(/XPath/, filtername). The current work I'm familiar with is DSDL's CRVL: Character Repertoire Validation Language where one can declare the characters that are considered acceptable for an instance: http://www.jtc1sc34.org/repository/0593c.htm But I don't believe it includes context, so I've just submitted that as a committee comment. > -------- Original Message -------- >Subject: Re: [ubl-dev] SV: [ubl] Re: [ubl-dev] Datatype Methodology RE: >[ubl-dev] SBS and Restricted Data Types >From: stephen.green@systml.co.uk >Date: Tue, May 09, 2006 5:55 am >To: ubl-dev@lists.oasis-open.org, ubl@lists.oasis-open.org > >Bryan, All, > >This raises and interesting point. There is surely an important need >to specify in a trading agreement the character set to be used in >the documents. The character set of an instance should be irrelevant given that is handled by the XML processor and not by the application. The XML processor delivers Unicode characters to the application regardless of the character set used in the XML declaration and the instance. However, character repertoire might, indeed, have to be restricted by an implementation, e.g., "I only support Western European characters and not Hebrew, Arabic or Han-based languages". This can be declared in CRVL. >I wonder whether even CAM has this :-) After all, should >my application have to be able to support musical notation or >hieroglyphics >in a product description? Maybe there should be a way to specify a >subset >of a character set too (especially if it is Unicode we are talking >about). >I bet many have had problems when a character decodes to two characters >in >certain systems (e.g. the GBP sign ): not good for translation to fixed >width and/or EDI. The XML-based application should be receiving the Unicode repertoire character for GBP sign and not the encoding of the character that was used to represent the character. Thankfully, using XML properly skirts many of the pitfalls and drawbacks of character sets. Unfortunately, many Java programmers (including some of my clients) were not aware of this and have messed up working XML systems by inadvertently injecting character set problems without considering the issues. >Quoting Bryan Rasmussen <BRS@itst.dk>: > > > I agree with not setting string length restrictions, I think it > would be nice > > to have string length minimums or constraints to require some content in an > > element if the element is required, but it's not a big thing for me. > > > > Another thing though would be restricting characters that are not > needed, as > > per the recommendations in http://www.w3.org/TR/unicode-xml/#Suitable > > > > I think what should be restricted is (from document): > > > > U+202A .. U+202E BIDI embedding controls > > (LRE, RLE, LRO, RLO, PDF) Strongly discouraged in [HTML 4.0] > > U+206A .. U+206B Activate/Inhibit Symmetric swapping Deprecated in Unicode > > U+206C .. U+206D Activate/Inhibit Arabic form shaping Deprecated in Unicode > > U+206E .. U+206F Activate/Inhibit National digit shapes > Deprecated in Unicode > > > > U+FFF9 .. U+FFFB Interlinear annotation characters Use ruby markup [Ruby] > > U+FEFF Byte order mark / ZWNBSP Use only as byte order mark. Use > U+2060 Word > > Joiner instead of using U+FEFF as ZWNBSP > > U+FFFC Object replacement character Use markup > > U+1D173..U+1D173A Scoping for Musical Notation Use an appropriate markup > > language > > U+E0000 .. U+E007F Language Tag codepoints I think those should just assume to be not allowed, without even having to indicate so in CRVL ... I added a question about this to the CRVL editor. > > I don't want to restrict the use of line feeds etc. as is > recommended in the > > aforementioned document. I do not see "line feed" mentioned in that document, but "line separator" is which is quite different and not related to the characters I understand are used by processing systems ... I'm not familiar with any systems that use the Unicode line separator, but it is recognized in XML 1.1 as a valid end of line character, translated to a new-line for the application, should it be used. I don't see a problem with including it in the list of unsuitable characters. I hope this helps. . . . . . . . . . . . Ken -- Registration open for XSLT/XSL-FO training: Wash.,DC 2006-06-12/16 Also for XSLT/XSL-FO training: Minneapolis, MN 2006-07-31/08-04 Also for XML/XSLT/XSL-FO training:Birmingham,England 2006-05-22/25 Also for XSLT/XSL-FO training: Copenhagen,Denmark 2006-05-08/11 World-wide on-site corporate, govt. & user group XML/XSL training. G. Ken Holman mailto:gkholman@CraneSoftwrights.com Crane Softwrights Ltd. http://www.CraneSoftwrights.com/u/ Box 266, Kars, Ontario CANADA K0A-2E0 +1(613)489-0999 (F:-0995) Male Cancer Awareness Aug'05 http://www.CraneSoftwrights.com/u/bc Legal business disclaimers: http://www.CraneSoftwrights.com/legal --------------------------------------------------------------------- This publicly archived list supports open discussion on implementing the UBL OASIS Standard. To minimize spam in the archives, you must subscribe before posting. [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ Alternately, using email: list-[un]subscribe@lists.oasis-open.org List archives: http://lists.oasis-open.org/archives/ubl-dev/ Committee homepage: http://www.oasis-open.org/committees/ubl/ List Guidelines: http://www.oasis-open.org/maillists/guidelines.php Join OASIS: http://www.oasis-open.org/join/
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]