OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

ubl-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [ubl-dev] SV: [ubl] Re: [ubl-dev] Datatype Methodology RE: [ubl-dev] SBS and Restricted Data Types


Ken, 

Thanks for the CRVL reference - I found -
http://xml.coverpages.org/DSDL-Part7-200502.pdf  and this looked
comprehensive. 

Are there any XML usage samples out there too?  

Thanks, DW 

p.s. Sorry I forgot the XML prologue got deprecated - people still
insist in putting <?xml version="1.0"?> everywhere though ; -)


 -------- Original Message --------
Subject: RE: [ubl-dev] SV: [ubl] Re: [ubl-dev] Datatype Methodology RE:
[ubl-dev] SBS and  Restricted Data Types
From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
Date: Tue, May 09, 2006 4:06 pm
To: UBL-Dev <ubl-dev@lists.oasis-open.org>

At 2006-05-09 05:50 -0700, David RR Webber \(XML\) wrote:
>Right now the only way I'm aware of controlling this is thru the XML
>prologue and setting UTF-8, etc.

There is no XML prologue ... there was an SGML prologue but it 
doesn't exist in XML.

>Like Bryan - we have found this problematic in production.  File
>attachments and file names is one area where people can create a
>filename on one O/S that is then not processable / gives problems -
>especially persisting into the backend database (e.g. Oracle) or during
>file handle opening.

I don't believe we have any UBL information items that name system 
resources, so this shouldn't be a problem.

>The only way we have addressed this to date is to issue manual
>guidelines to submitters.  Because these characters can cause issues in
>the processing at various levels - failures can occur prior to or after
>the CAM step ; -)

Characters that are invalid to XML and, according to the W3C Note 
http://www.w3.org/TR/2003/NOTE-unicode-xml-20030613/ not appropriate 
should be avoided ... other characters might be undesirable but 
shouldn't cause system failures.

>It's a good thought though - to add the ability to filter on character
>codes via an exclusion table mechanism - that would then point up the
>problem - e.g. invalid character code found in element <dataitem123>
>etc.  And then a predicate applyCharacterFilter(/XPath/, filtername).

The current work I'm familiar with is DSDL's CRVL: Character 
Repertoire Validation Language where one can declare the characters 
that are considered acceptable for an instance:

  http://www.jtc1sc34.org/repository/0593c.htm

But I don't believe it includes context, so I've just submitted that 
as a committee comment.

>  -------- Original Message --------
>Subject: Re: [ubl-dev] SV: [ubl] Re: [ubl-dev] Datatype Methodology RE:
>[ubl-dev] SBS and  Restricted Data Types
>From: stephen.green@systml.co.uk
>Date: Tue, May 09, 2006 5:55 am
>To: ubl-dev@lists.oasis-open.org, ubl@lists.oasis-open.org
>
>Bryan, All,
>
>This raises and interesting point. There is surely an important need
>to specify in a trading agreement the character set to be used in
>the documents.

The character set of an instance should be irrelevant given that is 
handled by the XML processor and not by the application.  The XML 
processor delivers Unicode characters to the application regardless 
of the character set used in the XML declaration and the instance.

However, character repertoire might, indeed, have to be restricted by 
an implementation, e.g., "I only support Western European characters 
and not Hebrew, Arabic or Han-based languages".  This can be declared in
CRVL.

>I wonder whether even CAM has this :-) After all, should
>my application have to be able to support musical notation or
>hieroglyphics
>in a product description? Maybe there should be a way to specify a
>subset
>of a character set too (especially if it is Unicode we are talking
>about).
>I bet many have had problems when a character decodes to two characters
>in
>certain systems (e.g. the GBP sign ): not good for translation to fixed
>width and/or EDI.

The XML-based application should be receiving the Unicode repertoire 
character for GBP sign and not the encoding of the character that was 
used to represent the character.

Thankfully, using XML properly skirts many of the pitfalls and 
drawbacks of character sets.  Unfortunately, many Java programmers 
(including some of my clients) were not aware of this and have messed 
up working XML systems by inadvertently injecting character set 
problems without considering the issues.

>Quoting Bryan  Rasmussen <BRS@itst.dk>:
>
> > I agree with not setting string length restrictions, I think it 
> would be nice
> > to have string length minimums or constraints to require some content in an
> > element if the element is required, but it's not a big thing for me.
> >
> > Another thing though would be restricting characters that are not 
> needed, as
> > per the recommendations in http://www.w3.org/TR/unicode-xml/#Suitable
> >
> > I think what should be restricted is (from document):
> >
> > U+202A .. U+202E BIDI embedding controls
> > (LRE, RLE, LRO, RLO, PDF) Strongly discouraged in [HTML 4.0]
> > U+206A .. U+206B Activate/Inhibit Symmetric swapping Deprecated  in Unicode
> > U+206C .. U+206D Activate/Inhibit Arabic form shaping Deprecated in Unicode
> > U+206E .. U+206F Activate/Inhibit National digit shapes 
> Deprecated in Unicode
> >
> > U+FFF9 .. U+FFFB Interlinear annotation characters Use ruby markup [Ruby]
> > U+FEFF Byte order mark / ZWNBSP Use only as byte order mark. Use 
> U+2060 Word
> > Joiner instead of using U+FEFF as ZWNBSP
> > U+FFFC Object replacement character Use markup
> > U+1D173..U+1D173A Scoping for Musical Notation Use an appropriate markup
> > language
> > U+E0000 .. U+E007F Language Tag codepoints

I think those should just assume to be not allowed, without even 
having to indicate so in CRVL ... I added a question about this to 
the CRVL editor.

> > I don't want to restrict the use of line feeds etc. as is 
> recommended in the
> > aforementioned document.

I do not see "line feed" mentioned in that document, but "line 
separator" is which is quite different and not related to the 
characters I understand are used by processing systems ... I'm not 
familiar with any systems that use the Unicode line separator, but it 
is recognized in XML 1.1 as a valid end of line character, translated 
to a new-line for the application, should it be used.  I don't see a 
problem with including it in the list of unsuitable characters.

I hope this helps.

. . . . . . . . . . . Ken

--
Registration open for XSLT/XSL-FO training: Wash.,DC 2006-06-12/16
Also for XSLT/XSL-FO training:    Minneapolis, MN 2006-07-31/08-04
Also for XML/XSLT/XSL-FO training:Birmingham,England 2006-05-22/25
Also for XSLT/XSL-FO training:    Copenhagen,Denmark 2006-05-08/11
World-wide on-site corporate, govt. & user group XML/XSL training.
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/u/
Box 266, Kars, Ontario CANADA K0A-2E0    +1(613)489-0999 (F:-0995)
Male Cancer Awareness Aug'05  http://www.CraneSoftwrights.com/u/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal


---------------------------------------------------------------------
This publicly archived list supports open discussion on implementing the
UBL OASIS Standard. To minimize spam in the
archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Alternately, using email: list-[un]subscribe@lists.oasis-open.org
List archives: http://lists.oasis-open.org/archives/ubl-dev/
Committee homepage: http://www.oasis-open.org/committees/ubl/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
Join OASIS: http://www.oasis-open.org/join/ 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]