ubl message

Subject: Initial Comments on UN/CEFACT ATG2 Core Component Schema Module

From: Tim McGrath <tmcgrath@portcomm.com.au>
To: ubl@lists.oasis-open.org
Date: Wed, 01 Sep 2004 15:41:45 +0800

I was ask to co-ordinate any UBL comments to be submitted to UN/CEFACT as part of the review of their XML Naming and Design Rules. In particular, how the schema modules for Core Components and Unqualified Data Types compared to those used by UBL.

My initial comments are posted here to encourage debate within the UBL TC - whether or not we submit a response and what that may say is to be decided by the teleconference calls on September 8th/9th. I have only addressed the ATG2 Core Component Schema - presumably to Unqualified Data Type Schema will have these issue plus others.

In addressing these schemas, it has to be conceded that some basic XML Naming and Design Rules differ (and apparently always will) between ATG2 and UBL. For example, the use of global and local type definitions. This means that we are not aiming for compatibility with schemas built using ATG2 rules - we are aiming for interoperability. That is, can a document whose schema is expressed in ATG2 form be mappable to components (elements and attributes) used by UBL.

The following outlines the areas that need to be addressed for this interoperability to be achieved.

Naming Rules
--------------
The difference in NDRs between ATG2 and UBL manifests itself most significantly in the naming of attributes.

The ATG2 naming rules [R 117 and R 133 duplicate each other] state ..
"Each supplementary component xsd:attribute "name" MUST be the supplementary component name with the separators and spaces removed. "

UBL's Naming and Design Rule [ATN1] originally stated..
"Each CCT:SupplementaryComponent xsd:attribute “name” MUST be the ccts:SupplementaryComponent dictionary entry name property term and representation term, with the separators removed." - but this has been reviewed (see under point 2. of this section).

In their implementation these difference impact in three significant ways.

1.UBL have adopted abbreviations for "Identifier" (which must appear as "ID") and "Uniform Resource Identifier" (which must appear as "URI").
Any attributes that contain these abbreviations will have different names. However these are obviously interoperable as we should be able to map one name to the other. Unfortunately, the given ATG2 Core Component schema does not adhere to the ATG2 rule (neither do the fragment samples in the main body agree with the final schemas- but we can assume the final schemas are what was intended), for example:
<xsd:attribute name="amountCurrencyID" type="xsd:token" use="optional">
<xsd:attribute name="amountCurrencyCodeListVersionID" type="xsd:token" use="optional">
<xsd:attribute name="codeListID" type="xsd:token" use="optional">
<xsd:attribute name="codeListAgencyID" type="xsd:token" use="optional">
<xsd:attribute name="codeListVersionID" type="xsd:token" use="optional">
<xsd:attribute name="codeLanguageID" type="xsd:language" use="optional">
<xsd:attribute name="identificationSchemeID" type="xsd:token" use="optional">
<xsd:attribute name="identificationSchemeAgencyID" type="xsd:token" use="optional">
<xsd:attribute name="identificationSchemeVersionID" type="xsd:token" use="optional">
<xsd:attribute name="measureUnitCodeListVersionID" type="xsd:token" use="optional">
<xsd:attribute name="quantityUnitCodeListID" type="xsd:token" use="optional">
<xsd:attribute name="quantityUnitCodeListAgencyID" type="xsd:token" use="optional">
<xsd:attribute name="languageID" type="xsd:language" use="optional">
<xsd:attribute name="languageLocaleID" type="xsd:token" use="optional">
- should all have the letters "ID" replaced by "Identifier", and...
<xsd:attribute name="binaryObjectURI" type="xsd:anyURI" use="optional">
<xsd:attribute name="codeListSchemeURI" type="xsd:anyURI" use="optional">
<xsd:attribute name="identificationSchemeDataURI" type="xsd:anyURI" use="optional">
<xsd:attribute name="identificationSchemeURI" type="xsd:anyURI" use="optional">
- should all have the letters "URI" replaced by "Uniform ResourceIdentifier", and...
<xsd:attribute name="codeListUniformResourceID" type="xsd:anyURI" use="optional">
- should have the letters "UniformResourceID" replaced by "Uniform ResourceIdentifier".

2. UBL truncates redundant Object Class in names.
UBL has realized that rule [ATN1] is not adequate to define an attribute name. This is because merely using property term and representation term for an attribute's name will not make it unique. For example, CodeType would have two attributes called "name" and two called "URI". The solution UBL adopted was based on Gunther's position paper of April 2002 (http://www.oasis-open.org/apps/org/workgroup/ubl/ubl-ndrsc/download.php/1505/draft-stuhec-nameTrun-01.doc):

• If a BBIE (Basic Business Information Entity) defined in a ABIE (Aggregated Business Information Entity) with the same “Object Class Term” and same “Object Class Qualifier”, that this “Object Class Term” can be truncated from the BBIE

In effect, UBL has abbreviated the ATG2 rule [R 117/133]. So that CodeType would have one "name" attribute (for the Code. Name) and one "codeListName" (for the Code List. Name). Again, this makes mapping between the two straightforward. However, there is one exception. The current ATG2 schema adds a new Object Class (or perhaps qualifies the Object Class) for the attribute LanguageID - so it is known to ATG2 as codeLanguageID and to UBL as languageID. this mapping cannot be assumed from the current CCTS.

3. UBL applies an additional naming rule when the Representation term is "text".
UBL has had a long standing rule ...

7.(b) The representation term “Text” will be considered the default representation term when a representation term does not appear.

[NB this rule has not made it into the latest NDR rules despite being listed as approved at the plenary meeting in May 2003 (http://lists.oasis-open.org/archives/ubl-ndrsc/200201/doc00005.doc)].
This rule means that the attribute known to ATG2 as "codeListAgencyNameText" is known in UBL as "codeListAgencyName". The same applies to...
<xsd:attribute name="binaryObjectFormatText" type="xsd:token" use="optional">
<xsd:attribute name="binaryObjectFilenameText" type="xsd:token" use="optional">
<xsd:attribute name="codeListNameText" type="xsd:token" use="optional">
<xsd:attribute name="codeNameText" type="xsd:token" use="optional">
<xsd:attribute name="dateTimeFormatText" type="xsd:token" use="optional">
<xsd:attribute name="identificationSchemeNameText" type="xsd:token" use="optional">
<xsd:attribute name="identificatonSchemeAgencyNameText" type="xsd:token" use="optional">
<xsd:attribute name="indicatorFormatText" type="xsd:token" use="optional">
<xsd:attribute name="numericFormatText" type="xsd:token" use="optional">
<xsd:attribute name="quantityUnitCodeListAgencyNameText" type="xsd:token" use="optional">
- and also to all annotation/documentation attributes that are represented by text fields.
Once again, because this a regular rule, these attribute names could be mapped between UBL and ATG2.

XSD Data Types
------------------
Not so easily mapped are the differences between the use of XSD data types for element and attribute values. There are three primary differences:

1. ATG2 is more restrictive than UBL for the values permitted in the following data types:
CodeTypes and IdentifierTypes are expressed as xsd:token. UBL has defined them as xsd:normalizedString.
TextTypes and NameTypes are expressed as xsd:token. UBL has defined them as xsd:string.
Just to remind myself (again) I looked up the definitions of these in the XSD specs (http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/).
In XSD, a token is the set of strings that do not contain the line feed (#xA) nor tab (#x9) characters, that have no leading or trailing spaces (#x20) and that have no internal sequences of two or more spaces. A normalizedString is the set of strings that do not contain the carriage return (#xD), line feed (#xA) nor tab (#x9) characters. An xsd:token is derived from an xsd:normalizedString. A string is the set of finite-length sequences of characters. An xsd:normalizedString is derived from an xsd:string.
This means instances of UBL code, identifier, text and name values may not be legitimate for applications basing their received data on ATG2 data types.
As we discussed when this came up in UBL last year, the uses of xsd:token also means that documents would not accept values such as "A SHADOW ON THE GLASS:VOL 1 A VIEW FROM THE MIRROR- IRVINE, IAN" or "VIRAGO BOOK OF SPIRITUALITY O- ANDERSON, SARAH" - both of which are real examples from a book publisher's EDI ordering system worked on. I have also seen "- ALL ITEMS ON THIS OFFER TO PURCHASE TO BE SUPPLIED SUBJECT TO: CONDITIONS APPENDED TO THE EDI TRADING AGREEMENT BETWEEN US.:SHUTDOWN REPLACEMENTS:DELIVERY TIMES ARE 7:30AM TO 3:00PM, TUESDAY TO FRIDAY." used in the automative industry. I am not sure why anyone would want to prevent this type of content.
As far as codes and identifiers goes, UBL has also decided that preventing leading, trailing or duplicate embedded spaces is too prohibitive. Again, i don't have to look far to see where this wont work. If we take the example of Australian State Codes, these are 3 characters. New South Wales is "NSW" and Victoria is "VIC", but Western Australia is "WA " not " WA" or "WA". Real application systems are built around the concept that spaces can be a legitimate part of a code or an identifier, so they will exchange these spaces wherever they appear in the data.
The principle of trying to enforce this type of content validation in a core component schema is bound for problems. Which is why UBL have settled on xsd:normalizedString and xsd:string. It is a similar argument to why we err on the side of making BIEs optional rather than mandatory - let the customization that comes with implementation define things like presentational content - not the core schemas.

2. NumericType is a complexType whose values are expressed as xsd:decimal with an extended attribute for numericFormatText. UBL defines NumericType as a simpleType using xsd:decimal.
This means instances of ATG2 numeric values may have formatting instructions that UBL-based applications do not expect (even though the ATG2 comment discourages using this attribute).

3. UBL uses more built-in XSD data types.
In UBL we have the following rules:
[GXS3] Built-in XSD Simple Types SHOULD be used wherever possible.
[CTD7] For every ccts:CCT whose supplementary components are not equivalent to the properties of a built-in xsd:datatype, the ccts:CCT MUST be defined as a named xsd:complexType in the ccts:CCT schema module.
[CTD10] Each CCT:SupplementaryComponent xsd:attribute "type" MUST define the specific xsd:built-in Datatype or the user defined xsd:simpleType for the ccts:SupplementaryComponent of the ccts:CCT.
This means:
In UBL, DateTimeType is a simpleType expressed as an xsd:dateTime. In ATG2 schemas, DateTimeType is a complexType expressed as an xsd:string with an extended attribute for dateTimeFormatText.
In UBL, IndicatorType is a simpleType expressed as an xsd:boolean. In ATG2 schemas, IndicatorType is a complexType expressed as an xsd:string with an extended attribute for indicatorFormatText.
This means instances of ATG2 date time or indicator values may be formatted in a way that UBL-based applications do not expect. They may also have additional formatting instructions.

Overall I see this as coming down to two issues...
1. The majority of concerns come down to a different set of NDRs. Do we want ATG2 to amend their rules to fit with UBLs? Presumably we have had as much (if not more) input into the ATG2 rules as anyone. Therefore, ATG2 has deliberated and chosen different rules to UBL. I see little point in re-submitting our rules to them again.
2. The remaining issues relate to choice of XSD datatypes. UBL has a proposal to align our use of datatypes with OAGIS 9.0.

OAG consider the built-in XSD type,"normalizedString", for all code,
identifier and text components (where there is no specific built-in type,
such as "language").
UBL consider the built-in XSD type,"normalizedString", for all text
components (where there is no specific built-in type, such as "language").This feels to me like a more appropriate strategy than using xsd:token everywhere and could be accomodated into UBL1.1 with some caution for existing document instances.

-- 
regards
tim mcgrath
phone: +618 93352228  
postal: po box 1289   fremantle    western australia 6160