Schema Design Rules for UBL...and Maybe for You

Keywords: UBL, XML, W3C XML Schema, Schema, ebXML, EDI, B2B

Eve Maler
XML Standards Architect
Sun Microsystems, Inc.
Web Technologies and Standards
Burlington
Massachusetts
USA
eve.maler@sun.com
http://www.sun.com/xml

Biography

Eve Maler is an XML Standards Architect in Sun's Web Technologies and Standards group. She currently specializes in developing standards and vocabularies in the areas of XML web services security and B2B and promoting standards adoption both in the industry at large and in Sun products and programs.

Eve was a charter member of the World Wide Web Consortium working group that created XML, and for two years coordinated all of Sun's W3C activities. Eve co-edited the second edition of the XML 1.0 Recommendation, the Note defining the XML Pipeline Definition Language, and several other specifications published by W3C. She is chair of the Schema Naming and Design Rules subgroup of the OASIS UBL (Universal Business Language) effort. Eve also co-founded, formerly chaired, and is the coordinating editor for the OASIS SAML (Security Assertion Markup Language) committee.

Eve co-authored Developing SGML DTDs: From Text to Model to Markup, a book that is unique in providing a targeted methodology for DTD design. She also served for several years as maintainer of the popular DocBook DTD for software documentation.


Abstract


The OASIS UBL (Universal Business Language) effort has several interesting goals and constraints that must be taken into account in the structuring of the UBL Library schemas. This paper discusses some of the major rules developed for the design of the schemas: UBL's connection to the UN/CEFACT–ebXML Core Component Technical Specification, its choice of options for element and datatype definitions, and its solution for reusable code lists. These rules are presented in the hope that they may be found useful to others embarking on an effort to define a standard XML vocabulary.


Table of Contents


Introduction
UBL Overview
     UBL in a Nutshell
     UBL's Relationship to EDI and ebXML
     The UBL Schema Problem Space
UBL's Connection to Core Components
     Core Component and ISO/IEC 11179 Concepts
     Mapping UBL Components to Core Components
     Embedded Documentation
Element and Type Definition Style
     Common Styles of Schema Organization
         Russian Doll
         Salami Slice
         Venetian Blind
         Garden of Eden
     UBL's Choice of Style
Rules for Code List Reuse
     The Code List Problem Space
     The UBL Solution
Conclusion
Footnotes
Acknowledgements
Bibliography

Introduction

This paper discusses some of the major schema design decisions made in the OASIS UBL effort. First UBL and its particular requirements are introduced. Then three major areas of UBL schema design are presented:

UBL Overview

This section introduces UBL, puts it into the context of the ebXML electronic business initiative and traditional EDI, and discusses UBL's schema requirements.

UBL in a Nutshell

UBL is an emerging standard for XML document formats that encode business messages, such as purchase orders and invoices, and the common building blocks of such messages. The scope of UBL is B2B communication across all industry sectors and domains for all types of organizations, including small and medium-size enterprises.

UBL has a goal to reach acceptance as a legal standard for international trade, and it also has a commitment to remain royalty-free.

The UBL Library is grounded in the Core Component semantics developed as part of the ebXML initiative for electronic business, though it is not a deliverable of that initiative. It is being developed under the auspices of OASIS, and the UBL Technical Committee has formed liaisons with several major horizontal and vertical efforts that have been working towards similar ends.

UBL's first phase will deliver a component library and a carefully selected set of standard B2B document types in XSD (W3C XML Schema) form. The deliverables will also include a set of design rules governing the construction of these schemas and a simple version of a methodology for customizing the schemas to meet the needs of different trading communities.

The work of the UBL Technical Committee is broken down by SC (subcommittee) :

In the year since the committee began its work, it has made a great deal of progress. Two public drafts of the UBL Library have been made available for review, and a third major public draft is expected near the time of the XML 2002 conference. For information on the status of UBL, downloadable copies of its draft deliverables, and a publicly archived record of the committee's email discussions, see the UBL website http://www.oasis-open.org/committees/ubl/.

UBL's Relationship to EDI and ebXML

Traditional EDI (Electronic Data Interchange) technology includes aspects of both infrastructure and business semantics.

05-01-02-fig01.gif

The EDI mechanism exhibits some well-known pressure points. For example, point-to-point VANs (Value-Added Networks) are expensive and cumbersome. Also, the standard EDI data formats that convey the business document payloads are defined as a superset of everyone's needs, requiring subtractive contextualization (customization for particular business contexts) with an infinite number of possible variations. Finally, the way that these special contexts are defined is with MIGs (Message Implementation Guides) , which use narrative text — a "soft" mechanism that is not machine-readable — to convey their intent.

The ebXML http://www.ebxml.org initiative's goal has been to enable enterprises of any size, anywhere, to find each other electronically and conduct business by exchanging XML messages. It has developed a set of modular specifications for a B2B architecture that supports these ambitious goals.

05-01-02-fig02.gif

The infrastructure specifications include "modern" versions of existing EDI architecture layers, such as the SOAP-friendly ebXML Message Service. Notably, they also include a new Registry/Repository component, which allows for submission, query, and retrieval of all types of e-business artifacts using rich metadata.

The payload specifications are syntax-neutral and abstract, and somewhat less mature than the others. The UN/CEFACT–ebXML Core Component specification is a system for creating idealized, business-context-free models for business information that can be mapped to traditional EDI syntax, XML syntax, or some other syntax entirely. The Context Methodology, which is part of the Core Component work, is an initial attempt to define formally describable methods for customizing and assembling components. Core Component development continues, as does further enhancement of the other parts of ebXML.

The UBL effort represents an explicit attempt to concretize the Core Components by mapping them to XML — specifically, an XSD representation — and to solve contextualization problems in an XSD environment.

The UBL Schema Problem Space

The schemas that define the UBL Library need to meet a number of interesting requirements:

This paper presents some highlights of the rules developed for the UBL Library that support these needs.

UBL's Connection to Core Components

UBL is committed to using the UN/CEFACT–ebXML Core Components as a semantic substrate. For this commitment to have significance, it is important for the UBL Library components to be expressed in a way that links back to the Core Components on which they are based. This section describes important Core Component concepts and how UBL realizes them in XML form.

Core Component and ISO/IEC 11179 Concepts

The UN/CEFACT–ebXML Core Components Technical Specification [CCTS] provides a framework for standardizing business information semantics in a flexible and yet interoperable way. The latest version is 1.85, released in September 2002. Additional supplemental documents will define actual catalogues of Core Components and their definitions.

Following are some key concepts for understanding the Core Component framework:

CC (core component)
A building block for the creation of a semantically correct and meaningful information exchange package. It contains only the information pieces necessary to describe a specific concept. An example of a CC might be a collection of information about an address.
Business context
The formal description of a specific business circumstance as identified by the values of a set of context categories, allowing different business circumstances to be uniquely distinguished. Examples of context categories are business process and geopolitical region; there are eight prescribed context categories in all.
BIE (business information entity)
A piece of business data or a group of pieces of business data with a unique business semantic definition. In other words, a BIE is a CC to which a particular business context has been applied. An example of a BIE might be a collection of information about an address in the U.S., where the geopolitical region is (one part of) the business context.
Basic vs. aggregate
A basic CC or BIE is one that constitutes a singular piece of business information. By contrast, an aggregate CC or BIE is a collection of several related pieces of business information. For example, an address would be aggregate, whereas a street name would be basic.
CCT (core component type)
A special low-level construct that carries the actual content of the business message. A CCT has a content component, plus a series of supplementary components that give essential extra meaning. An example of a CCT is an amount, where the content component is a number such as "12" and the essential supplementary component is a unit such as "Euro". There are ten prescribed kinds of CCTs in all.

The CC framework itself is based on the following concepts taken from the ISO/IEC 11179 [11179] specification for data dictionaries:

Object class term
The name of a logical data grouping.
Property term
The name of one characteristic or aspect of an object class. Sometimes the property term is the same as the representation term.
Representation term
The name of the set of valid values for a property. At the basic level, the set of CCTs provide a closed set of representations. At the aggregate level, it is possible to define any number of representations.
Qualifier term
One or more words that help define and differentiate a property term or representation term.

These concepts are used for unambiguously naming items in a data dictionary; there are also rules for defining items in a semantically clear and unique way. CCs and BIEs all have unique names based on this system, which are discussed further below.

CCs and BIEs can be seen both as objects in their own right and also as properties of other objects (higher logical groupings). In the case where an aggregate CC or BIE is useful as a property of another aggregate, an association CC or BIE must exist to document the relationship between the two.

Following is an example taken from the Core Components Technical Specification that shows all of these concepts:

05-01-02-fig03.gif

Here, a person and an address are aggregate CCs; a Name, Birth, Street, Post Code, Town, and Country are basic CCs, each serving as properties of the aggregates; and the notions of a person's official address and residence are association CCs.

Names are assigned in this pattern:

05-01-02-fig04.gif

Therefore, the names of our example components would be:

Component Name CC Classification ISO/IEC 11179 Classification and Description
Person. Details Aggregate Object class
Person. Name. Text Basic Property of Person. Details, represented as text
Person. Birth. Date Basic Property of Person. Details, represented as a date
Person. Residence. Address Association Property of Person. Details, represented as an address
Person. Official Address. Address Association Property of Person. Details, represented as an address
Address. Details Aggregate Object class that happens to be associated with the two association CCs listed in this table
Address. Street. Text Basic Property of Address. Details, represented as text
Address. Post Code. Text Basic Property of Address. Details, represented as text
Address. Town. Text Basic Property of Address. Details, represented as text
Address. Country. Identifier Basic Property of Address. Details, represented as an identifier
Text. Type CCT Representation term that happens to be associated with several properties listed in this table
Date Time. Type CCT Representation term that happens to be associated with one property listed in this table
Identifier. Type CCT Representation term that happens to be associated with one property listed in this table
Table 1

Mapping UBL Components to Core Components

The UBL Library, in layering on top of the CC system, does two things at once: It adds business context and it maps syntax-neutral constructs to a real XML syntax.

It is impossible to use CC-based models without adding business context and therefore turning the CCs into BIEs. The UBL Library does add its own measure of business context, though it tries to add as little as possible so as to remain useful as a base for all trading communities. Most often UBL reflects a particular business process but not, for example, a particular geopolitical region or industry classification; these would be left to other trading communities to define as customizations that add more layers of business context. But regardless of how generic the context may be in the UBL Library prior to any customizations, the effect is that UBL deals only in BIEs, and we can now turn away from the awkward "CC/BIE" language in this paper.

UBL also adds syntax, specifically XML syntax as defined in XSD form, but this syntax is added as late as possible in the process. When the UBL Library Content subcommittee does its modeling work, it records the results in a spreadsheet, where each row defines an object class or property in a form that is as syntax-neutral as possible. The spreadsheet is thus a fairly abstract data dictionary. Following is a stylized excerpt:

05-01-02-fig05.gif

While one of the UBL Library's schema modules is handcrafted to reflect a tuned XSD version of the CCTs with a rich set of complex and simple types, a perl script is used to generate most of the other modules and all of their element and type definitions from the spreadsheet. The UBL design rules governing this process are as follows:

The names of the elements and types are not faithful reproductions of the ISO/IEC 11179 naming scheme; some truncation is applied so that elements are not tightly bound to each distinct parent element environment and to maintain brevity and clarity. Here are the major naming rules:

Here are the names of the XML constructs for the above example components. The CCTs have been left out of the table because their definition and naming are still in flux.

BIE Name XML Name Remarks
Person. Details PersonType complex type All periods and spaces removed; "Details" replaced with "Type"
Person. Name. Text Name element "Text" elided because it is the default representation term; "Person" elided because the parent element's type indicates the object class unambiguously
Person. Birth. Date BirthDate element
Person. Residence. Address ResidenceAddress element
Person. Official Address. Address OfficialAddress element The repetition of "Address" as a property term and representation term has been collapsed, with only the latter remaining
Address. Details AddressType complex type
Address. Street. Text Street element
Address. Post Code. Text PostCode element
Address. Town. Text Town element
Address. Country. Identifier CountryID element "Identifier" shortened to "ID"
Table 2

Even though these rules result in fewer unique names in the XSD version than in the original dictionary, unique semantic definitions and software processing expectations can still be attached to all of the dictionary entries, and elements can therefore be neatly reused. Furthermore, additional entries can be created for other specific XPath-addressible XML environments, allowing for full specification of the intent behind every field in a business message.

Embedded Documentation

The schemas that make up the UBL Library are, as already mentioned, mostly generated rather than handcrafted. The large amount of metadata associated with each resulting schema construct is not lost. Rather, it is provided in the form of xs:documentation elements containing particular XHTML elements with special class attribute values.

Following is an excerpted sample of an XSD construct, showing the transfer of many of the spreadsheet columns into schema documentation:

<xs:element name="Address" type="ubl:AddressType" id="UBL000007" minOccurs="0">
  <xs:annotation>
    <xs:documentation>
      <xhtml:div class="UBL_Definition">
        <xhtml:p>the particulars that identify and locate
                 the place of a particular party.</xhtml:p>
      </xhtml:div>
      <xhtml:div class="BIE_Dictionary_Entry_Name">
        <xhtml:p>Party. Address. Address</xhtml:p>
      </xhtml:div>
      ...
    </xs:documentation>
  </xs:annotation>
</xs:element>

Element and Type Definition Style

UBL's requirements for reusability, customization, and friendliness to XML software development have had an impact on its design rules for element declarations and type definitions. This section describes the major choices UBL has made in this area.

Common Styles of Schema Organization

XSD offers several ways of organizing element declarations and type definitions. Many of the differences hinge on whether these constructs are made available for reuse (for example, assembly of existing elements into new content model configurations) and modification (for example, specialization of existing types through the addition or subtraction of content model features) by schema modules that import or include the module in which the constructs are declared.

There is no such thing as an "anonymous" element; every element must ultimately be assigned a name, because the appearance of these names in tags is required by the XML Recommendation:

XSD:
<xs:element name=”Recipe” ... />

Instance:
<Recipe>...</Recipe>

However, a name alone does not make the element's declaration reusable by other schema modules; the element must be both qualified with a namespace and globally declared (as a direct child of the xs:schema element) in order for it to be reusable in other schemas:

Inner XSD:
<xs:schema
  targetNamespace=”http://www.example.com/recipe”
  elementFormDefault=”qualified”
  ...>
...
  <xs:element name=”Recipe” ... />
...
</xs:schema>

Outer XSD:
<xs:schema
  targetNamespace=”http://www.example.com/magazine”
  ...>
  <xs:import targetNamespace=”http://www.example.com/recipe” ... />
...
</xs:schema>

Instance conforming to outer XSD:
...
<r:Recipe
  xmlns:r=”http://www.example.com/recipe”>
  ...
</r:Recipe>
...

If the element is declared locally (inside an xs:complexType element), it cannot be reused whether or not it is namespace-qualified, except indirectly by means of binding the whole type in which it is declared to an element in the outer schema:

Inner XSD:
<xs:complexType name=”RecipeType”>
  <element name=”Ingredient” ... />
<xs:complexType>

Outer XSD:
<xs:schema
  targetNamespace=”http://www.example.com/magazine”
  xmlns:m=”http://www.example.com/magazine”
  xmlns:r=”http://www.example.com/recipe”
  elementFormDefault=”qualified”
  ...>
  <xs:import targetNamespace=”http://www.example.com/recipe” ... />
  <xs:element name=”MyRecipe” type=”r:RecipeType” ... />
...
</xs:schema>

Instance:
<m:MyRecipe>
  <r:Ingredient>...</r:Ingredient>
</m:MyRecipe>

If the element is declared globally but is namespace-unqualified, it can be reused in other schema modules only by inclusion, in what is nearly a cut-and-paste fashion because the components are no longer recognizable as their original "selves."

The situation is different with types, which can be anonymous (unnamed). Anonymous types are not reusable in other schema modules and are not even reusable in the same module in which they were defined; their use is limited to the one element in whose declaration they appear. Named types are fully reusable; just like reusable elements, they must be defined globally (as a direct child of the xs:schema element).

<xs:schema
  ...>
<xs:complexType name=”ReusableType”>...</xs:complexType>
<xs:element name=”Element1OfReusableTxype” type=”ReusableType” />
<xs:element name=”Element2OfReusableType” type=”ReusableType” />
<xs:element name=”ElementOfAnonymousType”>
  <xs:complexType>
    ...
  </xs:complexType>
</xs:element>

The following sections describe the common mixtures of all these possibilities with a brief analysis of their different effects on the reusability of schema components. Most of these schema styles were first documented by Roger Costello in his "XML Schema: Best Practices" initiative http://www.xfront.com/BestPracticesHomepage.html. All examples below assume an XML instance structure similar to the following:

<Recipe>
  <ID>...</ID>
  <Ingredient>
    <ID>...</ID>
    <Amount>...</Amount>
    <Name>...</Name>
  </Ingredient>
  <Ingredient>...</Ingredient>
  <Step>
    <ID>...</ID>
    <Description>...</Description>
  </Step>
  <Step>...</Step>
</Recipe>

Russian Doll

In the Russian Doll style, element declarations are nested progressively more deeply inside anonymous type definitions like the famous "matryoshka" nesting dolls, with most elements (all, if there is no top-level element) therefore declared locally:

<xs:schema
  ...>
  <xs:element name=”Recipe”>
    <xs:complexType>
      <xs:element name=”ID” type=”cct:IDType” />
      <xs:element name=”Ingredient” maxOccurs=”unbounded”>
        <xs:complexType>
          <xs:element name=”ID” type=”IDType” />
          <xs:element name=”Amount” type=”xs:string” />
          <xs:element name=”Name” type=”xs:string” />
        </xs:complexType>
      </xs:element>
      <xs:element name=”Step” maxOccurs=”unbounded”>
        <xs:complexType>
          <xs:element name=”ID” type=”IDType” />
          <xs:element name=”Description” type=”xs:string” />
        </xs:complexType>
      </xs:element>
    </xs:complexType>
  </xs:element>
</xs:schema>

The top-level element is typically namespace-qualified because a target namespace is supplied. The locally declared elements may be namespace-qualified or namespace-unqualified, but even if they are qualified they are not reusable.

Salami Slice

In the Salami Slice style, types are declared anonymously as before, but elements are all declared globally, such that examining any one element declaration gives you a complete element-oriented "slice" of schema, reminiscent of DTDs:

<xs:schema
  ...>

  <xs:element name=”ID” type=”cct:IDType” />

  <xs:element name=”Recipe”>
    <xs:complexType>
      <xs:element ref=”ID” />
      <xs:element ref=”Ingredient” maxOccurs=”unbounded” />
      <xs:element ref=”Step” maxOccurs=”unbounded” />
    </xs:complexType>
  </xs:element>

  <xs:element name=”Ingredient”>
    <xs:complexType>
      <xs:element ref=”ID” />
      <xs:element ref=”Amount” />
      <xs:element ref=”Name” />
    </xs:complexType>
  </xs:element>

  <xs:element name=”Amount” type=”xs:string” />

  <xs:element name=”Name” type=”xs:string” />

  <xs:element name=”Step”>
    <xs:complexType>
      <xs:element ref=”ID” />
      <xs:element ref=”Description” />
    </xs:complexType>
  </xs:element>

  <xs:element name=”Description” type=”xs:string” />

</xs:schema>

The elements are typically namespace-qualified because a target namespace is supplied. Note that in Salami Slice there is only one ID element that is reused in several places in the same schema module, in contrast to Russian Doll where several identical local ID elements are declared.

Venetian Blind

The Venetian Blind style is the reverse of the Salami Slice style. The types are named and therefore global, but most of the elements (all, if there is no top-level element) are local:

<xs:schema
  ...>
  <xs:element name=”Recipe” type=”RecipeType” />

  <xs:complexType name=”RecipeType”>
    <xs:element name=”ID” type=”cct:IDType” />
    <xs:element name=”Ingredient” type=”IngredientType” maxOccurs=”unbounded” />
    <xs:element name=”Step” type=”StepType” maxOccurs=”unbounded” />
  </xs:complexType>

  <xs:complexType name=”IngredientType”>
    <xs:element name=”ID” type=”IDType” />
    <xs:element name=”Amount” type=”xs:string” />
    <xs:element name=”Name” type=”xs:string” />
  </xs:complexType>

  <xs:complexType name=”StepType”>
    <xs:element name=”ID” type=”IDType” />
    <xs:element name=”Description” type=”xs:string” />
  </xs:complexType>

</xs:schema>

As in the Russian Doll style, the top-level element is typically namespace-qualified because a target namespace is supplied, and the locally declared elements may be namespace-qualified or namespace-unqualified, but even if they are qualified they are not reusable. The name Venetian Blind reflects the fact that there are individual "slats" (like the salami "slices" except that examining any one type definition, not element declaration, gives you a complete slat) but also the fact that all the locally declared elements can be namespace-qualified or -unqualified (the slats can be "opened" or "closed") depending on the elementFormDefault attribute setting at the top of the schema or at the top of any outer schema.

Garden of Eden

The Garden of Eden style is the fourth logical style in this series, but has not been explored previously in the XML Schema Best Practices initiative. (The name, a reference to the biblical Adam and his act of naming all the cattle, birds, and beasts of the field, is the author's invention.) The elements are all globally declared, the types are named, content models always refer to global element declarations rather than providing the declarations in situ, and a target namespace is always set:

<xs:schema
  targetNamespace=”http://www.example.com/recipe”
  ...>

  <xs:element name=”Recipe” type=”RecipeType” />

  <xs:complexType name=”RecipeType”>
    <xs:element ref=”ID” />
    <xs:element ref=”Ingredient” maxOccurs=”unbounded” />
    <xs:element ref=”Step” maxOccurs=”unbounded” />
  </xs:complexType>

  <xs:element name=”ID” type=”cct:IDType” />

  <xs:element name=”Ingredient” type=”IngredientType” />

  <xs:complexType name=”IngredientType”>
    <xs:element ref=”ID” />
    <xs:element ref=”Amount” />
    <xs:element ref=”Name” />
  </xs:complexType>

  <xs:element name=”Amount” type=”xs:string” />

  <xs:element name=”Name” type=”xs:string” />

  <xs:element name=”Step” type=”StepType” />

  <xs:complexType name=”StepType”>
    <xs:element ref=”ID” />
    <xs:element ref=”Description” />
  </xs:complexType>

  <xs:element name=”Description” type=”xs:string” />

</xs:schema>

In this style, both elements and types are guaranteed to be reusable in schema modules that import this one.

UBL's Choice of Style

Initially the UBL Naming and Design Rules subcommittee chose one flavor of the Venetian Blind style, in which types were global and qualified but most elements were local and were explicitly intended to be unqualified. Our reasoning at the time was that namespace-aware tools and knowledge were scarce.

However, we later reconsidered our decision and chose the Garden of Eden style instead, where both types and elements are global and qualified. Our reasoning for the change involved a better understanding of the scenarios for UBL Library reuse and customization, as well as an agreement that we were more concerned about the availability of type-aware tools than namespace-aware tools.

Following are our two basic reuse and customization use case scenarios:

  1. Specialization of UBL messages: This scenario assumes that an existing message is appropriate with regard to general business process and any other documented business context, but some of the context details need to be further specialized — for example, a global shoe manufacturers' association may want to extend order line items to account for shoe sizes and colors. The salient characteristic of this kind of reuse is the derivation of UBL types, since this is the basic tool for specialization provided by XSD. This scenario is what drives our choice to define named, global, reusable types that can be used as base types.
  2. Assembly of UBL components into new kinds of messages: This scenario assumes that the lower-level building blocks of UBL, such as addresses, are suitable for some business purpose (though specialization would of course also be allowed), but that a new kind of message — with a new configuration of building blocks — is required to support the overall process desired. The salient characteristic of this kind of reuse is the referencing of UBL elements in the assembled new message format, no matter whether the assembly method is static or dynamic. This scenario is what drives our choice to declare global, namespace-qualified elements.

It should be noted that with our flavor of the Venetian Blind style, assembly of UBL elements into new message formats is not impossible; typically the desired UBL type would be bound to an element in the outer schema's namespace. For example, if UBL's notion of "Address" were to be reused exactly as it is without change, the ubl:AddressType complex type would be bound to outer:Address. However, this method presents the strong possibility of unwanted side effects:

With the Garden of Eden style, the UBL Library offers a component library that is truly reusable for assembly and customizable for specialization. We can now be more explicit about the choice of mapping to XSD from the UBL abstract model that was described in Mapping UBL Components to Core Components (though the description below is an oversimplification of the process):

Rules for Code List Reuse

Business documents frequently use codes to convey information in a way that is semantically clear and very compact. For example, product orders might supply a color code to indicate which color of each product is desired, and addresses might supply a country code to indicate what country the address is located in. A code list is a kind of data dictionary writ small, so it is appropriate for UBL to combine the use of Core Component dictionaries with the use of code lists.

UBL wants to reuse existing code lists whenever possible, which has led to an exploration of ways to incorporate XSD schema modules that allow for such reuse. This section describes UBL's needs and its recommended solution.

The Code List Problem Space

There are several major producers and publishers of code lists, including, for example, ISO (the International Organization for Standardization) and UN/ECE (the United Nations Economic Commission for Europe) . Often what is produced is much more than a flat list; it can be a complex hierarchical taxonomy or even a sophisticated graph-based ontology. This is particularly true in the case of product classifications, where the actual list of products might be the same from industry to industry (such as a set of chemicals), but the classifications layered on top of these products change radically depending on what kind of industry is involved (for example, a chemical manufacturer versus a purchaser in agribusiness). However, any one code compresses all of the associated meanings into a single value, and thus for any one code list there is a single valid set of values, even if each of the values has internal structure.

Many code lists are maintained simply as published lists of codes and definitions. Others have more robust electronic representations, for example in RDF or in the XML-based ebXML/OASIS Registry Information Model's [RIM] "classification scheme" structure. But in an environment where XSD is being used, a more schema-friendly representation is needed in addition, so that usage of codes can integrate properly with the rest of the definitions governing the business message format.

It may seem obvious how to handle code lists in XSD: create an enumeration of values in an xs:simpleType construct. This would certainly allow for a high degree of declarative validity checking, and it would work properly with defiinitions governing the rest of the business information. However, the situation is more complicated than this:

The UBL Solution

The UBL Naming and Design Rules subcommittee examined several solutions from the perspective of semantic clarity; interoperability; external maintenance; validatability; friendliness to the UBL context methodology; upgradability; and readability. The winning solution, documented in the UBL Code List Rules [CL] , works as follows:

  1. The code list producer first determines a series of essential identifying information to attach to the code list. Our rules for this identifying information are in flux because of recent changes to the Core Components Technical Specification, but it will include items such as an XML namespace name, a unique code list ID, a version number, an ID representing the code list maintenance agency, a URI identifying the normative electronic form of the code list, and so on.
  2. Then the code list producer writes an XSD schema module that conforms to a UBL-provided template. The module defines a complex type and associated simple types that encode and constrain both the code list values (to be represented as the content of an element, possibly with subelements but most likely as a string value) and the essential identifying information (to be represented as attributes on the element). Special embedded documentation of the type shown in Embedded Documentation must be included. Optionally, an element may be provided that binds the complex type to an element in the code list's XML namespace.
  3. The extent to which the schema module places syntactic constraints on the code list values is entirely up to the discretion of the code list producer, since maintenance and revision considerations come into play.

The UBL Library then binds the complex type to one of its own elements that is dedicated to holding only values from this one code list. The element is used as a mechanism for flexibility and extension inside a more generic element that represents a true basic BIE. For example:

<Address>
  ...

  <!-- outer code element; a basic BIE -->
  <CountryIdentificationCode>

    <!-- inner code element mapped to a foreign type -->
    <ISO3166CountryCode>BE</ISO3166CountryCode>
  </CountryIdentificationCode>

</Address>

Because the rules apply mostly to major producers of code lists and not as much to the developers of UBL itself (because we are mostly code list consumers), we have attempted to make the rules as attractive as possible to implement, and hope to encourage conformance to them. The benefits might include:

These benefits amount to the creation of an XSD-based code list marketplace.

Conclusion

This paper has explored the major rules [NDR] governing UBL's schema design and their rationales. The common thread in these rules is the need to achieve a solution that is simultaneously intuitive, flexible, interoperable, and based on standardized semantics. At a time when much W3C XML Schema usage is still experimental in nature, particularly in the development of internationally standard XML vocabularies, the UBL Naming and Design Rules subcommittee has delved into many subtle issues involved in the art of "schemography" and we hope they may be helpful to others (who in turn, we hope, will help us by commenting on our work!).

This paper has not touched on some other important areas of UBL design for which we have developed rules, and there are still other areas which will require attention as the UBL work continues and matures. Readers interested in UBL should make sure to follow its progress at http://www.oasis-open.org/committees/ubl/.

Footnotes

  1. The UN/ECE has done some work to define XML representations of some of its code lists. For more information, see http://www.unece.org/etrades/unedocs/repository/codelistintegration.htm.

  2. The term "schemographer" was coined by Murray Maloney in his work on SOX.

Acknowledgements

The author wishes to thank her fellow OASIS UBL Naming and Design Rules subcommittee members and former members — Bill Burcham, Mavis Cournane, Mark Crawford, Fabrice Desré, Arofan Gregory, Jessica Glace, Matthew Gertner, Michael Grimley, Eduardo Gutentag, Kris Ketels, Sue Probert, Mike Rawlins, Lisa Seaburg, Gunther Stuhec, and Paul Thorpe — for their dedication, knowledge, and skill in contributing to the rules described in this paper and to the author's understanding of the issues. The author would also like to thank Jon Bosak, chair of the OASIS UBL Technical Committee, for his plentiful support and encouragement.

Bibliography

[11179]
Information Technology — Metadata Registries. ISO/IEC 11179–1 through ISO/IEC 11179–6. International Organization for Standardization.
[CCTS]
Mark Crawford, et al. UN/CEFACT–ebXML Core Components Technical Specification. United Nations Centre for Trade Facilitation and Electronic Business (UN/CEFACT). Draft Version 1.85. 30-September-2002. http://xml.coverpages.org/CCTS-V1pt85-20020930.pdf.
[CL]
Eve Maler. OASIS UBL Code List Rules. OASIS. Document identifier wd-ublndrsc-codelist-nn. http://www.oasis-open.org/committees/ubl/ndrsc/.
[NDR]
Mavis Cournane, et al. OASIS UBL Naming and Design Rules. OASIS. Document identifier wd-ublndrsc-ndrdoc-nn. http://www.oasis-open.org/committees/ubl/ndrsc/.
[RIM]
OASIS/ebXML Registry Information Model v2.1. OASIS. June 2002. http://www.oasis-open.org/committees/regrep/documents/2.1/specs/ebrim_v2.1.pdf.