ubl-ndrsc message

Subject: [ubl-ndrsc] Containership Proposal
From: A Gregory <agregory@aeon-llc.com>
To: ubl-ndrsc@lists.oasis-open.org
Date: Fri, 07 Mar 2003 10:58:31 -0800
Folks:

As per the discussion last Wednesday, here is a brief write-up of my
arguments regarding containership.

Cheers,

Arofan

_____________

UBL Release Op70 and Containership
______________________________

Overview:

In the discussions about containership, a decision had been made to wait
until the Op70 release to see how the "normalization" of the LCSC modelling
activities would translate into XML structures, before making any decision
about containership. Generally speaking, the resulting XML has produced a
satisfactory level of containership. There are two areas where there are
problems, however: at the very top level, looking at the children of
document elements (Order, etc.); and in those cases where a child element
could be repeated many times, producing a "list" of like elements.

These two cases are examined primarily in terms of their effect on XML
processing, and whether they will  prove to be sub-optimized from the point
of view of XML processing with common tools/technologies. This argument also
looks at the easy comprehension of the XMl structures in these cases,
however, and whether  the usability of the XML structures might be enhanced
by the existence of additional containers in these two cases.

The issue of whether these containers represent semantic constructs is left
open for discussion, as it seems there may be some disagreement on this
point. It is assumed that this discussion will take place as the arguments
presented here are considered.

Issues:

As currently structured, the immedate child elements of a UBL document are
of two types - the "header" elements, appearing first in the document, as a
set of immediate children, and then a set of "item" elements, which in other
vocabularies typically make up the "body" section of a document. This
structuring is problematic for a number of reasons:

(1) Usability:

It is easier to see the distinction between these two types of child
elements if they are organized into two groups - a "header" and a "body".
Even if this is merely the result of traditional, presentation-based
structuring of vocabularies, it is still the case that many developers (and
other users) will find having the document-level element broken out into two
sections - header and body - easier to work with. This is not our primary
argument here, but, as we will see below, it becomes more important when we
look at the use of extensions.

(2) DOM Processing Efficiency:

Because many common XML tools use DOM structures to represent XML in
memory - notably XSLT and XSL-FO - we need to look at how well optimized the
existing structures are for this type of processing. When a specific element
is selected from a DOM representation, the nodes of the DOM tree must be
examined to find the desired node or nodes, often without recourse to the
XML schema itself. This means that the processor must examine each immediate
child of the root node, select those that match the selection criteria, and
then examine the immediate children of the matching nodes, and so on down
the tree, until the matching nodes have been found.

With the existing Op70, this is potentially a problem, particularly with
large documents, or with some large stylesheets. If I want to select an
item-type element from the body, I will have to examine a handful of
"header" elements before finding the matches in the "body" section below.
This is not ideal, but is not necessarily a problem, because there are not a
large number of header-type elements. The reverse case, however, is more
problematic. If I wish to select a header-type element from a document with
200 items, then I will need to examine not only each of the relatively few
header elements, but also each of the 200 item elements. When the number of
potential selects in an XSLT stylesheet is considered, for example, then we
will see that we may have a problem.

By comparison, the existence of containers for the header and body elements
would allow the processor to examine many fewer children (two at the
document level, and then at most the handful of header elements at that
level). To briefly look at the way the numbers work: in the existing
structures, in an instance with 7 header elements and 200 items, to select a
header element I would need to examine the 207 immediate children of the
document element (and then however many nodes existed as children of each
matched node); with header and container elements, the first selection makes
me examine 2 nodes, and then the 7 different nodes inside the header element
(total nodes examined = 9).

While this will clearly vary with the number of items in the document
instance, do we really want to design document structures that perform well
only with small instances? There is no performance down-side to adding a
level of containership here, and only a very minor impact on the amount of
memory required to store the DOM tree being processed.

These same processing inefficienies will exist with any element structure
any of whose immediate children have cardinalities such as 1..n or 0..n.
Follow-Ups:
- Re: [ubl-ndrsc] Containership Proposal
  - From: Tim McGrath <tmcgrath@portcomm.com.au>
- Re: [ubl-ndrsc] Containership Proposal
  - From: Eduardo Gutentag <eduardo.gutentag@sun.com>