ubl-lcsc message

Subject: RE: [ubl-lcsc] Re: Position Paper on List Containers
From: Chin Chee-Kai <cheekai@softml.net>
To: A Gregory <agregory@aeon-llc.com>
Date: Mon, 1 Sep 2003 19:31:58 +0800 (SGT)
On Mon, 1 Sep 2003, A Gregory wrote:

>>Tony/Tim:
>>
>>Just a note on the performance issues related to list containers:
>>
>>List containers improve any DOM-based processing by reducing the number
>>of iterations needed to match items *other* than the items in the list,
>>by allowing an entire large list of children to be ignored using a
>>single test, on the container, rather than on each of its children.
>>
>>This was the original reason for arguing that we needed list containers,
>>way back when.


Thanks for showing this.

Ok, let's talk about that.  Was there some misunderstanding
or confusion in the quotation?  Otherwise, I'll just look
at what's presented here.  As the assumed environment of 
processing and variables such as:

 - large/small sized document frequency, 
 - number of recurring items is large or small,
 - number of places requiring proposed containers is large or small,
 - DOM node selection methods & iteration algorithm used
 - the assumption that no caching, special optimization steps,
   etc are used but only schema design can be manipulated.

etc are not specified, it leaves open a wide scope of arbitrariness
that one can probably tune to match arguments either way.



>>List containers improve any DOM-based processing by reducing the number
>>of iterations needed to match items *other* than the items in the list,
>>by allowing an entire large list of children to be ignored using a
>>single test, on the container, rather than on each of its children.

May be it is the way it is written, but the paragraph seems to
have a merged/mixed-up view of phases of validating and processing
of an instance.  

For the purpose of analysing the claim above, let's take a look
at how documents are "consumed" into an application:

(A) On the assumption that an incoming document (not necessarily
even XML-related) is a blackbox, one must first see if it is
XML-parsable, failing which the document is discarded.  

(B) Then there is a schema validation phase, failing which the
document is again discarded (or error-processed, such as
flagging user, etc).

(C) Finally, application layer takes a properly schema-validated
document to "process".  "Processing" needs to still include
a preprocessing to do value checking, as not all incorrect 
data semantics can be filtered off by schema validation.




One could conceivably do some criss-crossed "fast-track"
processing between (B) & (C), but essentially their operations
can be layered as shown.

So, where does the "using a single test" go?  

It cannot be (B), because a schema validator must theoretically 
traverse through each and every instance node before concluding 
that the root DOM node is or isn't conformant to a given schema.  
In other words, if containers are used, the validator cannot skip the
contained children as it must confirm that the children are 
of the correct children type.

So there's no saving with containers, but more processing time
needed for the validator as there are now more new container types 
to schema-validate than without.


So it had to be at (C).  In an instance document, there's merit 
to claim the above "optimization" if:

   - (1) processing large recurring list of children is a
     redundant task that can be skipped as if they are proper,

   - (2) the number of "other children nodes" that are non-recurring
     is far larger than the instantiated recurring children nodes
     so that skipping over non-recurring children nodes to reach
     recurring children nodes constitutes a performance benefit.

But, for (1), during processing, since it occurs in (C), we know that
the contained children are already type-correct.  Yet as mentioned
in (C), the recurring nodes simply cannot be skipped over without 
processing, as they typically carry the most important information 
that prompted the need for the document in the first place.

An "OrderLine" in "Order" document, for instance, carries the 
essential ordered items.  These recurring children are essentially 
the "table" of the "form".  So they need to be "looked" at by the
processing application.  A part number such as "X999932G" might
be type-correct as a xsd:string and passed the schema-validtor,
but perhaps the application might decide that this isn't a valid
part number for the receiving company.

Thus, an application cannot gain advantage by any mechanism
that instructs it not to look at them, for when it does manage
to skip over those recurring children, it wouldn't know what
data is inside and thus cannot do things like update database,
generate responses, etc.

Not that it needs any experience to back it up, but we did this 
back in our XML Industrial Project, and so had an early taste of 
that necessity.



And for (2), let's look at some concrete numbers.  I'm sorry
that it will take very long to go through all 8 documents, so
while being a bit sloppy and not too scientific, I'll just use 
the "Order" as an example, and hope it is ok with the reader.

For e.g., from UBL-Order-0.81-draft-8.xsd, in "Order" document,
facts are:

- The total number of children is 33.

- Of these, the total number of non-recurring nodes such as ID, 
  Status Code,  GloballyUniqueIdentifier,.. etc is 29. 

- Of these 29 non-recurring nodes, 24 are 0..1 which means we can
  have documents that  have only 5 non-recurring children in the 
  worst case, and about 5+12 = 17 non-recurring children on a 
  rough average.   

- There are 4 potentially recurring children nodes.  

Without actual sampling frequency, we try to see if some
expected numbers could at least support the claim in (2).

The numbers show that on average, these 4 potentially recurring
nodes would have to contain on average, AT MOST 1 
((17 * 0.2) / 4 = 0.85) contained child in order for the 
non-recurring nodes to have an expected edge of 80/20 in numbers 
over recurring nodes.

If we say, well, 80/20 is too stringent and perhaps 50/50
is considered "overwhelming", then again, on average, these
4 potentially recurring nodes would have to cotnain
AT MOST 3 ((17 * 0.5) / 4 = 2.125)  children to order to
fulfil (2)'s assumption.

Note that these are expected numbers on UPPER bounds, not 
lower bounds.  In other words, if actual sampling frequencies
exceed these expected numbers most of the time, then (2)
would not be well supported enough by actual sampling facts.

May be it is just my claim, but there seems good chance that 
the upper bounds can be breached easily.


Can I be educated on why I should think these numbers are
convincing enough for containers to provide operating efficiency 
as claimed?




Best Regards,
Chin Chee-Kai
SoftML
Tel: +65-6820-2979
Fax: +65-6743-7875
Email: cheekai@SoftML.Net
http://SoftML.Net/
Follow-Ups:
- RE: [ubl-lcsc] Processing Efficiency Test WAS Re: Position Paper on List Containers
  - From: "A Gregory" <agregory@aeon-llc.com>
References:
- RE: [ubl-lcsc] Re: Position Paper on List Containers
  - From: "A Gregory" <agregory@aeon-llc.com>