ubl-lcsc message

Subject: RE: [ubl-lcsc] Processing Efficiency Test WAS Re: Position Paper on List Containers
From: "A Gregory" <agregory@aeon-llc.com>
To: "'Chin Chee-Kai'" <cheekai@SoftML.Net>
Date: Mon, 1 Sep 2003 07:32:24 -0700
Chee-Kai:

[Please note, I did a bit of testing and have included the results
below.]

I guess you were not party to the original discussion, which is much
narrower than the scope of your response. It is based on the behavior
(including my experience with optimizations) that is entirely limited to
your (C) below. Assume a document has been received and parsed/validated
as XML. There are, in my opinion, too many schemes to handle cross-nodal
and business logic validation to particularly design for them: we must
assume they are equal in all scenarios. 

The process efficiencies for containers are derived from the ubiquity of
DOM processing as seen in common XSLT processors (Saxon and Xalan) which
typically do not require a schema to perform their transformation
functions. (I won't go into the way in which cross-nodal logic can be
implemented as an XSLT transformation using schematron, but you may be
familiar with this approach. I don't know how prevalent this is, but
neither is it germane to the argument.) I cannot speak to the current
implementations of other DOM processors, but I suspect that they will
behave in similar fashion - I may be wrong about this, however, and so
will not argue this - it would really depend on optimization, as you
point out.

I don't want to argue this ad nauseum, either, but I believe that there
is a real processing efficiency here for large documents.

I guess the simple way to find out is to take a 1000+-item PO with and
without containers, and see how long it takes to perform an XSLT
transformation in identical circumstances (in this case, on my laptop
and using Saxon). Note that the following numbers are based *only* on
the inclusion of a header-level container and a list of line items
container. I have not gone through and included all of the rules
suggested by NDR, but only the two containers specified (I don't have
time, sadly.)

The XSLT I used grabbed one header item (the company name out of the
Buyer Party) and made a simple HTML out of it:

<html><p>[buyername]</p></html>

Thus, we are only processing 1 XPath here. 

My results:  


With Containers: 460 milliseconds

Without Containers: 470 milliseconds


Results are a net savings of 10 milliseconds when containers are used
for an XSLT that makes only a single match. 

(Admittedly, this is anything but a comprehensive test, but you will
find that your average XSLT process does a lot more than a single
lookup.)

OK - big deal - I can demonstrate a processing difference of 10
milliseconds out of a total processing time of under 500 - a bit better
than 2%. I suspect that this effect could be multiplied by the number of
XPath tests in the stylesheet.

So I continue my test with some more XPaths, to see if I am right. I add
3 more XPaths to my stylesheet: one more "header" call, and two calls
into the line items:


With Containers: 460 milliseconds

Without Containers: 510 milliseconds


Now we see my suspicion above borne out: we are looking at a processing
efficiency on the order of 10%. I would argue that this is significant.

And, presumably, the more XPaths we add, the greater the efficiency gain
will grow.

Note that there was 0 difference in prep time and in the time required
to build the trees (this was the same both with and without containers).
The hit was taken in pure processing time. Thus, the addition of a
couple of tags was minor - the processing penalty far outweighs it.

Now, we still must measure the relative importance of 10% processing
efficiency for just a transformation using XSLT, and I will confess that
I have not addresses the full scope of your response. But - since I
don't believe we have the resources to do comprehensive testing - I
still think my claims of significant efficiency gains in typical
processing scenarios with large document are borne out, all other things
being equal.

Cheers,

Arofan

-----Original Message-----
From: Chin Chee-Kai [mailto:cheekai@SoftML.Net] 
Sent: Monday, September 01, 2003 4:32 AM
To: A Gregory
Cc: 'Tim McGrath'; 'Anthony B. Coates'; ubl-lcsc@lists.oasis-open.org
Subject: RE: [ubl-lcsc] Re: Position Paper on List Containers

On Mon, 1 Sep 2003, A Gregory wrote:

>>Tony/Tim:
>>
>>Just a note on the performance issues related to list containers:
>>
>>List containers improve any DOM-based processing by reducing the
number
>>of iterations needed to match items *other* than the items in the
list,
>>by allowing an entire large list of children to be ignored using a
>>single test, on the container, rather than on each of its children.
>>
>>This was the original reason for arguing that we needed list
containers,
>>way back when.


Thanks for showing this.

Ok, let's talk about that.  Was there some misunderstanding
or confusion in the quotation?  Otherwise, I'll just look
at what's presented here.  As the assumed environment of 
processing and variables such as:

 - large/small sized document frequency, 
 - number of recurring items is large or small,
 - number of places requiring proposed containers is large or small,
 - DOM node selection methods & iteration algorithm used
 - the assumption that no caching, special optimization steps,
   etc are used but only schema design can be manipulated.

etc are not specified, it leaves open a wide scope of arbitrariness
that one can probably tune to match arguments either way.



>>List containers improve any DOM-based processing by reducing the
number
>>of iterations needed to match items *other* than the items in the
list,
>>by allowing an entire large list of children to be ignored using a
>>single test, on the container, rather than on each of its children.

May be it is the way it is written, but the paragraph seems to
have a merged/mixed-up view of phases of validating and processing
of an instance.  

For the purpose of analysing the claim above, let's take a look
at how documents are "consumed" into an application:

(A) On the assumption that an incoming document (not necessarily
even XML-related) is a blackbox, one must first see if it is
XML-parsable, failing which the document is discarded.  

(B) Then there is a schema validation phase, failing which the
document is again discarded (or error-processed, such as
flagging user, etc).

(C) Finally, application layer takes a properly schema-validated
document to "process".  "Processing" needs to still include
a preprocessing to do value checking, as not all incorrect 
data semantics can be filtered off by schema validation.




One could conceivably do some criss-crossed "fast-track"
processing between (B) & (C), but essentially their operations
can be layered as shown.

So, where does the "using a single test" go?  

It cannot be (B), because a schema validator must theoretically 
traverse through each and every instance node before concluding 
that the root DOM node is or isn't conformant to a given schema.  
In other words, if containers are used, the validator cannot skip the
contained children as it must confirm that the children are 
of the correct children type.

So there's no saving with containers, but more processing time
needed for the validator as there are now more new container types 
to schema-validate than without.


So it had to be at (C).  In an instance document, there's merit 
to claim the above "optimization" if:

   - (1) processing large recurring list of children is a
     redundant task that can be skipped as if they are proper,

   - (2) the number of "other children nodes" that are non-recurring
     is far larger than the instantiated recurring children nodes
     so that skipping over non-recurring children nodes to reach
     recurring children nodes constitutes a performance benefit.

But, for (1), during processing, since it occurs in (C), we know that
the contained children are already type-correct.  Yet as mentioned
in (C), the recurring nodes simply cannot be skipped over without 
processing, as they typically carry the most important information 
that prompted the need for the document in the first place.

An "OrderLine" in "Order" document, for instance, carries the 
essential ordered items.  These recurring children are essentially 
the "table" of the "form".  So they need to be "looked" at by the
processing application.  A part number such as "X999932G" might
be type-correct as a xsd:string and passed the schema-validtor,
but perhaps the application might decide that this isn't a valid
part number for the receiving company.

Thus, an application cannot gain advantage by any mechanism
that instructs it not to look at them, for when it does manage
to skip over those recurring children, it wouldn't know what
data is inside and thus cannot do things like update database,
generate responses, etc.

Not that it needs any experience to back it up, but we did this 
back in our XML Industrial Project, and so had an early taste of 
that necessity.



And for (2), let's look at some concrete numbers.  I'm sorry
that it will take very long to go through all 8 documents, so
while being a bit sloppy and not too scientific, I'll just use 
the "Order" as an example, and hope it is ok with the reader.

For e.g., from UBL-Order-0.81-draft-8.xsd, in "Order" document,
facts are:

- The total number of children is 33.

- Of these, the total number of non-recurring nodes such as ID, 
  Status Code,  GloballyUniqueIdentifier,.. etc is 29. 

- Of these 29 non-recurring nodes, 24 are 0..1 which means we can
  have documents that  have only 5 non-recurring children in the 
  worst case, and about 5+12 = 17 non-recurring children on a 
  rough average.   

- There are 4 potentially recurring children nodes.  

Without actual sampling frequency, we try to see if some
expected numbers could at least support the claim in (2).

The numbers show that on average, these 4 potentially recurring
nodes would have to contain on average, AT MOST 1 
((17 * 0.2) / 4 = 0.85) contained child in order for the 
non-recurring nodes to have an expected edge of 80/20 in numbers 
over recurring nodes.

If we say, well, 80/20 is too stringent and perhaps 50/50
is considered "overwhelming", then again, on average, these
4 potentially recurring nodes would have to cotnain
AT MOST 3 ((17 * 0.5) / 4 = 2.125)  children to order to
fulfil (2)'s assumption.

Note that these are expected numbers on UPPER bounds, not 
lower bounds.  In other words, if actual sampling frequencies
exceed these expected numbers most of the time, then (2)
would not be well supported enough by actual sampling facts.

May be it is just my claim, but there seems good chance that 
the upper bounds can be breached easily.


Can I be educated on why I should think these numbers are
convincing enough for containers to provide operating efficiency 
as claimed?




Best Regards,
Chin Chee-Kai
SoftML
Tel: +65-6820-2979
Fax: +65-6743-7875
Email: cheekai@SoftML.Net
http://SoftML.Net/
Follow-Ups:
- RE: [ubl-lcsc] Processing Efficiency Test WAS Re: Position Paper on List Containers
  - From: Chin Chee-Kai <cheekai@softml.net>
References:
- RE: [ubl-lcsc] Re: Position Paper on List Containers
  - From: Chin Chee-Kai <cheekai@softml.net>