[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [ubl-lcsc] Processing Efficiency Test WAS Re: Position Paper on List Containers
Chee-Kai: [Please note, I did a bit of testing and have included the results below.] I guess you were not party to the original discussion, which is much narrower than the scope of your response. It is based on the behavior (including my experience with optimizations) that is entirely limited to your (C) below. Assume a document has been received and parsed/validated as XML. There are, in my opinion, too many schemes to handle cross-nodal and business logic validation to particularly design for them: we must assume they are equal in all scenarios. The process efficiencies for containers are derived from the ubiquity of DOM processing as seen in common XSLT processors (Saxon and Xalan) which typically do not require a schema to perform their transformation functions. (I won't go into the way in which cross-nodal logic can be implemented as an XSLT transformation using schematron, but you may be familiar with this approach. I don't know how prevalent this is, but neither is it germane to the argument.) I cannot speak to the current implementations of other DOM processors, but I suspect that they will behave in similar fashion - I may be wrong about this, however, and so will not argue this - it would really depend on optimization, as you point out. I don't want to argue this ad nauseum, either, but I believe that there is a real processing efficiency here for large documents. I guess the simple way to find out is to take a 1000+-item PO with and without containers, and see how long it takes to perform an XSLT transformation in identical circumstances (in this case, on my laptop and using Saxon). Note that the following numbers are based *only* on the inclusion of a header-level container and a list of line items container. I have not gone through and included all of the rules suggested by NDR, but only the two containers specified (I don't have time, sadly.) The XSLT I used grabbed one header item (the company name out of the Buyer Party) and made a simple HTML out of it: <html><p>[buyername]</p></html> Thus, we are only processing 1 XPath here. My results: With Containers: 460 milliseconds Without Containers: 470 milliseconds Results are a net savings of 10 milliseconds when containers are used for an XSLT that makes only a single match. (Admittedly, this is anything but a comprehensive test, but you will find that your average XSLT process does a lot more than a single lookup.) OK - big deal - I can demonstrate a processing difference of 10 milliseconds out of a total processing time of under 500 - a bit better than 2%. I suspect that this effect could be multiplied by the number of XPath tests in the stylesheet. So I continue my test with some more XPaths, to see if I am right. I add 3 more XPaths to my stylesheet: one more "header" call, and two calls into the line items: With Containers: 460 milliseconds Without Containers: 510 milliseconds Now we see my suspicion above borne out: we are looking at a processing efficiency on the order of 10%. I would argue that this is significant. And, presumably, the more XPaths we add, the greater the efficiency gain will grow. Note that there was 0 difference in prep time and in the time required to build the trees (this was the same both with and without containers). The hit was taken in pure processing time. Thus, the addition of a couple of tags was minor - the processing penalty far outweighs it. Now, we still must measure the relative importance of 10% processing efficiency for just a transformation using XSLT, and I will confess that I have not addresses the full scope of your response. But - since I don't believe we have the resources to do comprehensive testing - I still think my claims of significant efficiency gains in typical processing scenarios with large document are borne out, all other things being equal. Cheers, Arofan -----Original Message----- From: Chin Chee-Kai [mailto:cheekai@SoftML.Net] Sent: Monday, September 01, 2003 4:32 AM To: A Gregory Cc: 'Tim McGrath'; 'Anthony B. Coates'; ubl-lcsc@lists.oasis-open.org Subject: RE: [ubl-lcsc] Re: Position Paper on List Containers On Mon, 1 Sep 2003, A Gregory wrote: >>Tony/Tim: >> >>Just a note on the performance issues related to list containers: >> >>List containers improve any DOM-based processing by reducing the number >>of iterations needed to match items *other* than the items in the list, >>by allowing an entire large list of children to be ignored using a >>single test, on the container, rather than on each of its children. >> >>This was the original reason for arguing that we needed list containers, >>way back when. Thanks for showing this. Ok, let's talk about that. Was there some misunderstanding or confusion in the quotation? Otherwise, I'll just look at what's presented here. As the assumed environment of processing and variables such as: - large/small sized document frequency, - number of recurring items is large or small, - number of places requiring proposed containers is large or small, - DOM node selection methods & iteration algorithm used - the assumption that no caching, special optimization steps, etc are used but only schema design can be manipulated. etc are not specified, it leaves open a wide scope of arbitrariness that one can probably tune to match arguments either way. >>List containers improve any DOM-based processing by reducing the number >>of iterations needed to match items *other* than the items in the list, >>by allowing an entire large list of children to be ignored using a >>single test, on the container, rather than on each of its children. May be it is the way it is written, but the paragraph seems to have a merged/mixed-up view of phases of validating and processing of an instance. For the purpose of analysing the claim above, let's take a look at how documents are "consumed" into an application: (A) On the assumption that an incoming document (not necessarily even XML-related) is a blackbox, one must first see if it is XML-parsable, failing which the document is discarded. (B) Then there is a schema validation phase, failing which the document is again discarded (or error-processed, such as flagging user, etc). (C) Finally, application layer takes a properly schema-validated document to "process". "Processing" needs to still include a preprocessing to do value checking, as not all incorrect data semantics can be filtered off by schema validation. One could conceivably do some criss-crossed "fast-track" processing between (B) & (C), but essentially their operations can be layered as shown. So, where does the "using a single test" go? It cannot be (B), because a schema validator must theoretically traverse through each and every instance node before concluding that the root DOM node is or isn't conformant to a given schema. In other words, if containers are used, the validator cannot skip the contained children as it must confirm that the children are of the correct children type. So there's no saving with containers, but more processing time needed for the validator as there are now more new container types to schema-validate than without. So it had to be at (C). In an instance document, there's merit to claim the above "optimization" if: - (1) processing large recurring list of children is a redundant task that can be skipped as if they are proper, - (2) the number of "other children nodes" that are non-recurring is far larger than the instantiated recurring children nodes so that skipping over non-recurring children nodes to reach recurring children nodes constitutes a performance benefit. But, for (1), during processing, since it occurs in (C), we know that the contained children are already type-correct. Yet as mentioned in (C), the recurring nodes simply cannot be skipped over without processing, as they typically carry the most important information that prompted the need for the document in the first place. An "OrderLine" in "Order" document, for instance, carries the essential ordered items. These recurring children are essentially the "table" of the "form". So they need to be "looked" at by the processing application. A part number such as "X999932G" might be type-correct as a xsd:string and passed the schema-validtor, but perhaps the application might decide that this isn't a valid part number for the receiving company. Thus, an application cannot gain advantage by any mechanism that instructs it not to look at them, for when it does manage to skip over those recurring children, it wouldn't know what data is inside and thus cannot do things like update database, generate responses, etc. Not that it needs any experience to back it up, but we did this back in our XML Industrial Project, and so had an early taste of that necessity. And for (2), let's look at some concrete numbers. I'm sorry that it will take very long to go through all 8 documents, so while being a bit sloppy and not too scientific, I'll just use the "Order" as an example, and hope it is ok with the reader. For e.g., from UBL-Order-0.81-draft-8.xsd, in "Order" document, facts are: - The total number of children is 33. - Of these, the total number of non-recurring nodes such as ID, Status Code, GloballyUniqueIdentifier,.. etc is 29. - Of these 29 non-recurring nodes, 24 are 0..1 which means we can have documents that have only 5 non-recurring children in the worst case, and about 5+12 = 17 non-recurring children on a rough average. - There are 4 potentially recurring children nodes. Without actual sampling frequency, we try to see if some expected numbers could at least support the claim in (2). The numbers show that on average, these 4 potentially recurring nodes would have to contain on average, AT MOST 1 ((17 * 0.2) / 4 = 0.85) contained child in order for the non-recurring nodes to have an expected edge of 80/20 in numbers over recurring nodes. If we say, well, 80/20 is too stringent and perhaps 50/50 is considered "overwhelming", then again, on average, these 4 potentially recurring nodes would have to cotnain AT MOST 3 ((17 * 0.5) / 4 = 2.125) children to order to fulfil (2)'s assumption. Note that these are expected numbers on UPPER bounds, not lower bounds. In other words, if actual sampling frequencies exceed these expected numbers most of the time, then (2) would not be well supported enough by actual sampling facts. May be it is just my claim, but there seems good chance that the upper bounds can be breached easily. Can I be educated on why I should think these numbers are convincing enough for containers to provide operating efficiency as claimed? Best Regards, Chin Chee-Kai SoftML Tel: +65-6820-2979 Fax: +65-6743-7875 Email: cheekai@SoftML.Net http://SoftML.Net/
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]