[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [ubl-lcsc] Re: Position Paper on List Containers
On Mon, 1 Sep 2003, A Gregory wrote: >>Tony/Tim: >> >>Just a note on the performance issues related to list containers: >> >>List containers improve any DOM-based processing by reducing the number >>of iterations needed to match items *other* than the items in the list, >>by allowing an entire large list of children to be ignored using a >>single test, on the container, rather than on each of its children. >> >>This was the original reason for arguing that we needed list containers, >>way back when. Thanks for showing this. Ok, let's talk about that. Was there some misunderstanding or confusion in the quotation? Otherwise, I'll just look at what's presented here. As the assumed environment of processing and variables such as: - large/small sized document frequency, - number of recurring items is large or small, - number of places requiring proposed containers is large or small, - DOM node selection methods & iteration algorithm used - the assumption that no caching, special optimization steps, etc are used but only schema design can be manipulated. etc are not specified, it leaves open a wide scope of arbitrariness that one can probably tune to match arguments either way. >>List containers improve any DOM-based processing by reducing the number >>of iterations needed to match items *other* than the items in the list, >>by allowing an entire large list of children to be ignored using a >>single test, on the container, rather than on each of its children. May be it is the way it is written, but the paragraph seems to have a merged/mixed-up view of phases of validating and processing of an instance. For the purpose of analysing the claim above, let's take a look at how documents are "consumed" into an application: (A) On the assumption that an incoming document (not necessarily even XML-related) is a blackbox, one must first see if it is XML-parsable, failing which the document is discarded. (B) Then there is a schema validation phase, failing which the document is again discarded (or error-processed, such as flagging user, etc). (C) Finally, application layer takes a properly schema-validated document to "process". "Processing" needs to still include a preprocessing to do value checking, as not all incorrect data semantics can be filtered off by schema validation. One could conceivably do some criss-crossed "fast-track" processing between (B) & (C), but essentially their operations can be layered as shown. So, where does the "using a single test" go? It cannot be (B), because a schema validator must theoretically traverse through each and every instance node before concluding that the root DOM node is or isn't conformant to a given schema. In other words, if containers are used, the validator cannot skip the contained children as it must confirm that the children are of the correct children type. So there's no saving with containers, but more processing time needed for the validator as there are now more new container types to schema-validate than without. So it had to be at (C). In an instance document, there's merit to claim the above "optimization" if: - (1) processing large recurring list of children is a redundant task that can be skipped as if they are proper, - (2) the number of "other children nodes" that are non-recurring is far larger than the instantiated recurring children nodes so that skipping over non-recurring children nodes to reach recurring children nodes constitutes a performance benefit. But, for (1), during processing, since it occurs in (C), we know that the contained children are already type-correct. Yet as mentioned in (C), the recurring nodes simply cannot be skipped over without processing, as they typically carry the most important information that prompted the need for the document in the first place. An "OrderLine" in "Order" document, for instance, carries the essential ordered items. These recurring children are essentially the "table" of the "form". So they need to be "looked" at by the processing application. A part number such as "X999932G" might be type-correct as a xsd:string and passed the schema-validtor, but perhaps the application might decide that this isn't a valid part number for the receiving company. Thus, an application cannot gain advantage by any mechanism that instructs it not to look at them, for when it does manage to skip over those recurring children, it wouldn't know what data is inside and thus cannot do things like update database, generate responses, etc. Not that it needs any experience to back it up, but we did this back in our XML Industrial Project, and so had an early taste of that necessity. And for (2), let's look at some concrete numbers. I'm sorry that it will take very long to go through all 8 documents, so while being a bit sloppy and not too scientific, I'll just use the "Order" as an example, and hope it is ok with the reader. For e.g., from UBL-Order-0.81-draft-8.xsd, in "Order" document, facts are: - The total number of children is 33. - Of these, the total number of non-recurring nodes such as ID, Status Code, GloballyUniqueIdentifier,.. etc is 29. - Of these 29 non-recurring nodes, 24 are 0..1 which means we can have documents that have only 5 non-recurring children in the worst case, and about 5+12 = 17 non-recurring children on a rough average. - There are 4 potentially recurring children nodes. Without actual sampling frequency, we try to see if some expected numbers could at least support the claim in (2). The numbers show that on average, these 4 potentially recurring nodes would have to contain on average, AT MOST 1 ((17 * 0.2) / 4 = 0.85) contained child in order for the non-recurring nodes to have an expected edge of 80/20 in numbers over recurring nodes. If we say, well, 80/20 is too stringent and perhaps 50/50 is considered "overwhelming", then again, on average, these 4 potentially recurring nodes would have to cotnain AT MOST 3 ((17 * 0.5) / 4 = 2.125) children to order to fulfil (2)'s assumption. Note that these are expected numbers on UPPER bounds, not lower bounds. In other words, if actual sampling frequencies exceed these expected numbers most of the time, then (2) would not be well supported enough by actual sampling facts. May be it is just my claim, but there seems good chance that the upper bounds can be breached easily. Can I be educated on why I should think these numbers are convincing enough for containers to provide operating efficiency as claimed? Best Regards, Chin Chee-Kai SoftML Tel: +65-6820-2979 Fax: +65-6743-7875 Email: cheekai@SoftML.Net http://SoftML.Net/
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]