[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [ubl-lcsc] Processing Efficiency Test WAS Re: Position Paper on List Containers
Up front, I'd like to say 3 things before I address each of your
points:
(1) My model-theoretic numbers are technology neutral.
In other words, they are applicable regardless of whether you
are using Saxon/Xalan, or C++, C, other Java, Perl, XSLT or
no XSLT, 300MHz or 3.0GHz machines. This is because the
model-theoretic numbers give expected numbers on the incoming
side, and pitch each application's performance against itself
operating on different structures and values.
Those numbers, independent of technologies used in
implementation, are telling us (or perhaps I should say
me and not include you) that the claims on efficiency
quoted cannot be supported unless one prays very hard
that the documents are going to be very very structurally
benign, which won't and cannot be the case (quantification
of "very very" is found in my previous email).
(2) Your presentation about the tests and performance numbers
cannot be claimed to be representative of the whole.
I like to say I'm glad you share your steps performed and
values obtained, as that is what can be most convincing in
bringing us to obtain certain conclusions. So while I
fully agree that you could do more tests with varying variables,
just one or a few tests cannot prove or disprove anything.
I guess I'm disappointed that the whole UBL TC is led to argue
for containership for so long on the grounds of just a few
test cases. They can be a start of an entire suite of
what might be very convincing results, if that should happen,
but until it is proven and conclusively presented, one cannot
prove by extension by saying "2 is a prime number, 3 is a prime
number, therefore all integers are prime numbers".
(3) Zooming in on your test methodology, there is an implicit
assumption on the use of XSLT. This is not in the same
breath of meaning as DOM-based processing, which doesn't
require XSLT. Your assumption that XSLT is the default
processing mechanism could be a practical one (based on
free software, open specs, many people to discuss with,
etc), but it is certainly not the only way nor a normative
requirement of UBL to MUST use XSLT when it comes to
data processing and transformation.
Furthermore, on only the configuration you mentioned
(Saxon, XSLT on your notebook), we don't know your notebook's
CPU model, speed, cache size, RAM memory size and what
XPaths you used in XSLT. XPaths are extremely powerful
expressions that when differently expressed ever so slightly,
can result in very magnified performance differences.
> I guess you were not party to the original discussion, which is much
> narrower than the scope of your response. It is based on the behavior
> (including my experience with optimizations) that is entirely limited to
> your (C) below.
That's good and bad; good because I can be a fresh listener, but
bad because I've understood the background to the presented argument
about performance gains, and cannot find new evidence to support
the claims about containership performance benefits.
> Assume a document has been received and parsed/validated
> as XML. There are, in my opinion, too many schemes to handle cross-nodal
> and business logic validation to particularly design for them: we must
> assume they are equal in all scenarios.
I don't quite understand what is meant by "cross-nodal". But if
you mean processing with multiple UBL documents "on-hand" within
an application, I need to highlight from little bit of work done
that shouldn't need any special mention that one can't run away
from dealing with cross-document data transformations in what
might be a limited real-life scenario.
> The process efficiencies for containers are derived from the ubiquity of
> DOM processing as seen in common XSLT processors (Saxon and Xalan) which
> typically do not require a schema to perform their transformation
> functions. (I won't go into the way in which cross-nodal logic can be
> implemented as an XSLT transformation using schematron, but you may be
> familiar with this approach. I don't know how prevalent this is, but
> neither is it germane to the argument.) I cannot speak to the current
> implementations of other DOM processors, but I suspect that they will
> behave in similar fashion - I may be wrong about this, however, and so
> will not argue this - it would really depend on optimization, as you
> point out.
I suppose you might have meant it as a short form of
expression when you mixed DOM, XSLT and Saxon all in a breath.
As you know, DOM is just a model of data access built over
an abstract XML tree. XSLT is a tranformation technology
that was initially really meant for presentational transformation.
It can be realized based directly on the internal abstract
XML tree, or over a layer of DOM constructs (which will incur
further performance cost but gaining DOM accessibility interface).
And Saxon is just one form of application that implements
XSLT.
Each of DOM, XSLT and Saxon introduces its own performance
penalties due to different reasons. In your timing numbers,
there's no break down of the 460 milliseconds, which portion
is attributable to the time needs of each layer, and
which portion is due entirely to the structure of the instance.
In other words, the latter would then really argue for you
in terms of container benefits.
I don't mean to criticize the exercise as I think some
timing numbers are better than none and given that all of us
are busy. However, assuming we are all interested to dig
to the bottom of truth, and I'd want to support containers
if numbers really argue for themselves, I think we cannot base
a conclusion of what might be programming delays (e.g. poor
implementation of loops), internal data structure inefficiencies
(e.g. no use of hashtables to cash already "hit" nodes),
poor programming constructs (e.g. lack of good use of macros
over functions), poor memory management (e.g. always relying
on garbage collection) etc etc.
> I don't want to argue this ad nauseum, either, but I believe that there
> is a real processing efficiency here for large documents.
I'd welcome the claim if there're real numbers to support.
But sorry that so far, I've only seen claim statements and
inconclusive timings based on a few samples. I don't think
one should say conclusively based on just that.
> I guess the simple way to find out is to take a 1000+-item PO with and
> without containers, and see how long it takes to perform an XSLT
> transformation in identical circumstances (in this case, on my laptop
> and using Saxon).
You cannot, because based on the argument put forth earlier,
the claimed advantage was when "the other" non-recurring nodes
overwhelms recurring nodes. I've just shown the list that the
upperbound numbers that ensure that you have that condition to
make containers "useful". And that upperbound given is about 3
in the best-case-argument for containers (again, regardless
of technology and implementations used).
When it goes beyond that, and that there's practical requirement
to process all nodes, then container element itself is dwarfed
by the 1000+ items and "the other nodes" that their presence/
absence can be easily seen as to lead to no conclusive
performance gains to speak of.
You also cannot REQUIRE use of XSLT. This becomes a
competition on "cleverness" to implement XSLT, and is different
from processing UBL instances at stage (C) (based on my previous
layering model of processing). This inherent requirement of XSLT
as a normative form of comparison cannot bring good to
implementors.
Arguments based on REQUIREment of XSLT (and Saxon for that matter)
in processing therefore cannot be used to support proposed
containership rules, unless further results show processing
benefits for UBL instances that are independent of technologies
used (and are in harmony with model-theoretic expected figures).
> Note that the following numbers are based *only* on
> the inclusion of a header-level container and a list of line items
> container. I have not gone through and included all of the rules
> suggested by NDR, but only the two containers specified (I don't have
> time, sadly.)
>
> The XSLT I used grabbed one header item (the company name out of the
> Buyer Party) and made a simple HTML out of it:
>
> <html><p>[buyername]</p></html>
>
> Thus, we are only processing 1 XPath here.
>
> My results:
> With Containers: 460 milliseconds
> Without Containers: 470 milliseconds
>
> Results are a net savings of 10 milliseconds when containers are used
> for an XSLT that makes only a single match.
>
> (Admittedly, this is anything but a comprehensive test, but you will
> find that your average XSLT process does a lot more than a single
> lookup.)
>
> OK - big deal - I can demonstrate a processing difference of 10
> milliseconds out of a total processing time of under 500 - a bit better
> than 2%. I suspect that this effect could be multiplied by the number of
> XPath tests in the stylesheet.
>
> So I continue my test with some more XPaths, to see if I am right. I add
> 3 more XPaths to my stylesheet: one more "header" call, and two calls
> into the line items:
>
> With Containers: 460 milliseconds
> Without Containers: 510 milliseconds
Ok, this is one step towards clarity, but not sufficient as
I mentioned that XSLT/XPath is only ONE way, and your configuration
of stylesheet/XPath/Saxon/Your-notebook is only ONE of the ONE ways
of doing it.
> Now we see my suspicion above borne out: we are looking at a processing
> efficiency on the order of 10%. I would argue that this is significant.
Can you really attribute the full 10% due to purely structural
differences about presence/absence of containers?
Can you be absolutely certain that some internal short
optimizations didn't take place within that particular
implementation of Saxon/XSLT/XPath that led to one instance
of quickened timing, but that the same can safely be said if
the complexity gets higher?
> And, presumably, the more XPaths we add, the greater the efficiency gain
> will grow.
Not yet. "2 is a prime, 3 is a prime" doesn't prove that
all integers are prime numbers. I'm impressed with your
boldness to claim such.
> Note that there was 0 difference in prep time and in the time required
> to build the trees (this was the same both with and without containers).
Can't be zero difference, simply because if you peek into the
internals of Saxon, time will be required to minimally allocate
structures for the extra container elements and to "walk over"
it and into its children. The difference is probably too small
for milisecond precision that you use to time the Saxon performance,
but the difference CAN be amplified when an instance has many
containers containing only 1 or 2 elements.
And since you ignored the Stage (B) schema parsing, you've
essentially ignored the time penalty that must necessarily be
incurred during that phase. That penalty is expected to be
much more measureable (larger), because a schema-validator has
two sets of nodes to operate on, one set has just the instance
itself, and the other set is the set of all UBL schemas now
proliferated with the many many container types.
> The hit was taken in pure processing time. Thus, the addition of a
> couple of tags was minor - the processing penalty far outweighs it.
See above, not conclusive that the delay observed was due
entirely on presence/absence of container elements.
> Now, we still must measure the relative importance of 10% processing
> efficiency for just a transformation using XSLT, and I will confess that
> I have not addresses the full scope of your response. But - since I
> don't believe we have the resources to do comprehensive testing - I
> still think my claims of significant efficiency gains in typical
> processing scenarios with large document are borne out, all other things
> being equal.
Neither do I have much resources to do either. But as much as I've
disliked the idea right up front due to some of the number I could
foresee, I hate to leave the burden of coming up with containered
schemas entirely to Tim to bear. So I did have to end up spending
some working and weekend time to work on container schemas. What
I didn't know that it was just to prove/disprove extrapolation
arguments based on a few tests though.
Best Regards,
Chin Chee-Kai
SoftML
Tel: +65-6820-2979
Fax: +65-6743-7875
Email: cheekai@SoftML.Net
http://SoftML.Net/
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]