[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Version 8.0 Schema XPath summary and possibly important statistics and analysis
Hello all FPSC and LCSC members!
I thought this would be of interest to LCSC as well as being on the
critical path for FPSC. With the delivery of the 0.8Draft3.1 schemas, I
was able to prepare the XPath summaries used by FPSC.
Preparing the summaries revealed some interesting statistics that I thought
the LCSC should be aware of. After some description and links, I'll
summarize the statistics.
First, for those who don't know XPath 1.0, it describes a data model for
XML instances and a string syntax to address components of the data model:
http://www.w3.org/TR/1999/REC-xpath-19991116
Although it happens to be used by XSLT, XPath is independent of *all*
transformation technologies and thus, importantly, provides a
transformation-independent description of the content of XML documents.
FPSC bases its work on XPath addresses in order to be transformation
technology agnostic, thus allowing both proprietary and non-proprietary
approaches to understand what we describe in our UBL XML documents in an
unbiased fashion.
To read how we document the XPath addresses, our practices are described in
this package:
http://www.oasis-open.org/committees/download.php/1810/fpscdoc-20030429-2000z.zip
The XPath summary files include one entry for every element with text and
every attribute that is possible to exist in an UBL document instance,
without regard for document model constraints as expressed in the W3C
Schema. The summary is just that: a summary of all possible valid XPath
addresses *in isolation*, not a summary of the combinations of XPath addresses.
Where there is recursion, such as an OrderLine being a descendant of
OrderLine, the XPath algorithms halt and do not replicate any further in
that subtree. Cardinality (the number of times a given item can be
repeated) is not included in the XPath report, only the unique XPath
addresses to all possible individual information items.
Here are the XPath address files for 0.8Draft3.1 in text and XML syntax:
http://www.oasis-open.org/committees/download.php/2742/UBL-XPath-0.8-draft-3-1.zip
Here are the XPath address files for 0.8Draft3.1 in HTML syntax (do *not*
try to load some of these *very* large files unless you have a *lot* of
memory on your machine; they work on my 1Gb machine, and they *might* work
if you have 512Mb, but some of them will surely crash your machine if you
have 256Mb or less):
http://www.oasis-open.org/committees/download.php/2743/UBL-XPathHTML-0.8-draft-3-1.zip
Changing the XPath generation algorithms for 0.8Draft3.1 gave me the chance
to regenerate the XPath addresses for 0.70 release for comparison purposes
in all formats:
http://www.oasis-open.org/committees/download.php/2741/UBL-XPath-0.7-v2.zip
Note that the XML files are what we call "key" files, in that content is
supplied in the key files as number indexes into the XPath summary
files. Thus, applying a stylesheet transform on one of these XML files
will format the key information, thus revealing text-oriented field display
XPath utilization. This proved very useful while developing the 0.7
stylesheets.
The statistics are quite interesting. Some design decisions in 0.8Draft3.1
have inflated the number of possible XPath addresses in a given UBL
instance tremendously. I gather this is because, for example, order lines
have been introduced where they were not used before, thus bringing in the
*entire* order line sub-tree definition into the given instance.
These tables represent the number and types of unique information items
possible based on the schema files; the information items are all those
elements where text can be entered and all attributes, and do not include
purely structural elements (the following is formatted with spaces,
requiring monospaced presentation):
Version 0.70:
doctype, all elements, elements with text, attributes, total information items
DespatchAdvice 4210, 3516, 26504, 30020
Invoice 4512, 3770, 28309, 32079
Order 1927, 1622, 11978, 13600
OrderCancellation 11, 9, 67, 76
OrderResponse 4070, 3408, 25532, 28940
OrderResponseSimple 10, 8, 63, 71
ReceiptAdvice 2767, 2312, 17412, 19724
Version 0.8Draft3.1:
doctype, all elements, elements with text, attributes, total information items
DespatchAdvice 23966, 19833, 154103, 173936
Invoice 16924, 14016, 108817, 122833
Order 3879, 3249, 24726, 27975
OrderCancellation 10, 8, 57, 65
OrderChange 3880, 3250, 24737, 27987
OrderResponse 18120, 15036, 116081, 131117
OrderResponseSimple 9, 7, 53, 60
ReceiptAdvice 21155, 17500, 135977, 153477
Remember the above numbers do *not* reflect any constructs available
through the recursion that is defined in some of the models, nor any
repeated elements ... only unique non-recursed XPath addresses possible in
the instances.
A number have grown by a factor of 4 or 5, ReceiptAdvice has grown by a
factor of almost 10. These XPath addresses are synthesized from an
analysis of the W3C Schema files, so I believe they do represent the actual
possible number of elements and attributes in the document models for the
instances.
I thought the number of information items in Version 0.7 was too large to
fathom, let alone what I discovered in Version 0.8. While I admit there
will never be a Receipt Advice instance with 153,477 information items, all
Receipt Advice instances will be constructed from these 153,477 information
items (not including duplicate entries or recursive definitions).
I'm still reviewing the XPath files for accuracy (some bugs in my original
stylesheets were revealed by some of the new practices in the 0.8 XSD
files), but my checks so far all indicate my analysis programs are working
fine. The regenerated 0.7 files work just fine with the released 0.7
stylesheets, so I'm assuming the 0.8Draft3.1 files are correct ... if
anyone should detect any faults *please* let me know ASAP. Thanks very much!
I hope those with an interest will find the list of all available XPath
addresses in an instance to be useful ... perhaps even providing a
diagnostic role in determining whether some of the paths through an
instance even make sense.
Please let me know if you have any questions.
Thanks again!
................... Ken
p.s. on a side note, my XPath algorithms in January were implemented in
XSLT and took about 20 minutes to be generated for all 7 document types in
the 0.70 release on my 1.2GHz machine. I thought nothing of this amount of
time. Rerunning these same XSLT transforms on 0.8Draft3.1 took 5 hours and
1 minute to produce the result (thus my first inkling that things had grown
*very* large). I rewrote the XPath algorithms in a combination of XSLT and
SAX/Python and reduced the generation of the 0.70 files to about 90 seconds
and the generation of the 0.8Draft3.1 files to about 7 minutes. As I tell
my XSLT students, some algorithms are best *not* implemented in XSLT when
dealing with very large inputs.
--
Upcoming hands-on courses: in-house corporate training available;
North America public: XSL-FO Aug 4,2003; XSLT/XPath Aug 12, 2003
G. Ken Holman mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd. http://www.CraneSoftwrights.com/o/
Box 266, Kars, Ontario CANADA K0A-2E0 +1(613)489-0999 (F:-0995)
ISBN 0-13-065196-6 Definitive XSLT and XPath
ISBN 0-13-140374-5 Definitive XSL-FO
ISBN 1-894049-08-X Practical Transformation Using XSLT and XPath
ISBN 1-894049-11-X Practical Formatting Using XSL-FO
Member of the XML Guild of Practitioners: http://XMLGuild.info
Male Breast Cancer Awareness http://www.CraneSoftwrights.com/o/bc
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]