OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

ubl-lcsc message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Version 8.0 Schema XPath summary and possibly important statistics and analysis


Hello all FPSC and LCSC members!

I thought this would be of interest to LCSC as well as being on the 
critical path for FPSC.  With the delivery of the 0.8Draft3.1 schemas, I 
was able to prepare the XPath summaries used by FPSC.

Preparing the summaries revealed some interesting statistics that I thought 
the LCSC should be aware of.  After some description and links, I'll 
summarize the statistics.

First, for those who don't know XPath 1.0, it describes a data model for 
XML instances and a string syntax to address components of the data model:

    http://www.w3.org/TR/1999/REC-xpath-19991116

Although it happens to be used by XSLT, XPath is independent of *all* 
transformation technologies and thus, importantly, provides a 
transformation-independent description of the content of XML documents.

FPSC bases its work on XPath addresses in order to be transformation 
technology agnostic, thus allowing both proprietary and non-proprietary 
approaches to understand what we describe in our UBL XML documents in an 
unbiased fashion.

To read how we document the XPath addresses, our practices are described in 
this package:

   http://www.oasis-open.org/committees/download.php/1810/fpscdoc-20030429-2000z.zip

The XPath summary files include one entry for every element with text and 
every attribute that is possible to exist in an UBL document instance, 
without regard for document model constraints as expressed in the W3C 
Schema.  The summary is just that: a summary of all possible valid XPath 
addresses *in isolation*, not a summary of the combinations of XPath addresses.

Where there is recursion, such as an OrderLine being a descendant of 
OrderLine, the XPath algorithms halt and do not replicate any further in 
that subtree.  Cardinality (the number of times a given item can be 
repeated) is not included in the XPath report, only the unique XPath 
addresses to all possible individual information items.

Here are the XPath address files for 0.8Draft3.1 in text and XML syntax:

   http://www.oasis-open.org/committees/download.php/2742/UBL-XPath-0.8-draft-3-1.zip

Here are the XPath address files for 0.8Draft3.1 in HTML syntax (do *not* 
try to load some of these *very* large files unless you have a *lot* of 
memory on your machine; they work on my 1Gb machine, and they *might* work 
if you have 512Mb, but some of them will surely crash your machine if you 
have 256Mb or less):

   http://www.oasis-open.org/committees/download.php/2743/UBL-XPathHTML-0.8-draft-3-1.zip

Changing the XPath generation algorithms for 0.8Draft3.1 gave me the chance 
to regenerate the XPath addresses for 0.70 release for comparison purposes 
in all formats:

   http://www.oasis-open.org/committees/download.php/2741/UBL-XPath-0.7-v2.zip

Note that the XML files are what we call "key" files, in that content is 
supplied in the key files as number indexes into the XPath summary 
files.  Thus, applying a stylesheet transform on one of these XML files 
will format the key information, thus revealing text-oriented field display 
XPath utilization.  This proved very useful while developing the 0.7 
stylesheets.

The statistics are quite interesting.  Some design decisions in 0.8Draft3.1 
have inflated the number of possible XPath addresses in a given UBL 
instance tremendously.  I gather this is because, for example, order lines 
have been introduced where they were not used before, thus bringing in the 
*entire* order line sub-tree definition into the given instance.

These tables represent the number and types of unique information items 
possible based on the schema files; the information items are all those 
elements where text can be entered and all attributes, and do not include 
purely structural elements (the following is formatted with spaces, 
requiring monospaced presentation):

Version 0.70:
doctype, all elements, elements with text, attributes, total information items

DespatchAdvice      4210, 3516, 26504, 30020
Invoice             4512, 3770, 28309, 32079
Order               1927, 1622, 11978, 13600
OrderCancellation     11,    9,    67,    76
OrderResponse       4070, 3408, 25532, 28940
OrderResponseSimple   10,    8,    63,    71
ReceiptAdvice       2767, 2312, 17412, 19724

Version 0.8Draft3.1:
doctype, all elements, elements with text, attributes, total information items

DespatchAdvice      23966, 19833, 154103, 173936
Invoice             16924, 14016, 108817, 122833
Order                3879,  3249,  24726,  27975
OrderCancellation      10,     8,     57,     65
OrderChange          3880,  3250,  24737,  27987
OrderResponse       18120, 15036, 116081, 131117
OrderResponseSimple     9,     7,     53,     60
ReceiptAdvice       21155, 17500, 135977, 153477

Remember the above numbers do *not* reflect any constructs available 
through the recursion that is defined in some of the models, nor any 
repeated elements ... only unique non-recursed XPath addresses possible in 
the instances.

A number have grown by a factor of 4 or 5, ReceiptAdvice has grown by a 
factor of almost 10.  These XPath addresses are synthesized from an 
analysis of the W3C Schema files, so I believe they do represent the actual 
possible number of elements and attributes in the document models for the 
instances.

I thought the number of information items in Version 0.7 was too large to 
fathom, let alone what I discovered in Version 0.8.  While I admit there 
will never be a Receipt Advice instance with 153,477 information items, all 
Receipt Advice instances will be constructed from these 153,477 information 
items (not including duplicate entries or recursive definitions).

I'm still reviewing the XPath files for accuracy (some bugs in my original 
stylesheets were revealed by some of the new practices in the 0.8 XSD 
files), but my checks so far all indicate my analysis programs are working 
fine.  The regenerated 0.7 files work just fine with the released 0.7 
stylesheets, so I'm assuming the 0.8Draft3.1 files are correct ... if 
anyone should detect any faults *please* let me know ASAP.  Thanks very much!

I hope those with an interest will find the list of all available XPath 
addresses in an instance to be useful ... perhaps even providing a 
diagnostic role in determining whether some of the paths through an 
instance even make sense.

Please let me know if you have any questions.

Thanks again!

................... Ken

p.s. on a side note, my XPath algorithms in January were implemented in 
XSLT and took about 20 minutes to be generated for all 7 document types in 
the 0.70 release on my 1.2GHz machine.  I thought nothing of this amount of 
time.  Rerunning these same XSLT transforms on 0.8Draft3.1 took 5 hours and 
1 minute to produce the result (thus my first inkling that things had grown 
*very* large).  I rewrote the XPath algorithms in a combination of XSLT and 
SAX/Python and reduced the generation of the 0.70 files to about 90 seconds 
and the generation of the 0.8Draft3.1 files to about 7 minutes.  As I tell 
my XSLT students, some algorithms are best *not* implemented in XSLT when 
dealing with very large inputs.

--
Upcoming hands-on courses: in-house corporate training available;
North America public:  XSL-FO Aug 4,2003; XSLT/XPath Aug 12, 2003

G. Ken Holman                mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd.         http://www.CraneSoftwrights.com/o/
Box 266, Kars, Ontario CANADA K0A-2E0   +1(613)489-0999 (F:-0995)
ISBN 0-13-065196-6                      Definitive XSLT and XPath
ISBN 0-13-140374-5                              Definitive XSL-FO
ISBN 1-894049-08-X  Practical Transformation Using XSLT and XPath
ISBN 1-894049-11-X              Practical Formatting Using XSL-FO
Member of the XML Guild of Practitioners:    http://XMLGuild.info
Male Breast Cancer Awareness http://www.CraneSoftwrights.com/o/bc



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]