[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Version 8.0 Schema XPath summary and possibly important statistics and analysis
Hello all FPSC and LCSC members! I thought this would be of interest to LCSC as well as being on the critical path for FPSC. With the delivery of the 0.8Draft3.1 schemas, I was able to prepare the XPath summaries used by FPSC. Preparing the summaries revealed some interesting statistics that I thought the LCSC should be aware of. After some description and links, I'll summarize the statistics. First, for those who don't know XPath 1.0, it describes a data model for XML instances and a string syntax to address components of the data model: http://www.w3.org/TR/1999/REC-xpath-19991116 Although it happens to be used by XSLT, XPath is independent of *all* transformation technologies and thus, importantly, provides a transformation-independent description of the content of XML documents. FPSC bases its work on XPath addresses in order to be transformation technology agnostic, thus allowing both proprietary and non-proprietary approaches to understand what we describe in our UBL XML documents in an unbiased fashion. To read how we document the XPath addresses, our practices are described in this package: http://www.oasis-open.org/committees/download.php/1810/fpscdoc-20030429-2000z.zip The XPath summary files include one entry for every element with text and every attribute that is possible to exist in an UBL document instance, without regard for document model constraints as expressed in the W3C Schema. The summary is just that: a summary of all possible valid XPath addresses *in isolation*, not a summary of the combinations of XPath addresses. Where there is recursion, such as an OrderLine being a descendant of OrderLine, the XPath algorithms halt and do not replicate any further in that subtree. Cardinality (the number of times a given item can be repeated) is not included in the XPath report, only the unique XPath addresses to all possible individual information items. Here are the XPath address files for 0.8Draft3.1 in text and XML syntax: http://www.oasis-open.org/committees/download.php/2742/UBL-XPath-0.8-draft-3-1.zip Here are the XPath address files for 0.8Draft3.1 in HTML syntax (do *not* try to load some of these *very* large files unless you have a *lot* of memory on your machine; they work on my 1Gb machine, and they *might* work if you have 512Mb, but some of them will surely crash your machine if you have 256Mb or less): http://www.oasis-open.org/committees/download.php/2743/UBL-XPathHTML-0.8-draft-3-1.zip Changing the XPath generation algorithms for 0.8Draft3.1 gave me the chance to regenerate the XPath addresses for 0.70 release for comparison purposes in all formats: http://www.oasis-open.org/committees/download.php/2741/UBL-XPath-0.7-v2.zip Note that the XML files are what we call "key" files, in that content is supplied in the key files as number indexes into the XPath summary files. Thus, applying a stylesheet transform on one of these XML files will format the key information, thus revealing text-oriented field display XPath utilization. This proved very useful while developing the 0.7 stylesheets. The statistics are quite interesting. Some design decisions in 0.8Draft3.1 have inflated the number of possible XPath addresses in a given UBL instance tremendously. I gather this is because, for example, order lines have been introduced where they were not used before, thus bringing in the *entire* order line sub-tree definition into the given instance. These tables represent the number and types of unique information items possible based on the schema files; the information items are all those elements where text can be entered and all attributes, and do not include purely structural elements (the following is formatted with spaces, requiring monospaced presentation): Version 0.70: doctype, all elements, elements with text, attributes, total information items DespatchAdvice 4210, 3516, 26504, 30020 Invoice 4512, 3770, 28309, 32079 Order 1927, 1622, 11978, 13600 OrderCancellation 11, 9, 67, 76 OrderResponse 4070, 3408, 25532, 28940 OrderResponseSimple 10, 8, 63, 71 ReceiptAdvice 2767, 2312, 17412, 19724 Version 0.8Draft3.1: doctype, all elements, elements with text, attributes, total information items DespatchAdvice 23966, 19833, 154103, 173936 Invoice 16924, 14016, 108817, 122833 Order 3879, 3249, 24726, 27975 OrderCancellation 10, 8, 57, 65 OrderChange 3880, 3250, 24737, 27987 OrderResponse 18120, 15036, 116081, 131117 OrderResponseSimple 9, 7, 53, 60 ReceiptAdvice 21155, 17500, 135977, 153477 Remember the above numbers do *not* reflect any constructs available through the recursion that is defined in some of the models, nor any repeated elements ... only unique non-recursed XPath addresses possible in the instances. A number have grown by a factor of 4 or 5, ReceiptAdvice has grown by a factor of almost 10. These XPath addresses are synthesized from an analysis of the W3C Schema files, so I believe they do represent the actual possible number of elements and attributes in the document models for the instances. I thought the number of information items in Version 0.7 was too large to fathom, let alone what I discovered in Version 0.8. While I admit there will never be a Receipt Advice instance with 153,477 information items, all Receipt Advice instances will be constructed from these 153,477 information items (not including duplicate entries or recursive definitions). I'm still reviewing the XPath files for accuracy (some bugs in my original stylesheets were revealed by some of the new practices in the 0.8 XSD files), but my checks so far all indicate my analysis programs are working fine. The regenerated 0.7 files work just fine with the released 0.7 stylesheets, so I'm assuming the 0.8Draft3.1 files are correct ... if anyone should detect any faults *please* let me know ASAP. Thanks very much! I hope those with an interest will find the list of all available XPath addresses in an instance to be useful ... perhaps even providing a diagnostic role in determining whether some of the paths through an instance even make sense. Please let me know if you have any questions. Thanks again! ................... Ken p.s. on a side note, my XPath algorithms in January were implemented in XSLT and took about 20 minutes to be generated for all 7 document types in the 0.70 release on my 1.2GHz machine. I thought nothing of this amount of time. Rerunning these same XSLT transforms on 0.8Draft3.1 took 5 hours and 1 minute to produce the result (thus my first inkling that things had grown *very* large). I rewrote the XPath algorithms in a combination of XSLT and SAX/Python and reduced the generation of the 0.70 files to about 90 seconds and the generation of the 0.8Draft3.1 files to about 7 minutes. As I tell my XSLT students, some algorithms are best *not* implemented in XSLT when dealing with very large inputs. -- Upcoming hands-on courses: in-house corporate training available; North America public: XSL-FO Aug 4,2003; XSLT/XPath Aug 12, 2003 G. Ken Holman mailto:gkholman@CraneSoftwrights.com Crane Softwrights Ltd. http://www.CraneSoftwrights.com/o/ Box 266, Kars, Ontario CANADA K0A-2E0 +1(613)489-0999 (F:-0995) ISBN 0-13-065196-6 Definitive XSLT and XPath ISBN 0-13-140374-5 Definitive XSL-FO ISBN 1-894049-08-X Practical Transformation Using XSLT and XPath ISBN 1-894049-11-X Practical Formatting Using XSL-FO Member of the XML Guild of Practitioners: http://XMLGuild.info Male Breast Cancer Awareness http://www.CraneSoftwrights.com/o/bc
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]