ubl-dev message

Subject: Re: Code List Value Validation
From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
To: xml-dev@lists.xml.org, ubl-dev@lists.oasis-open.org
Date: Mon, 10 Apr 2006 22:03:48 -0400
Hey Fraser, thanks for picking up on this thread.

At 2006-04-07 13:09 +0100, Fraser Goffin wrote:
>its been a while since I have had a chance to catch up, so hi.

And last week I was on the road consulting, so only now am I getting 
around to responding.  "Hi" back at you!  Thanks for your patience.

>I have been re-reading your UBL Code List Value Validation Methodology (v0.4)

I hope to have a slightly modified v0.5 out when I get some new work 
from Tony that he had hoped in a UBL teleconference last month to get 
to work on this month.

>again while I've had a few days off (I know I need to get out more 
>:-), a few questions if I may be so bold :-

Oh, please do, Fraser.  We need feedback from actual users as I've 
been addressing the problems from a geek's perspective.

>1. How has this work been received in UBL. Is it proceeding as 
>planned. Do you think there will be any statement on adoption of 
>this methodology any time soon ?

Thanks to Jon Bosak for addressing this in another post:

   http://lists.oasis-open.org/archives/ubl-dev/200604/msg00008.html

>2. A genericode file contains ONE code list at ONE version right ??

Indeed it does.

Well, actually, perhaps I see it slightly differently.  I see a 
genericode file as containing a versioned set of codes.  The UBL code 
list context association file will associate a *combination* of a 
number of sets of codes to make what I see traditionally as a "code 
list", that being the list of codes available for an information 
item.  Since the code list association file aggregates sets of codes 
into a single code list for a given context (which may have many or 
only a single location in a document), this has introduced a 
distinction between a code list and a set of codes.

But that is based on my interpretation of a "code list" which from my 
outsider (of business) geek role may be incorrect.  I had understood 
a "code list" to be the set of codes applicable for a given 
information item.  For users of genericode files who use a single 
genericode file for all of the coded values for an information item 
(quite typical I should think), then yes, a genericode file contains 
one code list at one version as you ask.

But given that I've proposed an information item's code list is the 
aggregate of a collection of sets of coded values, each set expressed 
separate genericode files, then a genericode file contains only a 
portion of a code list (possibly all), at a given version for that portion.

>3. In the doc you state that when an enumeration appears WITHIN a 
>schema, TPs may ligitimately operate a subset since all values are 
>valid in the full set, but they may not add new values.

Indeed I am, because of the absolute necessity that the UBL code list 
value validation methodology include a "first pass" schema validation 
of the instance is successful before a "second pass" value validation 
is even attempted.  This is based on the nature of XPath-based 
context testing:  an XPath address knows nothing of a schema and 
works solely on a well-formed instance of the actual presence of 
information items in a structured document, not on the possible 
information items available in the creation of a structured document.

Thus, the information items absolutely must be insured to be properly 
located within a given instance to have the confidence that the XPath 
addresses being used in the instance will not inadvertently pass 
assertions based on faulty placement.

Without a successful first pass, there is no integrity to the second pass.

So, the first pass must have no errors, and there is no distinction 
between a schema enumeration error and a schema structural error (nor 
should I think that validators should introduce such distinctions ... 
an instance is either schema valid or it isn't).

Should trading partners attempt to agree upon and express tailored 
values in the second-pass value validation, these values will prevent 
first-pass schema validation from being successful.

>Am I correct in assuming that the same is true for a code list which 
>is defined by a standards organisation (not UBL) but which is NOT 
>embedded within schema ?

That is not something I would assume.  That would make UBL too 
constraining for the real business world should trading partners need 
to go beyond the "initial set" of values supplied by the 
committee.  With the realization that while things may work fine 
between them, either would have to agree on those same values with 
any third party wishing to work with their UBL documents.

That UBL includes a limited number of enumerated values is a 
byproduct of the committee decision agreed upon to incorporate 
UN/CEFACT's expression of these values which happens to have been 
done through schema validation mechanisms.

>As you know in my situation we operate code-lists than are defined 
>by a standards body and in most cases want to use them in all 
>situations where there are equivalent semantics (including internal 
>app integration) rather than create alternative bespoke lists.

Indeed ... and trading partners who wish to conform with the 
standards body perhaps to guarantee blind interchange with another 
member of the standards body will, therefore, be required to limit 
themselves to the published sets of standardized values.

But by using the UBL methodology, that is a business decision and a 
clearly-documented technical practice, and the standards body can 
enforce validation with their standardized sets of coded values in 
the standardized sets of information item contexts without 
sacrificing flexibility when needed in exceptions.  The methodology 
will not constrain two consensual trading partners from engaging in 
using exceptions while still using other read-only artefacts 
considered sacrosanct.  I see document information item values as 
been interpretive while document information item structures as being rigid.

Moreover, the methodology will also allow trading partners to subset 
sets of coded values or use different sets of coded values in 
different information item contexts without violating schema 
validation in ways not offered by schema expressions.

Note that I am not advocating that a maverick user attempt to engage 
in blind interchange with a suite of values beyond the standardized 
set.  A community of users has standardized a set of values because 
of a community-wide agreement upon the semantics represented by those 
values.  Trading partners can agree upon the semantics represented by 
extended values.  Maverick users cannot impose unknown or unaccepted 
values upon unsuspecting recipients.  Thankfully, recipients who 
publish their acceptance of standardized values can use the artefacts 
published by the standards committee as the basis on which their 
systems validating acceptable input are built.

>Problem is, we do sometimes want/need to extend these lists often to 
>provide higher fidelity mapping to our operational systems.

Indeed.

>Another example might be where a code identifies some high level 
>semantic, but we want to be able to create a bunch of 'sub' codes to 
>provide a more granular view - accepting that in 2-way translation 
>there will be data loss.

Absolutely.

>Lobbying the standards body and getting a timely change/addition can 
>be problematic ?  -  anyway - I digress

:{)} Indeed ... but any tardiness on their part will not prevent 
consensual trading partners from engaging how they wish.

>4a. For code lists where there is no established [complete and/or 
>definative] standard or where the semantics and values are TP 
>relationship specific, the set of permissible values can be extended 
>and/or restricted from an offered base set (if available - using 
>your DocumentStatusCodeCodeType example) or the participating 
>organisations can agree the set of values (and presumably the list 
>ID to be used in XML instances). Is this correct ?

Absolutely.  If UBL left each and every coded information item 
totally empty, then there would be no out-of-the-box experience for 
inexperienced users looking to UBL for a starting point.

>4b. If I have many TPs and each has a slightly different 
>relationship, might this cause me to need a separate genericode file 
>for EACH code list that differs, however slightly, from another ?. 
>Is there a suggested low maintenance approach to this problem ?

How about a core genericode file for the common bits, a differential 
genericode file for the deltas, and use multiple IDREF references in 
the code list context association to pull in the aggregates?  This 
would allow you to version the common core separately from the 
differential bits.

>4c. Similarly to (4b.), if a custom code list is shared across 
>service contracts for multiple TP relationships, but a need arises 
>to create a new version with [say] some values added or removed, and 
>we need to be able to operate both versions concurrently for some 
>period of time, does this require a complete re-statement of the new 
>code list (in a new .gc file) with a new version number even if the 
>difference is ONE codified value (added/deleted/changed) amongst a 
>set of 10,000 values ?

Hmmmmmmmmm ... probably ... I don't believe there are any operators 
expressed in genericode.

But what if you were synthesizing your genericode as an XQuery 
result?  You could manage your many values in database tables and 
have the query pull out what you needed based on your criteria and 
the query result would be the XML instance suitable for use in the methodology.

>This also means that the implementation will repeat a lot of code. I 
>guess I am wondering whether there is/should be a way of expressing 
>a 'delta' of values ?

An interesting issue ... there are XML delta expressions out there 
... perhaps one could express the differences as an operation against 
the XML syntax.

But my gut feel is that it would have to be managed somehow at source 
and emitted as an XML instance.

I wonder if Tony can comment on his experiences with user 
requirements in this area.

>5. Continuing the theme of (4), we have some code lists which are 
>both highly volatile (values added and deleted (rarely changed) 
>every month) and are very large (e.g. > 50000 entries). An example 
>is Vehicle Make/Model. Do you think this approach is suitable for 
>this type of reference data (multiple 'active' versions, large 
>number of values) ?

Indeed ... such values are now most likely somewhere in a 
database.  Rather than implementing a methodology that incorporates 
direct database access to get at those 50,000 values, XML makes an 
idea bridge of a concrete expression of the values *at a point in 
time*, and the metadata for that export would express some 
identifying information for audit and tracking purposes (perhaps a 
"version" number of the data set?).  Then the XML instance evaporates 
at the end of the process and the next time you want the values you 
do your XQuery to get your XML expression of the values, using 
genericode as the vocabulary.

>6. Can you explain the difference between the UBL 'CodeType' and 
>'IdentifierType' in terms of what circumstances you would use either ?

That is an NDR issue and I'll confess that I do not know the nuance 
off the top of my head.

>If a schema identifies the ListID,

"schema" or "instance"?

>but we want to use a different one (to employ as 'richer set' of 
>values) , how would an industry standard schema accomodate this 
>possibility such that the schema remains a standard and unchanged 
>definition of the structural constraints (is this the 
>CodeType/IdentifierType approach) ?

Sorry, not sure.  Could you elaborate on what you are trying to ask here?

>7. Do you think that a skeleton context association file could be 
>auto-generated ?

Absolutely ... I did (though straightforward it was a bit of a challenge):

    http://www.oasis-open.org/archives/ubl/200602/msg00069.html

The "Garden of Eden" approach to the UBL NDR requires every element 
and data type be global ... because of that I was able to use XSLT to 
process the schema expressions.  I've successfully generated the 
XPath files for all of the document types, and I used the XPath file 
results to synthesize the above single "default" context association 
file for UBL 2.

BTW, XSLT was so slow at some of the processes, I ended up rewriting 
some of the steps in Python/SAX.

>8. Changing tack slightly. I am interested in using genericode files 
>for a number of purposes including, value-based validation, UI 
>generation (e,g, to populate UI controls such as list boxes), 
>transcoding between application specific codes. It would appear from 
>the genericode materials that this would be feasible, do you agree ? :-

Absolutely!  I chair the UBL HISC subcommittee (Human Interface 
Subcommittee) and genericode files are absolutely appropriate for 
defining drop-down lists, etc.  But not just in their isolation ... 
context is also important when drop down lists need to differ for 
different document contexts.  We'll probably still be using the code 
list context association file and not just raw genericode files.

Sorry, Fraser, I'm not sure what original text you had below before 
it was mangled by a mail system somewhere:

>Std Code    Std Desc        Appl'n A Equiv    Appl'n B Equiv    UI Text
>(key)                                (key)                 (key)
>
>abc            Std 
>Widget      def                    ghi                     Part No 
>3321-7 (small widget)
>
>9. What is the suggested approach to deal with deprecated code 
>values. Is this considered as a versioning issue both for standards 
>based code-lists (embedded in schema or not) and custom code lists ?

My gut feel is yes.

>Should code lists include validity date/time values or other 
>'active/deprecated' indicators ?

Tony, can you comment on which semantics in genericode might satisfy 
this requirement?

>10. Caller assertion of list version. If there is no matching 
>version is it best to flag the validation failure (and possibly 
>reject the message) - that is, 'trust' the caller assertion, or 
>validate against the un-vesioned complete list (similar point to the 
>one we discussed earlier about whether to to an xsi:schemaLocation 
>attribute value) ?

I decided not.

The way I implemented this is that if the instance doesn't state a 
version number for the coded value, the version number isn't 
important to the author of the instance, and is therefore not 
validated.  If, however, the instance does state a version number for 
the coded value, the implementation requires the version number to match.

I think this is an acceptable conclusion:  if I use a value and I 
don't care in the instance about which version of the list this value 
is from, then the version of the list being compared against is 
ignored.  But if I use a value and I declare the value is from a 
particular version, it might be because the semantics behind that 
particular value from that particular version are important to me.

I'm not sure ... have I answered my question.

>11. Devils advocate: Whats the difference in having to distribute 
>the latest .gc file versus having to use the latest XSD with updated 
>embedded enums ? (Ok, I think I know the answer to this one, but it 
>would be good to have a quote from 'championing' designer, for the 
>benefit of my peer group and sceptical and untrusting bosses :-)

That the structural integrity of UBL isn't being changed by changing 
a bunch of allowable values and therefore shouldn't have to require 
the redistribution of schemas that dictate the allowed structures.

Programs are built assuming both structural expectations on 
information and value expectations on content.  Making a change to 
recognize new structures is more difficult, time consuming and error 
prone than making a change to recognize a new value in a given 
structure.  While you implied there wouldn't be structural changes in 
a new schema with updated embedded enumerations, the version of the 
schema would be new.  If I claim my software supports a given version 
of the schema, I would probably have testing and other issues for new 
schemas to be installed in my system.  My gut feel is that I could 
more easily mitigate the impact on my system by only needing to 
accommodate new a new version of a set of values than having to prove 
my system can handle a new version of a schema.

>Anthony: So that you are aware, I am attempting to stimulate 
>interest in the use of genericode within the organisation that I 
>work with (a large UK financial services company) from a number of 
>potential perspectives. One of these is value-based validation, 
>hence discussions with Ken i.r.o his work with UBL, but also for a 
>more broadly accessible resource for reference data used for a 
>variety of purposes such as UI generation, transcoding, etc..

Kewl!

I hope this has helped, Fraser.  Thanks for all the feedback ... keep 
it coming!  We need it!  If you need anything I've said above 
explained, please ask ... it is late and I may have not caught everything.

. . . . . . . . . . . Ken


--
Registration open for XSLT/XSL-FO training: Wash.,DC 2006-06-12/16
Also for XML/XSLT/XSL-FO training:Birmingham,England 2006-05-22/25
Also for XSLT/XSL-FO training:    Copenhagen,Denmark 2006-05-08/11
World-wide on-site corporate, govt. & user group XML/XSL training.
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/u/
Box 266, Kars, Ontario CANADA K0A-2E0    +1(613)489-0999 (F:-0995)
Male Cancer Awareness Aug'05  http://www.CraneSoftwrights.com/u/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal
References:
- Code List Value Validation
  - From: "Fraser Goffin" <goffinf@googlemail.com>