ubl-clsc message

Subject: URI approach to code list metadata
From: jon.bosak@sun.com
To: ubl-clsc@lists.oasis-open.org
Date: Wed, 25 Feb 2004 08:18:24 -0800 (PST)
A new approach to the code list metatdata problem was put forward
during our discussion of code lists in the UBL TC meeting in
Washington Monday afternoon.  Ken Holman took an action to contact
Eve Maler and Arofan Gregory for their input.  I also copied Bill
Burcham on Ken's note.  Eve and Bill have responded; this
correspondence is copied below.

Our plan now is to use the present message to put this material
in the CLSC archive for review by participants in today's TC
meeting.  We will meet at 2:30 p.m. today to review the discussion
so far and then be joined at 3 p.m. by Eve Maler and (we hope) by
Marty Burns, who is being contacted by phone.

Jon

##################################################################

Date: Mon, 23 Feb 2004 21:25:09 -0500
To: "Eve L. Maler" <eve.maler@sun.com>, "A Gregory" <agregory@aeon-llc.com>
From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
Subject: Candidate technical approach to code lists discussed today
Cc: Jon Bosak <Jon.Bosak@sun.com>

Good evening, Eve and Arofan, I hope this note finds each of you well.

A very interesting discussion was held today at the UBL F2F where one of 
our biggest subject matter experts raised the "auditing" concern (yes, I 
mean like a tax auditor or a financial auditor) in that an XML instance of 
information should contain all of the business-related information without 
relying on an external document (as in schema expression) containing such 
business-related information for the legal interpretation of a document not 
being binding because it doesn't stand alone as a complete set of information.

Supplementary components for code lists (agency, version, etc.) have been 
proposed in the past to be defaulted attributes and, in the rogue Montréal 
methodology in the data type URI encapsulated in external schema 
fragments.  I gather, too, that an old proposal included qualified-name 
data where the namespace URI associated with the qualified name identified 
the coded value.

Qname data is difficult to support because tools such as SAX, DOM and XSLT 
don't have pre-built access to Qname decomposition when in data, while they 
all do when Qname decomposition is required of element types and attribute 
names.

A proposal was floated today whereby the specialization of code list values 
is done as a two-step change:  the element type or attribute name of the 
information item as delivered out-of-the-box in UBL is either unqualified 
or qualified as a UBL label.  When a code value from a different list is 
required, the name of the information item is modified to be a Qname chosen 
by two trading partners.  The trading partners must modify their schemas to 
utilize the Qname, thus changing the namespace of the schemas to prevent 
ambiguity with the original UBL schemas.

Fine, but in addition, the namespace-aware well-formedness of the XML 
instance that contains coded values from private-use code lists *requires 
the qname to be used* which requires that name's namespace URI to be in the 
instance.  If we adopt the convention of putting the supplemental 
components into fields of the URI string, then *both* the XML instance and 
the schema expression used to validate the instance unambiguously identify 
the private-use code list and that identification contains all of the 
supplemental components for an auditor to be assured the value contained by 
the labeled information item is intended to be from the set identified by 
the components.

By not using defaulted attributes, we don't have any user information in 
the schema expression and to date not very many tools give access to W3C 
schema expression defaulted attributes.  By not using specified attributes 
we remove the restriction for an element that only one of the combination 
of #PCDATA and attributes be a code list data item (since attributes cannot 
have attributes).

An interesting benefit of this Qname approach is that if a dozen 
information items in an instance all need to use values from the 
private-use code list, that code list's URI string is defined only once in 
the document and all uses of it have names modified to point to the URI, 
and this exploitation of the prefix is very succinct.  Also if any element 
today or in the future needs more than one of #PCDATA or attributes to be a 
code list data item, the prefixes can be used on each item.

In effect, namespace prefixes become document-wide (or if the user wants, 
element-wide) proxies to germane information found in the namespace URI string.

Some questions:

(1) - has this been a strategy that has been considered in the past and 
discounted with more thought applied to it than the few hours we've had 
this afternoon when the idea was floated?

(2) - does this strategy have sufficient merit to put in the effort to try 
and get something brand new into UBL 1.0 (which we are reluctant to do, but 
it would be nice to the outside world to see that we've accomplished 
something, and gee this approach sounds very interesting technically and 
would be a new way of approaching things that we haven't seen before)?

An example:

UBL defines currency to be a coded value of an ISO 1999 set:

    xmlns:cur="...OASIS URN with supp components for ISO 1999 currency values"
    ...
    <amount cur:currency='USD'>123.45</amount>

Two trading partners agree that they want the 2002 values to be used:

    xmlns:mycur="...a private-use URN with supp components for ISO 2002 values"
    ...
    <amount mycur:currency='US'>123.45</amount>

Both the well-formed instance and the validating schema expression contain 
all of the information needed by an auditor that the values represented are 
posited to be from a particular value set.

To be compatible with UBL the private-use URN would have to have the 
required components in an acceptable form .... such a technique is not 
available to any existing validation processes as up until now the industry 
hasn't tried (that I've seen) to mandate form or format of namespace URIs.

There is a prevalent attitude of "we still don't know that we have it right 
so let's not rush this and let's put it out to version 1.1 and put out 
version 1.0 with the existing beta placebo 'no value validation' 
methodology."  .... but this qname label idea seems quite compelling so we 
are reluctant to just write it off.

Can we impose on you to take the time to share your thoughts with this 
brief technical overview?  I'm quite knackered after the day so I may have 
glossed over this too much, but hopefully you'll see where we were going 
with this.

Thanks so much for your time!

............................. Ken

==================================================================

Date: Wed, 25 Feb 2004 00:55:23 -0500
From: "Eve L. Maler" <eve.maler@sun.com>
Subject: Re: Candidate technical approach to code lists discussed today
To: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
Cc: A Gregory <agregory@aeon-llc.com>, Jon Bosak <Jon.Bosak@sun.com>

Hi Ken,

As you probably know by now, I spoke with Jon today about the rough 
outlines of this approach.  I will try to answer your questions below, 
and offer any other thoughts that might be relevant.  (I did manage to 
finish reviewing the February 8 version of Marty Burns's paper on my 
plane ride on Monday, and had been planning to send a few of these 
comments to you folks anyway.)

G. Ken Holman wrote:

> Good evening, Eve and Arofan, I hope this note finds each of you well.
> 
> A very interesting discussion was held today at the UBL F2F where one of 
> our biggest subject matter experts raised the "auditing" concern (yes, I 
> mean like a tax auditor or a financial auditor) in that an XML instance 
> of information should contain all of the business-related information 
> without relying on an external document (as in schema expression) 
> containing such business-related information for the legal 
> interpretation of a document not being binding because it doesn't stand 
> alone as a complete set of information.

I could certainly be persuaded by this argument, since it's obviously a 
"hard" one in this case!

> Supplementary components for code lists (agency, version, etc.) have 
> been proposed in the past to be defaulted attributes and, in the rogue 
> Montréal methodology in the data type URI encapsulated in external 
> schema fragments.  I gather, too, that an old proposal included 
> qualified-name data where the namespace URI associated with the 
> qualified name identified the coded value.
> 
> Qname data is difficult to support because tools such as SAX, DOM and 
> XSLT don't have pre-built access to Qname decomposition when in data, 
> while they all do when Qname decomposition is required of element types 
> and attribute names.

Way back when, Wwe rejected QNames in content because of this reason and 
others; you can find the rationale in the low score this mechanism got 
in our weighted analysis (now reproduced in the appendix to Marty's paper).

> A proposal was floated today whereby the specialization of code list 
> values is done as a two-step change:  the element type or attribute name 
> of the information item as delivered out-of-the-box in UBL is either 
> unqualified or qualified as a UBL label.  When a code value from a 
> different list is required, the name of the information item is modified 
> to be a Qname chosen by two trading partners.  The trading partners must 
> modify their schemas to utilize the Qname, thus changing the namespace 
> of the schemas to prevent ambiguity with the original UBL schemas.
> 
> Fine, but in addition, the namespace-aware well-formedness of the XML 
> instance that contains coded values from private-use code lists 
> *requires the qname to be used* which requires that name's namespace URI 
> to be in the instance.  If we adopt the convention of putting the 
> supplemental components into fields of the URI string, then *both* the 
> XML instance and the schema expression used to validate the instance 
> unambiguously identify the private-use code list and that identification 
> contains all of the supplemental components for an auditor to be assured 
> the value contained by the labeled information item is intended to be 
> from the set identified by the components.
> 
> By not using defaulted attributes, we don't have any user information in 
> the schema expression and to date not very many tools give access to W3C 
> schema expression defaulted attributes.  By not using specified 
> attributes we remove the restriction for an element that only one of the 
> combination of #PCDATA and attributes be a code list data item (since 
> attributes cannot have attributes).
> 
> An interesting benefit of this Qname approach is that if a dozen 
> information items in an instance all need to use values from the 
> private-use code list, that code list's URI string is defined only once 
> in the document and all uses of it have names modified to point to the 
> URI, and this exploitation of the prefix is very succinct.  Also if any 
> element today or in the future needs more than one of #PCDATA or 
> attributes to be a code list data item, the prefixes can be used on each 
> item.
> 
> In effect, namespace prefixes become document-wide (or if the user 
> wants, element-wide) proxies to germane information found in the 
> namespace URI string.

In talking with Jon, I noted that it's a slight abuse of XML namespaces, 
but understandable given that what you really need in the "second-order 
code" case (when they're in attributes) is a tripartite attribute 
structure -- (1) code list metadata with which to interpret the 
attribute value, (2) attribute name, and (3) attribute value.  Global 
attributes (with a namespace and a local part) give you that.

> Some questions:
> 
> (1) - has this been a strategy that has been considered in the past and 
> discounted with more thought applied to it than the few hours we've had 
> this afternoon when the idea was floated?

I would say no.  The closest thing to come to it was the old discredited 
QNames in content mechanism, which sought to (ab?)use XML namespaces in 
this way; however, at that point we hadn't yet started thinking about 
"articulated" namespaces that had all the necessary supplementary 
components jammed into them.

> (2) - does this strategy have sufficient merit to put in the effort to 
> try and get something brand new into UBL 1.0 (which we are reluctant to 
> do, but it would be nice to the outside world to see that we've 
> accomplished something, and gee this approach sounds very interesting 
> technically and would be a new way of approaching things that we haven't 
> seen before)?

I would say so.  Some feedback and questions for you, though:

- I understand the need for XML namespaces here, but am not entirely 
thrilled about the need for some application to "parse" a namespace URI 
to get at the parts.  It also really does kind of abuse the URN system. 
  I'd somewhat rather have a document header wherein there's a structure 
that allows you to declare the parts in separate attributes or elements 
(whatever), and attach a namespace of your own choosing to it.

- From my reading of the February 8 document, I'm not clear on how the 
articulated namespaces get constructed by parties other than OASIS/the 
UBL TC.  Surely you don't want non-OASIS people defining urn:oasis:... 
URIs!  And I have a sort of similar comment on the construction of 
schemaLocation values.  Surely you can't mandate how and where people 
store their code list information!

- There may be a useless old artifact in the February 8 document based 
on an old idea during my tenure: the double-wrapped element.  I don't 
believe this is required any longer.

> An example:
> 
> UBL defines currency to be a coded value of an ISO 1999 set:
> 
>    xmlns:cur="...OASIS URN with supp components for ISO 1999 currency 
> values"
>    ...
>    <amount cur:currency='USD'>123.45</amount>
> 
> Two trading partners agree that they want the 2002 values to be used:
> 
>    xmlns:mycur="...a private-use URN with supp components for ISO 2002 
> values"
>    ...
>    <amount mycur:currency='US'>123.45</amount>
> 
> Both the well-formed instance and the validating schema expression 
> contain all of the information needed by an auditor that the values 
> represented are posited to be from a particular value set.
> 
> To be compatible with UBL the private-use URN would have to have the 
> required components in an acceptable form .... such a technique is not 
> available to any existing validation processes as up until now the 
> industry hasn't tried (that I've seen) to mandate form or format of 
> namespace URIs.
> 
> There is a prevalent attitude of "we still don't know that we have it 
> right so let's not rush this and let's put it out to version 1.1 and put 
> out version 1.0 with the existing beta placebo 'no value validation' 
> methodology."  .... but this qname label idea seems quite compelling so 
> we are reluctant to just write it off.
> 
> Can we impose on you to take the time to share your thoughts with this 
> brief technical overview?  I'm quite knackered after the day so I may 
> have glossed over this too much, but hopefully you'll see where we were 
> going with this.
> 
> Thanks so much for your time!

I hope these thoughts aren't too disjointed -- I'm knackered myself! 
FYI, if you would like to chat about these matters further, or if you'd 
like to teleconference me in to a larger group gathering, here are my 
likely free times for the rest of the week (all ET):

Wed 25 Feb: 3-4:30pm, 5:30-8pm
Thu 26 Feb: 5pm-as late as you want to go :-)

If you think you'd like to grab me during one of these times, please try 
and call my cell phone ahead of time to confirm; I'm not sure how much 
I'll be near email tomorrow, but it's probably 0% of the time.

	Eve

==================================================================

Date: Wed, 25 Feb 2004 10:11:37 -0500
To: "Eve L. Maler" <Eve.Maler@sun.com>
From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
Subject: Re: Candidate technical approach to code lists discussed today
Cc: A Gregory <agregory@aeon-llc.com>, Jon Bosak <Jon.Bosak@sun.com>

Thanks so much, Eve, for taking from your busy time for a quick response.

At 2004-02-25 00:55 -0500, Eve L. Maler wrote:
>In talking with Jon, I noted that it's a slight abuse of XML namespaces

Hmmmmm ... given namespaces are used for identification of vocabularies, 
and supplementary components are utilized for identification, my gut feel 
was that it was an entirely appropriate use since it would be used to 
identify a vocabulary consisting of one element or one attribute and that 
vocabulary is being exploited many times in an instance.

>>(2) - does this strategy have sufficient merit to put in the effort to 
>>try and get something brand new into UBL 1.0 (which we are reluctant to 
>>do, but it would be nice to the outside world to see that we've 
>>accomplished something, and gee this approach sounds very interesting 
>>technically and would be a new way of approaching things that we haven't 
>>seen before)?
>
>I would say so.  Some feedback and questions for you, though:
>
>- I understand the need for XML namespaces here, but am not entirely 
>thrilled about the need for some application to "parse" a namespace URI to 
>get at the parts.

I'm not sure how often an application would need to parse that kind of 
information, though ... it is identifying the code list but it does not, in 
itself, constitute code list values or information that I think would be 
accessed in an application.

If an application needed to switch between one code list and another, it 
would be (I believe) sufficient to check the entire URI string, not 
selective components.

>It also really does kind of abuse the URN system.

Hmmmmm ... again, I'm a bit perplexed because since URNs are for naming, 
and we are uniquely identifying code lists by a multi-component name ....

>I'd somewhat rather have a document header wherein there's a structure 
>that allows you to declare the parts in separate attributes or elements 
>(whatever), and attach a namespace of your own choosing to it.

My sensitivity to that is that it "disturbs" (too strong?) the physical 
structure for an aspect of identification, not information.  Though, again, 
I'm not enough familiar with the applications to know if they would need 
the supplementary components as information and not just identification.

>- From my reading of the February 8 document, I'm not clear on how the 
>articulated namespaces get constructed by parties other than OASIS/the UBL 
>TC.  Surely you don't want non-OASIS people defining urn:oasis:... URIs!

Oh!  Why not?  Could we not have a UBL URI with a private use indication 
and a structured order to the supplementary components (though I 
acknowledge there are no tools to validate constraints on the URI)?

>And I have a sort of similar comment on the construction of schemaLocation 
>values.  Surely you can't mandate how and where people store their code 
>list information!

Isn't that a totally separate issue?  For validation purposes that 
information can be kept in schema fragments and isn't needed to satisfy the 
requirement of having the information in the instance ... I think the 
schemas can bear the responsibility of schemaLocation information.

>- There may be a useless old artifact in the February 8 document based on 
>an old idea during my tenure: the double-wrapped element.  I don't believe 
>this is required any longer.

I'll look for that.

>I hope these thoughts aren't too disjointed -- I'm knackered myself!

:{)}

>here are my likely free times for the rest of the week (all ET):
>
>Wed 25 Feb: 3-4:30pm, 5:30-8pm
>Thu 26 Feb: 5pm-as late as you want to go :-)

Thanks for this ... I'm not sure how Jon wants the ball taken up at this point.

>If you think you'd like to grab me during one of these times, please try 
>and call my cell phone ahead of time to confirm; I'm not sure how much 
>I'll be near email tomorrow, but it's probably 0% of the time.

Noted, Eve ... thanks so much again for your guidance.

..................... Ken

##################################################################

Date: Tue, 24 Feb 2004 13:01:48 -0800 (PST)
From: jon.bosak@sun.com
To: Bill_Burcham@stercomm.com
Subject: Fwd: 

Hello Bill,

I'd like to get your opinion on this if you have a minute.  We're
going to try to resolve it tomorrow (Wednesday).

The requirement that drives this whole train of thought is that
the code list metadata has to be in the instance itself for legal
reasons.  The best ways we can think of to accomplish this are to
put the metadata right on the element itself (as a series of
attributes) or to stuff it into a structured namespace URI at the
top of the instance.  We're tending toward the second alternative,
but I'm queasy at the thought that there might be some XSD
"feature" that would make this untenable.  Our only other
alternative at this point would be to say that we're not
supporting code list validation in 1.0, which would be a shame
considering all the work that has gone into this.

If you could possibly be available to phone in some time on
Wednesday, that would be great; please let me know if this is a
possibility and, if so, when you could do it.  In any case, a
thumbs-up or -down by email would be much appreciated.

Jon

[Copy of message sent by Ken to Eve and Arofan omitted; see above]

==================================================================

From: "Burcham, Bill" <Bill_Burcham@stercomm.com>
To: "'jon.bosak@sun.com'" <jon.bosak@sun.com>
Subject: RE: 
Date: Wed, 25 Feb 2004 00:08:59 -0600

Hey Jon.  Traveling tonight.  Just arrived in Dublin, OH.  HQ and all that
:)

This is interesting.  I'm sure I don't follow the intricacies, but it looks
like we want two things:

1. no defaulting via schemas since the defaults aren't documented in the
instance doc
2. a "comment" in the instance doc, that an auditor can use to visually
verify that a code value comes from a restricted list

(1) seems easy to solve -- just make a UBL rule that says "no defaulting
allowed".  

Using a namespace URN to carry (2) is interesting, and the brevity-enhancing
trick of using namespace prefixes to refer to those URN's elsewhere in the
doc is nice.  I'm wondering what the net benefit of that is though, relative
to the downsides and relative to the alternatives.  

While the URN might be a handy place to stash a restricted set of code
values, it seem like a bit of a hack.  I worry that the list will get really
long in some cases and blow out some processor that isn't good at handling
really long URL's.  Could you get the same result with a stylized comment
above the URN?  E.g.

    <!-- a comment containing supp components for ISO 1999 currency values.
Pertains to next namespace decl... -->
    xmlns:cur=".. A GUID for this 'semantic primitive' "
    ...
    <amount cur:currency='USD'>123.45</amount>

It still gives you the succint referenceability within the instance doc (via
namespace prefix) but might ease the burden on processors (w.r.t. URN
length).   Also it solves another problem where you've got two codelists w/
the same set of values -- but different meanings (no examples come to mind
but I'm sure there out there :) the GUID solves that.

That's all I can think of right now.  Can't call tomorrow unfortunately --
all day meetings.  

Best regards to you and the other UBL-ers.

-Bill
Follow-Ups:
- Re: [ubl-clsc] URI approach to code list metadata
  - From: jon.bosak@sun.com