[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: URI approach to code list metadata
A new approach to the code list metatdata problem was put forward during our discussion of code lists in the UBL TC meeting in Washington Monday afternoon. Ken Holman took an action to contact Eve Maler and Arofan Gregory for their input. I also copied Bill Burcham on Ken's note. Eve and Bill have responded; this correspondence is copied below. Our plan now is to use the present message to put this material in the CLSC archive for review by participants in today's TC meeting. We will meet at 2:30 p.m. today to review the discussion so far and then be joined at 3 p.m. by Eve Maler and (we hope) by Marty Burns, who is being contacted by phone. Jon ################################################################## Date: Mon, 23 Feb 2004 21:25:09 -0500 To: "Eve L. Maler" <eve.maler@sun.com>, "A Gregory" <agregory@aeon-llc.com> From: "G. Ken Holman" <gkholman@CraneSoftwrights.com> Subject: Candidate technical approach to code lists discussed today Cc: Jon Bosak <Jon.Bosak@sun.com> Good evening, Eve and Arofan, I hope this note finds each of you well. A very interesting discussion was held today at the UBL F2F where one of our biggest subject matter experts raised the "auditing" concern (yes, I mean like a tax auditor or a financial auditor) in that an XML instance of information should contain all of the business-related information without relying on an external document (as in schema expression) containing such business-related information for the legal interpretation of a document not being binding because it doesn't stand alone as a complete set of information. Supplementary components for code lists (agency, version, etc.) have been proposed in the past to be defaulted attributes and, in the rogue Montréal methodology in the data type URI encapsulated in external schema fragments. I gather, too, that an old proposal included qualified-name data where the namespace URI associated with the qualified name identified the coded value. Qname data is difficult to support because tools such as SAX, DOM and XSLT don't have pre-built access to Qname decomposition when in data, while they all do when Qname decomposition is required of element types and attribute names. A proposal was floated today whereby the specialization of code list values is done as a two-step change: the element type or attribute name of the information item as delivered out-of-the-box in UBL is either unqualified or qualified as a UBL label. When a code value from a different list is required, the name of the information item is modified to be a Qname chosen by two trading partners. The trading partners must modify their schemas to utilize the Qname, thus changing the namespace of the schemas to prevent ambiguity with the original UBL schemas. Fine, but in addition, the namespace-aware well-formedness of the XML instance that contains coded values from private-use code lists *requires the qname to be used* which requires that name's namespace URI to be in the instance. If we adopt the convention of putting the supplemental components into fields of the URI string, then *both* the XML instance and the schema expression used to validate the instance unambiguously identify the private-use code list and that identification contains all of the supplemental components for an auditor to be assured the value contained by the labeled information item is intended to be from the set identified by the components. By not using defaulted attributes, we don't have any user information in the schema expression and to date not very many tools give access to W3C schema expression defaulted attributes. By not using specified attributes we remove the restriction for an element that only one of the combination of #PCDATA and attributes be a code list data item (since attributes cannot have attributes). An interesting benefit of this Qname approach is that if a dozen information items in an instance all need to use values from the private-use code list, that code list's URI string is defined only once in the document and all uses of it have names modified to point to the URI, and this exploitation of the prefix is very succinct. Also if any element today or in the future needs more than one of #PCDATA or attributes to be a code list data item, the prefixes can be used on each item. In effect, namespace prefixes become document-wide (or if the user wants, element-wide) proxies to germane information found in the namespace URI string. Some questions: (1) - has this been a strategy that has been considered in the past and discounted with more thought applied to it than the few hours we've had this afternoon when the idea was floated? (2) - does this strategy have sufficient merit to put in the effort to try and get something brand new into UBL 1.0 (which we are reluctant to do, but it would be nice to the outside world to see that we've accomplished something, and gee this approach sounds very interesting technically and would be a new way of approaching things that we haven't seen before)? An example: UBL defines currency to be a coded value of an ISO 1999 set: xmlns:cur="...OASIS URN with supp components for ISO 1999 currency values" ... <amount cur:currency='USD'>123.45</amount> Two trading partners agree that they want the 2002 values to be used: xmlns:mycur="...a private-use URN with supp components for ISO 2002 values" ... <amount mycur:currency='US'>123.45</amount> Both the well-formed instance and the validating schema expression contain all of the information needed by an auditor that the values represented are posited to be from a particular value set. To be compatible with UBL the private-use URN would have to have the required components in an acceptable form .... such a technique is not available to any existing validation processes as up until now the industry hasn't tried (that I've seen) to mandate form or format of namespace URIs. There is a prevalent attitude of "we still don't know that we have it right so let's not rush this and let's put it out to version 1.1 and put out version 1.0 with the existing beta placebo 'no value validation' methodology." .... but this qname label idea seems quite compelling so we are reluctant to just write it off. Can we impose on you to take the time to share your thoughts with this brief technical overview? I'm quite knackered after the day so I may have glossed over this too much, but hopefully you'll see where we were going with this. Thanks so much for your time! ............................. Ken ================================================================== Date: Wed, 25 Feb 2004 00:55:23 -0500 From: "Eve L. Maler" <eve.maler@sun.com> Subject: Re: Candidate technical approach to code lists discussed today To: "G. Ken Holman" <gkholman@CraneSoftwrights.com> Cc: A Gregory <agregory@aeon-llc.com>, Jon Bosak <Jon.Bosak@sun.com> Hi Ken, As you probably know by now, I spoke with Jon today about the rough outlines of this approach. I will try to answer your questions below, and offer any other thoughts that might be relevant. (I did manage to finish reviewing the February 8 version of Marty Burns's paper on my plane ride on Monday, and had been planning to send a few of these comments to you folks anyway.) G. Ken Holman wrote: > Good evening, Eve and Arofan, I hope this note finds each of you well. > > A very interesting discussion was held today at the UBL F2F where one of > our biggest subject matter experts raised the "auditing" concern (yes, I > mean like a tax auditor or a financial auditor) in that an XML instance > of information should contain all of the business-related information > without relying on an external document (as in schema expression) > containing such business-related information for the legal > interpretation of a document not being binding because it doesn't stand > alone as a complete set of information. I could certainly be persuaded by this argument, since it's obviously a "hard" one in this case! > Supplementary components for code lists (agency, version, etc.) have > been proposed in the past to be defaulted attributes and, in the rogue > Montréal methodology in the data type URI encapsulated in external > schema fragments. I gather, too, that an old proposal included > qualified-name data where the namespace URI associated with the > qualified name identified the coded value. > > Qname data is difficult to support because tools such as SAX, DOM and > XSLT don't have pre-built access to Qname decomposition when in data, > while they all do when Qname decomposition is required of element types > and attribute names. Way back when, Wwe rejected QNames in content because of this reason and others; you can find the rationale in the low score this mechanism got in our weighted analysis (now reproduced in the appendix to Marty's paper). > A proposal was floated today whereby the specialization of code list > values is done as a two-step change: the element type or attribute name > of the information item as delivered out-of-the-box in UBL is either > unqualified or qualified as a UBL label. When a code value from a > different list is required, the name of the information item is modified > to be a Qname chosen by two trading partners. The trading partners must > modify their schemas to utilize the Qname, thus changing the namespace > of the schemas to prevent ambiguity with the original UBL schemas. > > Fine, but in addition, the namespace-aware well-formedness of the XML > instance that contains coded values from private-use code lists > *requires the qname to be used* which requires that name's namespace URI > to be in the instance. If we adopt the convention of putting the > supplemental components into fields of the URI string, then *both* the > XML instance and the schema expression used to validate the instance > unambiguously identify the private-use code list and that identification > contains all of the supplemental components for an auditor to be assured > the value contained by the labeled information item is intended to be > from the set identified by the components. > > By not using defaulted attributes, we don't have any user information in > the schema expression and to date not very many tools give access to W3C > schema expression defaulted attributes. By not using specified > attributes we remove the restriction for an element that only one of the > combination of #PCDATA and attributes be a code list data item (since > attributes cannot have attributes). > > An interesting benefit of this Qname approach is that if a dozen > information items in an instance all need to use values from the > private-use code list, that code list's URI string is defined only once > in the document and all uses of it have names modified to point to the > URI, and this exploitation of the prefix is very succinct. Also if any > element today or in the future needs more than one of #PCDATA or > attributes to be a code list data item, the prefixes can be used on each > item. > > In effect, namespace prefixes become document-wide (or if the user > wants, element-wide) proxies to germane information found in the > namespace URI string. In talking with Jon, I noted that it's a slight abuse of XML namespaces, but understandable given that what you really need in the "second-order code" case (when they're in attributes) is a tripartite attribute structure -- (1) code list metadata with which to interpret the attribute value, (2) attribute name, and (3) attribute value. Global attributes (with a namespace and a local part) give you that. > Some questions: > > (1) - has this been a strategy that has been considered in the past and > discounted with more thought applied to it than the few hours we've had > this afternoon when the idea was floated? I would say no. The closest thing to come to it was the old discredited QNames in content mechanism, which sought to (ab?)use XML namespaces in this way; however, at that point we hadn't yet started thinking about "articulated" namespaces that had all the necessary supplementary components jammed into them. > (2) - does this strategy have sufficient merit to put in the effort to > try and get something brand new into UBL 1.0 (which we are reluctant to > do, but it would be nice to the outside world to see that we've > accomplished something, and gee this approach sounds very interesting > technically and would be a new way of approaching things that we haven't > seen before)? I would say so. Some feedback and questions for you, though: - I understand the need for XML namespaces here, but am not entirely thrilled about the need for some application to "parse" a namespace URI to get at the parts. It also really does kind of abuse the URN system. I'd somewhat rather have a document header wherein there's a structure that allows you to declare the parts in separate attributes or elements (whatever), and attach a namespace of your own choosing to it. - From my reading of the February 8 document, I'm not clear on how the articulated namespaces get constructed by parties other than OASIS/the UBL TC. Surely you don't want non-OASIS people defining urn:oasis:... URIs! And I have a sort of similar comment on the construction of schemaLocation values. Surely you can't mandate how and where people store their code list information! - There may be a useless old artifact in the February 8 document based on an old idea during my tenure: the double-wrapped element. I don't believe this is required any longer. > An example: > > UBL defines currency to be a coded value of an ISO 1999 set: > > xmlns:cur="...OASIS URN with supp components for ISO 1999 currency > values" > ... > <amount cur:currency='USD'>123.45</amount> > > Two trading partners agree that they want the 2002 values to be used: > > xmlns:mycur="...a private-use URN with supp components for ISO 2002 > values" > ... > <amount mycur:currency='US'>123.45</amount> > > Both the well-formed instance and the validating schema expression > contain all of the information needed by an auditor that the values > represented are posited to be from a particular value set. > > To be compatible with UBL the private-use URN would have to have the > required components in an acceptable form .... such a technique is not > available to any existing validation processes as up until now the > industry hasn't tried (that I've seen) to mandate form or format of > namespace URIs. > > There is a prevalent attitude of "we still don't know that we have it > right so let's not rush this and let's put it out to version 1.1 and put > out version 1.0 with the existing beta placebo 'no value validation' > methodology." .... but this qname label idea seems quite compelling so > we are reluctant to just write it off. > > Can we impose on you to take the time to share your thoughts with this > brief technical overview? I'm quite knackered after the day so I may > have glossed over this too much, but hopefully you'll see where we were > going with this. > > Thanks so much for your time! I hope these thoughts aren't too disjointed -- I'm knackered myself! FYI, if you would like to chat about these matters further, or if you'd like to teleconference me in to a larger group gathering, here are my likely free times for the rest of the week (all ET): Wed 25 Feb: 3-4:30pm, 5:30-8pm Thu 26 Feb: 5pm-as late as you want to go :-) If you think you'd like to grab me during one of these times, please try and call my cell phone ahead of time to confirm; I'm not sure how much I'll be near email tomorrow, but it's probably 0% of the time. Eve ================================================================== Date: Wed, 25 Feb 2004 10:11:37 -0500 To: "Eve L. Maler" <Eve.Maler@sun.com> From: "G. Ken Holman" <gkholman@CraneSoftwrights.com> Subject: Re: Candidate technical approach to code lists discussed today Cc: A Gregory <agregory@aeon-llc.com>, Jon Bosak <Jon.Bosak@sun.com> Thanks so much, Eve, for taking from your busy time for a quick response. At 2004-02-25 00:55 -0500, Eve L. Maler wrote: >In talking with Jon, I noted that it's a slight abuse of XML namespaces Hmmmmm ... given namespaces are used for identification of vocabularies, and supplementary components are utilized for identification, my gut feel was that it was an entirely appropriate use since it would be used to identify a vocabulary consisting of one element or one attribute and that vocabulary is being exploited many times in an instance. >>(2) - does this strategy have sufficient merit to put in the effort to >>try and get something brand new into UBL 1.0 (which we are reluctant to >>do, but it would be nice to the outside world to see that we've >>accomplished something, and gee this approach sounds very interesting >>technically and would be a new way of approaching things that we haven't >>seen before)? > >I would say so. Some feedback and questions for you, though: > >- I understand the need for XML namespaces here, but am not entirely >thrilled about the need for some application to "parse" a namespace URI to >get at the parts. I'm not sure how often an application would need to parse that kind of information, though ... it is identifying the code list but it does not, in itself, constitute code list values or information that I think would be accessed in an application. If an application needed to switch between one code list and another, it would be (I believe) sufficient to check the entire URI string, not selective components. >It also really does kind of abuse the URN system. Hmmmmm ... again, I'm a bit perplexed because since URNs are for naming, and we are uniquely identifying code lists by a multi-component name .... >I'd somewhat rather have a document header wherein there's a structure >that allows you to declare the parts in separate attributes or elements >(whatever), and attach a namespace of your own choosing to it. My sensitivity to that is that it "disturbs" (too strong?) the physical structure for an aspect of identification, not information. Though, again, I'm not enough familiar with the applications to know if they would need the supplementary components as information and not just identification. >- From my reading of the February 8 document, I'm not clear on how the >articulated namespaces get constructed by parties other than OASIS/the UBL >TC. Surely you don't want non-OASIS people defining urn:oasis:... URIs! Oh! Why not? Could we not have a UBL URI with a private use indication and a structured order to the supplementary components (though I acknowledge there are no tools to validate constraints on the URI)? >And I have a sort of similar comment on the construction of schemaLocation >values. Surely you can't mandate how and where people store their code >list information! Isn't that a totally separate issue? For validation purposes that information can be kept in schema fragments and isn't needed to satisfy the requirement of having the information in the instance ... I think the schemas can bear the responsibility of schemaLocation information. >- There may be a useless old artifact in the February 8 document based on >an old idea during my tenure: the double-wrapped element. I don't believe >this is required any longer. I'll look for that. >I hope these thoughts aren't too disjointed -- I'm knackered myself! :{)} >here are my likely free times for the rest of the week (all ET): > >Wed 25 Feb: 3-4:30pm, 5:30-8pm >Thu 26 Feb: 5pm-as late as you want to go :-) Thanks for this ... I'm not sure how Jon wants the ball taken up at this point. >If you think you'd like to grab me during one of these times, please try >and call my cell phone ahead of time to confirm; I'm not sure how much >I'll be near email tomorrow, but it's probably 0% of the time. Noted, Eve ... thanks so much again for your guidance. ..................... Ken ################################################################## Date: Tue, 24 Feb 2004 13:01:48 -0800 (PST) From: jon.bosak@sun.com To: Bill_Burcham@stercomm.com Subject: Fwd: Hello Bill, I'd like to get your opinion on this if you have a minute. We're going to try to resolve it tomorrow (Wednesday). The requirement that drives this whole train of thought is that the code list metadata has to be in the instance itself for legal reasons. The best ways we can think of to accomplish this are to put the metadata right on the element itself (as a series of attributes) or to stuff it into a structured namespace URI at the top of the instance. We're tending toward the second alternative, but I'm queasy at the thought that there might be some XSD "feature" that would make this untenable. Our only other alternative at this point would be to say that we're not supporting code list validation in 1.0, which would be a shame considering all the work that has gone into this. If you could possibly be available to phone in some time on Wednesday, that would be great; please let me know if this is a possibility and, if so, when you could do it. In any case, a thumbs-up or -down by email would be much appreciated. Jon [Copy of message sent by Ken to Eve and Arofan omitted; see above] ================================================================== From: "Burcham, Bill" <Bill_Burcham@stercomm.com> To: "'jon.bosak@sun.com'" <jon.bosak@sun.com> Subject: RE: Date: Wed, 25 Feb 2004 00:08:59 -0600 Hey Jon. Traveling tonight. Just arrived in Dublin, OH. HQ and all that :) This is interesting. I'm sure I don't follow the intricacies, but it looks like we want two things: 1. no defaulting via schemas since the defaults aren't documented in the instance doc 2. a "comment" in the instance doc, that an auditor can use to visually verify that a code value comes from a restricted list (1) seems easy to solve -- just make a UBL rule that says "no defaulting allowed". Using a namespace URN to carry (2) is interesting, and the brevity-enhancing trick of using namespace prefixes to refer to those URN's elsewhere in the doc is nice. I'm wondering what the net benefit of that is though, relative to the downsides and relative to the alternatives. While the URN might be a handy place to stash a restricted set of code values, it seem like a bit of a hack. I worry that the list will get really long in some cases and blow out some processor that isn't good at handling really long URL's. Could you get the same result with a stylized comment above the URN? E.g. <!-- a comment containing supp components for ISO 1999 currency values. Pertains to next namespace decl... --> xmlns:cur=".. A GUID for this 'semantic primitive' " ... <amount cur:currency='USD'>123.45</amount> It still gives you the succint referenceability within the instance doc (via namespace prefix) but might ease the burden on processors (w.r.t. URN length). Also it solves another problem where you've got two codelists w/ the same set of values -- but different meanings (no examples come to mind but I'm sure there out there :) the GUID solves that. That's all I can think of right now. Can't call tomorrow unfortunately -- all day meetings. Best regards to you and the other UBL-ers. -Bill
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]