clr-dev message

Subject: Re: [clr-dev] Philosophy behind the CVA contracts
From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
To: clr-dev@lists.oasis-open.org
Date: Sat, 05 Feb 2011 11:49:16 -0500
At 2011-02-05 09:04 -0600, ericdes wrote:
>I have a question I should have asked myself earlier... How does one 
>make the links between his own set of codes and the ones in the CVA 
>agreement (which might be different)?
>
>Example: A participant uses internally 'SP' as the country code for 
>Spain whereas he received documents with the code 'ES' for Spain (as 
>per a CVA agreement). How can the system know both codes relate to 
>the same 'Spain'?

Good question.  And I have a long essay answer that I hope you find 
useful; please bear with me while I go into detail.  I hope others 
will pitch in with their thoughts on this topic as well.

There are a couple of different camps in the XML world regarding 
semantics, or the meaning of information found in markup.  Some 
people believe that XML somehow conveys semantics between the angle 
brackets.  I disagree, in that I believe XML only conveys syntax and 
not semantics.  It is up to implementers of programs and stylesheets 
to interpret the semantics (meaning) represented by what is found in 
the syntax (XML).  Content is interpreted by the recipient, and 
hopefully that interpretation is the same as intended by the 
sender.  Sometimes it is obvious, sometimes it isn't, and sometimes 
(as in this example) people think it is obvious but there are nuances 
not considered.

Which isn't to say XML can't help in that interpretation of 
semantics, I just don't believe that <countryCode>ES</countryCode> 
means anything more or less than <countryCode>SP</countryCode> means, 
which is nothing at all.  It is up to the programmer/stylesheet 
writer to say "okay, when the <countryCode> element contains the 
string 'ES' then I'm going to interpret that as the country of 
Spain".  Another programmer/stylesheet writer can say "okay, when the 
<countryCode> element contains the string 'SP' then I'm going to 
interpret that as the country of Spain".

Which illustrates why international bodies like ISO have been 
standardizing on the concrete codes (strings of characters) that 
represent globally-agreed-upon abstract semantics such as countries, 
currencies, payment means, etc.  For interoperability, when everyone 
in the world wants to agree on a single code for Spain, they can all 
agree on "ES" and no further explanation is necessary because the 
documentation for the country code "ES" unambiguously specifies the 
semantics behind what the code means.

That isn't to say other codes cannot be used for Spain, such as "SP" 
in your example, but is that interoperable if there is no global 
agreement that "SP" represents Spain?  A unilingual Spaniard may have 
no inkling that "SP" represents Spain.  A unilingual German may not 
see "GE" as representing Germany, since "DE" is the abbreviation of 
Deutschland which is German for Germany.  A unilingual Dane may not 
see "DE" as representing Denmark, instead of the ISO "DK", because 
they'll probably interpret that as "DE" for Germany.

So, to your question of "how can they both be related to be the same 
thing?" you are getting into the realm of knowledge management, 
ontologies, taxonomies, topic maps, etc.  Fortunately, there are 
syntactic mechanisms in both genericode and CVA in order to provide 
enough of this knowledge management to be useful.

First, the syntax.  If you create a genericode list of Anglicized 
country abbreviations, then the code values in your private list are 
"SP", "GE", "DE", etc.  One code per row in the genericode model.

Then you create a CVA file that specifies in which document contexts 
you want constrained to your country codes.  You use XPath to specify 
the document context and you use a URI to your genericode file to 
impose the constraint.  If you want *both* "ES" and "SP" to represent 
Spain at the syntax level, then in the CVA file you point the one 
document context to two genericode files, and the validation will 
handle the union.

Note that my example illustrates the critical importance of code list 
metadata:  that data that identifies the code list from which codes 
are taken.  If in your CVA you are allowing both code lists for 
<countryCode> and the XML simply says <countryCode>DE</countryCode>, 
then is the country Denmark or Germany?  The user needs to use code 
list metadata to disambiguate this ambiguous value:

   <countryCode listId="ISO3166-1">DE</countryCode> represents Germany
   <countryCode listId="EricsList">DE</countryCode> represents Denmark

Without that disambiguation, how will your program know, for example, 
what tariff rate might apply in an importation document?  BTW, this 
gets really important when the same two-letter code is used over time 
for two different countries.  At one point "CS" represented 
Czechoslovakia and in 2003 it represented Serbia and Montenegro.  In 
that case the disambiguation needs both the list *and* the 
publication date of that list to know what "CS" 
represents.  Interestingly, Serbia and Montegro reverted to "YU" 
(Yugoslavia) in 2004[1] because of this very confusion, I suspect 
because programs did not accommodate release dates of country codes.

The CVA/genericode contract between two trading partners exchanging a 
document dictates which syntactic values are allowed where in the 
XML.  The semantics (meaning) behind those agreed-upon values are 
part of a separate implicit or explicit agreement between the trading 
partners, outside of the XML syntax.  Both trading partners have to 
agree, perhaps even with a signature or legal undertaking, on the 
abstract concept of countries.  Which isn't to say that in 
CVA/genericode there isn't an opportunity to codify those agreed-upon 
abstract concepts for programmatic analysis, because there is and 
your question/scenario illustrates this very well.

Note that ISO has yet to publish their codes and associated semantics 
in genericode, but for the UBL project the UBL TC did this for the 
ISO list of standardized country codes:

http://docs.oasis-open.org/ubl/os-UBL-2.0/cl/gc/default/CountryIdentificationCode-2.0.gc

Note in this XML serialization how ISO has augmented the country code 
table (in the abstract) with both name (in English) and a numeric 
value for each coded value.  This is realized in the genericode table 
(in the syntax) as three values per row, associating the country name 
and numeric value with each code.

Why the numeric value?  *That* is where the semantics are 
codified.  This might seem confusing on first read:  "how are 
semantics represented by a number?".  Well, names for countries 
change over time, so something was needed as a long-term, 
linguistically-neutral representation of the concept of a 
country.  If a country changes its name, its alphabetic code might 
change to reflect cultural requirements dictated by that 
country.  Its numeric code, however, doesn't change because the 
semantics of that country defined by those boundaries at that 
location on the globe still is that country.

I've copied four extracts that I found using a Google search, 
documented in the interpretation of newsletters published by ISO 
regarding the maintenance of the ISO country code list.  Two 
illustrate a country changing its name, and two illustrate the need 
to represent a brand new country:

   http://3waylabs.com/zone/iso-country-codes
   List of changes applied, as specified in registration newsletters:
   ...
   Newsletter III-1, 1989-12-5:
     Burma deleted, Myanmar added (same numeric value, change of country name)
   ...
   Newsletter III-7, 1990-08-14
     Unification of Yemen, under new numeric code
   ...
   Newsletter III-10, 1990-08-14
     Kampuchea deleted, Cambodia added (same numeric value, change of name)
   ...
   Newsletter III-13, 1990-10-30
     Germany unified (DDR deleted, new name and numeric code for 
unified Germany)
   ...

So, Burma used to be represented as "BU" and is now represented as 
"MM" but the numeric code "104" never changed.  So the meaning of 
that set of boundaries at that location on the globe has been and 
still is represented linguistically-neutrally as "104".  Similarly, 
Kampuchea "KA" is now Cambodia "KH" but the numeric code 116 did not change.

Note in my excerpts that the unified Yemen and the unified Germany 
were given new numeric codes because the meaning of the new countries 
is in fact brand new with a new set of boundaries at a location on 
the globe.  There was no unified Germany before, so the new numeric 
code "276" was assigned, even though the "DE" code is reused.

So, a program looking at the numeric values associated with country 
codes in a taxonomy can use the numeric values as the long-term basis 
for the meaning of countries and have that meaning last longer than 
using the code abbreviations.

But here is your opportunity to answer your question:  "how can both 
"ES" and "SP" be related to be the same thing?"  I think your answer 
is in those numeric codes.  If your abstract Anglicized country code 
list associated "724" with "SP" then your program can see that that 
is the same "724" associated with "ES" in the ISO country code list.

So, your genericode file would syntactically express in the 
code-level metadata a column for the equivalent ISO numeric 
linguistically-neutral country code.

If you needed to check if one of your country codes represents the 
same meaning as one of the ISO country codes, your program would do a 
simple check of the associated numeric value of each found in each 
genericode file.  If they are the same, you can then conclude you are 
talking about the same country, regardless of what the code actually is.

Your question and ISO's example of the numeric codes is an excellent 
illustration of the dissociation of syntax and meaning.  The XML says 
"ES", the code list metadata specifies the ISO list and its year of 
publication, the associated numeric code is "724" and your program 
understands that to be Spain.  Another XML says "SP", the code list 
metadata specifies Eric's list, the associated numeric code is "724" 
and your program understands that to be Spain.

I doubt, however, that many programs go to this extent to be 
unambiguous (witness the back-step made by ISO regarding the "CS" 
code).  It must not be so very important to them to be that precise 
over the very long term of decades of changing representations of 
long-term semantics.

I hope this has helped.  I welcome any questions about this answer so 
that I can help to clarify anything.

And I hope anyone else on the list with a perspective on this will 
also speak up.  Especially if they disagree with what I've written 
and can correct me.

. . . . . . . . . . . .  Ken

[1] http://www.wipo.int/pct/guide/en/gdvol1/annexes/annexk/ax_k.pdf (page 10)

--
Contact us for world-wide XML consulting & instructor-led training
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/c/
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal
Follow-Ups:
- Re: [clr-dev] Philosophy behind the CVA contracts
  - From: ericdes <eric@vcardprocessor.com>
References:
- Philosophy behind the CVA contracts
  - From: ericdes <eric@vcardprocessor.com>