[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [clr-dev] Philosophy behind the CVA contracts
At 2011-02-05 09:04 -0600, ericdes wrote: >I have a question I should have asked myself earlier... How does one >make the links between his own set of codes and the ones in the CVA >agreement (which might be different)? > >Example: A participant uses internally 'SP' as the country code for >Spain whereas he received documents with the code 'ES' for Spain (as >per a CVA agreement). How can the system know both codes relate to >the same 'Spain'? Good question. And I have a long essay answer that I hope you find useful; please bear with me while I go into detail. I hope others will pitch in with their thoughts on this topic as well. There are a couple of different camps in the XML world regarding semantics, or the meaning of information found in markup. Some people believe that XML somehow conveys semantics between the angle brackets. I disagree, in that I believe XML only conveys syntax and not semantics. It is up to implementers of programs and stylesheets to interpret the semantics (meaning) represented by what is found in the syntax (XML). Content is interpreted by the recipient, and hopefully that interpretation is the same as intended by the sender. Sometimes it is obvious, sometimes it isn't, and sometimes (as in this example) people think it is obvious but there are nuances not considered. Which isn't to say XML can't help in that interpretation of semantics, I just don't believe that <countryCode>ES</countryCode> means anything more or less than <countryCode>SP</countryCode> means, which is nothing at all. It is up to the programmer/stylesheet writer to say "okay, when the <countryCode> element contains the string 'ES' then I'm going to interpret that as the country of Spain". Another programmer/stylesheet writer can say "okay, when the <countryCode> element contains the string 'SP' then I'm going to interpret that as the country of Spain". Which illustrates why international bodies like ISO have been standardizing on the concrete codes (strings of characters) that represent globally-agreed-upon abstract semantics such as countries, currencies, payment means, etc. For interoperability, when everyone in the world wants to agree on a single code for Spain, they can all agree on "ES" and no further explanation is necessary because the documentation for the country code "ES" unambiguously specifies the semantics behind what the code means. That isn't to say other codes cannot be used for Spain, such as "SP" in your example, but is that interoperable if there is no global agreement that "SP" represents Spain? A unilingual Spaniard may have no inkling that "SP" represents Spain. A unilingual German may not see "GE" as representing Germany, since "DE" is the abbreviation of Deutschland which is German for Germany. A unilingual Dane may not see "DE" as representing Denmark, instead of the ISO "DK", because they'll probably interpret that as "DE" for Germany. So, to your question of "how can they both be related to be the same thing?" you are getting into the realm of knowledge management, ontologies, taxonomies, topic maps, etc. Fortunately, there are syntactic mechanisms in both genericode and CVA in order to provide enough of this knowledge management to be useful. First, the syntax. If you create a genericode list of Anglicized country abbreviations, then the code values in your private list are "SP", "GE", "DE", etc. One code per row in the genericode model. Then you create a CVA file that specifies in which document contexts you want constrained to your country codes. You use XPath to specify the document context and you use a URI to your genericode file to impose the constraint. If you want *both* "ES" and "SP" to represent Spain at the syntax level, then in the CVA file you point the one document context to two genericode files, and the validation will handle the union. Note that my example illustrates the critical importance of code list metadata: that data that identifies the code list from which codes are taken. If in your CVA you are allowing both code lists for <countryCode> and the XML simply says <countryCode>DE</countryCode>, then is the country Denmark or Germany? The user needs to use code list metadata to disambiguate this ambiguous value: <countryCode listId="ISO3166-1">DE</countryCode> represents Germany <countryCode listId="EricsList">DE</countryCode> represents Denmark Without that disambiguation, how will your program know, for example, what tariff rate might apply in an importation document? BTW, this gets really important when the same two-letter code is used over time for two different countries. At one point "CS" represented Czechoslovakia and in 2003 it represented Serbia and Montenegro. In that case the disambiguation needs both the list *and* the publication date of that list to know what "CS" represents. Interestingly, Serbia and Montegro reverted to "YU" (Yugoslavia) in 2004[1] because of this very confusion, I suspect because programs did not accommodate release dates of country codes. The CVA/genericode contract between two trading partners exchanging a document dictates which syntactic values are allowed where in the XML. The semantics (meaning) behind those agreed-upon values are part of a separate implicit or explicit agreement between the trading partners, outside of the XML syntax. Both trading partners have to agree, perhaps even with a signature or legal undertaking, on the abstract concept of countries. Which isn't to say that in CVA/genericode there isn't an opportunity to codify those agreed-upon abstract concepts for programmatic analysis, because there is and your question/scenario illustrates this very well. Note that ISO has yet to publish their codes and associated semantics in genericode, but for the UBL project the UBL TC did this for the ISO list of standardized country codes: http://docs.oasis-open.org/ubl/os-UBL-2.0/cl/gc/default/CountryIdentificationCode-2.0.gc Note in this XML serialization how ISO has augmented the country code table (in the abstract) with both name (in English) and a numeric value for each coded value. This is realized in the genericode table (in the syntax) as three values per row, associating the country name and numeric value with each code. Why the numeric value? *That* is where the semantics are codified. This might seem confusing on first read: "how are semantics represented by a number?". Well, names for countries change over time, so something was needed as a long-term, linguistically-neutral representation of the concept of a country. If a country changes its name, its alphabetic code might change to reflect cultural requirements dictated by that country. Its numeric code, however, doesn't change because the semantics of that country defined by those boundaries at that location on the globe still is that country. I've copied four extracts that I found using a Google search, documented in the interpretation of newsletters published by ISO regarding the maintenance of the ISO country code list. Two illustrate a country changing its name, and two illustrate the need to represent a brand new country: http://3waylabs.com/zone/iso-country-codes List of changes applied, as specified in registration newsletters: ... Newsletter III-1, 1989-12-5: Burma deleted, Myanmar added (same numeric value, change of country name) ... Newsletter III-7, 1990-08-14 Unification of Yemen, under new numeric code ... Newsletter III-10, 1990-08-14 Kampuchea deleted, Cambodia added (same numeric value, change of name) ... Newsletter III-13, 1990-10-30 Germany unified (DDR deleted, new name and numeric code for unified Germany) ... So, Burma used to be represented as "BU" and is now represented as "MM" but the numeric code "104" never changed. So the meaning of that set of boundaries at that location on the globe has been and still is represented linguistically-neutrally as "104". Similarly, Kampuchea "KA" is now Cambodia "KH" but the numeric code 116 did not change. Note in my excerpts that the unified Yemen and the unified Germany were given new numeric codes because the meaning of the new countries is in fact brand new with a new set of boundaries at a location on the globe. There was no unified Germany before, so the new numeric code "276" was assigned, even though the "DE" code is reused. So, a program looking at the numeric values associated with country codes in a taxonomy can use the numeric values as the long-term basis for the meaning of countries and have that meaning last longer than using the code abbreviations. But here is your opportunity to answer your question: "how can both "ES" and "SP" be related to be the same thing?" I think your answer is in those numeric codes. If your abstract Anglicized country code list associated "724" with "SP" then your program can see that that is the same "724" associated with "ES" in the ISO country code list. So, your genericode file would syntactically express in the code-level metadata a column for the equivalent ISO numeric linguistically-neutral country code. If you needed to check if one of your country codes represents the same meaning as one of the ISO country codes, your program would do a simple check of the associated numeric value of each found in each genericode file. If they are the same, you can then conclude you are talking about the same country, regardless of what the code actually is. Your question and ISO's example of the numeric codes is an excellent illustration of the dissociation of syntax and meaning. The XML says "ES", the code list metadata specifies the ISO list and its year of publication, the associated numeric code is "724" and your program understands that to be Spain. Another XML says "SP", the code list metadata specifies Eric's list, the associated numeric code is "724" and your program understands that to be Spain. I doubt, however, that many programs go to this extent to be unambiguous (witness the back-step made by ISO regarding the "CS" code). It must not be so very important to them to be that precise over the very long term of decades of changing representations of long-term semantics. I hope this has helped. I welcome any questions about this answer so that I can help to clarify anything. And I hope anyone else on the list with a perspective on this will also speak up. Especially if they disagree with what I've written and can correct me. . . . . . . . . . . . . Ken [1] http://www.wipo.int/pct/guide/en/gdvol1/annexes/annexk/ax_k.pdf (page 10) -- Contact us for world-wide XML consulting & instructor-led training Crane Softwrights Ltd. http://www.CraneSoftwrights.com/c/ G. Ken Holman mailto:gkholman@CraneSoftwrights.com Legal business disclaimers: http://www.CraneSoftwrights.com/legal
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]