ubl-lcsc message

Subject: Re: [ubl-lcsc] Code sets for Document and LineItem Status
From: jon.bosak@sun.com
To: cheekai@softml.net
Date: Fri, 17 Oct 2003 17:26:26 -0700 (PDT)
I think that Chee Kai is right but that we can't touch this right
now.

I think that Chee Kai is right because a single code list cannot
be straightforwardly represented in XML in a case-insensitive way.
XML is inherently case-sensitive.  And the reason for that is that
the upper and lower case characters are different Unicode code
points.  The concept of case does not exist in the majority of the
world's languages; only in some alphabetic scripts is a semantic
identity imagined to inhere in codes that happen to be 64
positions apart.  XML does not natively make this assumption.  If
we were writing the codes in Chinese we would not be talking about
this.  So the argument for normalizing case assumes that case is
significant.

However... it follows a fortiori that we have to respect the case
that these things have been given by their maintainers.  An
example near to hand is the language code list (ISO 639).  As I
pointed out to Ken a few days ago, in the actual printed
legally-paid-for paper version of ISO 639, the codes are in lower
case.  So if we say that the codes are now case-sensitive, that
is, that a difference in case signifies a semantic distinction,
then that list is going to have to stay in lower case to maintain
the semantics intended by the standard.  And if we're going to
take the opposite position and make the codes case-insensitive,
then case doesn't matter and there's no reason to change anything;
users can make the case anything they like, and implementers will
just have to bit-mask the difference.

The suggestion to use numeric codes is interesting but doesn't
really solve the problem.  The problem is that mnemonic codes are
troublesome and represent a significant investment of intellectual
effort.  Numeric codes eliminate this effort by abandoning the
goal of easy recognition by some large body of users.  Numeric
codes are, in short, an admission of defeat.  Maybe in the end
that turns out the best we can do, but I'm not yet ready to throw
in the towel without considering a couple of obvious alternatives.

I will note in passing that the numeric version of a list is
actually an entirely separate code whose members just happen to
map to the same referents.  I don't see a technical difference
between a numeric code list packaged with the alphabetic version
by ISO and a numeric code list developed by some separate agency
with a mandate to resolve its list to the same values.  (Hmmm.)  I
think this means that the alpha and numeric versions of a standard
that provides both have to be modeled as structurally distinct
lists, and users are going to have to explicitly agree on which of
these they are using, just as they would if the lists were
maintained by separate agencies.  Thus it is demonstrated that in
talking about alternatives, we are not in danger of losing the
purity of a single code list for each application; the standards
bodies themselves have already done that by providing officially
sanctioned, logically distinct variants.  They already require
users to choose among alternatives and convey that decision to
their trading partners.

Anne says:

| After having spent some time looking for code list values on the
| web recently I would propose that we want to keep a safe distance
| from creating the appearance that we are maintaining code lists in
| any way.  Some of the suggested changes below may take us too much
| in that direction - might make it appear as though we are
| maintaining these codes, since it will be obvious looking at the
| values that we have changed them.

I agree.  I think that we should, insofar as possible, be ripping
the codes right out of the UN files and pasting those code points
verbatim into the code schemas.  This should also be the easiest
thing to do, which I find a significant point in favor of this
strategy.

Anne adds:

| Hopefully the TC that Jon mentioned will come about sooner than
| later.

Yeah, but we don't have the resources to be forming another TC
right now.

We need a set of consistent, royalty-free code lists for basic
trade parameters that we can make universally available to support
UBL.  I can think of several ways to do this legally and
relatively painlessly (meaning that someone like me could crank
out semantically identical replacements for the lists we've been
trying to work with at a rate of 50-100 elements a day) without
abandoning mnemonics for a large subset of users.  Perhaps
something along the lines suggested by Tim:

> Fortunately, I think we have some flexibility here.  With these
> EDIFACT codes, we can be 'based upon' .  So a middle ground would
> be to use the 'short description' e.g. the words
> "Accepted","Conditionally Accepted" and "Rejected".  I think this
> is better than inventing our own terms.

Maybe not that algorithm, precisely, but something like that.  It
should be remembered that no one can copyright an idea, only its
form of expression.  There's nothing private about the semantics
of these code lists; their meanings belong to everyone.

Perhaps this is something that someone could take on over the
winter break.  How many elements, roughly, would we guess are
contained in the standard lists that we've identified as basic to
UBL operation (aside from the two that we believe we've been
licensed to use by ISO)?  What's the size of this task, really?

Jon
Follow-Ups:
- Re: [ubl-lcsc] Code sets for Document and LineItem Status
  - From: Anne Hendry <anne.hendry@sun.com>
References:
- Re: [ubl-lcsc] Code sets for Document and LineItem Status
  - From: Chin Chee-Kai <cheekai@softml.net>