ubl message

Subject: Differences between xsd:token and xsd:normalizedString
From: "G. Ken Holman" <gkholman@CraneSoftwrights.com>
To: Universal Business Language <ubl@lists.oasis-open.org>
Date: Mon, 15 Jun 2009 07:29:03 -0700
Hi all,

In preparation for a technical discussion in tonight's Pacific call, 
I have some citations here regarding W3C Schema type definitions:

http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#string
  - a string can have any set of valid XML characters

http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#normalizedString
  - a normalized string cannot have carriage returns, line feeds or tabs
  - a normalized string can have any number of space characters, including
    contiguous sequences of space characters

http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#token
  - a token cannot have carriage returns, line feeds or tabs
  - a token can have any number of singleton space characters, but not
    any contiguous sequences of more than one space character

So ... I wondered if "token" should really have been called "tokens" 
because the semantics of a token value could be seen as the set of 
singleton-space-separated tokens in a string:  the string has been 
tokenized (reduced to tokens).  All along I've been trusting the name 
to infer that it was a single token when in fact it can contain more 
than one token.  But, then again, it is confusing in the W3C Schema 
spec, because at the start it claims "token represents tokenized 
strings" while it also claims explicitly that the value space of 
token contains singleton spaces.  Which is correct?  There is a mail 
list where I can ask this, so I did last night and I got a brief 
response this morning:

   http://lists.w3.org/Archives/Public/xmlschema-dev/2009Jun/0032.html
   http://lists.w3.org/Archives/Public/xmlschema-dev/2009Jun/0033.html

Semantically, I think we are still where we want to be with UBL 
because even though most identifiers with spaces will have only one 
space, the entire value is the identifier.  Same with codes that 
users might decide will have spaces in them (who are we to restrict 
existing business practices?).  The value space of our values is not 
a set of space-separated tokens but a singleton value that has 
multiple spaces.  And we don't know that our users won't have 
sequences of spaces.  But we are asking our users not to use carriage 
returns, line feeds or tabs.  Which seems reasonable to me.

Given the answer I got this morning, it seems to me that indeed 
"token" really is, semantically, "tokens" ... that is a collection of 
token non-white-space values expressed in a space-separated string of 
tokens.  Certainly when our users are expressing a singleton code or 
identifier value containing spaces this is just a normalized string 
and not a tokenized string according to the published W3C definitions 
cited above; it isn't a set of space-separated values even if the 
expression of that set happens to be the right sequence of characters.

So for the discussion tonight, the choice in UBL 2.0 to use 
xsd:normalizedString instead of xsd:token appears to me to have been 
the right choice because of the implicit cardinality of syntactic 
values implied by the W3C definitions:  xsd:normalizedString is a 
singleton whereas xsd:token with embedded spaces is not.

. . . . . . . . . . . . Ken

--
XSLT/XQuery/XSL-FO hands-on training - Los Angeles, USA 2009-06-08
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/o/
Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
Video lesson:    http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
Video overview:  http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
G. Ken Holman                 mailto:gkholman@CraneSoftwrights.com
Male Cancer Awareness Nov'07  http://www.CraneSoftwrights.com/o/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal
Follow-Ups:
- Re: [ubl] Differences between xsd:token and xsd:normalizedString
  - From: Stephen Green <stephen.green@documentengineeringservices.com>