Re: [ubl] Differences between xsd:token and xsd:normalizedString

2009/6/15 G. Ken Holman <gkholman@cranesoftwrights.com>

Hi all,

In preparation for a technical discussion in tonight's Pacific call, I have some citations here regarding W3C Schema type definitions:

http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#string
- a string can have any set of valid XML characters

http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#normalizedString
- a normalized string cannot have carriage returns, line feeds or tabs
- a normalized string can have any number of space characters, including
contiguous sequences of space characters

http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#token
- a token cannot have carriage returns, line feeds or tabs
- a token can have any number of singleton space characters, but not
any contiguous sequences of more than one space character

So ... I wondered if "token" should really have been called "tokens" because the semantics of a token value could be seen as the set of singleton-space-separated tokens in a string: the string has been tokenized (reduced to tokens). All along I've been trusting the name to infer that it was a single token when in fact it can contain more than one token. But, then again, it is confusing in the W3C Schema spec, because at the start it claims "token represents tokenized strings" while it also claims explicitly that the value space of token contains singleton spaces. Which is correct? There is a mail list where I can ask this, so I did last night and I got a brief response this morning:

http://lists.w3.org/Archives/Public/xmlschema-dev/2009Jun/0032.html
http://lists.w3.org/Archives/Public/xmlschema-dev/2009Jun/0033.html

Semantically, I think we are still where we want to be with UBL because even though most identifiers with spaces will have only one space, the entire value is the identifier. Same with codes that users might decide will have spaces in them (who are we to restrict existing business practices?). The value space of our values is not a set of space-separated tokens but a singleton value that has multiple spaces. And we don't know that our users won't have sequences of spaces. But we are asking our users not to use carriage returns, line feeds or tabs. Which seems reasonable to me.

Given the answer I got this morning, it seems to me that indeed "token" really is, semantically, "tokens" ... that is a collection of token non-white-space values expressed in a space-separated string of tokens. Certainly when our users are expressing a singleton code or identifier value containing spaces this is just a normalized string and not a tokenized string according to the published W3C definitions cited above; it isn't a set of space-separated values even if the expression of that set happens to be the right sequence of characters.

So for the discussion tonight, the choice in UBL 2.0 to use xsd:normalizedString instead of xsd:token appears to me to have been the right choice because of the implicit cardinality of syntactic values implied by the W3C definitions: xsd:normalizedString is a singleton whereas xsd:token with embedded spaces is not.

. . . . . . . . . . . . Ken

--
XSLT/XQuery/XSL-FO hands-on training - Los Angeles, USA 2009-06-08
Crane Softwrights Ltd. http://www.CraneSoftwrights.com/o/
Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video
Video lesson: http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18
Video overview: http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18
G. Ken Holman mailto:gkholman@CraneSoftwrights.com
Male Cancer Awareness Nov'07 http://www.CraneSoftwrights.com/o/bc
Legal business disclaimers: http://www.CraneSoftwrights.com/legal

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail. Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

ubl message