Subject: Re: Maybe it's too late, but...

It's certainly not too late.  The current status is that we have adopted
zeroOrMoreTokens/oneOrMoreTokens, but, as with any issue that we've decided,
the TC can reconsider if anybody comes up with new information or ideas, as
you have here.  I suggest you open a new issue of whether we should adopt
this proposal to replace zeroOrMoreTokens/oneOrMoreTokens.

I very much like your idea.  I think it's extremely clean and elegant.  I
would suggest a name like <tokenized> (goes well with <mixed>) or maybe
<split> (since it does the same job as split() in JavaScript, Perl, C# etc)
or maybe even <list>.

I think this can be made to work for the formalism as well as the
implementation.  At the moment the formalism considers the children of an
element to be a sequence of characters and elements.  We could change this
to be a sequence of strings and elements.  In the children of an element you
would never have two consecutive strings.  However, in a forest/orchard (or
whatever we call the thing we match against a pattern) it would be possible
to have consecutive strings.  The semantics of <data> and <value> would be
that they match a single string.  <text> would match zero or more strings.
The rule for <split> is then simply:

M[[<split> pattern </split]](a, c, e) iff

+ a is {},
+ c is a single string s, and
+ M[[pattern]]({}, split(s), e)

where the split function takes a single string and returns a sequence of
zero or more strings by splitting one whitespace.

This provides a simple rationale for our rules restricting <data> and
<value>: they simply identify patterns which nothing can possible match.

----- Original Message -----
From: "Kohsuke KAWAGUCHI" <kohsuke.kawaguchi@eng.sun.com>
To: "TREX Discussion List" <trex@lists.oasis-open.org>
Sent: Friday, May 25, 2001 7:47 AM
Subject: Maybe it's too late, but...

As for the list of tokens...

My implementation experiment reveals that the following syntax is very
easy for the implementation and has greater expressiveness.

  any RELAX NG pattern except element/attribute.

For example, things like

    <data type="xsd:integer" />


  <!-- foo cannot have element/attribute descendants -->
  <ref name="foo" />

By using this proposal, the current <zeroOrMoreToken> P
</zeroOrMoreToken> is expressed as


From the view point of implementations, a residual of a <token> pattern
by a string token S is defined as

function residual( <token> P </token>,  S ) {
   Let {t1,t2,..., tn} be tokenization of S.

   if( residual( residual( residual( P, t1 ), t2 ) ..., tn ) == <empty/> )
      return <empty/>
      return <notAllowed/>

Therefore, <token> has the minimal impact on the complexity of the spec
(and implementations.)

I think the followings are the problems of the current <oneOrMoreToken>.

- First, I thought of the possibility to parse <oneOrMoreToken> as
  the list datatype of XSD. But it is difficult because of the pattern
  like <oneOrMoreToken><ref name="..."/></oneOrMoreToken>

- Then I thought of the possibility to implement a datatype that keeps
  a pattern as its body. In this way, <oneOrMoreToken> can be
  implemented as

function residual( <oneOrMoreToken> P </oneOrMoreToken>,  S ) {
   Let {t0,t1,..., tn-1} be tokenization of S.

   for( i=0; i<n; i++ )
     if( residual( P, ti ) != <empty/> )
      return <notAllowed/>

   return <notAllowed/>

- Then I found that there is really no reason to prohibit a sequence of
  data inside <oneOreMoreToken>. And in fact it is useful. The above
  implementation can correctly handle

      <data type="xsd:integer"/>

  The reason why we have to prohibit a sequence of data is we can't know
  how to split one big character sequence into sub-sequences.

  But as you see, oneOrMoreToken knows how to split them. So in fact
  there is no problem.

- For the above reasoning, there is no reason to prohibit plain
  <oneOrMore> within <oneOrMoreToken>. That implies <oneOrMoreToken>
  does not necessarily implement the "one-or-more" semantics.

  Instead, it can simply split one big string into sub-sequences.

- This observation leads me to this proposal.

E-Mail: kohsukekawaguchi@yahoo.com

