relax-ng message

Subject: Re: Maybe it's too late, but...

From: James Clark <jjc@jclark.com>
To: Kohsuke KAWAGUCHI <kohsuke.kawaguchi@eng.sun.com>,TREX Discussion List <trex@lists.oasis-open.org>
Date: Fri, 25 May 2001 14:11:39 +0700

It's certainly not too late.  The current status is that we have adopted
zeroOrMoreTokens/oneOrMoreTokens, but, as with any issue that we've decided,
the TC can reconsider if anybody comes up with new information or ideas, as
you have here.  I suggest you open a new issue of whether we should adopt
this proposal to replace zeroOrMoreTokens/oneOrMoreTokens.

I very much like your idea.  I think it's extremely clean and elegant.  I
would suggest a name like <tokenized> (goes well with <mixed>) or maybe
<split> (since it does the same job as split() in JavaScript, Perl, C# etc)
or maybe even <list>.

I think this can be made to work for the formalism as well as the
implementation.  At the moment the formalism considers the children of an
element to be a sequence of characters and elements.  We could change this
to be a sequence of strings and elements.  In the children of an element you
would never have two consecutive strings.  However, in a forest/orchard (or
whatever we call the thing we match against a pattern) it would be possible
to have consecutive strings.  The semantics of <data> and <value> would be
that they match a single string.  <text> would match zero or more strings.
The rule for <split> is then simply:

M[[<split> pattern </split]](a, c, e) iff

+ a is {},
+ c is a single string s, and
+ M[[pattern]]({}, split(s), e)

where the split function takes a single string and returns a sequence of
zero or more strings by splitting one whitespace.

This provides a simple rationale for our rules restricting <data> and
<value>: they simply identify patterns which nothing can possible match.

----- Original Message -----
From: "Kohsuke KAWAGUCHI" <kohsuke.kawaguchi@eng.sun.com>
To: "TREX Discussion List" <trex@lists.oasis-open.org>
Sent: Friday, May 25, 2001 7:47 AM
Subject: Maybe it's too late, but...



As for the list of tokens...

My implementation experiment reveals that the following syntax is very
easy for the implementation and has greater expressiveness.

<token>
  any RELAX NG pattern except element/attribute.
</token>



For example, things like

<token>
  <oneOrMore>
    <data type="xsd:integer" />
    <value>cm</value>
  </oneOrMore>
</token>

or

<token>
  <!-- foo cannot have element/attribute descendants -->
  <ref name="foo" />
</token>


By using this proposal, the current <zeroOrMoreToken> P
</zeroOrMoreToken> is expressed as

<token>
  <zeroOrMore>
    P
  </zeroOrMore>
</token>




From the view point of implementations, a residual of a <token> pattern
by a string token S is defined as

function residual( <token> P </token>,  S ) {
   Let {t1,t2,..., tn} be tokenization of S.

   if( residual( residual( residual( P, t1 ), t2 ) ..., tn ) == <empty/> )
      return <empty/>
   else
      return <notAllowed/>
}

Therefore, <token> has the minimal impact on the complexity of the spec
(and implementations.)


I think the followings are the problems of the current <oneOrMoreToken>.

- First, I thought of the possibility to parse <oneOrMoreToken> as
  the list datatype of XSD. But it is difficult because of the pattern
  like <oneOrMoreToken><ref name="..."/></oneOrMoreToken>

- Then I thought of the possibility to implement a datatype that keeps
  a pattern as its body. In this way, <oneOrMoreToken> can be
  implemented as

function residual( <oneOrMoreToken> P </oneOrMoreToken>,  S ) {
   Let {t0,t1,..., tn-1} be tokenization of S.

   for( i=0; i<n; i++ )
     if( residual( P, ti ) != <empty/> )
      return <notAllowed/>

   return <notAllowed/>
}

- Then I found that there is really no reason to prohibit a sequence of
  data inside <oneOreMoreToken>. And in fact it is useful. The above
  implementation can correctly handle

  <oneOrMoreToken>
    <group>
      <data type="xsd:integer"/>
      <value>cm</value>
    </group>
  </oneOrMoreToken>

  The reason why we have to prohibit a sequence of data is we can't know
  how to split one big character sequence into sub-sequences.

  But as you see, oneOrMoreToken knows how to split them. So in fact
  there is no problem.

- For the above reasoning, there is no reason to prohibit plain
  <oneOrMore> within <oneOrMoreToken>. That implies <oneOrMoreToken>
  does not necessarily implement the "one-or-more" semantics.

  Instead, it can simply split one big string into sub-sequences.

- This observation leads me to this proposal.



regards,
----------------------
K.Kawaguchi
E-Mail: kohsukekawaguchi@yahoo.com

Follow-Ups:
- Re: Maybe it's too late, but...
  - From: MURATA Makoto <mura034@attglobal.net>
- Re: Maybe it's too late, but...
  - From: James Clark <jjc@jclark.com>

References:
- Datatype and identity constraints proposal of the day (17 May)
  - From: James Clark <jjc@jclark.com>
- Maybe it's too late, but...
  - From: Kohsuke KAWAGUCHI <kohsuke.kawaguchi@eng.sun.com>