relax-ng message

Subject: Maybe it's too late, but...

From: Kohsuke KAWAGUCHI <kohsuke.kawaguchi@eng.sun.com>
To: TREX Discussion List <trex@lists.oasis-open.org>
Date: Thu, 24 May 2001 17:47:40 -0700


As for the list of tokens...

My implementation experiment reveals that the following syntax is very
easy for the implementation and has greater expressiveness.

<token>
  any RELAX NG pattern except element/attribute.
</token>



For example, things like

<token>
  <oneOrMore>
    <data type="xsd:integer" />
    <value>cm</value>
  </oneOrMore>
</token>

or

<token>
  <!-- foo cannot have element/attribute descendants -->
  <ref name="foo" />
</token>


By using this proposal, the current <zeroOrMoreToken> P
</zeroOrMoreToken> is expressed as

<token>
  <zeroOrMore>
    P
  </zeroOrMore>
</token>




From the view point of implementations, a residual of a <token> pattern
by a string token S is defined as

function residual( <token> P </token>,  S ) {
   Let {t1,t2,..., tn} be tokenization of S.
   
   if( residual( residual( residual( P, t1 ), t2 ) ..., tn ) == <empty/> )
      return <empty/>
   else
      return <notAllowed/>
}

Therefore, <token> has the minimal impact on the complexity of the spec
(and implementations.)


I think the followings are the problems of the current <oneOrMoreToken>.

- First, I thought of the possibility to parse <oneOrMoreToken> as
  the list datatype of XSD. But it is difficult because of the pattern
  like <oneOrMoreToken><ref name="..."/></oneOrMoreToken>

- Then I thought of the possibility to implement a datatype that keeps
  a pattern as its body. In this way, <oneOrMoreToken> can be
  implemented as

function residual( <oneOrMoreToken> P </oneOrMoreToken>,  S ) {
   Let {t0,t1,..., tn-1} be tokenization of S.
   
   for( i=0; i<n; i++ )
     if( residual( P, ti ) != <empty/> )
      return <notAllowed/>
      
   return <notAllowed/>
}

- Then I found that there is really no reason to prohibit a sequence of
  data inside <oneOreMoreToken>. And in fact it is useful. The above
  implementation can correctly handle
  
  <oneOrMoreToken>
    <group>
      <data type="xsd:integer"/>
      <value>cm</value>
    </group>
  </oneOrMoreToken>
  
  The reason why we have to prohibit a sequence of data is we can't know
  how to split one big character sequence into sub-sequences.
  
  But as you see, oneOrMoreToken knows how to split them. So in fact
  there is no problem.

- For the above reasoning, there is no reason to prohibit plain
  <oneOrMore> within <oneOrMoreToken>. That implies <oneOrMoreToken>
  does not necessarily implement the "one-or-more" semantics.
  
  Instead, it can simply split one big string into sub-sequences.
  
- This observation leads me to this proposal.



regards,
----------------------
K.Kawaguchi
E-Mail: kohsukekawaguchi@yahoo.com

Follow-Ups:
- Re: Maybe it's too late, but...
  - From: James Clark <jjc@jclark.com>

References:
- Datatype and identity constraints proposal of the day (17 May)
  - From: James Clark <jjc@jclark.com>