[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Re: Maybe it's too late, but...
It's certainly not too late. The current status is that we have adopted
zeroOrMoreTokens/oneOrMoreTokens, but, as with any issue that we've decided,
the TC can reconsider if anybody comes up with new information or ideas, as
you have here. I suggest you open a new issue of whether we should adopt
this proposal to replace zeroOrMoreTokens/oneOrMoreTokens.
I very much like your idea. I think it's extremely clean and elegant. I
would suggest a name like <tokenized> (goes well with <mixed>) or maybe
<split> (since it does the same job as split() in JavaScript, Perl, C# etc)
or maybe even <list>.
I think this can be made to work for the formalism as well as the
implementation. At the moment the formalism considers the children of an
element to be a sequence of characters and elements. We could change this
to be a sequence of strings and elements. In the children of an element you
would never have two consecutive strings. However, in a forest/orchard (or
whatever we call the thing we match against a pattern) it would be possible
to have consecutive strings. The semantics of <data> and <value> would be
that they match a single string. <text> would match zero or more strings.
The rule for <split> is then simply:
M[[<split> pattern </split]](a, c, e) iff
+ a is {},
+ c is a single string s, and
+ M[[pattern]]({}, split(s), e)
where the split function takes a single string and returns a sequence of
zero or more strings by splitting one whitespace.
This provides a simple rationale for our rules restricting <data> and
<value>: they simply identify patterns which nothing can possible match.
----- Original Message -----
From: "Kohsuke KAWAGUCHI" <kohsuke.kawaguchi@eng.sun.com>
To: "TREX Discussion List" <trex@lists.oasis-open.org>
Sent: Friday, May 25, 2001 7:47 AM
Subject: Maybe it's too late, but...
As for the list of tokens...
My implementation experiment reveals that the following syntax is very
easy for the implementation and has greater expressiveness.
<token>
any RELAX NG pattern except element/attribute.
</token>
For example, things like
<token>
<oneOrMore>
<data type="xsd:integer" />
<value>cm</value>
</oneOrMore>
</token>
or
<token>
<!-- foo cannot have element/attribute descendants -->
<ref name="foo" />
</token>
By using this proposal, the current <zeroOrMoreToken> P
</zeroOrMoreToken> is expressed as
<token>
<zeroOrMore>
P
</zeroOrMore>
</token>
From the view point of implementations, a residual of a <token> pattern
by a string token S is defined as
function residual( <token> P </token>, S ) {
Let {t1,t2,..., tn} be tokenization of S.
if( residual( residual( residual( P, t1 ), t2 ) ..., tn ) == <empty/> )
return <empty/>
else
return <notAllowed/>
}
Therefore, <token> has the minimal impact on the complexity of the spec
(and implementations.)
I think the followings are the problems of the current <oneOrMoreToken>.
- First, I thought of the possibility to parse <oneOrMoreToken> as
the list datatype of XSD. But it is difficult because of the pattern
like <oneOrMoreToken><ref name="..."/></oneOrMoreToken>
- Then I thought of the possibility to implement a datatype that keeps
a pattern as its body. In this way, <oneOrMoreToken> can be
implemented as
function residual( <oneOrMoreToken> P </oneOrMoreToken>, S ) {
Let {t0,t1,..., tn-1} be tokenization of S.
for( i=0; i<n; i++ )
if( residual( P, ti ) != <empty/> )
return <notAllowed/>
return <notAllowed/>
}
- Then I found that there is really no reason to prohibit a sequence of
data inside <oneOreMoreToken>. And in fact it is useful. The above
implementation can correctly handle
<oneOrMoreToken>
<group>
<data type="xsd:integer"/>
<value>cm</value>
</group>
</oneOrMoreToken>
The reason why we have to prohibit a sequence of data is we can't know
how to split one big character sequence into sub-sequences.
But as you see, oneOrMoreToken knows how to split them. So in fact
there is no problem.
- For the above reasoning, there is no reason to prohibit plain
<oneOrMore> within <oneOrMoreToken>. That implies <oneOrMoreToken>
does not necessarily implement the "one-or-more" semantics.
Instead, it can simply split one big string into sub-sequences.
- This observation leads me to this proposal.
regards,
----------------------
K.Kawaguchi
E-Mail: kohsukekawaguchi@yahoo.com
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC