[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Re: Maybe it's too late, but...
It's certainly not too late. The current status is that we have adopted zeroOrMoreTokens/oneOrMoreTokens, but, as with any issue that we've decided, the TC can reconsider if anybody comes up with new information or ideas, as you have here. I suggest you open a new issue of whether we should adopt this proposal to replace zeroOrMoreTokens/oneOrMoreTokens. I very much like your idea. I think it's extremely clean and elegant. I would suggest a name like <tokenized> (goes well with <mixed>) or maybe <split> (since it does the same job as split() in JavaScript, Perl, C# etc) or maybe even <list>. I think this can be made to work for the formalism as well as the implementation. At the moment the formalism considers the children of an element to be a sequence of characters and elements. We could change this to be a sequence of strings and elements. In the children of an element you would never have two consecutive strings. However, in a forest/orchard (or whatever we call the thing we match against a pattern) it would be possible to have consecutive strings. The semantics of <data> and <value> would be that they match a single string. <text> would match zero or more strings. The rule for <split> is then simply: M[[<split> pattern </split]](a, c, e) iff + a is {}, + c is a single string s, and + M[[pattern]]({}, split(s), e) where the split function takes a single string and returns a sequence of zero or more strings by splitting one whitespace. This provides a simple rationale for our rules restricting <data> and <value>: they simply identify patterns which nothing can possible match. ----- Original Message ----- From: "Kohsuke KAWAGUCHI" <kohsuke.kawaguchi@eng.sun.com> To: "TREX Discussion List" <trex@lists.oasis-open.org> Sent: Friday, May 25, 2001 7:47 AM Subject: Maybe it's too late, but... As for the list of tokens... My implementation experiment reveals that the following syntax is very easy for the implementation and has greater expressiveness. <token> any RELAX NG pattern except element/attribute. </token> For example, things like <token> <oneOrMore> <data type="xsd:integer" /> <value>cm</value> </oneOrMore> </token> or <token> <!-- foo cannot have element/attribute descendants --> <ref name="foo" /> </token> By using this proposal, the current <zeroOrMoreToken> P </zeroOrMoreToken> is expressed as <token> <zeroOrMore> P </zeroOrMore> </token> From the view point of implementations, a residual of a <token> pattern by a string token S is defined as function residual( <token> P </token>, S ) { Let {t1,t2,..., tn} be tokenization of S. if( residual( residual( residual( P, t1 ), t2 ) ..., tn ) == <empty/> ) return <empty/> else return <notAllowed/> } Therefore, <token> has the minimal impact on the complexity of the spec (and implementations.) I think the followings are the problems of the current <oneOrMoreToken>. - First, I thought of the possibility to parse <oneOrMoreToken> as the list datatype of XSD. But it is difficult because of the pattern like <oneOrMoreToken><ref name="..."/></oneOrMoreToken> - Then I thought of the possibility to implement a datatype that keeps a pattern as its body. In this way, <oneOrMoreToken> can be implemented as function residual( <oneOrMoreToken> P </oneOrMoreToken>, S ) { Let {t0,t1,..., tn-1} be tokenization of S. for( i=0; i<n; i++ ) if( residual( P, ti ) != <empty/> ) return <notAllowed/> return <notAllowed/> } - Then I found that there is really no reason to prohibit a sequence of data inside <oneOreMoreToken>. And in fact it is useful. The above implementation can correctly handle <oneOrMoreToken> <group> <data type="xsd:integer"/> <value>cm</value> </group> </oneOrMoreToken> The reason why we have to prohibit a sequence of data is we can't know how to split one big character sequence into sub-sequences. But as you see, oneOrMoreToken knows how to split them. So in fact there is no problem. - For the above reasoning, there is no reason to prohibit plain <oneOrMore> within <oneOrMoreToken>. That implies <oneOrMoreToken> does not necessarily implement the "one-or-more" semantics. Instead, it can simply split one big string into sub-sequences. - This observation leads me to this proposal. regards, ---------------------- K.Kawaguchi E-Mail: kohsukekawaguchi@yahoo.com
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC