[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Alternative algorithm to restrict use of data/string in patterns.
This proposal is a rewrite of http://lists.oasis-open.org/archives/trex/200104/msg00031.html One of my concern to the current restriction is that it's probably not closed to boolean operations(union/intersection/difference). Mine is made to satisfy closedness property, and it exactly exclude those patterns which cause the problem to validators. No more, no less. What my algorithm detects is two consecutive transitions by string in content model automaton. <group> <string>abc</string> <string>def</string> </group> The above pattern is not good because it contains two consecutive transitions by string.("abc" first, then "def"). <group> <string>abc</string> <element name="foo" /> <string>def</string> </group> This pattern doesn't cause a problem for validators (although it may not be a good practice). To be precise, there are actually three types of transitions. The first is "element", which is a transition by XML element. The second is "anyText", which is a transition by <anyString/>. We have treat anyString differently because we allow two consecutive transitions by anyString. The third is "text", which is a transition by <data> or <string>. So what we don't allow is two consectuive transitions by - {"text","text"} - {"text","anyString"} - {"anyString","text"} These are problematic because to perform these transitions, we have to "split" one string into two, so that one is interpreted as "text" and the other as "anyString". The algorithm is basically simple: for all q of states in this automaton 1. what type of transition can enter q 2. what type of transition can leave q If there is a transition to q by "text" and another transition from q by "anyString", then there is a path that accepts {"text","anyString"} transitions. So the algorithm will signal an error in cases like this. My algorithm does this without explicitly constructing automaton. To capture the tri-state "element"/"text"/"anyString", I'd like to introduce "fpdsType". One important operation over fpdsType is "OR" operation. That is, fpdsType||fpdsType -> fpdsType. This operation is commutative: fpds1|fpds2 == fpds2|fpds1. Definition of OR of fpdsType ---------------------------- X|Y == Y|X X|X == X "text"|X == "text" "anyString"|"element" == "anyString" If pattern p can start with transition "element"(X) and "text"(Y), then what matters is "text"(X|Y). Or operation is used to extract what matters. function fpdsType _calcFPDS( pattern p ); This function computes what type of transitions can be possible as the first transition of the automaton that is equivalent to p. For example, _calcFPDS( <optional><element /></optional><data /> ) is "text", because this regexp has a "text" transition from its start state. It also has a "element" transition, but it doesn't matter. However, sometimes pattern is not enough to decide fpds. value of _calcFPDS( <optional><element/></optional> ) depends on the successive pattern. If it is succeeded by <data/> (as follows), _calcFPDS will be "text". <optional> <element /> </optional> <data /> But if it is succeeded by <element/>, then _calcFPDS will be "element". <optional> <element /> </optional> <element /> So to correctly calculate what kind of transitions can be possible, _calcFPDS has to receive fpdsType that succeeds pattern p. that yields function fpdsType calcFPDS( pattern p, fpdsType fpds ); And it is defined as follows: fpdsType calcFPDS( pattern p, fpdsType fpds ) { if(p is "<string>e</string>" or "<data>e</data>") return "text" if(p is "<group>p1 p2</group>") return calcFPDS(p1, calcFPDS(p2,fpds)) if(p is "<oneOrMore> p1 </oneOrMore>") return calcFPDS(p1,fpds) if(p is "<interleave> p1 p2 </interleave>") return calcFPDS(p1,fpds)||calcFPDS(p2,fpds) if(p is "<choice> p1 p2 </choice>") return calcFPDS(p1,fpds)||calcFPDS(p2,fpds) if(p is "<anyString />") return "anyString" if(p is "<attribute />") return fpds if(p is "<element />") return "element" if( p is <empty/> ) return fpds } Then the following function "check" actually enforces "two consecutive string transitions" constraint. function void check( pattern p, fpdsType fpds ); pattern "p" here is a left language that represents a state. "fpds" is a transition that leaves state "p". this algorithm computes what kind of transitions can reach state p, and checks its consistency with fpds. For every "<element> nc p </element>" pattern, check starts with check( p, "element" ) The second parameter is "element" because after reading p, there is no transition by text nor anyString. function void check( pattern p, fpdsType fpds ) { switch( p ) { case: p == <string> ... </string> or <data> ... </data> if( fpds=="anyString" or "text" ) this pattern is illegal! case: p == <group> p1 p2 </group> check( p2, fpds ); check( p1, calcFPDS(p2,fpds) ) case: p == <oneOrMore> p1 </oneOrMore> // <oOM>p1</oOM> => p1 p1? check( p1, fpds | calcFPDS(p1,fpds) ) case: p == <interleave> p1 p2 </interleave> check( p1, fpds ) check( p2, fpds ) if( p1 contains <data/> or <string/>, && p2 contains <anyString/> or <element />, or viceversa) this pattern is illegal! case: p == <choice> p1 p2 </choice> check( p1, fpds ) check( p2, fpds ) case: p == <anyString/> if( fpds=="text" ) this pattern is illegal. case: p == <attribute /> or <element /> do nothing; } Probably these two functions can be combined into one. I hope this one is easier to understand. -- Kohsuke KAWAGUCHI +1 650 786 0721 Sun Microsystems kohsuke.kawaguchi@eng.sun.com
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC