relax-ng message

Subject: Alternative algorithm to restrict use of data/string in patterns.
From: Kohsuke KAWAGUCHI <kohsuke.kawaguchi@eng.sun.com>
To: TREX ML <trex@lists.oasis-open.org>
Date: Mon, 30 Apr 2001 14:34:52 -0700

This proposal is a rewrite of
http://lists.oasis-open.org/archives/trex/200104/msg00031.html


One of my concern to the current restriction is that it's probably not
closed to boolean operations(union/intersection/difference).

Mine is made to satisfy closedness property, and it exactly exclude
those patterns which cause the problem to validators. No more, no less.


What my algorithm detects is two consecutive transitions by string in
content model automaton.

<group>
  <string>abc</string>
  <string>def</string>
</group>

The above pattern is not good because it contains two consecutive
transitions by string.("abc" first, then "def").


<group>
  <string>abc</string>
  <element name="foo" />
  <string>def</string>
</group>

This pattern doesn't cause a problem for validators (although it may not
be a good practice).


To be precise, there are actually three types of transitions. The first
is "element", which is a transition by XML element. The second is
"anyText", which is a transition by <anyString/>. We have treat
anyString differently because we allow two consecutive transitions by
anyString. The third is "text", which is a transition by <data> or
<string>.

So what we don't allow is two consectuive transitions by

- {"text","text"}
- {"text","anyString"}
- {"anyString","text"}

These are problematic because to perform these transitions, we have to
"split" one string into two, so that one is interpreted as "text" and
the other as "anyString".


The algorithm is basically simple: 

for all q of states in this automaton

1. what type of transition can enter q
2. what type of transition can leave q

If there is a transition to q by "text" and another transition from q by
"anyString", then there is a path that accepts {"text","anyString"}
transitions. So the algorithm will signal an error in cases like this.

My algorithm does this without explicitly constructing automaton.



To capture the tri-state "element"/"text"/"anyString", I'd like to
introduce "fpdsType".

One important operation over fpdsType is "OR" operation. That is,
fpdsType||fpdsType -> fpdsType.

This operation is commutative: fpds1|fpds2 == fpds2|fpds1.

Definition of OR of fpdsType
----------------------------
X|Y == Y|X
X|X == X

"text"|X              == "text"
"anyString"|"element" == "anyString"

If pattern p can start with transition "element"(X) and "text"(Y), then
what matters is "text"(X|Y). Or operation is used to extract what
matters.



function fpdsType _calcFPDS( pattern p );

This function computes what type of transitions can be possible as the
first transition of the automaton that is equivalent to p.

For example, _calcFPDS( <optional><element /></optional><data /> ) is
"text", because this regexp has a "text" transition from its start state.
It also has a "element" transition, but it doesn't matter.

However, sometimes pattern is not enough to decide fpds.

value of _calcFPDS( <optional><element/></optional> ) depends on the
successive pattern.

If it is succeeded by <data/> (as follows), _calcFPDS will be "text".

<optional>
  <element />
</optional>
<data />

But if it is succeeded by <element/>, then _calcFPDS will be "element".

<optional>
  <element />
</optional>
<element />

So to correctly calculate what kind of transitions can be possible,
_calcFPDS has to receive fpdsType that succeeds pattern p.

that yields

function fpdsType calcFPDS( pattern p, fpdsType fpds );

And it is defined as follows:

fpdsType calcFPDS( pattern p, fpdsType fpds )
{
    if(p is "<string>e</string>" or "<data>e</data>")
      return "text"

    if(p is "<group>p1 p2</group>")
      return calcFPDS(p1, calcFPDS(p2,fpds))

    if(p is "<oneOrMore> p1 </oneOrMore>")
      return calcFPDS(p1,fpds)

    if(p is "<interleave> p1 p2 </interleave>")
      return calcFPDS(p1,fpds)||calcFPDS(p2,fpds)

    if(p is "<choice> p1 p2 </choice>")
      return calcFPDS(p1,fpds)||calcFPDS(p2,fpds)

    if(p is "<anyString />")
      return "anyString"

    if(p is "<attribute />")
      return fpds

    if(p is "<element />")
      return "element"

    if( p is <empty/> )
      return fpds
}



Then the following function "check" actually enforces "two consecutive
string transitions" constraint.


function void check( pattern p, fpdsType fpds );

pattern "p" here is a left language that represents a state.
"fpds" is a transition that leaves state "p".
this algorithm computes what kind of transitions can reach
state p, and checks its consistency with fpds.


For every "<element> nc p </element>" pattern, check starts with

check( p, "element" )

The second parameter is "element" because after reading p, there is no
transition by text nor anyString.


function void check( pattern p, fpdsType fpds ) {
    switch( p ) {
    case: p == <string> ... </string> or <data> ... </data>
        if( fpds=="anyString" or "text" )
            this pattern is illegal!
    
    case: p == <group> p1 p2 </group>
        check( p2, fpds );
        check( p1, calcFPDS(p2,fpds) )
    
    case: p == <oneOrMore> p1 </oneOrMore>
        // <oOM>p1</oOM> => p1 p1? 
        check( p1, fpds | calcFPDS(p1,fpds) )
    
    case: p == <interleave> p1 p2 </interleave>
        check( p1, fpds )
        check( p2, fpds )
        if( p1 contains <data/> or <string/>,
        &&  p2 contains <anyString/> or <element />, or viceversa)
            this pattern is illegal!
    
    case: p == <choice> p1 p2 </choice>
        check( p1, fpds )
        check( p2, fpds )
    
    case: p == <anyString/>
        if( fpds=="text" )
            this pattern is illegal.
    
    case: p == <attribute /> or <element />
        do nothing;
}


Probably these two functions can be combined into one.
I hope this one is easier to understand.

--
Kohsuke KAWAGUCHI                          +1 650 786 0721
Sun Microsystems                   kohsuke.kawaguchi@eng.sun.com
References:
- Restrictions on attribute patterns (attrInAttr)
  - From: Murata Makoto <mura034@attglobal.net>