relax-ng message

Subject: Alternative approach for key/keyref
From: Murata Makoto <mura034@attglobal.net>
To: relax-ng@lists.oasis-open.org
Date: Wed, 11 Jul 2001 00:32:53 +0900
Having seen 7.4 of the spec, I would like to point out another
possibility for keys and keyrefs.

Most proposals for keys utilize simplified path expressions.  Advanced 
features (e.g., multi-part keys) can be developed on top of such path 
expressions.  Path expressions are easy to evaluate.  Validators can 
easily validate documents.  More importantly, *application programs* 
can easily find out which element is a key.  

However, use of such path expressions together with RELAX NG is a
challenge.  We have to ensure datatypes of keys are unambiguous; that
is, for each key, any element/attribute of this key is required to
have a unique datatype.  How can we do this for RELAX NG grammars, which
may provide more than one interpretation per document?

We can examine datatype ambiguity by rewriting grammars [1].  Given a
grammar and a simple path expression, we create a grammar such that
(1) it accepts exactly what the original grammar accepts and (2) some
of its named patterns are marked, and (3) an element matches a marked
named pattern if and only if that element satisfies the path
expression.  Intuitively speaking, the new grammar simulates the
original grammar (in the bottom-up manner) as well as the path
expression (in the top-down manner).

Formally put, we rewrite grammars while constructing a DFA from path
expressions.  DFA construction is done by computing derivatives.  
We introduce a new named pattern for a named pattern and a state 
of the DFA.

We demonstrate such rewriting.  Consider a path expression
"/doc/foo@number" and a grammar as below:

<grammar ...>

  <start><ref name="doc"/></start>

  <define name="doc">
    <element name="doc">
      <ref name="foo"/>
      <ref name="bar"/>
    </element>
  </define>

  <define name="foo">
    <element name="foo">
      <attribute name="number">
        <data type="xsd:short"/>
      </attribute>
      <zeroOrMore>
        <ref name="foo1"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="foo1">
    <element name="foo">
      <attribute name="number"/>
      <zeroOrMore>
        <ref name="foo1"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="bar">
    <element name="bar">
      <ref name="foo1"/>
    </element>
  </define>

</grammar>


First, we have a start state, say #0.  Our <start> has only one child
<ref name="doc"/>.  There is only one <define> for "doc", which has
<element name="doc"> ...</element> as the child.

The derivative of /doc/foo@number by "doc" is foo@number.  
This derivative corresponds to a new state, say #1.  By combining 
the non-terminal in the original grammr "doc" and #1, we have 
a new non-terminal "doc#1".  A new <start> is:

    <start><ref name="doc#1"/></start>

Now consider the content model of the first <define> element.  It
contains <ref name="foo"/> and <ref name="bar"/>.  The element names
for these two named patterns are "foo" and "bar".

The derivative of foo@number by foo is @number.  This 
derivative corresponds to a new state, say #2.  The   
derivative of /foo@number by bar is the empty set, which 
corresponds to a deadend state, say #3.

Thus, we have a new <define> as below:  

  <define name="doc#1">
    <element name="doc">
      <ref name="foo#2"/>
      <ref name="bar#3"/>
    </element>
  </define>

Now, we consider the content model of the second <define> element.
It has <ref name="foo1"/> and <attribute name="number">.
There is only one <define> for "foo1", which has
<element name="foo"> ...</element> as the child.

The derivative of @number by "foo" is the empty set, which 
corresponds to #3.

  <define name="foo#2">
    <element name="foo">
      <attribute name="number">
        <data type="xsd:short"/>  <!-- match -->
      </attribute>
      <zeroOrMore>
        <ref name="foo1#3"/>
      </zeroOrMore>
    </element>
  </define>

The derivative of @number by @number is the empty pattern, which is a
final state.  Thus, <attribute name="number"> in the above <define> 
matches the given path expression.

Now, we consider the content model of the third <define> element.  As
does the previous <define>, it has <ref name="foo1"/> and
<attribute name="number">.  

The derivative of the empty set by "foo" (which is the element name
for "foo1") is the empty set (#3).

  <define name="foo1#3">
    <element name="foo">
      <key ="
      <attribute name="number"/>
      <zeroOrMore>
        <ref name="foo1#3"/>
      </zeroOrMore>
    </element>
  </define>

Since the derivative of the emptyset by <attribute name="number"/> 
is the empty set, this <attribute> does not match the given path 
expression.

Now, we consider bar#3.  The <define> element for the non-terminal "bar" 
has <element name="bar">.  The content model has 
<ref name="foo2"/>.

The derivateive of the empty set by "foo" (which is the element 
name for "foo2") is the empty set (#3).  Thus, we reference to
a named pattern "foo1#3", which is already defined.

Therefore, we have a new grammar as below:

<grammar...>

  <start><ref name="doc#1"/></start>

  <define name="doc#1">
    <element name="doc">
      <ref name="foo#2"/>
      <ref name="bar#3"/>
    </element>
  </define>

  <define name="foo#2">
    <element name="foo">
      <attribute name="number">  <!-- match -->
        <data type="xsd:short"/>
      </attribute>
      <zeroOrMore>
        <ref name="foo1#3"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="foo1#3">
    <element name="foo">
      <key ="
      <attribute name="number"/>
      <zeroOrMore>
        <ref name="foo1#3"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="bar#3">
    <element name="bar">
      <ref name="foo1#3"/>
    </element>
  </define>

</grammar>

We have found that only one <attribute> in this grammar matches the
given pattern.  Thus, the datatype for "doc/foo@number" is unique.

Suppose that we have changed the original grammar.  We replace 
<ref name="foo1"> in the second <define> with 
<ref name="foo">.

Then, from the path expression and the modified grammar, we get
a grammar as below.  Observe that this rewriting introduces 
two named patterns, namely foo#2 and foo#3.  The former matches 
the pattern, while the latter does not.


<grammar...>

  <start><ref name="doc#1"/></start>

  <define name="doc#1">
    <element name="doc">
      <ref name="foo#2"/>
      <ref name="bar#3"/>
    </element>
  </define>

  <define name="foo#2">
    <element name="foo">
      <attribute name="number">  <!-- match -->
        <data type="xsd:short"/>
      </attribute>
      <zeroOrMore>
        <ref name="foo#3"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="foo#3">
    <element name="foo">
      <attribute name="number">
        <data type="xsd:short"/>
      </attribute>
      <zeroOrMore>
        <ref name="foo#3"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="foo1#3">
    <element name="foo">
      <key ="
      <attribute name="number"/>
      <zeroOrMore>
        <ref name="foo1#3"/>
      </zeroOrMore>
    </element>
  </define>

  <define name="bar#3">
    <element name="bar">
      <ref name="foo1#3"/>
    </element>
  </define>

</grammar>



[1] Makoto Murata, Extended Path Expressions for XML, PODS 2001. 
http://www.trl.ibm.com/projects/xml/hedge/murataPODS01final.pdf

Cheers,

Makoto
Follow-Ups:
- Re: Alternative approach for key/keyref
  - From: James Clark <jjc@jclark.com>