relax-ng message

Subject: General concern about inter-grammar references

From: Kohsuke KAWAGUCHI <kohsuke.kawaguchi@eng.sun.com>
To: relax-ng@lists.oasis-open.org
Date: Tue, 10 Jul 2001 20:53:11 -0700


I have been thinking about the conversion from W3C XML Schema and SOX to
RELAX NG.

One of the problems I found is the difference between those languages
and RELAX NG about the mechanism of references.


RELAX NG
--------
In RELAX NG, a grammar constitutes the basic unit of references; any
references can be made as long as its source(<ref name="..."/>) and its
target(<define name="..."/>) are in the same grammar.

grammars can be then organized into a tree structure. A grammar can have
possibly multiple child grammars, and it can have at most one grammar.


parent <--           --> child

G1 -+- G2 --- G4
    |
    +- G3 -+- G5
           |
           +- G5'


Sometimes one grammar is loaded several times and therefore the same
grammar can appear more than once (like G5 and G5'). But conceptually
they are treated as different, although a smart implementation may
exploit this equality to achieve higher performance.


Inter-grammar references are allowed only if its source and its target is
directly connected. So a reference between G1 and G2 is OK, but G2 and
G3 is not.

There is further restriction. A grammar can refer to its parent grammar
as many times as it wants. But the parent grammar can refer to its child
grammar only once. So there is asymmetry here. This asymmetry can be
partially solved by using <withGrammar>. But this resolution is only
partial because you cannot write patterns like this:

<group>
 <choice>
  <ref name="G2:foo"/>  <!-- reference to "foo" in G2 -->
  <ref name="G3:bar"/>
 </choice>
 <choice>
  <ref name="G2:foo"/>
  <ref name="G3:bar"/>
 </choice>
</group>


And references between G2 and G3 or between G5 and G4 is still not
allowed.



W3C XML Schema / SOX
--------------------
In these languages, there is no basic unit of references. They do have
the concept of "schema", but the "schema" doesn't impose any restriction
on the references. Any inter-schema reference can be made between any
schemata.

A schema is designated by an URI, and a reference is made through a
(URI,local) pair, which is usually represented by a QName.

     S1
    /|\
   / | \
S2 --+-- S3
   \ | /
    \|/
     S4

So unlike RELAX NG, references are not tied to the tree structure.




Consequence(1)
--------------

Let's try to convert those languages into RELAX NG.  Due to the above
difference, we cannot convert a "schema" into a grammar. As a result, we
are forced to create one monolithic grammar, which contains all
definitions.

This makes the name collision highly possible. Typically this results in
pattern name like "{http://example.org/.../}bar".


Another problem is how to assemble necessary files. Say the "schema" A
references a definition in B. The problem is, we cannot write
<include href="B.rng"/> in A.rng because C.rng might have a reference to
B, too.

If both A and C has references to B and we write <include href="B.rng"/>
to both A.rng and C.rng, then it causes a collision because B.rng is
included twice.

So converted RELAX NG files cannot contain <include> statements. Instead,
you have to create a hub file by yourself and includes all necessary
files, which is a very tiresome labor.




Consequence(2)
--------------

Forget about the conversion and consider three grammars: ext1, ext2, and
base.

"base" contains a set of definitions. "ext1" and "ext2" adds some extra
functionalities to the "base" module.

base.rng
<define name="foo">
  <element name="base">...</element>
</define>

ext1.rng
<define name="foo" combine="choice">
  <element name="ext1">...</element>
</define>

ext2.rng
<define name="foo" combine="choice">
  <element name="ext2">...</element>
</define>


Now what you want to do is, to let "ext1"/"ext2" be used by itself. That
is, you don't want to write

<include href="base.rng"/>
<include href="ext1.rng"/>

to use "ext1". Using "ext1" should be possible by just

<include href="ext1.rng"/>



Using "ext2" should be possible in the same way. Also, you want to write

<include href="ext1.rng"/>
<include href="ext2.rng"/>

to use both. But this cannot be done due to the restriction imposed on
the inter-grammar references. I think Jeni Tennisson has gave us a
similar feedback, regarding XHTML m12n.





Conclusion
----------

It might be useful to name a included grammar and refer to it by using
the name.


<?xml version="1.0"?>
<grammar>

  <grammarRef name="table" href="..."/>
  <grammarRef name="list" href="..."/>
  
  <define name="foo">
    <choice>
      <ref name="xyz" grammar="table" />
      <ref name="abc" grammar="list" />
    </choice>
  </define>
</grammar>

If there are multiple <grammarRef> with the same name, then only the
first one is loaded and the rest is ignored.


I don't know if this proposal works well...

regards,
--
Kohsuke KAWAGUCHI                          +1 650 786 0721
Sun Microsystems                   kohsuke.kawaguchi@sun.com

Follow-Ups:
- Re: General concern about inter-grammar references
  - From: James Clark <jjc@jclark.com>
- Re: General concern about inter-grammar references
  - From: James Clark <jjc@jclark.com>
- Re: General concern about inter-grammar references
  - From: James Clark <jjc@jclark.com>
- Re: General concern about inter-grammar references
  - From: James Clark <jjc@jclark.com>
- Re: General concern about inter-grammar references
  - From: Kohsuke KAWAGUCHI <kohsuke.kawaguchi@eng.sun.com>