relax-ng message

Subject: More thoughts on canonical representation / NC equivalence

From: Kohsuke KAWAGUCHI <kohsuke.kawaguchi@eng.sun.com>
To: TREX ML <trex@lists.oasis-open.org>
Date: Mon, 07 May 2001 10:36:05 -0700


NC can be considered as a "black box" that receives (URI,local) pair and
returns Yes/No.

Now consider how can we decide that two NCs are equal:

One brute force approach is to feed all possible (URI,local) pairs to
NCs and see if their answers matches. Of course, there are infinite
number of (URI,local) pairs, so you can't do this.

However, since every NC consists of finite number of basic "rules", if
you select "probe points" nicely, then you *might* be able to tell the
equality (or subset relationship or whatever) by checking the answers of
probe points only.

So now the question is how to find those nice probe points.


Consider the following NC X:

<choice ns="foo">
  <name> a </name>
  <name> b </name>
</choice>


Only namespace URI "foo" and local names a and b are appeared. So it is
highly unlikely that X accepts (html,p) while it rejects (html,b).
"html" is just one of "other namespaces" for this NC.

This gives you a hint: you need a special symbol that represents "other
namespaces" and "other local names". This symbol is called MAGIC in jjc's
algorithm.


Algorithm "possibleNames" is the algorithm that computes probe points
for the given NC.

Probe points for the above X will be 

{ (foo,a),(foo,b) }

So if you have another NC Y and its probe points, then by merging probe
points and checking both by merged points, you can tell the various
relationship of two NCs. That's what jjc did.

Now go back to probe points.

Since they are just probe points, it is somewhat useless until you probe
NC by using them. Then why not annotate the set by the result of probing?

{ (foo,a)=>OK,(foo,b)=>OK }

name pairs not in this probe set are guaranteed to be "NG" (NC doesn't
accept them).

Sometimes, this annotation reveals that the probe points set is bigger
than necessary. For example,

<choice>
  <nsName ns="foo" />
  <anyName/>
</choice>

Probing points:
{ (MAGIC,MAGIC), (foo,a) }

Annotated PP:
{ (MAGIC,MAGIC)=>OK, (foo,MAGIC)=>OK }

If (MAGIC,MAGIC) is OK and (foo,MAGIC) is OK, then there is no need to
probe (foo,MAGIC) anymore. "foo" is just one of "other namespaces".

{ (MAGIC,MAGIC)=>OK }

There are several other rules to make the set smaller.


Now you see that annotated probe points are not just probe points --
rather, it is NC itself! This leads to the concept of "canonical
representation". This annotated set looks like uniquely representing a
NC, doesn't it.

In fact, it does.


One of the goodness of this canonical representation is that you can
construct XML of name class from this canonical representation. Moreover,
constructed XML is always simple.



--
Kohsuke KAWAGUCHI                          +1 650 786 0721
Sun Microsystems                   kohsuke.kawaguchi@eng.sun.com

References:
- Re: Guarded normal form (part 1)
  - From: James Clark <jjc@jclark.com>
- canonical representation of NC.
  - From: Kohsuke KAWAGUCHI <kohsuke.kawaguchi@eng.sun.com>