relax-ng message

Subject: Re: [relax-ng] Algebraic characterization of RELAX NG patterns
From: James Clark <jjc@jclark.com>
To: Murata Makoto <mura034@attglobal.net>, relax-ng@lists.oasis-open.org
Date: Mon, 22 Oct 2001 10:15:00 +0700
Thanks for taking the time to explain this.  I found it quite hard to
understand this.  I *think* I do now understand it.  Let me check my
understanding by attempting a more informal, intuitive explanation.
Hopefully, Murata-san will correct me if I have misunderstood.

Suppose we have some string regular expression R.  We can consider
this to define an equivalence relation over strings, where two strings
are equivalent if they always affect the match of R in exactly the
same way.  More precisely, two strings x and y are equivalent if and
only if for all strings pref and suff, either both pref + x + suff and
pref + y + suff match R or neither match R (+ means string
concatentation).  Now we can consider the abstract structure whose
members are equivalence classes of strings under this equivalence
relation, and which has a single operation corresponding to
concatenation on the underlying strings.  This abstract structure will
have the properties of a "monoid" i.e. there is an identity
(corresponding to the equivalence class containing the empty string)
and the operation is associative.  We also have a mapping from strings
to this abstract structure: each string maps to the equivalence class
that contains it.

Suppose we have a language that is not describable by a regular
expression, e.g. the classic example of the set of strings with
balanced parentheses.  We can consider the monoid for this in the same
way as with the regular expression.  Since the string with N open
parentheses will affect a match differently from the string with N+1
open parentheses, the monoid in this case will have an infinite number
of members.

In fact, a language having a corresponding monoid that is finite is
exactly equivalent to the language being regualar.

Now to extend this to RELAX NG we have to deal with

(a) attributes

(b) the fact that we have an infinite set of possible
attribute/element names

To deal with (b), we want to set things up so that we are dealing with
a finite alphabet just as in the string regular expression case. To do
this, instead of dealing particular names we deal with classes of
names that are equivalent as far as a RELAX NG schema is concerned.
For example, if we have

<element name="foo">
   <element name="bar">
     <empty/>
   </element>
  <zeroOrMore>
    <element>
      <anyName/>
      <empty/>
    </element>
  </zeroOrMore>
</element>

then we have three classes of names:

1. foo
2. bar
3. anything other than foo or bar

Thus we can treat this as if we had three names, for example, foo,
bar, baz; in dealing with a particular instance, we map any name that
isn't foo or bar onto baz.

This brings on to (a).  Attributes differ from elements in two ways:

1. Order isn't significant
2. You can't affect two attributes with the same name

There's no problem dealing with 1.  But our solution to (b) has an
impact on 2.  The solution to (b) makes us deal with classes of names
rather than individual names. Although we cannot have two attributes
with the same name, we can have two attributes whose name is from the
same class.  So what happens to 2?  This is where the proposed
restriction comes in: if we require that all infinite name classes are
in one or more, then we can replace 2 by

2'. The number of occurrences isn't significant (i.e. if you have one
attribute from a particular name class, having another attribute from
the same name class won't affect the match)

This is consistent with the solution to (b).


--On 21 October 2001 22:55 +0900 Murata Makoto <mura034@attglobal.net> 
wrote:

> Algebraic characterization of RELAX NG patterns
>
> 2001 10/21
> Murata Makoto
> EB2M-MRT@asahi-net.or.jp
>
> 1. Introduction
>
> I have argued that infinite-nameclass <attribute> should be allowed only
> when  it is a descendant of <oneOrMore>.  With this restriction, we have
> a nice  characterization theorem of RELAX NG patterns.
>
> 2. Regular languages and syntactic monoids
>
> We first consider regular languages.  It is well known that regular
> expressions and finite automata exactly capture regular languages.
> Meanwhile, another interesting characterization is given by syntactic
> monoids.
>
> In preparation, let $\Sigma$ be a finite set.  The set of strings over
> $\Sigma$ is denoted by $\Sigma^*$.
>
> Theorem 1:
>
> A subset $L$ of $\Sigma^*$ is regular if and only if
> $L$ is the inverse image of $X$ under $\phi$, where
>
>
>    - $N$ is a finite monoid; that is,
>
>         - $N$ is a finite set,
>
>         - $N$ has a binary operation $\circ$,
>
>         - $\circ$ has an identity $0$ (i.e.,
>           $0 \circ n =  n \circ 0 = n$ for every $n$
>           in $N$), and
>
>         - $\circ$ is associative; that is,
>           $(s_1 \circ s_2) \circ s_3 = s_1 \circ (s_2 \circ s_3)$,
>
>     - $\phi$ is a homomorphism from $\Sigma^*$ to $N$; that is,
>       $\phi(u v) = \phi(u) \circ \phi(v)$ for every $u, v$ in $\Sigma^*$,
>       and
>
>     - $X$ is a subset of $N$.
>
>
> 3. Generalization for RELAX NG patterns
>
> For sake of simplicity, we ignore <text>, <list>, <data>, <value>.  We
> also assume that any pattern is allowed as the child of <start>.
>
> In prepartion, let $\Delta$ be a finite set of disjoint name classes
> for attributes, and let $\Sigma$ be a finite set of disjoint name
> classes for elements.
>
> We can nicely generalize Theorem 1 as below.
>
> Theorem 2:
>
> A subset $L$ of $(2 ^ \Delta) \times \Sigma ^*$ is described by a
> RELAX NG pattern over $\Delta$ and $\Sigma$ if and only if $L$ is the
> inverse image of $X$ under $\xi \oplus \phi$, where
>
>     - $M$ is a finite set such that
>
>         - $M$ has a binary operation $+$,
>
>         - $+$ has an identity $0$ (i.e.,
>           $0 + m =  m + 0 = m$ for every $m$
>           in $M$),
>
>         - $+$ is associative; that is,
>           $(m_1 + m_2) + m_3 = m_1 + (m_2 + m_3)$,
>           and
>
>         - $+$ is commutative; that is,
>           $m_1 + m_2 = m_2 + m_1$, and
>
>         - $+$ is idempotent; that is, $m_1 + m_1 = m_1$
>
>     - $\xi$ is a homomorphism from $2^\Delta$ to $M$; that is,
>       $\xi(\Delta_1 \cup \Delta_2) = \xi(\Delta_1) + \xi(\Delta_2)$
>       for every subset $\Delta_1, \Delta_2$ of $\Sigma^*$
>
>     - $N$ is a finite monoid; that is,
>
>         - $N$ is a finite set,
>
>         - $N$ has a binary operation $\circ$,
>
>         - $\circ$ has an identity $1$ (i.e.,
>           $1 \circ n =  n \circ 1 = n$ for every $n$
>           in $N$), and
>
>         - $\circ$ is associative; that is,
>           $(n_1 \circ n_2) \circ n_3 = n_1 \circ (n_2 \circ n_3)$.
>
>     - $\phi$ is a homomorphism from $\Sigma^*$ to $N$, that is,
>       $\phi(u v) = \phi(u) \circ \phi(v)$ for every $u, v$ in $\Sigma^*$
>
>     - $\xi \oplus \phi$ is a mapping from $2^\Delta \times \Sigma^*$
>       to $M \times N$ defined as $\xi \oplus \phi(\Delta_1, u) =
>       (\xi(\Delta_1), \phi(, u))$
>
>     - $X$ is a subset of $M \times N$.
>
>
> Note: Observe that idempotency does not hold, if we allow
> infinite-nameclass <attribute> without an ascending <oneOrMore>,
>
> ----
> Murata Makoto EB2M-MRT@asahi-net.or.jp
>
> ----------------------------------------------------------------
> To subscribe or unsubscribe from this elist use the subscription
> manager: <http://lists.oasis-open.org/ob/adm.pl>
>
>
Follow-Ups:
- Re: [relax-ng] Algebraic characterization of RELAX NG patterns
  - From: MURATA Makoto <EB2M-MRT@asahi-net.or.jp>
References:
- [relax-ng] Algebraic characterization of RELAX NG patterns
  - From: Murata Makoto <mura034@attglobal.net>