relax-ng message

Subject: Re: [relax-ng] What's ambiguous for Relax NG?
From: "MURATA Makoto (FAMILY Given)" <EB2M-MRT@asahi-net.or.jp>
To: "Eric van der Vlist" <vdv@dyomedea.com>
Date: Sun, 06 Jul 2003 00:05:20 +0900

On Wed, 2 Jul 2003 22:02:10 +0200 (CEST)
"Eric van der Vlist" <vdv@dyomedea.com> wrote:

> W3C XML Schema has two different constraints to insure "determism".

Unambiguity  or determinism in RELAX NG is a very tricky issue.  We have unambiguity 
of regular expressions, unambiguity of regular hedge grammars, and datatype unambiguity.

First, there are three concepts about unambiguity or determinism of regular expressions: 
strongly-unambiguous, weakly-unammbiguous, and deterministic.  Informally, 
"deterministic" requires that you have one choice at a given point.  Meanwhile, 
"strongly-unambious" and "weakly-unambiguous" allows you to have more than 
one choice at a given point, but requires that, after you examine the entire 
string, you have one choice.  The difference between "strongly-unambiguous" 
and "weakly-unambiguous" is ambiguity about *.  For example, A** is weakly-
unambiguous, while it is not strongly-unambiguous.  More about this, see 

ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps

which is referenced from XML 1.0.  WXS requires deterministic content models.

> start = element book{ (odd,even)*,odd? }

This is not deterministic, but is strongly-unambiguous.


> start = element book{ empty|odd|((odd,even)+,odd?) }

The same.  This is not deterministic, but is strongly-unambiguous.

> odd=element odd-page{empty}
> even=element even-page{empty}
> 
> }
> 
> Even if this second flavor cannot be written as WXS either, I don't think
> we can consider it as ambiguous since for any valid document we can say
> without ambiguity which branch of the alternative matches the instance
> document.

I do not understand your point.  It is true that if we have an 
empty sequence we have to choose the first branch (empty) and 
that if we have a sequence <odd/> we have to choose the second (odd). 
But if we have a sequence <odd/><even/><odd/>...., we do not know
whether we should choose the first odd or second odd in the third branch
((odd,even)+,odd?) when we encounter the second <odd/>.  

Second, we have unambiguity or determinism of regular hedge grammars.  Again, 
unambiguity allows you to make a decision after you examine the entire tree, 
while determinism requires that you make a decision at a given point.
The following is obviously ambiguous.

start = doc1 | doc 2
doc1 = element foo {empty}
doc2 = element foo {empty}

Moreover, there are at least four variations of determinism of  regular 
hedge grammars.  Deterministic top-down hedge automata, deterministic bottom-up hedge automata, 
deterministic top-down with one look-ahead (or single-type), and Kohsuke's 
determinism, which allows you to determine the type of an element when you 
encounter its END tag.  I think that the first and second are only theoretically 
interesting.  A variation of the third one is used by WXS.

> The same applies to patterns such as:
> 
> element foo{attribute bar{empty}} | element foo{element bar{empty}}
> 
> which can be rewritten as:
> 
> element foo{attribute bar{empty} | element bar{empty}}

I would say that this is ambiguity of regular tree languages.

Third, we have datatype unambiguity.  The choice of int and boolean is 
ambiguous, since 0 and 1 belongs to both.

We can combine these three kinds of unambiguity and introduce quite 
a few variations of unambiguity!  I would argue that different 
tools require different variations of ambiguity.  It does make 
sense to allow ambiguous regular expressions but require 
deterministic top-down hedge automata with one look-ahead.

> I am wondering if, given the flexibility of Relax NG, there are any cases
> left of ambiguous schemas which cannot be rewritten in a non ambiguous
> form with the exception of ambiguous data types such as:
> element foo{xsd:boolean|xsd:integer}

1) datatype ambiguity

Hopeless.

2) Unambiguity or determinism of regular expressions

Any regular expression can be rewritten as a strongly-unambiguous regular expression, 
but cannot be rewritten as a deterministic regular expression.  The following paper 
tries to overcome this problem by creating a loose regular expression.

http://citeseer.nj.nec.com/ahonen95automatic.html

3) Unambiguity or determinism of regular hedge grammars

Any regular hedge grammars can be rewritten  a deterministic bottom-up 
hedge automaton.  However, such rewrite creates an entirely different 
schema (a named pattern in a rewritten schema corresponds to a SET 
of named patterns in the original).  Any regular hedge grammars can also 
be rewritten as a deterministic grammar as defined by Kohsuke.  Hopefully, 
this rewrite does a better job in preserving most <define> in the original.  
It is usually impossible to convert a regular hedge grammar to  
a deterministic top-down hedge automaton.  Finally, it is not always possible 
to convert a regular hedge grammar to an equivalent deterministic top-down 
hedge automaton with one lookahead.  For example,

start = 
  element doc {(sectionWithPara, sectionWithoutPara) 
               | (sectionWithoutPara, sectionWithPara)}

sectionWithPara = element section{para}
sectionWithoutPara = element section{empty}
para = element para {empty}

cannot be written as a deterministic top-down hedge automaton with one lookahead.

Note that this is deterministic as defined by Kohsuke.

Hope this helps.

Cheers,

-- 
MURATA Makoto (FAMILY Given) <EB2M-MRT@asahi-net.or.jp>
Follow-Ups:
- Re: [relax-ng] What's ambiguous for Relax NG?
  - From: Eric van der Vlist <vdv@dyomedea.com>
- Re: [relax-ng] What's ambiguous for Relax NG?
  - From: "MURATA Makoto (FAMILY Given)" <EB2M-MRT@asahi-net.or.jp>
References:
- What's ambiguous for Relax NG?
  - From: "Eric van der Vlist" <vdv@dyomedea.com>