relax-ng message

Subject: Re: [relax-ng] What's ambiguous for Relax NG?
From: Eric van der Vlist <vdv@dyomedea.com>
To: "MURATA Makoto (FAMILY Given)" <EB2M-MRT@asahi-net.or.jp>
Date: 05 Jul 2003 17:39:36 +0200
On Sat, 2003-07-05 at 17:05, MURATA Makoto (FAMILY Given) wrote:
> On Wed, 2 Jul 2003 22:02:10 +0200 (CEST)
> "Eric van der Vlist" <vdv@dyomedea.com> wrote:
> 
> > W3C XML Schema has two different constraints to insure "determism".
> 
> Unambiguity  or determinism in RELAX NG is a very tricky issue.  We have unambiguity 
> of regular expressions, unambiguity of regular hedge grammars, and datatype unambiguity.

Do you classify unambiguity of name classes as unambiguity of regular
expressions?

> First, there are three concepts about unambiguity or determinism of regular expressions: 
> strongly-unambiguous, weakly-unammbiguous, and deterministic.  Informally, 
> "deterministic" requires that you have one choice at a given point.  Meanwhile, 
> "strongly-unambious" and "weakly-unambiguous" allows you to have more than 
> one choice at a given point, but requires that, after you examine the entire 
> string, you have one choice.  The difference between "strongly-unambiguous" 
> and "weakly-unambiguous" is ambiguity about *.  For example, A** is weakly-
> unambiguous, while it is not strongly-unambiguous.  More about this, see 
> 
> ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps
> 
> which is referenced from XML 1.0.  WXS requires deterministic content models.
> 
> > start = element book{ (odd,even)*,odd? }
> 
> This is not deterministic, but is strongly-unambiguous.

That's a useful classification and clarifies what I was trying to
explain!

> > start = element book{ empty|odd|((odd,even)+,odd?) }
> 
> The same.  This is not deterministic, but is strongly-unambiguous.
> 
> > odd=element odd-page{empty}
> > even=element even-page{empty}
> > 
> > }
> > 
> > Even if this second flavor cannot be written as WXS either, I don't think
> > we can consider it as ambiguous since for any valid document we can say
> > without ambiguity which branch of the alternative matches the instance
> > document.
> 
> I do not understand your point.  

I think you have :-) . To take your words, I meant that it's unambiguous
but not deterministic!

> It is true that if we have an 
> empty sequence we have to choose the first branch (empty) and 
> that if we have a sequence <odd/> we have to choose the second (odd). 
> But if we have a sequence <odd/><even/><odd/>...., we do not know
> whether we should choose the first odd or second odd in the third branch
> ((odd,even)+,odd?) when we encounter the second <odd/>.  
> 
> Second, we have unambiguity or determinism of regular hedge grammars.  Again, 
> unambiguity allows you to make a decision after you examine the entire tree, 
> while determinism requires that you make a decision at a given point.

I am not sure I understand the difference between regular expressions
and hedge grammars. Do you use them as two different levels? Regular
expressions in general as opposed to hedge grammars which would be
regular expressions applied to tree structures? 

> The following is obviously ambiguous.
> 
> start = doc1 | doc 2
> doc1 = element foo {empty}
> doc2 = element foo {empty}
> 
> Moreover, there are at least four variations of determinism of  regular 
> hedge grammars.  Deterministic top-down hedge automata, deterministic bottom-up hedge automata, 
> deterministic top-down with one look-ahead (or single-type), and Kohsuke's 
> determinism, which allows you to determine the type of an element when you 
> encounter its END tag.  I think that the first and second are only theoretically 
> interesting.  A variation of the third one is used by WXS.
> 
> > The same applies to patterns such as:
> > 
> > element foo{attribute bar{empty}} | element foo{element bar{empty}}
> > 
> > which can be rewritten as:
> > 
> > element foo{attribute bar{empty} | element bar{empty}}
> 
> I would say that this is ambiguity of regular tree languages.
> 
> Third, we have datatype unambiguity.  The choice of int and boolean is 
> ambiguous, since 0 and 1 belongs to both.
> 
> We can combine these three kinds of unambiguity and introduce quite 
> a few variations of unambiguity!  I would argue that different 
> tools require different variations of ambiguity.  It does make 
> sense to allow ambiguous regular expressions but require 
> deterministic top-down hedge automata with one look-ahead.

You mean that it would be a good compromise between flexibility and
implementation and allow deterministic type assignment?

> > I am wondering if, given the flexibility of Relax NG, there are any cases
> > left of ambiguous schemas which cannot be rewritten in a non ambiguous
> > form with the exception of ambiguous data types such as:
> > element foo{xsd:boolean|xsd:integer}
> 
> 1) datatype ambiguity
> 
> Hopeless.

Unless people do really care about it!

If I write:

element foo{xsd:boolean{pattern="true|false"}|xsd:integer}

I can remove the datatype ambiguity. Maybe we would need an except
pattern to remove either the lexical or the value space of a datatype
from another datatype?

BTW, I have a question about Relax NG... Wouldn't an "except" pattern
have been useful out of the scope of datatypes? 

> 2) Unambiguity or determinism of regular expressions
> 
> Any regular expression can be rewritten as a strongly-unambiguous regular expression, 
> but cannot be rewritten as a deterministic regular expression.  The following paper 
> tries to overcome this problem by creating a loose regular expression.
> 
> http://citeseer.nj.nec.com/ahonen95automatic.html

That's what I was beginning to suspect!
[0-9]{9}[0-9x]
> 3) Unambiguity or determinism of regular hedge grammars
> 
> Any regular hedge grammars can be rewritten  a deterministic bottom-up 
> hedge automaton.  However, such rewrite creates an entirely different 
> schema (a named pattern in a rewritten schema corresponds to a SET 
> of named patterns in the original).  Any regular hedge grammars can also 
> be rewritten as a deterministic grammar as defined by Kohsuke.  Hopefully, 
> this rewrite does a better job in preserving most <define> in the original.  
> It is usually impossible to convert a regular hedge grammar to  
> a deterministic top-down hedge automaton.  Finally, it is not always possible 
> to convert a regular hedge grammar to an equivalent deterministic top-down 
> hedge automaton with one lookahead.  For example,
> 
> start = 
>   element doc {(sectionWithPara, sectionWithoutPara) 
>                | (sectionWithoutPara, sectionWithPara)}
> 
> sectionWithPara = element section{para}
> sectionWithoutPara = element section{empty}
> para = element para {empty}
> 
> cannot be written as a deterministic top-down hedge automaton with one lookahead.
> 
> Note that this is deterministic as defined by Kohsuke.
> 
> Hope this helps.

Yes, this helps a lot!

Thanks

Eric
> 
> Cheers,
-- 
If you have a XML document, you have its schema.
                                                  http://examplotron.org
------------------------------------------------------------------------
Upcoming Schema languages tutorial (registration open):
 - July 7th   (Portland, OR)      http://makeashorterlink.com/?K27A527A4
 - August 4th (Montreal, Canada)  http://makeashorterlink.com/?U28A217A4
------------------------------------------------------------------------
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
(W3C) XML Schema ISBN:0-596-00252-1 http://oreilly.com/catalog/xmlschema
------------------------------------------------------------------------
Follow-Ups:
- Re: [relax-ng] What's ambiguous for Relax NG?
  - From: "MURATA Makoto (FAMILY Given)" <EB2M-MRT@asahi-net.or.jp>
References:
- What's ambiguous for Relax NG?
  - From: "Eric van der Vlist" <vdv@dyomedea.com>
- Re: [relax-ng] What's ambiguous for Relax NG?
  - From: "MURATA Makoto (FAMILY Given)" <EB2M-MRT@asahi-net.or.jp>