[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [relax-ng] What's ambiguous for Relax NG?
On Wed, 2 Jul 2003 22:02:10 +0200 (CEST) "Eric van der Vlist" <vdv@dyomedea.com> wrote: > W3C XML Schema has two different constraints to insure "determism". Unambiguity or determinism in RELAX NG is a very tricky issue. We have unambiguity of regular expressions, unambiguity of regular hedge grammars, and datatype unambiguity. First, there are three concepts about unambiguity or determinism of regular expressions: strongly-unambiguous, weakly-unammbiguous, and deterministic. Informally, "deterministic" requires that you have one choice at a given point. Meanwhile, "strongly-unambious" and "weakly-unambiguous" allows you to have more than one choice at a given point, but requires that, after you examine the entire string, you have one choice. The difference between "strongly-unambiguous" and "weakly-unambiguous" is ambiguity about *. For example, A** is weakly- unambiguous, while it is not strongly-unambiguous. More about this, see ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps which is referenced from XML 1.0. WXS requires deterministic content models. > start = element book{ (odd,even)*,odd? } This is not deterministic, but is strongly-unambiguous. > start = element book{ empty|odd|((odd,even)+,odd?) } The same. This is not deterministic, but is strongly-unambiguous. > odd=element odd-page{empty} > even=element even-page{empty} > > } > > Even if this second flavor cannot be written as WXS either, I don't think > we can consider it as ambiguous since for any valid document we can say > without ambiguity which branch of the alternative matches the instance > document. I do not understand your point. It is true that if we have an empty sequence we have to choose the first branch (empty) and that if we have a sequence <odd/> we have to choose the second (odd). But if we have a sequence <odd/><even/><odd/>...., we do not know whether we should choose the first odd or second odd in the third branch ((odd,even)+,odd?) when we encounter the second <odd/>. Second, we have unambiguity or determinism of regular hedge grammars. Again, unambiguity allows you to make a decision after you examine the entire tree, while determinism requires that you make a decision at a given point. The following is obviously ambiguous. start = doc1 | doc 2 doc1 = element foo {empty} doc2 = element foo {empty} Moreover, there are at least four variations of determinism of regular hedge grammars. Deterministic top-down hedge automata, deterministic bottom-up hedge automata, deterministic top-down with one look-ahead (or single-type), and Kohsuke's determinism, which allows you to determine the type of an element when you encounter its END tag. I think that the first and second are only theoretically interesting. A variation of the third one is used by WXS. > The same applies to patterns such as: > > element foo{attribute bar{empty}} | element foo{element bar{empty}} > > which can be rewritten as: > > element foo{attribute bar{empty} | element bar{empty}} I would say that this is ambiguity of regular tree languages. Third, we have datatype unambiguity. The choice of int and boolean is ambiguous, since 0 and 1 belongs to both. We can combine these three kinds of unambiguity and introduce quite a few variations of unambiguity! I would argue that different tools require different variations of ambiguity. It does make sense to allow ambiguous regular expressions but require deterministic top-down hedge automata with one look-ahead. > I am wondering if, given the flexibility of Relax NG, there are any cases > left of ambiguous schemas which cannot be rewritten in a non ambiguous > form with the exception of ambiguous data types such as: > element foo{xsd:boolean|xsd:integer} 1) datatype ambiguity Hopeless. 2) Unambiguity or determinism of regular expressions Any regular expression can be rewritten as a strongly-unambiguous regular expression, but cannot be rewritten as a deterministic regular expression. The following paper tries to overcome this problem by creating a loose regular expression. http://citeseer.nj.nec.com/ahonen95automatic.html 3) Unambiguity or determinism of regular hedge grammars Any regular hedge grammars can be rewritten a deterministic bottom-up hedge automaton. However, such rewrite creates an entirely different schema (a named pattern in a rewritten schema corresponds to a SET of named patterns in the original). Any regular hedge grammars can also be rewritten as a deterministic grammar as defined by Kohsuke. Hopefully, this rewrite does a better job in preserving most <define> in the original. It is usually impossible to convert a regular hedge grammar to a deterministic top-down hedge automaton. Finally, it is not always possible to convert a regular hedge grammar to an equivalent deterministic top-down hedge automaton with one lookahead. For example, start = element doc {(sectionWithPara, sectionWithoutPara) | (sectionWithoutPara, sectionWithPara)} sectionWithPara = element section{para} sectionWithoutPara = element section{empty} para = element para {empty} cannot be written as a deterministic top-down hedge automaton with one lookahead. Note that this is deterministic as defined by Kohsuke. Hope this helps. Cheers, -- MURATA Makoto (FAMILY Given) <EB2M-MRT@asahi-net.or.jp>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]