[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Datatype assignment for TREX
I've been thinking about the issue of how to do datatype assignment for TREX. By "datatype assignment" I mean, given a TREX pattern and a document, assigning a datatype to each text node and attribute value (<anyString/> can be considered as urtype.). For some patterns and some documents, this requires lookahead (or multiple passes). <choice> <element name="b"> <element name="c"> <data type="xsd:int"/> </element> <element name="d"> <empty/> </element> </element> <element name="b"> <element name="c"> <anyString/> </element> </element> <element name="e"> <empty/> </element> </element> </choice> For example, in <b><c>1</c><d/></b> c has datatype xsd:int, but in <b><c>1</c><e/></b> c has no datatype (ie anyString). The approach I have taken is to come up with a constraint on patterns that can be checked independently of any source document and which is sufficient to ensure that an implementation can always easily tell what the datatype of a text node or attribute value is. The basic idea is to require that it always be possible to determine the datatype of a text node or attribute value just using the names of the element and attribute ancestors. To describe the algorithm we need the concept of the "direct descendants" of a pattern. The "direct descendants" of a pattern P are all the descendants of P that would be visited by a walk of the descendants of P that follows <ref> elements but does not look inside <element> and <attribute> elements and only looks at the first subpattern of a <concur> element. For example, assuming the following definition <define name="x"> <element name="x"><empty/></element> <!-- 1 --> </define> the direct descendants of this pattern: <choice> <ref name="x"> <anyString/> <!-- 2 --> <element name="y"> <!-- 3 --> <element name="z"><empty/></element> <!-- 4 --> </element> </choice> are the patterns labelled 1, 2 and 3, but not the pattern labelled 4. The elements in TREX can match sequences of characters are <anyString>, <string>, <data> and any element with a trex:role="datatype" element. Let's call these character elements. We say that a set of character elements is ambiguous if any of the following conditions apply: 1. it contains two distinct data or trex:role="datatype" elements 2. it contains both a data or trex:role="datatype" element and a <anyString> element 3. (i) it contains both a data or trex:role="datatype" element and a <string> element, and (ii) the content of the <string> element may be a value of the datatype Now we can describe the constraint on patterns, which I will call "easy datatype assignment". A pattern P has "easy datatype assigment" if the following conditions are all satisfied. 1. The set of direct descendants of P that are character elements are not ambiguous 2. For any name x, take the direct descendants of P that are <element> elements with a name class that x matches; the pattern consisting of a choice of the content patterns of all such <element> elements must also have "easy datatype assignment" 3. Same as 2, but for <attribute> elements instead of <element> elements. For example, in determining whether the example above had easy datatype assignment, we would look at (by applying step 2 with x="b"): <choice> <group> <element name="c"> <data type="xsd:int"/> </element> <element name="d"> <empty/> </element> </group> <group> <element name="c"> <anyString/> </element> </element> <element name="e"> <empty/> </element> </group> </choice> and (by applying step 2 with x="c"): <choice> <data type="xsd:int"/> <anyString/> </choice> which does not have easy datatype assigment, because the set of direct descendant character elements is ambiguous. There are a few subtleties to the implementation (to avoid infinite recursion and deal with wildcards), but it's basically quite straightforward. The main limitations are 1. It doesn't deal well with some uses of concur. 2. Suppose I have a "foo" element which can contain either ints or strings according the value of a "type" attribute: <choice> <element name="foo"> <attribute name="type"> <string>string</string> </attribute> <data type="xsd:string"/> </element> <element name="foo"> <attribute name="type"> <string>int</string> </attribute> <data type="xsd:int"/> </element> </choice> This would not satisfy my constraint. (On the other hand, if the datatypes are explicit in the instance, then datatype assignment needn't involve schema processing at all.) James
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC