RE: [search-ws] CQL Query on structured data.

I’d really hate to see us try to shoehorn structure searching into CQL. I’d rather see support for XQuery as the query grammar in SRU than make CQL do everything that it already does plus everything that XQuery does.

Ralph

From: Kerry Blinco [mailto:kblinco@powerup.com.au]
Sent: Tuesday, September 09, 2008 5:47 AM
To: search-ws@lists.oasis-open.org
Subject: [search-ws] CQL Query on structured data.

Ray,

I am copying to you the unedited version of the FRED project (Federated Repositories for Education) (http://fred.usq.edu.au/) use case and a proposed solution based on proximity using element and an abstract tree search.

The PQL solutions are different and they need to state their own case I think - Hopefully we can get them engaged in this after next week.

The FRED description is something I could send to you quickly as requested.

LOM CQL Documentation from the FRED project

Problem Statement ś

Need to be able to perform under CQL queries sensitive to the structure of the underlying LOM; e.g.

* dc.creator = sanderson and the dc.date =2006 are contained within the same contribute container.

* x is author and y is editor and y edited at least 4 months after x authored

* dc.creator is in the second grandchild of the grandfather of a node with dc.date = 2006

Australian Education Stakeholders have indicated that such structural queries (at least in their simplest forms) are important.

In LOM in particular,

* The ordering of nodes is often undefined by the standard, and so searches cannot rely on relative order.

* Many elements can have an open number of children.

Note that such structural queries are perceived to compromise the abstractness of CQL queries. For instance, a query are contained within the same contribute entry does not make sense unless the underlying record can be represented in LOM: the proximity is evaluated specifically in the context of the LOM information model. Even if a "contribute" entry is not nominated, different schemas may arrange elements very differently. In LOM, the creator and publisher are both in the contribute container, and are closer to each other structurally than is the keyword. In Dublin Core, creator, publisher and keyword are all children of the root node, and so are equally close to each other structurally. For that reason, if a context set realises element proximity searches, it can only do so relative to a specific schema, and cannot claim to be schema neutral, the way normal CQL searches are.

On the other hand, such structural queries are specific only to the information model of LOM, and not to any binding of the information model to a specific presentation. They are not dependent on the container and leaf elements being presented through XML, RDF, or Language Independent Datatypes.

CQL Element Proximity ś

The proposed solution for FRED is to provide context sets that allow for searches of abstract structure. The solution below looks at a tree structure.

SOLUTION:

The CQL context set provides for proximity searches with unit=element. Proximity will address the simpler structural queries, but not the more complex:

* "x is author and y is editor and y edited at least 4 months after x authored" assumes XPath-like extraction of individual elements, rather than the fixed indexes of CQL (lom::contribute[child::role=author]::date >= lom::contribute[child::role=editor]::date + 000400Z ).

* The query "dc.creator is in the second grandchild of the grandfather of a node with dc.date = 2006" is also inconsistent with CQL: all indexes in CQL need to be related to a search term. So CQL might conceivable query dc.creator = sanderson as a second grandchild, but not dc.creator in general; this is again an XPath query (/descendant[child::date=2006]/parent/child/child[position()=2]/self::creator).

Neither CQL 1.1 nor CQL 1.2 define how elements are counted or what proximity means in the context of CQL indices. Neither CQL version outlines the extent to which the underlying structure of the source LOM may be preserved after elements are extracted into indexes. CQL 1.2 expressly states that, though the element unit is defined for the CQL context set, no semantics for the unit is defined. (Indeed, nor is semantics for any proximity unit defined.) However, CQL expressly allows that other context sets define semantics for proximity units.

This has the risk of leading to inconsistent notions of proximity between different projects and applications.

FRED will develop its own semantics for element proximity searches in the Australian Education CQL context set.

FRED may develop a sample implementation of element proximity searches.

There are two possible interpretations of element proximity which FRED could use: textual proximity, and structural proximity.

* Under textual proximity, elements are tokenised in the same way that words, sentences etc. are tokenised.

o A LOM tree is constructed out of the container and leaf nodes in the LOM document. Where the container may have an unordered number of subcontainers or leaf nodes, the children in the tree are ordered arbitrarily.

o The LOM tree is traversed in-order, and each leaf node visited is extracted as a token, in the order in which it has been visited.

o Proximity search counts the number of tokens in the tokenised tree between elements.

o The notion of a node is preserved in the search; the notion of node hierarchy is not.

* Under structural proximity, the lowest common ancestor of two elements is determined.

o A LOM tree is constructed, as above.

o The two elements being queries are identified in the tree as leaf nodes.

o The distance of the leaf nodes to their lowest common ancestor in the tree is determined.

o Node hierarchy is preserved.

o Proximity search counts how many levels in the tree the elements are removed from their common ancestor (i.e. how many branchings intervene between them).

The lowest common ancestor definition of proximity corresponds to the kinds of queries of interest for LOM, which rely on elements being in the same container. Unlike element tokenisation, it is not sensitive to the ordering of elements, or the number of elements an aggregate node may contain.

On the other hand, element tokenisation corresponds closely to the implementation of other proximity searches. That said, the proposed notion of structural proximity is not an unusual understanding of proximity in the XML context; cf. http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf , where "Keyword Proximity Search in XML Trees" is understood explicitly in terms of Lowest Common Ancestor.)

Illustration:

Tree 1:

<a>

1

<c>

<g>

6

</g>

</c>

</a>

Tree 2:

<a>

<c>

<g>

6

</g>

</c>

1

</a>

In LOM, Trees 1 and 2 are identical, since LOM is order-insensitive.

Tree 3:

<a>

1

<c>

</c>

</a>

The queries we expect about LOM structure will typically concentrate on elements belonging to the same container, rather than on what other elements also belong to the container. Therefore, we would expect a query on the proximity of and <j/> to give the same result for Trees 1 and 3. (To give a LOM example: we expect x is author and y is editor and y edited at least 4 months after x authored to give the same answer whether or not a graphic designer is also defined in the Contribute node.

Tokenised LOM tree ś

* E.g. in an XML binding of LOM, the XML document is tokenised, with the token breaks being the element delimiters.

* Only leaf nodes are tokenised, and aggregate nodes are not considered tokens. E.g. in XML, consecutive delimiters count as a single token break.

* Tokenisation is exactly parallel to the tokenisation of text by word and sentence boundaries, already used for textual proximity searches.

* Proximity searches for elements use the well-established notion of distance between tokens.

The trees above tokenise as:

* Tree 1: 1 2 3 4 5 6 7

* Tree 2: 5 6 2 7 3 4 1

* Tree 3: 1 2 7

Search results are sensitive to accidents of ordering and of optional elements. So in the three trees, the distance between 1 and 7 could be 6, 3, or 2.

Lowest Common Ancestor ś

Work on algorithms to determine lowest common ancestor in XML has been done (http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf); the following algorithm is straightforward:

* Each element in the LOM tree is assigned a position string: a dot-delimited string of position numbers describing the path from the root to the element. The leftmost position number is the position of the child of the root traversed in the path, relative to its siblings, expressed as an ordinal number; the next position number is the position of the child of the child of the root, relative to its siblings, and so forth. For example:

Tree 1:

<a>

1 1

<c>

<d>2</d> 2.1

<e>3</e> 2.2

<f>4</f> 2.3

<g>

<h>5</h> 2.4.1

6 2.4.2

</g>

<j>7</j> 2.5

</c>

</a>

* Since each element is assigned a position string in isolation, these position strings can be indexed externally, and proximity queries may be transacted by looking up the position strings, without direct reference to the XML or any other realisation of the LOM information model. This means that such searches are compatible with an index-based CQL infrastructure.

* The position strings of the two elements are compared, and the minimum common prefix determined. For example: a proximity query for <d> and <h>, involving position strings 2.1 and 2.4.1, has the minimum common prefix 2.

* The minimum suffix length following the common prefix is determined for the two elements' position strings. 2.1 and 2.4.1 share the prefix 2, and after that prefix have the suffixes .1 and .4.1, of length 1 and 2 respectively.

* The minimum suffix length is the closest distance from one of the elements to the lowest common ancestor, and represents the number of branchings in the LOM tree between elements. Since the minimum suffix length for <d> and <h> is 1, the two elements are contained in the same container, and no intervening aggregate elements are possible between them: the lowest common ancestor is the parent of one of the elements.

* If the position strings are 2.1.4.5.6 and 2.1.7.3, the common prefix is 2.1, and the minimum suffix length is 2 (.7.3). This means that there is an intervening node (aggregate element) between the two elements: 2.1.7; the lowest common ancestor is the grandparent of 2.1.7.3, 2.1. So 2.1.4.5.6 and 2.1.7.3 are less close than 2.1.7.1 and 2.1.7.3, or for that matter 2.1.4 and 2.1.7.3.

Let us illustrate this with LOM instances:

<lom>

<lifecycle> 1

<contribute> 1.1

<role>author</role> 1.1.1

<entity>Sanderson</entity> 1.1.2

<date>2006</date> 1.1.3

</contribute>

<contribute> 1.2

<role>publisher</role> 1.2.1

<entity>Fredericksen</entity> 1.2.2

<date>2007</date> 1.2.3

</contribute>

<contribute> 1.3

<role>initiator</role> 1.3.1

<entity>Johnson</entity> 1.3.2

<date>2004</date> 1.3.3

</contribute>

<contribute> 1.4

<role>terminator</role> 1.4.1

<entity>Pierceson</entity> 1.4.2

<date>2008</date> 1.4.3

</contribute>

</lom>

* The query "dc.creator = sanderson and the dc.date =2006 are contained within the same contribute entry" translates to: dc.creator = sanderson prox/unit=element/distance=1 dc.date=2006.

* "sanderson" and "2006" have the position strings 1.1.2 and 1.1.3, so their distance is 1 (.2, .3). They are as close as possible in the LOM tree, sharing a common parent, so they satisfy the "within the same contribute entry" requirement.

* By contrast, "sanderson" and "2008" have the position strings 1.1.2 and 1.4.3, so their distance is 2 (.1.2, .4.3). The date 2008 is not associated with contributor Sanderson, but with a different contributor.

Kerry Blinco
e-Framework and Standards Manager, Link Affiliates, University of Southern Queensland; and
Technical Standards Adviser to the Department of Education Employment and Workplace Relations (DEEWR). Australia.
Email:     kblinco@powerup.com.au
Phone:   +61 7 3871 2699
Ph (Mobile) :    +61 419 787 992

The information contained in this e-mail message and any files may
be confidential information, and may also be the subject of legal professional privilege.
If you think you may not be the intended recipient, or if you have received this e-mail in error,
please contact the sender immediately and delete all copies of this e-mail. If you are not the intended
recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party.

This email represents the views of the individual sender, except where the sender expressly states otherwise.

search-ws message