Re: [search-ws] CQL Query on structured data.

On 19/09/2008, at 12:41 AM, Rob Sanderson wrote:

Or SPARQL.

Rob

On Thu, 2008-09-18 at 10:25 -0400, LeVan,Ralph wrote:

I’d really hate to see us try to shoehorn structure searching into
CQL. I’d rather see support for XQuery as the query grammar in SRU
than make CQL do everything that it already does plus everything that
XQuery does.

Ralph

From: Kerry Blinco [mailto:kblinco@powerup.com.au]
Sent: Tuesday, September 09, 2008 5:47 AM
To: search-ws@lists.oasis-open.org
Subject: [search-ws] CQL Query on structured data.

Ray,

I am copying to you the unedited version of the FRED project
(Federated Repositories for Education) (http://fred.usq.edu.au/) use
case and a proposed solution based on proximity using element and an
abstract tree search.

The PQL solutions are different and they need to state their own case
I think - Hopefully we can get them engaged in this after next week.

The FRED description is something I could send to you quickly as
requested.

LOM CQL Documentation from the FRED project

Problem Statement ¶

Need to be able to perform under CQL queries sensitive to the
structure of the underlying LOM; e.g.

* dc.creator = sanderson and the dc.date =2006 are contained
within the same contribute container.

* x is author and y is editor and y edited at least 4 months after
x authored

* dc.creator is in the second grandchild of the grandfather of a
node with dc.date = 2006

Australian Education Stakeholders have indicated that such structural
queries (at least in their simplest forms) are important.

In LOM in particular,

* The ordering of nodes is often undefined by the standard, and so
searches cannot rely on relative order.

* Many elements can have an open number of children.

Note that such structural queries are perceived to compromise the
abstractness of CQL queries. For instance, a query are contained
within the same contribute entry does not make sense unless the
underlying record can be represented in LOM: the proximity is
evaluated specifically in the context of the LOM information model.
Even if a "contribute" entry is not nominated, different schemas may
arrange elements very differently. In LOM, the creator and publisher
are both in the contribute container, and are closer to each other
structurally than is the keyword. In Dublin Core, creator, publisher
and keyword are all children of the root node, and so are equally
close to each other structurally. For that reason, if a context set
realises element proximity searches, it can only do so relative to a
specific schema, and cannot claim to be schema neutral, the way normal
CQL searches are.

On the other hand, such structural queries are specific only to the
information model of LOM, and not to any binding of the information
model to a specific presentation. They are not dependent on the
container and leaf elements being presented through XML, RDF, or
Language Independent Datatypes.

CQL Element Proximity ¶

The proposed solution for FRED is to provide context sets that allow
for searches of abstract structure. The solution below looks at a
tree structure.

SOLUTION:

The CQL context set provides for proximity searches with unit=element.
Proximity will address the simpler structural queries, but not the
more complex:

* "x is author and y is editor and y edited at least 4 months
after x authored" assumes XPath-like extraction of individual
elements, rather than the fixed indexes of CQL
(lom::contribute[child::role=author]::date >=
lom::contribute[child::role=editor]::date + 000400Z ).

* The query "dc.creator is in the second grandchild of the
grandfather of a node with dc.date = 2006" is also inconsistent with
CQL: all indexes in CQL need to be related to a search term. So CQL
might conceivable query dc.creator = sanderson as a second grandchild,
but not dc.creator in general; this is again an XPath query
(/descendant[child::date=2006]/parent/child/child[position()=2]/self::creator).

Neither CQL 1.1 nor CQL 1.2 define how elements are counted or what
proximity means in the context of CQL indices. Neither CQL version
outlines the extent to which the underlying structure of the source
LOM may be preserved after elements are extracted into indexes. CQL
1.2 expressly states that, though the element unit is defined for the
CQL context set, no semantics for the unit is defined. (Indeed, nor is
semantics for any proximity unit defined.) However, CQL expressly
allows that other context sets define semantics for proximity units.

This has the risk of leading to inconsistent notions of proximity
between different projects and applications.

FRED will develop its own semantics for element proximity searches in
the Australian Education CQL context set.

FRED may develop a sample implementation of element proximity
searches.

There are two possible interpretations of element proximity which FRED
could use: textual proximity, and structural proximity.

* Under textual proximity, elements are tokenised in the same way
that words, sentences etc. are tokenised.

o A LOM tree is constructed out of the container and leaf
nodes in the LOM document. Where the container may have an unordered
number of subcontainers or leaf nodes, the children in the tree are
ordered arbitrarily.

o The LOM tree is traversed in-order, and each leaf node
visited is extracted as a token, in the order in which it has been
visited.

o Proximity search counts the number of tokens in the
tokenised tree between elements.

o The notion of a node is preserved in the search; the
notion of node hierarchy is not.

* Under structural proximity, the lowest common ancestor of two
elements is determined.

o A LOM tree is constructed, as above.

o The two elements being queries are identified in the tree
as leaf nodes.

o The distance of the leaf nodes to their lowest common
ancestor in the tree is determined.

o Node hierarchy is preserved.

o Proximity search counts how many levels in the tree the
elements are removed from their common ancestor (i.e. how many
branchings intervene between them).

The lowest common ancestor definition of proximity corresponds to the
kinds of queries of interest for LOM, which rely on elements being in
the same container. Unlike element tokenisation, it is not sensitive
to the ordering of elements, or the number of elements an aggregate
node may contain.

On the other hand, element tokenisation corresponds closely to the
implementation of other proximity searches. That said, the proposed
notion of structural proximity is not an unusual understanding of
proximity in the XML context; cf.
http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf ,
where "Keyword Proximity Search in XML Trees" is understood explicitly
in terms of Lowest Common Ancestor.)

Illustration:

Tree 1:

<a>

1

<c>

<d>2</d>

<e>3</e>

<f>4</f>

<g>

<h>5</h>

6

</g>

<j>7</j>

</c>

</a>

Tree 2:

<a>

<c>

<g>

<h>5</h>

6

</g>

<d>2</d>

<j>7</j>

<e>3</e>

<f>4</f>

</c>

1

</a>

In LOM, Trees 1 and 2 are identical, since LOM is order-insensitive.

Tree 3:

<a>

1

<c>

<d>2</d>

<j>7</j>

</c>

</a>

The queries we expect about LOM structure will typically concentrate
on elements belonging to the same container, rather than on what other
elements also belong to the container. Therefore, we would expect a
query on the proximity of and <j/> to give the same result for
Trees 1 and 3. (To give a LOM example: we expect x is author and y is
editor and y edited at least 4 months after x authored to give the
same answer whether or not a graphic designer is also defined in the
Contribute node.

Tokenised LOM tree ¶

* E.g. in an XML binding of LOM, the XML document is tokenised,
with the token breaks being the element delimiters.

* Only leaf nodes are tokenised, and aggregate nodes are not
considered tokens. E.g. in XML, consecutive delimiters count as a
single token break.

* Tokenisation is exactly parallel to the tokenisation of text by
word and sentence boundaries, already used for textual proximity
searches.

* Proximity searches for elements use the well-established notion
of distance between tokens.

The trees above tokenise as:

* Tree 1: 1 2 3 4 5 6 7

* Tree 2: 5 6 2 7 3 4 1

* Tree 3: 1 2 7

Search results are sensitive to accidents of ordering and of optional
elements. So in the three trees, the distance between 1 and 7 could be
6, 3, or 2.

Lowest Common Ancestor ¶

Work on algorithms to determine lowest common ancestor in XML has been
done
(http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf);
the following algorithm is straightforward:

* Each element in the LOM tree is assigned a position string: a
dot-delimited string of position numbers describing the path from the
root to the element. The leftmost position number is the position of
the child of the root traversed in the path, relative to its siblings,
expressed as an ordinal number; the next position number is the
position of the child of the child of the root, relative to its
siblings, and so forth. For example:

Tree 1:

<a>

1 1

<c>

<d>2</d> 2.1

<e>3</e> 2.2

<f>4</f> 2.3

<g>

<h>5</h> 2.4.1

6 2.4.2

</g>

<j>7</j> 2.5

</c>

</a>

* Since each element is assigned a position string in isolation,
these position strings can be indexed externally, and proximity
queries may be transacted by looking up the position strings, without
direct reference to the XML or any other realisation of the LOM
information model. This means that such searches are compatible with
an index-based CQL infrastructure.

* The position strings of the two elements are compared, and the
minimum common prefix determined. For example: a proximity query for
<d> and <h>, involving position strings 2.1 and 2.4.1, has the minimum
common prefix 2.

* The minimum suffix length following the common prefix is
determined for the two elements' position strings. 2.1 and 2.4.1 share
the prefix 2, and after that prefix have the suffixes .1 and .4.1, of
length 1 and 2 respectively.

* The minimum suffix length is the closest distance from one of
the elements to the lowest common ancestor, and represents the number
of branchings in the LOM tree between elements. Since the minimum
suffix length for <d> and <h> is 1, the two elements are contained in
the same container, and no intervening aggregate elements are possible
between them: the lowest common ancestor is the parent of one of the
elements.

* If the position strings are 2.1.4.5.6 and 2.1.7.3, the common
prefix is 2.1, and the minimum suffix length is 2 (.7.3). This means
that there is an intervening node (aggregate element) between the two
elements: 2.1.7; the lowest common ancestor is the grandparent of
2.1.7.3, 2.1. So 2.1.4.5.6 and 2.1.7.3 are less close than 2.1.7.1 and
2.1.7.3, or for that matter 2.1.4 and 2.1.7.3.

Let us illustrate this with LOM instances:

<lom>

<lifecycle> 1

<contribute> 1.1

<role>author</role> 1.1.1

<entity>Sanderson</entity> 1.1.2

<date>2006</date> 1.1.3

</contribute>

<contribute> 1.2

<role>publisher</role> 1.2.1

<entity>Fredericksen</entity> 1.2.2

<date>2007</date> 1.2.3

</contribute>

<contribute> 1.3

<role>initiator</role> 1.3.1

<entity>Johnson</entity> 1.3.2

<date>2004</date> 1.3.3

</contribute>

<contribute> 1.4

<role>terminator</role> 1.4.1

<entity>Pierceson</entity> 1.4.2

<date>2008</date> 1.4.3

</contribute>

<lifecyle>

</lom>

* The query "dc.creator = sanderson and the dc.date =2006 are
contained within the same contribute entry" translates to: dc.creator
= sanderson prox/unit=element/distance=1 dc.date=2006.

* "sanderson" and "2006" have the position strings 1.1.2 and
1.1.3, so their distance is 1 (.2, .3). They are as close as possible
in the LOM tree, sharing a common parent, so they satisfy the "within
the same contribute entry" requirement.

* By contrast, "sanderson" and "2008" have the position strings
1.1.2 and 1.4.3, so their distance is 2 (.1.2, .4.3). The date 2008 is
not associated with contributor Sanderson, but with a different
contributor.

Kerry Blinco
e-Framework and Standards Manager, Link Affiliates, University of
Southern Queensland; and
Technical Standards Adviser to the Department of Education Employment
and Workplace Relations (DEEWR). Australia.
Email: kblinco@powerup.com.au
Phone: +61 7 3871 2699
Ph (Mobile) : +61 419 787 992

The information contained in this e-mail message and any files may
be confidential information, and may also be the subject of legal
professional privilege.
If you think you may not be the intended recipient, or if you have
received this e-mail in error,
please contact the sender immediately and delete all copies of this
e-mail. If you are not the intended
recipient, you must not reproduce any part of this e-mail or disclose
its contents to any other party.

This email represents the views of the individual sender, except where
the sender expressly states otherwise.

---------------------------------------------------------------------
To unsubscribe from this mail list, you must leave the OASIS TC that
generates this mail. Follow this link to all your TCs in OASIS at:
https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php

Kerry Blinco
e-Framework and Standards Manager, Link Affiliates, University of Southern Queensland; and
Technical Standards Adviser to the Department of Education Employment and Workplace Relations (DEEWR). Australia.
Email:     kblinco@powerup.com.au
Phone:   +61 7 3871 2699
Ph (Mobile) :    +61 419 787 992

The information contained in this e-mail message and any files may
be confidential information, and may also be the subject of legal professional privilege.
If you think you may not be the intended recipient, or if you have received this e-mail in error,
please contact the sender immediately and delete all copies of this e-mail. If you are not the intended
recipient, you must not reproduce any part of this e-mail or disclose its contents to any other party.

This email represents the views of the individual sender, except where the sender expressly states otherwise.

search-ws message