[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [search-ws] CQL Query on structured data.
I’d really hate to see us try to shoehorn structure searching
into CQL. I’d rather see support for XQuery as the query grammar in SRU than
make CQL do everything that it already does plus everything that XQuery does. Ralph From: Kerry Blinco
[mailto:kblinco@powerup.com.au] Ray, I am copying to you the unedited version of the FRED project
(Federated Repositories for Education) (http://fred.usq.edu.au/)
use case and a proposed solution based on proximity using element and an
abstract tree search. The PQL solutions are different and they need to state their
own case I think - Hopefully we can get them engaged in this after next week.
The FRED description is something I could send to you
quickly as requested. LOM CQL Documentation from the FRED project Problem Statement ¶ Need to be able to perform under CQL queries sensitive to
the structure of the underlying LOM; e.g. * dc.creator = sanderson and the dc.date
=2006 are contained within the same contribute container. * x is author and y is editor and y
edited at least 4 months after x authored * dc.creator is in the second grandchild
of the grandfather of a node with dc.date = 2006 Australian Education Stakeholders have indicated that such
structural queries (at least in their simplest forms) are important. In LOM in particular, * The ordering of nodes is often
undefined by the standard, and so searches cannot rely on relative order. * Many elements can have an open number
of children. Note that such structural queries are perceived to compromise
the abstractness of CQL queries. For instance, a query are contained within the
same contribute entry does not make sense unless the underlying record can be
represented in LOM: the proximity is evaluated specifically in the context of
the LOM information model. Even if a "contribute" entry is not
nominated, different schemas may arrange elements very differently. In LOM, the
creator and publisher are both in the contribute container, and are closer to
each other structurally than is the keyword. In Dublin Core, creator, publisher
and keyword are all children of the root node, and so are equally close to each
other structurally. For that reason, if a context set realises element
proximity searches, it can only do so relative to a specific schema, and cannot
claim to be schema neutral, the way normal CQL searches are. On the other hand, such structural queries are specific only
to the information model of LOM, and not to any binding of the information
model to a specific presentation. They are not dependent on the container
and leaf elements being presented through XML, RDF, or Language Independent
Datatypes. CQL Element Proximity ¶ The proposed solution for FRED is to provide context sets
that allow for searches of abstract structure. The solution below looks
at a tree structure. SOLUTION: The CQL context set provides for proximity searches with
unit=element. Proximity will address the simpler structural queries, but not
the more complex: * "x is author and y is editor and y
edited at least 4 months after x authored" assumes XPath-like extraction
of individual elements, rather than the fixed indexes of CQL
(lom::contribute[child::role=author]::date >= lom::contribute[child::role=editor]::date
+ 000400Z ). * The query "dc.creator is in the
second grandchild of the grandfather of a node with dc.date = 2006" is
also inconsistent with CQL: all indexes in CQL need to be related to a search
term. So CQL might conceivable query dc.creator = sanderson as a second
grandchild, but not dc.creator in general; this is again an XPath query
(/descendant[child::date=2006]/parent/child/child[position()=2]/self::creator). Neither CQL 1.1 nor CQL 1.2 define how elements are counted
or what proximity means in the context of CQL indices. Neither CQL version
outlines the extent to which the underlying structure of the source LOM may be
preserved after elements are extracted into indexes. CQL 1.2 expressly states
that, though the element unit is defined for the CQL context set, no semantics
for the unit is defined. (Indeed, nor is semantics for any proximity unit
defined.) However, CQL expressly allows that other context sets define
semantics for proximity units. This has the risk of leading to inconsistent notions of
proximity between different projects and applications. FRED will develop its own semantics for element proximity
searches in the Australian Education CQL context set. FRED may develop a sample implementation of element
proximity searches. There are two possible interpretations of element proximity
which FRED could use: textual proximity, and structural proximity. * Under textual proximity, elements are
tokenised in the same way that words, sentences etc. are tokenised. o A LOM tree is
constructed out of the container and leaf nodes in the LOM document. Where the
container may have an unordered number of subcontainers or leaf nodes, the
children in the tree are ordered arbitrarily. o The LOM tree is
traversed in-order, and each leaf node visited is extracted as a token, in the
order in which it has been visited. o Proximity search
counts the number of tokens in the tokenised tree between elements. o The notion of a
node is preserved in the search; the notion of node hierarchy is not. * Under structural proximity, the lowest
common ancestor of two elements is determined. o A LOM tree is
constructed, as above. o The two elements
being queries are identified in the tree as leaf nodes. o The distance of
the leaf nodes to their lowest common ancestor in the tree is determined. o Node hierarchy is
preserved. o Proximity search
counts how many levels in the tree the elements are removed from their common
ancestor (i.e. how many branchings intervene between them). The lowest common ancestor definition of proximity
corresponds to the kinds of queries of interest for LOM, which rely on elements
being in the same container. Unlike element tokenisation, it is not sensitive
to the ordering of elements, or the number of elements an aggregate node may
contain. On the other hand, element tokenisation corresponds closely
to the implementation of other proximity searches. That said, the proposed
notion of structural proximity is not an unusual understanding of proximity in
the XML context; cf. http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf
, where "Keyword Proximity Search in XML Trees" is understood
explicitly in terms of Lowest Common Ancestor.) Illustration: Tree 1: <a> <b>1</b> <c> <d>2</d> <e>3</e> <f>4</f> <g> <h>5</h> <i>6</i> </g> <j>7</j> </c> </a> Tree 2: <a> <c> <g> <h>5</h> <i>6</i> </g> <d>2</d> <j>7</j> <e>3</e> <f>4</f> </c> <b>1</b> </a> In LOM, Trees 1 and 2 are identical, since LOM is
order-insensitive. Tree 3: <a> <b>1</b> <c> <d>2</d> <j>7</j> </c> </a> The queries we expect about LOM structure will typically
concentrate on elements belonging to the same container, rather than on what
other elements also belong to the container. Therefore, we would expect a query
on the proximity of <b/> and <j/> to give the same result for Trees
1 and 3. (To give a LOM example: we expect x is author and y is editor and y
edited at least 4 months after x authored to give the same answer whether or
not a graphic designer is also defined in the Contribute node. Tokenised LOM tree ¶ * E.g. in an XML binding of LOM, the XML
document is tokenised, with the token breaks being the element delimiters. * Only leaf nodes are tokenised, and
aggregate nodes are not considered tokens. E.g. in XML, consecutive delimiters
count as a single token break. * Tokenisation is exactly parallel to the
tokenisation of text by word and sentence boundaries, already used for textual
proximity searches. * Proximity searches for elements use the
well-established notion of distance between tokens. The trees above tokenise as: * Tree 1: 1 2 3 4 5 6 7 * Tree 2: 5 6 2 7 3 4 1 * Tree 3: 1 2 7 Search results are sensitive to accidents of ordering and of
optional elements. So in the three trees, the distance between 1 and 7 could be
6, 3, or 2. Lowest Common Ancestor ¶ Work on algorithms to determine lowest common ancestor in
XML has been done (http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf);
the following algorithm is straightforward: * Each element in the LOM tree is
assigned a position string: a dot-delimited string of position numbers
describing the path from the root to the element. The leftmost position number
is the position of the child of the root traversed in the path, relative to its
siblings, expressed as an ordinal number; the next position number is the
position of the child of the child of the root, relative to its siblings, and
so forth. For example: Tree 1: <a> <b>1</b>
1 <c>
<d>2</d> 2.1
<e>3</e> 2.2
<f>4</f> 2.3 <g>
<h>5</h> 2.4.1
<i>6</i> 2.4.2 </g>
<j>7</j> 2.5 </c> </a> * Since each element is assigned a
position string in isolation, these position strings can be indexed externally,
and proximity queries may be transacted by looking up the position strings,
without direct reference to the XML or any other realisation of the LOM
information model. This means that such searches are compatible with an
index-based CQL infrastructure. * The position strings of the two
elements are compared, and the minimum common prefix determined. For example: a
proximity query for <d> and <h>, involving position strings 2.1 and
2.4.1, has the minimum common prefix 2. * The minimum suffix length following the
common prefix is determined for the two elements' position strings. 2.1 and
2.4.1 share the prefix 2, and after that prefix have the suffixes .1 and .4.1,
of length 1 and 2 respectively. * The minimum suffix length is the
closest distance from one of the elements to the lowest common ancestor, and
represents the number of branchings in the LOM tree between elements. Since the
minimum suffix length for <d> and <h> is 1, the two elements are
contained in the same container, and no intervening aggregate elements are
possible between them: the lowest common ancestor is the parent of one of the
elements. * If the position strings are 2.1.4.5.6
and 2.1.7.3, the common prefix is 2.1, and the minimum suffix length is 2
(.7.3). This means that there is an intervening node (aggregate element)
between the two elements: 2.1.7; the lowest common ancestor is the grandparent
of 2.1.7.3, 2.1. So 2.1.4.5.6 and 2.1.7.3 are less close than 2.1.7.1 and
2.1.7.3, or for that matter 2.1.4 and 2.1.7.3. Let us illustrate this with LOM instances: <lom> <lifecycle>
1 <contribute>
1.1 <role>author</role>
1.1.1
<entity>Sanderson</entity>
1.1.2 <date>2006</date>
1.1.3 </contribute> <contribute>
1.2 <role>publisher</role>
1.2.1
<entity>Fredericksen</entity> 1.2.2 <date>2007</date>
1.2.3 </contribute> <contribute>
1.3 <role>initiator</role>
1.3.1 <entity>Johnson</entity>
1.3.2 <date>2004</date>
1.3.3 </contribute> <contribute>
1.4
<role>terminator</role>
1.4.1
<entity>Pierceson</entity>
1.4.2 <date>2008</date>
1.4.3 </contribute> <lifecyle> </lom> * The query "dc.creator = sanderson
and the dc.date =2006 are contained within the same contribute entry"
translates to: dc.creator = sanderson prox/unit=element/distance=1
dc.date=2006. * "sanderson" and
"2006" have the position strings 1.1.2 and 1.1.3, so their distance
is 1 (.2, .3). They are as close as possible in the LOM tree, sharing a common
parent, so they satisfy the "within the same contribute entry"
requirement. * By contrast, "sanderson" and
"2008" have the position strings 1.1.2 and 1.4.3, so their distance
is 2 (.1.2, .4.3). The date 2008 is not associated with contributor Sanderson,
but with a different contributor. Kerry Blinco
|
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]