search-ws message

Subject: Re: [search-ws] CQL Query on structured data.
From: "Ray Denenberg, Library of Congress" <rden@loc.gov>
To: <search-ws@lists.oasis-open.org>
Date: Thu, 18 Sep 2008 11:05:38 -0400
On the agenda for Monday.

----- Original Message ----- 
From: "LeVan,Ralph" <levan@oclc.org>
To: "Kerry Blinco" <kblinco@powerup.com.au>;
<search-ws@lists.oasis-open.org>
Sent: Thursday, September 18, 2008 10:25 AM
Subject: RE: [search-ws] CQL Query on structured data.


I'd really hate to see us try to shoehorn structure searching into CQL.  I'd
rather see support for XQuery as the query grammar in SRU than make CQL do
everything that it already does plus everything that XQuery does.



Ralph



From: Kerry Blinco [mailto:kblinco@powerup.com.au]
Sent: Tuesday, September 09, 2008 5:47 AM
To: search-ws@lists.oasis-open.org
Subject: [search-ws] CQL Query on structured data.



Ray,



I am copying to you the unedited version of the FRED project (Federated
Repositories for Education) (http://fred.usq.edu.au/) use case and a
proposed solution based on proximity using element and an abstract tree
search.



The PQL solutions are different and they need to state their own case I
think - Hopefully we can get them engaged in this after next week.



The FRED description  is something I could send to you quickly as requested.



LOM CQL  Documentation from the FRED project



Problem Statement ś



Need to be able to perform under CQL queries sensitive to the structure of
the underlying LOM; e.g.



    * dc.creator = sanderson and the dc.date =2006 are contained within the
same contribute container.

    * x is author and y is editor and y edited at least 4 months after x
authored

    * dc.creator is in the second grandchild of the grandfather of a node
with dc.date = 2006



Australian Education Stakeholders have indicated that such structural
queries (at least in their simplest forms) are important.



In LOM in particular,



    * The ordering of nodes is often undefined by the standard, and so
searches cannot rely on relative order.

    * Many elements can have an open number of children.



Note that such structural queries are perceived to compromise the
abstractness of CQL queries. For instance, a query are contained within the
same contribute entry does not make sense unless the underlying record can
be represented in LOM: the proximity is evaluated specifically in the
context of the LOM information model.  Even if a "contribute" entry is not
nominated, different schemas may arrange elements very differently. In LOM,
the creator and publisher are both in the contribute container, and are
closer to each other structurally than is the keyword. In Dublin Core,
creator, publisher and keyword are all children of the root node, and so are
equally close to each other structurally. For that reason, if a context set
realises element proximity searches, it can only do so relative to a
specific schema, and cannot claim to be schema neutral, the way normal CQL
searches are.



On the other hand, such structural queries are specific only to the
information model of LOM, and not to any binding of the information model to
a specific presentation.  They are not dependent on the container and leaf
elements being presented through XML, RDF, or Language Independent
Datatypes.

CQL Element Proximity ś



The proposed solution for FRED is to provide context sets that allow for
searches of abstract structure.  The solution below looks at a tree
structure.



SOLUTION:



The CQL context set provides for proximity searches with unit=element.
Proximity will address the simpler structural queries, but not the more
complex:



    * "x is author and y is editor and y edited at least 4 months after x
authored" assumes XPath-like extraction of individual elements, rather than
the fixed indexes of CQL (lom::contribute[child::role=author]::date >=
lom::contribute[child::role=editor]::date + 000400Z ).

    * The query "dc.creator is in the second grandchild of the grandfather
of a node with dc.date = 2006" is also inconsistent with CQL: all indexes in
CQL need to be related to a search term. So CQL might conceivable query
dc.creator = sanderson as a second grandchild, but not dc.creator in
general; this is again an XPath query
(/descendant[child::date=2006]/parent/child/child[position()=2]/self::creato
r).



Neither CQL 1.1 nor CQL 1.2 define how elements are counted or what
proximity means in the context of CQL indices. Neither CQL version outlines
the extent to which the underlying structure of the source LOM may be
preserved after elements are extracted into indexes. CQL 1.2 expressly
states that, though the element unit is defined for the CQL context set, no
semantics for the unit is defined. (Indeed, nor is semantics for any
proximity unit defined.) However, CQL expressly allows that other context
sets define semantics for proximity units.



This has the risk of leading to inconsistent notions of proximity between
different projects and applications.



FRED will develop its own semantics for element proximity searches in the
Australian Education CQL context set.



FRED may develop a sample implementation of element proximity searches.



There are two possible interpretations of element proximity which FRED could
use: textual proximity, and structural proximity.



    * Under textual proximity, elements are tokenised in the same way that
words, sentences etc. are tokenised.

          o A LOM tree is constructed out of the container and leaf nodes in
the LOM document. Where the container may have an unordered number of
subcontainers or leaf nodes, the children in the tree are ordered
arbitrarily.

          o The LOM tree is traversed in-order, and each leaf node visited
is extracted as a token, in the order in which it has been visited.

          o Proximity search counts the number of tokens in the tokenised
tree between elements.

          o The notion of a node is preserved in the search; the notion of
node hierarchy is not.

    * Under structural proximity, the lowest common ancestor of two elements
is determined.

          o A LOM tree is constructed, as above.

          o The two elements being queries are identified in the tree as
leaf nodes.

          o The distance of the leaf nodes to their lowest common ancestor
in the tree is determined.

          o Node hierarchy is preserved.

          o Proximity search counts how many levels in the tree the elements
are removed from their common ancestor (i.e. how many branchings intervene
between them).



The lowest common ancestor definition of proximity corresponds to the kinds
of queries of interest for LOM, which rely on elements being in the same
container. Unlike element tokenisation, it is not sensitive to the ordering
of elements, or the number of elements an aggregate node may contain.



On the other hand, element tokenisation corresponds closely to the
implementation of other proximity searches. That said, the proposed notion
of structural proximity is not an unusual understanding of proximity in the
XML context; cf.
http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf , where
"Keyword Proximity Search in XML Trees" is understood explicitly in terms of
Lowest Common Ancestor.)



Illustration:



Tree 1:



    <a>

      <b>1</b>

      <c>

        <d>2</d>

        <e>3</e>

        <f>4</f>

        <g>

          <h>5</h>

          <i>6</i>

        </g>

        <j>7</j>

      </c>

    </a>



Tree 2:



    <a>

      <c>

        <g>

          <h>5</h>

          <i>6</i>

        </g>

        <d>2</d>

        <j>7</j>

        <e>3</e>

        <f>4</f>

      </c>

      <b>1</b>

    </a>



In LOM, Trees 1 and 2 are identical, since LOM is order-insensitive.



Tree 3:



    <a>

      <b>1</b>

      <c>

        <d>2</d>

        <j>7</j>

     </c>

    </a>



The queries we expect about LOM structure will typically concentrate on
elements belonging to the same container, rather than on what other elements
also belong to the container. Therefore, we would expect a query on the
proximity of <b/> and <j/> to give the same result for Trees 1 and 3. (To
give a LOM example: we expect x is author and y is editor and y edited at
least 4 months after x authored to give the same answer whether or not a
graphic designer is also defined in the Contribute node.

Tokenised LOM tree ś



    * E.g. in an XML binding of LOM, the XML document is tokenised, with the
token breaks being the element delimiters.

    * Only leaf nodes are tokenised, and aggregate nodes are not considered
tokens. E.g. in XML, consecutive delimiters count as a single token break.

    * Tokenisation is exactly parallel to the tokenisation of text by word
and sentence boundaries, already used for textual proximity searches.

    * Proximity searches for elements use the well-established notion of
distance between tokens.



The trees above tokenise as:



    * Tree 1: 1 2 3 4 5 6 7

    * Tree 2: 5 6 2 7 3 4 1

    * Tree 3: 1 2 7



Search results are sensitive to accidents of ordering and of optional
elements. So in the three trees, the distance between 1 and 7 could be 6, 3,
or 2.

Lowest Common Ancestor ś



Work on algorithms to determine lowest common ancestor in XML has been done
(http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf); the
following algorithm is straightforward:



    * Each element in the LOM tree is assigned a position string: a
dot-delimited string of position numbers describing the path from the root
to the element. The leftmost position number is the position of the child of
the root traversed in the path, relative to its siblings, expressed as an
ordinal number; the next position number is the position of the child of the
child of the root, relative to its siblings, and so forth. For example:



    Tree 1:



        <a>

          <b>1</b>        1

          <c>

            <d>2</d>      2.1

            <e>3</e>      2.2

            <f>4</f>      2.3

            <g>

              <h>5</h>    2.4.1

              <i>6</i>    2.4.2

            </g>

            <j>7</j>      2.5

          </c>

        </a>



    * Since each element is assigned a position string in isolation, these
position strings can be indexed externally, and proximity queries may be
transacted by looking up the position strings, without direct reference to
the XML or any other realisation of the LOM information model. This means
that such searches are compatible with an index-based CQL infrastructure.

    * The position strings of the two elements are compared, and the minimum
common prefix determined. For example: a proximity query for <d> and <h>,
involving position strings 2.1 and 2.4.1, has the minimum common prefix 2.

    * The minimum suffix length following the common prefix is determined
for the two elements' position strings. 2.1 and 2.4.1 share the prefix 2,
and after that prefix have the suffixes .1 and .4.1, of length 1 and 2
respectively.

    * The minimum suffix length is the closest distance from one of the
elements to the lowest common ancestor, and represents the number of
branchings in the LOM tree between elements. Since the minimum suffix length
for <d> and <h> is 1, the two elements are contained in the same container,
and no intervening aggregate elements are possible between them: the lowest
common ancestor is the parent of one of the elements.

    * If the position strings are 2.1.4.5.6 and 2.1.7.3, the common prefix
is 2.1, and the minimum suffix length is 2 (.7.3). This means that there is
an intervening node (aggregate element) between the two elements: 2.1.7; the
lowest common ancestor is the grandparent of 2.1.7.3, 2.1. So 2.1.4.5.6 and
2.1.7.3 are less close than 2.1.7.1 and 2.1.7.3, or for that matter 2.1.4
and 2.1.7.3.



Let us illustrate this with LOM instances:



<lom>

  <lifecycle>                             1

    <contribute>                          1.1

      <role>author</role>                 1.1.1

      <entity>Sanderson</entity>          1.1.2

      <date>2006</date>                   1.1.3

    </contribute>

    <contribute>                          1.2

      <role>publisher</role>              1.2.1

      <entity>Fredericksen</entity>       1.2.2

      <date>2007</date>                   1.2.3

    </contribute>

    <contribute>                          1.3

      <role>initiator</role>              1.3.1

      <entity>Johnson</entity>            1.3.2

      <date>2004</date>                   1.3.3

    </contribute>

    <contribute>                          1.4

      <role>terminator</role>             1.4.1

      <entity>Pierceson</entity>          1.4.2

      <date>2008</date>                   1.4.3

    </contribute>

  <lifecyle>

</lom>



    * The query "dc.creator = sanderson and the dc.date =2006 are contained
within the same contribute entry" translates to: dc.creator = sanderson
prox/unit=element/distance=1 dc.date=2006.

    * "sanderson" and "2006" have the position strings 1.1.2 and 1.1.3, so
their distance is 1 (.2, .3). They are as close as possible in the LOM tree,
sharing a common parent, so they satisfy the "within the same contribute
entry" requirement.

    * By contrast, "sanderson" and "2008" have the position strings 1.1.2
and 1.4.3, so their distance is 2 (.1.2, .4.3). The date 2008 is not
associated with contributor Sanderson, but with a different contributor.



Kerry Blinco
e-Framework and Standards Manager, Link Affiliates, University of Southern
Queensland; and
Technical Standards Adviser to the Department of Education Employment and
Workplace Relations (DEEWR).  Australia.
Email:     kblinco@powerup.com.au
Phone:   +61 7 3871 2699
Ph (Mobile) :    +61 419 787 992

The information contained in this e-mail message and any files may
be confidential information, and may also be the subject of legal
professional privilege.
If you think you may not be the intended recipient, or if you have received
this e-mail in error,
please contact the sender immediately and delete all copies of this e-mail.
If you are not the intended
recipient, you must not reproduce any part of this e-mail or disclose its
contents to any other party.

This email represents the views of the individual sender, except where the
sender expressly states otherwise.
Follow-Ups:
- RE: [search-ws] CQL Query on structured data.
  - From: "LeVan,Ralph" <levan@oclc.org>
References:
- CQL Query on structured data.
  - From: Kerry Blinco <kblinco@powerup.com.au>
- RE: [search-ws] CQL Query on structured data.
  - From: "LeVan,Ralph" <levan@oclc.org>