search-ws message

Subject: RE: [search-ws] CQL Query on structured data.
From: Rob Sanderson <azaroth@liverpool.ac.uk>
To: "LeVan,Ralph" <levan@oclc.org>
Date: Thu, 18 Sep 2008 15:41:00 +0100

Or SPARQL.

Rob


On Thu, 2008-09-18 at 10:25 -0400, LeVan,Ralph wrote:
> I’d really hate to see us try to shoehorn structure searching into
> CQL.  I’d rather see support for XQuery as the query grammar in SRU
> than make CQL do everything that it already does plus everything that
> XQuery does.
> 
>  
> 
> Ralph
> 
>  
> 
> From: Kerry Blinco [mailto:kblinco@powerup.com.au] 
> Sent: Tuesday, September 09, 2008 5:47 AM
> To: search-ws@lists.oasis-open.org
> Subject: [search-ws] CQL Query on structured data.
> 
> 
>  
> 
> Ray,
> 
> 
>  
> 
> 
> I am copying to you the unedited version of the FRED project
> (Federated Repositories for Education) (http://fred.usq.edu.au/) use
> case and a proposed solution based on proximity using element and an
> abstract tree search.
> 
> 
>  
> 
> 
> The PQL solutions are different and they need to state their own case
> I think - Hopefully we can get them engaged in this after next week.  
> 
> 
>  
> 
> 
> The FRED description  is something I could send to you quickly as
> requested.
> 
> 
>  
> 
> 
> LOM CQL  Documentation from the FRED project 
> 
> 
>  
> 
> 
> Problem Statement ¶
> 
> 
>  
> 
> 
> Need to be able to perform under CQL queries sensitive to the
> structure of the underlying LOM; e.g.
> 
> 
>  
> 
> 
>     * dc.creator = sanderson and the dc.date =2006 are contained
> within the same contribute container.
> 
> 
>     * x is author and y is editor and y edited at least 4 months after
> x authored
> 
> 
>     * dc.creator is in the second grandchild of the grandfather of a
> node with dc.date = 2006 
> 
> 
>  
> 
> 
> Australian Education Stakeholders have indicated that such structural
> queries (at least in their simplest forms) are important.
> 
> 
>  
> 
> 
> In LOM in particular,
> 
> 
>  
> 
> 
>     * The ordering of nodes is often undefined by the standard, and so
> searches cannot rely on relative order.
> 
> 
>     * Many elements can have an open number of children. 
> 
> 
>  
> 
> 
> Note that such structural queries are perceived to compromise the
> abstractness of CQL queries. For instance, a query are contained
> within the same contribute entry does not make sense unless the
> underlying record can be represented in LOM: the proximity is
> evaluated specifically in the context of the LOM information model.
>  Even if a "contribute" entry is not nominated, different schemas may
> arrange elements very differently. In LOM, the creator and publisher
> are both in the contribute container, and are closer to each other
> structurally than is the keyword. In Dublin Core, creator, publisher
> and keyword are all children of the root node, and so are equally
> close to each other structurally. For that reason, if a context set
> realises element proximity searches, it can only do so relative to a
> specific schema, and cannot claim to be schema neutral, the way normal
> CQL searches are.
> 
> 
>  
> 
> 
> On the other hand, such structural queries are specific only to the
> information model of LOM, and not to any binding of the information
> model to a specific presentation.  They are not dependent on the
> container and leaf elements being presented through XML, RDF, or
> Language Independent Datatypes.
> 
> 
> CQL Element Proximity ¶  
> 
> 
>  
> 
> 
> The proposed solution for FRED is to provide context sets that allow
> for searches of abstract structure.  The solution below looks at a
> tree structure. 
> 
> 
>  
> 
> 
> SOLUTION: 
> 
> 
>  
> 
> 
> The CQL context set provides for proximity searches with unit=element.
> Proximity will address the simpler structural queries, but not the
> more complex:
> 
> 
>  
> 
> 
>     * "x is author and y is editor and y edited at least 4 months
> after x authored" assumes XPath-like extraction of individual
> elements, rather than the fixed indexes of CQL
> (lom::contribute[child::role=author]::date >=
> lom::contribute[child::role=editor]::date + 000400Z ).
> 
> 
>     * The query "dc.creator is in the second grandchild of the
> grandfather of a node with dc.date = 2006" is also inconsistent with
> CQL: all indexes in CQL need to be related to a search term. So CQL
> might conceivable query dc.creator = sanderson as a second grandchild,
> but not dc.creator in general; this is again an XPath query
> (/descendant[child::date=2006]/parent/child/child[position()=2]/self::creator). 
> 
> 
>  
> 
> 
> Neither CQL 1.1 nor CQL 1.2 define how elements are counted or what
> proximity means in the context of CQL indices. Neither CQL version
> outlines the extent to which the underlying structure of the source
> LOM may be preserved after elements are extracted into indexes. CQL
> 1.2 expressly states that, though the element unit is defined for the
> CQL context set, no semantics for the unit is defined. (Indeed, nor is
> semantics for any proximity unit defined.) However, CQL expressly
> allows that other context sets define semantics for proximity units.
> 
> 
>  
> 
> 
> This has the risk of leading to inconsistent notions of proximity
> between different projects and applications.
> 
> 
>  
> 
> 
> FRED will develop its own semantics for element proximity searches in
> the Australian Education CQL context set.
> 
> 
>  
> 
> 
> FRED may develop a sample implementation of element proximity
> searches.
> 
> 
>  
> 
> 
> There are two possible interpretations of element proximity which FRED
> could use: textual proximity, and structural proximity.
> 
> 
>  
> 
> 
>     * Under textual proximity, elements are tokenised in the same way
> that words, sentences etc. are tokenised.
> 
> 
>           o A LOM tree is constructed out of the container and leaf
> nodes in the LOM document. Where the container may have an unordered
> number of subcontainers or leaf nodes, the children in the tree are
> ordered arbitrarily.
> 
> 
>           o The LOM tree is traversed in-order, and each leaf node
> visited is extracted as a token, in the order in which it has been
> visited.
> 
> 
>           o Proximity search counts the number of tokens in the
> tokenised tree between elements.
> 
> 
>           o The notion of a node is preserved in the search; the
> notion of node hierarchy is not. 
> 
> 
>     * Under structural proximity, the lowest common ancestor of two
> elements is determined.
> 
> 
>           o A LOM tree is constructed, as above.
> 
> 
>           o The two elements being queries are identified in the tree
> as leaf nodes.
> 
> 
>           o The distance of the leaf nodes to their lowest common
> ancestor in the tree is determined.
> 
> 
>           o Node hierarchy is preserved.
> 
> 
>           o Proximity search counts how many levels in the tree the
> elements are removed from their common ancestor (i.e. how many
> branchings intervene between them). 
> 
> 
>  
> 
> 
> The lowest common ancestor definition of proximity corresponds to the
> kinds of queries of interest for LOM, which rely on elements being in
> the same container. Unlike element tokenisation, it is not sensitive
> to the ordering of elements, or the number of elements an aggregate
> node may contain.
> 
> 
>  
> 
> 
> On the other hand, element tokenisation corresponds closely to the
> implementation of other proximity searches. That said, the proposed
> notion of structural proximity is not an unusual understanding of
> proximity in the XML context; cf.
> http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf ,
> where "Keyword Proximity Search in XML Trees" is understood explicitly
> in terms of Lowest Common Ancestor.)
> 
> 
>  
> 
> 
> Illustration:
> 
> 
>  
> 
> 
> Tree 1:
> 
> 
>  
> 
> 
>     <a>
> 
> 
>       <b>1</b>
> 
> 
>       <c>
> 
> 
>         <d>2</d>
> 
> 
>         <e>3</e>
> 
> 
>         <f>4</f>
> 
> 
>         <g>
> 
> 
>           <h>5</h>
> 
> 
>           <i>6</i>
> 
> 
>         </g>
> 
> 
>         <j>7</j>
> 
> 
>       </c>
> 
> 
>     </a>
> 
> 
>  
> 
> 
> Tree 2:
> 
> 
>  
> 
> 
>     <a>
> 
> 
>       <c>
> 
> 
>         <g>
> 
> 
>           <h>5</h>
> 
> 
>           <i>6</i>
> 
> 
>         </g>
> 
> 
>         <d>2</d>
> 
> 
>         <j>7</j>
> 
> 
>         <e>3</e>
> 
> 
>         <f>4</f>
> 
> 
>       </c>
> 
> 
>       <b>1</b>
> 
> 
>     </a>
> 
> 
>  
> 
> 
> In LOM, Trees 1 and 2 are identical, since LOM is order-insensitive.
> 
> 
>  
> 
> 
> Tree 3:
> 
> 
>  
> 
> 
>     <a>
> 
> 
>       <b>1</b>
> 
> 
>       <c>
> 
> 
>         <d>2</d>
> 
> 
>         <j>7</j>
> 
> 
>      </c>
> 
> 
>     </a>
> 
> 
>  
> 
> 
> The queries we expect about LOM structure will typically concentrate
> on elements belonging to the same container, rather than on what other
> elements also belong to the container. Therefore, we would expect a
> query on the proximity of <b/> and <j/> to give the same result for
> Trees 1 and 3. (To give a LOM example: we expect x is author and y is
> editor and y edited at least 4 months after x authored to give the
> same answer whether or not a graphic designer is also defined in the
> Contribute node.
> 
> 
> Tokenised LOM tree ¶
> 
> 
>  
> 
> 
>     * E.g. in an XML binding of LOM, the XML document is tokenised,
> with the token breaks being the element delimiters.
> 
> 
>     * Only leaf nodes are tokenised, and aggregate nodes are not
> considered tokens. E.g. in XML, consecutive delimiters count as a
> single token break.
> 
> 
>     * Tokenisation is exactly parallel to the tokenisation of text by
> word and sentence boundaries, already used for textual proximity
> searches.
> 
> 
>     * Proximity searches for elements use the well-established notion
> of distance between tokens. 
> 
> 
>  
> 
> 
> The trees above tokenise as:
> 
> 
>  
> 
> 
>     * Tree 1: 1 2 3 4 5 6 7
> 
> 
>     * Tree 2: 5 6 2 7 3 4 1
> 
> 
>     * Tree 3: 1 2 7 
> 
> 
>  
> 
> 
> Search results are sensitive to accidents of ordering and of optional
> elements. So in the three trees, the distance between 1 and 7 could be
> 6, 3, or 2.
> 
> 
> Lowest Common Ancestor ¶
> 
> 
>  
> 
> 
> Work on algorithms to determine lowest common ancestor in XML has been
> done
> (http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf);
> the following algorithm is straightforward:
> 
> 
>  
> 
> 
>     * Each element in the LOM tree is assigned a position string: a
> dot-delimited string of position numbers describing the path from the
> root to the element. The leftmost position number is the position of
> the child of the root traversed in the path, relative to its siblings,
> expressed as an ordinal number; the next position number is the
> position of the child of the child of the root, relative to its
> siblings, and so forth. For example: 
> 
> 
>  
> 
> 
>     Tree 1:
> 
> 
>  
> 
> 
>         <a>
> 
> 
>           <b>1</b>        1
> 
> 
>           <c>
> 
> 
>             <d>2</d>      2.1
> 
> 
>             <e>3</e>      2.2
> 
> 
>             <f>4</f>      2.3
> 
> 
>             <g>
> 
> 
>               <h>5</h>    2.4.1
> 
> 
>               <i>6</i>    2.4.2
> 
> 
>             </g>
> 
> 
>             <j>7</j>      2.5
> 
> 
>           </c>
> 
> 
>         </a>
> 
> 
>  
> 
> 
>     * Since each element is assigned a position string in isolation,
> these position strings can be indexed externally, and proximity
> queries may be transacted by looking up the position strings, without
> direct reference to the XML or any other realisation of the LOM
> information model. This means that such searches are compatible with
> an index-based CQL infrastructure.
> 
> 
>     * The position strings of the two elements are compared, and the
> minimum common prefix determined. For example: a proximity query for
> <d> and <h>, involving position strings 2.1 and 2.4.1, has the minimum
> common prefix 2.
> 
> 
>     * The minimum suffix length following the common prefix is
> determined for the two elements' position strings. 2.1 and 2.4.1 share
> the prefix 2, and after that prefix have the suffixes .1 and .4.1, of
> length 1 and 2 respectively.
> 
> 
>     * The minimum suffix length is the closest distance from one of
> the elements to the lowest common ancestor, and represents the number
> of branchings in the LOM tree between elements. Since the minimum
> suffix length for <d> and <h> is 1, the two elements are contained in
> the same container, and no intervening aggregate elements are possible
> between them: the lowest common ancestor is the parent of one of the
> elements.
> 
> 
>     * If the position strings are 2.1.4.5.6 and 2.1.7.3, the common
> prefix is 2.1, and the minimum suffix length is 2 (.7.3). This means
> that there is an intervening node (aggregate element) between the two
> elements: 2.1.7; the lowest common ancestor is the grandparent of
> 2.1.7.3, 2.1. So 2.1.4.5.6 and 2.1.7.3 are less close than 2.1.7.1 and
> 2.1.7.3, or for that matter 2.1.4 and 2.1.7.3. 
> 
> 
>  
> 
> 
> Let us illustrate this with LOM instances:
> 
> 
>  
> 
> 
> <lom>
> 
> 
>   <lifecycle>                             1
> 
> 
>     <contribute>                          1.1
> 
> 
>       <role>author</role>                 1.1.1
> 
> 
>       <entity>Sanderson</entity>          1.1.2
> 
> 
>       <date>2006</date>                   1.1.3
> 
> 
>     </contribute>
> 
> 
>     <contribute>                          1.2 
> 
> 
>       <role>publisher</role>              1.2.1
> 
> 
>       <entity>Fredericksen</entity>       1.2.2
> 
> 
>       <date>2007</date>                   1.2.3
> 
> 
>     </contribute>
> 
> 
>     <contribute>                          1.3
> 
> 
>       <role>initiator</role>              1.3.1
> 
> 
>       <entity>Johnson</entity>            1.3.2
> 
> 
>       <date>2004</date>                   1.3.3
> 
> 
>     </contribute>
> 
> 
>     <contribute>                          1.4
> 
> 
>       <role>terminator</role>             1.4.1
> 
> 
>       <entity>Pierceson</entity>          1.4.2
> 
> 
>       <date>2008</date>                   1.4.3
> 
> 
>     </contribute>
> 
> 
>   <lifecyle>
> 
> 
> </lom>
> 
> 
>  
> 
> 
>     * The query "dc.creator = sanderson and the dc.date =2006 are
> contained within the same contribute entry" translates to: dc.creator
> = sanderson prox/unit=element/distance=1 dc.date=2006.
> 
> 
>     * "sanderson" and "2006" have the position strings 1.1.2 and
> 1.1.3, so their distance is 1 (.2, .3). They are as close as possible
> in the LOM tree, sharing a common parent, so they satisfy the "within
> the same contribute entry" requirement.
> 
> 
>     * By contrast, "sanderson" and "2008" have the position strings
> 1.1.2 and 1.4.3, so their distance is 2 (.1.2, .4.3). The date 2008 is
> not associated with contributor Sanderson, but with a different
> contributor. 
> 
> 
>  
> 
> 
> Kerry Blinco
> e-Framework and Standards Manager, Link Affiliates, University of
> Southern Queensland; and
> Technical Standards Adviser to the Department of Education Employment
> and Workplace Relations (DEEWR).  Australia.
> Email:     kblinco@powerup.com.au
> Phone:   +61 7 3871 2699              
> Ph (Mobile) :    +61 419 787 992
> 
> The information contained in this e-mail message and any files may
> be confidential information, and may also be the subject of legal
> professional privilege.  
> If you think you may not be the intended recipient, or if you have
> received this e-mail in error, 
> please contact the sender immediately and delete all copies of this
> e-mail. If you are not the intended 
> recipient, you must not reproduce any part of this e-mail or disclose
> its contents to any other party.
> 
> This email represents the views of the individual sender, except where
> the sender expressly states otherwise.
> 
>  
> 
> 
>  
> 
> 
> 
> 
>  
> 
>
Follow-Ups:
- Re: [search-ws] CQL Query on structured data.
  - From: Kerry Blinco <kblinco@powerup.com.au>
References:
- CQL Query on structured data.
  - From: Kerry Blinco <kblinco@powerup.com.au>
- RE: [search-ws] CQL Query on structured data.
  - From: "LeVan,Ralph" <levan@oclc.org>