search-ws message

Subject: Re: [search-ws] CQL Query on structured data.
From: "Ray Denenberg, Library of Congress" <rden@loc.gov>
To: <search-ws@lists.oasis-open.org>
Date: Thu, 18 Sep 2008 11:24:09 -0400
If XQuery is to be the approach, we need to do some work to ensure that it
fits within our model.   If it does, then I'm on board with that approach.

SPARQL, that's another story altogether. I'm not sure what Rob has in mind
with that suggestion.  Rob?

--Ray

----- Original Message ----- 
From: "Kerry Blinco" <kblinco@powerup.com.au>
To: "Rob Sanderson" <azaroth@liverpool.ac.uk>
Cc: "LeVan,Ralph" <levan@oclc.org>; <search-ws@lists.oasis-open.org>
Sent: Thursday, September 18, 2008 11:16 AM
Subject: Re: [search-ws] CQL Query on structured data.


Hi

I don't care about the answer so long as there is a way of expressing
an abstract expression of search of structure - this was just an
example of the problem

Kerry


On 19/09/2008, at 12:41 AM, Rob Sanderson wrote:

>
> Or SPARQL.
>
> Rob
>
>
> On Thu, 2008-09-18 at 10:25 -0400, LeVan,Ralph wrote:
>> I’d really hate to see us try to shoehorn structure searching into
>> CQL.  I’d rather see support for XQuery as the query grammar in SRU
>> than make CQL do everything that it already does plus everything that
>> XQuery does.
>>
>>
>>
>> Ralph
>>
>>
>>
>> From: Kerry Blinco [mailto:kblinco@powerup.com.au]
>> Sent: Tuesday, September 09, 2008 5:47 AM
>> To: search-ws@lists.oasis-open.org
>> Subject: [search-ws] CQL Query on structured data.
>>
>>
>>
>>
>> Ray,
>>
>>
>>
>>
>>
>> I am copying to you the unedited version of the FRED project
>> (Federated Repositories for Education) (http://fred.usq.edu.au/) use
>> case and a proposed solution based on proximity using element and an
>> abstract tree search.
>>
>>
>>
>>
>>
>> The PQL solutions are different and they need to state their own case
>> I think - Hopefully we can get them engaged in this after next week.
>>
>>
>>
>>
>>
>> The FRED description  is something I could send to you quickly as
>> requested.
>>
>>
>>
>>
>>
>> LOM CQL  Documentation from the FRED project
>>
>>
>>
>>
>>
>> Problem Statement ¶
>>
>>
>>
>>
>>
>> Need to be able to perform under CQL queries sensitive to the
>> structure of the underlying LOM; e.g.
>>
>>
>>
>>
>>
>>     * dc.creator = sanderson and the dc.date =2006 are contained
>> within the same contribute container.
>>
>>
>>     * x is author and y is editor and y edited at least 4 months
>> after
>> x authored
>>
>>
>>     * dc.creator is in the second grandchild of the grandfather of a
>> node with dc.date = 2006
>>
>>
>>
>>
>>
>> Australian Education Stakeholders have indicated that such structural
>> queries (at least in their simplest forms) are important.
>>
>>
>>
>>
>>
>> In LOM in particular,
>>
>>
>>
>>
>>
>>     * The ordering of nodes is often undefined by the standard,
>> and so
>> searches cannot rely on relative order.
>>
>>
>>     * Many elements can have an open number of children.
>>
>>
>>
>>
>>
>> Note that such structural queries are perceived to compromise the
>> abstractness of CQL queries. For instance, a query are contained
>> within the same contribute entry does not make sense unless the
>> underlying record can be represented in LOM: the proximity is
>> evaluated specifically in the context of the LOM information model.
>>  Even if a "contribute" entry is not nominated, different schemas may
>> arrange elements very differently. In LOM, the creator and publisher
>> are both in the contribute container, and are closer to each other
>> structurally than is the keyword. In Dublin Core, creator, publisher
>> and keyword are all children of the root node, and so are equally
>> close to each other structurally. For that reason, if a context set
>> realises element proximity searches, it can only do so relative to a
>> specific schema, and cannot claim to be schema neutral, the way
>> normal
>> CQL searches are.
>>
>>
>>
>>
>>
>> On the other hand, such structural queries are specific only to the
>> information model of LOM, and not to any binding of the information
>> model to a specific presentation.  They are not dependent on the
>> container and leaf elements being presented through XML, RDF, or
>> Language Independent Datatypes.
>>
>>
>> CQL Element Proximity ¶
>>
>>
>>
>>
>>
>> The proposed solution for FRED is to provide context sets that allow
>> for searches of abstract structure.  The solution below looks at a
>> tree structure.
>>
>>
>>
>>
>>
>> SOLUTION:
>>
>>
>>
>>
>>
>> The CQL context set provides for proximity searches with
>> unit=element.
>> Proximity will address the simpler structural queries, but not the
>> more complex:
>>
>>
>>
>>
>>
>>     * "x is author and y is editor and y edited at least 4 months
>> after x authored" assumes XPath-like extraction of individual
>> elements, rather than the fixed indexes of CQL
>> (lom::contribute[child::role=author]::date >=
>> lom::contribute[child::role=editor]::date + 000400Z ).
>>
>>
>>     * The query "dc.creator is in the second grandchild of the
>> grandfather of a node with dc.date = 2006" is also inconsistent with
>> CQL: all indexes in CQL need to be related to a search term. So CQL
>> might conceivable query dc.creator = sanderson as a second
>> grandchild,
>> but not dc.creator in general; this is again an XPath query
>> (/descendant[child::date=2006]/parent/child/child[position()=2]/
>> self::creator).
>>
>>
>>
>>
>>
>> Neither CQL 1.1 nor CQL 1.2 define how elements are counted or what
>> proximity means in the context of CQL indices. Neither CQL version
>> outlines the extent to which the underlying structure of the source
>> LOM may be preserved after elements are extracted into indexes. CQL
>> 1.2 expressly states that, though the element unit is defined for the
>> CQL context set, no semantics for the unit is defined. (Indeed,
>> nor is
>> semantics for any proximity unit defined.) However, CQL expressly
>> allows that other context sets define semantics for proximity units.
>>
>>
>>
>>
>>
>> This has the risk of leading to inconsistent notions of proximity
>> between different projects and applications.
>>
>>
>>
>>
>>
>> FRED will develop its own semantics for element proximity searches in
>> the Australian Education CQL context set.
>>
>>
>>
>>
>>
>> FRED may develop a sample implementation of element proximity
>> searches.
>>
>>
>>
>>
>>
>> There are two possible interpretations of element proximity which
>> FRED
>> could use: textual proximity, and structural proximity.
>>
>>
>>
>>
>>
>>     * Under textual proximity, elements are tokenised in the same way
>> that words, sentences etc. are tokenised.
>>
>>
>>           o A LOM tree is constructed out of the container and leaf
>> nodes in the LOM document. Where the container may have an unordered
>> number of subcontainers or leaf nodes, the children in the tree are
>> ordered arbitrarily.
>>
>>
>>           o The LOM tree is traversed in-order, and each leaf node
>> visited is extracted as a token, in the order in which it has been
>> visited.
>>
>>
>>           o Proximity search counts the number of tokens in the
>> tokenised tree between elements.
>>
>>
>>           o The notion of a node is preserved in the search; the
>> notion of node hierarchy is not.
>>
>>
>>     * Under structural proximity, the lowest common ancestor of two
>> elements is determined.
>>
>>
>>           o A LOM tree is constructed, as above.
>>
>>
>>           o The two elements being queries are identified in the tree
>> as leaf nodes.
>>
>>
>>           o The distance of the leaf nodes to their lowest common
>> ancestor in the tree is determined.
>>
>>
>>           o Node hierarchy is preserved.
>>
>>
>>           o Proximity search counts how many levels in the tree the
>> elements are removed from their common ancestor (i.e. how many
>> branchings intervene between them).
>>
>>
>>
>>
>>
>> The lowest common ancestor definition of proximity corresponds to the
>> kinds of queries of interest for LOM, which rely on elements being in
>> the same container. Unlike element tokenisation, it is not sensitive
>> to the ordering of elements, or the number of elements an aggregate
>> node may contain.
>>
>>
>>
>>
>>
>> On the other hand, element tokenisation corresponds closely to the
>> implementation of other proximity searches. That said, the proposed
>> notion of structural proximity is not an unusual understanding of
>> proximity in the XML context; cf.
>> http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf ,
>> where "Keyword Proximity Search in XML Trees" is understood
>> explicitly
>> in terms of Lowest Common Ancestor.)
>>
>>
>>
>>
>>
>> Illustration:
>>
>>
>>
>>
>>
>> Tree 1:
>>
>>
>>
>>
>>
>>     <a>
>>
>>
>>       <b>1</b>
>>
>>
>>       <c>
>>
>>
>>         <d>2</d>
>>
>>
>>         <e>3</e>
>>
>>
>>         <f>4</f>
>>
>>
>>         <g>
>>
>>
>>           <h>5</h>
>>
>>
>>           <i>6</i>
>>
>>
>>         </g>
>>
>>
>>         <j>7</j>
>>
>>
>>       </c>
>>
>>
>>     </a>
>>
>>
>>
>>
>>
>> Tree 2:
>>
>>
>>
>>
>>
>>     <a>
>>
>>
>>       <c>
>>
>>
>>         <g>
>>
>>
>>           <h>5</h>
>>
>>
>>           <i>6</i>
>>
>>
>>         </g>
>>
>>
>>         <d>2</d>
>>
>>
>>         <j>7</j>
>>
>>
>>         <e>3</e>
>>
>>
>>         <f>4</f>
>>
>>
>>       </c>
>>
>>
>>       <b>1</b>
>>
>>
>>     </a>
>>
>>
>>
>>
>>
>> In LOM, Trees 1 and 2 are identical, since LOM is order-insensitive.
>>
>>
>>
>>
>>
>> Tree 3:
>>
>>
>>
>>
>>
>>     <a>
>>
>>
>>       <b>1</b>
>>
>>
>>       <c>
>>
>>
>>         <d>2</d>
>>
>>
>>         <j>7</j>
>>
>>
>>      </c>
>>
>>
>>     </a>
>>
>>
>>
>>
>>
>> The queries we expect about LOM structure will typically concentrate
>> on elements belonging to the same container, rather than on what
>> other
>> elements also belong to the container. Therefore, we would expect a
>> query on the proximity of <b/> and <j/> to give the same result for
>> Trees 1 and 3. (To give a LOM example: we expect x is author and y is
>> editor and y edited at least 4 months after x authored to give the
>> same answer whether or not a graphic designer is also defined in the
>> Contribute node.
>>
>>
>> Tokenised LOM tree ¶
>>
>>
>>
>>
>>
>>     * E.g. in an XML binding of LOM, the XML document is tokenised,
>> with the token breaks being the element delimiters.
>>
>>
>>     * Only leaf nodes are tokenised, and aggregate nodes are not
>> considered tokens. E.g. in XML, consecutive delimiters count as a
>> single token break.
>>
>>
>>     * Tokenisation is exactly parallel to the tokenisation of text by
>> word and sentence boundaries, already used for textual proximity
>> searches.
>>
>>
>>     * Proximity searches for elements use the well-established notion
>> of distance between tokens.
>>
>>
>>
>>
>>
>> The trees above tokenise as:
>>
>>
>>
>>
>>
>>     * Tree 1: 1 2 3 4 5 6 7
>>
>>
>>     * Tree 2: 5 6 2 7 3 4 1
>>
>>
>>     * Tree 3: 1 2 7
>>
>>
>>
>>
>>
>> Search results are sensitive to accidents of ordering and of optional
>> elements. So in the three trees, the distance between 1 and 7
>> could be
>> 6, 3, or 2.
>>
>>
>> Lowest Common Ancestor ¶
>>
>>
>>
>>
>>
>> Work on algorithms to determine lowest common ancestor in XML has
>> been
>> done
>> (http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf);
>> the following algorithm is straightforward:
>>
>>
>>
>>
>>
>>     * Each element in the LOM tree is assigned a position string: a
>> dot-delimited string of position numbers describing the path from the
>> root to the element. The leftmost position number is the position of
>> the child of the root traversed in the path, relative to its
>> siblings,
>> expressed as an ordinal number; the next position number is the
>> position of the child of the child of the root, relative to its
>> siblings, and so forth. For example:
>>
>>
>>
>>
>>
>>     Tree 1:
>>
>>
>>
>>
>>
>>         <a>
>>
>>
>>           <b>1</b>        1
>>
>>
>>           <c>
>>
>>
>>             <d>2</d>      2.1
>>
>>
>>             <e>3</e>      2.2
>>
>>
>>             <f>4</f>      2.3
>>
>>
>>             <g>
>>
>>
>>               <h>5</h>    2.4.1
>>
>>
>>               <i>6</i>    2.4.2
>>
>>
>>             </g>
>>
>>
>>             <j>7</j>      2.5
>>
>>
>>           </c>
>>
>>
>>         </a>
>>
>>
>>
>>
>>
>>     * Since each element is assigned a position string in isolation,
>> these position strings can be indexed externally, and proximity
>> queries may be transacted by looking up the position strings, without
>> direct reference to the XML or any other realisation of the LOM
>> information model. This means that such searches are compatible with
>> an index-based CQL infrastructure.
>>
>>
>>     * The position strings of the two elements are compared, and the
>> minimum common prefix determined. For example: a proximity query for
>> <d> and <h>, involving position strings 2.1 and 2.4.1, has the
>> minimum
>> common prefix 2.
>>
>>
>>     * The minimum suffix length following the common prefix is
>> determined for the two elements' position strings. 2.1 and 2.4.1
>> share
>> the prefix 2, and after that prefix have the suffixes .1 and .4.1, of
>> length 1 and 2 respectively.
>>
>>
>>     * The minimum suffix length is the closest distance from one of
>> the elements to the lowest common ancestor, and represents the number
>> of branchings in the LOM tree between elements. Since the minimum
>> suffix length for <d> and <h> is 1, the two elements are contained in
>> the same container, and no intervening aggregate elements are
>> possible
>> between them: the lowest common ancestor is the parent of one of the
>> elements.
>>
>>
>>     * If the position strings are 2.1.4.5.6 and 2.1.7.3, the common
>> prefix is 2.1, and the minimum suffix length is 2 (.7.3). This means
>> that there is an intervening node (aggregate element) between the two
>> elements: 2.1.7; the lowest common ancestor is the grandparent of
>> 2.1.7.3, 2.1. So 2.1.4.5.6 and 2.1.7.3 are less close than 2.1.7.1
>> and
>> 2.1.7.3, or for that matter 2.1.4 and 2.1.7.3.
>>
>>
>>
>>
>>
>> Let us illustrate this with LOM instances:
>>
>>
>>
>>
>>
>> <lom>
>>
>>
>>   <lifecycle>                             1
>>
>>
>>     <contribute>                          1.1
>>
>>
>>       <role>author</role>                 1.1.1
>>
>>
>>       <entity>Sanderson</entity>          1.1.2
>>
>>
>>       <date>2006</date>                   1.1.3
>>
>>
>>     </contribute>
>>
>>
>>     <contribute>                          1.2
>>
>>
>>       <role>publisher</role>              1.2.1
>>
>>
>>       <entity>Fredericksen</entity>       1.2.2
>>
>>
>>       <date>2007</date>                   1.2.3
>>
>>
>>     </contribute>
>>
>>
>>     <contribute>                          1.3
>>
>>
>>       <role>initiator</role>              1.3.1
>>
>>
>>       <entity>Johnson</entity>            1.3.2
>>
>>
>>       <date>2004</date>                   1.3.3
>>
>>
>>     </contribute>
>>
>>
>>     <contribute>                          1.4
>>
>>
>>       <role>terminator</role>             1.4.1
>>
>>
>>       <entity>Pierceson</entity>          1.4.2
>>
>>
>>       <date>2008</date>                   1.4.3
>>
>>
>>     </contribute>
>>
>>
>>   <lifecyle>
>>
>>
>> </lom>
>>
>>
>>
>>
>>
>>     * The query "dc.creator = sanderson and the dc.date =2006 are
>> contained within the same contribute entry" translates to: dc.creator
>> = sanderson prox/unit=element/distance=1 dc.date=2006.
>>
>>
>>     * "sanderson" and "2006" have the position strings 1.1.2 and
>> 1.1.3, so their distance is 1 (.2, .3). They are as close as possible
>> in the LOM tree, sharing a common parent, so they satisfy the "within
>> the same contribute entry" requirement.
>>
>>
>>     * By contrast, "sanderson" and "2008" have the position strings
>> 1.1.2 and 1.4.3, so their distance is 2 (.1.2, .4.3). The date
>> 2008 is
>> not associated with contributor Sanderson, but with a different
>> contributor.
>>
>>
>>
>>
>>
>> Kerry Blinco
>> e-Framework and Standards Manager, Link Affiliates, University of
>> Southern Queensland; and
>> Technical Standards Adviser to the Department of Education Employment
>> and Workplace Relations (DEEWR).  Australia.
>> Email:     kblinco@powerup.com.au
>> Phone:   +61 7 3871 2699
>> Ph (Mobile) :    +61 419 787 992
>>
>> The information contained in this e-mail message and any files may
>> be confidential information, and may also be the subject of legal
>> professional privilege.
>> If you think you may not be the intended recipient, or if you have
>> received this e-mail in error,
>> please contact the sender immediately and delete all copies of this
>> e-mail. If you are not the intended
>> recipient, you must not reproduce any part of this e-mail or disclose
>> its contents to any other party.
>>
>> This email represents the views of the individual sender, except
>> where
>> the sender expressly states otherwise.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail.  Follow this link to all your TCs in OASIS at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
>

Kerry Blinco
e-Framework and Standards Manager, Link Affiliates, University of
Southern Queensland; and
Technical Standards Adviser to the Department of Education Employment
and Workplace Relations (DEEWR).  Australia.
Email:     kblinco@powerup.com.au
Phone:   +61 7 3871 2699
Ph (Mobile) :    +61 419 787 992

The information contained in this e-mail message and any files may
be confidential information, and may also be the subject of legal
professional privilege.
If you think you may not be the intended recipient, or if you have
received this e-mail in error,
please contact the sender immediately and delete all copies of this e-
mail. If you are not the intended
recipient, you must not reproduce any part of this e-mail or disclose
its contents to any other party.

This email represents the views of the individual sender, except
where the sender expressly states otherwise.
Follow-Ups:
- Re: [search-ws] CQL Query on structured data.
  - From: Rob Sanderson <azaroth@liverpool.ac.uk>
References:
- CQL Query on structured data.
  - From: Kerry Blinco <kblinco@powerup.com.au>
- RE: [search-ws] CQL Query on structured data.
  - From: "LeVan,Ralph" <levan@oclc.org>
- RE: [search-ws] CQL Query on structured data.
  - From: Rob Sanderson <azaroth@liverpool.ac.uk>
- Re: [search-ws] CQL Query on structured data.
  - From: Kerry Blinco <kblinco@powerup.com.au>