search-ws message

Subject: RE: [search-ws] CQL Query on structured data.
From: "LeVan,Ralph" <levan@oclc.org>
To: "Ray Denenberg, Library of Congress" <rden@loc.gov>,<search-ws@lists.oasis-open.org>
Date: Thu, 18 Sep 2008 11:14:24 -0400
Sadly, I won't be on the Monday call.

Ralph

> -----Original Message-----
> From: Ray Denenberg, Library of Congress [mailto:rden@loc.gov]
> Sent: Thursday, September 18, 2008 11:06 AM
> To: search-ws@lists.oasis-open.org
> Subject: Re: [search-ws] CQL Query on structured data.
> 
> On the agenda for Monday.
> 
> ----- Original Message -----
> From: "LeVan,Ralph" <levan@oclc.org>
> To: "Kerry Blinco" <kblinco@powerup.com.au>;
> <search-ws@lists.oasis-open.org>
> Sent: Thursday, September 18, 2008 10:25 AM
> Subject: RE: [search-ws] CQL Query on structured data.
> 
> 
> I'd really hate to see us try to shoehorn structure searching into CQL.
> I'd
> rather see support for XQuery as the query grammar in SRU than make CQL
> do
> everything that it already does plus everything that XQuery does.
> 
> 
> 
> Ralph
> 
> 
> 
> From: Kerry Blinco [mailto:kblinco@powerup.com.au]
> Sent: Tuesday, September 09, 2008 5:47 AM
> To: search-ws@lists.oasis-open.org
> Subject: [search-ws] CQL Query on structured data.
> 
> 
> 
> Ray,
> 
> 
> 
> I am copying to you the unedited version of the FRED project (Federated
> Repositories for Education) (http://fred.usq.edu.au/) use case and a
> proposed solution based on proximity using element and an abstract tree
> search.
> 
> 
> 
> The PQL solutions are different and they need to state their own case I
> think - Hopefully we can get them engaged in this after next week.
> 
> 
> 
> The FRED description  is something I could send to you quickly as
> requested.
> 
> 
> 
> LOM CQL  Documentation from the FRED project
> 
> 
> 
> Problem Statement ś
> 
> 
> 
> Need to be able to perform under CQL queries sensitive to the structure
> of
> the underlying LOM; e.g.
> 
> 
> 
>     * dc.creator = sanderson and the dc.date =2006 are contained within
> the
> same contribute container.
> 
>     * x is author and y is editor and y edited at least 4 months after
> x
> authored
> 
>     * dc.creator is in the second grandchild of the grandfather of a
> node
> with dc.date = 2006
> 
> 
> 
> Australian Education Stakeholders have indicated that such structural
> queries (at least in their simplest forms) are important.
> 
> 
> 
> In LOM in particular,
> 
> 
> 
>     * The ordering of nodes is often undefined by the standard, and so
> searches cannot rely on relative order.
> 
>     * Many elements can have an open number of children.
> 
> 
> 
> Note that such structural queries are perceived to compromise the
> abstractness of CQL queries. For instance, a query are contained within
> the
> same contribute entry does not make sense unless the underlying record
> can
> be represented in LOM: the proximity is evaluated specifically in the
> context of the LOM information model.  Even if a "contribute" entry is
> not
> nominated, different schemas may arrange elements very differently. In
> LOM,
> the creator and publisher are both in the contribute container, and are
> closer to each other structurally than is the keyword. In Dublin Core,
> creator, publisher and keyword are all children of the root node, and
> so are
> equally close to each other structurally. For that reason, if a context
> set
> realises element proximity searches, it can only do so relative to a
> specific schema, and cannot claim to be schema neutral, the way normal
> CQL
> searches are.
> 
> 
> 
> On the other hand, such structural queries are specific only to the
> information model of LOM, and not to any binding of the information
> model to
> a specific presentation.  They are not dependent on the container and
> leaf
> elements being presented through XML, RDF, or Language Independent
> Datatypes.
> 
> CQL Element Proximity ś
> 
> 
> 
> The proposed solution for FRED is to provide context sets that allow
> for
> searches of abstract structure.  The solution below looks at a tree
> structure.
> 
> 
> 
> SOLUTION:
> 
> 
> 
> The CQL context set provides for proximity searches with unit=element.
> Proximity will address the simpler structural queries, but not the more
> complex:
> 
> 
> 
>     * "x is author and y is editor and y edited at least 4 months after
> x
> authored" assumes XPath-like extraction of individual elements, rather
> than
> the fixed indexes of CQL (lom::contribute[child::role=author]::date >=
> lom::contribute[child::role=editor]::date + 000400Z ).
> 
>     * The query "dc.creator is in the second grandchild of the
> grandfather
> of a node with dc.date = 2006" is also inconsistent with CQL: all
> indexes in
> CQL need to be related to a search term. So CQL might conceivable query
> dc.creator = sanderson as a second grandchild, but not dc.creator in
> general; this is again an XPath query
> (/descendant[child::date=2006]/parent/child/child[position()=2]/self::c
> reato
> r).
> 
> 
> 
> Neither CQL 1.1 nor CQL 1.2 define how elements are counted or what
> proximity means in the context of CQL indices. Neither CQL version
> outlines
> the extent to which the underlying structure of the source LOM may be
> preserved after elements are extracted into indexes. CQL 1.2 expressly
> states that, though the element unit is defined for the CQL context
> set, no
> semantics for the unit is defined. (Indeed, nor is semantics for any
> proximity unit defined.) However, CQL expressly allows that other
> context
> sets define semantics for proximity units.
> 
> 
> 
> This has the risk of leading to inconsistent notions of proximity
> between
> different projects and applications.
> 
> 
> 
> FRED will develop its own semantics for element proximity searches in
> the
> Australian Education CQL context set.
> 
> 
> 
> FRED may develop a sample implementation of element proximity searches.
> 
> 
> 
> There are two possible interpretations of element proximity which FRED
> could
> use: textual proximity, and structural proximity.
> 
> 
> 
>     * Under textual proximity, elements are tokenised in the same way
> that
> words, sentences etc. are tokenised.
> 
>           o A LOM tree is constructed out of the container and leaf
> nodes in
> the LOM document. Where the container may have an unordered number of
> subcontainers or leaf nodes, the children in the tree are ordered
> arbitrarily.
> 
>           o The LOM tree is traversed in-order, and each leaf node
> visited
> is extracted as a token, in the order in which it has been visited.
> 
>           o Proximity search counts the number of tokens in the
> tokenised
> tree between elements.
> 
>           o The notion of a node is preserved in the search; the notion
> of
> node hierarchy is not.
> 
>     * Under structural proximity, the lowest common ancestor of two
> elements
> is determined.
> 
>           o A LOM tree is constructed, as above.
> 
>           o The two elements being queries are identified in the tree
> as
> leaf nodes.
> 
>           o The distance of the leaf nodes to their lowest common
> ancestor
> in the tree is determined.
> 
>           o Node hierarchy is preserved.
> 
>           o Proximity search counts how many levels in the tree the
> elements
> are removed from their common ancestor (i.e. how many branchings
> intervene
> between them).
> 
> 
> 
> The lowest common ancestor definition of proximity corresponds to the
> kinds
> of queries of interest for LOM, which rely on elements being in the
> same
> container. Unlike element tokenisation, it is not sensitive to the
> ordering
> of elements, or the number of elements an aggregate node may contain.
> 
> 
> 
> On the other hand, element tokenisation corresponds closely to the
> implementation of other proximity searches. That said, the proposed
> notion
> of structural proximity is not an unusual understanding of proximity in
> the
> XML context; cf.
> http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf ,
> where
> "Keyword Proximity Search in XML Trees" is understood explicitly in
> terms of
> Lowest Common Ancestor.)
> 
> 
> 
> Illustration:
> 
> 
> 
> Tree 1:
> 
> 
> 
>     <a>
> 
>       <b>1</b>
> 
>       <c>
> 
>         <d>2</d>
> 
>         <e>3</e>
> 
>         <f>4</f>
> 
>         <g>
> 
>           <h>5</h>
> 
>           <i>6</i>
> 
>         </g>
> 
>         <j>7</j>
> 
>       </c>
> 
>     </a>
> 
> 
> 
> Tree 2:
> 
> 
> 
>     <a>
> 
>       <c>
> 
>         <g>
> 
>           <h>5</h>
> 
>           <i>6</i>
> 
>         </g>
> 
>         <d>2</d>
> 
>         <j>7</j>
> 
>         <e>3</e>
> 
>         <f>4</f>
> 
>       </c>
> 
>       <b>1</b>
> 
>     </a>
> 
> 
> 
> In LOM, Trees 1 and 2 are identical, since LOM is order-insensitive.
> 
> 
> 
> Tree 3:
> 
> 
> 
>     <a>
> 
>       <b>1</b>
> 
>       <c>
> 
>         <d>2</d>
> 
>         <j>7</j>
> 
>      </c>
> 
>     </a>
> 
> 
> 
> The queries we expect about LOM structure will typically concentrate on
> elements belonging to the same container, rather than on what other
> elements
> also belong to the container. Therefore, we would expect a query on the
> proximity of <b/> and <j/> to give the same result for Trees 1 and 3.
> (To
> give a LOM example: we expect x is author and y is editor and y edited
> at
> least 4 months after x authored to give the same answer whether or not
> a
> graphic designer is also defined in the Contribute node.
> 
> Tokenised LOM tree ś
> 
> 
> 
>     * E.g. in an XML binding of LOM, the XML document is tokenised,
> with the
> token breaks being the element delimiters.
> 
>     * Only leaf nodes are tokenised, and aggregate nodes are not
> considered
> tokens. E.g. in XML, consecutive delimiters count as a single token
> break.
> 
>     * Tokenisation is exactly parallel to the tokenisation of text by
> word
> and sentence boundaries, already used for textual proximity searches.
> 
>     * Proximity searches for elements use the well-established notion
> of
> distance between tokens.
> 
> 
> 
> The trees above tokenise as:
> 
> 
> 
>     * Tree 1: 1 2 3 4 5 6 7
> 
>     * Tree 2: 5 6 2 7 3 4 1
> 
>     * Tree 3: 1 2 7
> 
> 
> 
> Search results are sensitive to accidents of ordering and of optional
> elements. So in the three trees, the distance between 1 and 7 could be
> 6, 3,
> or 2.
> 
> Lowest Common Ancestor ś
> 
> 
> 
> Work on algorithms to determine lowest common ancestor in XML has been
> done
> (http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf); the
> following algorithm is straightforward:
> 
> 
> 
>     * Each element in the LOM tree is assigned a position string: a
> dot-delimited string of position numbers describing the path from the
> root
> to the element. The leftmost position number is the position of the
> child of
> the root traversed in the path, relative to its siblings, expressed as
> an
> ordinal number; the next position number is the position of the child
> of the
> child of the root, relative to its siblings, and so forth. For example:
> 
> 
> 
>     Tree 1:
> 
> 
> 
>         <a>
> 
>           <b>1</b>        1
> 
>           <c>
> 
>             <d>2</d>      2.1
> 
>             <e>3</e>      2.2
> 
>             <f>4</f>      2.3
> 
>             <g>
> 
>               <h>5</h>    2.4.1
> 
>               <i>6</i>    2.4.2
> 
>             </g>
> 
>             <j>7</j>      2.5
> 
>           </c>
> 
>         </a>
> 
> 
> 
>     * Since each element is assigned a position string in isolation,
> these
> position strings can be indexed externally, and proximity queries may
> be
> transacted by looking up the position strings, without direct reference
> to
> the XML or any other realisation of the LOM information model. This
> means
> that such searches are compatible with an index-based CQL
> infrastructure.
> 
>     * The position strings of the two elements are compared, and the
> minimum
> common prefix determined. For example: a proximity query for <d> and
> <h>,
> involving position strings 2.1 and 2.4.1, has the minimum common prefix
> 2.
> 
>     * The minimum suffix length following the common prefix is
> determined
> for the two elements' position strings. 2.1 and 2.4.1 share the prefix
> 2,
> and after that prefix have the suffixes .1 and .4.1, of length 1 and 2
> respectively.
> 
>     * The minimum suffix length is the closest distance from one of the
> elements to the lowest common ancestor, and represents the number of
> branchings in the LOM tree between elements. Since the minimum suffix
> length
> for <d> and <h> is 1, the two elements are contained in the same
> container,
> and no intervening aggregate elements are possible between them: the
> lowest
> common ancestor is the parent of one of the elements.
> 
>     * If the position strings are 2.1.4.5.6 and 2.1.7.3, the common
> prefix
> is 2.1, and the minimum suffix length is 2 (.7.3). This means that
> there is
> an intervening node (aggregate element) between the two elements:
> 2.1.7; the
> lowest common ancestor is the grandparent of 2.1.7.3, 2.1. So 2.1.4.5.6
> and
> 2.1.7.3 are less close than 2.1.7.1 and 2.1.7.3, or for that matter
> 2.1.4
> and 2.1.7.3.
> 
> 
> 
> Let us illustrate this with LOM instances:
> 
> 
> 
> <lom>
> 
>   <lifecycle>                             1
> 
>     <contribute>                          1.1
> 
>       <role>author</role>                 1.1.1
> 
>       <entity>Sanderson</entity>          1.1.2
> 
>       <date>2006</date>                   1.1.3
> 
>     </contribute>
> 
>     <contribute>                          1.2
> 
>       <role>publisher</role>              1.2.1
> 
>       <entity>Fredericksen</entity>       1.2.2
> 
>       <date>2007</date>                   1.2.3
> 
>     </contribute>
> 
>     <contribute>                          1.3
> 
>       <role>initiator</role>              1.3.1
> 
>       <entity>Johnson</entity>            1.3.2
> 
>       <date>2004</date>                   1.3.3
> 
>     </contribute>
> 
>     <contribute>                          1.4
> 
>       <role>terminator</role>             1.4.1
> 
>       <entity>Pierceson</entity>          1.4.2
> 
>       <date>2008</date>                   1.4.3
> 
>     </contribute>
> 
>   <lifecyle>
> 
> </lom>
> 
> 
> 
>     * The query "dc.creator = sanderson and the dc.date =2006 are
> contained
> within the same contribute entry" translates to: dc.creator = sanderson
> prox/unit=element/distance=1 dc.date=2006.
> 
>     * "sanderson" and "2006" have the position strings 1.1.2 and 1.1.3,
> so
> their distance is 1 (.2, .3). They are as close as possible in the LOM
> tree,
> sharing a common parent, so they satisfy the "within the same
> contribute
> entry" requirement.
> 
>     * By contrast, "sanderson" and "2008" have the position strings
> 1.1.2
> and 1.4.3, so their distance is 2 (.1.2, .4.3). The date 2008 is not
> associated with contributor Sanderson, but with a different
> contributor.
> 
> 
> 
> Kerry Blinco
> e-Framework and Standards Manager, Link Affiliates, University of
> Southern
> Queensland; and
> Technical Standards Adviser to the Department of Education Employment
> and
> Workplace Relations (DEEWR).  Australia.
> Email:     kblinco@powerup.com.au
> Phone:   +61 7 3871 2699
> Ph (Mobile) :    +61 419 787 992
> 
> The information contained in this e-mail message and any files may
> be confidential information, and may also be the subject of legal
> professional privilege.
> If you think you may not be the intended recipient, or if you have
> received
> this e-mail in error,
> please contact the sender immediately and delete all copies of this e-
> mail.
> If you are not the intended
> recipient, you must not reproduce any part of this e-mail or disclose
> its
> contents to any other party.
> 
> This email represents the views of the individual sender, except where
> the
> sender expressly states otherwise.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe from this mail list, you must leave the OASIS TC that
> generates this mail.  Follow this link to all your TCs in OASIS at:
> https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
>
Follow-Ups:
- Re: [search-ws] CQL Query on structured data.
  - From: "Ray Denenberg, Library of Congress" <rden@loc.gov>
References:
- CQL Query on structured data.
  - From: Kerry Blinco <kblinco@powerup.com.au>
- RE: [search-ws] CQL Query on structured data.
  - From: "LeVan,Ralph" <levan@oclc.org>
- Re: [search-ws] CQL Query on structured data.
  - From: "Ray Denenberg, Library of Congress" <rden@loc.gov>