[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [search-ws] CQL Query on structured data.
Or SPARQL. Rob On Thu, 2008-09-18 at 10:25 -0400, LeVan,Ralph wrote: > I’d really hate to see us try to shoehorn structure searching into > CQL. I’d rather see support for XQuery as the query grammar in SRU > than make CQL do everything that it already does plus everything that > XQuery does. > > > > Ralph > > > > From: Kerry Blinco [mailto:kblinco@powerup.com.au] > Sent: Tuesday, September 09, 2008 5:47 AM > To: search-ws@lists.oasis-open.org > Subject: [search-ws] CQL Query on structured data. > > > > > Ray, > > > > > > I am copying to you the unedited version of the FRED project > (Federated Repositories for Education) (http://fred.usq.edu.au/) use > case and a proposed solution based on proximity using element and an > abstract tree search. > > > > > > The PQL solutions are different and they need to state their own case > I think - Hopefully we can get them engaged in this after next week. > > > > > > The FRED description is something I could send to you quickly as > requested. > > > > > > LOM CQL Documentation from the FRED project > > > > > > Problem Statement ¶ > > > > > > Need to be able to perform under CQL queries sensitive to the > structure of the underlying LOM; e.g. > > > > > > * dc.creator = sanderson and the dc.date =2006 are contained > within the same contribute container. > > > * x is author and y is editor and y edited at least 4 months after > x authored > > > * dc.creator is in the second grandchild of the grandfather of a > node with dc.date = 2006 > > > > > > Australian Education Stakeholders have indicated that such structural > queries (at least in their simplest forms) are important. > > > > > > In LOM in particular, > > > > > > * The ordering of nodes is often undefined by the standard, and so > searches cannot rely on relative order. > > > * Many elements can have an open number of children. > > > > > > Note that such structural queries are perceived to compromise the > abstractness of CQL queries. For instance, a query are contained > within the same contribute entry does not make sense unless the > underlying record can be represented in LOM: the proximity is > evaluated specifically in the context of the LOM information model. > Even if a "contribute" entry is not nominated, different schemas may > arrange elements very differently. In LOM, the creator and publisher > are both in the contribute container, and are closer to each other > structurally than is the keyword. In Dublin Core, creator, publisher > and keyword are all children of the root node, and so are equally > close to each other structurally. For that reason, if a context set > realises element proximity searches, it can only do so relative to a > specific schema, and cannot claim to be schema neutral, the way normal > CQL searches are. > > > > > > On the other hand, such structural queries are specific only to the > information model of LOM, and not to any binding of the information > model to a specific presentation. They are not dependent on the > container and leaf elements being presented through XML, RDF, or > Language Independent Datatypes. > > > CQL Element Proximity ¶ > > > > > > The proposed solution for FRED is to provide context sets that allow > for searches of abstract structure. The solution below looks at a > tree structure. > > > > > > SOLUTION: > > > > > > The CQL context set provides for proximity searches with unit=element. > Proximity will address the simpler structural queries, but not the > more complex: > > > > > > * "x is author and y is editor and y edited at least 4 months > after x authored" assumes XPath-like extraction of individual > elements, rather than the fixed indexes of CQL > (lom::contribute[child::role=author]::date >= > lom::contribute[child::role=editor]::date + 000400Z ). > > > * The query "dc.creator is in the second grandchild of the > grandfather of a node with dc.date = 2006" is also inconsistent with > CQL: all indexes in CQL need to be related to a search term. So CQL > might conceivable query dc.creator = sanderson as a second grandchild, > but not dc.creator in general; this is again an XPath query > (/descendant[child::date=2006]/parent/child/child[position()=2]/self::creator). > > > > > > Neither CQL 1.1 nor CQL 1.2 define how elements are counted or what > proximity means in the context of CQL indices. Neither CQL version > outlines the extent to which the underlying structure of the source > LOM may be preserved after elements are extracted into indexes. CQL > 1.2 expressly states that, though the element unit is defined for the > CQL context set, no semantics for the unit is defined. (Indeed, nor is > semantics for any proximity unit defined.) However, CQL expressly > allows that other context sets define semantics for proximity units. > > > > > > This has the risk of leading to inconsistent notions of proximity > between different projects and applications. > > > > > > FRED will develop its own semantics for element proximity searches in > the Australian Education CQL context set. > > > > > > FRED may develop a sample implementation of element proximity > searches. > > > > > > There are two possible interpretations of element proximity which FRED > could use: textual proximity, and structural proximity. > > > > > > * Under textual proximity, elements are tokenised in the same way > that words, sentences etc. are tokenised. > > > o A LOM tree is constructed out of the container and leaf > nodes in the LOM document. Where the container may have an unordered > number of subcontainers or leaf nodes, the children in the tree are > ordered arbitrarily. > > > o The LOM tree is traversed in-order, and each leaf node > visited is extracted as a token, in the order in which it has been > visited. > > > o Proximity search counts the number of tokens in the > tokenised tree between elements. > > > o The notion of a node is preserved in the search; the > notion of node hierarchy is not. > > > * Under structural proximity, the lowest common ancestor of two > elements is determined. > > > o A LOM tree is constructed, as above. > > > o The two elements being queries are identified in the tree > as leaf nodes. > > > o The distance of the leaf nodes to their lowest common > ancestor in the tree is determined. > > > o Node hierarchy is preserved. > > > o Proximity search counts how many levels in the tree the > elements are removed from their common ancestor (i.e. how many > branchings intervene between them). > > > > > > The lowest common ancestor definition of proximity corresponds to the > kinds of queries of interest for LOM, which rely on elements being in > the same container. Unlike element tokenisation, it is not sensitive > to the ordering of elements, or the number of elements an aggregate > node may contain. > > > > > > On the other hand, element tokenisation corresponds closely to the > implementation of other proximity searches. That said, the proposed > notion of structural proximity is not an unusual understanding of > proximity in the XML context; cf. > http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf , > where "Keyword Proximity Search in XML Trees" is understood explicitly > in terms of Lowest Common Ancestor.) > > > > > > Illustration: > > > > > > Tree 1: > > > > > > <a> > > > <b>1</b> > > > <c> > > > <d>2</d> > > > <e>3</e> > > > <f>4</f> > > > <g> > > > <h>5</h> > > > <i>6</i> > > > </g> > > > <j>7</j> > > > </c> > > > </a> > > > > > > Tree 2: > > > > > > <a> > > > <c> > > > <g> > > > <h>5</h> > > > <i>6</i> > > > </g> > > > <d>2</d> > > > <j>7</j> > > > <e>3</e> > > > <f>4</f> > > > </c> > > > <b>1</b> > > > </a> > > > > > > In LOM, Trees 1 and 2 are identical, since LOM is order-insensitive. > > > > > > Tree 3: > > > > > > <a> > > > <b>1</b> > > > <c> > > > <d>2</d> > > > <j>7</j> > > > </c> > > > </a> > > > > > > The queries we expect about LOM structure will typically concentrate > on elements belonging to the same container, rather than on what other > elements also belong to the container. Therefore, we would expect a > query on the proximity of <b/> and <j/> to give the same result for > Trees 1 and 3. (To give a LOM example: we expect x is author and y is > editor and y edited at least 4 months after x authored to give the > same answer whether or not a graphic designer is also defined in the > Contribute node. > > > Tokenised LOM tree ¶ > > > > > > * E.g. in an XML binding of LOM, the XML document is tokenised, > with the token breaks being the element delimiters. > > > * Only leaf nodes are tokenised, and aggregate nodes are not > considered tokens. E.g. in XML, consecutive delimiters count as a > single token break. > > > * Tokenisation is exactly parallel to the tokenisation of text by > word and sentence boundaries, already used for textual proximity > searches. > > > * Proximity searches for elements use the well-established notion > of distance between tokens. > > > > > > The trees above tokenise as: > > > > > > * Tree 1: 1 2 3 4 5 6 7 > > > * Tree 2: 5 6 2 7 3 4 1 > > > * Tree 3: 1 2 7 > > > > > > Search results are sensitive to accidents of ordering and of optional > elements. So in the three trees, the distance between 1 and 7 could be > 6, 3, or 2. > > > Lowest Common Ancestor ¶ > > > > > > Work on algorithms to determine lowest common ancestor in XML has been > done > (http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf); > the following algorithm is straightforward: > > > > > > * Each element in the LOM tree is assigned a position string: a > dot-delimited string of position numbers describing the path from the > root to the element. The leftmost position number is the position of > the child of the root traversed in the path, relative to its siblings, > expressed as an ordinal number; the next position number is the > position of the child of the child of the root, relative to its > siblings, and so forth. For example: > > > > > > Tree 1: > > > > > > <a> > > > <b>1</b> 1 > > > <c> > > > <d>2</d> 2.1 > > > <e>3</e> 2.2 > > > <f>4</f> 2.3 > > > <g> > > > <h>5</h> 2.4.1 > > > <i>6</i> 2.4.2 > > > </g> > > > <j>7</j> 2.5 > > > </c> > > > </a> > > > > > > * Since each element is assigned a position string in isolation, > these position strings can be indexed externally, and proximity > queries may be transacted by looking up the position strings, without > direct reference to the XML or any other realisation of the LOM > information model. This means that such searches are compatible with > an index-based CQL infrastructure. > > > * The position strings of the two elements are compared, and the > minimum common prefix determined. For example: a proximity query for > <d> and <h>, involving position strings 2.1 and 2.4.1, has the minimum > common prefix 2. > > > * The minimum suffix length following the common prefix is > determined for the two elements' position strings. 2.1 and 2.4.1 share > the prefix 2, and after that prefix have the suffixes .1 and .4.1, of > length 1 and 2 respectively. > > > * The minimum suffix length is the closest distance from one of > the elements to the lowest common ancestor, and represents the number > of branchings in the LOM tree between elements. Since the minimum > suffix length for <d> and <h> is 1, the two elements are contained in > the same container, and no intervening aggregate elements are possible > between them: the lowest common ancestor is the parent of one of the > elements. > > > * If the position strings are 2.1.4.5.6 and 2.1.7.3, the common > prefix is 2.1, and the minimum suffix length is 2 (.7.3). This means > that there is an intervening node (aggregate element) between the two > elements: 2.1.7; the lowest common ancestor is the grandparent of > 2.1.7.3, 2.1. So 2.1.4.5.6 and 2.1.7.3 are less close than 2.1.7.1 and > 2.1.7.3, or for that matter 2.1.4 and 2.1.7.3. > > > > > > Let us illustrate this with LOM instances: > > > > > > <lom> > > > <lifecycle> 1 > > > <contribute> 1.1 > > > <role>author</role> 1.1.1 > > > <entity>Sanderson</entity> 1.1.2 > > > <date>2006</date> 1.1.3 > > > </contribute> > > > <contribute> 1.2 > > > <role>publisher</role> 1.2.1 > > > <entity>Fredericksen</entity> 1.2.2 > > > <date>2007</date> 1.2.3 > > > </contribute> > > > <contribute> 1.3 > > > <role>initiator</role> 1.3.1 > > > <entity>Johnson</entity> 1.3.2 > > > <date>2004</date> 1.3.3 > > > </contribute> > > > <contribute> 1.4 > > > <role>terminator</role> 1.4.1 > > > <entity>Pierceson</entity> 1.4.2 > > > <date>2008</date> 1.4.3 > > > </contribute> > > > <lifecyle> > > > </lom> > > > > > > * The query "dc.creator = sanderson and the dc.date =2006 are > contained within the same contribute entry" translates to: dc.creator > = sanderson prox/unit=element/distance=1 dc.date=2006. > > > * "sanderson" and "2006" have the position strings 1.1.2 and > 1.1.3, so their distance is 1 (.2, .3). They are as close as possible > in the LOM tree, sharing a common parent, so they satisfy the "within > the same contribute entry" requirement. > > > * By contrast, "sanderson" and "2008" have the position strings > 1.1.2 and 1.4.3, so their distance is 2 (.1.2, .4.3). The date 2008 is > not associated with contributor Sanderson, but with a different > contributor. > > > > > > Kerry Blinco > e-Framework and Standards Manager, Link Affiliates, University of > Southern Queensland; and > Technical Standards Adviser to the Department of Education Employment > and Workplace Relations (DEEWR). Australia. > Email: kblinco@powerup.com.au > Phone: +61 7 3871 2699 > Ph (Mobile) : +61 419 787 992 > > The information contained in this e-mail message and any files may > be confidential information, and may also be the subject of legal > professional privilege. > If you think you may not be the intended recipient, or if you have > received this e-mail in error, > please contact the sender immediately and delete all copies of this > e-mail. If you are not the intended > recipient, you must not reproduce any part of this e-mail or disclose > its contents to any other party. > > This email represents the views of the individual sender, except where > the sender expressly states otherwise. > > > > > > > > > > > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]