[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [search-ws] CQL Query on structured data.
Sadly, I won't be on the Monday call. Ralph > -----Original Message----- > From: Ray Denenberg, Library of Congress [mailto:email@example.com] > Sent: Thursday, September 18, 2008 11:06 AM > To: firstname.lastname@example.org > Subject: Re: [search-ws] CQL Query on structured data. > > On the agenda for Monday. > > ----- Original Message ----- > From: "LeVan,Ralph" <email@example.com> > To: "Kerry Blinco" <firstname.lastname@example.org>; > <email@example.com> > Sent: Thursday, September 18, 2008 10:25 AM > Subject: RE: [search-ws] CQL Query on structured data. > > > I'd really hate to see us try to shoehorn structure searching into CQL. > I'd > rather see support for XQuery as the query grammar in SRU than make CQL > do > everything that it already does plus everything that XQuery does. > > > > Ralph > > > > From: Kerry Blinco [mailto:firstname.lastname@example.org] > Sent: Tuesday, September 09, 2008 5:47 AM > To: email@example.com > Subject: [search-ws] CQL Query on structured data. > > > > Ray, > > > > I am copying to you the unedited version of the FRED project (Federated > Repositories for Education) (http://fred.usq.edu.au/) use case and a > proposed solution based on proximity using element and an abstract tree > search. > > > > The PQL solutions are different and they need to state their own case I > think - Hopefully we can get them engaged in this after next week. > > > > The FRED description is something I could send to you quickly as > requested. > > > > LOM CQL Documentation from the FRED project > > > > Problem Statement ¶ > > > > Need to be able to perform under CQL queries sensitive to the structure > of > the underlying LOM; e.g. > > > > * dc.creator = sanderson and the dc.date =2006 are contained within > the > same contribute container. > > * x is author and y is editor and y edited at least 4 months after > x > authored > > * dc.creator is in the second grandchild of the grandfather of a > node > with dc.date = 2006 > > > > Australian Education Stakeholders have indicated that such structural > queries (at least in their simplest forms) are important. > > > > In LOM in particular, > > > > * The ordering of nodes is often undefined by the standard, and so > searches cannot rely on relative order. > > * Many elements can have an open number of children. > > > > Note that such structural queries are perceived to compromise the > abstractness of CQL queries. For instance, a query are contained within > the > same contribute entry does not make sense unless the underlying record > can > be represented in LOM: the proximity is evaluated specifically in the > context of the LOM information model. Even if a "contribute" entry is > not > nominated, different schemas may arrange elements very differently. In > LOM, > the creator and publisher are both in the contribute container, and are > closer to each other structurally than is the keyword. In Dublin Core, > creator, publisher and keyword are all children of the root node, and > so are > equally close to each other structurally. For that reason, if a context > set > realises element proximity searches, it can only do so relative to a > specific schema, and cannot claim to be schema neutral, the way normal > CQL > searches are. > > > > On the other hand, such structural queries are specific only to the > information model of LOM, and not to any binding of the information > model to > a specific presentation. They are not dependent on the container and > leaf > elements being presented through XML, RDF, or Language Independent > Datatypes. > > CQL Element Proximity ¶ > > > > The proposed solution for FRED is to provide context sets that allow > for > searches of abstract structure. The solution below looks at a tree > structure. > > > > SOLUTION: > > > > The CQL context set provides for proximity searches with unit=element. > Proximity will address the simpler structural queries, but not the more > complex: > > > > * "x is author and y is editor and y edited at least 4 months after > x > authored" assumes XPath-like extraction of individual elements, rather > than > the fixed indexes of CQL (lom::contribute[child::role=author]::date >= > lom::contribute[child::role=editor]::date + 000400Z ). > > * The query "dc.creator is in the second grandchild of the > grandfather > of a node with dc.date = 2006" is also inconsistent with CQL: all > indexes in > CQL need to be related to a search term. So CQL might conceivable query > dc.creator = sanderson as a second grandchild, but not dc.creator in > general; this is again an XPath query > (/descendant[child::date=2006]/parent/child/child[position()=2]/self::c > reato > r). > > > > Neither CQL 1.1 nor CQL 1.2 define how elements are counted or what > proximity means in the context of CQL indices. Neither CQL version > outlines > the extent to which the underlying structure of the source LOM may be > preserved after elements are extracted into indexes. CQL 1.2 expressly > states that, though the element unit is defined for the CQL context > set, no > semantics for the unit is defined. (Indeed, nor is semantics for any > proximity unit defined.) However, CQL expressly allows that other > context > sets define semantics for proximity units. > > > > This has the risk of leading to inconsistent notions of proximity > between > different projects and applications. > > > > FRED will develop its own semantics for element proximity searches in > the > Australian Education CQL context set. > > > > FRED may develop a sample implementation of element proximity searches. > > > > There are two possible interpretations of element proximity which FRED > could > use: textual proximity, and structural proximity. > > > > * Under textual proximity, elements are tokenised in the same way > that > words, sentences etc. are tokenised. > > o A LOM tree is constructed out of the container and leaf > nodes in > the LOM document. Where the container may have an unordered number of > subcontainers or leaf nodes, the children in the tree are ordered > arbitrarily. > > o The LOM tree is traversed in-order, and each leaf node > visited > is extracted as a token, in the order in which it has been visited. > > o Proximity search counts the number of tokens in the > tokenised > tree between elements. > > o The notion of a node is preserved in the search; the notion > of > node hierarchy is not. > > * Under structural proximity, the lowest common ancestor of two > elements > is determined. > > o A LOM tree is constructed, as above. > > o The two elements being queries are identified in the tree > as > leaf nodes. > > o The distance of the leaf nodes to their lowest common > ancestor > in the tree is determined. > > o Node hierarchy is preserved. > > o Proximity search counts how many levels in the tree the > elements > are removed from their common ancestor (i.e. how many branchings > intervene > between them). > > > > The lowest common ancestor definition of proximity corresponds to the > kinds > of queries of interest for LOM, which rely on elements being in the > same > container. Unlike element tokenisation, it is not sensitive to the > ordering > of elements, or the number of elements an aggregate node may contain. > > > > On the other hand, element tokenisation corresponds closely to the > implementation of other proximity searches. That said, the proposed > notion > of structural proximity is not an unusual understanding of proximity in > the > XML context; cf. > http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf , > where > "Keyword Proximity Search in XML Trees" is understood explicitly in > terms of > Lowest Common Ancestor.) > > > > Illustration: > > > > Tree 1: > > > > <a> > > <b>1</b> > > <c> > > <d>2</d> > > <e>3</e> > > <f>4</f> > > <g> > > <h>5</h> > > <i>6</i> > > </g> > > <j>7</j> > > </c> > > </a> > > > > Tree 2: > > > > <a> > > <c> > > <g> > > <h>5</h> > > <i>6</i> > > </g> > > <d>2</d> > > <j>7</j> > > <e>3</e> > > <f>4</f> > > </c> > > <b>1</b> > > </a> > > > > In LOM, Trees 1 and 2 are identical, since LOM is order-insensitive. > > > > Tree 3: > > > > <a> > > <b>1</b> > > <c> > > <d>2</d> > > <j>7</j> > > </c> > > </a> > > > > The queries we expect about LOM structure will typically concentrate on > elements belonging to the same container, rather than on what other > elements > also belong to the container. Therefore, we would expect a query on the > proximity of <b/> and <j/> to give the same result for Trees 1 and 3. > (To > give a LOM example: we expect x is author and y is editor and y edited > at > least 4 months after x authored to give the same answer whether or not > a > graphic designer is also defined in the Contribute node. > > Tokenised LOM tree ¶ > > > > * E.g. in an XML binding of LOM, the XML document is tokenised, > with the > token breaks being the element delimiters. > > * Only leaf nodes are tokenised, and aggregate nodes are not > considered > tokens. E.g. in XML, consecutive delimiters count as a single token > break. > > * Tokenisation is exactly parallel to the tokenisation of text by > word > and sentence boundaries, already used for textual proximity searches. > > * Proximity searches for elements use the well-established notion > of > distance between tokens. > > > > The trees above tokenise as: > > > > * Tree 1: 1 2 3 4 5 6 7 > > * Tree 2: 5 6 2 7 3 4 1 > > * Tree 3: 1 2 7 > > > > Search results are sensitive to accidents of ordering and of optional > elements. So in the three trees, the distance between 1 and 7 could be > 6, 3, > or 2. > > Lowest Common Ancestor ¶ > > > > Work on algorithms to determine lowest common ancestor in XML has been > done > (http://www.cs.fiu.edu/~vagelis/publications/Tkde-tree-search.pdf); the > following algorithm is straightforward: > > > > * Each element in the LOM tree is assigned a position string: a > dot-delimited string of position numbers describing the path from the > root > to the element. The leftmost position number is the position of the > child of > the root traversed in the path, relative to its siblings, expressed as > an > ordinal number; the next position number is the position of the child > of the > child of the root, relative to its siblings, and so forth. For example: > > > > Tree 1: > > > > <a> > > <b>1</b> 1 > > <c> > > <d>2</d> 2.1 > > <e>3</e> 2.2 > > <f>4</f> 2.3 > > <g> > > <h>5</h> 2.4.1 > > <i>6</i> 2.4.2 > > </g> > > <j>7</j> 2.5 > > </c> > > </a> > > > > * Since each element is assigned a position string in isolation, > these > position strings can be indexed externally, and proximity queries may > be > transacted by looking up the position strings, without direct reference > to > the XML or any other realisation of the LOM information model. This > means > that such searches are compatible with an index-based CQL > infrastructure. > > * The position strings of the two elements are compared, and the > minimum > common prefix determined. For example: a proximity query for <d> and > <h>, > involving position strings 2.1 and 2.4.1, has the minimum common prefix > 2. > > * The minimum suffix length following the common prefix is > determined > for the two elements' position strings. 2.1 and 2.4.1 share the prefix > 2, > and after that prefix have the suffixes .1 and .4.1, of length 1 and 2 > respectively. > > * The minimum suffix length is the closest distance from one of the > elements to the lowest common ancestor, and represents the number of > branchings in the LOM tree between elements. Since the minimum suffix > length > for <d> and <h> is 1, the two elements are contained in the same > container, > and no intervening aggregate elements are possible between them: the > lowest > common ancestor is the parent of one of the elements. > > * If the position strings are 188.8.131.52.6 and 184.108.40.206, the common > prefix > is 2.1, and the minimum suffix length is 2 (.7.3). This means that > there is > an intervening node (aggregate element) between the two elements: > 2.1.7; the > lowest common ancestor is the grandparent of 220.127.116.11, 2.1. So 18.104.22.168.6 > and > 22.214.171.124 are less close than 126.96.36.199 and 188.8.131.52, or for that matter > 2.1.4 > and 184.108.40.206. > > > > Let us illustrate this with LOM instances: > > > > <lom> > > <lifecycle> 1 > > <contribute> 1.1 > > <role>author</role> 1.1.1 > > <entity>Sanderson</entity> 1.1.2 > > <date>2006</date> 1.1.3 > > </contribute> > > <contribute> 1.2 > > <role>publisher</role> 1.2.1 > > <entity>Fredericksen</entity> 1.2.2 > > <date>2007</date> 1.2.3 > > </contribute> > > <contribute> 1.3 > > <role>initiator</role> 1.3.1 > > <entity>Johnson</entity> 1.3.2 > > <date>2004</date> 1.3.3 > > </contribute> > > <contribute> 1.4 > > <role>terminator</role> 1.4.1 > > <entity>Pierceson</entity> 1.4.2 > > <date>2008</date> 1.4.3 > > </contribute> > > <lifecyle> > > </lom> > > > > * The query "dc.creator = sanderson and the dc.date =2006 are > contained > within the same contribute entry" translates to: dc.creator = sanderson > prox/unit=element/distance=1 dc.date=2006. > > * "sanderson" and "2006" have the position strings 1.1.2 and 1.1.3, > so > their distance is 1 (.2, .3). They are as close as possible in the LOM > tree, > sharing a common parent, so they satisfy the "within the same > contribute > entry" requirement. > > * By contrast, "sanderson" and "2008" have the position strings > 1.1.2 > and 1.4.3, so their distance is 2 (.1.2, .4.3). The date 2008 is not > associated with contributor Sanderson, but with a different > contributor. > > > > Kerry Blinco > e-Framework and Standards Manager, Link Affiliates, University of > Southern > Queensland; and > Technical Standards Adviser to the Department of Education Employment > and > Workplace Relations (DEEWR). Australia. > Email: firstname.lastname@example.org > Phone: +61 7 3871 2699 > Ph (Mobile) : +61 419 787 992 > > The information contained in this e-mail message and any files may > be confidential information, and may also be the subject of legal > professional privilege. > If you think you may not be the intended recipient, or if you have > received > this e-mail in error, > please contact the sender immediately and delete all copies of this e- > mail. > If you are not the intended > recipient, you must not reproduce any part of this e-mail or disclose > its > contents to any other party. > > This email represents the views of the individual sender, except where > the > sender expressly states otherwise. > > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php >