[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Re: ContentBasedQuery questions/ideas.
Dan and Matt, The discussion you're having is very helpful to me. I don't have a really good feeling for where people are coming from on this topic, so keep the examples and use cases coming. Here's what I'm seeing so far: 1) I understand the general requirement for what's presented in Appendix D of ebRS. In my mind, it's assuming the following interfaces: a) An interface for a Client to tell a Registry how to "index" an XML document. b) A way to represent the "index" using existing RIM data structures (i.e. Classifications) The assumption is that the document to be indexed is XML, so presumably the Registry is capable of opening it and creating the "index". Arbitrary Clients can then Query the result by searching for and using the Classifications and ClassificationSchemes created in Step b). 2) Matt's example (assuming a Word document) involves the following interfaces: a) An interface for a Client to tell a Registry how to "index" a document. Let's call this an Index Creation Request. b) Existence of a component that knows how to open and read a Word document, understand what is being asked by an Index Creation Request, and create an "index". Can we call this the ContentHandler? Should we assume that every Registry has a ContentHandler for XML documents? What Matt's example hasn't done yet is explain how the "index" might be represented in the Registry. Do we need new structures, or can the "index" created by the ContentHandler be represented by existing RIM data structures (e.g. Slots or Classifications)? 3) I understand that Matt wants to design something that will be upward extensible to direct query of submitted repository items using a "Content Handler" architecture. My understanding is that this group doesn't want to rule out such an extension to direct query on repository items, but that the primary use cases driving our near term work is workable solutions to items 1) and 2) above. Matt -- it would really help me if you took your example to the next step to show how the "index" would be represented in RIM. Would we need additional data structures? 4) Assume that I'm a User communicating with a Registry through some Client software. Also assume that my Client in NOT the same Client that requested the index creation and NOT the same Client that submitted the repository item being indexed! I'm hoping that whatever solution we come up with will allow my Client to take advantage of these new indexes without needing to understand the specifics of each ContentHandler that may have been used to create the indexes, and without having to learn any new Query Syntax that isn't already part of RIM. -- Len At 06:11 PM 9/5/01, Matthew MacKenzie wrote: >Dan, > >I've responded inline as well. > >Cheers, > >Matt > >On Wednesday 05 September 2001 14:47, Dan Chang wrote: ><snipped /> > > > > Team, > > > > I have started looking more closely at ContentBasedQuery, and have a few > > questions for those of you that may have been more closely involved with > > the > > registry specifications. Please excuse my ignorance if I am off in left > > field, here are my observations and questions: > > > > 1. Does ContentBasedQuery need to fall under the FilterQuery umbrella? (I > > think not, just checking.) > > ==> I think it should. Our focus is on the registry not the repository. > > That is, > > ==> a user is expected to query through the registry not directly on the > > repository. > > ==> Therefore, content-based query should supplement and be part of filter > > query, not be > > ==> used independently. > >Fair enough. > > > > > 2. In the RS spec, appendix D talks about a syntax for defining > > "Classification Indexes". I read this over and over and don't really see > > how > > these details relate to content based queries, they seem to relate more to > > defining how a registry implementer might build side tables in their RDBMS > > > > so that a keyed query could take place (e.g. id-1 LIKE fo% AND id-2 LIKE > > %ar). Does the ContentBasedQuery spec have to address the needs of the SQL > > > > implementer, or should it remain technology neutral with apendixes for > > implementation specific issues (if available)? > > ==> Our last discussion/agreement was that, to make things simpler and more > > uniform, > > ==> content-based query will be based on content index expressed in XPath. > >And that is a great decision if you can guarantee that the content will >always be XML, or that there will always be an XML mapping for content. I >would like to strike this requirement in favour of a content-type-neutral >approach which I alluded to as being specified via a content handler >interface. Of course, XPath syntax can be extended to cover index expression >for other formats, provided that the other formats are structured in some way >(e.g. Images could utilize paths to refer to embedded metadata >/Image/Metadata/DateTaken, Word processing files could use a path to >represent sections, paragraphs, headings, metadata, etceteras). The question >is whether we would want to use XPath for addressing data that is not XML. > > > > > > 3. How does everyone feel about a "Content Handler" architecture that > > would > > allow for a content based query to span more than just XML documents? The > > registry's self describing CPP could contain a list of mime types that can > > be > > content searched, and the implementation of each content type other than > > XML/HTML and plain text could be left up to registry vendors. > > ==> Content handler sounds good. > > > >In that case, I guess that the major decision is how we should express >indexes for arbitrary content-types. I have no problem with using XPath and >having the available paths for content that is non-xml exposed via an entry >in the registry CPP, e.g. > ><ContentBasedQuery> > <Type mime="application/msword" name="MS Word"> > <Mapping occurence="1" path="/Document/Wordcount" > label="Word Count" /> > <Mapping occurence="*" path="/Document/h1" label="Level > one heading" /> > ... > </Type> > <Type /> > <Type /> ></ContentBasedQuery> > >The neat thing about this is that users could specify an arbitrary mime type >when submitting an object to the registry, and have a custom handler deal >with its indexing and queries for content based queries. > >Are non-XML content-types in the RE content in scope? > >-- >Matthew MacKenzie >XML Global > ><quote> >Canada Bill Jone's Motto: > It's morally wrong to allow suckers to keep their money. >Supplement: > A .44 magnum beats four aces. ></quote> > >---------------------------------------------------------------- >To subscribe or unsubscribe from this elist use the subscription >manager: <http://lists.oasis-open.org/ob/adm.pl> ************************************************************** Len Gallagher LGallagher@nist.gov NIST Work: 301-975-3251 Bldg 820 Room 562 Home: 301-424-1928 Gaithersburg, MD 20899-8970 USA Fax: 301-948-6213 **************************************************************
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC