regrep-query message

Subject: Re: ContentBasedQuery questions/ideas.
From: Len Gallagher <LGallagher@nist.gov>
To: regrep-query@lists.oasis-open.org
Date: Thu, 06 Sep 2001 10:32:23 -0400

Dan and Matt,

The discussion you're having is very helpful to me. I don't have a really 
good feeling for where people are coming from on this topic, so keep the 
examples and use cases coming.

Here's what I'm seeing so far:

1) I understand the general requirement for what's presented in Appendix D 
of ebRS. In my mind, it's assuming the following interfaces:

  a) An interface for a Client to tell a Registry how to "index" an XML 
document.

  b) A way to represent the "index" using existing RIM data structures 
(i.e. Classifications)

The assumption is that the document to be indexed is XML, so presumably the 
Registry is capable of opening it and creating the "index". Arbitrary 
Clients can then Query the result by searching for and using the 
Classifications and ClassificationSchemes created in Step b).

2) Matt's example (assuming a Word document) involves the following interfaces:

  a) An interface for a Client to tell a Registry how to "index" a 
document. Let's call this an Index Creation Request.

  b) Existence of a component that knows how to open and read a Word 
document, understand what is being asked by an Index Creation Request, and 
create an "index". Can we call this the ContentHandler? Should we assume 
that every Registry has a ContentHandler for XML documents?

What Matt's example hasn't done yet is explain how the "index" might be 
represented in the Registry. Do we need new structures, or can the "index" 
created by the ContentHandler be represented by existing RIM data 
structures (e.g. Slots or Classifications)?

3) I understand that Matt wants to design something that will be upward 
extensible to direct query of submitted repository items using a "Content 
Handler" architecture. My understanding is that this group doesn't want to 
rule out such an extension to direct query on repository items, but that 
the primary use cases driving our near term work is workable solutions to 
items 1) and 2) above.

Matt -- it would really help me if you took your example to the next step 
to show how the "index" would be represented in RIM. Would we need 
additional data structures?

4) Assume that I'm a User communicating with a Registry through some Client 
software. Also assume that my Client in NOT the same Client that requested 
the index creation and NOT the same Client that submitted the repository 
item being indexed! I'm hoping that whatever solution we come up with will 
allow my Client to take advantage of these new indexes without needing to 
understand the specifics of each ContentHandler that may have been used to 
create the indexes, and without having to learn any new Query Syntax that 
isn't already part of RIM.

-- Len



At 06:11 PM 9/5/01, Matthew MacKenzie wrote:
>Dan,
>
>I've responded inline as well.
>
>Cheers,
>
>Matt
>
>On Wednesday 05 September 2001 14:47, Dan Chang wrote:
><snipped />
> >
> > Team,
> >
> > I have started looking more closely at ContentBasedQuery, and have a few
> > questions for those of you that may have been more closely involved with
> > the
> > registry specifications.  Please excuse my ignorance if I am off in left
> > field, here are my observations and questions:
> >
> > 1.  Does ContentBasedQuery need to fall under the FilterQuery umbrella? (I
> > think not, just checking.)
> > ==> I think it should. Our focus is on the registry not the repository.
> > That is,
> > ==> a user is expected to query through the registry not directly on the
> > repository.
> > ==> Therefore, content-based query should supplement and be part of filter
> > query, not be
> > ==> used independently.
>
>Fair enough.
>
> >
> > 2.  In the RS spec, appendix D talks about a syntax for defining
> > "Classification Indexes".  I read this over and over and don't really see
> > how
> > these details relate to content based queries, they seem to relate more to
> > defining how a registry implementer might  build side tables in their RDBMS
> >
> > so that a keyed query could take place (e.g. id-1 LIKE fo% AND id-2 LIKE
> > %ar).  Does the ContentBasedQuery spec have to address the needs of the SQL
> >
> > implementer, or should it remain technology neutral with apendixes for
> > implementation specific issues (if available)?
> > ==> Our last discussion/agreement was that, to make things simpler and more
> > uniform,
> > ==> content-based query will be based on content index expressed in XPath.
>
>And that is a great decision if you can guarantee that the content will
>always be XML, or that there will always be an XML mapping for content.  I
>would like to strike this requirement in favour of a content-type-neutral
>approach which I alluded to as being specified via a content handler
>interface.  Of course, XPath syntax can be extended to cover index expression
>for other formats, provided that the other formats are structured in some way
>(e.g. Images could utilize paths to refer to embedded metadata
>/Image/Metadata/DateTaken, Word processing files could use a path to
>represent sections, paragraphs, headings, metadata, etceteras).  The question
>is whether we would want to use XPath for addressing data that is not XML.
>
>
> >
> > 3.  How does everyone feel about a "Content Handler" architecture that
> > would
> > allow for a content based query to span more than just XML documents?  The
> > registry's self describing CPP could contain a list of mime types that can
> > be
> > content searched, and the implementation of each content type other than
> > XML/HTML and plain text could be left up to registry vendors.
> > ==> Content handler sounds good.
> >
>
>In that case, I guess that the major decision is how we should express
>indexes for arbitrary content-types.  I have no problem with using XPath and
>having the available paths for content that is non-xml exposed via an entry
>in the registry CPP, e.g.
>
><ContentBasedQuery>
>         <Type mime="application/msword" name="MS Word">
>                 <Mapping occurence="1" path="/Document/Wordcount" 
> label="Word Count" />
>                 <Mapping occurence="*" path="/Document/h1" label="Level 
> one heading" />
>                 ...
>         </Type>
>         <Type />
>         <Type />
></ContentBasedQuery>
>
>The neat thing about this is that users could specify an arbitrary mime type
>when submitting an object to the registry, and have a custom handler deal
>with its indexing and queries for content based queries.
>
>Are non-XML content-types in the RE content in scope?
>
>--
>Matthew MacKenzie
>XML Global
>
><quote>
>Canada Bill Jone's Motto:
>   It's morally wrong to allow suckers to keep their money.
>Supplement:
>   A .44 magnum beats four aces.
></quote>
>
>----------------------------------------------------------------
>To subscribe or unsubscribe from this elist use the subscription
>manager: <http://lists.oasis-open.org/ob/adm.pl>

**************************************************************
Len Gallagher                             LGallagher@nist.gov
NIST                                      Work: 301-975-3251
Bldg 820  Room 562                        Home: 301-424-1928
Gaithersburg, MD 20899-8970 USA           Fax: 301-948-6213
**************************************************************
Follow-Ups:
- Re: ContentBasedQuery questions/ideas.
  - From: Matthew MacKenzie <matt@xmlglobal.com>
References:
- Re: ContentBasedQuery questions/ideas.
  - From: Dan Chang <dtchang@us.ibm.com>
- Re: ContentBasedQuery questions/ideas.
  - From: Matthew MacKenzie <matt@xmlglobal.com>