regrep-query message

Subject: Re: ContentBasedQuery questions/ideas.
From: Matthew MacKenzie <matt@xmlglobal.com>
To: Len Gallagher <LGallagher@nist.gov>, regrep-query@lists.oasis-open.org
Date: Thu, 06 Sep 2001 10:56:49 -0700
Len,

I've responded inline.  

Regards,

Matt

On Thursday 06 September 2001 07:32, Len Gallagher wrote:
> Dan and Matt,
>
> The discussion you're having is very helpful to me. I don't have a really
> good feeling for where people are coming from on this topic, so keep the
> examples and use cases coming.
>
> Here's what I'm seeing so far:
>
> 1) I understand the general requirement for what's presented in Appendix D
> of ebRS. In my mind, it's assuming the following interfaces:
>
>   a) An interface for a Client to tell a Registry how to "index" an XML
> document.
>
>   b) A way to represent the "index" using existing RIM data structures
> (i.e. Classifications)
>
> The assumption is that the document to be indexed is XML, so presumably the
> Registry is capable of opening it and creating the "index". Arbitrary
> Clients can then Query the result by searching for and using the
> Classifications and ClassificationSchemes created in Step b).
>
> 2) Matt's example (assuming a Word document) involves the following
> interfaces:
>
>   a) An interface for a Client to tell a Registry how to "index" a
> document. Let's call this an Index Creation Request.

Yes, although I would like to clarify: the index creation request happens 
once, and required parameters would be mime type and handler.

e.g.

<IndexCreationRequest
	mimeType="application/x-pdf"
	handler="vendorRecognizedString" />

Ideally, the IndexCreationRequest would just signal the registry to begin 
creating the index, and not give any instructions on how to do so.  The 
handlers would hopefully be defined through server configuration so that we 
don't start having specific language issues poke their ugly heads (for 
instance, handler="com.xmlglobal.ebxml.registry.handlers.Pdf")  I suggest 
having the handler attribute just in case the vendor wants to offer 
alternative indexing methods for any given mime type.



>
>   b) Existence of a component that knows how to open and read a Word
> document, understand what is being asked by an Index Creation Request, and
> create an "index". Can we call this the ContentHandler? Should we assume
> that every Registry has a ContentHandler for XML documents?

That is a good definition of what I think a content handler is.  I would want 
to specify that an ebXML registry must support at least text|application/xml.

>
> What Matt's example hasn't done yet is explain how the "index" might be
> represented in the Registry. Do we need new structures, or can the "index"
> created by the ContentHandler be represented by existing RIM data
> structures (e.g. Slots or Classifications)?

I was thinking Slots.  No matter what ContentHandler is being used, the data 
to be indexed will be rather simple ... key=value data.  The only issue I see 
is that a Slot seems to be a container for a List or Collection, when in 
reality we would be looking for a Map interface.  Maybe the Slot interface 
can be modified to support List or Map type data?

>
> 3) I understand that Matt wants to design something that will be upward
> extensible to direct query of submitted repository items using a "Content
> Handler" architecture. My understanding is that this group doesn't want to
> rule out such an extension to direct query on repository items, but that
> the primary use cases driving our near term work is workable solutions to
> items 1) and 2) above.
>
> Matt -- it would really help me if you took your example to the next step
> to show how the "index" would be represented in RIM. Would we need
> additional data structures?

I addressed that above.  I think initially that a modification to Slot could 
hold the indexes.  If you wish, I could work out a proposed interface change. 
 I obviously don't want to remove the List functionality already present, so 
maybe what we are looking at is making Slot abstract and having a ListSlot 
and a MapSlot.  Ideas?

>
> 4) Assume that I'm a User communicating with a Registry through some Client
> software. Also assume that my Client in NOT the same Client that requested
> the index creation and NOT the same Client that submitted the repository
> item being indexed! I'm hoping that whatever solution we come up with will
> allow my Client to take advantage of these new indexes without needing to
> understand the specifics of each ContentHandler that may have been used to
> create the indexes, and without having to learn any new Query Syntax that
> isn't already part of RIM.

Of course!  That is where the new entry in the Registry's self describing CPP 
comes in. It allows the registry admin or owner define the available keys for 
content query for supported mime types.  The client could simply look into 
this document and fill up a list box or something with the keys that can 
serve as context for a search.  A RegistryEntryFilter(terminology?) might be 
required to execute prior to the content query to ensure that only entries 
holding a given set of mime types have the context query executed on them.


>
> -- Len
>
> At 06:11 PM 9/5/01, Matthew MacKenzie wrote:
> >Dan,
> >
> >I've responded inline as well.
> >
> >Cheers,
> >
> >Matt
> >
> >On Wednesday 05 September 2001 14:47, Dan Chang wrote:
> ><snipped />
> >
> > > Team,
> > >
> > > I have started looking more closely at ContentBasedQuery, and have a
> > > few questions for those of you that may have been more closely involved
> > > with the
> > > registry specifications.  Please excuse my ignorance if I am off in
> > > left field, here are my observations and questions:
> > >
> > > 1.  Does ContentBasedQuery need to fall under the FilterQuery umbrella?
> > > (I think not, just checking.)
> > > ==> I think it should. Our focus is on the registry not the repository.
> > > That is,
> > > ==> a user is expected to query through the registry not directly on
> > > the repository.
> > > ==> Therefore, content-based query should supplement and be part of
> > > filter query, not be
> > > ==> used independently.
> >
> >Fair enough.
> >
> > > 2.  In the RS spec, appendix D talks about a syntax for defining
> > > "Classification Indexes".  I read this over and over and don't really
> > > see how
> > > these details relate to content based queries, they seem to relate more
> > > to defining how a registry implementer might  build side tables in
> > > their RDBMS
> > >
> > > so that a keyed query could take place (e.g. id-1 LIKE fo% AND id-2
> > > LIKE %ar).  Does the ContentBasedQuery spec have to address the needs
> > > of the SQL
> > >
> > > implementer, or should it remain technology neutral with apendixes for
> > > implementation specific issues (if available)?
> > > ==> Our last discussion/agreement was that, to make things simpler and
> > > more uniform,
> > > ==> content-based query will be based on content index expressed in
> > > XPath.
> >
> >And that is a great decision if you can guarantee that the content will
> >always be XML, or that there will always be an XML mapping for content.  I
> >would like to strike this requirement in favour of a content-type-neutral
> >approach which I alluded to as being specified via a content handler
> >interface.  Of course, XPath syntax can be extended to cover index
> > expression for other formats, provided that the other formats are
> > structured in some way (e.g. Images could utilize paths to refer to
> > embedded metadata
> >/Image/Metadata/DateTaken, Word processing files could use a path to
> >represent sections, paragraphs, headings, metadata, etceteras).  The
> > question is whether we would want to use XPath for addressing data that
> > is not XML.
> >
> > > 3.  How does everyone feel about a "Content Handler" architecture that
> > > would
> > > allow for a content based query to span more than just XML documents? 
> > > The registry's self describing CPP could contain a list of mime types
> > > that can be
> > > content searched, and the implementation of each content type other
> > > than XML/HTML and plain text could be left up to registry vendors.
> > > ==> Content handler sounds good.
> >
> >In that case, I guess that the major decision is how we should express
> >indexes for arbitrary content-types.  I have no problem with using XPath
> > and having the available paths for content that is non-xml exposed via an
> > entry in the registry CPP, e.g.
> >
> ><ContentBasedQuery>
> >         <Type mime="application/msword" name="MS Word">
> >                 <Mapping occurence="1" path="/Document/Wordcount"
> > label="Word Count" />
> >                 <Mapping occurence="*" path="/Document/h1" label="Level
> > one heading" />
> >                 ...
> >         </Type>
> >         <Type />
> >         <Type />
> ></ContentBasedQuery>
> >
> >The neat thing about this is that users could specify an arbitrary mime
> > type when submitting an object to the registry, and have a custom handler
> > deal with its indexing and queries for content based queries.
> >
> >Are non-XML content-types in the RE content in scope?
> >
> >--
> >Matthew MacKenzie
> >XML Global
> >
> ><quote>
> >Canada Bill Jone's Motto:
> >   It's morally wrong to allow suckers to keep their money.
> >Supplement:
> >   A .44 magnum beats four aces.
> ></quote>
> >
> >----------------------------------------------------------------
> >To subscribe or unsubscribe from this elist use the subscription
> >manager: <http://lists.oasis-open.org/ob/adm.pl>
>
> **************************************************************
> Len Gallagher                             LGallagher@nist.gov
> NIST                                      Work: 301-975-3251
> Bldg 820  Room 562                        Home: 301-424-1928
> Gaithersburg, MD 20899-8970 USA           Fax: 301-948-6213
> **************************************************************
>
>
> ----------------------------------------------------------------
> To subscribe or unsubscribe from this elist use the subscription
> manager: <http://lists.oasis-open.org/ob/adm.pl>

-- 
Matthew MacKenzie
XML Global

<quote>
Heuristics are bug ridden by definition.  If they didn't have bugs,
then they'd be algorithms.
</quote>
Follow-Ups:
- Re: ContentBasedQuery questions/ideas.
  - From: Len Gallagher <LGallagher@nist.gov>
References:
- Re: ContentBasedQuery questions/ideas.
  - From: Dan Chang <dtchang@us.ibm.com>
- Re: ContentBasedQuery questions/ideas.
  - From: Len Gallagher <LGallagher@nist.gov>