regrep-query message

Subject: Re: ContentBasedQuery questions/ideas.
From: Dan Chang <dtchang@us.ibm.com>
To: Matthew MacKenzie <matt@xmlglobal.com>
Date: Wed, 05 Sep 2001 17:22:37 -0700

Matt,

Thank you for your thoughts. I agree with them in general. I think you have
a good start on CBQ.

I believe non-XML content types are in scope, but I think we ought to put
XML contents as first priority for obvious reasons.

Regards,  Dan

Metadata Management Technology and Standard
IBM DBTI for e-Business
Notes:     Dan Chang/Santa Teresa/IBM@IBMUS
Internet:  dtchang@us.ibm.com
VM:          IBMUSM50(DTCHANG)
Phone:    (408)-463-2319


                                                                                                         
                    Matthew                                                                              
                    MacKenzie            To:     Dan Chang/Santa Teresa/IBM@IBMUS                        
                    <matt@xmlgloba       cc:     regrep-query@lists.oasis-open.org                       
                    l.com>               Subject:     Re: ContentBasedQuery questions/ideas.             
                                                                                                         
                    09/05/01 03:11                                                                       
                    PM                                                                                   
                                                                                                         
                                                                                                         



Dan,

I've responded inline as well.

Cheers,

Matt

On Wednesday 05 September 2001 14:47, Dan Chang wrote:
<snipped />
>
> Team,
>
> I have started looking more closely at ContentBasedQuery, and have a few
> questions for those of you that may have been more closely involved with
> the
> registry specifications.  Please excuse my ignorance if I am off in left
> field, here are my observations and questions:
>
> 1.  Does ContentBasedQuery need to fall under the FilterQuery umbrella?
(I
> think not, just checking.)
> ==> I think it should. Our focus is on the registry not the repository.
> That is,
> ==> a user is expected to query through the registry not directly on the
> repository.
> ==> Therefore, content-based query should supplement and be part of
filter
> query, not be
> ==> used independently.

Fair enough.

>
> 2.  In the RS spec, appendix D talks about a syntax for defining
> "Classification Indexes".  I read this over and over and don't really see
> how
> these details relate to content based queries, they seem to relate more
to
> defining how a registry implementer might  build side tables in their
RDBMS
>
> so that a keyed query could take place (e.g. id-1 LIKE fo% AND id-2 LIKE
> %ar).  Does the ContentBasedQuery spec have to address the needs of the
SQL
>
> implementer, or should it remain technology neutral with apendixes for
> implementation specific issues (if available)?
> ==> Our last discussion/agreement was that, to make things simpler and
more
> uniform,
> ==> content-based query will be based on content index expressed in
XPath.

And that is a great decision if you can guarantee that the content will
always be XML, or that there will always be an XML mapping for content.  I
would like to strike this requirement in favour of a content-type-neutral
approach which I alluded to as being specified via a content handler
interface.  Of course, XPath syntax can be extended to cover index
expression
for other formats, provided that the other formats are structured in some
way
(e.g. Images could utilize paths to refer to embedded metadata
/Image/Metadata/DateTaken, Word processing files could use a path to
represent sections, paragraphs, headings, metadata, etceteras).  The
question
is whether we would want to use XPath for addressing data that is not XML.


>
> 3.  How does everyone feel about a "Content Handler" architecture that
> would
> allow for a content based query to span more than just XML documents?
The
> registry's self describing CPP could contain a list of mime types that
can
> be
> content searched, and the implementation of each content type other than
> XML/HTML and plain text could be left up to registry vendors.
> ==> Content handler sounds good.
>

In that case, I guess that the major decision is how we should express
indexes for arbitrary content-types.  I have no problem with using XPath
and
having the available paths for content that is non-xml exposed via an entry

in the registry CPP, e.g.

<ContentBasedQuery>
           <Type mime="application/msword" name="MS Word">
                     <Mapping occurence="1" path="/Document/Wordcount"
label="Word Count" />
                     <Mapping occurence="*" path="/Document/h1" label
="Level one heading" />
                     ...
           </Type>
           <Type />
           <Type />
</ContentBasedQuery>

The neat thing about this is that users could specify an arbitrary mime
type
when submitting an object to the registry, and have a custom handler deal
with its indexing and queries for content based queries.

Are non-XML content-types in the RE content in scope?

--
Matthew MacKenzie
XML Global

<quote>
Canada Bill Jone's Motto:
  It's morally wrong to allow suckers to keep their money.
Supplement:
  A .44 magnum beats four aces.
</quote>