OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

cmis message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: [OASIS Issue Tracker] Commented: (CMIS-86) Provide a new servicethat will allow search crawlers to efficiently navigate a CMIS repository.



    [ http://tools.oasis-open.org/issues/browse/CMIS-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=10149#action_10149 ] 

Ryan McVeigh commented on CMIS-86:
----------------------------------

A couple of comments from my colleagues at Oracle:

Contents of the change log

The following text seems to indicate that events cannot be omitted from the stream even if later made irrelevant:

"The order in which the CmisChangedObjectType instances appear in the output set is the order in which the events happened, oldest first, for each instance of the content described by CmisChangedObjectType. For example, if an item was created at time t, updated at time t+1 and then deleted at time t+2, the order in which the events appear in the output set is create at t, update at t+1, delete at t+2, though these events would not necessarily be grouped together in a page of responses or even in the same response page. This is done so that the service consumer can process the events in the order they appear in the result set without having to remember what events it has already processed for an object."

I think we should relax this and allow repository optimizations such as the following:
- If later in time object x is deleted, only that deletion needs to be reported. Creation and update can be omitted (we still need to keep the delete even for object created after the changeToken because an initial full crawl has a view of the content that does not exactly correspond to that changeToken (it may have captured changes that happened between the start and end of the crawl);
- When there is a sequence of creation and updates on object x, only the last update needs to be represented. This may look a bit awkward since the crawler may have to treat an update as a creation in its index, but this can save quite a few updates to the index for frequently changed items. Actually it seems that making a distinction between creation and update is not strictly necessary for the crawler.
 
------------------
REST binding

In the sample response, the single entry has the 3 types of cmis:changedObject. I assume that this is just to demonstrate those 3 possible forms, but that a single entry could contain only one of those. I think the sample would be much more informative if it contained one entry of each type. I would especially like to see what is the very minimum set of information that needs to be included on a deleted item.

It seems strange that each entry in this collection is not a 'normal' entry as you would find it in the other collection. I would expect the <cmis:properties> tag to remain a child of entry. cmis:changedObject should contain only 'new' information relevant to the notion of change (so only the type of change and time of change). Since both of those properties are simple and short, and strongly typed I would expect them to be represented as attributes of cmis:changedObject.

> Provide a new service that will allow search crawlers to efficiently navigate a CMIS repository.
> ------------------------------------------------------------------------------------------------
>
>                 Key: CMIS-86
>                 URL: http://tools.oasis-open.org/issues/browse/CMIS-86
>             Project: OASIS Content Management Interoperability Services TC
>          Issue Type: New Feature
>          Components: Domain Model, REST/AtomPub Binding, Schema, Web Services Binding
>    Affects Versions: Draft 0.50
>            Reporter: Gregory Melahn
>            Assignee: Ethan Gur-esh
>             Fix For: Draft 0.6
>
>
> CMIS needs to allow repositories to expose what information inside the repository has changed in an efficient manner for applications of interest, like search crawlers, to facilitate incremental indexing of a repository.
> In theory, a search crawler could index the content of a CMIS repository by using the navigation mechanisms already defined as part of the proposed specification. For example, a crawler engine could start at the root collection and, using the REST bindings, progressively navigate through the folders, get the document content and metadata, and index that content. It could use the CMIS date/time stamps to more efficiently do this by querying for documents modified since the last crawl.
> But there are problems with this approach. First, there is no mechanism for knowing what has been deleted from the repository, so the indexed content would contain 'dead' references. Second, there is no standard way to get the access control information needed to filter the search results so the search consumer only sees the content (s)he is supposed to see. Third, each indexer would solve the crawling of the repository in a different way (for example, one could use query and one could use navigation) causing different performance and scalability characteristics that would be hard to control in such  a system.  Finally, the cost of indexing an entire repository can be prohibitive for large content, or content that changes often, requiring support for incremental crawling and paging results.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://tools.oasis-open.org/issues/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]