provision message

Subject: Large Data Set Handling in SPML 2.0...

From: "Jeff Bohren" <jbohren@opennetwork.com>
To: <provision@lists.oasis-open.org>
Date: Mon, 27 Oct 2003 22:48:45 -0500

Large Data Set Handling in SPML 2.0

One of the issues that were discussed during the SPML 1.0 effort was how to handle large sets of data that may be returned by a search request. In other words, if a specific search returns 100,000 entries, the client may want to get the first 10,000 entries in on response and then request the next 10,000 entries to be returned in another response, and so on until all 100,000 entries are returned.

Typically this is handled in other protocols by using some kind of iterator that is passed back and forth to indicate state of the data transfer. This was one of the approaches discussed in the SPML 1.0 effort. Such an iterator approach is also proposed in the IBM submission for SPML 2.0.

While using an iterator in this fashion is a reasonable approach, it does lead to other issues that must be addressed. The use of an iterator implies that the search is stateful, even though HTTP is not a stateful protocol.

The kinds of questions this leads to are:

1) How long should search results be kept until they are no longer available for iterating? In other words, what happens when the client never asks for the rest of the data in a specified time period?

2) Is that time period specified by the client for the server?

3) How much of the data should be cached from the search on the underlying resource?

4) Is that cache size specified by the client or the server?

5) What are the security implications? What should happen if a client asks for the next iteration of a search initiated by a different client? In that considered a security breach and is preventing it in scope?

None of these issues mean that we should not use an iterator approach, they are just issues we have to deal with. Some of these issues will be implementation specific, but we should note those cases in the specification.

Does any one have any other suggested solutions to handling large data sets that they would like to put forwards?

Jeff Bohren

OpenNetwork Technologies