wsrf message

Subject: Scheduled termination, heartbeats and dependent objects

From: David Hull <dmh@tibco.com>
To: wsrf@lists.oasis-open.org
Date: Fri, 03 Sep 2004 12:11:26 -0400

Hello all,

Following on to some discussion in the WSN group, Steve Graham has asked me to share some of my thoughts on scheduled termination and other resource lifetime issues. Here is an attempt to do so.

As we all know, one of the major differences between distributed computing and local computing is uncertainty in communication. If I indirect through a valid pointer, I can expect the hardware to retrieve something from memory. If not, I've got bigger problems than a bad pointer. If I send a message to an endpoint, however, there are any number of perfectly ordinary reasons why that might fail. In a local environment, I can assume that at least all the logical components of the system are in place. In a distributed environment I can't.

This impacts resource lifetimes directly. If I ask to destroy a resource in a local environment, I can assume it's gone. In a distributed environment, any reason messaging may fail is a reason a resource may leak. So we need a robust way of cleaning up resources that, for whatever reason, are no longer needed. The only tool we have is messaging, and I know of two basic (and very similar) approaches to determining whether a resource is still needed: scheduled termination and heartbeating.

In scheduled termination, a resource consumer negotiates a termination time. Absent any further communication, the resource provider may assume that the consumer no longer needs the resource, and both parties know this. The consumer may extend or terminate a use of a resource by sending a subsequent message. Though I haven't seen it done, it would also be possible for the provider simply to require renewals at a given fixed interval.

Heartbeating is typically used in the related case of determining whether a particular server is alive. The server agrees to send out messages (generally multicast) at no longer than an agreed interval (in some variations, the heartbeat message contains a "time until next heartbeat" field, allowing for a variable interval between heartbeats). If a client does not hear from a server for more than a given number of heartbeat periods, it assumes that the server is down. It's not hard to see that a variation of this could work in the resource world: The consumer sends the provider periodic heartbeats, and if the provider misses too many heartbeats, it assumes the resource is no longer needed.

Viewed this way, the main difference between scheduled termination and heartbeating is who determines the interval, whether the provider or consumer. In either scheme the interval between renewal/heartbeat may be fixed in advance or determined with each message. In both schemes, the provider may erroneously think a consumer has disappeared.

Now suppose that a particular consumer needs a large number of resources from a provider on an all-or-nothing basis. When the consumer is done with a particular operation, it will want to release all of these resources. If for whatever reason the consumer fails, we would like the provider to be able to detect this and release all resources associated with the consumer. We would definitely not like to have to send a renew/heartbeat/destroy message for each resource individually.

The solution in this case still involves periodic messages, but we would like to send as few of these as possible. One approach would be to create a "parent" resource for the resources to be treated as a group. The consumer and provider then cooperate to track this single resource. If the consumer destroys this resource, or if the provider does not receive the necessary renew/heartbeat messages, the provider destroys the entire group of resources.

I believe that many existing systems do essentially this, though it is usually not phrased in these terms. For example, the consumer may establish a session context via a TCP connection to a provider. If the consumer terminates the connection, or the connection is dropped for whatever reason, the session is destroyed and the provider frees all resources associated with the consumer. I don't think it's too much of a stretch to view the session as a parent object with the other resources dependent on it.

For what it's worth, TCP is essentially using a heartbeat mechanism under the covers, and this is one reason why I made a point of describing heartbeating. Often a process will monitor the heartbeats of another and destroy local resources associated with that process if heartbeats fail.

This all suggests a two-tiered approach to resource lifetimes:

Primitive lifetime management mechanisms. A resource is destroyed when

The consumer explicitly requests destruction.
A recognized external event occurs, e.g., TCP informs an application that a connection has been terminated.
A scheduled termination time is reached without a renewal
A given number of heartbeats is missed.

Lifetime management by dependency. A resource is destroyed when its parent resource is destroyed.

I'm not yet convinced that this will cover all lifetime scenarios, but it does allow large collections of resources to be treated efficiently as a group.

Follow-Ups:
- Re: [wsrf] Scheduled termination, heartbeats and dependent objects
  - From: Tom Maguire <tmaguire@us.ibm.com>
- Re: [wsrf] Scheduled termination, heartbeats and dependent objects
  - From: Samuel Meder <meder@mcs.anl.gov>
- RE: [wsrf] Scheduled termination, heartbeats and dependent objects
  - From: "Alex Sim" <ASim@lbl.gov>