wsrf message

Subject: Re: [wsrf] Scheduled termination, heartbeats and dependent objects
From: Samuel Meder <meder@mcs.anl.gov>
To: David Hull <dmh@tibco.com>
Date: Wed, 08 Sep 2004 13:03:42 -0500
See comment inline.

On Fri, 2004-09-03 at 11:11, David Hull wrote:
> Hello all,
> 
> Following on to some discussion in the WSN group, Steve Graham has
> asked me to share some of my thoughts on scheduled termination and
> other resource lifetime issues.  Here is an attempt to do so.
> 
> As we all know, one of the major differences between distributed
> computing and local computing is uncertainty in communication.  If I
> indirect through a valid pointer, I can expect the hardware to
> retrieve something from memory.  If not, I've got bigger problems than
> a bad pointer.  If I send a message to an endpoint, however, there are
> any number of perfectly ordinary reasons why that might fail.  In a
> local environment, I can assume that at least all the logical
> components of the system are in place.  In a distributed environment I
> can't.
> 
> This impacts resource lifetimes directly.  If I ask to destroy a
> resource in a local environment, I can assume it's gone.  In a
> distributed environment, any reason messaging may fail is a reason a
> resource may leak.  So we need a robust way of cleaning up resources
> that, for whatever reason, are no longer needed.  The only tool we
> have is messaging, and I know of two basic (and very similar)
> approaches to determining whether a resource is still needed:
> scheduled termination and heartbeating.
> 
> In scheduled termination, a resource consumer negotiates a termination
> time.  Absent any further communication, the resource provider may
> assume that the consumer no longer needs the resource, and both
> parties know this.  The consumer may extend or terminate a use of a
> resource by sending a subsequent message.  Though I haven't seen it
> done, it would also be possible for the provider simply to require
> renewals at a given fixed interval.
> 
> Heartbeating is typically used in the related case of determining
> whether a particular server is alive.  The server agrees to send out
> messages (generally multicast) at no longer than an agreed interval
> (in some variations, the heartbeat message contains a "time until next
> heartbeat" field, allowing for a variable interval between
> heartbeats).  If a client does not hear from a server for more than a
> given number of heartbeat periods, it assumes that the server is
> down.  It's not hard to see that a variation of this could work in the
> resource world: The consumer sends the provider periodic heartbeats,
> and if the provider misses too many heartbeats, it assumes the
> resource is no longer needed. 
> 
> Viewed this way, the main difference between scheduled termination and
> heartbeating is who determines the interval, whether the provider or
> consumer.  In either scheme the interval between renewal/heartbeat may
> be fixed in advance or determined with each message.  In both schemes,
> the provider may erroneously think a consumer has disappeared.

I believe that scheduled termination captures both of these scenarios.
Ultimately the constraints on the interval are determined by the
intersection of consumer and producer side policy. If the producer needs
control it can just enforce a allowed renewal interval (and possibly
advertise this via policy) whereas the consumer can control the interval
since it is the one sending the messages. 

I don't see any need to introduce another mechanism for supporting
heartbeats.

/Sam

> Now suppose that a particular consumer needs a large number of
> resources from a provider on an all-or-nothing basis.  When the
> consumer is done with a particular operation, it will want to release
> all of these resources.  If for whatever reason the consumer fails, we
> would like the provider to be able to detect this and release all
> resources associated with the consumer.  We would definitely not like
> to have to send a renew/heartbeat/destroy message for each resource
> individually.
> 
> The solution in this case still involves periodic messages, but we
> would like to send as few of these as possible.  One approach would be
> to create a "parent" resource for the resources to be treated as a
> group.  The consumer and provider then cooperate to track this single
> resource.  If the consumer destroys this resource, or if the provider
> does not receive the necessary renew/heartbeat messages, the provider
> destroys the entire group of resources.
> 
> I believe that many existing systems do essentially this, though it is
> usually not phrased in these terms.  For example, the consumer may
> establish a session context via a TCP connection to a provider.  If
> the consumer terminates the connection, or the connection is dropped
> for whatever reason, the session is destroyed and the provider frees
> all resources associated with the consumer.  I don't think it's too
> much of a stretch to view the session as a parent object with the
> other resources dependent on it.
> 
> For what it's worth, TCP is essentially using a heartbeat mechanism
> under the covers, and this is one reason why I made a point of
> describing heartbeating.  Often a process will monitor the heartbeats
> of another and destroy local resources associated with that process if
> heartbeats fail.
> 
> This all suggests a two-tiered approach to resource lifetimes:
>      1. Primitive lifetime management mechanisms.  A resource is
>         destroyed when
>               * The consumer explicitly requests destruction.
>               * A recognized external event occurs, e.g., TCP informs
>                 an application that a connection has been terminated.
>               * A scheduled termination time is reached without a
>                 renewal
>               * A given number of heartbeats is missed.
>      2. Lifetime management by dependency.  A resource is destroyed
>         when its parent resource is destroyed.
> I'm not yet convinced that this will cover all lifetime scenarios, but
> it does allow large collections of resources to be treated efficiently
> as a group.
-- 
Sam Meder <meder@mcs.anl.gov>
The Globus Alliance - University of Chicago
630-252-1752
Follow-Ups:
- Re: [wsrf] Scheduled termination, heartbeats and dependent objects
  - From: David Hull <dmh@tibco.com>
References:
- Scheduled termination, heartbeats and dependent objects
  - From: David Hull <dmh@tibco.com>