wsrf message

Subject: Re: [wsrf] Scheduled termination, heartbeats and dependent objects

From: David Hull <dmh@tibco.com>
Date: Wed, 08 Sep 2004 14:37:07 -0400

Samuel Meder wrote:

See comment inline.

On Fri, 2004-09-03 at 11:11, David Hull wrote:

Hello all,

Following on to some discussion in the WSN group, Steve Graham has
asked me to share some of my thoughts on scheduled termination and
other resource lifetime issues.  Here is an attempt to do so.

As we all know, one of the major differences between distributed
computing and local computing is uncertainty in communication.  If I
indirect through a valid pointer, I can expect the hardware to
retrieve something from memory.  If not, I've got bigger problems than
a bad pointer.  If I send a message to an endpoint, however, there are
any number of perfectly ordinary reasons why that might fail.  In a
local environment, I can assume that at least all the logical
components of the system are in place.  In a distributed environment I
can't.

This impacts resource lifetimes directly.  If I ask to destroy a
resource in a local environment, I can assume it's gone.  In a
distributed environment, any reason messaging may fail is a reason a
resource may leak.  So we need a robust way of cleaning up resources
that, for whatever reason, are no longer needed.  The only tool we
have is messaging, and I know of two basic (and very similar)
approaches to determining whether a resource is still needed:
scheduled termination and heartbeating.

In scheduled termination, a resource consumer negotiates a termination
time.  Absent any further communication, the resource provider may
assume that the consumer no longer needs the resource, and both
parties know this.  The consumer may extend or terminate a use of a
resource by sending a subsequent message.  Though I haven't seen it
done, it would also be possible for the provider simply to require
renewals at a given fixed interval.

Heartbeating is typically used in the related case of determining
whether a particular server is alive.  The server agrees to send out
messages (generally multicast) at no longer than an agreed interval
(in some variations, the heartbeat message contains a "time until next
heartbeat" field, allowing for a variable interval between
heartbeats).  If a client does not hear from a server for more than a
given number of heartbeat periods, it assumes that the server is
down.  It's not hard to see that a variation of this could work in the
resource world: The consumer sends the provider periodic heartbeats,
and if the provider misses too many heartbeats, it assumes the
resource is no longer needed. 

Viewed this way, the main difference between scheduled termination and
heartbeating is who determines the interval, whether the provider or
consumer.  In either scheme the interval between renewal/heartbeat may
be fixed in advance or determined with each message.  In both schemes,
the provider may erroneously think a consumer has disappeared.


I believe that scheduled termination captures both of these scenarios.
Ultimately the constraints on the interval are determined by the
intersection of consumer and producer side policy. If the producer needs
control it can just enforce a allowed renewal interval (and possibly
advertise this via policy) whereas the consumer can control the interval
since it is the one sending the messages.

A couple of points:

Heartbeat is effectively an out-only operation. In scheduled termination, you say "I'd like this for another 5 minutes" and I say "well, you can have it for 4." In heartbeating, you say "I've got it for another 5 minutes." and <= 5 minutes later you say "I've got it for another 5 minutes (or whatever)." You don't care what I reply. Describing this as a request/reply is confusing at best.
Because the heartbeating entity acts unilaterally, it had better be talking about resources it owns. In other words, it's OK to send a heartbeat that says "my database is still up". It's not such a good idea to send a heartbeat that says "I own your disk for another 5 minutes." That's what leasing is for -- "May I have your disk for another 5 minutes?" "Sure." This asymmetry may affect policy negotiation.
Heartbeats may be multicast (and in some environments usually are). There are cases where it is feasible and not surprisingly more efficient for a process to multicast "I'm alive" every so often than for all of its counterparties to ask it repeatedly "will you be alive for the next interval?". Does scheduled termination handle multicast heartbeats?
It might be nice to know who's driving. One problem I have with the current scheduled termination semantics is that it's hard to tell what's going on: The lessor suggests a termination time. The lessee may use that, or ignore it, use it as a hint or whatever. This potentially repeats with each renewal. I can imagine arbitrary amounts of toolkit logic dedicated to guessing a good initial termination time, but why bother if you can know that the lessee is just going to be heartbeating to you at regular intervals no matter what you do?

I'm pretty sure there's a distinction to be made between scheduled termination and heartbeating. I'm even more sure that there are ways other than scheduled termination to handle the general problem, even though there will always be periodic pinging going on at some level -- the relevant questions are whether and how this pinging is visible at the resource level.

If you grant that a realistic framework will have to handle more than just scheduled termination, it doesn't seem harmful to distinguish between the two mechanisms.

I don't see any need to introduce another mechanism for supporting
heartbeats.

/Sam

Now suppose that a particular consumer needs a large number of
resources from a provider on an all-or-nothing basis.  When the
consumer is done with a particular operation, it will want to release
all of these resources.  If for whatever reason the consumer fails, we
would like the provider to be able to detect this and release all
resources associated with the consumer.  We would definitely not like
to have to send a renew/heartbeat/destroy message for each resource
individually.

The solution in this case still involves periodic messages, but we
would like to send as few of these as possible.  One approach would be
to create a "parent" resource for the resources to be treated as a
group.  The consumer and provider then cooperate to track this single
resource.  If the consumer destroys this resource, or if the provider
does not receive the necessary renew/heartbeat messages, the provider
destroys the entire group of resources.

I believe that many existing systems do essentially this, though it is
usually not phrased in these terms.  For example, the consumer may
establish a session context via a TCP connection to a provider.  If
the consumer terminates the connection, or the connection is dropped
for whatever reason, the session is destroyed and the provider frees
all resources associated with the consumer.  I don't think it's too
much of a stretch to view the session as a parent object with the
other resources dependent on it.

For what it's worth, TCP is essentially using a heartbeat mechanism
under the covers, and this is one reason why I made a point of
describing heartbeating.  Often a process will monitor the heartbeats
of another and destroy local resources associated with that process if
heartbeats fail.

This all suggests a two-tiered approach to resource lifetimes:
     1. Primitive lifetime management mechanisms.  A resource is
        destroyed when
              * The consumer explicitly requests destruction.
              * A recognized external event occurs, e.g., TCP informs
                an application that a connection has been terminated.
              * A scheduled termination time is reached without a
                renewal
              * A given number of heartbeats is missed.
     2. Lifetime management by dependency.  A resource is destroyed
        when its parent resource is destroyed.
I'm not yet convinced that this will cover all lifetime scenarios, but
it does allow large collections of resources to be treated efficiently
as a group.

References:
- Scheduled termination, heartbeats and dependent objects
  - From: David Hull <dmh@tibco.com>
- Re: [wsrf] Scheduled termination, heartbeats and dependent objects
  - From: Samuel Meder <meder@mcs.anl.gov>