I have a couple use cases that can definitely benefit
from this "better" lifetime management on the resources. Internally, we
already have similar two level lifetime management on the
resouces.
One comment is that, though it may be too "locally"
detailed, when the resources are "accidently" freed because e.g. given
heartbeats are not satisfied, and when the user comes back later and wants to
find out what happened, we'd like to have a way to say what happened and
possibly show the last states of the resources (may have to re-allocate or
partially allocate the resouces for the rest of operation). This may be an issue
for local robustness or keeping history, but very important to our cases.
Maybe a "grace period" in the lifetime management
mechanism would help in case of abnormal or forced termination.
Hello all,
Following on to some discussion in the WSN group,
Steve Graham has asked me to share some of my thoughts on scheduled
termination and other resource lifetime issues. Here is an attempt to do
so.
As we all know, one of the major differences between distributed
computing and local computing is uncertainty in communication. If I
indirect through a valid pointer, I can expect the hardware to retrieve
something from memory. If not, I've got bigger problems than a bad
pointer. If I send a message to an endpoint, however, there are any
number of perfectly ordinary reasons why that might fail. In a local
environment, I can assume that at least all the logical components of the
system are in place. In a distributed environment I can't.
This
impacts resource lifetimes directly. If I ask to destroy a resource in a
local environment, I can assume it's gone. In a distributed environment,
any reason messaging may fail is a reason a resource may leak. So we
need a robust way of cleaning up resources that, for whatever reason, are no
longer needed. The only tool we have is messaging, and I know of two
basic (and very similar) approaches to determining whether a resource is still
needed: scheduled termination and heartbeating.
In scheduled
termination, a resource consumer negotiates a termination time. Absent
any further communication, the resource provider may assume that the consumer
no longer needs the resource, and both parties know this. The consumer
may extend or terminate a use of a resource by sending a subsequent
message. Though I haven't seen it done, it would also be possible for
the provider simply to require renewals at a given fixed
interval.
Heartbeating is typically used in the related case of
determining whether a particular server is alive. The server agrees to
send out messages (generally multicast) at no longer than an agreed interval
(in some variations, the heartbeat message contains a "time until next
heartbeat" field, allowing for a variable interval between heartbeats).
If a client does not hear from a server for more than a given number of
heartbeat periods, it assumes that the server is down. It's not hard to
see that a variation of this could work in the resource world: The consumer
sends the provider periodic heartbeats, and if the provider misses too many
heartbeats, it assumes the resource is no longer needed.
Viewed this
way, the main difference between scheduled termination and heartbeating is who
determines the interval, whether the provider or consumer. In either
scheme the interval between renewal/heartbeat may be fixed in advance or
determined with each message. In both schemes, the provider may
erroneously think a consumer has disappeared.
Now suppose that a
particular consumer needs a large number of resources from a provider on an
all-or-nothing basis. When the consumer is done with a particular
operation, it will want to release all of these resources. If for
whatever reason the consumer fails, we would like the provider to be able to
detect this and release all resources associated with the consumer. We
would definitely not like to have to send a renew/heartbeat/destroy
message for each resource individually.
The solution in this case still
involves periodic messages, but we would like to send as few of these as
possible. One approach would be to create a "parent" resource for the
resources to be treated as a group. The consumer and provider then
cooperate to track this single resource. If the consumer destroys this
resource, or if the provider does not receive the necessary renew/heartbeat
messages, the provider destroys the entire group of resources.
I
believe that many existing systems do essentially this, though it is usually
not phrased in these terms. For example, the consumer may establish a
session context via a TCP connection to a provider. If the consumer
terminates the connection, or the connection is dropped for whatever reason,
the session is destroyed and the provider frees all resources associated with
the consumer. I don't think it's too much of a stretch to view the
session as a parent object with the other resources dependent on
it.
For what it's worth, TCP is essentially using a heartbeat mechanism
under the covers, and this is one reason why I made a point of describing
heartbeating. Often a process will monitor the heartbeats of another and
destroy local resources associated with that process if heartbeats
fail.
This all suggests a two-tiered approach to resource
lifetimes:
- Primitive lifetime management mechanisms. A resource is destroyed
when
- The consumer explicitly requests destruction.
- A recognized external event occurs, e.g., TCP informs an application
that a connection has been terminated.
- A scheduled termination time is reached without a renewal
- A given number of heartbeats is missed.
- Lifetime management by dependency. A resource is destroyed when
its parent resource is destroyed.
I'm not yet convinced that this
will cover all lifetime scenarios, but it does allow large collections of
resources to be treated efficiently as a
group.
|