Hi Mark,
Comments interleaved:
Mark Little wrote:
Alastair
Green wrote:
I'm sorry, but I don't get it.
1. Replay is never sent from the Coordinator to the Participant.
I never said that, did I?
No, you didn't. I was establishing a premise, that we are only
interested in participant-driven recovery, and (in the following
points) that the principles of coordinator-driven recovery are infinite
retry of the same messages till they get thro'. All of which begs the
question: "why does AT not use the same principles for
participant-driven recovery?".
I spent time on this because you said the following:
"Different type of failure. I interpret the resend of Prepared on comms
failure to be in the case where the sender (the participant) knows that
the original Prepared wasn't delivered. [AG: that is, a
participant-detected failure.] My original statement above referring to
comms failures is more: there was a network partition after Prepared
was successfully delivered and this partition has been healed. In the
meantime, the coordinator committed, couldn't contact the participant
because of the network partition and so must go into some form of
recovery mode. [AG: which is to say, coordinator-side recovery] From
the coordinator's perspective, there is no way for it to distinguish
between a network partition and the failure of the machine on which the
participant resides. From the participants perspective, there is a
difference, though the resolution is the same: it initiates a Replay
message.
[AG: there is no difference between the perception of failure of the C
by the P, or the P by the C. If the net is down or divided, either
sender may get a comm failure.]
My point was: what is the relevance of being a C or a P? The behaviour
needed (indefinite retry) is identical for Prepare/Prepared, and
Prepared/Commit | Rollback. Both sides should have the same retry
behaviour.
2. If the Coordinator never receives
Prepared, it resends Prepare. If it never gets Committed back, it
resends Commit (your scenario). In each case it does so as often as it
wants until it gets the *ed back.
Sure. But that's top-down (coordinator driven recovery). I thought what
we were discussing was bottom-up (participant driven recovery). Can you
confirm that is your reading of the original issue too?
Absolutely. I am trying to establish we all agree on the principle of
infinite retry-driven guaranteed message delivery. It's followed C to
P, but it isn't followed in P to C, which is the precise point of this
issue.
3. If the Participant fails and recovers, it knows that it may not have
sent Prepared (it could fail between the log write and the message
send), and must communicate the semantic "prepared". A message exists
that carries exactly that semantic: Prepared.
Or, it could send Replay ;-)?
Yes, we could, but why on earth would we define a new message for the
same semantic?
If the Participant tries to send Prepared (before or after crash
recovery) and the message send fails to its knowledge (one
interpretation of comms time out), it resends Prepared.
Sure. No argument there: if it knows the Prepared failed to be
delivered, then it can obviously resend for an implementation
(potentially infinite) time. It could then periodically keep retrying.
Or, it could send Replay later.
Having an extra message requires justification. Saying it exists is not
a justification. We don't have two messages with subtly differing
semantics (or odder still, with identical semantics) for
coordinator-driven recovery.
If the Participant never receives Commit or Rollback (another
interpretation of comms time out), it again resends Prepared.
Or Replay.
In other words, the Participant sends and resends Prepared until it
gets Commit or Rollback, across all failures and for all time.
4. The OTS replay_completion is not a precedent. OTS uses RPCs, not
one-way messages. This makes retry behaviour more difficult to model.
But if we strip that aside, we see that OTS does exactly the opposite
of AT: it does not tolerate communications failure if the prepared
semantic fails to get through, and it does not cause premature abort
after a recoverable failure in the prepared state. In my view, both OTS
and AT are wrong: *there is no reason to treat comms failure and crash
recovery differently, either in mechanism of retrying or in effect on
transaction outcome.*
In OTS we say Vote vote = resource.prepare(), and the Vote enumeration
tells us whether it's prepared, readonly or rollback. The operation is
not idempotent -- a communications failure that prevents the vote
returning will cause transaction abort. I think this is wrong and
arbitrary, i.e it is a bad precedent and should not be copied.
Correctly, AT does not copy this feature, and tolerates this failure
(comms time out = resend Prepared).
I think you're definitely misinterpreting my reference to
replay_completion: I'm talking only about the bottom-up recovery
scenario, which is exactly the same scenario this issue describes.
In OTS the failure to send (or a failure to deliver) the vote from P to
C will cause a comms time out at the coordinator end. OTS treats a
comms failure from P to C as causing an abort. AT treats a comms
failure from P to C (failure to send or deliver, if there is enough
acking going on in the transport) as the occasion for a resend of
Prepared. That is the difference I was pointing out. The point of
detection in OTS is different, but in both cases we are talking about a
message failing to get from P to C.
I agree that in OTS a C to P message failure could also cause
transaction abort (a product of the decision to prohibit C to P retries
of prepare in OTS). I wasn't trying to comment on that.
If the participant fails in OTS then it can't tell when it failed (did
it ever return from the prepare operation, i.e. send back the Vote?)
So, it has to send a message to say: "I am prepared"
(replay_completion), and it will receive a status. It may also get a
replay of commit or rollback, as these operations can be duplicated
(they are idempotent).
replay_completion is defined as being "a hint to the coordinator" that
the prepared participant has never received commit or rollback. As a
hint it cannot affect the state or the behaviour of the coordinator,
other than to stimulate a replay of commit or rollback, speeding things
up. Its semantic is: "I am prepared". (The additional semantic "And
once I failed" is irrelevant.). Correctly, in OTS replaying the
prepared semantic never causes transaction abort, as it wrongly can in
AT.
I think you're mixing issues, which can only lead to confusion. Let's
keep this strictly at the issue in hand. It'll make it easier for
everyone else to follow.
I think you missed the point of what this section said. This issue 052
concerns the fact that replay of Prepared causes different C behaviour
than sending Retry. OTS correctly makes the Vote returned on prepare
(the normal send) have exactly the semantics as replay_completion (the
retry send). AT does not follow this good precedent of OTS.
The only reason for the existence of replay_completion as a distinct
operation is because you can't return the response/return value of an
RPC twice.
If OTS had modelled this using one ways, it would have ended up with
two interfaces (simplified, and forgetting my IDL syntax, and changing
the real names to save looking them up):
interface coordinator
{
void vote (in Vote); // Vote is an enum: Commit = Prepared,
Readonly, Rollback
}
interface resource
{
void prepare();
void commit();
void rollback();
}
Our failure scenario would then logically be:
C invokes resource.prepare()
P invokes coordinator.vote (Vote.Commit)
P fails
P invokes coordinator.vote (Vote.Commit)
In AT this appears as
C sends Prepare
P sends Prepared
P fails
P resends Prepared
The separate message replay_completion is an artefact of RPC, not of
the requirements of the transaction protocol.
The correct behaviour for AT is to resend Prepared in the face of comms
failures, and after crash recovery.
I disagree. The correct behaviour is to send Replay.
It seems to me that this is just an assertion: you haven't yet
presented a single argument as to why this should be the case.
I think you agree that Replay should not have a different semantic than
Prepared (the point of this issue 052). If that is true we will create
a Replay row in the CV state table that is identical in all respects to
the Prepared row, other than the fact that it has the label Replay. Why
should we do this? It is truly pointless.
This is the relevance of the OTS comparison. OTS has to have two ways
of sending the semantic, because it uses an RPC to get back the
semantic in normal operation, and must define a separate operation
(message) for recovery. But the reaction to the semantic "prepared" is
identical.
AT does not use RPC, every message is a one way. It does not need two
messages for one semantic. The only semantic difference between
Prepared and Replay, if you accept that early abort is wrong, is that
Replay additionally conveys the secondary, irrelevant meaning: "I have
failed and recovered". As this is truly secondary (has no effect on the
receiver, which acts identically on receipt) it is truly irrelevant.
This can be proved. If a P implementation resends Prepared on recovery
(and no-one can stop it doing that) then transactions will complete
correctly and with no diminution in QoS.
Having to cater for Replay is either a waste of implementers' time, or
it creates an inconsistency, which I think we both agree has no evident
rationale. Maybe someone else can provide a rationale, but I haven't
seen one yet.
Alastair
Mark.
Alastair
Mark Little wrote:
Alastair Green wrote:
Hi Mark,
Just one point:
Mark Little wrote:
Since it crashed in Prepared Success
state we should be able to assume that the participant obeyed the rules
and made its decision to be able to commit durable. Hence, this Replay
message should be interpreted as a), though the semantic of "have
recovered" shouldn't exclude the fact that the failure may have been in
the network and not the participant service itself (for instance).
One might think so, but in fact when the Participant experiences a
comms time out it Resends Prepared (PV state table).
Which begs the question: if that works for comm failures, why do we do
something different for process failures which are recovered?
Different type of failure. I interpret the resend of Prepared on comms
failure to be in the case where the sender (the participant) knows that
the original Prepared wasn't delivered. My original statement above
referring to comms failures is more: there was a network partition
after Prepared was successfully delivered and this partition has been
healed. In the meantime, the coordinator committed, couldn't contact
the participant because of the network partition and so must go into
some form of recovery mode. From the coordinator's perspective, there
is no way for it to distinguish between a network partition and the
failure of the machine on which the participant resides. From the
participants perspective, there is a difference, though the resolution
is the same: it initiates a Replay message.
I just wanted to make sure our definition of failure didn't preclude
partitions.
The implication of the two events for the Coordinator, as you point
out, should be identical (we are ensuring that the Prepared Success
state is communicated to the Coordinator).
But these are different scenarios. As a slight (related) aside: the OTS
works fine with replay_completion on the RecoveryCoordinator, so there
is precedent for Replay.
Mark.
Alastair
Alastair
|