ws-tx message

Subject: Re: [ws-tx] Issue 007 - WS-C: Make Register/RegisterResponse retriable

From: Alastair Green <alastair.green@choreology.com>
To: Andrew Wilkinson3 <awilkinson@uk.ibm.com>
Date: Mon, 19 Dec 2005 15:21:21 +0000

Andrew,

This discussion reminds me of the old (and very wise) joke about the farmer, who on being asked how to get to X, replies "If I were you, I wouldn't start from here".

I agree that it "works" (it is safe from a consistency standpoint), but it's very easy to do better, and the status quo stops working when we turn our attention to registrations for AT Completion protocol, and to the operation of BA mixed outcome. I'd turn this around: what's the big deal about introducing Participant identifiers?

I think you, Ian and Max are right that registration can always be made to work safely for WS-AT 2PC and WS-BA PC and CC, with one (inessential) qualification, which I will come back to. I do not think it is good design, because it prohibits a feature required for failure-tolerant transactional recoverable applications, such as BPM engines.

[Making it work does not require banning duplicate Registers, and I think you cannot meaningfully introduce a prohibition on such duplicates, as the transport may fox you, at minimum. It is true that an implementation can police application-induced retries (although, paradoxically, to do so has to track participant ids internally, which makes one wonder why all the effort to avoid evincing them on the wire).]

First, the "abort because of re-registration" may occur even if the doubly-registered EPRs are stable. Absent observation of reply-tos, this situation (which can arise simply through duplicate delivery and a lost response) can cause an unnecessary abort (the "shadow" Completed for the first registration will never get through to the Coordinator).

If the EPR alters on re-registration (which is quite realistic for recoverable BPM engines, or other recoverable applications) then the Coordinator completion case can be handled by redirecting replies to ensure that the shadow traffic from the stale registration plays back correctly. Again, if reply-tos are ignored then the outcome will be consistent. That would require no spec change in the strictest sense: the Participant could ignore the redirect, and the transaction would eventually fail by timeout. (My diagram 2.)

In the case of Participant completion, then the absence of one of the expected Completed messages will also cause a failure by timeout. Again, a consistent outcome, but why pre-ordain failure? (My diagram 3.)

The qualification is this: it is assumed that the failure of any one Participant to go Complete in a timely fashion is good enough to trigger collapse of the activity by timeout.

[This could be problematic. One could have a mixed outcome protocol that allows "rolling confirms": i.e. we selectively confirm (close in BA terms) one or more participants. By so doing we have decided that those participants are already sufficient to end up with a good outcome. Others may still be added to the confirm set. But the absence of a Completed from an unconfirmed participant is not enough to blow away the transaction. Now, we could then say: fine, if we never hear back from a registered Participant we will try to cancel it; if we fail to do so we will ignore that failure and let the rest of the transaction proceed to completion. This is OK in the case being considered, because the inability to receive Completed relates to a shadow Participant, which we would ideally wish to eliminate from consideration anyway. (All of what I have described is compatible with WS-BA, because it does not state any rules about permissible modes of partial completion.)]

So, pushing your argument to the limit: it is possible to make all modes of failure resulting from duplicate registration of a participant with varying EPRs "work", in the sense of preserving consistency, including in mixed outcome cases, for WS-AT 2PC and WS-BA's two protocols.

Peter and I have been proceeding from a different criterion of success: that avoiding unnecessary aborts in these cases is worthwhile. As I see it you are proceeding from the premise that these are corner cases -- they can occur, they must be catered for in terms of guaranteeing consistency, and that pre-ordained failure in two situations is acceptable.

The failure condition requires one of two circumstances, either plain duplication and message loss with stable EPRs but no mandatory use of wsa:replyTo; or when, a) the application has checkpointed, attempted registration, and then failed, and b) it has recovered and attempted to re-register with a different EPR, in which case, for Participant Completion use of reply-to won't help.

I am still very unhappy with a protocol that doesn't permit successful operation in the face of these cases. It is easy to tolerate and overcome these comm/application failures, and we should help to enable that failure tolerance. Otherwise, building truly reliable, transactional recoverable applications is impeded.

I would not change my vote on the basis of the "corner case" argument, and I would ensure that my implementation avoided this problem by using extension participant identifiers when self-interoperating, if defeated in a vote in the TC.

[Parenthetically, the change implied thus far by your thinking would be simply to remove AlreadyRegistered from WS-Coordination (it cannot stay, because in your scheme you have no way of detecting duplicates, and you don't need any such mechanism, so a fault that depends on duplicate detection cannot be processed.)]

However, we have not yet exhausted the argument.

What about WS-AT Completion Protocol? As I understand it, the original design intent of the fault AlreadyRegistered was to notify the requester that someone else has registered to act as the terminator of the transaction (Initiator in WS-AT CP terminology).

Am I right in thinking that users of the WS-AT Completion Protocol are intended to register upfront? If so, and it is desired to prevent duplicate registration, then the WS-AT CP Participant must be capable of being screened for duplication. If they are not, then we run the risk that the duplicate delivery of the Register will be permitted and processed, and may result from two concurrent attempts at registration, which are not intended to be allowed. We can either stop trying to police that, or we need ... to be able to identify Participants.

Finally, I still remain to be persuaded that you can properly build systems using mixed outcome without the ability to correlate registrations with the application traffic which stimulates/accompanies registration. I would like to see other people's thoughts on this latter point: the discussion on this aspect has been one-sided thus far. (See issue 014 for a description of the problem.)

If we introduce participant identifiers then every case can be handled, and we end up with cleaner, less convoluted, implementations, if we choose to take advantage.

Yours,

Alastair

Andrew Wilkinson3 wrote:

Alastair,

So, introducing participant identifiers will resolve both Issue 007 
and Issue 014. 

It is a more elegant approach that does not create forced 
implementation choices that are required to work around its absence 
for the duplicate registration problem. It is the only solution that
will make BA MixedOutcome or BA Participant Completion registration 
workable (at least, the only one thus far proposed).


I believe that it is possible to create an interoperable implementation of 
BA Participant Completion without being able to detect duplicate 
registrations. In the event of a coordinator receiving two register 
messages for the same participant, either because the register was retried 
or the transport delivered the message twice, it will expect the receipt 
of two completed messages. Should one of these messages not be forthcoming 
the coordinator may opt to complete in failure - sending cancel to the 
participant from which it has not received a completed message and 
compensate to the one from which such a message has been received.

Assuming that the participant implementation has been coded such that it 
does not retry sending of a register message (and we could mandate this in 
the spec if needed) what we're coping with here is an error in the 
transport and, as such, completing in failure is entirely appropriate. 
What's important is that the outcome is consistent and I believe that the 
BA state table ensures that this will be the case.

Andy

References:
- Re: [ws-tx] Issue 007 - WS-C: Make Register/RegisterResponse retriable
  - From: Andrew Wilkinson3 <awilkinson@uk.ibm.com>