ws-tx message

Subject: RE: [ws-tx] Commentary on Issue 007 - WS-C: Make Register/RegisterResponse retriable

From: "Peter Furniss" <peter.furniss@choreology.com>
To: "'Max Feingold'" <Max.Feingold@microsoft.com>,<ws-tx@lists.oasis-open.org>
Date: Wed, 8 Feb 2006 22:48:00 -0000

Title: Message

I think we may have terminology/concept difference on "durable" and "volatile" identifiers, which is confusing some of this discussion.

The participant identifiers we were talking about were intended to be an unambiguous identifier for a state entity. If the entity, as a set of state, disappeared in a crash, then anything that reappeared in the same system would have a different identifier.

One can of course have an identifier for the thing that stays there, regardless of its state information. An example of this is the Jini participant identifier, which is essentially static (and can be registered in multiple transactions, over time, but only once each (jini passes the txid on ALL messages). but Jini alsodefines a "crash count", which is incremented after crash/stateloss. The participant identifier we were proposing is effectively jini id + crashcount (or that + txid) - it is the identifier of the enrolled thing.

Used in this sense, the example below just doesn't happen - you DON'T have the same id after the crash.

And with an id like that, nothing needs to invent spuriously different endpoints (with volatile, opaque identifiers).

Peter

-----Original Message-----
From: Max Feingold [mailto:Max.Feingold@microsoft.com]
Sent: 27 January 2006 02:27
To: ws-tx@lists.oasis-open.org
Subject: [ws-tx] Commentary on Issue 007 - WS-C: Make Register/RegisterResponse retriable

Hello.

I’d like to add some written content on issue 007, in order to clarify the point I made verbally during our last TC call: that durable participant identifiers in isolation do not add any value to registration retry scenarios in WS-AT.

In an earlier message [1] to this list, I introduced the details of two important scenarios where a transaction manager (TM) will send a registration retry in the context of WS-AT.

The first is the positive case: a subordinate TM opts to send a second registration message because its first registration message did not receive a response (either because the Register message was lost or because the RegisterResponse message was lost). In this case, the expectation is that the dropped message will not prevent the transaction from committing.

The second is the negative case: a subordinate TM successfully registers for 2PC, then fails and recovers. Because WS-AT presumes abort, the recovering TM will have forgotten about its membership in the existing transaction. Consequently, any participants it might have accumulated during the active phase before the failure will also have been forgotten. Because WS-AT is a disconnected protocol and the TM has no recollection of prior events, no immediate action will be taken by any node in the transaction tree. As the active phase proceeds, the TM may be re-infected with the same transaction. If that occurs, the TM will naturally attempt to register with a coordinator[2] in order to (re)join the transaction. In this case, the expectation is that to avoid transaction tree splits (and therefore data corruption) the transaction needs to be aborted.

In my earlier message[1], I explained how a participant implementation can ensure that each of these scenarios is processed correctly in the context of the current specifications. To summarize, the participant simply needs to use a unique volatile identifier in each registration EPR in order to distinguish which individual registration is being targeted by subsequent protocol messages composed by its coordinator(s). This technique allows the participant to send an arbitrary number of registration requests while retaining correctness (in the context of the two scenarios mentioned above) and imposing no special requirements on the behavior of a WS-AT coordinator.

What I would like to do now is explain why this technique _must_ still be used even if durable participant identifiers are used in registration. In other words, _adding durable participant identifiers does not add value to WS-AT’s registration retry semantics_.

To illustrate this argument, I’ve outlined the behavior that we would observe when durable participant identifiers are used, but the volatile identifiers mentioned above (or an equivalent) are not used.

As before, P is a participant, C a coordinator and T a transaction.

1.      P registers for durable 2PC on C for transaction T.
a.      P provides C with durable identifier IP.
2.      P accumulates a number of participants during the active phase
3.      P fails and recovers, is reinfected with T by a local application.
4.      P (re-)registers for durable 2PC on C for transaction T.
a.      P provides C with the same durable identifier IP.
5.      C recognizes IP and responds to P with a successful RegisterResponse message.
a.      C continues to think of P’s multiple registrations as a single enlistment.

At this point, there is nothing preventing the transaction from committing. This violates the principles outlined above for the re-infection scenario. Data corruption will result.

One can imagine a couple of variations on this behavior in order to attempt to address the problem:

1.      C could abort the transaction when it recognizes P’s duplicate registration in step 5.
a.      This is self-defeating, as the entire purpose of the feature is to allow registration retries. In other words, this breaks the positive first scenario.

b.      If the original C is different from the second C, the latter will not know to abort the transaction.
2.      C could send an augmented RegisterResponse indicating “already registered”. P could detect this condition and abort if it only remembers sending one registration message.

c.      This is not a full solution: it _stops working_ as soon as the recovered P needs to send more than one Register message (e.g. it falls into the deliberate retry pattern exemplified by the positive first scenario).

d.      If the original C is different from the second C, no augmented response will be sent. Only P can truly know (or deduce) that it has registered twice, but forgotten the state associated with the first enlistment.

The bottom line is that the participant needs to own the problem of noticing that it has become amnesiac. It does not work to attempt to push the problem off onto a coordinator; which is essentially what the durable participant id proposal does. To make these scenarios work correctly, some kind of per-enlistment identifier scheme needs to be used at the participant, in order to distinguish individual enlistments created by individual registrations. This is in effect the same scheme that was already discussed [1]. It can be implemented easily, cleanly and interoperably with the current specifications.

I conclude therefore that WS-AT does not benefit from durable participant identifiers, for the stated purpose of allowing registration retries to occur safely.

For those coordination protocols that do happen to need participant identifiers, in order to enable some other feature, it is trivial to use the extensibility provided by WS-C to add them as a feature of those protocols. That is _precisely_ why this extensibility was created: to allow the design and use of features that do not make sense for all coordination protocols.

Thanks,

-mfeingol

[1] http://lists.oasis-open.org/archives/ws-tx/200512/msg00223.html

[2] Due to the vagaries of transaction propagation, the coordinator may not be the same one as before.

Follow-Ups:
- RE: [ws-tx] Commentary on Issue 007 - WS-C: Make Register/RegisterResponse retriable
  - From: "Max Feingold" <Max.Feingold@microsoft.com>