ws-tx message

Subject: Re: [ws-tx] Issue 016: WS-C: ReplaceParticipant
From: Ian Robinson <ian_robinson@uk.ibm.com>
To: Alastair Green <alastair.green@choreology.com>
Date: Wed, 21 Dec 2005 15:01:55 +0000




Alastair,
Deployments that claim to provide high availability must (and can) do so
today by providing EPRs that are externally stable (i.e stable from the
perspective of the remote, foreign requester). The proxy gateway example I
used previously, requiring no knowledge or cooperation by the requester, is
more or less equivalent to the indirect IOR in CORBA - which is a well
established pattern. This is an infrastructure rather than an application
concern so it is not an unreasonable burden.
On the sauce front, the wsa:ReplyTo MIH on non-terminal protocol messages
is primarily a mechanism to enable protocol messages to be delivered after
a participant has forgotten a transaction (e.g. to enable an AT participant
to respond wsat:Aborted to a duplicate wsat:Rollback) but it is also a
safeguard that ensures that, once agreement protocol messages have started,
they can always be completed regardless of the high-availability capability
of the deployments involved in the activity. Providing a more general
purpose capability to refesh incoherent EPRs is, I believe, more a matter
for the the WS-A WG and beyond the scope of WS-Tx.

Regards,
Ian Robinson
STSM, WebSphere Messaging and Transactions Architect
IBM Hursley Lab, UK
ian_robinson@uk.ibm.com


                                                                           
             Alastair Green                                                
             <alastair.green@c                                             
             horeology.com>                                             To 
                                       Ian Robinson/UK/IBM@IBMGB           
             19/12/2005 15:55                                           cc 
                                       marchadr@wellsfargo.com, Mark       
                                       Little <mark.little@jboss.com>,     
                                       peter.furniss@choreology.com,       
                                       ws-tx@lists.oasis-open.org          
                                                                   Subject 
                                       Re: [ws-tx] Issue 016: WS-C:        
                                       ReplaceParticipant                  
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




Ian,

You are introducing a very strong constraint on implementation and
deployment if you (implicitly) assume that EPRs must be externally stable.

What is sauce for the goose is sauce for the gander. If redirection in
long-running activity registration isn't really necessary, then why is
it necessary to provide a redirection mechanism for any of the
coordination protocol messages at all?

It is true that I can work out a way of avoiding EPR replacement if I
try hard enough. But we then lose in terms of implementation and
deployment simplicity. In our experience, some of the issues involved
with this in e.g. recovery situations can be quite fraught. What happens
when the stateless gateway crashes? Who repairs the link to the true
endpoint? On what stimulus? Do we make the gateway stateful? How do deal
with volatile staling, etc etc.

Putting an address array in the hands of the counterpart can
considerably abridge that set of problems or the execution of the
message and code paths required to solve them.

We can end up putting secondary internal distributed protocols into
operation, when simple redirection will handle 99% of cases. We should
reserve EPR reclamation for the circumstances where it is unavoidable.
It is a cost, both in terms of set up and in terms of network message
duplication, and in terms of added fragility, which can often be walked
around. This is an area where I believe there is no one obvious answer,
and where implementation (and run-time) variation should be assumed.

I agree with the design principle behind the redirectable coordination
protocol messages (notifications with reply-to). I would like it to
apply to, mutatis mutandis, to Register/RegisterResponse. That is a
consistent approach. It is permissive with respect to implementation
strategies.

Once again, with participant identification, address reswap via replay
of R/RR is not hard. R/RR become more like other retriable exchanges,
and that's a win.

I hesitate to say all roads lead to Rome, but plenty of them do.

Alastair


Ian Robinson wrote:
>
>
> Just to clarify one point:
> Alastair wrote:
> "if no coord protocol messages have flowed then a recovering participant
> can re-register, to communicate a new address. If any coord protocol
> messages have flowed then a recovering participant can replay, e.g.
resend
> Prepared, to communicate a new address."
>
> In the case where coord protocols have started then the Participant
should
> resend Prepared, as you say. In the case where coord protocols have not
yet
> started but the participant has failed and had to be moved to a different
> address then it is reasonable (certainly for short-duration activities)
for
> the activity as a whole to fail as a consequence of the original
> participant being unavailable to respond to protocol messages. I think
the
> real question here is how to think about long-running activities, where
> failure is more likely before the activity completion-agreement protocol
> has started.
> If the participant has a "stable" EPR then the problem does not occur
> (certainly, there is then no need to "replace" it). But what does it mean
> for an EPR to be "stable" over a long period? It might be tempting to
> invent some new WS-Addressing terminology - e.g. a stable EPR is one
whose
> address remains coherent throughout the lifetime of the service it
> references, etc - but I think this begins to stray beyond the scope of
> WS-Tx and we should not do this. But it is certainly possible to build
> interoperable WS-Addressing infrastructure which is fault tolerant. There
> are many ways to achieve this, for example by only exposing (in the
> wsa:Address) the logical address of the corporate gateway server that
> typically sits in front of a Participant. Such gateways can afford to be
> stateless and are typically highly available; such gateways are also part
> of the environment that created the exported EPR and can be considered to
> have knowledge of the structure of the exported EPR, including any
> ReferenceParameters. If the server, behind the gateway, that hosts the
> Participant state fails and the Participant is logically moved to another
> server then it should not necessary to have to update the registration in
> the external Coordinator. The routing can be a detail of the
WS-Addressing
> function in the gateway. Gateways may suffer outages too but they always
> come back on-line at the same address (if you want to stay in business
> :-)).
> My point is to illustrate that there it is not a "requirement" to be able
> to specify a mechanism to replace EPRs - that is just one proposed
solution
> to the requirement to be able to provide a fault tolerant solution for
> long-running activities.
>
> Regards,
> Ian Robinson
> STSM, WebSphere Messaging and Transactions Architect
> IBM Hursley Lab, UK
> ian_robinson@uk.ibm.com
>
>
>

>              Alastair Green

>              <alastair.green@c

>              horeology.com>
To
>                                        Mark Little
<mark.little@jboss.com>
>              14/12/2005 18:37
cc
>                                        marchadr@wellsfargo.com, Ian

>                                        Robinson/UK/IBM@IBMGB,

>                                        peter.furniss@choreology.com,

>                                        ws-tx@lists.oasis-open.org

>
Subject
>                                        Re: [ws-tx] Issue 016: WS-C:

>                                        ReplaceParticipant

>

>

>

>

>

>

>
>
>
>
> Ian, Dan, Mark
>
> I had missed the relatively clear statement in Section 9 of WS-AT that
you
> point out. It's not half-hidden. though it is very terse.
>
> The notion that you MAY use a new one, but don't have to (e.g. you can
> build up a battery of primary, secondary etc addresses, and try to
failback
> etc) seems right to me. Most likely you want to use the most recent, but
> there is nothing to stop you using the old one (or to use the old one
> first, and then to try the new one). I think the wording of the existing
> text is too restrictive in its implications -- it doesn't make it clear
> that this could be used to proactively redirect on recovery -- but it's
> correct normatively.
>
> I see no problem with this mechanism, so long as the retriable
> Register/RegisterResponse is available with exactly the characteristics
of
> my revised proposed solution to 007.
>
> Assuming that solution, if no coord protocol messages have flowed then a
> recovering participant can re-register, to communicate a new address.
>
> If any coord protocol messages have flowed then a recovering participant
> can replay, e.g. resend Prepared, to communicate a new address.
>
> If the other end is communicable, it will respond. This will happen as
fast
> and as effectively as a response to ReplaceParticipant -- not messier,
but
> neater, because there is no special new message. The style of using
replay
> of a real message to stimulate a recovery of the conversation is
preferable
> in my mind to having special messages saying "I'm recovering, where are
we
> up to?". (Cf. the question, why have a Replay message in WS-AT?)
>
> Finally, on the placement of this stuff (Coord versus the referencing
> specs).
>
> The existing issues Peter and I have raised include moving all the
general
> statements about notifications and terminal notifications into
> WS-Coordination and out of WS-AT, and then having WS-AT and WS-BA
reference
> them.
>
> This should include this section: i.e we define that non-terminal
> notifications which contain a replyTo can be responded to by a subsequent
> message in the exchange using that EPR. This is incorporated by reference
> when WS-AT or WS-BA say: the following messages are notifications in the
> terms defined by WS-C, the following ones are terminal notifications in
> those terms.
>
> (If you wanted to harden this you could define base schema types which
are
> notifications and terminal notifications in WS-Coordination, and define
all
> coordination protocol messages as extensions of them in the referencing
> spec schemas. The pros and cons of XMLery of this kind are not my
> specialism, so I shall light that blue touchpaper and retire to a safe
> distance.)
>
> We could include a BTP-style Redirect, (i.e. the bilateral version of
> Mark's proposed ReplaceParticipant), which becomes feasible if you have
> participant and coordinator identification, but that seems heavy-handed.
> The beauties of the current scheme are that it is self-identifying
because
> it uses the channel or link established by R/RR; that it requires no new
> message; that it optimizes network traffic, and that it is a no-change
> (other than perhaps minor editorial) resolution to this issue (assuming
> necessary change on R/RR as discussed under 007).
>
> Peter's point (that address replacement can lead to permanent loss of
> connectivity if both sides just move and leave no "forwarding" address)
is
> very important: you want old addresses to be forwarding addresses, at
> least. But that is a warning to implementers, not an enforceable
normative
> statement.
>
> In sum I think we should ponder the existing wording to see if there is
> anything normative that needs adding, and to consider whether the
examples
> and recommendations section should be a bit wider, to better surface and
> explain the uses of this feature including the rationale behind Mark's
> issue.
>
> Alastair
>
> Mark Little wrote:
>       Yes, but that by itself does not help in the failure and recovery
>       occurs before notification messages are exchanged. The replace
>       message may help in that case, except that if the coordinator
hasn't
>       begun the coordination protocol, the response to replay may be
>       nothing and in which case, we don't achieve much in the way of
>       failure resiliency. Of course, the recovered participant could
simply
>       keep retrying replay until it triggered a response, as in the
example
>       Ian outlined, but that seems messy and inefficient to me.
>
>       Mark.
>
>
>       marchadr@wellsfargo.com wrote:
>
>             Looks like this is already mentioned a bit in the WS-AT spec:
>
>             "Notification messages are addressed by both coordinators and
>             participants using the Endpoint
>             References initially obtained during the
>             Register-RegisterResponse exchange. If a wsa:ReplyTo header
>             is present in a notification message it MAY be used by the
>             recipient, for example in cases where a Coordinator or
>             Participant has forgotten a transaction that is completed and
>             needs to respond to a resent
>             protocol message. Permanent loss of connectivity between a
>             coordinator and a participant in an in-doubt
>             state can result in data corruption."
>
>             - Dan
>
>             -----Original Message-----
>             From: Marchant, Dan R. Sent: Wednesday, December 14, 2005
6:43
>             AM
>             To: ian_robinson@uk.ibm.com; alastair.green@choreology.com
>             Cc: peter.furniss@choreology.com; ws-tx@lists.oasis-open.org
>             Subject: RE: [ws-tx] Issue 016: WS-C: ReplaceParticipant
>
>
>             +1 for using the ReplyTo.
>
>             The replyTo could be an endpoint that virtualizes the
specific
>             endpoints within the EPR,
>             creating a cleaner failover and recover scenario.
>
>             My 2 cents,
>
>             Dan
>
>
>             -----Original Message-----
>             From: Ian Robinson [mailto:ian_robinson@uk.ibm.com]
>             Sent: Wednesday, December 14, 2005 6:15 AM
>             To: Alastair Green
>             Cc: Peter Furniss; ws-tx@lists.oasis-open.org
>             Subject: Re: [ws-tx] Issue 016: WS-C: ReplaceParticipant
>
>
>
>
>
>
>             As you say, section 9 of WS-AT deals with this situation. I
>             believe the
>             text is already appropriately worded. Essentially, the
>             registered EPR is
>             good until it isn't; if the registered EPR becomes "stale" in
>             some way then
>             the ReplyTo EPR is the means by which the EPR can be
>             "refreshed". There is
>             deliberately no requirement to replace the registered EPR
with
>             the ReplyTo
>             EPR - this allows an implementatoin to log the registered EPR
>             and to
>             continue to use it throughout the transaction and across any
>             failures.
>             The following sequence illustrates how EPR replacement is
>             supported:
>
>             Participant A registers EPR Pa.
>             Coordinator C1 sends Prepare to Pa and it responds Prepared.
>             Participant A's environment suffers a disasterous failure and
>             the
>             participant is recovered at a different address.
>             C1 tries to send commit to Pa but Pa is no longer
addressable.
>             C1 retries the commit.
>             Meanwhile Pa is recovered at Pa' and resends Prepared to C1
>             with Pa' as the
>             ReplyTo MAP.
>             C1, having determines that Pa is not responding, replaces Pa
>             with Pa' and
>             REsends commit (per the AT state table)
>             The transaction proceeds to successful conclusion.
>
>
>             Regards,
>             Ian Robinson
>             STSM, WebSphere Messaging and Transactions Architect
>             IBM Hursley Lab, UK
>             ian_robinson@uk.ibm.com
>
>
>             Alastair Green
>             <alastair.green@c
>             horeology.com>                                             To
>             Peter Furniss                                   13/12/2005
>             19:04          <peter.furniss@choreology.com>
>             cc
>             ws-tx@lists.oasis-open.org
>             Subject                                       Re: [ws-tx]
Issue
>             016: WS-C:
>             ReplaceParticipant
>
>
>
>
>             Mark,
>
>             This is an interesting issue, and dovetails with a couple of
>             questions on
>             the Register/RegisterResponse per se.
>
>             The first point is:  we need to make it clear when you have
to
>             stop
>             retrying Register. You shouldn't send it if you've received
>             RegisterResponse.
>
>             If we make R/RR a standard one-way MEP, which I favour, then
we
>             can use the
>             notification/terminal notification nomenclature to state
this.
>
>             Then we come to your address replacement issue per se.
>
>             In BTP we ended up with a message, REDIRECT, which either the
>             Superior
>             (Coordinator) or Inferior (Participant) could send to the
>             other, saying:
>             this is entity Foo, please send my messages to this new
>             address. To do this
>             one needs an identity, so one can say: "I am Foo". If you
have
>             a
>             Coordinator identifier and a Participant identifier, then
this
>             is easy.
>
>             However, I think we already have this (bidirectional) feature
>             in the WS-AT
>             and WS-BA protocols in another form, albeit somewhat tucked
>             away.
>
>             In Section 9 on use of WS-A Headers, it is stated that a
>             non-terminal
>             notification has to have a reply-to address. I presume (there
>             is no
>             statement on this, and that needs fixing, for sure) that this
>             field only
>             makes sense if I am trying to redirect subsequent traffic. In
>             other words,
>             I send a standard message but qualify it with the added
>             semantic: "I've
>             moved". If the receivers sees this, I assume they should
>             overwrite the old
>             EPR they have, and continue as normal.
>
>             Such an address replacement means that redirection is
>             accomplished as a
>             by-product of recovery-driven replay of messages, or because
>             the load
>             balancer has done a reshuffle -- it doesn't really matter
why.
>
>             This is neat, because it avoids having to communicate
>             identifiers for
>             redirection (they are still needed for the original register
as
>             per other
>             discussions).
>
>             Therefore, I believe that this issue could be resolved by
>             supplementing and
>             expanding the  WS-Coord  spec's statements on  MEPs, types of
>             messages etc,
>             with a statement that a non-terminal notification reply-to
>             should supplant
>             the previously held EPR for the next and subsequent messages
in
>             the
>             conversation, and we're done.
>
>             It is probably obvious, but I see no very good reason why
>             redirection
>             (address replacement) should be limited to the Participant
end.
>
>
>             Alastair
>
>
>             Peter Furniss wrote:
>                  This is hereby identified as ws-tx issue 016
>
>                  Please follow up to this message or otherwise ensure
your
>             subject
>                  line
>                  starts "Issue 016 - "
>                               (after any Re:, [ws-tx] etc)
>
>
>                  Issue name -- WS-C: ReplaceParticipant
>
>                  Owner: Mark Little [mailto:mark.little@jboss.com]
>
>                  Target document and draft:
>
>                  Protocol:  Coord
>
>                  Artifact:  spec / schema
>
>                  Draft:
>
>                  Coord spec working draft uploaded 2005-12-02
>
>                  Link to the document referenced:
>
>
>
http://www.oasis-open.org/committees/download.php/15738/WS-Coordination
>
>                  -
>                  2005-11-22.pdf
>
>
>                  Issue Type
>
>                  Design
>
>                  Issue Details
>
>                  In order to coordinate long running interactions, it is
>             necessary to
>                  tolerate failures and recovery situations within the
scope
>             of an
>                  activity (long running activity). Once a participant is
>             registered
>                  with
>                  a coordinator,  the current specification implicitly
>             mandates that
>                  recovery requires it to come back up on the same EPR in
>             order that
>                  the
>                  coordinator can subsequently drive it through whatever
>             protocol is
>                  used
>                  (e.g., 2PC). However, recovery on the same EPR cannot be
>             guaranteed
>                  and
>                  is at best an implementation choice. Failure to recover
on
>             the same
>                  EPR
>                  will ultimately lead to more coordinated activities
>             terminating in a
>                  failure state (e.g., aborting) because participants
cannot
>             be
>                  reached,
>                  even if they failed and recovered prior to the start of
>             execution of
>                  the
>
>                  coordinator's protocol.
>
>                  Proposed Resolution:
>
>                  That we add a ReplaceParticipant operation that allows a
>             registering
>                  service to instruct the coordinator service to replace
one
>             EPR with
>                  another EPR. Because EPRs are not currently comparable,
a
>             resolution
>                  of
>                  issue 7 or 14 is relevant to this issue.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
References:
- Re: [ws-tx] Issue 016: WS-C: ReplaceParticipant
  - From: Alastair Green <alastair.green@choreology.com>