ws-rx message

Subject: RE: [ws-rx] i0019 - a formal proposal - take 2

From: Matthew Lovett <MLOVETT@uk.ibm.com>
To: "Stefan Batres" <stefanba@microsoft.com>, ws-rx@lists.oasis-open.org
Date: Thu, 8 Sep 2005 14:46:25 +0100

Hi Stefan,

The justification and scenarios for issue 19 seem clear to me, but to reiterate:

Issue 19 is about sequence termination on Fault. There are 2 cases:

Faults raised at the RMS:
For some reason the RMS has a fault. It could be a bad SequenceAcknowledgement message (sent from the RMD), it could be that the sequence is about to timeout, it could be caused by the phase of the moon. The RMS has an interest in shutting down the sequence in a graceful manner. With the protocol as specified, all that the RMS can do is:
a) Send an AckRequested, wait for a SequenceAcknowledgement, assume that is accurate final ack state, and terminate the sequence
b) Assume that the last SequenceAcknowledgement it saw is accurate final ack state, and terminate the sequence
(To terminate the sequence the RMS could reply with a SequenceTerminated fault, or could send a TerminateSequence message... though it is not clear if the RMS is allowed to send a terminate sequence for a sequence that contains gaps - that leads to issue 28)
All of these approaches are fragile. There is no way for the RMS to be sure of the final ack state.

Faults raised at the RMD:
For some reason the RMD has a fault. It could be a bad message (sent from the RMS), it could be that the sequence is about to timeout, it could be caused by the phase of the moon. The RMD has an interest in ensuring that the RMS can shut down the sequence in a graceful manner. With the protocol as specified, all that the RMD can do is unilaterally terminate the sequence - discarding the final ack state.
You could argue that when the RMD terminates the sequence it should send a SequenceTerminated fault to the RMS, and that the fault message should contain the final ack state. However, that is fragile... if that message does not make it to the RMS then the RMS cannot ask for it to be retransmitted, and any subsequent AckRequested message will fail as the RMD no longer maintains state for that sequence id.

These issues are caused by the specification failing to define sequence shutdown protocol and semantics. Doug's proposal addresses this by defining a new state for sequences - a state where the sequence is closed but not yet terminated. Either end of the protocol can cause the sequence to enter this state (the RMS can send a CloseSequence message, the RMD may unilaterally close the sequence). Once in this state the RMS is free to request ack state, and the RMD is able to deliver it. In addition, the ack state that is exchanged for closed sequences is explicitly tagged as final, so the RMS can be sure of the final state of the sequence.

I'm not going to reiterate the detail of the proposal - Doug's message #5 does a good job of that - but I hope this message provides the scenarios that the list has been asking for.

Thanks,

Matt
--
Matt Lovett, WebSphere MQ Development
Email: mlovett@uk.ibm.com
Phone: 248310 (internal), +44 (0)1962 818310 (external)

"Stefan Batres" <stefanba@microsoft.com> wrote on 06/09/2005 18:13:27: > Doug,
>
> What I’m trying to do is to identify the set of use cases this > feature attempts to address – you might disagree with the set I’ve > identified and that is perfectly valid. It is our job though to > motivate changes to the contributed specs. If you disagree with the > way I’ve characterized the set of use cases for this feature then it > would really help if you could write down for me how you > characterize the use cases vs. the protocol as submitted. I hope you > can take doing this seriously; I don’t think it is a good design > process to add features to the protocol simply because we think they > are helpful and refuse to do the leg work of 1) Defining the > characteristics of the use cases when the features are helpful, 2) > Compare that against the contributed documents and 3) Go through the > exercise of identifying real world use cases that match said characteristics.
>
> --Stefan
>
>
> > From: Doug Davis [mailto:dug@us.ibm.com] > Sent: Monday, September 05, 2005 5:16 PM > To: ws-rx@lists.oasis-open.org > Subject: RE: [ws-rx] i0019 - a formal proposal - take 2
>
> > Stefan, > I disagree with the premise of your note. The use cases for this > feature are not limited to the cases you've mentioned, nor are they > limited to the cases I or anyone else has mentioned. So trying to > fit all possible use cases into the scope you defined just doesn't > fly for me. The reason behind why the RMS wants to get an accurate > and final ack state could be just about anything - and as tempting > as it is to rambling off yet another possible reason why this > feature would be useful I'd prefer to not let the conversation get > bogged down an attempt to limit the scope of this feature. As I've > mentioned, if as an implementor you don't think you'll ever need > this _optional_ feature then don't send it. > thanks > -Doug >
> > "Stefan Batres" <stefanba@microsoft.com>
> 09/05/2005 07:30 PM
> > To
> > Doug Davis/Raleigh/IBM@IBMUS, <ws-rx@lists.oasis-open.org>
> > cc
> >
> > Subject
> > RE: [ws-rx] i0019 - a formal proposal - take 2
> >
> >
> >
> > > > > A quick correction to my comment below: > > Note that thus far, we’ve managed to describe exactly one scenario > that fits the #2 description: [RMD] has separate state stores for > session state and messages – the latter fails but the former is > still operable. > > The scenario we’ve talked about is where the RMD uses separate state > stores, not the RMS. > > --Stefan >
>
> > > From: Stefan Batres [mailto:stefanba@microsoft.com] > Sent: Thursday, September 01, 2005 10:40 AM > To: Doug Davis; ws-rx@lists.oasis-open.org > Subject: RE: [ws-rx] i0019 - a formal proposal - take 2 > > Doug, > > I apologize if my rant below is a bit to cryptic, let me try again: > > 1. When a catastrophic failure occurs (e.g. RMD amnesia), an RMS has > to react in some way; It could return an error to the user or it can > engage in a recovery mechanism of some sort. I don’t believe you are > trying to prescribe what the RMS’s reaction ought to be. > 2. As you’ve said time and again, this proposal is about getting the > RMS an accurate ack set in cases where: 1. A full ack set will never > be possible (or at least not in a reasonable amount of time), 2. > There are messages that have been sent and for which no ack has been > received and 3. The problem that prevents a full ack set doesn’t > prevent the exchange of protocol messages. > > The point I was trying to make is that given #1 above, #2 is an > optimization for a case that will be relatively rare. Note that I > don’t question for a second the correctness of your proposal – what > concerns me is adding elements to the protocol for this specific > case, #2, especially since apps will have to deal with #1 anyway. > > Note that thus far, we’ve managed to describe exactly one scenario > that fits the #2 description: RMS has separate state stores for > session state and messages – the latter fails but the former is > still operable. > > --Stefan >
>
> > > From: Doug Davis [mailto:dug@us.ibm.com] > Sent: Wednesday, August 31, 2005 3:58 AM > To: ws-rx@lists.oasis-open.org > Subject: RE: [ws-rx] i0019 - a formal proposal - take 2 > > > I'm having a hard time following this. I sounds like you're saying > because the proposal does not solve all RM related problems you > don't want to have it in our 'bag of tricks' at all. Following that > logic, why should we distinguish between SequenceTerminated Fault > and any other Fault? We do it because we want to provide as much > information back to the RMS as possible. What it uses this > information for is up to it. > As I've said may times before, this proposal does not suggest ANY > recovery scheme. What I've done (outside of the proposal itself) is > discuss how I _think_ an RMS might use this information in some > error recovery mechanism but this proposal itself does not suggest > one. This proposal simply provides a mechanism for the RMS to get > an accurate accounting of the state of the sequence - that's it. > How the RMS uses this information is up to it. If for nothing else > it may choose to simply log the information - that alone is > invaluable to someone trying to figure out what's going on. And I'm > having a hard time understanding why providing an _optional_ > mechanism that could aide in the RMS getting an accurate accounting > of the state of the sequence (without having to call up the RMD's > admin) is a bad thing. > thanks, > -Doug
> > "Stefan Batres" <stefanba@microsoft.com>
> 08/31/2005 01:48 AM
> >
> > To
> > Doug Davis/Raleigh/IBM@IBMUS, <ws-rx@lists.oasis-open.org>
> > cc
> >
> > Subject
> > RE: [ws-rx] i0019 - a formal proposal - take 2
> > >
>
> >
> >
> > > > > > Doug, > > You mention a specific situation: An RMD experiences a failure that > prevents it from receiving application messages. I agree in so far > as saying that in such a failure case this proposal could be helpful > in that it helps the RMS to engage in recovery of some sort (either > inform applications that a specific message was not sent or open a > new sequence, assuming ordering is not important). But this is not > the only failure case that applications will want to deal with (with > or without help from the protocol). > Consider the case where connectivity is lost for long enough for > both sequences to expire or consider the case where the destination > suffers a loss of session state. In such failure modes this solution > is not helpful – yet applications will need a recovery strategy of > some sort. It might be that it is application specific, or it might > be that a general failure recovery specification is created and > ratified at some point. The important idea is that the only way to > deal with all failure modes is at higher level. This proposal > leverages the protocol to optimize recovery in specific > circumstances that should be relatively rare. RM implementations > should not be required to support failure mode recovery mechanisms > that either don’t apply to them or that they choose to implement in > a uniform way at a higher level. > > Thanks > > --Stefan > >
> >
> > > > From: Doug Davis [mailto:dug@us.ibm.com] > Sent: Tuesday, August 30, 2005 1:08 PM > To: ws-rx@lists.oasis-open.org > Subject: RE: [ws-rx] i0019 - a formal proposal - take 2 > > > Yet more comments. :-) > -Doug
> > "Stefan Batres" <stefanba@microsoft.com>
> 08/30/2005 03:35 PM
> >
>
> > To
> > Doug Davis/Raleigh/IBM@IBMUS, <ws-rx@lists.oasis-open.org>
> > cc
> >
> > Subject
> > RE: [ws-rx] i0019 - a formal proposal - take 2
> > > >
>
>
> >
> >
> > > > > > Doug, > > Some more comments and thoughts on your proposal: > > > <dug>... When or why an RMS uses CloseSequence is up to it to decide. > All we know is that it wants to shut things down and get an accurate > ACK from the RMD.</dug> > > I still have not heard of a plausible reason why an RMS “wants to > shut things down” and the current spec presents a problem. Comparing > the spec as it stands today vs. the spec + this proposal: >
> TODAY: RMS wants to end the sequence so it sends a LastMessage and > must wait for a complete set of acks; this might require > retransmitting messages. Once a full set of acks is received RMS > sends TerminateSequence.
>
> TODAY + THIS PROPOSAL: RMS wants to end the sequence so it sends > Close, waits for a CloseResponse, possibly retransmitting the Close. > Once a CloseResponse is received RMS sends TerminateSequence.
> > The problem with the TODAY scenario, as I’ve heard it in this forum, > is that the RMS might have to wait unacceptably long between sending > LastMessage and getting a full ack range. But if getting some > messages or acks across proves difficult; why would the RMS expect > that getting Close across would be any easier? > > <dug> 1 - I don't believe your text is accurate in that Close is > supposed to be used in cases where the sequence needs to end due to > something going wrong. You've described a case where the sequence > is functioning just fine - and while Close can be used in those > cases as well, it provides no additional value. 2- Sending a Close > and sending application data can have quite a different set of > features executed so I don't think its hard to imagine cases where > RM messages can get processed just fine but application messages run > into problems. I believe Chris mentioned on some call the notion of > two different persistent stores - one for RM data and one for app- > data. Its possible that the app-data one is running into problems. > 3 - Using the CloseSequence operation is option - if you feel that, > as an RMS implementor, you'll never see its usefulness then you're > free to never implement/send it. However, I'd hate remove this > option for those of us who do see value in it. </dug> > > > > > <dug>The case that I keep thinking about is one where the RMD is > actually a cluster of machines and when a sequence gets created it > has an affinity to a certain server in the cluster - meaning it > processes all of the messages for that sequence. If that server > starts to have problems, and for some reason it just can't seem to > process any new app messages then the RMS can close down the > sequence and start up a new one. Hopefully, the new sequence will be > directed to a different server in the cluster. </dug> > > There are two problems with this scenario and the proposed solution. > 1. If an RMD has sequence-to-machine affinity that should be > strictly the RMDs decision and the RMDs problem. The RMS is > autonomous; this proposal puts expectations on the RMS’ behavior > based on particularities of the RMD implementation. To be clear, I’ > ll note that affinity can be achieved in two ways: > i. > By performing stateful routing at the RMD; basically the RMD has to > remember every active sequence and what machine it has affinity to. > In this case it would be simple to change the RMD’s routing table > when a machine fails. > ii. > By generating different EPR’s for each machine. For affinity to > function this way two things are necessary: > 1. Some sort of endpoint resolution mechanism would have to be > devised for the RMS to learn the EPR that it should target. > 2. A mechanism for migrating that EPR.
> Clearly 1) and 2) are outside the scope of the TC and, in my view, > this proposal might be defining 2) in an informal way that is > specific to WS-RM.
> > 2. If the RMS somehow guesses that there is a problem on the > EPR to which it is sending its messages and somehow decides that > Closing the sequence and starting a new one is the right course of > action, ordering guarantees are compromised. > > <dug> I probably didn't state the problem very well. I didn't > intend to claim that the RMS knew about this affinity, but instead > it knew that something was wrong with the current sequence and in > order to try to fix the situation it decided to try another > sequence. The affinity bit was thrown in there to explain why > starting a new sequence _might_ fix the problem. > > I should also point out that while a lot of these discussions have > focused on InOrder+ExactlyOnce DA, this feature is still useful in > other DAs. For example, if the DA is just ExactlyOnce - having an > accurate accounting of the ACKs allows a subsequent sequence to send > just the gaps from the first, so getting an accurate list of the > gaps becomes critical. And this of course leads us to the > discussion of how to determine the DA in use - which I think might > be part of issues 6, 9, 24 and 27. > </dug> > > Finally, I agree with you that considering a gap-filling mechanism > would be a good thing for this TC to do. > > > --Stefan > >

References:
- RE: [ws-rx] i0019 - a formal proposal - take 2
  - From: "Stefan Batres" <stefanba@microsoft.com>