Stefan:
Gaps in sequences will happen for the very
same reason that "AtLeastOnce" DA makes provision for hopelessly-lost
messages ("... or else an error will be raised.") for whatever
reason that may be.
Of course you can always make sure that the
current state of a sequence is fully acknowledged before sending the next
message on it and close a sequence at the first sign of trouble. But the price
of this no-gap policy would then be very high in performance loss.
Trying to reduce the opportunities for
message loss in various ways is welcome, but at the end of the day RM must still
plan for the worst case (which it already does with its DA error cases: there
is only so much the protocol can do, and beyond that the best we can ask is to accurately
report on errors, so that DAs are best fulfilled even in that case.)
Jacques
From: Stefan Batres
[mailto:stefanba@microsoft.com]
Sent: Wednesday, September 07,
2005 4:38 PM
To: Doug Davis;
ws-rx@lists.oasis-open.org
Subject: RE: [ws-rx] i0019 - a
formal proposal - take 2
Doug et al,
This, IMHO, is an example
of what I'm talking about. Is the very-large-message problem something
that this proposal is suited for? I don't think so (note that it
doesn't fit with "#1"); we're speculating that it might
be that an RMS wants to gracefully complete a sequence on which there are
messages that have not been acknowledged because transmission failed due to the
size of the messages. In my view, the problem in this case is not how to
complete a sequence with holes, but rather, dealing with very-large-messages.
For instance, a way to address that problem would be with a message
fragmentation protocol that would work on top of WS-RM. Another way might be to
fragment the messages at the app layer and reflect that in the contract. Yet
another way could be to limit the size of messages via policy. Like the use
case given before involving machine affinity and machine failure[1], there are
more appropriate ways to address this use case.
[1] http://www.oasis-open.org/apps/org/workgroup/ws-rx/email/archives/200508/msg00303.html
--Stefan
From: Doug Davis
[mailto:dug@us.ibm.com]
Sent: Wednesday, September 07,
2005 5:50 AM
To: ws-rx@lists.oasis-open.org
Subject: RE: [ws-rx] i0019 - a
formal proposal - take 2
Stefan,
The
proposal attempts to address the two issues, i019 and i028. Perhaps
you're looking to close the issues as invalid? During the course of these
discussions several use cases have been mentioned as possible situation in
which the issues mentioned i019 and i028 will occur. If you think those
use cases fit into the "#1" you mentioned, and you believe that case
to be rare then ok. I don't see the situations i019 and i028 talk about
as being rare nor do I think your "#1" is the only case. I
believe people have mentioned cases much less catastrophic, such as extremely
large messages just can not be delivered due to some network issues (sadly
something I run into quite a bit), that would still warrant the need for this
solution. But, I don't see the need to iterate all of them since the
entire point of the spec is that networks are not reliable and problems will
occur. So running into one that prevents us from getting 100% guaranteed
complete delivery every time isn't hard for me to imagine. But that's
just me.
thanks,
-Doug
"Stefan Batres"
<stefanba@microsoft.com>
09/06/2005
01:13 PM
|
To
|
Doug Davis/Raleigh/IBM@IBMUS,
<ws-rx@lists.oasis-open.org>
|
cc
|
|
Subject
|
RE: [ws-rx] i0019 - a formal proposal - take 2
|
|
Doug,
What I'm trying to do is to identify the set of use cases this feature attempts to
address - you might disagree with the set I've identified and that
is perfectly valid. It is our job though to motivate changes to the contributed
specs. If you disagree with the way I've characterized the set of use
cases for this feature then it would really help if you could write down for me
how you characterize the use cases vs. the protocol as submitted. I hope you
can take doing this seriously; I don't think it is a good design process
to add features to the protocol simply because we think they are helpful and
refuse to do the leg work of 1) Defining the characteristics of the use cases
when the features are helpful, 2) Compare that against the contributed
documents and 3) Go through the exercise of identifying real world use cases
that match said characteristics.
--Stefan
From: Doug Davis
[mailto:dug@us.ibm.com]
Sent: Monday, September 05, 2005 5:16 PM
To: ws-rx@lists.oasis-open.org
Subject: RE: [ws-rx] i0019 - a formal proposal - take 2
Stefan,
I disagree with the premise of your note. The use cases for this
feature are not limited to the cases you've mentioned, nor are they limited to
the cases I or anyone else has mentioned. So trying to fit all possible
use cases into the scope you defined just doesn't fly for me. The reason
behind why the RMS wants to get an accurate and final ack state could be just
about anything - and as tempting as it is to rambling off yet another possible
reason why this feature would be useful I'd prefer to not let the conversation
get bogged down an attempt to limit the scope of this feature. As I've
mentioned, if as an implementor you don't think you'll ever need this
_optional_ feature then don't send it.
thanks
-Doug
"Stefan Batres"
<stefanba@microsoft.com>
09/05/2005
07:30 PM
|
To
|
Doug Davis/Raleigh/IBM@IBMUS,
<ws-rx@lists.oasis-open.org>
|
cc
|
|
Subject
|
RE: [ws-rx] i0019 - a formal proposal - take 2
|
|
A quick correction to my comment below:
Note that thus far, we've managed to describe exactly one scenario that
fits the #2 description: [RMD] has separate state stores for session state and
messages - the latter fails but the former is still operable.
The scenario we've talked about is where the RMD uses separate state
stores, not the RMS.
--Stefan
From: Stefan Batres [mailto:stefanba@microsoft.com]
Sent: Thursday, September 01, 2005 10:40 AM
To: Doug Davis; ws-rx@lists.oasis-open.org
Subject: RE: [ws-rx] i0019 - a formal proposal - take 2
Doug,
I apologize if my rant below is a bit to cryptic, let me try again:
1. When a catastrophic failure occurs (e.g. RMD amnesia), an RMS has to react
in some way; It could return an error to the user or it can engage in a recovery
mechanism of some sort. I don't believe you are trying to prescribe what
the RMS's reaction ought to be.
2. As you've said time and again, this proposal is about getting the RMS
an accurate ack set in cases where: 1. A full ack set will never be possible
(or at least not in a reasonable amount of time), 2.There are messages that
have been sent and for which no ack has been received and 3. The problem that
prevents a full ack set doesn't prevent the exchange of protocol
messages.
The point I was trying to make is that given #1 above, #2 is an optimization
for a case that will be relatively rare. Note that I don't question for a
second the correctness of your proposal - what concerns me is adding
elements to the protocol for this specific case, #2, especially since apps will
have to deal with #1 anyway.
Note that thus far, we've managed to describe exactly one scenario that
fits the #2 description: RMS has separate state stores for session state and
messages - the latter fails but the former is still operable.
--Stefan
From: Doug Davis [mailto:dug@us.ibm.com]
Sent: Wednesday, August 31, 2005 3:58 AM
To: ws-rx@lists.oasis-open.org
Subject: RE: [ws-rx] i0019 - a formal proposal - take 2
I'm having a hard time following this. I sounds like you're saying
because the proposal does not solve all RM related problems you don't want to
have it in our 'bag of tricks' at all. Following that logic, why should
we distinguish between SequenceTerminated Fault and any other Fault? We
do it because we want to provide as much information back to the RMS as
possible. What it uses this information for is up to it.
As I've said may times before, this proposal does not suggest ANY recovery
scheme. What I've done (outside of the proposal itself) is discuss how I
_think_ an RMS might use this information in some error recovery mechanism but
this proposal itself does not suggest one. This proposal simply provides
a mechanism for the RMS to get an accurate accounting of the state of the sequence
- that's it. How the RMS uses this information is up to it. If for
nothing else it may choose to simply log the information - that alone is
invaluable to someone trying to figure out what's going on. And I'm
having a hard time understanding why providing an _optional_ mechanism that
could aide in the RMS getting an accurate accounting of the state of the
sequence (without having to call up the RMD's admin) is a bad thing.
thanks,
-Doug
"Stefan Batres"
<stefanba@microsoft.com>
08/31/2005
01:48 AM
|
To
|
Doug Davis/Raleigh/IBM@IBMUS,
<ws-rx@lists.oasis-open.org>
|
cc
|
|
Subject
|
RE: [ws-rx] i0019 - a formal proposal - take 2
|
|
Doug,
You mention a specific situation: An RMD experiences a failure that prevents it
from receiving application messages. I agree in so far as saying that in such a
failure case this proposal could be helpful in that it helps the RMS to engage
in recovery of some sort (either inform applications that a specific message
was not sent or open a new sequence, assuming ordering is not important). But
this is not the only failure case that applications will want to deal with
(with or without help from the protocol).
Consider the case where connectivity is lost for long enough for both sequences
to expire or consider the case where the destination suffers a loss of session
state. In such failure modes this solution is not helpful - yet
applications will need a recovery strategy of some sort. It might be that it is
application specific, or it might be that a general failure recovery
specification is created and ratified at some point. The important idea is that
the only way to deal with all failure modes is at higher level. This proposal
leverages the protocol to optimize recovery in specific circumstances that
should be relatively rare. RM implementations should not be required to support
failure mode recovery mechanisms that either don't apply to them or that
they choose to implement in a uniform way at a higher level.
Thanks
--Stefan
From: Doug Davis [mailto:dug@us.ibm.com]
Sent: Tuesday, August 30, 2005 1:08 PM
To: ws-rx@lists.oasis-open.org
Subject: RE: [ws-rx] i0019 - a formal proposal - take 2
Yet more comments. :-)
-Doug
"Stefan Batres"
<stefanba@microsoft.com>
08/30/2005
03:35 PM
|
To
|
Doug Davis/Raleigh/IBM@IBMUS,
<ws-rx@lists.oasis-open.org>
|
cc
|
|
Subject
|
RE: [ws-rx] i0019 - a formal proposal - take 2
|
|
Doug,
Some more comments and thoughts on your proposal:
<dug>... When or why an RMS uses CloseSequence is up to it to decide.
All we know is that it wants to shut things down and get an accurate ACK from
the RMD.</dug>
I still have not heard of a plausible reason why an RMS "wants to shut
things down" and the current spec presents a problem. Comparing the spec
as it stands today vs. the spec + this proposal:
- TODAY: RMS wants to end the sequence so it
sends a LastMessage and must wait for a complete set of acks; this might
require retransmitting messages. Once a full set of acks is received RMS
sends TerminateSequence.
- TODAY + THIS PROPOSAL: RMS wants to end the
sequence so it sends Close, waits for a CloseResponse, possibly
retransmitting the Close. Once a CloseResponse is received RMS sends
TerminateSequence.
The problem with the TODAY scenario, as I've heard it in this forum, is
that the RMS might have to wait unacceptably long between sending LastMessage
and getting a full ack range. But if getting some messages or acks across
proves difficult; why would the RMS expect that getting Close across would be
any easier?
<dug> 1 - I don't believe your text is accurate in that Close is supposed
to be used in cases where the sequence needs to end due to something going
wrong. You've described a case where the sequence is functioning just
fine - and while Close can be used in those cases as well, it provides no
additional value. 2- Sending a Close and sending application data can
have quite a different set of features executed so I don't think its hard to
imagine cases where RM messages can get processed just fine but application
messages run into problems. I believe Chris mentioned on some call the
notion of two different persistent stores - one for RM data and one for
app-data. Its possible that the app-data one is running into problems.
3 - Using the CloseSequence operation is option - if you feel that, as an
RMS implementor, you'll never see its usefulness then you're free to never
implement/send it. However, I'd hate remove this option for those of us
who do see value in it. </dug>
<dug>The case that I keep thinking about is one where the RMD is actually
a cluster of machines and when a sequence gets created it has an affinity to a
certain server in the cluster - meaning it processes all of the messages for
that sequence. If that server starts to have problems, and for some reason it
just can't seem to process any new app messages then the RMS can close down the
sequence and start up a new one. Hopefully, the new sequence will be directed
to a different server in the cluster. </dug>
There are two problems with this scenario and the proposed solution.
1.
If an RMD has sequence-to-machine affinity that should be
strictly the RMDs decision and the RMDs problem. The RMS is autonomous; this
proposal puts expectations on the RMS' behavior based on particularities
of the RMD implementation. To be clear, I'll note that affinity can be
achieved in two ways:
i. By
performing stateful routing at the RMD; basically the RMD has to remember every
active sequence and what machine it has affinity to. In this case it would be
simple to change the RMD's routing table when a machine fails.
ii. By
generating different EPR's for each machine. For affinity to function
this way two things are necessary:
1.
Some sort of endpoint resolution mechanism would have to
be devised for the RMS to learn the EPR that it should target.
2.
A mechanism for migrating that EPR.
Clearly
1) and 2) are outside the scope of the TC and, in my view, this proposal might
be defining 2) in an informal way that is specific to WS-RM.
2.
If the RMS somehow guesses that there is a problem on the
EPR to which it is sending its messages and somehow decides that Closing the
sequence and starting a new one is the right course of action, ordering guarantees
are compromised.
<dug> I probably didn't state the problem very well. I didn't
intend to claim that the RMS knew about this affinity, but instead it knew that
something was wrong with the current sequence and in order to try to fix the
situation it decided to try another sequence. The affinity bit was thrown
in there to explain why starting a new sequence _might_ fix the problem.
I should also point out that while a lot of these discussions have focused on
InOrder+ExactlyOnce DA, this feature is still useful in other DAs. For
example, if the DA is just ExactlyOnce - having an accurate accounting of the
ACKs allows a subsequent sequence to send just the gaps from the first, so
getting an accurate list of the gaps becomes critical. And this of course
leads us to the discussion of how to determine the DA in use - which I think
might be part of issues 6, 9, 24 and 27.
</dug>
Finally, I agree with you that considering a gap-filling mechanism would be a
good thing for this TC to do.
--Stefan