Edwin,
Sounds like a sensible approach. I assume that the configuration
for this behaviour is on a "per operation" basis? Does this not produce
a higher amount of coupling between the process definition and the
deployment descriptor?
Cheers,
-Ron
Edwin Khodabakchian wrote:
Ron,
Our implementation would can
be configured to either 1) resume and throw an exception on the
reply or 2) do nothing (can be used in case where all operations are
idempotent and retried without side effects.). The behavior is
configurable through the deployment descriptor.
Edwin
Folks,
I was recently given an interesting question from one of my
development teams, and I thought it would be of interest to this group,
since it touches on universal implementation issues.
The question is based on the following scenario: given a process
something like this:
<sequence>
<receive name="rcv" ... />
<assign name="as1" ... />
<invoke name="inv" ... />
<assign name="as2" ... />
<reply name="rep" ... />
</sequence>
The <receive> and <reply> activities are part of a
request-response MEP, bound to SOAP, so that the request-response is
synchronous (uses the same connection for request and response).
Simple enough. But suppose that during execution of an instance of
the above process, somewhere after the <receive> activity is
completed but before the <reply> activity is done, the BPEL
engine suffers a crash. Since we have the full state persistence,
recovery is simple enough. We can therefore finish creating the reply,
but this is rather useless, since the client connection is lost.
So what is the right thing to do under these circumstances? Should
the engine, upon recovery in this situation, fault the running
activity? Should it continue to the reply activity, and presumably
fault because the connection is closed?
What of the client program? It sees that the HTTP connection closed
while awaiting a response to the request. It might reasonably resend
the request (HTTP being what it is). If this is the expected behaviour,
might it not be appropriate for the BPEL engine offering the service
our client is using to, upon recover, "roll back" or otherwise
compensate the completed activities in the sequence (not shown in the
process above), to the point of the <receive> activity, and
restart the receive?
I know that some of these complexities are the result of using
unreliable messaging, and you get what you pay for, right? On the other
hand, this illustrates some interesting states that a BPEL
implementation might have to deal with, which aren't discussed in the
specification. At the very least, we have some unspecified faults to
deal with -- presumably implementation specific.
So what are other implementers doing in this case? Generating a
fault of one sort of another, or performing more heroic efforts to
recover from the crash? I'm just interested in general approaches,
since we don't want to require NDAs here! My development team is busy
trying to create some recovery mechanisms for the scenario above, based
on some sort of client/server interaction (client retries being the
most likely sort). These guys are pretty clever, so I wouldn't doubt
that they could invent something that, in many cases, actually recover
from the crash scenario above.
Thoughts? Is anyone else concerned about crash recovery, perhaps
with different scenarios?
-Ron
|