Re: [wsbpel-implement] Fault tolerance considerations

wsbpel-implement message

Subject: Re: [wsbpel-implement] Fault tolerance considerations

From: Ron Ten-Hove <Ronald.Ten-Hove@Sun.COM>

To: edwink@collaxa.com

Date: Wed, 15 Oct 2003 07:52:44 -0700

Edwin,

Sounds like a sensible approach. I assume that the configuration for this behaviour is on a "per operation" basis? Does this not produce a higher amount of coupling between the process definition and the deployment descriptor?

Cheers,
-Ron

Edwin Khodabakchian wrote:

Ron,

Our implementation would can be configured to either 1) resume and throw an exception on the reply or 2) do nothing (can be used in case where all operations are idempotent and retried without side effects.). The behavior is configurable through the deployment descriptor.

Edwin
From: Ron Ten-Hove [mailto:Ronald.Ten-Hove@Sun.COM]
Sent: Tuesday, October 14, 2003 4:25 PM
To: bpel implementation
Subject: [wsbpel-implement] Fault tolerance considerations

Folks,

I was recently given an interesting question from one of my development teams, and I thought it would be of interest to this group, since it touches on universal implementation issues.

The question is based on the following scenario: given a process something like this:
<sequence>
  <receive name="rcv" ... />
  <assign  name="as1" ... />
  <invoke  name="inv" ... />
  <assign  name="as2" ... />
  <reply   name="rep" ... />
</sequence>
  
The <receive> and <reply> activities are part of a request-response MEP, bound to SOAP, so that the request-response is synchronous (uses the same connection for request and response).

    Simple enough. But suppose that during execution of an instance of the above process, somewhere after the <receive> activity is completed but before the <reply> activity is done, the BPEL engine suffers a crash. Since we have the full state persistence, recovery is simple enough. We can therefore finish creating the reply, but this is rather useless, since the client connection is lost.

    So what is the right thing to do under these circumstances? Should the engine, upon recovery in this situation, fault the running activity? Should it continue to the reply activity, and presumably fault because the connection is closed?

    What of the client program? It sees that the HTTP connection closed while awaiting a response to the request. It might reasonably resend the request (HTTP being what it is). If this is the expected behaviour, might it not be appropriate for the BPEL engine offering the service our client is using to, upon recover, "roll back" or otherwise compensate the completed activities in the sequence (not shown in the process above), to the point of the <receive> activity, and restart the receive?

    I know that some of these complexities are the result of using unreliable messaging, and you get what you pay for, right? On the other hand, this illustrates some interesting states that a BPEL implementation might have to deal with, which aren't discussed in the specification. At the very least, we have some unspecified faults to deal with -- presumably implementation specific.

    So what are other implementers doing in this case? Generating a fault of one sort of another, or performing more heroic efforts to recover from the crash? I'm just interested in general approaches, since we don't want to require NDAs here! My development team is busy trying to create some recovery mechanisms for the scenario above, based on some sort of client/server interaction (client retries being the most likely sort). These guys are pretty clever, so I wouldn't doubt that they could invent something that, in many cases, actually recover from the crash scenario above.

    Thoughts? Is anyone else concerned about crash recovery, perhaps with different scenarios?

-Ron