Re: [wsbpel-implement] Fault tolerance considerations

Well, I have the same problem, and you do not need a crash to do that. The problem is that BPEL prescribe receive-reply as implementing a synchronous WSDL operation, when in practice you cannot enforce it. You just need add a wait for a week between the receive and the reply, and I’m sure you do not want to keep the connection open for that long.

I opened issue 17 (Asynchronous operations) a while back, but have not have time to pursue it. IMHO the receive / reply pair does requires an asynchronous WSDL binding (one that does not require the connection to remain open). In theory, you could define such a binding, but nobody will be able to use it because first is not WS-I compliant, and second does not fit most WSDL implementation frameworks.

It may be that WS-Routing provides a solution to this issue by allowing a reverse message path for the reply. But, I have not had time to study this alternative.

In any case, I’m also interested on see (read) how others are tackling this implementation issue….

Regards,

Mike Marin

-----Original Message-----
From: Ron Ten-Hove [mailto:Ronald.Ten-Hove@Sun.COM]
Sent: Tuesday, October 14, 2003 4:25 PM
To: bpel implementation
Subject: [wsbpel-implement] Fault tolerance considerations

Folks,

I was recently given an interesting question from one of my development teams, and I thought it would be of interest to this group, since it touches on universal implementation issues.

The question is based on the following scenario: given a process something like this:

<sequence>

  <receive name="rcv" ... />

  <assign  name="as1" ... />

  <invoke  name="inv" ... />

  <assign  name="as2" ... />

  <reply   name="rep" ... />

</sequence>

The <receive> and <reply> activities are part of a request-response MEP, bound to SOAP, so that the request-response is synchronous (uses the same connection for request and response).

    Simple enough. But suppose that during execution of an instance of the above process, somewhere after the <receive> activity is completed but before the <reply> activity is done, the BPEL engine suffers a crash. Since we have the full state persistence, recovery is simple enough. We can therefore finish creating the reply, but this is rather useless, since the client connection is lost.

    So what is the right thing to do under these circumstances? Should the engine, upon recovery in this situation, fault the running activity? Should it continue to the reply activity, and presumably fault because the connection is closed?

    What of the client program? It sees that the HTTP connection closed while awaiting a response to the request. It might reasonably resend the request (HTTP being what it is). If this is the expected behaviour, might it not be appropriate for the BPEL engine offering the service our client is using to, upon recover, "roll back" or otherwise compensate the completed activities in the sequence (not shown in the process above), to the point of the <receive> activity, and restart the receive?

    I know that some of these complexities are the result of using unreliable messaging, and you get what you pay for, right? On the other hand, this illustrates some interesting states that a BPEL implementation might have to deal with, which aren't discussed in the specification. At the very least, we have some unspecified faults to deal with -- presumably implementation specific.

    So what are other implementers doing in this case? Generating a fault of one sort of another, or performing more heroic efforts to recover from the crash? I'm just interested in general approaches, since we don't want to require NDAs here! My development team is busy trying to create some recovery mechanisms for the scenario above, based on some sort of client/server interaction (client retries being the most likely sort). These guys are pretty clever, so I wouldn't doubt that they could invent something that, in many cases, actually recover from the crash scenario above.

    Thoughts? Is anyone else concerned about crash recovery, perhaps with different scenarios?

-Ron

wsbpel-implement message