[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Failure and recovery
Like the other message this was written over the Atlantic and available in hard-copy at the Mt Laurel meeting but not discussed yet. I've been looking at the Failure and recovery section of the original BEA input to see how much modification will be needed to fit with where we are now. This isn't a general reworking of that, rather some points to stir into the pot. The distinction between "site" and communication/timeout failure is sound, but I don't quite agree with the communication/timeout distinction. We can assume the underlying "transport" protocol avoids delivering corrupted messages, but should assume that it will report communication failures when it gives up whatever internal recovery/reattempt mechanisms it has. We shouldn't consider our own (BTP-level) timeouts in this. Note a remote site failure just appears as a communication failure at the other end - the difference is a site failure involves some restoration of local state from persistent information where it happened. Recovery from both communication and state failure involves some exchanges with the partner to re-establish aligned state. I think we want to follow the common presumed-abort pattern, but also to allow business transactions to continue to progress to success over interruptions (including e.g. scheduled closure of a service - this would certainly be expected in a truly long-running transaction). Presume-abort is designed to minimise the occasions on which the implementations MUST log (persist transactional state information), whereas continuing the active phase implies that some aspects of that state were persisted earlier. I think we can cope with this by *requiring* only the presume-abort logging, but making the protocol do the right thing in restoring state at any phase, if both parties have retained the necessary information. If one hasn't persisted its state, then the recovery exchanges from the other side will discover this, and abort their side of the atom too. Required presume-abort logging is: participant eager-logs just before sending vote-ready participant eager-logs complete (= log delete) after applying a confirm and before replying to it participant eager-logs complete between receiving cancel and replying to it coordinator eager-logs just before sending confirm coordinator lazy-logs complete after receiving confirmed coordinator logs complete on deciding to cancel All of those are for a single atom, or rather for a single branch (single coordinator-participant relationship), but obviously can be combined within the coordinator. There is no need to *require* eager-logs that avoid re-asking the partner in the event of recovery - it's an implementation option whether it is better to do the log-write (always) to save the network exchange after recovery (rare, one assumes). The differences from regular presume-abort behaviour are either side can send that it is in the active state - if the partner is also active, the work of the branch can continue (this might require some kind of application resynchronization - we have NOT so far made any provision for this - but note that if your application can resynchronize and resume after interruption, there *must* be some way to resume BTP or it was futile to resynch the application) the voting exchange can be completed by the recovery (most/some presume-abort systems will force rollback if a recovery message arrives at the coordinator while it was collecting votes) I'm hoping this can all be expressed in the state table, with some annotations, though I haven't got there yet. Peter ------------------------------------------ Peter Furniss Technical Director, Choreology Ltd email: peter.furniss@choreology.com phone: +44 20 7670 1679 direct: +44 20 7670 1783 mobile: 07951 536168 13 Austin Friars, London EC2N 2JX
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC