[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: redirection and recovery
As promised at the JavaOne f-t-f, what we are trying to address in this email is the concept of message redirection primarily for recovery purposes. As I'll show, however, this can be used for other reasons as well. Rather than describe in terms of specific message names and explicit parameters, I'll talk at a higher-level of abstraction. Let's first look at the case of failures; there are essentially two failure scenarios we need to consider: (i) a participant failure. (ii) a coordinator failure. 1: Participant Failures When a participant fails is important to this discussion. If it fails prior to being prepared by the coordinator (but not necessarily by itself), then the coordinator need not record any information about it and will cancel. If the participant had prepared itself through an autonomous vote, for example, then it should have recorded sufficient information to eventually confirm/undo, and to recover (e.g., the coordinator reference). If it didn't prepare, then no recovery will occur. [Note from current heuristic discussions, if an autonomously prepared participant cancels it needs to remember this decision until it can find out the true coordinator decision, so recovery in the event of a subsequent failure is still required.] Assuming recovery (and that the coordinator/coordinator factory has not failed), then when it does recover, the participant will contact the coordinator to determine the status. If there is no record of the "transaction" then a cancel is presumed, and potentially a heuristic (contradiction in BTP terms) may have occurred. If there is a record of the transaction then: (i) the transaction is still inflight and the outcome is not know as yet. (ii) the transaction has confirmed and the participant can confirm too. If the participant has managed to recover on the same address as it had when it failed, then all outstanding references to it held by the coordinator will remain valid, and no further work need happen. However, if it could not recover on the same location (e.g., the machine it was originally on no longer exists) then we potentially have two problems: (a) when an inflight transaction tries to terminate, it has assembled an intentions list which is the list of enlisted participants, and when it contacts the participant on its old address it will no longer be able to, and may fail the transaction. (b) a confirmed transaction has persisted its intentions list (essentially the transaction log now), and which it then needs to either send cancel or confirm to. Until all of the entries on the list have been contacted, the log cannot be garbage collected. If the participant recovers on a different location, gets the status, and commits itself as it should, the coordinator will not (without some help) know that one of the entries on its list has gone so that it can prune the list. Instead it will continue to try to deliver the response to the old participant address, and continually receive an indication that the delivery failed. Therefore, for (a) and (b) somehow the transaction must be able to tie up the information representing a recovered participant with that representing an original participant in its intentions list so that it can replace the original participant in its intentions list. Let's assume that there is a redirection message that sends the old participant reference and the new participant reference to some end-point. The end-point may be: (i) the coordinator, and in which case it can replace the old reference with the new (ii) some external entity, e.g., a UDDI service. If the coordinator finds that it cannot contact a participant then it may go to this entity to determine the updated location. This is important when we consider the coordinator failure scenario later. 2: Coordinator Failure When a coordinator fails, outstanding references to it held by participants (failed or not) may no longer be immediately useable. A participant that wants to enquire as to the status of the transaction, e.g., because it has failed and recovered, or because it simply thinks there has been too long a period since it last had communication with the coordinator, will now find that it cannot talk to the coordinator. How this manifests itself will depend upon the infrastructure used to disseminate messages (e.g., if CORBA IIOP then a COMM_FAILURE exception will be thrown to the sender). Until (and unless) the coordinator recovers on the same reference, a participant cannot know that outcome of the transaction. If the coordinator eventually recovers on the same reference, then no further work is necessary. However, just as with participant recovery, it may well be that the coordinator cannot recover on the same reference. Therefore, there is a requirement for coordinators to be able to recover on potentially different locations, and to redirect all participant messages to the new location. How does this occur? Let's first assume that all participants either have not failed, or have recovered on the same location, i.e., all references to participants held by the coordinator remain valid. Therefore, the recovered coordinator must send a redirection message to each participant, identifying the old coordinator reference, and the new coordinator reference for them to use. One obtained, a participant can use the new coordinator address to may enquiries and to (possibly) complete the BTP. However, what if a participant fails and comes back on a different location? Then, as mentioned earlier, it is necessary to use some third-party location mechanism to break this bootstrap problem where coordinator and participant recover on different locations. Which third-party is used will depend upon a number of factors including the message infrastructure (e.g., CosNaming for CORBA, or UDDI for SOAP), the application semantics, security (e.g., which UDDI service) etc. The third-party to use may well be implicitly encoded within the coordinator/participant address. Assuming the presence of this third party, a recovering entity (coordinator or participant) may publish its new address (and associate it with the old address) in this party. Likewise an entity (coordinator or participant) that finds it has an invalid address for another entity may contact this party and ask for the up-to-date address. Note, the third-party could actually be located at the original reference, and use a forwarding protocol for incoming messages meant for an old coordinator/participant. When the recovered coordinator/participant receives these messages, it may short-cut the responses, i.e., rather than sending them back through the receive path it may send the responses directly and include a redirection for the receiver to update its out-of-date references. Obviously this is only possible of the original location continues to exist (potentially forever). What about participants that the coordinator doesn't know about (one shots)? The coordinator site fails and recovers elsewhere; the transaction has cancelled, and does not need to send cancels to participants (and in fact in this case it did not even know about the one-shot since, for example, the message went astray). However, a one-shot participant needs to determine whether it has caused a contradiction, but cannot find the original coordinator. The third-party approach helps to solve this problem. 3: Redirection for other reasons Given that a redirection protocol exists, we can use this for reasons other than failure recovery. Suppose a coordinator or participant wants to migrate to another location (e.g., the coordinator starts off on a laptop and the battery starts to fail before the transaction can complete, therefore necessitating moving the coordinator), it can now inform interested parties via this redirection mechanism. Likewise, it may be possible to use redirection to aid load balancing. Note, cohesion coordinator and atom coordinator require redirection equally. Mark. ---------------------------------------------- Dr. Mark Little (mark@arjuna.com) Transactions Architect, HP Arjuna Labs Phone +44 191 2064538 Fax +44 191 2064203
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC