Coordinator interface

As promised, here is the proposed coordinator interface, which will allow interoperable coordinator web services to be provided. What we are after is a service that handles the creation and management of atom coordinators. Although there are several possible ways in which this can be accomplished, I'll propose just one, which I believe is the simplest and most intuitive:

(i) begin: creates and begins a new atom and returns the id for it. If the creator of the atom would like it to have a default lifetime then a timeout parameter could be supplied: if after this timeout the atom has not been completed (got past prepare) then it will automatically be undone. This allows the service to automatically manage its own resources (atoms), especially in the cases of failure and potential denial of service attacks, i.e., I would find it hard to believe that any publicly available service would simply allow a user to create atoms that live forever as it would then be relatively straightforward to deliberately call this service many times in order to restrict access from other users and (eventually) use up all available resources. Suggestion: we have a General exception that can be returned if non-specific errors occur; this method may throw this exception.

(ii) prepare: takes an atom id and attempts to prepare it. Returns the result of this. Depending on the outcome of the timeouts discussion, it is possible that prepare could result in a heuristic outcome, and so this method should throw corresponding exceptions (TwoPhaseViolation). In addition, if the id refers to an atom that the service does not know about, the InvalidAtom exception is thrown.

(iii) confirm: takes an id of a previously prepared atom and attempts to confirm it. If the atom has not been previously prepared then the NotPrepared exception is thrown. If a heuristic occurs, then the TwoPhaseViolation exception is thrown. In addition, if the id refers to an atom that the service does not know about, the InvalidAtom exception is thrown.

(iv) undo: takes an atom id and attempts to undo it (the atom need not have been previously confirmed). If a heuristic occurs, then the TwoPhaseViolation exception is thrown. In addition, if the id refers to an atom that the service does not know about, the InvalidAtom exception is thrown.

StatusUnknown: the service cannot determine the status at this point. This should (hopefully) be a transient condition, and a subsequent call to getStatus should eventually result in a different outcome.

StatusConfirmed: the atom has been confirmed. Since the service may discard an atom once it has completed, this value is not guaranteed to be returned forever. If an application wants to guarantee that it knows the outcome of an atom then it should enlist its own participant.

StatusUndone: the atom has been undone. Since the service may discard an atom once it has completed, this value is not guaranteed to be returned forever. If an application wants to guarantee that it knows the outcome of an atom then it should enlist its own participant.

StatusNoAtom: the service has know knowledge about the supplied atom. This may mean that is never existed, or has finished and tidied-up. If the caller was a previously enlisted participant then it can know that the atom was undone, since otherwise the service (atom) would either be active, or maintaining a durable log of participants it had not yet told to confirm (c.f., a transaction's intentions list/log).

(vi) enlistParticipant: takes an atom id and a participant reference (url), and enlists that participant in the desired atom. If the atom is no longer in the active phase then the Inactive exception is thrown. If the service has no record of the atom then the InvalidAtom exception is thrown.

(vii) recover: takes an atom id and an old participant reference, and a new participant reference. This instructs the service to run recovery on the specified atom and to replace the specified old participant with the new one. This is used when a failed participant recovers at a different location and can be used to try to drive recovery quicker than would otherwise be the case, i.e., drive it from the recovering participant. [This really depends upon what we think the failure scenarios are, but in our experience this is useful: even if we allow a participant to re-register itself in, say, UDDI, it requires the atom to periodically check that service to find the new location. "Periodically" may not be fast enough in some cases, and a hint from a recovered participant to drive recovery now would be desirable.]

bt-models message