[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Re: Need volunteer to draft definition of reliable messaging,wasRE:reliable messaging - hop by hop
More rejoinders, MWS2 Regards, Marty ************************************************************************************* Martin W. Sachs IBM T. J. Watson Research Center P. O. B. 704 Yorktown Hts, NY 10598 914-784-7287; IBM tie line 863-7287 Notes address: Martin W Sachs/Watson/IBM Internet address: mwsachs @ us.ibm.com ************************************************************************************* Dan Weinreb <dlw@exceloncorp.com> on 09/03/2001 03:16:19 AM Please respond to Dan Weinreb <dlw@exceloncorp.com> To: Martin W Sachs/Watson/IBM@IBMUS cc: david@drummondgroup.com, rberwanger@btrade.com, ebxml-msg@lists.oasis-open.org Subject: Re: Need volunteer to draft definition of reliable messaging,was RE:reliable messaging - hop by hop Date: Mon, 03 Sep 2001 01:35:22 -0400 From: Martin W Sachs <mwsachs@us.ibm.com> MWS: Disk head crashes are not necessarily insurmountable. Yes, certainly. I was just suggesting that our model of the network should allow that a message might be delivered after an unbounded delay, and I was trying for a colorful illustration of what might cause a delay as long as several weeks. It could be a hardware failure in a network board rather than a disk; it really doesn't matter. Yes, you can use redundant hardware to make things more reliable, but I don't think we want to assume that the entire network has, in fact, done so. MWS2: We need to state the assumptions that that the implementation must follow to assure that the STATED goals of reliable messaging will be met. If the entire network doesn't meet the assumptions, then our stated goals will not be fulfilled. The weaker we make the goals, the less value there will be to reliable messaging. MWS: I said "unlikely", not "impossible". If you consider the length of the time window between persisting the message and sending the ACK and compare that to the total interval of time during which network partitions can occur, the fraction of network partitions that occur between persisting the message and sending the ACK is pretty small. That's not exactly the relevant window. The problem arises when a network partition happens between (t1) the time when the message is read from the network by the To Party MSH and (t2) the time when the acknowledgement from the To Party MSH is received by the From Party MSH. MWS2: If the partition happens between the time the To party MSH reads the message and the time it sends the ACK, for a single hop, the To MSH will know that it couldn't send the ACK and (hopefully) retry later. For multihop, the partition may of course happen beyond the first hop. For the multihop case, I agree that the window is longer than I first thought. ("The problem" is when a message is actually received by the To Party but the From Party doesn't get any acknowledgement.) (The problem also arises if the partition happens after the To Party receives the message but before it persists it.) (The problem also arises if the To Party is able to transmit the acknowledgment into the network, but the network fails before the acknowledgment is read by the From Party.) ...Unless, of course, the network partitions is correlated with persisting the message. I think we can safely ignore the correlation. I agree with that. In any case, I agree with David Burdett's prior suggestion that we can require the delivery failure notification but recommend that in case of delivery failure notification, the status of the message be requested out-band because there are low-probability events that might occur which could cause delivery failure to be recognized although the message was delivered and persisted. This is far better than giving up on delivery failure notification because of possible pathological cases. (I assume that when you say "delivery failure notification" here you are referring to the notification BY the From Party MSH TO the From Party Application, rather than any particular "DFN message" in the network.) MWS2: Yes, that is what I was referring to. Yes, we can say that in the "problem" case, where the From Party is left uncertain, there can be some out-of-band way to resolve what's going on, e.g. a phone call (or some other "network" than the one that's partitioned!) and then a way to manually tell the MSH's what was learned during the phone call. (This is a little bit like the concept of "heuristic commit" in the XA protocol for two-phase commits, where a participant in a distributed transaction can be in doubt about the outcome, and a manual override can take place.) Adding a way to do out-of-band resolution of the uncertainty is fine. I'm just not comfortable with coming to conclusions based on the *absence* of acknowledgment messages, *without* the out-of-band resolution. MWS2: I agree since I have been able to come up with a better way to eliminate the problem case. - Dan ---------------------------------------------------------------- To subscribe or unsubscribe from this elist use the subscription manager: <http://lists.oasis-open.org/ob/adm.pl>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC