[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Re: Need volunteer to draft definition of reliable messaging,wasRE:reliable messaging - hop by hop
Date: Mon, 03 Sep 2001 01:35:22 -0400 From: Martin W Sachs <mwsachs@us.ibm.com> MWS: Disk head crashes are not necessarily insurmountable. Yes, certainly. I was just suggesting that our model of the network should allow that a message might be delivered after an unbounded delay, and I was trying for a colorful illustration of what might cause a delay as long as several weeks. It could be a hardware failure in a network board rather than a disk; it really doesn't matter. Yes, you can use redundant hardware to make things more reliable, but I don't think we want to assume that the entire network has, in fact, done so. MWS: I said "unlikely", not "impossible". If you consider the length of the time window between persisting the message and sending the ACK and compare that to the total interval of time during which network partitions can occur, the fraction of network partitions that occur between persisting the message and sending the ACK is pretty small. That's not exactly the relevant window. The problem arises when a network partition happens between (t1) the time when the message is read from the network by the To Party MSH and (t2) the time when the acknowledgement from the To Party MSH is received by the From Party MSH. ("The problem" is when a message is actually received by the To Party but the From Party doesn't get any acknowledgement.) (The problem also arises if the partition happens after the To Party receives the message but before it persists it.) (The problem also arises if the To Party is able to transmit the acknowledgment into the network, but the network fails before the acknowledgment is read by the From Party.) ...Unless, of course, the network partitions is correlated with persisting the message. I think we can safely ignore the correlation. I agree with that. In any case, I agree with David Burdett's prior suggestion that we can require the delivery failure notification but recommend that in case of delivery failure notification, the status of the message be requested out-band because there are low-probability events that might occur which could cause delivery failure to be recognized although the message was delivered and persisted. This is far better than giving up on delivery failure notification because of possible pathological cases. (I assume that when you say "delivery failure notification" here you are referring to the notification BY the From Party MSH TO the From Party Application, rather than any particular "DFN message" in the network.) Yes, we can say that in the "problem" case, where the From Party is left uncertain, there can be some out-of-band way to resolve what's going on, e.g. a phone call (or some other "network" than the one that's partitioned!) and then a way to manually tell the MSH's what was learned during the phone call. (This is a little bit like the concept of "heuristic commit" in the XA protocol for two-phase commits, where a participant in a distributed transaction can be in doubt about the outcome, and a manual override can take place.) Adding a way to do out-of-band resolution of the uncertainty is fine. I'm just not comfortable with coming to conclusions based on the *absence* of acknowledgment messages, *without* the out-of-band resolution. - Dan
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC