ebxml-msg message

Subject: Re: Need volunteer to draft definition of reliable messaging,wasRE:reliable messaging - hop by hop

From: Dan Weinreb <dlw@exceloncorp.com>
To: mwsachs@us.ibm.com
Date: Mon, 03 Sep 2001 03:16:19 -0400 (EDT)

   Date: Mon, 03 Sep 2001 01:35:22 -0400
   From: Martin W Sachs <mwsachs@us.ibm.com>

   MWS:  Disk head crashes are not necessarily insurmountable.  

Yes, certainly.  I was just suggesting that our model of the network
should allow that a message might be delivered after an unbounded
delay, and I was trying for a colorful illustration of what might
cause a delay as long as several weeks.  It could be a hardware
failure in a network board rather than a disk; it really doesn't
matter.  Yes, you can use redundant hardware to make things more
reliable, but I don't think we want to assume that the entire network
has, in fact, done so.

   MWS:  I said "unlikely", not "impossible".  If you consider the length
   of the time window between persisting the message and sending the ACK and
   compare that to the total interval of time during which network partitions
   can occur, the fraction of network partitions that occur between
   persisting the message and sending the ACK is pretty small.  

That's not exactly the relevant window.  The problem arises when a
network partition happens between (t1) the time when the message is
read from the network by the To Party MSH and (t2) the time when the
acknowledgement from the To Party MSH is received by the From Party
MSH.

("The problem" is when a message is actually received by the To Party
but the From Party doesn't get any acknowledgement.)  (The problem
also arises if the partition happens after the To Party receives the
message but before it persists it.)  (The problem also arises if the
To Party is able to transmit the acknowledgment into the network, but
the network fails before the acknowledgment is read by the From
Party.)

								...Unless,
   of course, the network partitions is correlated with persisting the
   message.  I think we can safely ignore the correlation.  

I agree with that.
							    In any case,
   I agree with David Burdett's prior suggestion that we can require
   the delivery failure notification but recommend that in case of
   delivery failure notification, the status of the message be requested
   out-band because there are low-probability
   events that might occur which could
   cause delivery failure to be recognized
   although the message was delivered and persisted. This is far better
   than giving up on delivery failure notification because of possible
   pathological cases.

(I assume that when you say "delivery failure notification" here you
are referring to the notification BY the From Party MSH TO the From
Party Application, rather than any particular "DFN message" in the
network.)  Yes, we can say that in the "problem" case, where the From
Party is left uncertain, there can be some out-of-band way to resolve
what's going on, e.g. a phone call (or some other "network" than the
one that's partitioned!) and then a way to manually tell the MSH's
what was learned during the phone call.

(This is a little bit like the concept of "heuristic commit" in the XA
protocol for two-phase commits, where a participant in a distributed
transaction can be in doubt about the outcome, and a manual override
can take place.)

Adding a way to do out-of-band resolution of the uncertainty is fine.
I'm just not comfortable with coming to conclusions based on the
*absence* of acknowledgment messages, *without* the out-of-band
resolution.

- Dan

References:
- Re: Need volunteer to draft definition of reliable messaging,wasRE:reliable messaging - hop by hop
  - From: Martin W Sachs <mwsachs@us.ibm.com>