ebxml-msg message

Subject: Re: Need volunteer to draft definition of reliable messaging,wasRE:reliable messaging - hop by hop
From: Martin W Sachs <mwsachs@us.ibm.com>
To: Dan Weinreb <dlw@exceloncorp.com>
Date: Mon, 03 Sep 2001 21:01:49 -0400

More rejoinders, MWS2

Regards,
Marty

*************************************************************************************

Martin W. Sachs
IBM T. J. Watson Research Center
P. O. B. 704
Yorktown Hts, NY 10598
914-784-7287;  IBM tie line 863-7287
Notes address:  Martin W Sachs/Watson/IBM
Internet address:  mwsachs @ us.ibm.com
*************************************************************************************



Dan Weinreb <dlw@exceloncorp.com> on 09/03/2001 03:16:19 AM

Please respond to Dan Weinreb <dlw@exceloncorp.com>

To:   Martin W Sachs/Watson/IBM@IBMUS
cc:   david@drummondgroup.com, rberwanger@btrade.com,
      ebxml-msg@lists.oasis-open.org
Subject:  Re: Need volunteer to draft definition of reliable messaging,was
      RE:reliable messaging - hop by hop



   Date: Mon, 03 Sep 2001 01:35:22 -0400
   From: Martin W Sachs <mwsachs@us.ibm.com>

   MWS:  Disk head crashes are not necessarily insurmountable.

Yes, certainly.  I was just suggesting that our model of the network
should allow that a message might be delivered after an unbounded
delay, and I was trying for a colorful illustration of what might
cause a delay as long as several weeks.  It could be a hardware
failure in a network board rather than a disk; it really doesn't
matter.  Yes, you can use redundant hardware to make things more
reliable, but I don't think we want to assume that the entire network
has, in fact, done so.

MWS2:  We need to state the assumptions that that the implementation
must follow to assure that the STATED goals of reliable messaging will
be met.  If the entire network doesn't meet the assumptions, then our
stated goals will not be fulfilled.  The weaker we make the goals, the
less value there will be to reliable messaging.

   MWS:  I said "unlikely", not "impossible".  If you consider the length
   of the time window between persisting the message and sending the ACK
and
   compare that to the total interval of time during which network
partitions
   can occur, the fraction of network partitions that occur between
   persisting the message and sending the ACK is pretty small.

That's not exactly the relevant window.  The problem arises when a
network partition happens between (t1) the time when the message is
read from the network by the To Party MSH and (t2) the time when the
acknowledgement from the To Party MSH is received by the From Party
MSH.

MWS2:  If the partition happens between the time the To party MSH
reads the message and the time it sends the ACK, for a single hop,
the To MSH will know that it couldn't send the ACK and (hopefully)
retry later.  For multihop, the partition may of course happen
beyond the first hop.  For the multihop case, I agree that the window
is longer than I first thought.

("The problem" is when a message is actually received by the To Party
but the From Party doesn't get any acknowledgement.)  (The problem
also arises if the partition happens after the To Party receives the
message but before it persists it.)  (The problem also arises if the
To Party is able to transmit the acknowledgment into the network, but
the network fails before the acknowledgment is read by the From
Party.)

                                         ...Unless,
   of course, the network partitions is correlated with persisting the
   message.  I think we can safely ignore the correlation.

I agree with that.
                                       In any case,
   I agree with David Burdett's prior suggestion that we can require
   the delivery failure notification but recommend that in case of
   delivery failure notification, the status of the message be requested
   out-band because there are low-probability
   events that might occur which could
   cause delivery failure to be recognized
   although the message was delivered and persisted. This is far better
   than giving up on delivery failure notification because of possible
   pathological cases.

(I assume that when you say "delivery failure notification" here you
are referring to the notification BY the From Party MSH TO the From
Party Application, rather than any particular "DFN message" in the
network.)

MWS2:  Yes, that is what I was referring to.

  Yes, we can say that in the "problem" case, where the From
Party is left uncertain, there can be some out-of-band way to resolve
what's going on, e.g. a phone call (or some other "network" than the
one that's partitioned!) and then a way to manually tell the MSH's
what was learned during the phone call.

(This is a little bit like the concept of "heuristic commit" in the XA
protocol for two-phase commits, where a participant in a distributed
transaction can be in doubt about the outcome, and a manual override
can take place.)

Adding a way to do out-of-band resolution of the uncertainty is fine.
I'm just not comfortable with coming to conclusions based on the
*absence* of acknowledgment messages, *without* the out-of-band
resolution.

MWS2:  I agree since I have been able to come up with a better way to
eliminate the problem case.

- Dan

----------------------------------------------------------------
To subscribe or unsubscribe from this elist use the subscription
manager: <http://lists.oasis-open.org/ob/adm.pl>
Follow-Ups:
- Re: Need volunteer to draft definition of reliable messaging,wasRE:reliable messaging - hop by hop
  - From: Dan Weinreb <dlw@exceloncorp.com>