ebxml-msg message

Subject: Re: T2 Retry with Delivery Receipt
From: Martin W Sachs <mwsachs@us.ibm.com>
To: Dan Weinreb <dlw@exceloncorp.com>
Date: Fri, 14 Sep 2001 10:06:25 -0400

Dan,

Very well said.

I just want to point out that until the MSG team comes up with an upper
level abstract interface that permits back to back MSHs with no intervening
function, back to back MSHs are simply impossible.  There has to be some
function in between them.

Regards,
Marty

*************************************************************************************

Martin W. Sachs
IBM T. J. Watson Research Center
P. O. B. 704
Yorktown Hts, NY 10598
914-784-7287;  IBM tie line 863-7287
Notes address:  Martin W Sachs/Watson/IBM
Internet address:  mwsachs @ us.ibm.com
*************************************************************************************



Dan Weinreb <dlw@exceloncorp.com> on 09/14/2001 12:09:25 AM

Please respond to "Dan Weinreb" <dlw@exceloncorp.com>

To:   david.burdett@commerceone.com
cc:   Martin W Sachs/Watson/IBM@IBMUS, ebxml-msg@lists.oasis-open.org
Subject:  Re: T2 Retry with Delivery Receipt



   Date: Thu, 13 Sep 2001 14:16:57 -0700
   From: "Burdett, David" <david.burdett@commerceone.com>

   You also cannot reasonably guarantee that the B2MSH when would NEVER
lose
   data when it crashed.

This is why we have a formal failure model.  Yes, in the real-life
world, you just can't ever guarantee anything at all with 100%
certainty.  But when we talk about making something "reliable", we
come up with a failure model, pretend that the real world conforms to
the failure modelm, and do everything we can to make the real world
actaully behave like the failure model, to the point where failures
other than the modelled failures are too rare to worry about.  Then
we design the system to be able to recover from the modelled failures.

So, for example, if a host has a transactional persistent memory, we
assume that its transaction system works, and any commit will be
atomic, consistent, isolated, and durable.  We do not worry about
coming back up from a crash and finding an internally-inconsistent
state in the persistent store because half of the changes got
committed and half didn't.  Sure, it *can* happen, but it's not part
of our failure model, so we don't claim to be reliable in face of it,
and we assume that it doesn't happen, which will be fine as long as
such failure are really very rare (see below).

   This in fact suggests a really nasty use case. Suppose:
   1. The B2 MSH forwards the message to APP2
   2. The B2 MSH cathes fire and as a result loses both its database and
   recovery log files and so CANNOT recover the fact that it previously
   forwarded a message to APP2.

Well, if you like that one, how about this one: we have a From MSH
talking directly to a To MSH, with no intermediaries at all.  The
message gets sent successfully, To MSH persists it and commits, To MSH
sends an appropriate acknowledgement to the From MSH, and the From MSH
reports success.  But before the application can read the message, To
MSH catches fire and loses its database and recovery log files, and
there aren't any backups, so the effect is exactly as if the message
had never been delivered to the To MSH at all.

The answer is that "MSH X catches fire and suffers an irrecoverable
total media failure" is not part of our failure model.  (Nor is
catastrophic byzantine CPU failure, as I've been pointing out.)  We do
not claim to be reliable in face of that.  (Nor does anybody else who
is trying to do anything like what we're doing!)  A message passing
system of the kind we're talking about needs to have a persistent
transactional store that it can depend on.  So there are two possible
answers:

(1) Catching fire and total irrecoverable media failure is so rare
that we don't care about it.  (And please don't tell me that there's
no degree of rareness so small that we don't care about it.  Clearly
there is some such limit.  Take the probability that you will be hit
by an H-bomb sometime during your life, divide by one million, and
certainly we don't have to worry about a hazard with that
probability.)

(2) Serious users make this failure unlikely, by using redundant
disks, sufficient backups with offsite storage, and so on.