[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Re: T2 Retry with Delivery Receipt
Dan, Very well said. I just want to point out that until the MSG team comes up with an upper level abstract interface that permits back to back MSHs with no intervening function, back to back MSHs are simply impossible. There has to be some function in between them. Regards, Marty ************************************************************************************* Martin W. Sachs IBM T. J. Watson Research Center P. O. B. 704 Yorktown Hts, NY 10598 914-784-7287; IBM tie line 863-7287 Notes address: Martin W Sachs/Watson/IBM Internet address: mwsachs @ us.ibm.com ************************************************************************************* Dan Weinreb <dlw@exceloncorp.com> on 09/14/2001 12:09:25 AM Please respond to "Dan Weinreb" <dlw@exceloncorp.com> To: david.burdett@commerceone.com cc: Martin W Sachs/Watson/IBM@IBMUS, ebxml-msg@lists.oasis-open.org Subject: Re: T2 Retry with Delivery Receipt Date: Thu, 13 Sep 2001 14:16:57 -0700 From: "Burdett, David" <david.burdett@commerceone.com> You also cannot reasonably guarantee that the B2MSH when would NEVER lose data when it crashed. This is why we have a formal failure model. Yes, in the real-life world, you just can't ever guarantee anything at all with 100% certainty. But when we talk about making something "reliable", we come up with a failure model, pretend that the real world conforms to the failure modelm, and do everything we can to make the real world actaully behave like the failure model, to the point where failures other than the modelled failures are too rare to worry about. Then we design the system to be able to recover from the modelled failures. So, for example, if a host has a transactional persistent memory, we assume that its transaction system works, and any commit will be atomic, consistent, isolated, and durable. We do not worry about coming back up from a crash and finding an internally-inconsistent state in the persistent store because half of the changes got committed and half didn't. Sure, it *can* happen, but it's not part of our failure model, so we don't claim to be reliable in face of it, and we assume that it doesn't happen, which will be fine as long as such failure are really very rare (see below). This in fact suggests a really nasty use case. Suppose: 1. The B2 MSH forwards the message to APP2 2. The B2 MSH cathes fire and as a result loses both its database and recovery log files and so CANNOT recover the fact that it previously forwarded a message to APP2. Well, if you like that one, how about this one: we have a From MSH talking directly to a To MSH, with no intermediaries at all. The message gets sent successfully, To MSH persists it and commits, To MSH sends an appropriate acknowledgement to the From MSH, and the From MSH reports success. But before the application can read the message, To MSH catches fire and loses its database and recovery log files, and there aren't any backups, so the effect is exactly as if the message had never been delivered to the To MSH at all. The answer is that "MSH X catches fire and suffers an irrecoverable total media failure" is not part of our failure model. (Nor is catastrophic byzantine CPU failure, as I've been pointing out.) We do not claim to be reliable in face of that. (Nor does anybody else who is trying to do anything like what we're doing!) A message passing system of the kind we're talking about needs to have a persistent transactional store that it can depend on. So there are two possible answers: (1) Catching fire and total irrecoverable media failure is so rare that we don't care about it. (And please don't tell me that there's no degree of rareness so small that we don't care about it. Clearly there is some such limit. Take the probability that you will be hit by an H-bomb sometime during your life, divide by one million, and certainly we don't have to worry about a hazard with that probability.) (2) Serious users make this failure unlikely, by using redundant disks, sufficient backups with offsite storage, and so on.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC