[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Subject: Re: T2 Retry with Delivery Receipt
We have gotten pretty far afield from what reliable messaging is for. Reliable messaging assures the From application that the message got to its destination and was persisted. More importantly it assures that if the message did not get to its destination, the From application knows about the failure. Reliable messaging should not in any way, shape, or form concern itself with content. Even if the MSH performs some content services (e.g. signing), those services are on behalf of the application and are outside the scope of the MS spec and reliable messaging. Example: failure to validate a signature is NOT a reliable-messaging function. The RM protocol should indicate that the message got where it was supposed to go and was persisted. Some other mechanism should tell the From application that the message failed validation. That mechanism is probably an error message defined in the collaboration protocol (e.g. BPSS instance document). A RM retry won't accomplish anything for this one. Correcting the certificate or whatever is needed must result in sending a new message with a new message ID, not a RM retry. Regards, Marty ************************************************************************************* Martin W. Sachs IBM T. J. Watson Research Center P. O. B. 704 Yorktown Hts, NY 10598 914-784-7287; IBM tie line 863-7287 Notes address: Martin W Sachs/Watson/IBM Internet address: mwsachs @ us.ibm.com ************************************************************************************* Dan Weinreb <dlw@exceloncorp.com> on 09/19/2001 10:16:12 AM Please respond to Dan Weinreb <dlw@exceloncorp.com> To: david@drummondgroup.com cc: Chris.Ferris@sun.com, ebxml-msg@lists.oasis-open.org Subject: Re: T2 Retry with Delivery Receipt Date: Tue, 18 Sep 2001 15:34:58 -0500 From: David Fischer <david@drummondgroup.com> <rhetoric-mode> "I don't want to" is not a valid reason. "It's too complicated" is almost as bad (how hard is it to concatenate two strings?). We can allow retries, Chris just doesn't want to. Why? The reason is "It wouldn't do any good". If the reason the message didn't get through is that the (unreliable) transport layer dropped it, the regular ("hop-to-hop") retry mechanism exists to deal with that problem. There is no need to impose a second retry mechanism on top of the first one: or, if there is, then there is also a need for a third and fourth layer and so on. You said: <df>retries do not guarantee success and never will. The question is what to do when those failures occur.</df> But what are you saying we should do? You seem to be saying that we should retry some more. </rhetoric-mode> OK, OK, you're not really saying that. And I don't really believe that they don't do any good under any scenarios. I think the case for end-to-end retry should be made by clearly stating the scenarios where end-to-end retry adds value that hop-to-hop retry does not. Let's consider why retrying the *same* message (same message ID, same digital signature, same contents, just as you say, everything the same except certain fields that are specific to the hop-to-hop layer of communication) *ever* does *any* good. If it failed the first time, why won't it just keep on failing and failing? I can see two categories of reason: (1) There are *random* *transient* failures that happen often enough to worry about. Simply trying again has a good chance of succeeding. (2) Something in the external environment changes before the retry. I think that's what you had mind when you said "it might be manual" and "It might be now or after a fix." The "unreliable IM" is an example of (1) that isn't handled by hop-to-hop retry and would be handled by automatic, right-now end-to-end retry. It's still not clear that a convincing use case for this has been presented. What are the scenarios in which (2) provides the justification for the retry? David F, you presented some "example use cases", but some of them aren't what we need as scenarios, because they are effects rather than causes, e.g. "a DFN sent" or "an Error Message sent". What I think of as a "scenario" has to explain why they were sent: what actually went wrong in the first place? So let me try some scenarios. I think scenarios break down into two categories: those in which the From party gets some kind of negative reply, and those where the From party times out. Suppose I send a purchase order to Staples and I digitally sign it with a private key, and in the ds:keyInfo I send a certificate with the corresponding public key, but unfortunately this certificate expired a few days ago. The To Party sees that the certificate has expired, so the digital signature is no good, so it rejects the message. Automatic retries are clearly pointless. The From people could transmit a new certificate out-of-band to the To people and tell them to force their MSH to use the new certificate on the existing message, but this seems kind of implausible for various reasons. Or the From side could obtain a new certificate, and then send the message with the new certificate. But then it's not the same message, as defined above. Should it have the same messageID? (I don't have an answer to this.) Suppose Staples changes its address. I sent a purchase order to Staples, and the CPA says to use HTTP to www.staples.com, and upon trying that I get an HTTP 404 (no such URL), or even a DNS error ("there's no such host name as "www.staples.com"). Automatic retries do no good. But if administrators at the From host install a new CPA, then retrying the exact same message could succeed. Suppose Staples's MSH machine has run out of disk space and rejects the incoming message. Automatic retries could solve this, by simply retrying until ordinary work frees up disk space, or the administrators at Staples add a new disk. On the other hand, the hop-to-hop retry mechanism could do that just as well. But this brings up a question as to when retries ought to time out. You could say that knowing when to really give up sometimes requires manual intervention or special knowledge; no simple time-interval value in a CPA can substitute for intelligent ways of deciding how long to retry. You might posit that a retry mechanism operating at the end-to-end level is better positioned to allow this kind of intelligence to be brought to bear than a hop-to-hop retry mechanism. Related scenario: Staples installs a new release of its MSH software, the new release has bugs that cause it to wrongly reject messages; we retry after Staples goes back to the old release or installs a fix. Similarly, an administrator at Staples messes with the configuration settings so that our messages are wrongly rejected, etc. (The From MSH might have some kind of fancy features allowing administrators finer control over retry. There might be commands like "stop retransmitting this message but keep it in the MSH so that we can commence retranmsitting later". None of this would be part of the normative protocol specification. David F, I get the impression that have in mind something like this.) Then there are timeout scenarios, e.g. what you called "lack of DR". Chris said "If the DR is sent reliably, then its absense is significant cause for concern." I agree, but we still have to figure out how to react if a DR does not appear after a "reasonable timeout". What scenarios might produce this? Actually, we don't really need a "scenario" as such. Reliable messaging still allows for the possibility that the sender still (after any given time interval) does not know whether the message has actually been delivered yet. So a DR can take longer than any "reasonable timeout" even if there has been no failure. If the From side wants to learn whether the message was ever recieved, it can either just keep waiting, or it can send a message, which might be exactly the same as the original message, or might be a Message Status Request. You mentioned "XML text corruption in transit". If we are really concerned about data corruption that's not caught by the TCP checksum, then we really need to add an error-correcting-code as part of our own protocol. If we don't add one, then we're clearly operating under the assumption that the transport layer can be trusted to never deliver corrupted data. (Our failure model for the transport layer is that it's "unreliable" in the sense that it can drop messages, but it always detects data corruption and discards such messages, so it never delivers us corrupted bits.) -- Dan ---------------------------------------------------------------- To subscribe or unsubscribe from this elist use the subscription manager: <http://lists.oasis-open.org/ob/adm.pl>
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [Elist Home]
Powered by eList eXpress LLC