ebxml-msg message

Subject: Re: T2 Retry with Delivery Receipt
From: Martin W Sachs <mwsachs@us.ibm.com>
To: Dan Weinreb <dlw@exceloncorp.com>
Date: Wed, 19 Sep 2001 22:23:38 -0400

We have gotten pretty far afield from what reliable messaging is for.
Reliable messaging assures the From application that the message got to its
destination and was persisted. More importantly it assures that if the
message did not get to its destination, the From application knows about
the failure.

Reliable messaging should not in any way, shape, or form concern itself
with content.  Even if the MSH performs some content services (e.g.
signing), those services are on behalf of the application and are outside
the scope of the MS spec and reliable messaging.

Example: failure to validate a signature is NOT a reliable-messaging
function.  The RM protocol should indicate that the message got where it
was supposed to go and was persisted.  Some other mechanism should tell the
From application that the message failed validation.  That mechanism is
probably an error message defined in the collaboration protocol (e.g. BPSS
instance document). A RM retry won't accomplish anything for this one.
Correcting the certificate or whatever is needed must result in sending a
new message with a new message ID, not a RM retry.

Regards,
Marty

*************************************************************************************

Martin W. Sachs
IBM T. J. Watson Research Center
P. O. B. 704
Yorktown Hts, NY 10598
914-784-7287;  IBM tie line 863-7287
Notes address:  Martin W Sachs/Watson/IBM
Internet address:  mwsachs @ us.ibm.com
*************************************************************************************



Dan Weinreb <dlw@exceloncorp.com> on 09/19/2001 10:16:12 AM

Please respond to Dan Weinreb <dlw@exceloncorp.com>

To:   david@drummondgroup.com
cc:   Chris.Ferris@sun.com, ebxml-msg@lists.oasis-open.org
Subject:  Re: T2 Retry with Delivery Receipt



   Date: Tue, 18 Sep 2001 15:34:58 -0500
   From: David Fischer <david@drummondgroup.com>

<rhetoric-mode>

   "I don't want to" is not a valid reason.  "It's too complicated" is
almost as
   bad (how hard is it to concatenate two strings?).  We can allow retries,
Chris
   just doesn't want to.  Why?

The reason is "It wouldn't do any good".

If the reason the message didn't get through is that the (unreliable)
transport layer dropped it, the regular ("hop-to-hop") retry mechanism
exists to deal with that problem.  There is no need to impose a second
retry mechanism on top of the first one: or, if there is, then there
is also a need for a third and fourth layer and so on.

You said:

   <df>retries do not guarantee success and never will.  The question is
what to do
   when those failures occur.</df>

But what are you saying we should do?  You seem to be saying that we
should retry some more.

</rhetoric-mode>

OK, OK, you're not really saying that.  And I don't really believe
that they don't do any good under any scenarios.  I think the case for
end-to-end retry should be made by clearly stating the scenarios where
end-to-end retry adds value that hop-to-hop retry does not.

Let's consider why retrying the *same* message (same message ID, same
digital signature, same contents, just as you say, everything the same
except certain fields that are specific to the hop-to-hop layer of
communication) *ever* does *any* good.  If it failed the first time,
why won't it just keep on failing and failing?  I can see two
categories of reason:

(1) There are *random* *transient* failures that happen often enough
to worry about.  Simply trying again has a good chance of succeeding.

(2) Something in the external environment changes before the retry.
I think that's what you had mind when you said "it might be manual"
and "It might be now or after a fix."

The "unreliable IM" is an example of (1) that isn't handled by
hop-to-hop retry and would be handled by automatic, right-now
end-to-end retry.  It's still not clear that a convincing use
case for this has been presented.

What are the scenarios in which (2) provides the justification for the
retry?  David F, you presented some "example use cases", but some of
them aren't what we need as scenarios, because they are effects rather
than causes, e.g. "a DFN sent" or "an Error Message sent".  What I
think of as a "scenario" has to explain why they were sent: what
actually went wrong in the first place?

So let me try some scenarios.  I think scenarios break down into two
categories: those in which the From party gets some kind of negative
reply, and those where the From party times out.

Suppose I send a purchase order to Staples and I digitally sign it
with a private key, and in the ds:keyInfo I send a certificate with
the corresponding public key, but unfortunately this certificate
expired a few days ago.  The To Party sees that the certificate has
expired, so the digital signature is no good, so it rejects the
message.  Automatic retries are clearly pointless.  The From people
could transmit a new certificate out-of-band to the To people and tell
them to force their MSH to use the new certificate on the existing
message, but this seems kind of implausible for various reasons.  Or
the From side could obtain a new certificate, and then send the
message with the new certificate.  But then it's not the same message,
as defined above.  Should it have the same messageID?  (I don't have
an answer to this.)

Suppose Staples changes its address.  I sent a purchase order to
Staples, and the CPA says to use HTTP to www.staples.com, and upon
trying that I get an HTTP 404 (no such URL), or even a DNS error
("there's no such host name as "www.staples.com").  Automatic retries
do no good.  But if administrators at the From host install a new CPA,
then retrying the exact same message could succeed.

Suppose Staples's MSH machine has run out of disk space and rejects
the incoming message.  Automatic retries could solve this, by simply
retrying until ordinary work frees up disk space, or the
administrators at Staples add a new disk.  On the other hand, the
hop-to-hop retry mechanism could do that just as well.  But this
brings up a question as to when retries ought to time out.  You could
say that knowing when to really give up sometimes requires manual
intervention or special knowledge; no simple time-interval value in a
CPA can substitute for intelligent ways of deciding how long to retry.
You might posit that a retry mechanism operating at the end-to-end
level is better positioned to allow this kind of intelligence to be
brought to bear than a hop-to-hop retry mechanism.

Related scenario: Staples installs a new release of its MSH software,
the new release has bugs that cause it to wrongly reject messages; we
retry after Staples goes back to the old release or installs a fix.
Similarly, an administrator at Staples messes with the configuration
settings so that our messages are wrongly rejected, etc.

(The From MSH might have some kind of fancy features allowing
administrators finer control over retry.  There might be commands like
"stop retransmitting this message but keep it in the MSH so that we
can commence retranmsitting later".  None of this would be part of the
normative protocol specification.  David F, I get the impression that
have in mind something like this.)

Then there are timeout scenarios, e.g. what you called "lack of DR".
Chris said "If the DR is sent reliably, then its absense is
significant cause for concern."  I agree, but we still have to figure
out how to react if a DR does not appear after a "reasonable timeout".
What scenarios might produce this?  Actually, we don't really need a
"scenario" as such.  Reliable messaging still allows for the
possibility that the sender still (after any given time interval) does
not know whether the message has actually been delivered yet.  So a DR
can take longer than any "reasonable timeout" even if there has been
no failure.  If the From side wants to learn whether the message was
ever recieved, it can either just keep waiting, or it can send a
message, which might be exactly the same as the original message, or
might be a Message Status Request.

You mentioned "XML text corruption in transit".  If we are really
concerned about data corruption that's not caught by the TCP checksum,
then we really need to add an error-correcting-code as part of our own
protocol.  If we don't add one, then we're clearly operating under the
assumption that the transport layer can be trusted to never deliver
corrupted data.  (Our failure model for the transport layer is that
it's "unreliable" in the sense that it can drop messages, but it
always detects data corruption and discards such messages, so it never
delivers us corrupted bits.)

-- Dan

----------------------------------------------------------------
To subscribe or unsubscribe from this elist use the subscription
manager: <http://lists.oasis-open.org/ob/adm.pl>
Follow-Ups:
- Re: T2 Retry with Delivery Receipt
  - From: Dan Weinreb <dlw@exceloncorp.com>