ebxml-msg message

Subject: RE: T2 Retry with Delivery Receipt
From: Martin W Sachs <mwsachs@us.ibm.com>
To: David Fischer <david@drummondgroup.com>
Date: Thu, 20 Sep 2001 08:57:59 -0400

David,

We can't avoid this tangent as long as you continue to assert that NRR,
among other, has to be considered a transmission error check.

Regards,
Marty

*************************************************************************************

Martin W. Sachs
IBM T. J. Watson Research Center
P. O. B. 704
Yorktown Hts, NY 10598
914-784-7287;  IBM tie line 863-7287
Notes address:  Martin W Sachs/Watson/IBM
Internet address:  mwsachs @ us.ibm.com
*************************************************************************************



David Fischer <david@drummondgroup.com> on 09/20/2001 01:23:08 AM

To:   Martin W Sachs/Watson/IBM@IBMUS
cc:   ebxml-msg@lists.oasis-open.org
Subject:  RE: T2 Retry with Delivery Receipt



Marty, I really don't want to get into this discussion since it is so far
out of
scope for ebXML.  However, if you are interested, please see the links
below.
The truth is that no one knows how many transmission errors are undetected
and
those who have done studies suggest, for sensitive data, exactly what Dan
proposed, software checksums.

These reports suggest that there is actually a high rate of undetected
errors on
TCP/IP networks (high meaning there could be the probability of a megabyte
file
transmission having an even chance of an error).  I don't know for sure,
nor am
I particularly concerned since I can digitally sign anything important and
prevent errors through software means.  Even a signature is not 100%, but
it is
close enough.

Please, let's not go off on this tangent!

Regards,

David Fischer
Drummond Group.

http://citeseer.nj.nec.com/stone00when.html
http://www.academ.com/nanog/feb1997/dynamics.html

-----Original Message-----
From: Martin W Sachs [mailto:mwsachs@us.ibm.com]
Sent: Wednesday, September 19, 2001 10:08 PM
To: David Fischer
Cc: Dan Weinreb; ebxml-msg@lists.oasis-open.org
Subject: RE: T2 Retry with Delivery Receipt



You are still postulating that there are transmission errors that won't be
caught by either TCP or the underlying physical transport.  You need to
make a convincing case that the can happen.

Regards,
Marty

********************************************************************************

*****

Martin W. Sachs
IBM T. J. Watson Research Center
P. O. B. 704
Yorktown Hts, NY 10598
914-784-7287;  IBM tie line 863-7287
Notes address:  Martin W Sachs/Watson/IBM
Internet address:  mwsachs @ us.ibm.com
********************************************************************************

*****



"David Fischer" <david@drummondgroup.com> on 09/19/2001 10:43:42 PM

To:   Martin W Sachs/Watson/IBM@IBMUS
cc:   "Dan Weinreb" <dlw@exceloncorp.com>, <ebxml-msg@lists.oasis-open.org>
Subject:  RE: T2 Retry with Delivery Receipt



Actually NRR is exactly for making sure that the message arrived intact,
either
as a protection from transmission failures or from security breaches (e.g.
man-in-the-middle attack).  It might be pretty bad if I ordered 2 items and
there was a transmission failure and it got changed to 1,000,002 (actually
it
would be more binary than that).  The signature assures the To Party that
it did
not change and NRR assures back to the From Party that it did not change
(round
trip).

David Fischer
Drummond Group.

-----Original Message-----
From: Martin W Sachs [mailto:mwsachs@us.ibm.com]
Sent: Wednesday, September 19, 2001 9:26 PM
To: David Fischer
Cc: Dan Weinreb; ebxml-msg@lists.oasis-open.org
Subject: RE: T2 Retry with Delivery Receipt



NRR is a non-repudiation function.  It is not intended as a transmission
error detector.  Someone will have to make a convincing case that TCP is
not sufficient for detecting transmission errors.

Regards,
Marty

********************************************************************************


*****

Martin W. Sachs
IBM T. J. Watson Research Center
P. O. B. 704
Yorktown Hts, NY 10598
914-784-7287;  IBM tie line 863-7287
Notes address:  Martin W Sachs/Watson/IBM
Internet address:  mwsachs @ us.ibm.com
********************************************************************************


*****



David Fischer <david@drummondgroup.com> on 09/19/2001 10:46:21 AM

To:   Dan Weinreb <dlw@exceloncorp.com>
cc:   ebxml-msg@lists.oasis-open.org
Subject:  RE: T2 Retry with Delivery Receipt



Very good.  I agree with most of it.

One comment about check-sums.  We already have an transmission
error-catching
mechanism called NRR.

On the whole, I think this is very good.  The point is that there are some
scenarios which would require a retry.  But I prefer to phrase the question
differently -- why would an IM *ever* stop a retry from the end?  It is not
the
job of an IM to tell the ends what they may or may not send.  Since it is
an
easy thing to differentiate between an IM retry and an end retry, its not
"why
would we" but rather "why wouldn't we"?

This may all be moot since built-in problems with multi-hop (like not
allowing
end-to-end retries) could force IMs out of the picture.  I'd rather not,
but . .
. ?

Regards,

David Fischer
Drummond Group.

-----Original Message-----
From: Dan Weinreb [mailto:dlw@exceloncorp.com]
Sent: Wednesday, September 19, 2001 9:16 AM
To: david@drummondgroup.com
Cc: Chris.Ferris@sun.com; ebxml-msg@lists.oasis-open.org
Subject: Re: T2 Retry with Delivery Receipt


   Date: Tue, 18 Sep 2001 15:34:58 -0500
   From: David Fischer <david@drummondgroup.com>

<rhetoric-mode>

   "I don't want to" is not a valid reason.  "It's too complicated" is
almost as
   bad (how hard is it to concatenate two strings?).  We can allow retries,
Chris
   just doesn't want to.  Why?

The reason is "It wouldn't do any good".

If the reason the message didn't get through is that the (unreliable)
transport layer dropped it, the regular ("hop-to-hop") retry mechanism
exists to deal with that problem.  There is no need to impose a second
retry mechanism on top of the first one: or, if there is, then there
is also a need for a third and fourth layer and so on.

You said:

   <df>retries do not guarantee success and never will.  The question is
what to
do
   when those failures occur.</df>

But what are you saying we should do?  You seem to be saying that we
should retry some more.

</rhetoric-mode>

OK, OK, you're not really saying that.  And I don't really believe
that they don't do any good under any scenarios.  I think the case for
end-to-end retry should be made by clearly stating the scenarios where
end-to-end retry adds value that hop-to-hop retry does not.

Let's consider why retrying the *same* message (same message ID, same
digital signature, same contents, just as you say, everything the same
except certain fields that are specific to the hop-to-hop layer of
communication) *ever* does *any* good.  If it failed the first time,
why won't it just keep on failing and failing?  I can see two
categories of reason:

(1) There are *random* *transient* failures that happen often enough
to worry about.  Simply trying again has a good chance of succeeding.

(2) Something in the external environment changes before the retry.
I think that's what you had mind when you said "it might be manual"
and "It might be now or after a fix."

The "unreliable IM" is an example of (1) that isn't handled by
hop-to-hop retry and would be handled by automatic, right-now
end-to-end retry.  It's still not clear that a convincing use
case for this has been presented.

What are the scenarios in which (2) provides the justification for the
retry?  David F, you presented some "example use cases", but some of
them aren't what we need as scenarios, because they are effects rather
than causes, e.g. "a DFN sent" or "an Error Message sent".  What I
think of as a "scenario" has to explain why they were sent: what
actually went wrong in the first place?

So let me try some scenarios.  I think scenarios break down into two
categories: those in which the From party gets some kind of negative
reply, and those where the From party times out.

Suppose I send a purchase order to Staples and I digitally sign it
with a private key, and in the ds:keyInfo I send a certificate with
the corresponding public key, but unfortunately this certificate
expired a few days ago.  The To Party sees that the certificate has
expired, so the digital signature is no good, so it rejects the
message.  Automatic retries are clearly pointless.  The From people
could transmit a new certificate out-of-band to the To people and tell
them to force their MSH to use the new certificate on the existing
message, but this seems kind of implausible for various reasons.  Or
the From side could obtain a new certificate, and then send the
message with the new certificate.  But then it's not the same message,
as defined above.  Should it have the same messageID?  (I don't have
an answer to this.)

Suppose Staples changes its address.  I sent a purchase order to
Staples, and the CPA says to use HTTP to www.staples.com, and upon
trying that I get an HTTP 404 (no such URL), or even a DNS error
("there's no such host name as "www.staples.com").  Automatic retries
do no good.  But if administrators at the From host install a new CPA,
then retrying the exact same message could succeed.

Suppose Staples's MSH machine has run out of disk space and rejects
the incoming message.  Automatic retries could solve this, by simply
retrying until ordinary work frees up disk space, or the
administrators at Staples add a new disk.  On the other hand, the
hop-to-hop retry mechanism could do that just as well.  But this
brings up a question as to when retries ought to time out.  You could
say that knowing when to really give up sometimes requires manual
intervention or special knowledge; no simple time-interval value in a
CPA can substitute for intelligent ways of deciding how long to retry.
You might posit that a retry mechanism operating at the end-to-end
level is better positioned to allow this kind of intelligence to be
brought to bear than a hop-to-hop retry mechanism.

Related scenario: Staples installs a new release of its MSH software,
the new release has bugs that cause it to wrongly reject messages; we
retry after Staples goes back to the old release or installs a fix.
Similarly, an administrator at Staples messes with the configuration
settings so that our messages are wrongly rejected, etc.

(The From MSH might have some kind of fancy features allowing
administrators finer control over retry.  There might be commands like
"stop retransmitting this message but keep it in the MSH so that we
can commence retranmsitting later".  None of this would be part of the
normative protocol specification.  David F, I get the impression that
have in mind something like this.)

Then there are timeout scenarios, e.g. what you called "lack of DR".
Chris said "If the DR is sent reliably, then its absense is
significant cause for concern."  I agree, but we still have to figure
out how to react if a DR does not appear after a "reasonable timeout".
What scenarios might produce this?  Actually, we don't really need a
"scenario" as such.  Reliable messaging still allows for the
possibility that the sender still (after any given time interval) does
not know whether the message has actually been delivered yet.  So a DR
can take longer than any "reasonable timeout" even if there has been
no failure.  If the From side wants to learn whether the message was
ever recieved, it can either just keep waiting, or it can send a
message, which might be exactly the same as the original message, or
might be a Message Status Request.

You mentioned "XML text corruption in transit".  If we are really
concerned about data corruption that's not caught by the TCP checksum,
then we really need to add an error-correcting-code as part of our own
protocol.  If we don't add one, then we're clearly operating under the
assumption that the transport layer can be trusted to never deliver
corrupted data.  (Our failure model for the transport layer is that
it's "unreliable" in the sense that it can drop messages, but it
always detects data corruption and discards such messages, so it never
delivers us corrupted bits.)

-- Dan


----------------------------------------------------------------
To subscribe or unsubscribe from this elist use the subscription
manager: <http://lists.oasis-open.org/ob/adm.pl>




----------------------------------------------------------------
To subscribe or unsubscribe from this elist use the subscription
manager: <http://lists.oasis-open.org/ob/adm.pl>





----------------------------------------------------------------
To subscribe or unsubscribe from this elist use the subscription
manager: <http://lists.oasis-open.org/ob/adm.pl>


----------------------------------------------------------------
To subscribe or unsubscribe from this elist use the subscription
manager: <http://lists.oasis-open.org/ob/adm.pl>