ebxml-msg message

Subject: Re: Guaranteed duplicate elimination vs. upper bound on delays
From: Martin W Sachs <mwsachs@us.ibm.com>
To: Dan Weinreb <dlw@exceloncorp.com>
Date: Wed, 15 Aug 2001 09:18:45 -0400
Comments below.

Regards,
Marty

*************************************************************************************

Martin W. Sachs
IBM T. J. Watson Research Center
P. O. B. 704
Yorktown Hts, NY 10598
914-784-7287;  IBM tie line 863-7287
Notes address:  Martin W Sachs/Watson/IBM
Internet address:  mwsachs @ us.ibm.com
*************************************************************************************



Dan Weinreb <dlw@exceloncorp.com> on 08/15/2001 01:28:24 AM

Please respond to "Dan Weinreb" <dlw@exceloncorp.com>

To:   Martin W Sachs/Watson/IBM@IBMUS
cc:   ebxml-msg@lists.oasis-open.org
Subject:  Re: Guaranteed duplicate elimination vs. upper bound on delays



   Date: Mon, 13 Aug 2001 16:38:12 -0400
   From: Martin W Sachs <mwsachs@us.ibm.com>

   As I recall, there is a time to live associated with each IP packet,
which
   helps TCP manage these things.  I agree that a message service time to
live
   would help by killing the really long-delayed messages.

MWS:  I agree with the paragraphs above and below.  But the basic idea is
the
same.  Get rid of the stuff that has been hanging around too long before it
causes trouble.

Actually I think that in real life, the time-to-live field in the IP
packet isn't really used as a measure of realtime, but as a hop count,
decremented by each router, mainly in order to get rid of looping
packages that can arise during unusual circumstances such as when
network traffic is proceeding while the routers are changing their
configurations/tables/etc.

The story on how TCP does this, unfortunately, appears to be
complicated.  See http://www.lcg.org/sock-faq/detail.php3?id=13, and
also RFC 1337 and especially the the Appendix to RFC 1185 (search for
"The scheme finally adopted for TCP combines features of both these
proposals.  TCP uses three mechanisms:").

The key thing seems to be the TIME_WAIT state.  The following
quotation is from the sock-faq, from Richard Stevens, who, as they
say, "wrote the book" on TCP (several books actually):

   The reason that the duration of the TIME_WAIT state is 2*MSL is that
   the maximum amount of time a packet can wander around a network is
   assumed to be MSL seconds. The factor of 2 is for the round-trip. The
   recommended value for MSL is 120 seconds, but Berkeley-derived
   implementations normally use 30 seconds instead. This means a
   TIME_WAIT delay between 1 and 4 minutes. Solaris 2.x does indeed use
   the recommended MSL of 120 seconds.

As far as I can tell the 120 seconds is arbitrary and has nothing to
do with the IP time-to-live feature.  Unfortunately for us, we are not
dealing with IP routers but potentially with store-and-forward
mailers, which might accept and store a message, and then suffer a
head crash requiring spare parts that might not arrive for weeks,
especially if the poor high-tech company is on credit hold and the MIS
guy is on vacation, and then it might come back up and finally forward
the message a month later.

MWS:  That means that if the application cares, it also needs a time to
live.  That should be a parameter of the BPSS spec if it isn't already.
Including time-to-live (persist duration) helps be allowing the MSH to
clear out the junk that arrive too late for MSH duplicate detection
earlier but I'm beginning to think that if that's its
main purpose, persistDuration must be supplied by the application


-- Dan
Follow-Ups:
- Re: Guaranteed duplicate elimination vs. upper bound on delays
  - From: christopher ferris <chris.ferris@east.sun.com>