ebxml-msg message

Subject: Re: T2 Retry with Delivery Receipt
From: Dan Weinreb <dlw@exceloncorp.com>
To: ebxml-msg@lists.oasis-open.org
Date: Thu, 13 Sep 2001 11:29:19 -0400 (EDT)
I have another idea about how to clarify the hop-to-hop/end-to-end
debate.

First of all, whichever of the two points of view you take, there is
such a thing as "MSH-level failure" to deliver a message.  For
example, suppose From Party F talks to To Party T directly, with no
IM's at all.  F tries sending the message and it fails, or F doesn't
get an Ack from T.  Therefore F does all of its retries; sadly, all of
them fail or don't result in an Ack.  There's nothing more to try, so
the From MSH returns an "MSH-level failure" error code (or exception,
whatever) to the From Application.

This is what some people have been calling "DFN", but "DFN" has also
been used as the name of a message, so I want to avoid that
terminological confusion.  The "MSH-level failure" means "I, the MSH,
have tried to get your message through, but after trying everything
within my power, I cannot confirm that your message got through."

(There are also cases where the MSH can promise that the message did
not get through, but those are too easy to be of interest here so I'll
ignore them.)

At this point, any further retrying has to be at a higher level than
the MS layer at all, i.e. the application layer.  For example, maybe
the human administrators reconfigure things, pick a new HTTP server
because they decide the old one is too broken for words, negotiate a
new CPA, and then try again.

But I think that if this kind of "application-level retry" is done,
the new message is considered "new" by the MSH, which doesn't know
that the application level thinks of it as a retry.  So what about the
possibility that the first message actually did get through, and you
might be causing a duplication?  The only answer here is that you must
resolve the doubt before doing the application-level retry.

This clearly (IMHO) is what MessageStatus is for: to get a positive or
negative resolution from the TO party.  There's nothing wrong with
MessageStatus.  Distributed systems have long had things like this.
For example, in two-phase commit protocols, there is a message where
one of the resource managers can ask the transaction manager "please
tell me what the outcome of transaction 1324 turned out to be".  It's
just like that.  Anyway, there is no way for the From MSH to resolve
its doubt about whether the message was delivered until it (the From
MSH) can actually communicate with the To MSH.

OK, now let's introduce IM's into the picture, assuming the overall
model of "reliable IM"'s advocated by Chris and Colleen.  Suppose F
talks to T through a chain of IM's: F <-> IM1 <-> IM2 <-> T.  IM1
tries to send the message to IM2, and IM1 exhausts all his retries
without success.  What happens now?

In the "reliable IM" model, if communication between any two adjacent
MSH's fails, that constitutes MSH failure for the original request,
and the F MSH should return a failure code (or exception) to the F
application.  It's just like the simple case of MSH failure that I
started out with.  The only recourse at this point is
application-level retry as discussed above.  There is no point in
doing an "end-to-end retry", because retrying has already been
attempted and has proven inadequate.  You'd just be beating a dead
horse.  There is no failure mode that would be repaired by an
end-to-end retry that would not have already been taken care of by the
hop-to-hop retry.  That's because the "reliable IM" axiom says that
there aren't any such failure modes in the failure model.  The only
failure modes are in the network between the MSH's, and the hop-to-hop
retries take care of that.  (Of course you have to tune the retry
parameters based on just how flakey the network is.)

Again, I am not taking sides here.  I just want to point out that the
model Chris and Colleen are advocating is entirely consistent and
works just fine, as long as you accept the "reliable IM" axiom.

On the other hand, suppose you assume the "unreliable IM" model
advocated by David F and Marty.  In this case, end-to-end retry is
useful, because it recovers from "IM failure".  "IM failure" is
different from the network failure that the hop-to-hop retries take
care of.  F would do an end-to-end retry either because some kind of
"delivery failed" message was sent back to it, or else because it
times out.

(Of course end-to-end retry implies that there must be something like
a "retry count" field in the message so that there should be two
different kinds of message identity, as we've discused.)

Chris, it seems to me that you'd have to agree that hop-to-hop isn't
adequate if someone were to provide a use case that met all of the
following criteria:

- It's clearly and compellingly something that we must support.
- The node in question must, for some reason, be treated as an
  IM at the ebXML MS level; it cannot, for whatever reason,
  be treated as if it were an SMTP store-and-forward mailer
  at the underlying transport/communication layer.
- The node is unreliable, e.g. it can drop (or duplicate)
  messages for reasons other than the network being its
  usual flakey message-dropping self.

To put it another way, I think your (Chris) position is that there
isn't ever going to be compelling reason to take an unreliable node
and treat it as an IM.

I would like to add that protocol-translating gateways are very much
*not* the same thing as IM's.  Providing a use case in which we need
to use an unreliable protocol-translating gateway does *not*
contradict the hop-to-hop approach for dealing with IM's.  To knock
down hop-to-hop, someone must come up with a genuine IM use case,
corresponding to Figure 8-2 (in section 8.5.4, page 28, of the MS
Spec).
Follow-Ups:
- Re: T2 Retry with Delivery Receipt
  - From: Christopher Ferris <chris.ferris@sun.com>