ebxml-msg message

Subject: Re: T2 Retry with Delivery Receipt

From: Christopher Ferris <chris.ferris@sun.com>
To: Dan Weinreb <dlw@exceloncorp.com>
Date: Thu, 13 Sep 2001 12:35:33 -0400

Dan,

Please see below.

Chris

Dan Weinreb wrote:
> 
> I have another idea about how to clarify the hop-to-hop/end-to-end
> debate.
> 
> First of all, whichever of the two points of view you take, there is
> such a thing as "MSH-level failure" to deliver a message.  For
> example, suppose From Party F talks to To Party T directly, with no
> IM's at all.  F tries sending the message and it fails, or F doesn't
> get an Ack from T.  Therefore F does all of its retries; sadly, all of
> them fail or don't result in an Ack.  There's nothing more to try, so
> the From MSH returns an "MSH-level failure" error code (or exception,
> whatever) to the From Application.
> 
> This is what some people have been calling "DFN", but "DFN" has also
> been used as the name of a message, so I want to avoid that
> terminological confusion.  The "MSH-level failure" means "I, the MSH,
> have tried to get your message through, but after trying everything
> within my power, I cannot confirm that your message got through."

Yes, we need to clear this confusion up. I think that there are cases
where a DFN "exception" thrown by an MSH back to the sending application
needs to be converted into a DFN message that is sent to the original
"From Party", probably reliably so that it can know that there are problems
afoot.

> 
> (There are also cases where the MSH can promise that the message did
> not get through, but those are too easy to be of interest here so I'll
> ignore them.)

;-)

> 
> At this point, any further retrying has to be at a higher level than
> the MS layer at all, i.e. the application layer.  For example, maybe
> the human administrators reconfigure things, pick a new HTTP server
> because they decide the old one is too broken for words, negotiate a
> new CPA, and then try again.

Right, there are bigger problems than a simle retry (which has already
been attempted, most likely repeatedly) will resolve.

> 
> But I think that if this kind of "application-level retry" is done,
> the new message is considered "new" by the MSH, which doesn't know
> that the application level thinks of it as a retry.  So what about the
> possibility that the first message actually did get through, and you
> might be causing a duplication?  The only answer here is that you must
> resolve the doubt before doing the application-level retry.

Correct.

> 
> This clearly (IMHO) is what MessageStatus is for: to get a positive or
> negative resolution from the TO party.  There's nothing wrong with
> MessageStatus.  Distributed systems have long had things like this.

Exactly!

> For example, in two-phase commit protocols, there is a message where
> one of the resource managers can ask the transaction manager "please
> tell me what the outcome of transaction 1324 turned out to be".  It's
> just like that.  Anyway, there is no way for the From MSH to resolve
> its doubt about whether the message was delivered until it (the From
> MSH) can actually communicate with the To MSH.

Agreed.

> 
> OK, now let's introduce IM's into the picture, assuming the overall
> model of "reliable IM"'s advocated by Chris and Colleen.  Suppose F
> talks to T through a chain of IM's: F <-> IM1 <-> IM2 <-> T.  IM1
> tries to send the message to IM2, and IM1 exhausts all his retries
> without success.  What happens now?
> 
> In the "reliable IM" model, if communication between any two adjacent
> MSH's fails, that constitutes MSH failure for the original request,
> and the F MSH should return a failure code (or exception) to the F
> application.  It's just like the simple case of MSH failure that I

Agreed.

> started out with.  The only recourse at this point is
> application-level retry as discussed above.  There is no point in
> doing an "end-to-end retry", because retrying has already been
> attempted and has proven inadequate.  You'd just be beating a dead

My point exactly. It should be noted that there is nothing preventing
the intermediary "application" from doing a resend, triggering a new
set of retries on the part of the sending intermediate MSH node. Nothing
at all! Nothing new is needed in the protocol, but we could certainly
make this more explicit in the specification. We don't need to say
*how* this is accomplished, but we could in a non-normative note
say that the application could say to the MSH "resend MessageId 123"
or it could simply pass the message back to the MSH as a new message
with the same MessageId. We really don't care nor should we be dictating
how this is accomplished if at all.

We decided in Tokyo that the intermediary node had a "routing
application" that did the work of routing and forwarding of the
message to the next node in the message path.

> horse.  There is no failure mode that would be repaired by an
> end-to-end retry that would not have already been taken care of by the
> hop-to-hop retry.  That's because the "reliable IM" axiom says that
> there aren't any such failure modes in the failure model.  The only
> failure modes are in the network between the MSH's, and the hop-to-hop
> retries take care of that.  (Of course you have to tune the retry
> parameters based on just how flakey the network is.)

Right. That's a well stated description of my position.

> 
> Again, I am not taking sides here.  I just want to point out that the
> model Chris and Colleen are advocating is entirely consistent and
> works just fine, as long as you accept the "reliable IM" axiom.
> 
> On the other hand, suppose you assume the "unreliable IM" model
> advocated by David F and Marty.  In this case, end-to-end retry is
> useful, because it recovers from "IM failure".  "IM failure" is
> different from the network failure that the hop-to-hop retries take
> care of.  F would do an end-to-end retry either because some kind of
> "delivery failed" message was sent back to it, or else because it
> times out.

Yes, but I think that for our purposes, we can stipulate that an IM
MUST be reliable. If we want to tackle the unreliable case, then I think
we're talking about a horse of a different color and that would be
new functionality that should be considered as out of scope for 1.1.

> 
> (Of course end-to-end retry implies that there must be something like
> a "retry count" field in the message so that there should be two
> different kinds of message identity, as we've discused.)

Correct. Plus consideration as to what the implications are on
processing intermediaries that do MORE than mere routing of a message
to its next hop in the message path.

> 
> Chris, it seems to me that you'd have to agree that hop-to-hop isn't
> adequate if someone were to provide a use case that met all of the
> following criteria:
> 
> - It's clearly and compellingly something that we must support.
> - The node in question must, for some reason, be treated as an
>   IM at the ebXML MS level; it cannot, for whatever reason,
>   be treated as if it were an SMTP store-and-forward mailer
>   at the underlying transport/communication layer.
> - The node is unreliable, e.g. it can drop (or duplicate)
>   messages for reasons other than the network being its
>   usual flakey message-dropping self.

Agreed, but I think that we can say that the intermediary node 
MUST be reliable. If we want to tackle this case, then let's do 
so in the next iteration. It is a more complicated problem than
many of us believe.

> 
> To put it another way, I think your (Chris) position is that there
> isn't ever going to be compelling reason to take an unreliable node
> and treat it as an IM.

Right. At least for the purposes of this version of the spec.

> 
> I would like to add that protocol-translating gateways are very much
> *not* the same thing as IM's.  Providing a use case in which we need
> to use an unreliable protocol-translating gateway does *not*
> contradict the hop-to-hop approach for dealing with IM's.  To knock
> down hop-to-hop, someone must come up with a genuine IM use case,
> corresponding to Figure 8-2 (in section 8.5.4, page 28, of the MS
> Spec).
> 
> ----------------------------------------------------------------
> To subscribe or unsubscribe from this elist use the subscription
> manager: <http://lists.oasis-open.org/ob/adm.pl>

begin:vcard 
n:Ferris;Christopher
tel;cell:508-667-0402
tel;work:781-442-3063
x-mozilla-html:FALSE
org:Sun Microsystems, Inc;XTC Advanced Development
adr:;;One Network Drive;Burlington;Ma;01803-0903;USA
version:2.1
email;internet:chris.ferris@east.sun.com
title:Senior Staff Engineer
fn:Christopher Ferris
end:vcard

References:
- Re: T2 Retry with Delivery Receipt
  - From: Dan Weinreb <dlw@exceloncorp.com>