OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

ebxml-msg message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: RE: [ebxml-msg] a more in-depth look at the relayed multi-hop

inline <JD>

From: Pim van der Eijk [mailto:pim.vandereijk@oasis-open.org]
Sent: Tuesday, March 04, 2008 1:58 AM
To: Durand, Jacques R.; ebxml-msg@lists.oasis-open.org
Subject: RE: [ebxml-msg] a more in-depth look at the relayed multi-hop

Hello Jacques,
I agree that the proposal presents many challenges. I also don't claim it is better than other proposals yet, before understanding the implications more fully. But I also think it has some advantages from a functionality point of view. More comments in-line.


From: Durand, Jacques R. [mailto:JDurand@us.fujitsu.com]
Sent: 04 March 2008 03:22
To: ebxml-msg@lists.oasis-open.org
Subject: [ebxml-msg] a more in-depth look at the relayed multi-hop

Here is a more detailed analysis of the "ack relay" multi-hop technique, highlighting some "challenges" it presents.
Two major points are developed here:
(1) AtMostOnce delivery assurance
(2) a (critical) analysis of the Ack relay algorithms.
1. AtMostOnce delivery assurance
In the relayed acks technique, AtMostOnce (duplicate elimination) cannot be supported end-to-end
based on RM sequence numbers. It cannot be delegated to the string of RM modules on the way.
For example while the message is transiting in the intermediary,
between the removal and the insertion of a new RM header, it could be duplicated during a crash
recovery, and the dup be assigned a new RM seq number, undetectable afterward.
The only way to support it end-to-end is to implement it based on eb:MessageId, in the ebMS processor.
Not being able to rely on seq numbers reduces the efficiency and scalability of the dup check,
and forces its implementation at ebMS processing level.
This is at odds with existing endpoint MSHs that conform to Core V3 and that will naturally implement
it in the RM module, based on an efficient dup check using seq numbers.
These endpoint MSHs would not be able to enforce reliable AtMostOnce in a multi-hop configuration.
1)  If this can happen in an intermediary, it could happen in any sending MSH too, if there is a crash between the ebMS packaging (ebMS MessageId) and the WS-RM handling (WS-RM MessageId).  In that case the sending MSH could send an ebMS message twice, with different RM IDs. 
<JD> but a major difference, is that the Sending party is responsible for this duplication, not the RM layer for which the contract starts when the message is "sent" (defined in Fig 1 of WS-RM as the time the message is passed to RMS) and ends when "delivered" (although the delivery part is not really covered in WS-RM protocol).
If the message is "sent" twice, the Sending MSH itself should "extend" the RM contract and at least detect this as required in Core V3, section 8.2.2. i.e. do its part in case of failure before the RM module. Same thing on the Receiver side: receiving MSH could  crash and never deliver to Consumer: with WS-RM that is not the responsibility of the RM layer. But everything that happens between the initial "send"  and the final "deliver" is under the RM contract - which we all are striving to fulfill end-to-end.
Of course we could require of an MSH Intermediary that it also "extends" the RM contract, but that is yet one more feature to ask of the ebMS layer.
2)  We could add a column "ebMS_ID" to the Message Mapping table, and require the pair (ebMS_ID, To) to be unique in the table. Any attempt to forward a message with the same ebMS MessageId to the same destination would then fail. The intermediary in effect would do ebMS duplicate elimination. Not sure if we want it to ... 
<JD> possible - but again one more responsibility for the ebMS intermediary.
3)  The intermediary is a single "node". The receiving and sending/forwarding parts of it would probably be using a single underlying database, quite possibly (if this is the core interconnection of a large community) a clustered/high available version of a DBMS.  Updates to the Message Mapping Table and inserts in the RM sequence tables (with an AutoIncrement sequence number) could be implemented as transactions.
I see a much more serious problem with my proposal with "In Order" delivery.  If the intermediary receives A:1 and A:4, it cannot forward A:4 until it has received A:2 and A:3 (because it doesn't know if those messages are to the same destination or not). So there could be a significant amount of resequencing (see http://www.enterpriseintegrationpatterns.com/Resequencer.html).  This can be done (I know a MOM production system that does exactly this ..), but the overhead due to queuing is huge.  In this case the simpler situtation where there is a one-to-one mapping between In-Sequences and Out-Sequences (as in Sander's original proposal) is preferable.  Conformance profiles could specify that the intermediary does not support in-order end-to-end messaging.  
<JD> I saw the support for "InOrder" not as critical to achieve for multi-hop (e.g. is not part of the basic conformance profiles of core V3) so I did not insist on it. Most of the troubles for implementing relayed acks come from the possibility for an intermediary to "split" an RM sequence into several outbound RM sequences.
2. The relay algorithms.
The "mapping" table shown in Pim illustration makes the relay technique look deceptively simple and easy.
But it is only one side of the picture.
- First, it should be noted that, unlike a routing table which is in general a read-only
structure, this is a data structure with a very high rate of updates.
This makes scalability much harder to guarantee (e.g. database indexes are costly to maintain
when update rate is high, and simply ineffective when too high).
- It appears that one of the two major options in implementing the relayed acks,
will not work. These two options were:
(Op1) when forwarding an RM message from sequence S1 to sequence S2, the intermediary
keeps the same sequence number for the message (so only seq ID is changing), so that the Acks
received for S2 only need a seq ID substitution before being forwarded back for S1.
(Op2)when forwarding an RM message from sequence S1 to sequence S2, the intermediary
assigns a new sequence number to the message. the Acks received for S2 then need be mapped to
the sequence numbers for S1.
- The reason why Op1 won't work is that by splitting the message flow on S1 into possibly
several new outgoing RM sequences (S2, S3...) through an intermediary, it would be impossible to keep
contiguous seq numbers for each one of these outgoing sequences. This contiguity is
fundamental to the notion of "sequence numbers": RM modules are hardwired to control this
by-1 increment, required by RM spec (Section 3.7: "RM Source MUST assign each message within
a Sequence a MessageNumber element that increments by 1 from an initial value of 1.")
- So we are left with Op2, which requires significant work to convert an Ack into another Ack
when doing relay. To understand the kind of mapping that needs be done for each Ack, here
is a snapshot of the situation the intermediary will face at any time
(reusing here the sequence IDs in Pim's table: sequence A is split into seq C and seq D
by the intermediary. The number "3" in A:3 is the seq number of the message)
- at some point in time, the "mapping table" in the MSH intermediary will look like:
(the arrow --> means the intermediary has forwarded the message to the next sequence)
A:1 --> C:1 (acked)
A:2 --> D:1(no Ack yet)
(A:3 not received yet)
A:5 --> C:2 (acked)
A:4 (received late...) --> C:3 (no Ack yet)
A:6 --> D:2 (acked)
A:3 (received late...) --> C:4 (acked)
A:7 --> D:3 (no Ack yet)
A:8 missing
A:9 --> D:4 (acked)
The RM Acks for sequence A will look like:
At time message A:6 is received by intermediary, and its "relay ack" received for seq D:
<wsrm:AcknowledgementRange Upper="1" Lower="1"/>
<wsrm:AcknowledgementRange Upper="6" Lower="5"/>
At time message A:9 is received by intermediary, and its "relay ack" received for seq D:
<wsrm:AcknowledgementRange Upper="1" Lower="1"/>
<wsrm:AcknowledgementRange Upper="3" Lower="3"/>
<wsrm:AcknowledgementRange Upper="6" Lower="5"/>
<wsrm:AcknowledgementRange Upper="9" Lower="9"/>
The RM ack is in fact a set of "acked intervals" that cover the entire sequence,
(meaning every past message is in fact reacknowledged in every Ack)
and although over time we can expect the emergence of a "solid" single interval for the oldest messages
(meaning the laggards have been received, they have all been acknowledged at last etc.)
the reality is that the status of the most recent messages will always be a bit messy as illustrated here:
some "sequence gaps", several ack ranges, until these recent messages settle in turn into a stable,
solid Acked interval covering all past messages. Some refer to such a datastructure, as "foamy".
The bottom line is that each Ack creation for seq A will require the mapping and merging of a similar
set of ack ranges from sequence C and sequence D. And the "foam" from Acks of C and D is adding up in
Acks for A (meaning, if seq A splits into 100 smaller sequences, and each one of these has an average
of 3 ackranges at any time in their Acks, then seq A will have an average of 300 ackranges in everyone of
its acks). This also means that if the RM destination of sequence D is down for some time,
its non-acknowledged message numbers will reflect in all consolidated Acks for sequence A,
and will fragment these Acks.
E.g. in our example, for 9 messages sent over seq A the intermediaries will not be able to do less
than 3 ack ranges as long as D server is down. Imagine what the Acks will look like after 9999 messages
are sent...
(it may very well be that several MSHs senders of messages for sequence D, alltogether generate a few
thousand messages before realizing that D server is down and before stopping their sending - it would take just a
few minutes. As soon as that happens, intermediaries may find themselves having to deal with Acks,
each one of them containing thousands of ackranges !)
This can happen in situations without intermediaries too.  In one actual project a network problem caused 80% of all connections to one system to fail.  Eventually all messages got delivered, but massively out of sequence, and some messages hours late ...  
<JD> Right. but here, if you "split" the sequence A into C and D, and D destination is down but not C destination, and A messages are forwarded to C and  D alternatively, then you'll end-up with many "gaps" in the sequence A: every other message will not be acknowledged for a while, yet you still need to ack C messages, so you may endup with "huges" Acks (many ackranges) for seq A.
The general point is that creating such RM Acks requires a fairly complex mapping algorithm for the intermediary,
that is not very robust in case of server or connection downtime, and the performance of which
can degrade quite quickly. The cost of this algorithm shows for each Ack at each intermediary.
To the complexity of the algorithm, add its reliance on a transient state info in the intermediary
(the ever-changing mapping table), and you get a solution with a high price to pay in development effort,
testing/validation effort, and for a run-time robustness and scalability that is brittle.
In effect the proposal is a about updating a N:M mapping between sequences based on a one-to-one mapping between messages in those sequences. The easy/naive implementation of this is no doubt incredibly slow ...
Compare this with a solution where the intermediary has ...nothing to do at all in terms of RM processing,
after an end-to-end single RM sequence has been established between two parties.
Yes, but that proposal lacks the generality of being able to deliver some messages in a sequence, and forwarding others to multiple destinations. I agree that perhaps the price to pay to support that generality is too high ..  
<JD> the other proposal has its challenges too... but these challenges are more on the initial handshaking for establishing sequences, not on the actual usage of the sequence over time. We still need to get "at the bottom" of that one too, though. Before this I think we need to clarify some basic routing questions that concern both proposals (e.g. routing of Receipt & Error signals in responses to user messages)
I think we need to be aware of all this before even considering this solution as a general candidate.
I believe there might be some cases where it has merits, but for several other multi-hop configurations
these are clearly serious challenges.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]