virtio-comment message

Subject: Re: RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set

From: zhenwei pi <pizhenwei@bytedance.com>
To: Parav Pandit <parav@nvidia.com>, Stefan Hajnoczi <stefanha@redhat.com>
Date: Fri, 9 Jun 2023 09:39:19 +0800



On 6/9/23 01:01, Parav Pandit wrote:

From: Stefan Hajnoczi <stefanha@redhat.com>
Sent: Thursday, June 8, 2023 12:41 PM

For stream protocol, it always work fine.
For keyed protocol, for example RDMA, the target side needs to use
ibv_post_recv to receive a large size(sizeof
virtio_of_command_connect + sizeof virtio_of_connect). If the
target uses ibv_post_recv to receive
sizeof(CMD) + sizeof(DESC) * 1, the initiator fails in RDMA SEND.


I read that "A RC connection is very similar to a TCP connection" in
the NVIDIA documentation
(https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/
Transport+Modes) and expected SOCK_STREAM semantics for RDMA SEND.

Are you saying ibv_post_send() fails when the receiver's work
request sg_list size is smaller (fewer bytes) than the sender's?


Yes, it will fail.
The receiver get a CQE with status 'IBV_WC_LOC_LEN_ERR', see
https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/


Parav: Can you confirm that this is expected?

Ibv_post_send() will not fail because it is a queuing interface.
But the send operation itself will fail via send (requester) side completion moving the QP to error.
Receive q also moves to error.

This makes it hard to inline payloads as I was suggesting before :(.


What I was suggesting in other thread, is if we want to inline the payload, we should do following.
RDMA write followed by RDMA send. So, a Block write commands actual data can be placed directly in say 4K memory of target.

This way, sender and receiver works with constant size buffers in send and receive queue.
RDMA is message based and not byte stream based.

Inline RDMA write is often called eager buffer, similar to PCIe write combine buffer.

Both doesn't likely work at scale as the buffer sharing becomes difficult across multiple connections.
It is memory vs perf trade off.
But doable.

We should start with first establishing the data transfer model covering 512B to 1M context and take up the optimizations as extensions.


Hi, Parav

What do you think about another RDMA inline proposal in

'[PATCH v2 11/11] transport-fabrics: support inline data for keyedtransmission'?

1, use feature command to get the target max recv buffer size, forexample 16k2, use feature command to set the initiator max recv buffer size, forexample 16kIf the size of payload is less than max recv buffer size, using a singleRDMA SEND is enough. for example, virtio-blk writes 8k: 16 + 8192 <16384, this means a single RDMA SEND is fine.


--
zhenwei pi

Follow-Ups:
- RE: RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: Parav Pandit <parav@nvidia.com>

References:
- Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: zhenwei pi <pizhenwei@bytedance.com>
- Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: zhenwei pi <pizhenwei@bytedance.com>
- Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: zhenwei pi <pizhenwei@bytedance.com>
- Re: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: Parav Pandit <parav@nvidia.com>