virtio-comment message

Subject: RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set

From: Parav Pandit <parav@nvidia.com>
To: Stefan Hajnoczi <stefanha@redhat.com>, zhenwei pi <pizhenwei@bytedance.com>
Date: Thu, 8 Jun 2023 17:01:19 +0000

> From: Stefan Hajnoczi <stefanha@redhat.com>
> Sent: Thursday, June 8, 2023 12:41 PM

> > > > For stream protocol, it always work fine.
> > > > For keyed protocol, for example RDMA, the target side needs to use
> > > > ibv_post_recv to receive a large size(sizeof
> > > > virtio_of_command_connect + sizeof virtio_of_connect). If the
> > > > target uses ibv_post_recv to receive
> > > > sizeof(CMD) + sizeof(DESC) * 1, the initiator fails in RDMA SEND.
> > >
> > > I read that "A RC connection is very similar to a TCP connection" in
> > > the NVIDIA documentation
> > > (https://docs.nvidia.com/networking/display/RDMAAwareProgrammingv17/
> > > Transport+Modes) and expected SOCK_STREAM semantics for RDMA SEND.
> > >
> > > Are you saying ibv_post_send() fails when the receiver's work
> > > request sg_list size is smaller (fewer bytes) than the sender's?
> > >
> >
> > Yes, it will fail.
> > The receiver get a CQE with status 'IBV_WC_LOC_LEN_ERR', see
> > https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/
> 
> Parav: Can you confirm that this is expected?
> 
Ibv_post_send() will not fail because it is a queuing interface.
But the send operation itself will fail via send (requester) side completion moving the QP to error.
Receive q also moves to error.

> This makes it hard to inline payloads as I was suggesting before :(.

What I was suggesting in other thread, is if we want to inline the payload, we should do following.
RDMA write followed by RDMA send. So, a Block write commands actual data can be placed directly in say 4K memory of target.

This way, sender and receiver works with constant size buffers in send and receive queue.
RDMA is message based and not byte stream based.

Inline RDMA write is often called eager buffer, similar to PCIe write combine buffer.

Both doesn't likely work at scale as the buffer sharing becomes difficult across multiple connections.
It is memory vs perf trade off.
But doable.

We should start with first establishing the data transfer model covering 512B to 1M context and take up the optimizations as extensions.

Follow-Ups:
- Re: RE: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: zhenwei pi <pizhenwei@bytedance.com>

References:
- Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: zhenwei pi <pizhenwei@bytedance.com>
- Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: zhenwei pi <pizhenwei@bytedance.com>
- Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: zhenwei pi <pizhenwei@bytedance.com>
- Re: Re: Re: Re: [PATCH v2 06/11] transport-fabrics: introduce command set
  - From: Stefan Hajnoczi <stefanha@redhat.com>