virtio-comment message

Subject: RE: Re: [virtio-comment] [PROPOSAL] Virtio Over Fabrics(TCP/RDMA)

From: Parav Pandit <parav@nvidia.com>
To: zhenwei pi <pizhenwei@bytedance.com>
Date: Thu, 27 Apr 2023 20:31:13 +0000

> From: zhenwei pi <pizhenwei@bytedance.com>
> Sent: Thursday, April 27, 2023 4:21 AM
> 
> On 4/25/23 13:03, Parav Pandit wrote:
> >
> >
> [...]
> >
> > I briefly see your rdma command descriptor example, which is not
> > aligned to 16B. Perf wise it will be poor than nvme rdma fabrics.
> >
> 
> Hi,
> I'm confused here, could you please give me more hint?
> 1, The size of command descriptor(I defined in example) is larger than
> command size of nvme rdma, more overhead leads performance worse than
> nvme over rdma.
> 
Which structure?

I am guessing from the header file that you have,

virtio_of_command_vring
followed by
virtio_of_vring_desc[cmd.ndesc] where cmd.opcode = virtio_of_op_vring

if so, it seems fine to me.
However, the lack of actual command missing in the virtio_of_command_vring struct is not so good.
Such indirection overheads only reduce the perf as it is not constant size data coming in for the blk storage target side.
And even if it comes somehow, it requires two level protocol parsers. 
Can be simplified as you are not starting with any history here, abstraction point can be possibly virtio commands than the vring.

I donât see a need for desc to have id and flags the way its drafted over the rdma fabrics:
What I had in mind as,
struct virtio_of_descriptor {
	le64 addr;
	le32 len;
	union {
		le32 rdma_key;
		le32 id + reserved;
		le32 tcp_desc_id;
	};

We can possibly define appropriate virtio fabric descriptors; at that point, the abstraction point is not literally taking the vring across the fabric.

Depending on use case may be starting with either one of TCP or RDMA makes sense, instead of cooking all at once.

> 2, The command size not aligned to 16B leads performance issue on RDMA
> SEND operation. My colleague Zhuo help me test the performance on sending
> 16/24/32 bytes:
> taskset -c 30 ib_send_bw -d mlx5_2 -i 1 -x 3 -s 16 -t 1 xx.xx.xx.xx taskset -c 30
> ib_send_bw -d mlx5_2 -i 1 -x 3 -s 24 -t 1 xx.xx.xx.xx taskset -c 30 ib_send_bw -d
> mlx5_2 -i 1 -x 3 -s 32 -t 1 xx.xx.xx.xx The QPS seems almost same.
> 
structure [1] generated subsequent vring_desc[] descriptors to unaligned 8B address results in partial writes of the desc.

It is hard to say from ib_send_bw test what is being done.
I remember mlx5 have cache aligned accesses, nop wqe segments and more.

I also donât see the 'id' field coming back in the response command_status.
So why to transmit over the fabric which is not used.
Did I miss the id in completion side?

[1] https://github.com/pizhenwei/linux/blob/7a13b310d1338c462f8e0b13d39a571645bc4698/include/uapi/linux/virtio_of.h#L129

Follow-Ups:
- Re: RE: Re: [virtio-comment] [PROPOSAL] Virtio Over Fabrics(TCP/RDMA)
  - From: zhenwei pi <pizhenwei@bytedance.com>

References:
- [PROPOSAL] Virtio Over Fabrics(TCP/RDMA)
  - From: zhenwei pi <pizhenwei@bytedance.com>
- Re: [virtio-comment] [PROPOSAL] Virtio Over Fabrics(TCP/RDMA)
  - From: Jason Wang <jasowang@redhat.com>
- Re: Re: [virtio-comment] [PROPOSAL] Virtio Over Fabrics(TCP/RDMA)
  - From: zhenwei pi <pizhenwei@bytedance.com>
- Re: [virtio-comment] [PROPOSAL] Virtio Over Fabrics(TCP/RDMA)
  - From: Parav Pandit <parav@nvidia.com>
- Re: Re: [virtio-comment] [PROPOSAL] Virtio Over Fabrics(TCP/RDMA)
  - From: zhenwei pi <pizhenwei@bytedance.com>