virtio-comment message

Subject: Re: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs

From: Heng Qi <hengqi@linux.alibaba.com>
To: Parav Pandit <parav@nvidia.com>
Date: Thu, 25 Jan 2024 11:05:15 +0800



å 2024/1/24 äå9:18, Parav Pandit åé:

From: Heng Qi <hengqi@linux.alibaba.com>
Sent: Wednesday, January 24, 2024 6:31 PM


å 2024/1/22 äå1:03, Parav Pandit åé:

From: Heng Qi <hengqi@linux.alibaba.com>
Sent: Monday, January 22, 2024 8:27 AM

å 2024/1/20 äå5:59, Parav Pandit åé:

From: Heng Qi <hengqi@linux.alibaba.com>
Sent: Wednesday, January 17, 2024 10:22 AM

å 2024/1/15 äå9:21, Parav Pandit åé:

From: virtio-comment@lists.oasis-open.org
<virtio-comment@lists.oasis- open.org> On Behalf Of Heng Qi
Sent: Monday, January 15, 2024 6:36 PM

Currently, when each time the driver attempts to update the
coalescing parameters for a vq, it needs to kick the device and
wait for the ctrlq response to return.

It does not need to wait. This is some driver limitation that does
not use

the queue as "queue".

Such driver limitation should be removed in the driver. It does
not qualify

as limitation.

Yes, we don't have to wait.

But in general, for user commands, it is necessary to obtain the
final results synchronously.

Yes. Use initiated command can enqueue the request to cvq. Go to
sleep

for several micro to milliseconds.

The user command cannot return before the final result is obtained.
And wait is not the problem this patch solves.

By not holding the rtnl lock, rest of the context that needs to
enqueue the

request can progress such as that of netdim.

Would like to see the using of rtnl lock changed.

Inside the virtnet_rx_dim_work() there should be rtnl lock call.
A virtio_device level lock to be used for cvq. :)

In addition, I have made batching and asynchronousization of the
netdim command, you can refer to this patch:
https://lore.kernel.org/all/1705410693-118895-4-git-send-email-
hengqi@linux.alibaba.com/

In the listed above driver patch the motivation "to optimize the CPU
overhead of the DIM worker caused by the guest being busy waiting for
the command response result."

Is not right.
Because guest is still busy waiting.

There is no busy wait for guests, see get_cvq_work().

Ok. not always busy waiting, sometimes it does.

Busy waiting will only occur when the user command or dim command cannotfind the available buffer for cvq.

The user command is still in polling mode for now, I have not tried tooptimize this. Now it's about improving dim performance.

virtnet_cvq_response() should not have flag..

The flag is mainly used to identify whether it is a user command. If so,the previous polling mode will still be maintained.


Who ever gets the OS global rtnl lock is calling virtnet_cvq_response() and checking and releasing.
Shouldnât be done this way with try lock etc.
Rtnl lock is not supposed to protect low level driver some ctrlvq response flag.


To summarize, we now want to make some improvements to cvq:
1. Reasonable timeout in busy waiting mode or interrupt-based etc.

2. Batch processing (the core problem is how to get results from usercommands synchronously)

3. Remove rtnl_lockâs protection for ctrlq.

Yes, these improve the processing efficiency of cvq and avoid theproblem of OS hang.

I think 1, 3 need short term long term work and we need to try.
I donât see any demand for batching user commands.

So for now, let's move forward with optimizing dim.

Without batching, due to rtnl lock every VQ command is serialized as one

outstanding command at a time in virtnet_rx_dim_work().

Due to this device is unable to take benefit of DMA batching at large scale.

Adding dim commands is now asynchronous, and the device will receive
batches of commands.

I am comparing the code where there is no rtnl lock with the code you posted.


Okay, I can post a patch later based on the code without rtnl_lock.

The reference code of patch [1] needs to begin without OS global rtnl lock as base line.

[1] https://lore.kernel.org/all/1705410693-118895-4-git-send-email-hengqi@linux.alibaba.com/

This will enable driver to enqueue multiple cvq commands without
waiting

for previous one.

This will also enable device to find natural coalescing done on
multiple

commands.

When batch user commands occur, ensuring synchronization is a

concern.

The following path is observed: 1. Driver kicks the device; 2.
After the device receives the kick, CPU scheduling occurs and DMA
multiple buffers multiple times; 3. The device completes
processing and replies

with a response.

When large-queue devices issue multiple requests and kick the
device frequently, this often interrupt the work of the device-side CPU.

When there is large devices and multiple driver notifications by a
cpu that is N times faster than the device side cpu, the device
may find natural

coalescing on the commands of a given cvq.

First we have to solve the ctrlq batch adding user (ethtool) command.
Even if processed in a batch way on device side, the number of
kicks and the number of backend DMAs has not been reduced.

Driver notifications are PCI writes so it should not hamper device
side,

which can ignore them when they do not bother about it.

Driver notifications need to be processed by the DPU, which
interferes with the CPU on the DPU.

I was asking, if there is anyway to disable for your DPU to ignore these

notifications while previous one is pending?

  From your above description, it seems there isnât.

While the device is processing the request, additional kicks are ignored.

Ok. that is good. In that case the kicks are not interfering, right?


YES.

When a VM sends many DIM requests, the requests are serialized andprocessed one at a time, and the number of kicks increases.

Even when we add commands asynchronously (such as this patchhttps://lore.kernel.org/all/1705410693-118895-4-git-send-email-hengqi@linux.alibaba.com/),

the device needs to process each command present in the ctrlq individually.

Each command requires the device CPU to wait for the DMA operation tocomplete before replying.Although kicks are reduced, the device's DMA operations for batchedcommands are serialized, which affects efficiency.

In addition, I would like to extend the VIRTIO_NET_F_VQ_NOTF_COALfeature to support the new VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET command.

I think this is ok.

Thanks!

Follow-Ups:
- Re: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs
  - From: Jason Wang <jasowang@redhat.com>

References:
- [PATCH v2] virtio-net: support setting coalescing params for multiple vqs
  - From: Heng Qi <hengqi@linux.alibaba.com>
- RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs
  - From: Heng Qi <hengqi@linux.alibaba.com>
- RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs
  - From: Heng Qi <hengqi@linux.alibaba.com>
- RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs
  - From: Heng Qi <hengqi@linux.alibaba.com>
- RE: [virtio-dev] RE: [virtio-comment] RE: [virtio-dev] RE: [virtio-comment] [PATCH v2] virtio-net: support setting coalescing params for multiple vqs
  - From: Parav Pandit <parav@nvidia.com>