virtio-comment message

Subject: Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs

From: Jason Wang <jasowang@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 27 Dec 2023 10:32:53 +0800

On Tue, Dec 26, 2023 at 5:18âPM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Dec 26, 2023 at 11:32:12AM +0800, Jason Wang wrote:
> > On Fri, Dec 22, 2023 at 2:45âPM Heng Qi <hengqi@linux.alibaba.com> wrote:
> > >
> > >
> > >
> > > å 2023/12/22 äå10:40, Jason Wang åé:
> > > > On Thu, Dec 21, 2023 at 12:55âPM Heng Qi <hengqi@linux.alibaba.com> wrote:
> > > >>
> > > >>
> > > >> å 2023/12/21 äå11:37, Jason Wang åé:
> > > >>> On Wed, Dec 20, 2023 at 10:40âPM Heng Qi <hengqi@linux.alibaba.com> wrote:
> > > >>>> Currently, when each time the driver attempts to update the coalescing parameters
> > > >>>> for a vq, it needs to kick the device and wait for the ctrlq response to return.
> > > >>>>
> > > >>>> If a command can only update one vq parameters, when the parameters are updated
> > > >>>> frequently (such as netdim), ctrlq on the device is kicked frequently, which will
> > > >>>> increase the device CPU scheduling overhead, and the number and overhead of device
> > > >>>> DMA will also increase.
> > > >>>>
> > > >>>> Merging multiple vq updated parameters into one command can effectively reduce
> > > >>>> the number of kick devices and device DMA times.
> > > >>>>
> > > >>>> Test results show that this greatly improves the efficiency of the ctrlq in
> > > >>>> responding to multiple vq coalescing parameter updates issued by the driver.
> > > >>> So netdim is per virtqueue, to make use of this, you need to batch the
> > > >>> netdim requests first.
> > > >> Yes. I think the code looks more intuitive.
> > > >> Please view the following PoC test code combined with the current code
> > > >> of upstream virtio netdim
> > > >>
> > > >> +struct ctrl_coal_vq_entry {
> > > >> +       u16 vqn;
> > > >> +       u16 reserved;
> > > >> +       int usecs;
> > > >> +       int packets;
> > > >> +};
> > > >> +
> > > >> +struct virtio_net_ctrl_mrg_coal_vq {
> > > >> +       int num_entries;
> > > >> +       struct ctrl_coal_vq_entry coal_vqs[];
> > > >> +};
> > > >> +
> > > >>
> > > >> @@ -313,12 +335,18 @@ struct virtnet_info {
> > > >>
> > > >>           struct control_buf *ctrl;
> > > >>
> > > >> +       struct virtio_net_ctrl_mrg_coal_vq *ctrl_coal_vqs;
> > > >> +
> > > >>
> > > >>
> > > >> @@ -4390,6 +4586,7 @@ static int virtnet_alloc_queues(struct
> > > >> virtnet_info *vi)
> > > >>           if (!vi->rq)
> > > >>                   goto err_rq;
> > > >>
> > > >> +       vi->ctrl_coal_vqs = kzalloc(sizeof(struct
> > > >> virtio_net_ctrl_mrg_coal_vq) + vi->max_queue_pairs * sizeof(struct
> > > >> ctrl_coal_vq_entry), GFP_KERNEL);
> > > >>
> > > >>
> > > >> +       struct scatterlist sg;
> > > >> +       for (i = 0; i < vi->curr_queue_pairs; i++) {
> > > >> +                vi->ctrl_coal_vqs->coal_vqs[i].vqn = rxq2vq(i);
> > > >> +                vi->ctrl_coal_vqs->coal_vqs[i].usecs = 8;
> > > >> +                vi->ctrl_coal_vqs->coal_vqs[i].packets = 16;
> > > >> +        }
> > > >> +        vi->ctrl_coal_vqs->num_entries = i;
> > > >> +        sg_init_one(&sg, vi->ctrl_coal_vqs, sizeof(struct
> > > >> virtio_net_ctrl_mrg_coal_vq) + vi->ctrl_coal_vqs->num_entries *
> > > >> sizeof(struct ctrl_coal_vq_entry));
> > > >> +        vi->req_num++;
> > > >> +        if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
> > > >> +                              VIRTIO_NET_CTRL_NOTF_COAL_*VQS*_SET, &sg))
> > > >>
> > > >>    #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET               3
> > > >> +#define VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET              4
> > > >>
> > > >>>    And if you do that, you can batch the commands
> > > >>> as well.
> > > >>
> > > >> Batch commands we have tried:
> > > >>
> > > >> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> > > >> +                       err = virtnet_pre_send_cmd(vi, &vi->vq_ctrls[i],
> > > >> 8, 16);
> > > >> +               }
> > > >> +
> > > >> +               if (unlikely(!virtqueue_kick(vi->cvq)))
> > > >> +                       return vi->ctrl->status == VIRTIO_NET_OK;
> > > >> +
> > > >> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> > > >> +                       while (!virtqueue_is_broken(vi->cvq) &&
> > > >> !virtqueue_get_buf(vi->cvq, &tmp))
> > > >> +                               cpu_relax();
> > > >> +               }
> > > >> +       }
> > > >>
> > > >>
> > > >> +
> > > >> +static bool virtnet_pre_send_cmd(struct virtnet_info *vi, struct
> > > >> vq_coal *vq_ctrls,
> > > >> +                                u32 max_usecs, u32 max_packets)
> > > >> +{
> > > >> +       struct scatterlist hdr, out, stat;
> > > >> +       struct scatterlist *sgs[4];
> > > >> +       unsigned out_num = 0;
> > > >> +       int ret;
> > > >> +
> > > >> +       BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> > > >> +
> > > >> +       vq_ctrls->hdr.class = VIRTIO_NET_CTRL_NOTF_COAL;
> > > >> +       vq_ctrls->hdr.cmd = VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET;
> > > >> +       sg_init_one(&hdr, &vq_ctrls->hdr, sizeof(vq_ctrls->hdr));
> > > >> +       sgs[out_num++] = &hdr;
> > > >> +
> > > >> +       vq_ctrls->coal_vq.vqn = cpu_to_le16(0);
> > > >> +       vq_ctrls->coal_vq.coal.max_usecs = cpu_to_le32(max_usecs);
> > > >> +       vq_ctrls->coal_vq.coal.max_packets = cpu_to_le32(max_packets);
> > > >> +       sg_init_one(&out, &vq_ctrls->ho, sizeof(vq_ctrls->ho));
> > > >> +
> > > >> +       if (&out)
> > > >> +               sgs[out_num++] = &out;
> > > >> +
> > > >> +       vq_ctrls->status = ~0;
> > > >> +       sg_init_one(&stat, &vq_ctrls->status, sizeof(vq_ctrls->status));
> > > >> +       sgs[out_num] = &stat;
> > > >> +
> > > >> +       BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
> > > >> +       ret = virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
> > > >> +       if (ret < 0) {
> > > >> +               dev_warn(&vi->vdev->dev,
> > > >> +                        "Failed to add sgs for command vq: %d\n.", ret);
> > > >> +               return false;
> > > >> +       }
> > > >>
> > > >>
> > > >> Overall, batch reqs are sufficient. Because the current major overhead
> > > >> is the number of DMAs.
> > > >> For example, for a device with 256 queues,
> > > >>
> > > >> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
> > > >> The overhead of batch cmds is 1 kick + 256*8 DMA times.
> > > > This can furtherly optimized by using IN_ORDER. The DMA can be batched as well.
> > >
> > > DMA batch reduces the number of DMA requests, but not the actual number
> > > of DMAs.
> >
> > Well, if we do IN_ORDER, then
> >
> > 1 for avail idx.
> >
> > At most 2 DMA is needed for all descriptors.
> >
> > Driver can allocate an array of cvq commands. So CVQ commands could be
> > fetched at most 2 DMA.
> >
> > etc.
>
> For most users, all this happens maybe once per VM start.  It's not
> reasonable to start micro-optimizing this until it is taking so much
> time it slows VM start measureably.  With each DMA taking at most
> of an extra microsecond I just do not see the benefit, even with 1000s
> of DMAs, this will take at most milliseconds.

Probably, it is just an example for the possible optimization.

To make the proposal more general, a general container command should
be better than an ad-hoc batch command just for coalescing.

Thanks

>
>
> > > For out DPU, how much to batch and when to end the batch also need to be
> > > evaluated in detail.
> > >
> > > >
> > > > Or even using packed virtqueue.
> > > >
> > > >> The overhead of batch reqs is 1 kick + 8 DMA times.
> > > > If this is a real issue for your architecture, we should not make it
> > > > only work for coalescing. Any per vq tweaking may suffer from that.
> > >
> > > YES.
> > >
> > > > E.g if you want to do ARFS in the future. So it should a e.g a
> > > > container command for cvq instead of just for coalescing command
> > > > itself.
> > > >
> > > > We need to make sure if the existing facility is sufficient or not.
> > >
> > > We considered to update the generic ctrlq batch command. But so far we
> > > have only seen
> > > this requirement for netdim and aRFS.
> >
> > Two use cases should be sufficient to justify.
> >
> > We do have others, for example the vlan/mac filters.
> >
> >
> > > At the same time, aRFS has not yet
> > > entered the community.
> > > We will try to add a batch req interface for it and it may have a
> > > dedicated flow vq to ensure that
> > > the number of command issuances larger than netdim is responded to in time.
> > > So we temporarily plan to advance the batch of coalescing parameter
> > > requests first.
> >
> > Thanks
>

References:
- [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  - From: Heng Qi <hengqi@linux.alibaba.com>
- Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  - From: Heng Qi <hengqi@linux.alibaba.com>
- Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  - From: Heng Qi <hengqi@linux.alibaba.com>
- Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  - From: "Michael S. Tsirkin" <mst@redhat.com>