virtio-comment message

Subject: RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

From: Parav Pandit <parav@nvidia.com>
To: "Zhu, Lingshan" <lingshan.zhu@intel.com>, Jason Wang <jasowang@redhat.com>
Date: Mon, 11 Sep 2023 09:05:53 +0000

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Monday, September 11, 2023 2:17 PM

[..]
> > Hypervisor needs to do right setup anyway for using PCI spec define access
> control and other semantics which is outside the scope of [1].
> > It is outside primarily because proposal [1] is not migrating the whole "PCI
> device".
> > It is migrating the virtio device, so that we can migrate from PCI VF member
> to some software based device too.
> > And vis-versa.
> Since you talked about P2P, IOMMU is basically for address space isolation. For
> security reasons, it is usually suggest to passthrough all devices in one IOMMU
> group to a single guest.
> 
IOMMU group is OS concept and no need to mix it here.

> That means, if you want the VF to perform P2P with the PF there the AQ
> resides, you have to place them in the same IOMMU group and passthrough
> them all to a guest. So how this AQ serve other purposes?
> >
A PF resides on the hypervisor. One or more VFs are passthrough to the VM.
When one wants to do nesting, may be one of the VF can do the role of admin for its peer VF.
Such extension is only needed for nesting.

For non-nesting being the known common case to us, such extension is not needed.

> >>> QOS is such a broad term that is hard to debate unless you get to a
> >>> specific
> >> point.
> >> E.g., there can be hundreds or thousands of VMs, how many admin vq
> >> are required to serve them when LM? To converge, no timeout.
> > How many RSS queues are required to reach 800Gbs NIC performance at what
> q depth at what interrupt moderation level?
> > Such details are outside the scope of virtio specification.
> > Those are implementation details of the device.
> >
> > Similarly here for AQ too.
> > The inherent nature of AQ to queue commands and execute them out of order
> in the device is the fundamental reason, AQ is introduced.
> > And one can have more AQs to do unrelated work, mainly from the hypervisor
> owner device who wants to enqueue unrelated commands in parallel.
> As pointed above insufficient RSS capabilities may cause performance
> overhead, not not a failure, the device still stay functional.
If UDP packets are dropped, even application can fail who do no retry.

> But too few AQ to serve too high volume of VMs may be a problem.
It is left for the device to implement the needed scale requirement.

> Yes the number of AQs are negotiable, but how many exactly should the HW
> provide?
Again, it is outside the scope. It is left to the device implementation like many other performance aspects.

> 
> Naming a number or an algorithm for the ratio of devices / num_of_AQs is
> beyond this topic, but I made my point clear.
Sure. It is beyond.
And it is not a concern either.

> >
> >>>> E.g., do you need 100 admin vqs for 1000 VMs? How do you decide the
> >>>> number in HW implementation and how does the driver get informed?
> >>> Usually just one AQ is enough as proposal [1] is built around
> >>> inherent
> >> downtime reduction.
> >>> You can ask similar question for RSS, how does hw device how many
> >>> RSS queues are needed. ð
> >>> Device exposes number of supported AQs that driver is free to use.
> >> RSS is not a must for the transition through maybe performance overhead.
> >> But if the host can not finish Live Migration in the due time, then
> >> it is a failed LM.
> > It can aborts the LM and restore it back by resuming the device.
> aborts means fail
> >
> >>> Most sane sys admins do not migrate 1000 VMs at same time for
> >>> obvious
> >> reasons.
> >>> But when such requirements arise, a device may support it.
> >>> Just like how a net device can support from 1 to 32K txqueues at spec level.
> >> The orchestration layer may do that for host upgrade or power-saving.
> >> And the VMs may be required to migrate together, for example:
> >> a cluster of VMs in the same subnet.
> >>
> > Sure. AQ of depth 1K can support 1K outstanding commands at a time for
> 1000 member devices.
> PCI transition is FIFO, 
I do not understand what is "PCI transition".

> can depth = 1K introduce significant latency?
AQ command execution is not done serially. There is enough text on the AQ chapter as I recall.

> And 1K depths is
> almost identical to 2 X 500 queue depths, so still the same problem, how many
> resource does the HW need to reserve to serve the worst case?
> 
You didnât describe the problem.
Virtqueue is generic infrastructure to execute commands, be it admin command, control command, flow filter command, scsi command.
How many to execute in parallel, how many queues to have are device implementation specific.

> Let's forget the numbers, the point is clear.
Ok. I agree with you.
Number of AQs and its depth matter for this discussion, and its performance characterization is outside the spec.
Design wise, key thing to have the queuing interface between driver and device for device migration commands.
This enables both entities to execute things in parallel.

This is fully covered in [1].
So let's improve [1].

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html

> >
> >> Lets do not introduce new frangibility
> > I donât see any frangibility added by [1].
> > If you see one, please let me know.
> The resource and latency explained above.
> >
> >>>>>> Remember CSP migrates all VMs on a host for powersaving or upgrade.
> >>>>> I am not sure why the migration reason has any influence on the design.
> >>>> Because this design is for live migration.
> >>>>> The CSPs that we had discussed, care for performance more and
> >>>>> hence
> >>>> prefers passthrough instead or mediation and donât seem to be doing
> >>>> any nesting.
> >>>>> CPU doesnt have support for 3 level of page table nesting either.
> >>>>> I agree that there could be other users who care for nested functionality.
> >>>>>
> >>>>> Any ways, nesting and non-nesting are two different requirements.
> >>>> The LM facility should server both,
> >>> I donât see how PCI spec let you do it.
> >>> PCI community already handed over this to SR-PCIM interface outside
> >>> of the
> >> PCI spec domain.
> >>> Hence, its done over admin queue for passthrough devices.
> >>>
> >>> If you can explain, how your proposal addresses passthrough support
> >>> without
> >> mediation and also does DMA, I am very interested to learn that.
> >> Do you mean nested?
> > Before nesting, just like to see basic single level passthrough to see functional
> and performant like [1].
> I think we have discussed about this, the nested guest is not aware of the admin
> vq and can not access it, because the admin vq is a host facility.

A nested guest VM is not aware and should not.
The VM hosting the nested VM, is aware on how to execute administrative commands using the owner device.

At present for PCI transport, owner device is PF.

In future for nesting, may be another peer VF can be delegated such task and it can perform administration command.

For bare metal may be some other admin device like DPU can do that role.

> >

> >> Why this series can not support nested?
> > I donât see all the aspects that I covered in series [1] ranging from flr, device
> context migration, virtio level reset, dirty page tracking, p2p support, etc.
> covered in some device, vq suspend resume piece.
> >
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.h
> > tml
> We have discussed many other issues in this thread.
> >
> >>>> And it does not serve bare-metal live migration either.
> >>> A bare-metal migration seems a distance theory as one need side cpu,
> >> memory accessor apart from device accessor.
> >>> But somehow if that exists, there will be similar admin device to
> >>> migrate it
> >> may be TDDISP will own this whole piece one day.
> >> Bare metal live migration require other components like firmware OS
> >> and partitioning, that's why the device live migration should not be a
> blocker.
> > Device migration is not blocker.
> > In-fact it facilitates for this future in case if that happens where side cpu like
> DPU or similar sideband virtio admin device can migrate over its admin vq.
> >
> > Long ago when admin commands were discussed, this was discussed too
> where a admin device may not be an owner device.
> The admin vq can not migrate it self therefore baremetal can not be migrated
> by admin vq
May be I was not clear. The admin commands are executed by some other device than the PF.
In above I call it admin device, which can be a DPU may be some other dedicated admin device or something else.
Large part of non virtio infrastructure at platform, BIOS, cpu, memory level needs to evolve before virtio can utilize it.

We donât need to cook all now, as long as we have administration commands its good.
The real credit owner for detaching the administration command from the admin vq is Michael. :)
We like to utilize this in future for DPU case where admin device is not the PCI PF.
Eswitch, PF migration etc may utilize it in future when needed.

Follow-Ups:
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>

References:
- [PATCH 0/5] virtio: introduce SUSPEND bit and vq state
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>