virtio-comment message

Subject: Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
To: Parav Pandit <parav@nvidia.com>, Jason Wang <jasowang@redhat.com>
Date: Mon, 11 Sep 2023 17:32:49 +0800



On 9/11/2023 5:05 PM, Parav Pandit wrote:

From: Zhu, Lingshan <lingshan.zhu@intel.com>
Sent: Monday, September 11, 2023 2:17 PM

[..]

Hypervisor needs to do right setup anyway for using PCI spec define access

control and other semantics which is outside the scope of [1].

It is outside primarily because proposal [1] is not migrating the whole "PCI

device".

It is migrating the virtio device, so that we can migrate from PCI VF member

to some software based device too.

And vis-versa.

Since you talked about P2P, IOMMU is basically for address space isolation. For
security reasons, it is usually suggest to passthrough all devices in one IOMMU
group to a single guest.

IOMMU group is OS concept and no need to mix it here.

That means, if you want the VF to perform P2P with the PF there the AQ
resides, you have to place them in the same IOMMU group and passthrough
them all to a guest. So how this AQ serve other purposes?

A PF resides on the hypervisor. One or more VFs are passthrough to the VM.
When one wants to do nesting, may be one of the VF can do the role of admin for its peer VF.
Such extension is only needed for nesting.

For non-nesting being the known common case to us, such extension is not needed.

So implement AQ on the "admin" VF? This require the HW reserve dedicatedresource for every VF?

So expensive, Overkill?

And a VF may be managed by the PF and its admin "vf"?

QOS is such a broad term that is hard to debate unless you get to a
specific

point.
E.g., there can be hundreds or thousands of VMs, how many admin vq
are required to serve them when LM? To converge, no timeout.

How many RSS queues are required to reach 800Gbs NIC performance at what

q depth at what interrupt moderation level?

Such details are outside the scope of virtio specification.
Those are implementation details of the device.

Similarly here for AQ too.
The inherent nature of AQ to queue commands and execute them out of order

in the device is the fundamental reason, AQ is introduced.

And one can have more AQs to do unrelated work, mainly from the hypervisor

owner device who wants to enqueue unrelated commands in parallel.
As pointed above insufficient RSS capabilities may cause performance
overhead, not not a failure, the device still stay functional.

If UDP packets are dropped, even application can fail who do no retry.

UDP is not reliable, and performance overhead does not mean fail.

But too few AQ to serve too high volume of VMs may be a problem.

It is left for the device to implement the needed scale requirement.

Yes, so how many HW resource should the HW implementation reserved
to serve the worst case? Half of the board resource?

Yes the number of AQs are negotiable, but how many exactly should the HW
provide?

Again, it is outside the scope. It is left to the device implementation like many other performance aspects.

I agree we can skip this issue, but the point is clear. and this is notonly a performance issue,

this can lead to failed LM.

Naming a number or an algorithm for the ratio of devices / num_of_AQs is
beyond this topic, but I made my point clear.

Sure. It is beyond.
And it is not a concern either.

It is, the user expect the LM process success than fail.

E.g., do you need 100 admin vqs for 1000 VMs? How do you decide the
number in HW implementation and how does the driver get informed?

Usually just one AQ is enough as proposal [1] is built around
inherent

downtime reduction.

You can ask similar question for RSS, how does hw device how many
RSS queues are needed. ð
Device exposes number of supported AQs that driver is free to use.

RSS is not a must for the transition through maybe performance overhead.
But if the host can not finish Live Migration in the due time, then
it is a failed LM.

It can aborts the LM and restore it back by resuming the device.

aborts means fail

Most sane sys admins do not migrate 1000 VMs at same time for
obvious

reasons.

But when such requirements arise, a device may support it.
Just like how a net device can support from 1 to 32K txqueues at spec level.

The orchestration layer may do that for host upgrade or power-saving.
And the VMs may be required to migrate together, for example:
a cluster of VMs in the same subnet.

Sure. AQ of depth 1K can support 1K outstanding commands at a time for

1000 member devices.
PCI transition is FIFO,

I do not understand what is "PCI transition".

PCI data flow.

can depth = 1K introduce significant latency?

AQ command execution is not done serially. There is enough text on the AQ chapter as I recall.

Then require more HW resource, I don't see difference.

And 1K depths is
almost identical to 2 X 500 queue depths, so still the same problem, how many
resource does the HW need to reserve to serve the worst case?

You didnât describe the problem.
Virtqueue is generic infrastructure to execute commands, be it admin command, control command, flow filter command, scsi command.
How many to execute in parallel, how many queues to have are device implementation specific.

So the question is how many to serve the worst case? Does the HW vendorneed to reserve half of the board resource?

Let's forget the numbers, the point is clear.

Ok. I agree with you.
Number of AQs and its depth matter for this discussion, and its performance characterization is outside the spec.
Design wise, key thing to have the queuing interface between driver and device for device migration commands.
This enables both entities to execute things in parallel.

This is fully covered in [1].
So let's improve [1].

[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html

I am not sure, why [1] is a must? There are certain issues discussed inthis thread for [1] stay unsolved.


By the way, do you see anything we need to improve in this series?

Lets do not introduce new frangibility

I donât see any frangibility added by [1].
If you see one, please let me know.

The resource and latency explained above.

Remember CSP migrates all VMs on a host for powersaving or upgrade.

I am not sure why the migration reason has any influence on the design.

Because this design is for live migration.

The CSPs that we had discussed, care for performance more and
hence

prefers passthrough instead or mediation and donât seem to be doing
any nesting.

CPU doesnt have support for 3 level of page table nesting either.
I agree that there could be other users who care for nested functionality.

Any ways, nesting and non-nesting are two different requirements.

The LM facility should server both,

I donât see how PCI spec let you do it.
PCI community already handed over this to SR-PCIM interface outside
of the

PCI spec domain.

Hence, its done over admin queue for passthrough devices.

If you can explain, how your proposal addresses passthrough support
without

mediation and also does DMA, I am very interested to learn that.
Do you mean nested?

Before nesting, just like to see basic single level passthrough to see functional

and performant like [1].
I think we have discussed about this, the nested guest is not aware of the admin
vq and can not access it, because the admin vq is a host facility.

A nested guest VM is not aware and should not.
The VM hosting the nested VM, is aware on how to execute administrative commands using the owner device.

The VM does not talk to admin vq either, the admin vq is a hostfacility, host owns it.


At present for PCI transport, owner device is PF.

In future for nesting, may be another peer VF can be delegated such task and it can perform administration command.

Then it may run into the problems explained above.


For bare metal may be some other admin device like DPU can do that role.

So [1] is not ready

Why this series can not support nested?

I donât see all the aspects that I covered in series [1] ranging from flr, device

context migration, virtio level reset, dirty page tracking, p2p support, etc.
covered in some device, vq suspend resume piece.

[1]
https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.h
tml

We have discussed many other issues in this thread.

And it does not serve bare-metal live migration either.

A bare-metal migration seems a distance theory as one need side cpu,

memory accessor apart from device accessor.

But somehow if that exists, there will be similar admin device to
migrate it

may be TDDISP will own this whole piece one day.
Bare metal live migration require other components like firmware OS
and partitioning, that's why the device live migration should not be a

blocker.

Device migration is not blocker.
In-fact it facilitates for this future in case if that happens where side cpu like

DPU or similar sideband virtio admin device can migrate over its admin vq.

Long ago when admin commands were discussed, this was discussed too

where a admin device may not be an owner device.
The admin vq can not migrate it self therefore baremetal can not be migrated
by admin vq

May be I was not clear. The admin commands are executed by some other device than the PF.

From SW perspective, it should be the admin vq and the device it resides.

In above I call it admin device, which can be a DPU may be some other dedicated admin device or something else.
Large part of non virtio infrastructure at platform, BIOS, cpu, memory level needs to evolve before virtio can utilize it.

virito device should be self-contained. Not depend on other components.


We donât need to cook all now, as long as we have administration commands its good.
The real credit owner for detaching the administration command from the admin vq is Michael. :)
We like to utilize this in future for DPU case where admin device is not the PCI PF.
Eswitch, PF migration etc may utilize it in future when needed.

Again, the design should not rely on other host components.

And it is not about the credit, this is reliable work outcome

Follow-Ups:
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>

References:
- [PATCH 0/5] virtio: introduce SUSPEND bit and vq state
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>