virtio-comment message

Subject: Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
To: Parav Pandit <parav@nvidia.com>, Jason Wang <jasowang@redhat.com>
Date: Mon, 11 Sep 2023 15:58:11 +0800



On 9/11/2023 3:30 PM, Parav Pandit wrote:

From: Zhu, Lingshan <lingshan.zhu@intel.com>
Sent: Monday, September 11, 2023 12:48 PM

On 9/11/2023 3:07 PM, Parav Pandit wrote:

From: Zhu, Lingshan <lingshan.zhu@intel.com>
Sent: Monday, September 11, 2023 12:28 PM

I donât see in his proposal how all the features and functionality
supported is

achieved.
I will include in-flight descriptor tracker and diry-page traking in
V2, anything else missed?
It can migrate the device itself, why don't you think so, can you
name some issues we can work on for improvements?

I would like to see a proposal similar to [1] that can work without mediation

in case if you want to combine two use cases under one.

Else, I donât see a need to merge two things.

Dirty page tracking, peer to peer, downtime, no-mediation, flrs all are covered

in [1] for passthrough cases.
We are introducing basic facilities, feel free to re-use them in the admin vq
solution.

Basic facilities are added in [1] for passthrough devices.
You can leverage them in your v2 for supporting p2p devices, dirty page tracking, passthrough support, shorter downtime and more.

Basic facilities should be better not depend on others, but admin vq canre-use the basic facilities.


For P2P, what if the devices are placed in different IOMMU group?

If you want to implement LM by admin vq, the facilities in my series
can be re- used. E.g., forward your suspend to SUSPEND bit.

Just VQ suspend is not enough...

In this series, it contains: device SUSPEND, queue state accessor.
MST required in-flight descriptor tracking, which will be included in next
version.

For passthrough more than that is needed.

Dirty page tracking will be addressed too, others should we work on?

Admin queue of the member device is migrated like any other queue
using

above [1].

2) won't work in the nested environment, or we need complicated
SR-IOV emulation in order to work

Poking at the device from the driver to migrate it is not going
to work if the driver lives within guest.

This is by design to allow live migration to work in the nested layer.
And it's the way we've used for CPU and MMU. Anything may virtio
different here?

Nested and non-nested use cases likely cannot be addressed by
single

solution/interface.

I think Ling Shan's proposal addressed them both.

I donât see how all above points are covered.

Why?


And how do you migrate nested VMs by admin vq?

Hypervisor = level 1.
VM = level 2.
Nested VM = level 3.
VM of level 2 to take care of migrating level 3 composed device using its sw

composition or may be using some kind of mediation that you proposed.
So, nested VM is not aware of the admin vq or does not have access to admin
vq, right?

Right. It is not aware.

How many admin vqs and the bandwidth are reserved for migrate all VMs?

It does not matter because number of AQs is configurable that device and

driver can decide to use.

I am not sure which BW are talking about.
There are many BW in place that one can regulate, at network level, pci level,

VM level etc.
It matters because of QOS and the downtime must converge.

QOS is such a broad term that is hard to debate unless you get to a specific point.

E.g., there can be hundreds or thousands of VMs, how many admin vq arerequired to serve them when

LM? To converge, no timeout.

E.g., do you need 100 admin vqs for 1000 VMs? How do you decide the number
in HW implementation and how does the driver get informed?

Usually just one AQ is enough as proposal [1] is built around inherent downtime reduction.
You can ask similar question for RSS, how does hw device how many RSS queues are needed. ð
Device exposes number of supported AQs that driver is free to use.

RSS is not a must for the transition through maybe performance overhead.
But if the host can not finish Live Migration in the due time, then it is
a failed LM.


Most sane sys admins do not migrate 1000 VMs at same time for obvious reasons.
But when such requirements arise, a device may support it.
Just like how a net device can support from 1 to 32K txqueues at spec level.

The orchestration layer may do that for host upgrade or power-saving.
And the VMs may be required to migrate together, for example:
a cluster of VMs in the same subnet.

Lets do not introduce new frangibility

Remember CSP migrates all VMs on a host for powersaving or upgrade.

I am not sure why the migration reason has any influence on the design.

Because this design is for live migration.

The CSPs that we had discussed, care for performance more and hence

prefers passthrough instead or mediation and donât seem to be doing any
nesting.

CPU doesnt have support for 3 level of page table nesting either.
I agree that there could be other users who care for nested functionality.

Any ways, nesting and non-nesting are two different requirements.

The LM facility should server both,

I donât see how PCI spec let you do it.
PCI community already handed over this to SR-PCIM interface outside of the PCI spec domain.
Hence, its done over admin queue for passthrough devices.

If you can explain, how your proposal addresses passthrough support without mediation and also does DMA, I am very interested to learn that.

Do you mean nested? Why this series can not support nested?

And it does not serve bare-metal live migration either.

A bare-metal migration seems a distance theory as one need side cpu, memory accessor apart from device accessor.
But somehow if that exists, there will be similar admin device to migrate it may be TDDISP will own this whole piece one day.

Bare metal live migration require other components like firmware OS andpartitioning, that's why the device live migration should not

be a blocker.

Follow-Ups:
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>

References:
- [PATCH 0/5] virtio: introduce SUSPEND bit and vq state
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>