virtio-comment message

Subject: Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
To: Parav Pandit <parav@nvidia.com>, Jason Wang <jasowang@redhat.com>
Date: Tue, 12 Sep 2023 12:06:34 +0800



On 9/11/2023 6:21 PM, Parav Pandit wrote:

From: Zhu, Lingshan <lingshan.zhu@intel.com>
Sent: Monday, September 11, 2023 3:03 PM
So implement AQ on the "admin" VF? This require the HW reserve dedicated
resource for every VF?
So expensive, Overkill?

And a VF may be managed by the PF and its admin "vf"?

Yes.

it's a bit chaos, as you can see if the nested(L2 guest) VF can be managed
by both L1 guest VF and the host PF, that means two owners of the L2 VF.

If UDP packets are dropped, even application can fail who do no retry.

UDP is not reliable, and performance overhead does not mean fail.

It largely depends on application.
I have seen iperf UDP failing on packet drop and never recovered.
A retransmission over UDP can fail.

That depends on the workload, if it choose UDP, it is aware of thepossibilities

of losing packets. But anyway, LM are expected to perform
successfully in the due time

But too few AQ to serve too high volume of VMs may be a problem.

It is left for the device to implement the needed scale requirement.

Yes, so how many HW resource should the HW implementation reserved to
serve the worst case? Half of the board resource?

The board designer can decide how to manage the resource.
Administration commands are explicit instructions to the device.
It knows how many members device's dirty tracking is ongoing, which device context is being read/written.

Still, does the board designer need to prepare for the worst case? Howto meet that challenge?


Admin command can even fail with EAGAIN error code when device is out of resource and software can retry the command.

As demonstrated, this series is reliable as the config spacefunctionalities, so maybe less possibilities to fail?


They key part is all of these happens outside of the VM's downtime.
Majority of the work in proposal [1] is done when the VM is _live_.
Hence, the resource consumption or reservation is significantly less.

Still depends on the volume of VMs and devices, the orchestration layer
needs to migrate the last round of dirty pages and states even when the VM
has been suspended.

Naming a number or an algorithm for the ratio of devices / num_of_AQs
is beyond this topic, but I made my point clear.

Sure. It is beyond.
And it is not a concern either.

It is, the user expect the LM process success than fail.

I still fail to understand why LM process fails.
The migration process is slow, but downtime is not in [1].

If I recall it clear, the downtime is around 300ms, so
don't let the bandwidth or num of admin vqs become
a bottle neck which may introduce more possibilities to fail.

can depth = 1K introduce significant latency?

AQ command execution is not done serially. There is enough text on the AQ

chapter as I recall.
Then require more HW resource, I don't see difference.

Difference compared to what, multiple AQs?
If so, sure.
The device who prefers to do only one AQ command at a time, sure it can work with less resource and do one at a time.

I think we are discussing the same issue as above "resource for theworst case" problem

And 1K depths is
almost identical to 2 X 500 queue depths, so still the same problem,
how many resource does the HW need to reserve to serve the worst case?

You didnât describe the problem.
Virtqueue is generic infrastructure to execute commands, be it admin

command, control command, flow filter command, scsi command.

How many to execute in parallel, how many queues to have are device

implementation specific.
So the question is how many to serve the worst case? Does the HW vendor need
to reserve half of the board resource?

No. It does not need to.

same as above

Let's forget the numbers, the point is clear.

Ok. I agree with you.
Number of AQs and its depth matter for this discussion, and its performance

characterization is outside the spec.

Design wise, key thing to have the queuing interface between driver and

device for device migration commands.

This enables both entities to execute things in parallel.

This is fully covered in [1].
So let's improve [1].

[1]
https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.h
tml

I am not sure, why [1] is a must? There are certain issues discussed in this
thread for [1] stay unsolved.

By the way, do you see anything we need to improve in this series?

In [1], device context needs to more rich as we progress in v1/v2 versions.

[..]

A nested guest VM is not aware and should not.
The VM hosting the nested VM, is aware on how to execute administrative

commands using the owner device.
The VM does not talk to admin vq either, the admin vq is a host facility, host
owns it.

Admin VQ is owned by the device whichever has it.
As I explained before, it is on the owner device.
If needed one can do on more than owner device.
A VM_A which is hosting another VM_B, a VM_A can have peer VF with AQ to be the admin device or migration manager device.

so two or more owners own the same device, conflict?

At present for PCI transport, owner device is PF.

In future for nesting, may be another peer VF can be delegated such task and

it can perform administration command.
Then it may run into the problems explained above.

For bare metal may be some other admin device like DPU can do that role.

So [1] is not ready

Why this series can not support nested?

I donât see all the aspects that I covered in series [1] ranging
from flr, device

context migration, virtio level reset, dirty page tracking, p2p support, etc.
covered in some device, vq suspend resume piece.

[1]
https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061
.h
tml

We have discussed many other issues in this thread.

And it does not serve bare-metal live migration either.

A bare-metal migration seems a distance theory as one need side
cpu,

memory accessor apart from device accessor.

But somehow if that exists, there will be similar admin device to
migrate it

may be TDDISP will own this whole piece one day.
Bare metal live migration require other components like firmware OS
and partitioning, that's why the device live migration should not
be a

blocker.

Device migration is not blocker.
In-fact it facilitates for this future in case if that happens where
side cpu like

DPU or similar sideband virtio admin device can migrate over its admin vq.

Long ago when admin commands were discussed, this was discussed too

where a admin device may not be an owner device.
The admin vq can not migrate it self therefore baremetal can not be
migrated by admin vq

May be I was not clear. The admin commands are executed by some other

device than the PF.
  From SW perspective, it should be the admin vq and the device it resides.

In above I call it admin device, which can be a DPU may be some other

dedicated admin device or something else.

Large part of non virtio infrastructure at platform, BIOS, cpu, memory level

needs to evolve before virtio can utilize it.
virito device should be self-contained. Not depend on other components.

We donât need to cook all now, as long as we have administration commands

its good.

The real credit owner for detaching the administration command from
the admin vq is Michael. :) We like to utilize this in future for DPU case where

admin device is not the PCI PF.

Eswitch, PF migration etc may utilize it in future when needed.

Again, the design should not rely on other host components.

It does not. It relies on the administration commands.

I remember you have mentioned using DPU infrastructure to
perform bare-metal live migration?

And it is not about the credit, this is reliable work outcome

I didnât follow the comment.

Follow-Ups:
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>

References:
- [PATCH 0/5] virtio: introduce SUSPEND bit and vq state
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>