virtio-comment message

Subject: RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE

From: Parav Pandit <parav@nvidia.com>
To: "Zhu, Lingshan" <lingshan.zhu@intel.com>, Jason Wang <jasowang@redhat.com>
Date: Tue, 12 Sep 2023 07:40:36 +0000

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Tuesday, September 12, 2023 12:58 PM
> To: Parav Pandit <parav@nvidia.com>; Jason Wang <jasowang@redhat.com>
> Cc: Michael S. Tsirkin <mst@redhat.com>; eperezma@redhat.com;
> cohuck@redhat.com; stefanha@redhat.com; virtio-comment@lists.oasis-
> open.org; virtio-dev@lists.oasis-open.org
> Subject: Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement
> VIRTIO_F_QUEUE_STATE
> 
> 
> 
> On 9/12/2023 2:47 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Tuesday, September 12, 2023 12:04 PM
> >>
> >>
> >> On 9/12/2023 1:58 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Tuesday, September 12, 2023 9:37 AM
> >>>>
> >>>> On 9/11/2023 6:21 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Monday, September 11, 2023 3:03 PM So implement AQ on the
> >>>>>> "admin" VF? This require the HW reserve dedicated resource for
> >>>>>> every VF?
> >>>>>> So expensive, Overkill?
> >>>>>>
> >>>>>> And a VF may be managed by the PF and its admin "vf"?
> >>>>> Yes.
> >>>> it's a bit chaos, as you can see if the nested(L2 guest) VF can be
> >>>> managed by both L1 guest VF and the host PF, that means two owners
> >>>> of the
> >> L2 VF.
> >>> This is the nesting.
> >>> When you do M level nesting, does any cpu in world handle its own
> >>> page
> >> tables in isolation of next level and also perform equally well?
> >> Not exactly, in nesting, L1 guest is the host/infrastructure emulator
> >> for L2, so L2 is expect to do nothing with the host, or something
> >> like L2 VF managed by both
> >> L1 VF and host PF can lead to operational and security issues?
> >>>>>>> If UDP packets are dropped, even application can fail who do no retry.
> >>>>>> UDP is not reliable, and performance overhead does not mean fail.
> >>>>> It largely depends on application.
> >>>>> I have seen iperf UDP failing on packet drop and never recovered.
> >>>>> A retransmission over UDP can fail.
> >>>> That depends on the workload, if it choose UDP, it is aware of the
> >>>> possibilities of losing packets. But anyway, LM are expected to
> >>>> perform successfully in the due time
> >>> And LM also depends on the workload. :)
> >> Exactly! That's the point, how to meet the requirements!
> >>> It is pointless to discuss performance characteristics as a point to
> >>> use AQ or
> >> not.
> >> How to meet QOS requirement when LM?
> > By following [1] where large part of device context and dirty page tracking is
> done when the VM is running.
> Still needs to migrate the last round of dirty pages and device states when VM
> freeze. Still can be large if take big amount of VMs into consideration, and that
> is where ~300ms due time rules.
> >
> >>> No. board designer does not need to.
> >>> As explained already, if board wants to supporting single command of
> >>> AQ,
> >> sure.
> >> Same as above, the QOS question. For example, how to avoid the
> >> situation that half VMs can be migrated and others timeout?
> > Why would this happen?
> > Timeout is not related to AQ in case if that happens.
> explained above
> > Timeout can happen to config registers too. And it can be even far more
> harder for board designers to support PCI reads in a timeout to handle in 384
> reads in parallel.
> When the VM freeze, the virtio functionalities, for example virito-net
> transaction is suspended as well, so no TLPs for networking traffic buffers.
The config registers mediated operation done by host itself are TLPs flowing for several hundreds of VM example you took.
In your example you took 1000 VMs freezing simultaneously for which you need to finish the config cycles in some 300 msec.

> 
> The on-device Live Migration facility can use the full PCI device bandwidth for
> migration.
So does admin commands also.
However the big difference is: registers do not scale with large number of VFs.
Admin commands scale easily.

I probably should not repeat what is already captured in the admin commands commit log and cover letter.

> 
> That is the difference with the admin vq.
I donât know what difference you are talking about.
PCI device bandwidth for migration is available with admin commands and some config registers both.
BW != timeout.

> >
> > I am still not able to follow your point for asking about unrelated QOS
> questions.
> explained above, it has to meet the due time requirement and many VMs can
> be migrated simultaneously, in that situation, they have to race for the admin
> vq resource/bandwidth.
> >
> >>>>> Admin command can even fail with EAGAIN error code when device is
> >>>>> out of
> >>>> resource and software can retry the command.
> >>>> As demonstrated, this series is reliable as the config space
> >>>> functionalities, so maybe less possibilities to fail?
> >>> Huh. Config space has far higher failure rate for the PCI transport
> >>> when due to
> >> inherent nature of PCI timeouts and reads and polling.
> >>> For any bulk data transfer virtqueue is spec defined approach.
> >>> For more than a year this was debated you can check some 2021 emails.
> >>>
> >>> You can see the patches that data transfer done in [1] over
> >>> registers is snail
> >> slow.
> >> Do you often observe virtio PCI config space fail? Or does admin vq
> >> need to transfer data through PCI?
> > Admin commands needs to transfer bulk data across thousands of VFs in
> parallel for many VFs without baking registers in PCI.
> So you agree actually PCI config space are very unlikely to fail? It is reliable.
> 
No. I do not agree. It can fail and very hard for board designers.
AQs are more reliable way to transport bulk data in scalable manner for tens of member devices.

> Please allow me to provide an extreme example, is one single admin vq
> limitless, that can serve hundreds to thousands of VMs migration? 
It is left to the device implementation. Just like RSS and multi queue support?
Is one Q enough for 800Gbps to 10Mbps link?
Answer is: Not the scope of specification, spec provide the framework to scale this way, but not impose on the device.

> If not, two or
> three or what number?
It really does not matter. Its wrong point to discuss here.
Number of queues and command execution depends on the device implementation.
A financial transaction application can timeout when a device queuing delay for virtio net rx queue is long.
And we donât put details about such things in specification.
Spec takes the requirements and provides driver device interface to implement and scale.

I still donât follow the motivation behind the question.
Is your question: How many admin queues are needed to migrate N member devices? If so, it is implementation specific.
It is similar to how such things depend on implementation for 30 virtio device types.

And if are implying that because it is implementation specific, that is why administration queue should not be used, but some configuration register should be used.
Than you should propose a config register interface to post virtqueue descriptors that way for 30 device types!

Follow-Ups:
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>

References:
- [PATCH 0/5] virtio: introduce SUSPEND bit and vq state
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>