[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE
On 9/11/2023 5:05 PM, Parav Pandit wrote:
So implement AQ on the "admin" VF? This require the HW reserve dedicated resource for every VF?From: Zhu, Lingshan <lingshan.zhu@intel.com> Sent: Monday, September 11, 2023 2:17 PM[..]Hypervisor needs to do right setup anyway for using PCI spec define accesscontrol and other semantics which is outside the scope of [1].It is outside primarily because proposal [1] is not migrating the whole "PCIdevice".It is migrating the virtio device, so that we can migrate from PCI VF memberto some software based device too.And vis-versa.Since you talked about P2P, IOMMU is basically for address space isolation. For security reasons, it is usually suggest to passthrough all devices in one IOMMU group to a single guest.IOMMU group is OS concept and no need to mix it here.That means, if you want the VF to perform P2P with the PF there the AQ resides, you have to place them in the same IOMMU group and passthrough them all to a guest. So how this AQ serve other purposes?A PF resides on the hypervisor. One or more VFs are passthrough to the VM. When one wants to do nesting, may be one of the VF can do the role of admin for its peer VF. Such extension is only needed for nesting. For non-nesting being the known common case to us, such extension is not needed.
So expensive, Overkill? And a VF may be managed by the PF and its admin "vf"?
QOS is such a broad term that is hard to debate unless you get to a specificpoint. E.g., there can be hundreds or thousands of VMs, how many admin vq are required to serve them when LM? To converge, no timeout.How many RSS queues are required to reach 800Gbs NIC performance at whatq depth at what interrupt moderation level?Such details are outside the scope of virtio specification. Those are implementation details of the device. Similarly here for AQ too. The inherent nature of AQ to queue commands and execute them out of orderin the device is the fundamental reason, AQ is introduced.And one can have more AQs to do unrelated work, mainly from the hypervisorowner device who wants to enqueue unrelated commands in parallel. As pointed above insufficient RSS capabilities may cause performance overhead, not not a failure, the device still stay functional.If UDP packets are dropped, even application can fail who do no retry.
UDP is not reliable, and performance overhead does not mean fail.
But too few AQ to serve too high volume of VMs may be a problem.It is left for the device to implement the needed scale requirement.
Yes, so how many HW resource should the HW implementation reserved to serve the worst case? Half of the board resource?
I agree we can skip this issue, but the point is clear. and this is not only a performance issue,Yes the number of AQs are negotiable, but how many exactly should the HW provide?Again, it is outside the scope. It is left to the device implementation like many other performance aspects.
this can lead to failed LM.
Naming a number or an algorithm for the ratio of devices / num_of_AQs is beyond this topic, but I made my point clear.Sure. It is beyond. And it is not a concern either.
It is, the user expect the LM process success than fail.
E.g., do you need 100 admin vqs for 1000 VMs? How do you decide the number in HW implementation and how does the driver get informed?Usually just one AQ is enough as proposal [1] is built around inherentdowntime reduction.You can ask similar question for RSS, how does hw device how many RSS queues are needed. ð Device exposes number of supported AQs that driver is free to use.RSS is not a must for the transition through maybe performance overhead. But if the host can not finish Live Migration in the due time, then it is a failed LM.It can aborts the LM and restore it back by resuming the device.aborts means failMost sane sys admins do not migrate 1000 VMs at same time for obviousreasons.But when such requirements arise, a device may support it. Just like how a net device can support from 1 to 32K txqueues at spec level.The orchestration layer may do that for host upgrade or power-saving. And the VMs may be required to migrate together, for example: a cluster of VMs in the same subnet.Sure. AQ of depth 1K can support 1K outstanding commands at a time for1000 member devices. PCI transition is FIFO,I do not understand what is "PCI transition".
PCI data flow.
can depth = 1K introduce significant latency?AQ command execution is not done serially. There is enough text on the AQ chapter as I recall.
Then require more HW resource, I don't see difference.
So the question is how many to serve the worst case? Does the HW vendor need to reserve half of the board resource?And 1K depths is almost identical to 2 X 500 queue depths, so still the same problem, how many resource does the HW need to reserve to serve the worst case?You didnât describe the problem. Virtqueue is generic infrastructure to execute commands, be it admin command, control command, flow filter command, scsi command. How many to execute in parallel, how many queues to have are device implementation specific.
I am not sure, why [1] is a must? There are certain issues discussed in this thread for [1] stay unsolved.Let's forget the numbers, the point is clear.Ok. I agree with you. Number of AQs and its depth matter for this discussion, and its performance characterization is outside the spec. Design wise, key thing to have the queuing interface between driver and device for device migration commands. This enables both entities to execute things in parallel. This is fully covered in [1]. So let's improve [1]. [1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.html
By the way, do you see anything we need to improve in this series?
The VM does not talk to admin vq either, the admin vq is a host facility, host owns it.Lets do not introduce new frangibilityI donât see any frangibility added by [1]. If you see one, please let me know.The resource and latency explained above.Remember CSP migrates all VMs on a host for powersaving or upgrade.I am not sure why the migration reason has any influence on the design.Because this design is for live migration.The CSPs that we had discussed, care for performance more and henceprefers passthrough instead or mediation and donât seem to be doing any nesting.CPU doesnt have support for 3 level of page table nesting either. I agree that there could be other users who care for nested functionality. Any ways, nesting and non-nesting are two different requirements.The LM facility should server both,I donât see how PCI spec let you do it. PCI community already handed over this to SR-PCIM interface outside of thePCI spec domain.Hence, its done over admin queue for passthrough devices. If you can explain, how your proposal addresses passthrough support withoutmediation and also does DMA, I am very interested to learn that. Do you mean nested?Before nesting, just like to see basic single level passthrough to see functionaland performant like [1]. I think we have discussed about this, the nested guest is not aware of the admin vq and can not access it, because the admin vq is a host facility.A nested guest VM is not aware and should not. The VM hosting the nested VM, is aware on how to execute administrative commands using the owner device.
At present for PCI transport, owner device is PF. In future for nesting, may be another peer VF can be delegated such task and it can perform administration command.
Then it may run into the problems explained above.
For bare metal may be some other admin device like DPU can do that role.
So [1] is not ready
Why this series can not support nested?I donât see all the aspects that I covered in series [1] ranging from flr, devicecontext migration, virtio level reset, dirty page tracking, p2p support, etc. covered in some device, vq suspend resume piece.[1] https://lists.oasis-open.org/archives/virtio-comment/202309/msg00061.h tmlWe have discussed many other issues in this thread.And it does not serve bare-metal live migration either.A bare-metal migration seems a distance theory as one need side cpu,memory accessor apart from device accessor.But somehow if that exists, there will be similar admin device to migrate itmay be TDDISP will own this whole piece one day. Bare metal live migration require other components like firmware OS and partitioning, that's why the device live migration should not be ablocker.Device migration is not blocker. In-fact it facilitates for this future in case if that happens where side cpu likeDPU or similar sideband virtio admin device can migrate over its admin vq.Long ago when admin commands were discussed, this was discussed toowhere a admin device may not be an owner device. The admin vq can not migrate it self therefore baremetal can not be migrated by admin vqMay be I was not clear. The admin commands are executed by some other device than the PF.
From SW perspective, it should be the admin vq and the device it resides.
In above I call it admin device, which can be a DPU may be some other dedicated admin device or something else. Large part of non virtio infrastructure at platform, BIOS, cpu, memory level needs to evolve before virtio can utilize it.
virito device should be self-contained. Not depend on other components.
We donât need to cook all now, as long as we have administration commands its good. The real credit owner for detaching the administration command from the admin vq is Michael. :) We like to utilize this in future for DPU case where admin device is not the PCI PF. Eswitch, PF migration etc may utilize it in future when needed.
Again, the design should not rely on other host components. And it is not about the credit, this is reliable work outcome
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]