[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
å 2021/7/20 äå8:27, Max Gurtovoy åé:
On 7/20/2021 6:02 AM, Jason Wang wrote:å 2021/7/19 äå8:43, Stefan Hajnoczi åé:On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:My virtiofs device implementation is backed by an in-memory file system.å 2021/7/15 äå6:01, Stefan Hajnoczi åé:On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:å 2021/7/14 äå11:07, Stefan Hajnoczi åé:On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:You're right, not this patch. I mentioned it because your other patch series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implementså 2021/7/14 äå5:53, Stefan Hajnoczi åé:On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:Device state state can be large so a register interface would be aå 2021/7/13 äå6:00, Stefan Hajnoczi åé:So I'm not against the admin virtqueue. As said before, admin virtqueueOn Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:Adding the STOP but as a Device Status bit is a small and clean VIRTIOå 2021/7/12 äå5:57, Stefan Hajnoczi åé:On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:If an admin virtqueue is used instead of the STOP Device Status field bit then there's no need to re-read the Device Status field in a loopå 2021/7/11 äå4:36, Michael S. Tsirkin åé:On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:It just requires a new status bit. Anything that makes you think it's hardIt's doable like this. It's all a lot of hoops to jump through though.As I see it, the feature can be driven entirely by the VMM as long as it intercept the relevant configuration space (PCI, MMIO, etc) from guest's reads and writes, and present it as coherent and transparent for the guest. Some use cases I can imagine with a physical device (orIf I understand correctly, this is alldriven from the driver inside the guest, so for this to work the guest must be running and already have initialised the driver.Yes.vp_vpda device) with VIRTIO_F_STOP:1) The VMM chooses not to pass the feature flag. The guest cannot stop the device, so any write to this flag is an error/undefined. 2) The VMM passes the flag to the guest. The guest can stop the device. 2.1) The VMM stops the device to perform a live migration, and the guest does not write to STOP in any moment of the LM. It resets the destination device with the state, and then initializes the device. 2.2) The guest stops the device and, when STOP(32) is set, the source VMM migrates the device status. The destination VMM realizes the bit, so it sets the bit in the destination too after device initialization. 2.3) The device is not initialized by the guest so it doesn't matterwhat bit has the HW, but the VM can be migrated. Am I missing something? Thanks!It's also not easy for devices to implement.to implement?E.g for networking device, it should be sufficient to use this bit + thevirtqueue state.Why don't we design the feature in a way that is useable by VMMsIt use the common technology like register shadowing without any furtherand implementable by devices in a simple way?stuffs. Or do you have any other ideas?(I think we all know migration will be very hard if we simply pass throughthose state registers).until the device has stopped.Probably not. Let me to clarify several points:- This proposal has nothing to do with admin virtqueue. Actually, admin virtqueue could be used for carrying any basic device facility like status bit. E.g I'm going to post patches that use admin virtqueue as a "transport"for device slicing at virtio level.- Even if we had introduced admin virtqueue, we still need a per function interface for this. This is a must for nested virtualization, we can'talways expect things like PF can be assigned to L1 guest.- According to the proposal, there's no need for the device to complete all the consumed buffers, device can choose to expose those inflight descriptors in a device specific way and set the STOP bit. This means, if we have the device specific in-flight descriptor reporting facility, the device canalmost set the STOP bit immediately.- If we don't go with the basic device facility but using the admin virtqueue specific method, we still need to clarify how it works with the device status state machine, it will be some kind of sub-states which looksmuch more complicated than the current proposal.When migrating a guest with many VIRTIO devices a busy waiting approach extends downtime if implemented sequentially (stopping one device at aWell. You need some kinds of waiting for sure, the device/DMA needs sometimetime).to be stopped. The downtime is determined by a specific virtioimplementation which is hard to be restricted at the spec level. We canclarify that the device must set the STOP bit in e.g 100ms.ÂÂÂÂÂÂ It can be implemented concurrently (setting the STOP bit on all devices and then looping until all their Device Status fields have thebit set), but this becomes more complex to implement.I still don't get what kind of complexity did you worry here.I'm a little worried about adding a new bit that requires busywaiting...Busy wait is not something that is introduced in this patch:4.1.4.3.2 Driver Requirements: Common configuration structure layoutAfter writing 0 to device_status, the driver MUST wait for a read ofdevice_status to return 0 before reinitializing the device.Since it was required for at least one transport. We need do somethingsimilar to when introducing basic facility.spec change. I like that.On the other hand, devices need time to stop and that time can be unbounded. For example, software virtio-blk/scsi implementations since cannot immediately cancel in-flight I/O requests on Linux hosts.The natural interface for long-running operations is virtqueue requests. That's why I mentioned the alternative of using an admin virtqueueinstead of a Device Status bit.could be used for carrying the device status bit.Send a command to set STOP status bit to admin virtqueue. Device will make the command buffer used after it has successfully stopped the device.AFAIK, they are not mutually exclusive, since they are trying to solvedifferent problems. Device status - basic device facilityAdmin virtqueue - transport/device specific way to implement (part of) thedevice facilityAlthough you mentioned that the stopped state needs to be reflected in the Device Status field somehow, I'm not sure about that since the driver typically doesn't need to know whether the device is beingThe guest won't see the real device status bit. VMM will shadow the devicemigrated.status bit in this case.E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest isunaware of the migration.STOP status bit is set by Qemu to real virtio hardware. But guest will onlysee the DRIVER_OK without STOP.It's not hard to implement the nested on top, see the discussion initiated by Eugenio about how expose VIRTIO_F_STOP to guest for nested livemigration.ÂÂÂÂÂ In fact, the VMM would need to hide this bit and it's safer toSee above, VMM may choose to hide or expose the capability. It's useful forkeep it out-of-band instead of risking exposing it by accident.migrating a nested guest.If we design an interface that can be used in the nested environment, it'snot an ideal interface.In addition, stateful devices need to load/save non-trivial amounts of data. They need DMA to do this efficiently, so an admin virtqueue is aI don't get the point here. You still need to address the exact the similar issues for admin virtqueue: the unbound time in freezing the device, thegood fit again.interaction with the virtio device status state machine.bottleneck. DMA is needed. I think a virtqueue is a good fit for saving/loading device state.So this patch doesn't mandate a register interface, isn't it?it a register interface.Yes, although virtqueues are a pretty good interface that works across transports (PCI/MMIO/etc) thanks to the standard vring memory layout.And DMAdoesn't means a virtqueue, it could be a transport specific method.I think we need to start from defining the state of one specific device andvirtio-blk might be the simplest. I think virtio-net has more devicesee what is the best interface.state and virtio-scsi is definitely more complext than virtio-blk.First we need agreement on whether "device state" encompasses the fullstate of the device or just state that is unknown to the VMM.I think we've discussed this in the past. It can't work since: 1) The state and its format must be clearly defined in the spec 2) We need to maintain migration compatibility and debug-abilitySome devices need implementation-specific state. They should still beable to live migrate even if it means cross-implementation migration anddebug-ability is not possible.I think we need to re-visit this conclusion. Migration compatibility is pretty important, especially consider the software stack has spent a hugemount of effort in maintaining them. Say a virtio hardware would break this, this mean we will lose all the advantages of being a standard device. If we can't do live migration among:1) different backends, e.g migrate from virtio hardware to migrate software2) different vendorsWe failed to say as a standard device and the customer is in fact locked bythe vendor implicitly.The device state includes the contents of each file. Your virtiofs device implementation uses Linux file handles to keep track of open files. The device state includes Linux file handles (but not the contents of each file) so the destination host can access the same files on shared storage.Cornelia's virtiofs device implementation is backed by an object storageHTTP API. The device state includes API object IDs. The device state is implementation-dependent. There is no standard representation and it's not possible to migrate between device implementations. How are they supposed to migrate?So if I understand correclty, virtio-fs is not desigined to be migrate-able?(Having a check on the current virtio-fs support in qemu, it looks to me it has a migration blocker).This is why I think it's necessarily to allow implementation-specific device state representations.Or you probably mean you don't support cross backend migration. This sounds like a drawback and it's actually not a standard device but a vendor/implementation specific device.It would bring a lot of troubles, not only for the implementation but for the management. Maybe we can start from adding the support of migration for some specific backend and start from there.See above. Stateful devices may require an implementation-defined deviceI never understood this argument. The Linux uAPI passes through lots of3) Not a proper uAPI desginopaque data from devices to userspace. Allowing animplementation-specific device state representation is nothing new. VFIOalready does it.I think we've already had a lots of discussion for VFIO but without aconclusion. Maybe we need the verdict from Linus or Greg (not sure if it'stoo late). But that's not related to virito and this thread.What you propose here is kind of conflict with the efforts of virtio. I think we all aggree that we should define the state in the spec. Assumingthis is correct: 1) why do we still offer opaque migration state to userspace?state representation.So my point stand still, it's not a standard device if we do this.2) how can it be integrated into the current VMM (Qemu) virtio devices'migration bytes stream?Opaque data like D-Bus VMState:https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fqemu.readthedocs.io%2Fen%2Flatest%2Finterop%2Fdbus-vmstate.html&data=04%7C01%7Cmgurtovoy%40nvidia.com%7C73950d2060194ce2a43e08d94b2ae003%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637623469808033640%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6smKOLikySbPdeQa1sbNSRdTB13p3ma09BH%2BeknXAS4%3D&reserved=0Actually, I meant how to keep the opaque state which is compatible with all the existing device that can do migration.E.g we want to live migration virtio-blk among any backends (from a hardware device to a software backend).I prefer we'll handle HW to SW migration in the future.
Yes, that's very important and on of the key advantages of virtio.
We're still debating on other basic stuff.We can't do VFIO full pasthrough for migration anyway, some kind of mdev isThat'sbasically the difference between the vhost/vDPA's selective passthroughapproach and VFIO's full passthrough approach.required but it's duplicated with the current vp_vdpa driver.I'm not sure that's true. Generic VFIO PCI migration can probably be achieved without mdev: 1. Define a migration PCI Capability that indicates support forÂÂÂÂ VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implementÂÂÂÂ the migration interface in hardware instead of an mdev driver.So I think it still depend on the driver to implement migrate state which isvendor specific.The current VFIO migration interface depends on a device-specific software mdev driver but here I'm showing that the physical device canimplement the migration interface so that no device-specific driver codeis needed.This is not what I read from the patch: Â* device_state: (read/write)Â*ÂÂÂÂÂ - The user application writes to this field to inform the vendor driverÂ*ÂÂÂÂÂÂÂ about the device state to be transitioned to.Â*ÂÂÂÂÂ - The vendor driver should take the necessary actions to change the Â*ÂÂÂÂÂÂÂ device state. After successful transition to a given state, the Â*ÂÂÂÂÂÂÂ vendor driver should return success on write(device_state, state) Â*ÂÂÂÂÂÂÂ system call. If the device state transition fails, the vendor driverÂ*ÂÂÂÂÂÂÂ should return an appropriate -errno for the fault condition. Vendor driver need to mediate between the uAPI and the actual device.We're building an infrastructure for VFIO PCI devices in the last few months.It should be merged hopefully to kernel 5.15.
Ok.
Note that it's just an uAPI definition not something defined in the PCIYes, that's why I mentioned Changpeng Liu's idea to turn the uAPI into aspec.standard PCI Capability to eliminate the need for device-specific drivers.Ok.Out of curiosity, the patch is merged without any real users in the Linux.This is very bad since we lose the change to audit the whole design.I agree. It would have helped to have a complete vision for how live migration should work along with demos. I don't see any migration code in samples/vfio-mdev/ :(.Right.Creating a standard is not related to Linux nor VFIO.
I fully agree here.
With the proposal that I've sent, we can develop a migration driver and virtio device that will support it (NVIDIA virtio-blk SNAP device).And you can build live migration support in virtio_vdpa driver (if VDPA migration protocol will be implemented).
Right, vp_vdpa fit naturally for this. But I don't see the much value of a dedicated migration driver, do you?
Thanks
2. The VMM either uses the migration PCI Capability directly fromÂÂÂÂ userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATIONÂÂÂÂ to userspace so migration can proceed in the same way as with ÂÂÂÂ VFIO/mdev drivers. 3. The PCI Capability is not passed through to the guest.This brings troubles in the nested environment.It depends on the device splitting/management design. If L0 wishes to let L1 manage the VFs then it would need to expose a management device.Since the migration interface is generic (not device-specific) a genericmanagement device solves this for all devices.Right, but it's a burden to expose the management device or it may just won't work.ThanksStefan
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]