virtio-dev message

Subject: Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit

From: Jason Wang <jasowang@redhat.com>
To: Max Gurtovoy <mgurtovoy@nvidia.com>, Stefan Hajnoczi <stefanha@redhat.com>
Date: Wed, 21 Jul 2021 11:09:20 +0800


å 2021/7/20 äå8:27, Max Gurtovoy åé:

On 7/20/2021 6:02 AM, Jason Wang wrote:
å 2021/7/19 äå8:43, Stefan Hajnoczi åé:
On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:
å 2021/7/15 äå6:01, Stefan Hajnoczi åé:
On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
å 2021/7/14 äå11:07, Stefan Hajnoczi åé:
On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
å 2021/7/14 äå5:53, Stefan Hajnoczi åé:
On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
å 2021/7/13 äå6:00, Stefan Hajnoczi åé:
On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
å 2021/7/12 äå5:57, Stefan Hajnoczi åé:
On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
å 2021/7/11 äå4:36, Michael S. Tsirkin åé:
On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio PerezMartin wrote:
If I understand correctly, this is all
driven from the driver inside the guest, so for thisto workthe guest must be running and already haveinitialised the driver.
Yes.
As I see it, the feature can be driven entirely by theVMM as long asit intercept the relevant configuration space (PCI,MMIO, etc) fromguest's reads and writes, and present it as coherentand transparentfor the guest. Some use cases I can imagine with aphysical device (or
vp_vpda device) with VIRTIO_F_STOP:
1) The VMM chooses not to pass the feature flag. Theguest cannot stopthe device, so any write to this flag is anerror/undefined.2) The VMM passes the flag to the guest. The guest canstop the device.2.1) The VMM stops the device to perform a livemigration, and theguest does not write to STOP in any moment of the LM.It resets thedestination device with the state, and then initializesthe device.2.2) The guest stops the device and, when STOP(32) isset, the sourceVMM migrates the device status. The destination VMMrealizes the bit,so it sets the bit in the destination too after deviceinitialization.2.3) The device is not initialized by the guest so itdoesn't matter
what bit has the HW, but the VM can be migrated.

Am I missing something?

Thanks!
It's doable like this. It's all a lot of hoops to jumpthrough though.
It's also not easy for devices to implement.
It just requires a new status bit. Anything that makesyou think it's hard
to implement?
E.g for networking device, it should be sufficient to usethis bit + the
virtqueue state.
Why don't we design the feature in a way that is useableby VMMs
and implementable by devices in a simple way?
It use the common technology like register shadowingwithout any further
stuffs.

Or do you have any other ideas?
(I think we all know migration will be very hard if wesimply pass through
those state registers).
If an admin virtqueue is used instead of the STOP DeviceStatus fieldbit then there's no need to re-read the Device Statusfield in a loop
until the device has stopped.
Probably not. Let me to clarify several points:
- This proposal has nothing to do with admin virtqueue.Actually, adminvirtqueue could be used for carrying any basic devicefacility like statusbit. E.g I'm going to post patches that use admin virtqueueas a "transport"
for device slicing at virtio level.
- Even if we had introduced admin virtqueue, we still needa per functioninterface for this. This is a must for nestedvirtualization, we can't
always expect things like PF can be assigned to L1 guest.
- According to the proposal, there's no need for the deviceto complete allthe consumed buffers, device can choose to expose thoseinflight descriptorsin a device specific way and set the STOP bit. This means,if we have thedevice specific in-flight descriptor reporting facility,the device can
almost set the STOP bit immediately.
- If we don't go with the basic device facility but usingthe adminvirtqueue specific method, we still need to clarify how itworks with thedevice status state machine, it will be some kind ofsub-states which looks
much more complicated than the current proposal.
When migrating a guest with many VIRTIO devices a busywaiting approachextends downtime if implemented sequentially (stopping onedevice at a
time).
Well. You need some kinds of waiting for sure, thedevice/DMA needs sometime
to be stopped. The downtime is determined by a specific virtio
implementation which is hard to be restricted at the speclevel. We can
clarify that the device must set the STOP bit in e.g 100ms.
ÂÂÂÂÂÂ It can be implemented concurrently (setting theSTOP bit on alldevices and then looping until all their Device Statusfields have the
bit set), but this becomes more complex to implement.
I still don't get what kind of complexity did you worry here.
I'm a little worried about adding a new bit that requiresbusy
waiting...
Busy wait is not something that is introduced in this patch:
4.1.4.3.2 Driver Requirements: Common configurationstructure layout
After writing 0 to device_status, the driver MUST wait fora read of
device_status to return 0 before reinitializing the device.
Since it was required for at least one transport. We needdo something
similar to when introducing basic facility.
Adding the STOP but as a Device Status bit is a small andclean VIRTIO
spec change. I like that.
On the other hand, devices need time to stop and that timecan beunbounded. For example, software virtio-blk/scsiimplementations sincecannot immediately cancel in-flight I/O requests on Linuxhosts.
The natural interface for long-running operations isvirtqueue requests.That's why I mentioned the alternative of using an adminvirtqueue
instead of a Device Status bit.
So I'm not against the admin virtqueue. As said before, adminvirtqueue
could be used for carrying the device status bit.
Send a command to set STOP status bit to admin virtqueue.Device will makethe command buffer used after it has successfully stopped thedevice.
AFAIK, they are not mutually exclusive, since they are tryingto solve
different problems.

Device status - basic device facility
Admin virtqueue - transport/device specific way to implement(part of) the
device facility
Although you mentioned that the stopped state needs to bereflected inthe Device Status field somehow, I'm not sure about thatsince thedriver typically doesn't need to know whether the device isbeing
migrated.
The guest won't see the real device status bit. VMM willshadow the device
status bit in this case.
E.g with the current vhost-vDPA, vDPA behave like a vhostdevice, guest is
unaware of the migration.
STOP status bit is set by Qemu to real virtio hardware. Butguest will only
see the DRIVER_OK without STOP.
It's not hard to implement the nested on top, see thediscussion initiatedby Eugenio about how expose VIRTIO_F_STOP to guest for nestedlive
migration.
ÂÂÂÂÂ In fact, the VMM would need to hide this bit and it'ssafer to
keep it out-of-band instead of risking exposing it by accident.
See above, VMM may choose to hide or expose the capability.It's useful for
migrating a nested guest.
If we design an interface that can be used in the nestedenvironment, it's
not an ideal interface.
In addition, stateful devices need to load/save non-trivialamounts ofdata. They need DMA to do this efficiently, so an adminvirtqueue is a
good fit again.
I don't get the point here. You still need to address theexact the similarissues for admin virtqueue: the unbound time in freezing thedevice, the
interaction with the virtio device status state machine.
Device state state can be large so a register interface wouldbe a
bottleneck. DMA is needed. I think a virtqueue is a good fit for
saving/loading device state.
So this patch doesn't mandate a register interface, isn't it?
You're right, not this patch. I mentioned it because your otherpatchseries ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")implements
it a register interface.
And DMA
doesn't means a virtqueue, it could be a transport specificmethod.
Yes, although virtqueues are a pretty good interface that worksacrosstransports (PCI/MMIO/etc) thanks to the standard vring memorylayout.
I think we need to start from defining the state of onespecific device and
see what is the best interface.
virtio-blk might be the simplest. I think virtio-net has moredevice
state and virtio-scsi is definitely more complext than virtio-blk.
First we need agreement on whether "device state" encompassesthe full
state of the device or just state that is unknown to the VMM.
I think we've discussed this in the past. It can't work since:

1) The state and its format must be clearly defined in the spec
2) We need to maintain migration compatibility and debug-ability
Some devices need implementation-specific state. They should still be
able to live migrate even if it means cross-implementationmigration and
debug-ability is not possible.
I think we need to re-visit this conclusion. Migrationcompatibility ispretty important, especially consider the software stack has spenta huge
mount of effort in maintaining them.

Say a virtio hardware would break this, this mean we will lose all the
advantages of being a standard device.

If we can't do live migration among:
1) different backends, e.g migrate from virtio hardware to migratesoftware
2) different vendors
We failed to say as a standard device and the customer is in factlocked by
the vendor implicitly.
My virtiofs device implementation is backed by an in-memory filesystem.
The device state includes the contents of each file.

Your virtiofs device implementation uses Linux file handles to keep
track of open files. The device state includes Linux file handles (but
not the contents of each file) so the destination host can access the
same files on shared storage.
Cornelia's virtiofs device implementation is backed by an objectstorage
HTTP API. The device state includes API object IDs.

The device state is implementation-dependent. There is no standard
representation and it's not possible to migrate between device
implementations. How are they supposed to migrate?
So if I understand correclty, virtio-fs is not desigined to bemigrate-able?
(Having a check on the current virtio-fs support in qemu, it looks tome it has a migration blocker).
This is why I think it's necessarily to allow implementation-specific
device state representations.
Or you probably mean you don't support cross backend migration. Thissounds like a drawback and it's actually not a standard device but avendor/implementation specific device.
It would bring a lot of troubles, not only for the implementation butfor the management. Maybe we can start from adding the support ofmigration for some specific backend and start from there.
3) Not a proper uAPI desgin
I never understood this argument. The Linux uAPI passes throughlots of
opaque data from devices to userspace. Allowing an
implementation-specific device state representation is nothingnew. VFIO
already does it.
I think we've already had a lots of discussion for VFIO but without a
conclusion. Maybe we need the verdict from Linus or Greg (not sureif it's
too late). But that's not related to virito and this thread.
What you propose here is kind of conflict with the efforts ofvirtio. Ithink we all aggree that we should define the state in the spec.Assuming
this is correct:

1) why do we still offer opaque migration state to userspace?
See above. Stateful devices may require an implementation-defineddevice
state representation.
So my point stand still, it's not a standard device if we do this.
2) how can it be integrated into the current VMM (Qemu) virtiodevices'
migration bytes stream?
Opaque data like D-Bus VMState:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fqemu.readthedocs.io%2Fen%2Flatest%2Finterop%2Fdbus-vmstate.html&data=04%7C01%7Cmgurtovoy%40nvidia.com%7C73950d2060194ce2a43e08d94b2ae003%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C637623469808033640%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6smKOLikySbPdeQa1sbNSRdTB13p3ma09BH%2BeknXAS4%3D&reserved=0
Actually, I meant how to keep the opaque state which is compatiblewith all the existing device that can do migration.
E.g we want to live migration virtio-blk among any backends (from ahardware device to a software backend).
I prefer we'll handle HW to SW migration in the future.



Yes, that's very important and on of the key advantages of virtio.

We're still debating on other basic stuff.
That's
basically the difference between the vhost/vDPA's selectivepassthrough
approach and VFIO's full passthrough approach.
We can't do VFIO full pasthrough for migration anyway, some kindof mdev is
required but it's duplicated with the current vp_vdpa driver.
I'm not sure that's true. Generic VFIO PCI migration can probably be
achieved without mdev:
1. Define a migration PCI Capability that indicates support for
ÂÂÂÂ VFIO_REGION_TYPE_MIGRATION. This allows the PCI device toimplement
ÂÂÂÂ the migration interface in hardware instead of an mdev driver.
So I think it still depend on the driver to implement migrate statewhich is
vendor specific.
The current VFIO migration interface depends on a device-specific
software mdev driver but here I'm showing that the physical device can
implement the migration interface so that no device-specific drivercode
is needed.
This is not what I read from the patch:

Â* device_state: (read/write)
Â*ÂÂÂÂÂ - The user application writes to this field to inform thevendor driver
Â*ÂÂÂÂÂÂÂ about the device state to be transitioned to.
Â*ÂÂÂÂÂ - The vendor driver should take the necessary actions tochange theÂ*ÂÂÂÂÂÂÂ device state. After successful transition to a given state,theÂ*ÂÂÂÂÂÂÂ vendor driver should return success on write(device_state,state)Â*ÂÂÂÂÂÂÂ system call. If the device state transition fails, thevendor driver
Â*ÂÂÂÂÂÂÂ should return an appropriate -errno for the fault condition.

Vendor driver need to mediate between the uAPI and the actual device.
We're building an infrastructure for VFIO PCI devices in the last fewmonths.
It should be merged hopefully to kernel 5.15.

Ok.

Note that it's just an uAPI definition not something defined in thePCI
spec.
Yes, that's why I mentioned Changpeng Liu's idea to turn the uAPIinto a
standard PCI Capability to eliminate the need for device-specific
drivers.
Ok.
Out of curiosity, the patch is merged without any real users in theLinux.
This is very bad since we lose the change to audit the whole design.
I agree. It would have helped to have a complete vision for how live
migration should work along with demos. I don't see any migration code
in samples/vfio-mdev/ :(.
Right.
Creating a standard is not related to Linux nor VFIO.



I fully agree here.

With the proposal that I've sent, we can develop a migration driverand virtio device that will support it (NVIDIA virtio-blk SNAP device).
And you can build live migration support in virtio_vdpa driver (ifVDPA migration protocol will be implemented).

Right, vp_vdpa fit naturally for this. But I don't see the much value ofa dedicated migration driver, do you?


Thanks

2. The VMM either uses the migration PCI Capability directly from
ÂÂÂÂ userspace or core VFIO PCI code advertisesVFIO_REGION_TYPE_MIGRATION
ÂÂÂÂ to userspace so migration can proceed in the same way as with
ÂÂÂÂ VFIO/mdev drivers.
3. The PCI Capability is not passed through to the guest.
This brings troubles in the nested environment.
It depends on the device splitting/management design. If L0 wishes to
let L1 manage the VFs then it would need to expose a management device.
Since the migration interface is generic (not device-specific) ageneric
management device solves this for all devices.
Right, but it's a burden to expose the management device or it mayjust won't work.
Thanks
Stefan

References:
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>