virtio-comment message

Subject: Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit

From: Jason Wang <jasowang@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Date: Thu, 22 Jul 2021 10:08:51 +0800


å 2021/7/21 äå6:42, Stefan Hajnoczi åé:

On Wed, Jul 21, 2021 at 10:52:15AM +0800, Jason Wang wrote:

å 2021/7/20 äå6:19, Stefan Hajnoczi åé:

On Tue, Jul 20, 2021 at 11:02:42AM +0800, Jason Wang wrote:

å 2021/7/19 äå8:43, Stefan Hajnoczi åé:

On Fri, Jul 16, 2021 at 10:03:17AM +0800, Jason Wang wrote:

å 2021/7/15 äå6:01, Stefan Hajnoczi åé:

On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:

å 2021/7/14 äå11:07, Stefan Hajnoczi åé:

On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:

å 2021/7/14 äå5:53, Stefan Hajnoczi åé:

On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:

å 2021/7/13 äå6:00, Stefan Hajnoczi åé:

On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:

å 2021/7/12 äå5:57, Stefan Hajnoczi åé:

On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:

å 2021/7/11 äå4:36, Michael S. Tsirkin åé:

On Fri, Jul 09, 2021 at 07:23:33PM +0200, Eugenio Perez Martin wrote:

           If I understand correctly, this is all
driven from the driver inside the guest, so for this to work
the guest must be running and already have initialised the driver.

Yes.

As I see it, the feature can be driven entirely by the VMM as long as
it intercept the relevant configuration space (PCI, MMIO, etc) from
guest's reads and writes, and present it as coherent and transparent
for the guest. Some use cases I can imagine with a physical device (or
vp_vpda device) with VIRTIO_F_STOP:

1) The VMM chooses not to pass the feature flag. The guest cannot stop
the device, so any write to this flag is an error/undefined.
2) The VMM passes the flag to the guest. The guest can stop the device.
2.1) The VMM stops the device to perform a live migration, and the
guest does not write to STOP in any moment of the LM. It resets the
destination device with the state, and then initializes the device.
2.2) The guest stops the device and, when STOP(32) is set, the source
VMM migrates the device status. The destination VMM realizes the bit,
so it sets the bit in the destination too after device initialization.
2.3) The device is not initialized by the guest so it doesn't matter
what bit has the HW, but the VM can be migrated.

Am I missing something?

Thanks!

It's doable like this. It's all a lot of hoops to jump through though.
It's also not easy for devices to implement.

It just requires a new status bit. Anything that makes you think it's hard
to implement?

E.g for networking device, it should be sufficient to use this bit + the
virtqueue state.

Why don't we design the feature in a way that is useable by VMMs
and implementable by devices in a simple way?

It use the common technology like register shadowing without any further
stuffs.

Or do you have any other ideas?

(I think we all know migration will be very hard if we simply pass through
those state registers).

If an admin virtqueue is used instead of the STOP Device Status field
bit then there's no need to re-read the Device Status field in a loop
until the device has stopped.

Probably not. Let me to clarify several points:

- This proposal has nothing to do with admin virtqueue. Actually, admin
virtqueue could be used for carrying any basic device facility like status
bit. E.g I'm going to post patches that use admin virtqueue as a "transport"
for device slicing at virtio level.
- Even if we had introduced admin virtqueue, we still need a per function
interface for this. This is a must for nested virtualization, we can't
always expect things like PF can be assigned to L1 guest.
- According to the proposal, there's no need for the device to complete all
the consumed buffers, device can choose to expose those inflight descriptors
in a device specific way and set the STOP bit. This means, if we have the
device specific in-flight descriptor reporting facility, the device can
almost set the STOP bit immediately.
- If we don't go with the basic device facility but using the admin
virtqueue specific method, we still need to clarify how it works with the
device status state machine, it will be some kind of sub-states which looks
much more complicated than the current proposal.

When migrating a guest with many VIRTIO devices a busy waiting approach
extends downtime if implemented sequentially (stopping one device at a
time).

Well. You need some kinds of waiting for sure, the device/DMA needs sometime
to be stopped. The downtime is determined by a specific virtio
implementation which is hard to be restricted at the spec level. We can
clarify that the device must set the STOP bit in e.g 100ms.

         It can be implemented concurrently (setting the STOP bit on all
devices and then looping until all their Device Status fields have the
bit set), but this becomes more complex to implement.

I still don't get what kind of complexity did you worry here.

I'm a little worried about adding a new bit that requires busy
waiting...

Busy wait is not something that is introduced in this patch:

4.1.4.3.2 Driver Requirements: Common configuration structure layout

After writing 0 to device_status, the driver MUST wait for a read of
device_status to return 0 before reinitializing the device.

Since it was required for at least one transport. We need do something
similar to when introducing basic facility.

Adding the STOP but as a Device Status bit is a small and clean VIRTIO
spec change. I like that.

On the other hand, devices need time to stop and that time can be
unbounded. For example, software virtio-blk/scsi implementations since
cannot immediately cancel in-flight I/O requests on Linux hosts.

The natural interface for long-running operations is virtqueue requests.
That's why I mentioned the alternative of using an admin virtqueue
instead of a Device Status bit.

So I'm not against the admin virtqueue. As said before, admin virtqueue
could be used for carrying the device status bit.

Send a command to set STOP status bit to admin virtqueue. Device will make
the command buffer used after it has successfully stopped the device.

AFAIK, they are not mutually exclusive, since they are trying to solve
different problems.

Device status - basic device facility

Admin virtqueue - transport/device specific way to implement (part of) the
device facility

Although you mentioned that the stopped state needs to be reflected in
the Device Status field somehow, I'm not sure about that since the
driver typically doesn't need to know whether the device is being
migrated.

The guest won't see the real device status bit. VMM will shadow the device
status bit in this case.

E.g with the current vhost-vDPA, vDPA behave like a vhost device, guest is
unaware of the migration.

STOP status bit is set by Qemu to real virtio hardware. But guest will only
see the DRIVER_OK without STOP.

It's not hard to implement the nested on top, see the discussion initiated
by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
migration.

        In fact, the VMM would need to hide this bit and it's safer to
keep it out-of-band instead of risking exposing it by accident.

See above, VMM may choose to hide or expose the capability. It's useful for
migrating a nested guest.

If we design an interface that can be used in the nested environment, it's
not an ideal interface.

In addition, stateful devices need to load/save non-trivial amounts of
data. They need DMA to do this efficiently, so an admin virtqueue is a
good fit again.

I don't get the point here. You still need to address the exact the similar
issues for admin virtqueue: the unbound time in freezing the device, the
interaction with the virtio device status state machine.

Device state state can be large so a register interface would be a
bottleneck. DMA is needed. I think a virtqueue is a good fit for
saving/loading device state.

So this patch doesn't mandate a register interface, isn't it?

You're right, not this patch. I mentioned it because your other patch
series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE") implements
it a register interface.

And DMA
doesn't means a virtqueue, it could be a transport specific method.

Yes, although virtqueues are a pretty good interface that works across
transports (PCI/MMIO/etc) thanks to the standard vring memory layout.

I think we need to start from defining the state of one specific device and
see what is the best interface.

virtio-blk might be the simplest. I think virtio-net has more device
state and virtio-scsi is definitely more complext than virtio-blk.

First we need agreement on whether "device state" encompasses the full
state of the device or just state that is unknown to the VMM.

I think we've discussed this in the past. It can't work since:

1) The state and its format must be clearly defined in the spec
2) We need to maintain migration compatibility and debug-ability

Some devices need implementation-specific state. They should still be
able to live migrate even if it means cross-implementation migration and
debug-ability is not possible.

I think we need to re-visit this conclusion. Migration compatibility is
pretty important, especially consider the software stack has spent a huge
mount of effort in maintaining them.

Say a virtio hardware would break this, this mean we will lose all the
advantages of being a standard device.

If we can't do live migration among:

1) different backends, e.g migrate from virtio hardware to migrate software
2) different vendors

We failed to say as a standard device and the customer is in fact locked by
the vendor implicitly.

My virtiofs device implementation is backed by an in-memory file system.
The device state includes the contents of each file.

Your virtiofs device implementation uses Linux file handles to keep
track of open files. The device state includes Linux file handles (but
not the contents of each file) so the destination host can access the
same files on shared storage.

Cornelia's virtiofs device implementation is backed by an object storage
HTTP API. The device state includes API object IDs.

The device state is implementation-dependent. There is no standard
representation and it's not possible to migrate between device
implementations. How are they supposed to migrate?

So if I understand correclty, virtio-fs is not desigined to be migrate-able?

(Having a check on the current virtio-fs support in qemu, it looks to me it
has a migration blocker).

The code does not support live migration but it's on the roadmap. Max
Reitz added Linux file handle support to virtiofsd. That was the first
step towards being able to migrate the device's state.


A dumb question, how do qemu know it is connected to virtiofsd?

virtiofsd is a vhost-user-fs device. QEMU doesn't know if it's connected
to virtiofsd or another implementation.

That's my understanding. So this answers my questions basically: therecould be a common migration implementation for each virtio-fs devicewhich implies that we only need to migrate the common device specificstate but not implementation specific state.

This is why I think it's necessarily to allow implementation-specific
device state representations.

Or you probably mean you don't support cross backend migration. This sounds
like a drawback and it's actually not a standard device but a
vendor/implementation specific device.

It would bring a lot of troubles, not only for the implementation but for
the management. Maybe we can start from adding the support of migration for
some specific backend and start from there.

Yes, it's complicated. Some implementations could be compatible, but
others can never be compatible because they have completely different
state.

The virtiofsd implementation is the main one for virtiofs and the device
state representation can be published, even standardized. Others can
implement it to achieve migration compatibility.

But it must be possible for implementations that have completely
different state to migrate too. virtiofsd isn't special.

3) Not a proper uAPI desgin

I never understood this argument. The Linux uAPI passes through lots of
opaque data from devices to userspace. Allowing an
implementation-specific device state representation is nothing new. VFIO
already does it.

I think we've already had a lots of discussion for VFIO but without a
conclusion. Maybe we need the verdict from Linus or Greg (not sure if it's
too late). But that's not related to virito and this thread.

What you propose here is kind of conflict with the efforts of virtio. I
think we all aggree that we should define the state in the spec. Assuming
this is correct:

1) why do we still offer opaque migration state to userspace?

See above. Stateful devices may require an implementation-defined device
state representation.

So my point stand still, it's not a standard device if we do this.

These "non-standard devices" still need to be able to migrate.


See other thread, it breaks the effort of having a spec.

   How
should we do that?


I think the main issue is that, to me it's not a virtio device but a device
that is using virtio queue with implementation specific state. So it can't
be migrated by the virtio subsystem but through a vendor/implementation
specific migration driver.

Okay. Are you thinking about a separate set of vDPA APIs and vhost
ioctls so the VMM can save/load implementation-specific device state?

Probably not, I think the question is can we define the virtio-fs devicestate in the spec? If yes (and I think the answer is yes), we're fine.If not, it looks like we need to improve the spec or design.

These separate APIs just need to be called as part of the standard
VIRTIO stop and vq save/load lifecycle.



Yes, if they are virtio standard, we need to invent them.

2) how can it be integrated into the current VMM (Qemu) virtio devices'
migration bytes stream?

Opaque data like D-Bus VMState:
https://qemu.readthedocs.io/en/latest/interop/dbus-vmstate.html

Actually, I meant how to keep the opaque state which is compatible with all
the existing device that can do migration.

E.g we want to live migration virtio-blk among any backends (from a hardware
device to a software backend).

There was a series of email threads last year where migration
compatibility was discussed:

https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg02620.html

I proposed an algorithm for checking migration compatibility between
devices. The source and destination device can have different
implementations (e.g. hardware, software, etc).

It involves picking an identifier like virtio-spec.org/pci/virtio-net
for the device state representation and device parameters for aspects of
the device that vary between instances (e.g. tso=on|off).

It's more complex than today's live migration approach in libvirt and
QEMU. Today libvirt configures the source and destination in a
compatible manner (thanks to knowledge of the device implementation) and
then QEMU transfers the device state.

Part of the point of defining a migration compatibility algorithm is
that it's possible to lift the assumptions out of libvirt so that
arbitrary device implementations can be supported (hardware, software,
etc) without putting knowledge about every device/VMM implementation
into libvirt.

(The other advantage is that this allows orchestration software to
determine migration compatibility before starting a migration.)


This looks like another independent issues and I fully agree to have a
better migration protocol. But using that means we break the migration
compatibility with the existing device which is used for more than 10 years.
We still need to make the migration from/to the existing virtio device to
work.

I agree that migrating to/from existing devices needs to work. It should
be possible to transition without breaking migration.

That's
basically the difference between the vhost/vDPA's selective passthrough
approach and VFIO's full passthrough approach.

We can't do VFIO full pasthrough for migration anyway, some kind of mdev is
required but it's duplicated with the current vp_vdpa driver.

I'm not sure that's true. Generic VFIO PCI migration can probably be
achieved without mdev:
1. Define a migration PCI Capability that indicates support for
       VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
       the migration interface in hardware instead of an mdev driver.

So I think it still depend on the driver to implement migrate state which is
vendor specific.

The current VFIO migration interface depends on a device-specific
software mdev driver but here I'm showing that the physical device can
implement the migration interface so that no device-specific driver code
is needed.

This is not what I read from the patch:

  Â* device_state: (read/write)
  Â*ÂÂÂÂÂ - The user application writes to this field to inform the vendor
driver
  Â*ÂÂÂÂÂÂÂ about the device state to be transitioned to.
  Â*ÂÂÂÂÂ - The vendor driver should take the necessary actions to change the
  Â*ÂÂÂÂÂÂÂ device state. After successful transition to a given state, the
  Â*ÂÂÂÂÂÂÂ vendor driver should return success on write(device_state, state)
  Â*ÂÂÂÂÂÂÂ system call. If the device state transition fails, the vendor
driver
  Â*ÂÂÂÂÂÂÂ should return an appropriate -errno for the fault condition.

Vendor driver need to mediate between the uAPI and the actual device.

Yes, that's the current state of VFIO migration. If a hardware interface
(e.g. PCI Capability) is defined that maps to this API then no
device-specific drivers would be necessary because core VFIO PCI code
can implement the uAPI by talking to the hardware.


As we discussed, it would be very hard. The device state is implementation
specific which may not fit for the Capability. (PCIE has already had VF
migration state in the SR-IOV extended capability).

2. The VMM either uses the migration PCI Capability directly from
       userspace or core VFIO PCI code advertises VFIO_REGION_TYPE_MIGRATION
       to userspace so migration can proceed in the same way as with
       VFIO/mdev drivers.
3. The PCI Capability is not passed through to the guest.

This brings troubles in the nested environment.

It depends on the device splitting/management design. If L0 wishes to
let L1 manage the VFs then it would need to expose a management device.
Since the migration interface is generic (not device-specific) a generic
management device solves this for all devices.

Right, but it's a burden to expose the management device or it may just
won't work.

A single generic management device is not a huge burden and it may turn
out that keeping the management device out-of-band is actually a
desirable feature if the device owner does not wish to expose the
stop/save/load functionality for some reason.


VMM are free to hide those features from guest. Management can just do
-device virtio-pci,state=false

Having management device works for L0 but not suitable for L(x>0). A per
function device interface is a must for the nested virt to work in a simple
and easy way.

You are right, a per function interface is simplest. I'm not experienced
enough with SR-IOV and nested virtualization to have a strong opinion in
this area.

Yes, and they can co-exist, the admin virtqueue works for L0, but weneed hide those via per function API for nested.

That's why I start from proposing the basic facility instead of anactual transport (PCI or admin virtqueue) implementation.


Thanks


Stefan

Follow-Ups:
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>

References:
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>