virtio-dev message

Subject: Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Date: Thu, 22 Jul 2021 11:24:59 +0100
On Thu, Jul 22, 2021 at 03:33:10PM +0800, Jason Wang wrote:
> 
> å 2021/7/21 äå6:20, Stefan Hajnoczi åé:
> > On Wed, Jul 21, 2021 at 10:29:17AM +0800, Jason Wang wrote:
> > > å 2021/7/20 äå4:50, Stefan Hajnoczi åé:
> > > > On Tue, Jul 20, 2021 at 11:04:55AM +0800, Jason Wang wrote:
> > > > > å 2021/7/19 äå8:45, Stefan Hajnoczi åé:
> > > > > > On Fri, Jul 16, 2021 at 11:53:13AM +0800, Jason Wang wrote:
> > > > > > > å 2021/7/16 äå10:03, Jason Wang åé:
> > > > > > > > å 2021/7/15 äå6:01, Stefan Hajnoczi åé:
> > > > > > > > > On Thu, Jul 15, 2021 at 09:35:13AM +0800, Jason Wang wrote:
> > > > > > > > > > å 2021/7/14 äå11:07, Stefan Hajnoczi åé:
> > > > > > > > > > > On Wed, Jul 14, 2021 at 06:29:28PM +0800, Jason Wang wrote:
> > > > > > > > > > > > å 2021/7/14 äå5:53, Stefan Hajnoczi åé:
> > > > > > > > > > > > > On Tue, Jul 13, 2021 at 08:16:35PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > å 2021/7/13 äå6:00, Stefan Hajnoczi åé:
> > > > > > > > > > > > > > > On Tue, Jul 13, 2021 at 11:27:03AM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > > > å 2021/7/12 äå5:57, Stefan Hajnoczi åé:
> > > > > > > > > > > > > > > > > On Mon, Jul 12, 2021 at 12:00:39PM +0800, Jason Wang wrote:
> > > > > > > > > > > > > > > > > > å 2021/7/11 äå4:36, Michael S. Tsirkin åé:
> > > > > > > > > > > > > > > > > > > On Fri, Jul 09, 2021 at
> > > > > > > > > > > > > > > > > > > 07:23:33PM +0200, Eugenio
> > > > > > > > > > > > > > > > > > > Perez Martin wrote:
> > > > > > > > > > > > > > > > > > > > > >    ÂÂÂÂÂÂÂ If I understand correctly, this is all
> > > > > > > > > > > > > > > > > > > > > > driven from the
> > > > > > > > > > > > > > > > > > > > > > driver inside
> > > > > > > > > > > > > > > > > > > > > > the guest, so
> > > > > > > > > > > > > > > > > > > > > > for this to work
> > > > > > > > > > > > > > > > > > > > > > the guest must
> > > > > > > > > > > > > > > > > > > > > > be running and
> > > > > > > > > > > > > > > > > > > > > > already have
> > > > > > > > > > > > > > > > > > > > > > initialised the
> > > > > > > > > > > > > > > > > > > > > > driver.
> > > > > > > > > > > > > > > > > > > > > Yes.
> > > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > As I see it, the feature
> > > > > > > > > > > > > > > > > > > > can be driven entirely
> > > > > > > > > > > > > > > > > > > > by the VMM as long as
> > > > > > > > > > > > > > > > > > > > it intercept the
> > > > > > > > > > > > > > > > > > > > relevant configuration
> > > > > > > > > > > > > > > > > > > > space (PCI, MMIO, etc)
> > > > > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > > > > guest's reads and
> > > > > > > > > > > > > > > > > > > > writes, and present it
> > > > > > > > > > > > > > > > > > > > as coherent and
> > > > > > > > > > > > > > > > > > > > transparent
> > > > > > > > > > > > > > > > > > > > for the guest. Some use
> > > > > > > > > > > > > > > > > > > > cases I can imagine with
> > > > > > > > > > > > > > > > > > > > a physical device (or
> > > > > > > > > > > > > > > > > > > > vp_vpda device) with VIRTIO_F_STOP:
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > 1) The VMM chooses not
> > > > > > > > > > > > > > > > > > > > to pass the feature
> > > > > > > > > > > > > > > > > > > > flag. The guest cannot
> > > > > > > > > > > > > > > > > > > > stop
> > > > > > > > > > > > > > > > > > > > the device, so any write to this flag is an error/undefined.
> > > > > > > > > > > > > > > > > > > > 2) The VMM passes the
> > > > > > > > > > > > > > > > > > > > flag to the guest. The
> > > > > > > > > > > > > > > > > > > > guest can stop the
> > > > > > > > > > > > > > > > > > > > device.
> > > > > > > > > > > > > > > > > > > > 2.1) The VMM stops the
> > > > > > > > > > > > > > > > > > > > device to perform a live
> > > > > > > > > > > > > > > > > > > > migration, and the
> > > > > > > > > > > > > > > > > > > > guest does not write to
> > > > > > > > > > > > > > > > > > > > STOP in any moment of
> > > > > > > > > > > > > > > > > > > > the LM. It resets the
> > > > > > > > > > > > > > > > > > > > destination device with
> > > > > > > > > > > > > > > > > > > > the state, and then
> > > > > > > > > > > > > > > > > > > > initializes the device.
> > > > > > > > > > > > > > > > > > > > 2.2) The guest stops the
> > > > > > > > > > > > > > > > > > > > device and, when
> > > > > > > > > > > > > > > > > > > > STOP(32) is set, the
> > > > > > > > > > > > > > > > > > > > source
> > > > > > > > > > > > > > > > > > > > VMM migrates the device
> > > > > > > > > > > > > > > > > > > > status. The destination
> > > > > > > > > > > > > > > > > > > > VMM realizes the bit,
> > > > > > > > > > > > > > > > > > > > so it sets the bit in
> > > > > > > > > > > > > > > > > > > > the destination too
> > > > > > > > > > > > > > > > > > > > after device
> > > > > > > > > > > > > > > > > > > > initialization.
> > > > > > > > > > > > > > > > > > > > 2.3) The device is not
> > > > > > > > > > > > > > > > > > > > initialized by the guest
> > > > > > > > > > > > > > > > > > > > so it doesn't matter
> > > > > > > > > > > > > > > > > > > > what bit has the HW, but the VM can be migrated.
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > Am I missing something?
> > > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > > Thanks!
> > > > > > > > > > > > > > > > > > > It's doable like this. It's
> > > > > > > > > > > > > > > > > > > all a lot of hoops to jump
> > > > > > > > > > > > > > > > > > > through though.
> > > > > > > > > > > > > > > > > > > It's also not easy for devices to implement.
> > > > > > > > > > > > > > > > > > It just requires a new status
> > > > > > > > > > > > > > > > > > bit. Anything that makes you
> > > > > > > > > > > > > > > > > > think it's hard
> > > > > > > > > > > > > > > > > > to implement?
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > E.g for networking device, it
> > > > > > > > > > > > > > > > > > should be sufficient to use this
> > > > > > > > > > > > > > > > > > bit + the
> > > > > > > > > > > > > > > > > > virtqueue state.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > Why don't we design the
> > > > > > > > > > > > > > > > > > > feature in a way that is
> > > > > > > > > > > > > > > > > > > useable by VMMs
> > > > > > > > > > > > > > > > > > > and implementable by devices in a simple way?
> > > > > > > > > > > > > > > > > > It use the common technology
> > > > > > > > > > > > > > > > > > like register shadowing without
> > > > > > > > > > > > > > > > > > any further
> > > > > > > > > > > > > > > > > > stuffs.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Or do you have any other ideas?
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > (I think we all know migration
> > > > > > > > > > > > > > > > > > will be very hard if we simply
> > > > > > > > > > > > > > > > > > pass through
> > > > > > > > > > > > > > > > > > those state registers).
> > > > > > > > > > > > > > > > > If an admin virtqueue is used
> > > > > > > > > > > > > > > > > instead of the STOP Device Status
> > > > > > > > > > > > > > > > > field
> > > > > > > > > > > > > > > > > bit then there's no need to re-read
> > > > > > > > > > > > > > > > > the Device Status field in a loop
> > > > > > > > > > > > > > > > > until the device has stopped.
> > > > > > > > > > > > > > > > Probably not. Let me to clarify several points:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > - This proposal has nothing to do with
> > > > > > > > > > > > > > > > admin virtqueue. Actually, admin
> > > > > > > > > > > > > > > > virtqueue could be used for carrying any
> > > > > > > > > > > > > > > > basic device facility like status
> > > > > > > > > > > > > > > > bit. E.g I'm going to post patches that
> > > > > > > > > > > > > > > > use admin virtqueue as a "transport"
> > > > > > > > > > > > > > > > for device slicing at virtio level.
> > > > > > > > > > > > > > > > - Even if we had introduced admin
> > > > > > > > > > > > > > > > virtqueue, we still need a per function
> > > > > > > > > > > > > > > > interface for this. This is a must for
> > > > > > > > > > > > > > > > nested virtualization, we can't
> > > > > > > > > > > > > > > > always expect things like PF can be assigned to L1 guest.
> > > > > > > > > > > > > > > > - According to the proposal, there's no
> > > > > > > > > > > > > > > > need for the device to complete all
> > > > > > > > > > > > > > > > the consumed buffers, device can choose
> > > > > > > > > > > > > > > > to expose those inflight descriptors
> > > > > > > > > > > > > > > > in a device specific way and set the
> > > > > > > > > > > > > > > > STOP bit. This means, if we have the
> > > > > > > > > > > > > > > > device specific in-flight descriptor
> > > > > > > > > > > > > > > > reporting facility, the device can
> > > > > > > > > > > > > > > > almost set the STOP bit immediately.
> > > > > > > > > > > > > > > > - If we don't go with the basic device
> > > > > > > > > > > > > > > > facility but using the admin
> > > > > > > > > > > > > > > > virtqueue specific method, we still need
> > > > > > > > > > > > > > > > to clarify how it works with the
> > > > > > > > > > > > > > > > device status state machine, it will be
> > > > > > > > > > > > > > > > some kind of sub-states which looks
> > > > > > > > > > > > > > > > much more complicated than the current proposal.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > When migrating a guest with many
> > > > > > > > > > > > > > > > > VIRTIO devices a busy waiting
> > > > > > > > > > > > > > > > > approach
> > > > > > > > > > > > > > > > > extends downtime if implemented
> > > > > > > > > > > > > > > > > sequentially (stopping one device at
> > > > > > > > > > > > > > > > > a
> > > > > > > > > > > > > > > > > time).
> > > > > > > > > > > > > > > > Well. You need some kinds of waiting for
> > > > > > > > > > > > > > > > sure, the device/DMA needs sometime
> > > > > > > > > > > > > > > > to be stopped. The downtime is determined by a specific virtio
> > > > > > > > > > > > > > > > implementation which is hard to be
> > > > > > > > > > > > > > > > restricted at the spec level. We can
> > > > > > > > > > > > > > > > clarify that the device must set the STOP bit in e.g 100ms.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > >    ÂÂÂÂÂ It can be implemented
> > > > > > > > > > > > > > > > > concurrently (setting the STOP bit
> > > > > > > > > > > > > > > > > on all
> > > > > > > > > > > > > > > > > devices and then looping until all
> > > > > > > > > > > > > > > > > their Device Status fields have the
> > > > > > > > > > > > > > > > > bit set), but this becomes more complex to implement.
> > > > > > > > > > > > > > > > I still don't get what kind of complexity did you worry here.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > I'm a little worried about adding a new bit that requires busy
> > > > > > > > > > > > > > > > > waiting...
> > > > > > > > > > > > > > > > Busy wait is not something that is introduced in this patch:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > 4.1.4.3.2 Driver Requirements: Common
> > > > > > > > > > > > > > > > configuration structure layout
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > After writing 0 to device_status, the
> > > > > > > > > > > > > > > > driver MUST wait for a read of
> > > > > > > > > > > > > > > > device_status to return 0 before reinitializing the device.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Since it was required for at least one
> > > > > > > > > > > > > > > > transport. We need do something
> > > > > > > > > > > > > > > > similar to when introducing basic facility.
> > > > > > > > > > > > > > > Adding the STOP but as a Device Status bit
> > > > > > > > > > > > > > > is a small and clean VIRTIO
> > > > > > > > > > > > > > > spec change. I like that.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On the other hand, devices need time to stop and that time can be
> > > > > > > > > > > > > > > unbounded. For example, software
> > > > > > > > > > > > > > > virtio-blk/scsi implementations since
> > > > > > > > > > > > > > > cannot immediately cancel in-flight I/O requests on Linux hosts.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > The natural interface for long-running
> > > > > > > > > > > > > > > operations is virtqueue requests.
> > > > > > > > > > > > > > > That's why I mentioned the alternative of using an admin virtqueue
> > > > > > > > > > > > > > > instead of a Device Status bit.
> > > > > > > > > > > > > > So I'm not against the admin virtqueue. As said
> > > > > > > > > > > > > > before, admin virtqueue
> > > > > > > > > > > > > > could be used for carrying the device status bit.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Send a command to set STOP status bit to admin
> > > > > > > > > > > > > > virtqueue. Device will make
> > > > > > > > > > > > > > the command buffer used after it has
> > > > > > > > > > > > > > successfully stopped the device.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > AFAIK, they are not mutually exclusive, since
> > > > > > > > > > > > > > they are trying to solve
> > > > > > > > > > > > > > different problems.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Device status - basic device facility
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Admin virtqueue - transport/device specific way
> > > > > > > > > > > > > > to implement (part of) the
> > > > > > > > > > > > > > device facility
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Although you mentioned that the stopped
> > > > > > > > > > > > > > > state needs to be reflected in
> > > > > > > > > > > > > > > the Device Status field somehow, I'm not sure about that since the
> > > > > > > > > > > > > > > driver typically doesn't need to know whether the device is being
> > > > > > > > > > > > > > > migrated.
> > > > > > > > > > > > > > The guest won't see the real device status bit.
> > > > > > > > > > > > > > VMM will shadow the device
> > > > > > > > > > > > > > status bit in this case.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > E.g with the current vhost-vDPA, vDPA behave
> > > > > > > > > > > > > > like a vhost device, guest is
> > > > > > > > > > > > > > unaware of the migration.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > STOP status bit is set by Qemu to real virtio
> > > > > > > > > > > > > > hardware. But guest will only
> > > > > > > > > > > > > > see the DRIVER_OK without STOP.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > It's not hard to implement the nested on top,
> > > > > > > > > > > > > > see the discussion initiated
> > > > > > > > > > > > > > by Eugenio about how expose VIRTIO_F_STOP to guest for nested live
> > > > > > > > > > > > > > migration.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > >    ÂÂÂÂ In fact, the VMM would need to hide
> > > > > > > > > > > > > > > this bit and it's safer to
> > > > > > > > > > > > > > > keep it out-of-band instead of risking exposing it by accident.
> > > > > > > > > > > > > > See above, VMM may choose to hide or expose the
> > > > > > > > > > > > > > capability. It's useful for
> > > > > > > > > > > > > > migrating a nested guest.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > If we design an interface that can be used in
> > > > > > > > > > > > > > the nested environment, it's
> > > > > > > > > > > > > > not an ideal interface.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > In addition, stateful devices need to
> > > > > > > > > > > > > > > load/save non-trivial amounts of
> > > > > > > > > > > > > > > data. They need DMA to do this efficiently,
> > > > > > > > > > > > > > > so an admin virtqueue is a
> > > > > > > > > > > > > > > good fit again.
> > > > > > > > > > > > > > I don't get the point here. You still need to
> > > > > > > > > > > > > > address the exact the similar
> > > > > > > > > > > > > > issues for admin virtqueue: the unbound time in
> > > > > > > > > > > > > > freezing the device, the
> > > > > > > > > > > > > > interaction with the virtio device status state machine.
> > > > > > > > > > > > > Device state state can be large so a register interface would be a
> > > > > > > > > > > > > bottleneck. DMA is needed. I think a virtqueue is a good fit for
> > > > > > > > > > > > > saving/loading device state.
> > > > > > > > > > > > So this patch doesn't mandate a register interface, isn't it?
> > > > > > > > > > > You're right, not this patch. I mentioned it because your other patch
> > > > > > > > > > > series ("[PATCH] virtio-pci: implement VIRTIO_F_RING_STATE")
> > > > > > > > > > > implements
> > > > > > > > > > > it a register interface.
> > > > > > > > > > > 
> > > > > > > > > > > > And DMA
> > > > > > > > > > > > doesn't means a virtqueue, it could be a transport specific method.
> > > > > > > > > > > Yes, although virtqueues are a pretty good interface that works across
> > > > > > > > > > > transports (PCI/MMIO/etc) thanks to the standard vring memory layout.
> > > > > > > > > > > 
> > > > > > > > > > > > I think we need to start from defining the state of one
> > > > > > > > > > > > specific device and
> > > > > > > > > > > > see what is the best interface.
> > > > > > > > > > > virtio-blk might be the simplest. I think virtio-net has more device
> > > > > > > > > > > state and virtio-scsi is definitely more complext than virtio-blk.
> > > > > > > > > > > 
> > > > > > > > > > > First we need agreement on whether "device state" encompasses the full
> > > > > > > > > > > state of the device or just state that is unknown to the VMM.
> > > > > > > > > > I think we've discussed this in the past. It can't work since:
> > > > > > > > > > 
> > > > > > > > > > 1) The state and its format must be clearly defined in the spec
> > > > > > > > > > 2) We need to maintain migration compatibility and debug-ability
> > > > > > > > > Some devices need implementation-specific state. They should still be
> > > > > > > > > able to live migrate even if it means cross-implementation migration and
> > > > > > > > > debug-ability is not possible.
> > > > > > > > I think we need to re-visit this conclusion. Migration compatibility is
> > > > > > > > pretty important, especially consider the software stack has spent a
> > > > > > > > huge mount of effort in maintaining them.
> > > > > > > > 
> > > > > > > > Say a virtio hardware would break this, this mean we will lose all the
> > > > > > > > advantages of being a standard device.
> > > > > > > > 
> > > > > > > > If we can't do live migration among:
> > > > > > > > 
> > > > > > > > 1) different backends, e.g migrate from virtio hardware to migrate
> > > > > > > > software
> > > > > > > > 2) different vendors
> > > > > > > > 
> > > > > > > > We failed to say as a standard device and the customer is in fact locked
> > > > > > > > by the vendor implicitly.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > 3) Not a proper uAPI desgin
> > > > > > > > > I never understood this argument. The Linux uAPI passes through lots of
> > > > > > > > > opaque data from devices to userspace. Allowing an
> > > > > > > > > implementation-specific device state representation is nothing new. VFIO
> > > > > > > > > already does it.
> > > > > > > > I think we've already had a lots of discussion for VFIO but without a
> > > > > > > > conclusion. Maybe we need the verdict from Linus or Greg (not sure if
> > > > > > > > it's too late). But that's not related to virito and this thread.
> > > > > > > > 
> > > > > > > > What you propose here is kind of conflict with the efforts of virtio. I
> > > > > > > > think we all aggree that we should define the state in the spec.
> > > > > > > > Assuming this is correct:
> > > > > > > > 
> > > > > > > > 1) why do we still offer opaque migration state to userspace?
> > > > > > > > 2) how can it be integrated into the current VMM (Qemu) virtio devices'
> > > > > > > > migration bytes stream?
> > > > > > > > 
> > > > > > > > We should standardize everything that is visible by the driver to be a
> > > > > > > > standard device. That's the power of virtio.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > > > That's
> > > > > > > > > > > basically the difference between the vhost/vDPA's selective
> > > > > > > > > > > passthrough
> > > > > > > > > > > approach and VFIO's full passthrough approach.
> > > > > > > > > > We can't do VFIO full pasthrough for migration anyway, some kind
> > > > > > > > > > of mdev is
> > > > > > > > > > required but it's duplicated with the current vp_vdpa driver.
> > > > > > > > > I'm not sure that's true. Generic VFIO PCI migration can probably be
> > > > > > > > > achieved without mdev:
> > > > > > > > > 1. Define a migration PCI Capability that indicates support for
> > > > > > > > >    ÂÂÂ VFIO_REGION_TYPE_MIGRATION. This allows the PCI device to implement
> > > > > > > > >    ÂÂÂ the migration interface in hardware instead of an mdev driver.
> > > > > > > > So I think it still depend on the driver to implement migrate state
> > > > > > > > which is vendor specific.
> > > > > > > > 
> > > > > > > > Note that it's just an uAPI definition not something defined in the PCI
> > > > > > > > spec.
> > > > > > > > 
> > > > > > > > Out of curiosity, the patch is merged without any real users in the
> > > > > > > > Linux. This is very bad since we lose the change to audit the whole
> > > > > > > > design.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > 2. The VMM either uses the migration PCI Capability directly from
> > > > > > > > >    ÂÂÂ userspace or core VFIO PCI code advertises
> > > > > > > > > VFIO_REGION_TYPE_MIGRATION
> > > > > > > > >    ÂÂÂ to userspace so migration can proceed in the same way as with
> > > > > > > > >    ÂÂÂ VFIO/mdev drivers.
> > > > > > > > > 3. The PCI Capability is not passed through to the guest.
> > > > > > > > This brings troubles in the nested environment.
> > > > > > > > 
> > > > > > > > Thanks
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > Changpeng Liu originally mentioned the idea of defining a migration PCI
> > > > > > > > > Capability.
> > > > > > > > > 
> > > > > > > > > > >    ÂÂ For example, some of the
> > > > > > > > > > > virtio-net state is available to the VMM with vhost/vDPA because it
> > > > > > > > > > > intercepts the virtio-net control virtqueue.
> > > > > > > > > > > 
> > > > > > > > > > > Also, we need to decide to what degree the device state representation
> > > > > > > > > > > is standardized in the VIRTIO specification.
> > > > > > > > > > I think all the states must be defined in the spec otherwise the device
> > > > > > > > > > can't claim it supports migration at virtio level.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > >    ÂÂ I think it makes sense to
> > > > > > > > > > > standardize it if it's possible to convey all necessary
> > > > > > > > > > > state and device
> > > > > > > > > > > implementors can easily implement this device state representation.
> > > > > > > > > > I doubt it's high device specific. E.g can we standardize device(GPU)
> > > > > > > > > > memory?
> > > > > > > > > For devices that have little internal state it's possible to define a
> > > > > > > > > standard device state representation.
> > > > > > > > > 
> > > > > > > > > For other devices, like virtio-crypto, virtio-fs, etc it becomes
> > > > > > > > > difficult because the device implementation contains state that will be
> > > > > > > > > needed but is very specific to the implementation. These devices *are*
> > > > > > > > > migratable but they don't have standard state. Even here there is a
> > > > > > > > > spectrum:
> > > > > > > > > - Host OS-specific state (e.g. Linux struct file_handles)
> > > > > > > > > - Library-specific state (e.g. crypto library state)
> > > > > > > > > - Implementation-specific state (e.g. sshfs inode state for virtio-fs)
> > > > > > > > > 
> > > > > > > > > This is why I think it's necessary to support both standard device state
> > > > > > > > > representations and implementation-specific device state
> > > > > > > > > representations.
> > > > > > > Having two ways will bring extra complexity. That why I suggest:
> > > > > > > 
> > > > > > > - to have general facility for the virtuqueue to be migrated
> > > > > > > - leave the device specific state to be device specific. so device can
> > > > > > > choose what is convenient way or interface.
> > > > > > I don't think we have a choice. For stateful devices it can be
> > > > > > impossible to define a standard device state representation.
> > > > > Let me clarify, I agree we can't have a standard device state for all kinds
> > > > > of device.
> > > > > 
> > > > > That's way I tend to leave them to be device specific. (but not
> > > > > implementation specific)
> > > > Unfortunately device state is sometimes implementation-specific. Not
> > > > because the device is proprietary, but because the actual state is
> > > > meaningless to other implementations.
> > > > 
> > > > I mentioned virtiofs as an example where file system backends can be
> > > > implemented in completely different ways so the device state cannot be
> > > > migrated between implementations.
> > > 
> > > So let me clarify my understanding, we had two kinds of states:
> > > 
> > > 1) implementation specific state that is not noticeable by the driver
> > > 2) device specific state that is noticeable by the driver
> > > 
> > > We don't have the interest in 1).
> > > 
> > > For 2) it's what needs to be defined in the spec. If we fail to generalize
> > > the device specific state, it can't be used by a standard virtio driver. Or
> > > maybe you can give a concrete example on how vitio-fs fail in doing this?
> > 2) is what I mean when I say a "stateful" device. I agree, 1) is not
> > relevant to this discussion because we don't need to migrate internal
> > device state that the driver cannot interact with.
> > 
> > The virtiofs device has an OPEN request for opening a file. Live
> > migration must transfer the list of open files from the source to the
> > destination device so the driver can continue accessing files it
> > previously had open.
> > 
> > However, the result of the OPEN request is a number similar to a POSIX
> > fd, not the full device-internal state associated with an open file.
> > After migration the driver expects to continue using the number to
> > operate on the file. We must transfer the open file state to the
> > destination device.
> > 
> > Different device implementations may have completely different concepts
> > of open file state:
> > 
> > - An in-memory file system. The list of open files is a list of
> >    in-memory inodes. We'll also need to transfer the entire contents of
> >    the files/directories since it's in-memory and not shared with the
> >    destination device.
> > 
> > - A passthrough Linux file system. We need to transfer the Linux file
> >    handles (see open_by_handle_at(2)) so the destination device can open
> >    the inodes on the underlying shared host file system.
> > 
> > - A distributed object storage API. We need to transfer the list of
> >    open object IDs so the destination device can perform I/O to the same
> >    objects.
> > 
> > - Anyone can create a custom virtiofs device implementation and it will
> >    rely on different open file state.
> 
> 
> So it looks to me you want to propose migration drivers for different
> implementation. It looks to me we it's better to go with something more
> easier:
> 
> 1) Having a common driver visible state defined in the spec, and use them
> for migration
> 2) It's the charge of the device or backend to "map" the driver visible
> state to its implementation specific state.
> 
> If 1) is in-sufficient, we should extend it until the it satisfies 2)
> 
> For virtio-fs, it looks like the issue is that the implementation needs to
> associate objects in different namespaces (guest vs host).
> 
> For the above cases:
> 
> 1) For the in-memory file system, if I understand correctly, it can be
> accessed directly by guest via a transport specific way (e.g BAR). In this
> case, it's driver noticeable state so all the memory must be migrated to the
> destination.

The virtiofs DAX Window feature is optional. Also, the DAX Window BAR
only exposes file contents, not file system metadata. The guest driver
or the BAR contents are not enough to migrate the state of the device.

> 2) For the passthrough, it's the charge of the implementation to do the map.
> I think it doesn't differ a lot with the current migration among shared
> storage of block devices. (Note that open_by_handle_at() requires
> CAP_DAC_READ_SEARCH which I'm not sure it can used)

It's different from block devices because the FUSE protocol
is stateful. Commands like LOOKUP, OPEN, OPENDIR create objects that are
referred to by a temporary number (similar to POSIX fds). Due to POSIX
file system semantics there's no way to recreate the mapping correctly
without Linux file handles (because files can be deleted, renamed, moved
while they are still open or their inode is known). Shared block devices
do not have this issue.

> 3) For distributed storage, the implementation should implement the
> association between driver object and distributed storage object (I think
> each such API should has something like UUID for a global namespace) and
> provide a way for reverse lookup.
> 4) For other implementation, it should be the same as 3)
> 
> It's the management layer to decide whether we can do cross implementation
> migration, but qemu should migrate the common driver visible state instead
> of implementation specific state.
> 
> We can have a dedicated feature flag for this and block the migration for
> the devices without this feature.
> 
> I think we don't want to end up with several migration drivers for
> virtio-fs.
>
> > I imagine virtio-gpu and virtio-crypto might have similar situations
> > where an object created through a virtqueue request has device-internal
> > state associated with it that must be migrated.
> 
> 
> So the point stands still. The device internal state should be restored from
> the device specific state defined in the spec. Spec would guarantee that the
> minimal part of the device specific state, implementation should use those
> to restore implementation specific state.
> 
> 
> > 
> > > > > But we can generalize the virtqueue state for sure.
> > > > I agree and also that some device types can standardize their device
> > > > state representations. But I think it's a technical requirement to
> > > > support implementation-specific state for device types where
> > > > cross-implementation migration is not possible.
> > > 
> > > A question here, if the driver depends on the implementation specific state,
> > > how can we make sure that driver can work for other implementation. If we're
> > > sure that a single driver can work for all kinds of implementations, it
> > > means the we had device specific state not implementation state.
> > I think this is confusing stateless and stateful devices. You are
> > describing a stateless device here. I'll try to define a stateful
> > device:
> > 
> > A stateful device maintains state that the driver operates on indirectly
> > via standard requests. For example, the virtio-crypto device has
> > CREATE_SESSION requests and a session ID is returned to the driver so
> > further requests can be made on the session object. It may not be
> > possible to replay, reconnect, or restart the device without losing
> > state.
> 
> 
> If it's impossible to do all of these. These state should be noticeable by
> the driver and they are not implementation specific but device specific
> which could be defined in the spec.
> 
> 
> > 
> > I hope that this description, together with the virtiofs specifics
> > above, make the problem clearer.
> 
> 
> Yes.
> 
> 
> > 
> > > > I'm not saying the implementation-specific state representation has to
> > > > be a binary blob. There could be an identifier registry to ensure live
> > > > migration compatibility checks can be performed. There could also be a
> > > > standard binary encoding for migration data.
> > > 
> > > Yes, such requirements has been well studied in the past. There should be
> > > plenty of protocols to do this.
> > > 
> > > 
> > > >    But the contents will be
> > > > implementation-specific for some devices.
> > > 
> > > If we allow this, it breaks the spec effort for having a standard devices.
> > > And it will block the real customers.
> > If we forbid this then devices for which migration is technically
> > possible will be unmigratable. Both users and device implementors will
> > find other solutions, like VFIO, so I don't think we can stop them even
> > if we tried.
> 
> 
> The difference is standard device vs vendor specific device. We can't avoid
> the implementation specific state for vendor specific hardware.
> 
> But for standard device, we don't want to end up with migration drivers. (Do
> we want vendor to ship the vendor specific migration drivers for NVM(e)
> devices?)
> 
> 
> > 
> > I recognize that opaque device state poses a risk to migration
> > compatibility, because device implementors may arbitrarily use opaque
> > state when a standard is available.
> > 
> > However, the way to avoid this scenario is by:
> > 
> > 1. Making the standard migration approach the easiest to implement
> >     because everything has been taken care of. It will save implementors
> >     the headache of defining and coding their own device state
> >     representations and versioning.
> > 
> > 2. Educate users about migration compatibility so they can identify
> >     implementors are locking in their users.
> 
> 
> For vendor specific device, this may work. But for standard devices like
> virtio, we should go further.
> 
> The device states should be defined in the spec clearly. We should re-visit
> the design if those states contains anything that is implementation
> specific.

Can you describe how migrating virtiofs devices should work? I think
that might be quicker than if I reply to each of your points because our
views are still quite far apart.

Stefan
Attachment: signature.asc
Description: PGP signature
Follow-Ups:
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
References:
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [virtio-comment] [PATCH V2 2/2] virtio: introduce STOP status bit
  - From: Jason Wang <jasowang@redhat.com>