virtio-comment message

Subject: Re: [PATCH v3 2/2] virtio: introduce STOP status bit
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Eugenio Perez Martin <eperezma@redhat.com>
Date: Wed, 1 Dec 2021 10:21:28 +0000
On Mon, Nov 29, 2021 at 05:55:57PM +0100, Eugenio Perez Martin wrote:
> On Mon, Nov 29, 2021 at 11:29 AM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Nov 25, 2021 at 10:57:28AM +0800, Jason Wang wrote:
> > >
> > > å 2021/11/24 äå7:20, Stefan Hajnoczi åé:
> > > > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote:
> > > > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote:
> > > > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio PÃrez wrote:
> > > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new
> > > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size.
> > > > > > > > I'm trying to understand how this would work. Available buffers may be
> > > > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the
> > > > > > > > avail ring could contain something like:
> > > > > > > >
> > > > > > > >    avail.ring = [Used, Not used, Used, Not used, ...]
> > > > > > > >                                                  ^--- avail.idx
> > > > > > > >
> > > > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not
> > > > > > > > used" in avail.ring.
> > > > > > > >
> > > > > > > > Does this mean the driver can rewind avail.idx by counting the number of
> > > > > > > > "Not used" buffers and skipping "Used" buffers until it reaches
> > > > > > > > num_not_used "Not used" buffers?
> > > > > > > >
> > > > > > > I'm going to also drop the "resume" part for the next version because
> > > > > > > it adds extra complexity not actually needed, and it can be achieved
> > > > > > > with a full reset in a simpler way.
> > > > > > >
> > > > > > > But I'll explain it below with your examples. Long story short, the
> > > > > > > driver only can rewind the available descriptors that are still in the
> > > > > > > available ring, and the device must flush the ones that cannot recover
> > > > > > > from the ring.
> > > > > > >
> > > > > > > > I think there is a known issue with this approach:
> > > > > > > >
> > > > > > > > Imagine a vring with 4 elements:
> > > > > > > >
> > > > > > > >    avail.ring = [0,        1,    2,    3   ]
> > > > > > > >                  Not used, used, used, used
> > > > > > > >                                             ^--- avail.idx
> > > > > > > >
> > > > > > > > Since the device has used 3 buffers the driver now has space to make
> > > > > > > > more buffers available. avail.idx wraps back to the start of the ring
> > > > > > > > and the driver overwrites the first element ("Not used"):
> > > > > > > >
> > > > > > > >    avail.ring = [1,        N/A,  N/A,  N/A]
> > > > > > > >                  Not used, N/A,  N/A,  N/A
> > > > > > > >                           ^--- avail.idx
> > > > > > > >
> > > > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1
> > > > > > > > for the new available buffer.
> > > > > > > >
> > > > > > > > Now we stop the device, knowing there are two buffers available that
> > > > > > > > have not been used. But avail.ring[] actually only contains the new
> > > > > > > > buffer (vring descriptor 1) that we made available because we overwrote
> > > > > > > > the old avail.ring[] element (vring descriptor 0).
> > > > > > > >
> > > > > > > > What now? Where does the device reset its internal avail_idx to?
> > > > > > > To be on the same page, in qemu the device maintains two "internal
> > > > > > > avail idx": shadow_avail_idx (last seen in the available ring, could
> > > > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from
> > > > > > > avail, 2). The device must forget shadow_avail_idx and flush the
> > > > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now
> > > > > > > it can stop.
> > > > > > >
> > > > > > > The proposal allows the device to fail descriptor 0 in a
> > > > > > > device-specific way, but I think now it was a bad choice.
> > > > > > >
> > > > > > > The driver cannot move the device's last_avail_idx in this operation:
> > > > > > > The device is simply forced to flush used ones to the used ring or
> > > > > > > descriptor ring in a packed vq case. So the device's internal
> > > > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3.
> > > > > > >
> > > > > > > The device must keep its last_avail_idx through stop and resume cycle.
> > > > > > Are you saying that all buffers avail->ring[i % ring_size] must be
> > > > > > completed by the device before the STOP bit is reported where i <=
> > > > > > last_avail_idx?
> > > > > >
> > > > > > This means the driver can modify avail->ring[i % ring_size] where
> > > > > > avail_idx >= i > used_idx.
> > > > > >
> > > > > Yes, That's correct. The driver could also decide to modify the
> > > > > descriptor table instead of the avail ring to do so, but I think the
> > > > > point is clear now.
> > > > >
> > > > > Somehow it is thought after the premise that the out of order
> > > > "Somehow it is thought after the premise" == "there is a fundamental
> > > > design assumption"?
> > > >
> > > > > descriptors are descriptors that the device must wait to complete
> > > > > before the pause anyway. Depending on the device, it might prefer to
> > > > > cancel them, to wait for them, etc. The interesting descriptors to
> > > > > rewind are the ones that have not reached the device (i > used_idx).
> > > > > The driver can do whatever it wants with them.
> > > > >
> > > > > If we assume all the in-flight descriptors are idempotent and we
> > > > > expose a way for the device to expose them, the model is way more
> > > > > simpler than this.
> > > > The constraint that the device has to mark all previously seen "avail"
> > > > buffers as "used" is problematic. It makes STOP visible to the driver
> > > > when the device has to fail requests.
> > >
> > >
> > > I think we need some clarification here on the driver. For doing migration,
> > > some kind of mediation is a must.
> > >
> > > As we've discussed in the previous versions of this proposal, the VMM
> > > usually won't advertise the STOP feature to guest if we don't want to do
> > > nested live migration (if we do we can shadow it anyhow).
> > >
> > > So from the guest point of view it won't see neither STOP nor the inflight
> > > descriptors.
> >
> > That's not how I understand STOP semantics. See below.
> >
> > > > That is incompatible with how
> > > > devices behave across live migration today. If you want to use STOP for
> > > > live migration then it's probably necessary to rethink this constraint.
> > > >
> > > > QEMU's virtio-blk and virtio-scsi device models put failed requests onto
> > > > a list so they can be retried after problems with the underlying storage
> > > > have been resolved (e.g. more disk space becomes available and ENOSPC
> > > > requests can be retried).
> > >
> > >
> > > A question, I think for those "failure" it's actually not visible from the
> > > drive? If this is true, from the spec point of view, there are still
> > > inflight. The VMM may choose to migrate them to the destination and
> > > re-submit them there. This works more like vhost re-connection.
> >
> > That's how I would like STOP to work, but the semantics seem to be
> > different. Eugenio can correct me if this is wrong:
> >
> > All avail descriptors before the last used descriptor must be marked
> > used before the device reports the STOP bit. For example:
> >
> >   avail.ring = [1, 2, 3, 4]
> >   used.ring = [3]
> >
> > The driver writes the STOP bit. Now the device MUST complete 1 and 2
> > before reporting the STOP bit. Therefore we cannot keep 1 and 2
> > in flight but it can keep 4 in flight. The problem is that this
> > conflicts with the virtio-blk/scsi failed requests behavior where 1 and
> > 2 should be kept in flight and migrated.
> >
> 
> (Only answering to your example use case at this part of the mail)
> 
> My intention was a little bit less rigid actually, but it does not
> meet the blk use case anyway.
> 
> In that case, the device should be free to not mark descriptor 2 as
> used, since the device will start on last_avail_idx == 1, and it will
> read it again after the reset.

last_avail_idx must be 3 since the device already saw avail.ring[0],
avail.ring[1], and avail.ring[2]. How can last_avail_idx be 1?

Also, what does "reset" mean? I don't think a VIRTIO device reset is
part of this process, just setting and clearing the STOP bit.

> And that requirement was intended to be
> removed once we implement a standard / device specific way to report
> them differently. It's loosely expressed as "Depending on the device,
> ... as long as the driver can recover its normal operation if it
> resumes the device without the need of resetting it".
> 
> Although I thought this freedom would help devices to implement stop
> semantincs, to track overridden descriptors could actually be way
> worse than simply split them as in flight == (used_idx,
> last_avail_idx) or available (last_avail_idx, avail_idx).
> 
> I still think that, ideally, the device should be able to report
> differently the descriptors that are not-rewindables (for example,
> in-flight writes, because rewind them leave the device in an
> inconsistent state) and rewindable (not started writes, reads) to the
> driver. Just for the sake of flexibility. But potentially overridden
> descriptors complicate it, so it's probably not worth it. And our
> intended use case (live migration) has no use for it, so I think it's
> better to stick with in flight vs avail.
> 
> (Now adding my view of Jason's point on top)
> 
> At this moment, blk is able to detect ENOSPC because the device is in
> qemu's code, software based. If the device is out of qemu, it will
> need either:
> * A way to signal the error condition to qemu, so it can start the
> migration to solve it.
> * Another process to monitor available space so it can react & migrate.
> 
> Since you pointed out a queue of failed requests, I will go with the
> first method. The data queues of the device reach directly the guest,
> so the device cannot use them to signal ENOSPC: To deliver it via
> VirtQueue will skip qemu. This is already outside of VirtIO. How would
> that work in the nested migration case, for example? The only way I
> can think at this moment is to use shadow virtqueue from the beginning
> of qemu operation.
> 
> Once qemu receives that signal, the guest would only see that
> descriptor id 1 has been used. For the next revision, it will see no
> descriptor used.

I don't understand the idea here. I'll wait until the next revision of
this series to think through virtio-blk again.

Stefan
Attachment: signature.asc
Description: PGP signature
Follow-Ups:
- Re: [PATCH v3 2/2] virtio: introduce STOP status bit
  - From: Eugenio Perez Martin <eperezma@redhat.com>