[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [PATCH v3 2/2] virtio: introduce STOP status bit
On Thu, Dec 02, 2021 at 10:40:57AM +0800, Jason Wang wrote: > On Tue, Nov 30, 2021 at 12:56 AM Eugenio Perez Martin > <eperezma@redhat.com> wrote: > > > > On Mon, Nov 29, 2021 at 11:29 AM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > On Thu, Nov 25, 2021 at 10:57:28AM +0800, Jason Wang wrote: > > > > > > > > å 2021/11/24 äå7:20, Stefan Hajnoczi åé: > > > > > On Tue, Nov 23, 2021 at 06:00:20PM +0100, Eugenio Perez Martin wrote: > > > > > > On Tue, Nov 23, 2021 at 1:17 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > On Thu, Nov 18, 2021 at 08:58:05PM +0100, Eugenio Perez Martin wrote: > > > > > > > > On Thu, Nov 18, 2021 at 5:00 PM Stefan Hajnoczi <stefanha@redhat.com> wrote: > > > > > > > > > On Thu, Nov 11, 2021 at 07:58:12PM +0100, Eugenio PÃrez wrote: > > > > > > > > > > +the driver MAY change avail_idx in the case of split virtqueue, but the new > > > > > > > > > > +avail_idx MUST be within used_idx and used_idx plus virtqueue size. > > > > > > > > > I'm trying to understand how this would work. Available buffers may be > > > > > > > > > consumed out-of-order unless VIRTIO_F_IN_ORDER was negotiated, so the > > > > > > > > > avail ring could contain something like: > > > > > > > > > > > > > > > > > > avail.ring = [Used, Not used, Used, Not used, ...] > > > > > > > > > ^--- avail.idx > > > > > > > > > > > > > > > > > > There are num_not_used = avail.idx - used.idx requests that are "Not > > > > > > > > > used" in avail.ring. > > > > > > > > > > > > > > > > > > Does this mean the driver can rewind avail.idx by counting the number of > > > > > > > > > "Not used" buffers and skipping "Used" buffers until it reaches > > > > > > > > > num_not_used "Not used" buffers? > > > > > > > > > > > > > > > > > I'm going to also drop the "resume" part for the next version because > > > > > > > > it adds extra complexity not actually needed, and it can be achieved > > > > > > > > with a full reset in a simpler way. > > > > > > > > > > > > > > > > But I'll explain it below with your examples. Long story short, the > > > > > > > > driver only can rewind the available descriptors that are still in the > > > > > > > > available ring, and the device must flush the ones that cannot recover > > > > > > > > from the ring. > > > > > > > > > > > > > > > > > I think there is a known issue with this approach: > > > > > > > > > > > > > > > > > > Imagine a vring with 4 elements: > > > > > > > > > > > > > > > > > > avail.ring = [0, 1, 2, 3 ] > > > > > > > > > Not used, used, used, used > > > > > > > > > ^--- avail.idx > > > > > > > > > > > > > > > > > > Since the device has used 3 buffers the driver now has space to make > > > > > > > > > more buffers available. avail.idx wraps back to the start of the ring > > > > > > > > > and the driver overwrites the first element ("Not used"): > > > > > > > > > > > > > > > > > > avail.ring = [1, N/A, N/A, N/A] > > > > > > > > > Not used, N/A, N/A, N/A > > > > > > > > > ^--- avail.idx > > > > > > > > > > > > > > > > > > Since vring descriptor 0 is still in use the driver chose descriptor 1 > > > > > > > > > for the new available buffer. > > > > > > > > > > > > > > > > > > Now we stop the device, knowing there are two buffers available that > > > > > > > > > have not been used. But avail.ring[] actually only contains the new > > > > > > > > > buffer (vring descriptor 1) that we made available because we overwrote > > > > > > > > > the old avail.ring[] element (vring descriptor 0). > > > > > > > > > > > > > > > > > > What now? Where does the device reset its internal avail_idx to? > > > > > > > > To be on the same page, in qemu the device maintains two "internal > > > > > > > > avail idx": shadow_avail_idx (last seen in the available ring, could > > > > > > > > be 4 in this case) and last_avail_idx (next descriptor to fetch from > > > > > > > > avail, 2). The device must forget shadow_avail_idx and flush the > > > > > > > > descriptors that cannot recover (0). So last_avail_idx is now 3. Now > > > > > > > > it can stop. > > > > > > > > > > > > > > > > The proposal allows the device to fail descriptor 0 in a > > > > > > > > device-specific way, but I think now it was a bad choice. > > > > > > > > > > > > > > > > The driver cannot move the device's last_avail_idx in this operation: > > > > > > > > The device is simply forced to flush used ones to the used ring or > > > > > > > > descriptor ring in a packed vq case. So the device's internal > > > > > > > > avail_idx == used_idx == 3. When the device resumes, it's still 3. > > > > > > > > > > > > > > > > The device must keep its last_avail_idx through stop and resume cycle. > > > > > > > Are you saying that all buffers avail->ring[i % ring_size] must be > > > > > > > completed by the device before the STOP bit is reported where i <= > > > > > > > last_avail_idx? > > > > > > > > > > > > > > This means the driver can modify avail->ring[i % ring_size] where > > > > > > > avail_idx >= i > used_idx. > > > > > > > > > > > > > Yes, That's correct. The driver could also decide to modify the > > > > > > descriptor table instead of the avail ring to do so, but I think the > > > > > > point is clear now. > > > > > > > > > > > > Somehow it is thought after the premise that the out of order > > > > > "Somehow it is thought after the premise" == "there is a fundamental > > > > > design assumption"? > > > > > > > > > > > descriptors are descriptors that the device must wait to complete > > > > > > before the pause anyway. Depending on the device, it might prefer to > > > > > > cancel them, to wait for them, etc. The interesting descriptors to > > > > > > rewind are the ones that have not reached the device (i > used_idx). > > > > > > The driver can do whatever it wants with them. > > > > > > > > > > > > If we assume all the in-flight descriptors are idempotent and we > > > > > > expose a way for the device to expose them, the model is way more > > > > > > simpler than this. > > > > > The constraint that the device has to mark all previously seen "avail" > > > > > buffers as "used" is problematic. It makes STOP visible to the driver > > > > > when the device has to fail requests. > > > > > > > > > > > > I think we need some clarification here on the driver. For doing migration, > > > > some kind of mediation is a must. > > > > > > > > As we've discussed in the previous versions of this proposal, the VMM > > > > usually won't advertise the STOP feature to guest if we don't want to do > > > > nested live migration (if we do we can shadow it anyhow). > > > > > > > > So from the guest point of view it won't see neither STOP nor the inflight > > > > descriptors. > > > > > > That's not how I understand STOP semantics. See below. > > > > > > > > That is incompatible with how > > > > > devices behave across live migration today. If you want to use STOP for > > > > > live migration then it's probably necessary to rethink this constraint. > > > > > > > > > > QEMU's virtio-blk and virtio-scsi device models put failed requests onto > > > > > a list so they can be retried after problems with the underlying storage > > > > > have been resolved (e.g. more disk space becomes available and ENOSPC > > > > > requests can be retried). > > > > > > > > > > > > A question, I think for those "failure" it's actually not visible from the > > > > drive? If this is true, from the spec point of view, there are still > > > > inflight. The VMM may choose to migrate them to the destination and > > > > re-submit them there. This works more like vhost re-connection. > > > > > > That's how I would like STOP to work, but the semantics seem to be > > > different. Eugenio can correct me if this is wrong: > > > > > > All avail descriptors before the last used descriptor must be marked > > > used before the device reports the STOP bit. For example: > > > > > > avail.ring = [1, 2, 3, 4] > > > used.ring = [3] > > > > > > The driver writes the STOP bit. Now the device MUST complete 1 and 2 > > > before reporting the STOP bit. Therefore we cannot keep 1 and 2 > > > in flight but it can keep 4 in flight. The problem is that this > > > conflicts with the virtio-blk/scsi failed requests behavior where 1 and > > > 2 should be kept in flight and migrated. > > Ok, so I think I get your comments on the vhost. Regarding the failed > requests behaviour, it looks like it's an implementation specific > feature which is out of the virtio spec. If we want to preserve the > behaviour, we need to extend the virtio spec. > > Some quick thoughts: > > 1) extend the virtio-blk error codes > 2) allow to configure the behaviour (report, ignore, stop) on error > via config space or control vq > 3) signal the error via config interrupt > > With all of the above, it may provide a virtio-blk that is fully > compatible with what is provided by qemu. Considering its complexity, > I wonder if we can start from something simple and build the features > gradually. If I was not wrong, we can start by exposing something to > make it work like vhost-(user)-blk. When all of those facilities were > implemented in the spec, vhost-vDPA got those facilities as well. Then > it looks to me it's sufficient to define: > > 1) STOP > 2) indices state synchronization > 3) inflight descriptors report (is this a must?) > > Or even 3) could be optional, to make things easier, having 1) and 2) > makes it sufficient to migrate the networking devices. And we can do > 3) on top? I suggest: 1. STOP 2. vhost-style get_vring_base() to fetch last_avail_idx 3. Device state save/load Under this model the driver can rewind the avail ring, but only back to last_avail_idx. This gives the device the freedom to keep in-flight requests like virtio-blk/scsi's failed requests list. Information about those requests will be transferred as part of device state save/load. This patch series just needs to define STOP. Rewinding descriptors requires a new get_vring_base()-style operation while the device is in the stopped state. This can be proposed separately. Device state save/load can also be proposed separately. Stefan
Attachment:
signature.asc
Description: PGP signature
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]