virtio message

Subject: Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Rusty Russell <rusty@au1.ibm.com>
Date: Wed, 11 Sep 2013 10:20:52 +0200
On Wed, Sep 11, 2013 at 12:49:26PM +0930, Rusty Russell wrote:
> James Bottomley <jbottomley@parallels.com> writes:
> > [resending to virtio-comment; it looks like I'm not subscribed to
> > virtio-dev ... how do you subscribe?]
> 
> Mail to virtio-dev-subscribe@lists.oasis-open.org, or via
>         https://www.oasis-open.org/mlmanage/
> 
> BTW, I've moved this to virtio@ since it's core business, with virtio-comment
> cc'd.
> 
> > Sorry, I don't have a copy of the original email to reply to:
> >
> > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html
> >
> > The part that concerns me is this:
> >
> >> +5. The cache mode should be read from the writeback field of the configuration
> >> +  if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
> >> +  write to the field in order to toggle the cache between writethrough (0)
> >> +  and writeback (1) mode.
> >> +  If the feature is not available, the driver can instead look at the result
> >> +  of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
> >> +  reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
> >
> > The questions are twofold and have to do with Write Back only disks (to
> > date we've seen quite a few ATA devices like this and a huge number of
> > USB devices):
> >
> >      1. If the guest doesn't negotiate WCE, what do you do on the host
> >         (flush on every write is one possible option; run unsafe and
> >         hope the host doesn't crash is another).

The default WCE=0 semantics should be that the host ensures every write
reaches stable storage.

This can optionally be overridden in the host.  It might be useful, for
example during guest OS installation where you throw away the image if
installation is interrupted by a power failure.

> >      2. If the guest asks to toggle the device from writeback (1) to
> >         writethrough (0) mode, what do you do?  Refuse the toggle would
> >         be reasonable or flip back into whatever mode you were using to
> >         handle 1. is also possible.

I don't think there is a reasonable way to refuse since the WCE toggle
is implemented as a configuration space field.  It's hard to return an
error from configuration space stores - virtio-net moved to a control
virtqueue in order to support configuration updates properly.

The transition from writeback (1) to writethrough (0) mode should be
allowed and the host uses the same solution as for #1.  I think your
suggestion is a good idea.

> I thought about this more after the call.  If we look at block device
> implementations on the host:
> 
> 1) Dumb device (ie. no flush support).
>    - Get write request, write() to backing file.  Repeat.
>    - If guest crashes it always sees in order, if host crashes you're
>      out of luck.
> 
> 2) Dumb device which tries to handle host crashes.
>    - Noone wants this: requires a fdatasync() after every write.
> 
> 3) Smart device.  Uses AIO/threads to service requests.
>    - Needs flushes otherwise if guest crashes it can see out of order.
>    - Flushes can must wait for outstanding requests.
> 
> 4) Smart device which tries to handle host crashes.
>    - Flushes must fdatasync() after waiting.
> 
> The interesting question is between 3 & 4:
> - Do we differentiate 3 and 4 from the guest side?
> - Or do we ban 3 and insist on 4?  Knowing that there are no guarantees that an
>   implementation will actually hit the metal (eg. crappy underlying
>   device or crappy non-barrier filesystem).
> 
> Whatever we do, I don't see why we'd want to toggle WCE after
> negotiation.  If you implement a smart device, you'd need to drop to a
> single thread, but you'd definitely lose host-crash reliability.

I think this classification doesn't correspond to the actual semantics
of disks.  My understanding is that:

If the host submits multiple requests then ordering is not guaranteed.
WCE=0 does not imply that requests become ordered.  Therefore comments
about dropping to a single thread don't appear correct to me.

For example, the host wants to ensure that write A reaches the disk
before write B.  With WCE=0 the host must wait for write A to complete
before submitting write B.

I also don't think you lose host-crash reliability by dropping to WCE=0.
The guest initiated the WCE 1 -> 0 change and therefore it understands
the rules for reaching stable storage.  The guest OS or application
would wait for write A to complete before issuing write B if A -> B
ordering is necessary.

Finally, let's not worry about broken storage stacks that do not
propagate flushes.  Let's specify virtio-blk WCE to work like real disks
and then hypervisors can let users restrict themselves to safe modes if
the stack doesn't support all modes.

For example, it was typical to run legacy guests (old LVM) with WCE=0
since the guest storage stack did not propagate flushes.  That's a
*configuration* choice but at the spec level all we need to do is:
1. Make guests that are unaware of WCE default to WCE=0.
2. Expose WCE toggling to guests that are aware.

Stefan
Follow-Ups:
- Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks
  - From: Rusty Russell <rusty@au1.ibm.com>
References:
- Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks
  - From: Rusty Russell <rusty@au1.ibm.com>