virtio-comment message

Subject: Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks

From: Rusty Russell <rusty@au1.ibm.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Date: Mon, 16 Sep 2013 11:04:38 +0930

Stefan Hajnoczi <stefanha@redhat.com> writes:
> On Wed, Sep 11, 2013 at 12:49:26PM +0930, Rusty Russell wrote:
>> James Bottomley <jbottomley@parallels.com> writes:
>> > [resending to virtio-comment; it looks like I'm not subscribed to
>> > virtio-dev ... how do you subscribe?]
>> 
>> Mail to virtio-dev-subscribe@lists.oasis-open.org, or via
>>         https://www.oasis-open.org/mlmanage/
>> 
>> BTW, I've moved this to virtio@ since it's core business, with virtio-comment
>> cc'd.
>> 
>> > Sorry, I don't have a copy of the original email to reply to:
>> >
>> > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html
>> >
>> > The part that concerns me is this:
>> >
>> >> +5. The cache mode should be read from the writeback field of the configuration
>> >> +  if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also
>> >> +  write to the field in order to toggle the cache between writethrough (0)
>> >> +  and writeback (1) mode.
>> >> +  If the feature is not available, the driver can instead look at the result
>> >> +  of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after
>> >> +  reset if and only if VIRTIO_BLK_F_WCE is negotiated[30]
>> >
>> > The questions are twofold and have to do with Write Back only disks (to
>> > date we've seen quite a few ATA devices like this and a huge number of
>> > USB devices):
>> >
>> >      1. If the guest doesn't negotiate WCE, what do you do on the host
>> >         (flush on every write is one possible option; run unsafe and
>> >         hope the host doesn't crash is another).
>
> The default WCE=0 semantics should be that the host ensures every write
> reaches stable storage.

Here's the problem: I don't think anyone will really implement this.

lguest certainly doesn't flush every write, not bhyve.  Xen famously
didn't.  I can't see where qemu does it either, but it could be buried
in the aio stuff?

> This can optionally be overridden in the host.  It might be useful, for
> example during guest OS installation where you throw away the image if
> installation is interrupted by a power failure.

If noone does it, and they don't have to, let's just be honest in the
spec and specify that we don't expect them to do sync writes.

>> >      2. If the guest asks to toggle the device from writeback (1) to
>> >         writethrough (0) mode, what do you do?  Refuse the toggle would
>> >         be reasonable or flip back into whatever mode you were using to
>> >         handle 1. is also possible.
>
> I don't think there is a reasonable way to refuse since the WCE toggle
> is implemented as a configuration space field.  It's hard to return an
> error from configuration space stores - virtio-net moved to a control
> virtqueue in order to support configuration updates properly.
>
> The transition from writeback (1) to writethrough (0) mode should be
> allowed and the host uses the same solution as for #1.  I think your
> suggestion is a good idea.
>
>> I thought about this more after the call.  If we look at block device
>> implementations on the host:
>> 
>> 1) Dumb device (ie. no flush support).
>>    - Get write request, write() to backing file.  Repeat.
>>    - If guest crashes it always sees in order, if host crashes you're
>>      out of luck.
>> 
>> 2) Dumb device which tries to handle host crashes.
>>    - Noone wants this: requires a fdatasync() after every write.
>> 
>> 3) Smart device.  Uses AIO/threads to service requests.
>>    - Needs flushes otherwise if guest crashes it can see out of order.
>>    - Flushes can must wait for outstanding requests.
>> 
>> 4) Smart device which tries to handle host crashes.
>>    - Flushes must fdatasync() after waiting.
>> 
>> The interesting question is between 3 & 4:
>> - Do we differentiate 3 and 4 from the guest side?
>> - Or do we ban 3 and insist on 4?  Knowing that there are no guarantees that an
>>   implementation will actually hit the metal (eg. crappy underlying
>>   device or crappy non-barrier filesystem).
>> 
>> Whatever we do, I don't see why we'd want to toggle WCE after
>> negotiation.  If you implement a smart device, you'd need to drop to a
>> single thread, but you'd definitely lose host-crash reliability.
>
> I think this classification doesn't correspond to the actual semantics
> of disks.  My understanding is that:
>
> If the host submits multiple requests then ordering is not guaranteed.
> WCE=0 does not imply that requests become ordered.  Therefore comments
> about dropping to a single thread don't appear correct to me.
>
> For example, the host wants to ensure that write A reaches the disk
> before write B.  With WCE=0 the host must wait for write A to complete
> before submitting write B.

Right, I had missed that subtlety.

> I also don't think you lose host-crash reliability by dropping to WCE=0.
> The guest initiated the WCE 1 -> 0 change and therefore it understands
> the rules for reaching stable storage.  The guest OS or application
> would wait for write A to complete before issuing write B if A -> B
> ordering is necessary.

But how would this guarantee be implemented on the host without syncing
after every write?  Ok, technically it could batch updates to the used
ring and do a single fsync before that, but that doesn't seem much of a
win.

> Finally, let's not worry about broken storage stacks that do not
> propagate flushes.  Let's specify virtio-blk WCE to work like real disks
> and then hypervisors can let users restrict themselves to safe modes if
> the stack doesn't support all modes.

But they won't get host-crash resilience under any circumstances, right?
Certainly if the host fs doesn't support barriers they won't...

> For example, it was typical to run legacy guests (old LVM) with WCE=0
> since the guest storage stack did not propagate flushes.  That's a
> *configuration* choice but at the spec level all we need to do is:
> 1. Make guests that are unaware of WCE default to WCE=0.
> 2. Expose WCE toggling to guests that are aware.

That's one reason I prefer the simplified version: no-WCE means no
host-crash guarantees, with-WCE means it hits the metal.

Whether it's really sane to toggle WCE is another question, but it's
currently a feature bit so we can just not offer it.  Qemu seems not to
offer it by default.

Cheers,
Rusty.

Follow-Ups:
- Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks
  - From: Paolo Bonzini <pbonzini@redhat.com>

References:
- Problems with VIRTIO-4 and writeback only disks
  - From: James Bottomley <jbottomley@parallels.com>
- Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks
  - From: Rusty Russell <rusty@au1.ibm.com>
- Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks
  - From: Stefan Hajnoczi <stefanha@redhat.com>