Subject: Re: [virtio] Re: [virtio-comment] Problems with VIRTIO-4 and writeback only disks
Stefan Hajnoczi <firstname.lastname@example.org> writes: > On Wed, Sep 11, 2013 at 12:49:26PM +0930, Rusty Russell wrote: >> James Bottomley <email@example.com> writes: >> > [resending to virtio-comment; it looks like I'm not subscribed to >> > virtio-dev ... how do you subscribe?] >> >> Mail to firstname.lastname@example.org, or via >> https://www.oasis-open.org/mlmanage/ >> >> BTW, I've moved this to virtio@ since it's core business, with virtio-comment >> cc'd. >> >> > Sorry, I don't have a copy of the original email to reply to: >> > >> > https://lists.oasis-open.org/archives/virtio-comment/201308/msg00078.html >> > >> > The part that concerns me is this: >> > >> >> +5. The cache mode should be read from the writeback field of the configuration >> >> + if the VIRTIO_BLK_F_CONFIG_WCE feature if available; the driver can also >> >> + write to the field in order to toggle the cache between writethrough (0) >> >> + and writeback (1) mode. >> >> + If the feature is not available, the driver can instead look at the result >> >> + of negotiating VIRTIO_BLK_F_WCE: the cache will be in writeback mode after >> >> + reset if and only if VIRTIO_BLK_F_WCE is negotiated >> > >> > The questions are twofold and have to do with Write Back only disks (to >> > date we've seen quite a few ATA devices like this and a huge number of >> > USB devices): >> > >> > 1. If the guest doesn't negotiate WCE, what do you do on the host >> > (flush on every write is one possible option; run unsafe and >> > hope the host doesn't crash is another). > > The default WCE=0 semantics should be that the host ensures every write > reaches stable storage. Here's the problem: I don't think anyone will really implement this. lguest certainly doesn't flush every write, not bhyve. Xen famously didn't. I can't see where qemu does it either, but it could be buried in the aio stuff? > This can optionally be overridden in the host. It might be useful, for > example during guest OS installation where you throw away the image if > installation is interrupted by a power failure. If noone does it, and they don't have to, let's just be honest in the spec and specify that we don't expect them to do sync writes. >> > 2. If the guest asks to toggle the device from writeback (1) to >> > writethrough (0) mode, what do you do? Refuse the toggle would >> > be reasonable or flip back into whatever mode you were using to >> > handle 1. is also possible. > > I don't think there is a reasonable way to refuse since the WCE toggle > is implemented as a configuration space field. It's hard to return an > error from configuration space stores - virtio-net moved to a control > virtqueue in order to support configuration updates properly. > > The transition from writeback (1) to writethrough (0) mode should be > allowed and the host uses the same solution as for #1. I think your > suggestion is a good idea. > >> I thought about this more after the call. If we look at block device >> implementations on the host: >> >> 1) Dumb device (ie. no flush support). >> - Get write request, write() to backing file. Repeat. >> - If guest crashes it always sees in order, if host crashes you're >> out of luck. >> >> 2) Dumb device which tries to handle host crashes. >> - Noone wants this: requires a fdatasync() after every write. >> >> 3) Smart device. Uses AIO/threads to service requests. >> - Needs flushes otherwise if guest crashes it can see out of order. >> - Flushes can must wait for outstanding requests. >> >> 4) Smart device which tries to handle host crashes. >> - Flushes must fdatasync() after waiting. >> >> The interesting question is between 3 & 4: >> - Do we differentiate 3 and 4 from the guest side? >> - Or do we ban 3 and insist on 4? Knowing that there are no guarantees that an >> implementation will actually hit the metal (eg. crappy underlying >> device or crappy non-barrier filesystem). >> >> Whatever we do, I don't see why we'd want to toggle WCE after >> negotiation. If you implement a smart device, you'd need to drop to a >> single thread, but you'd definitely lose host-crash reliability. > > I think this classification doesn't correspond to the actual semantics > of disks. My understanding is that: > > If the host submits multiple requests then ordering is not guaranteed. > WCE=0 does not imply that requests become ordered. Therefore comments > about dropping to a single thread don't appear correct to me. > > For example, the host wants to ensure that write A reaches the disk > before write B. With WCE=0 the host must wait for write A to complete > before submitting write B. Right, I had missed that subtlety. > I also don't think you lose host-crash reliability by dropping to WCE=0. > The guest initiated the WCE 1 -> 0 change and therefore it understands > the rules for reaching stable storage. The guest OS or application > would wait for write A to complete before issuing write B if A -> B > ordering is necessary. But how would this guarantee be implemented on the host without syncing after every write? Ok, technically it could batch updates to the used ring and do a single fsync before that, but that doesn't seem much of a win. > Finally, let's not worry about broken storage stacks that do not > propagate flushes. Let's specify virtio-blk WCE to work like real disks > and then hypervisors can let users restrict themselves to safe modes if > the stack doesn't support all modes. But they won't get host-crash resilience under any circumstances, right? Certainly if the host fs doesn't support barriers they won't... > For example, it was typical to run legacy guests (old LVM) with WCE=0 > since the guest storage stack did not propagate flushes. That's a > *configuration* choice but at the spec level all we need to do is: > 1. Make guests that are unaware of WCE default to WCE=0. > 2. Expose WCE toggling to guests that are aware. That's one reason I prefer the simplified version: no-WCE means no host-crash guarantees, with-WCE means it hits the metal. Whether it's really sane to toggle WCE is another question, but it's currently a feature bit so we can just not offer it. Qemu seems not to offer it by default. Cheers, Rusty.