[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE
On 08/07/2015 11:30, Michael S. Tsirkin wrote: > On Thu, Jul 02, 2015 at 04:20:22PM +0200, Paolo Bonzini wrote: >> VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance >> (up to 2x, though more realistically +30-40%) in latency-bound workloads. >> However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH. >> >> In addition, even removing VIRTIO_BLK_F_FLUSH was probably not a great >> idea, because it simplifies simple drivers (e.g. firmware) that are okay >> with a writethrough cache but still need data to persist after power loss. >> What really should have been removed is just the possibility that devices >> not propose VIRTIO_BLK_F_FLUSH, but even that only deserves a "SHOULD" in >> the new world of conformance statements. >> >> Restore these, with the following changes: >> >> * clarify and use conformance statements in order to define writeback >> and writethrough caching according to what is commonly done by high-end >> storage. >> >> * clarify (with conformance statements) the influence of the >> VIRTIO_BLK_F_FLUSH feature on caching and how to proceed if only one of >> VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE is negotiated. >> >> * strengthen the requirement for persisting writes to MUST after >> a VIRTIO_BLK_T_FLUSH request (and in other cases too involving the >> new features). >> The suggested behavior upon feature negotiation is okay for the Linux >> implementation of virtio1, even after the implementation is modified to >> support the two new features. >> >> This fixes VIRTIO-144. >> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> >> --- >> conformance.tex | 2 + >> content.tex | 130 +++++++++++++++++++++++++++++++++++++++++++++++--------- >> 2 files changed, 111 insertions(+), 21 deletions(-) >> >> diff --git a/conformance.tex b/conformance.tex >> index 29c6ba8..8753cd6 100644 >> --- a/conformance.tex >> +++ b/conformance.tex >> @@ -102,6 +102,7 @@ A network driver MUST conform to the following normative statements: >> A block driver MUST conform to the following normative statements: >> >> \begin{itemize} >> +\item \ref{drivernormative:Device Types / Block Device / Device Initialization} >> \item \ref{drivernormative:Device Types / Block Device / Device Operation} >> \end{itemize} >> >> @@ -205,6 +206,7 @@ A network device MUST conform to the following normative statements: >> A block device MUST conform to the following normative statements: >> >> \begin{itemize} >> +\item \ref{devicenormative:Device Types / Block Device / Device Initialization} >> \item \ref{devicenormative:Device Types / Block Device / Device Operation} >> \end{itemize} >> >> diff --git a/content.tex b/content.tex >> index 3b12263..e58a0bd 100644 >> --- a/content.tex >> +++ b/content.tex >> @@ -3723,8 +3723,13 @@ device except where noted. >> >> \item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in \field{blk_size}. >> >> +\item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. >> + >> \item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O >> alignment. >> + >> +\item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback >> + and writethrough modes. >> \end{description} >> >> \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} >> @@ -3733,16 +3738,12 @@ device except where noted. >> \item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. >> >> \item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. >> - >> -\item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. >> - >> -\item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback >> - and writethrough modes. >> \end{description} >> >> -VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers >> -MUST only negotiate this feature if they are capable of sending >> -VIRTIO_BLK_T_FLUSH commands. >> +\begin{note} >> + In the legacy interface, VIRTIO_BLK_F_FLUSH was also >> + called VIRTIO_BLK_F_WCE. >> +\end{note} >> >> \subsection{Device configuration layout}\label{sec:Device Types / Block Device / Device configuration layout} >> >> @@ -3771,7 +3772,7 @@ struct virtio_blk_config { >> // optimal (suggested maximum) I/O size in blocks >> le32 opt_io_size; >> } topology; >> - u8 reserved; >> + u8 writeback; >> }; >> \end{lstlisting} >> >> @@ -3801,21 +3802,52 @@ according to the native endian of the guest rather than >> \field{topology} struct can be read to determine the physical block size and optimal >> I/O lengths for the driver to use. This also does not affect the units >> in the protocol, only performance. >> + >> +\item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache >> + mode can be read or set through the \field{writeback} field. 0 corresponds >> + to a writethrough cache, 1 to a writeback cache\footnote{Consistent with >> + \ref{devicenormative:Device Types / Block Device / Device Operation}, >> + a writethrough cache can be defined broadly as a cache that commits >> + writes to non-volatile storage before reporting their completion. >> + For example, a battery-backed writeback cache actually counts as >> + writethrough according to this definition.}. The cache mode after >> + reset is undefined. >> \end{enumerate} >> >> +\drivernormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} >> + >> +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of >> +sending VIRTIO_BLK_T_FLUSH commands. >> + >> +If the VIRTIO_BLK_F_CONFIG_WCE feature is not negotiated, but >> +VIRTIO_BLK_F_FLUSH was proposed by the device, the driver MAY deduce >> +the presence of a writethrough cache. Otherwise, the driver SHOULD >> +assume presence of a writeback cache. >> + >> +\devicenormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} >> + >> +Devices SHOULD always propose VIRTIO_BLK_F_FLUSH, and MUST propose it >> +if they propose VIRTIO_BLK_F_CONFIG_WCE. > > s/propose/offer/ Okay. >> + >> +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature >> +is not, the device MUST set \field{writeback} to 0 as soon as the driver >> +sets the FEATURES_OK status bit. > > So this field is dual use, sometimes it's written by driver, sometimes - > by a device. A bit messy I think, will cause races or at least > confusion. It doesn't cause races, because the device can set it before returning from the instruction that wrote FEATURES_OK. > What is this in aid of? > > If VIRTIO_BLK_F_FLUSH is not negotiated devices already commit writes. > > Why also set writeback to 0? So that it's consistent with the status of the device. The driver may set it to 1 to prevent the device from committing the writes. >> + >> \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} >> >> -The \field{reserved} field used to be called \field{writeback}. If the >> -VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode can be >> -read from \field{writeback}; the >> -driver can also write to the field in order to toggle the cache >> -between writethrough (0) and writeback (1) mode. If the feature is >> -not available, the driver can instead look at the result of >> -negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode >> -after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated. >> +Because legacy devices do not have FEATURES_OK, legacy device behavior >> +differs around feature negotiation; legacy devices never modify >> +\field{writeback} as a result of a driver's write to the status register. >> +In particular, some legacy drivers wrote to \field{writeback} between >> +setting the DRIVER status bit and setting the DRIVER_OK status bit. >> +It would be a bug for the device to overwrite \field{writeback} when >> +the DRIVER_OK status bit is set. > > Can this be a MUST NOT statement? > And when is it ok to write then? For legacy, never. For modern, when you get FEATURES_OK. >> + >> +Legacy devices must also never modify \field{writeback} as a result of > > We can not require anything from legacy devices. They are out there. They didn't as far as I know, and if you want to implement both modern and legacy this is a deviation from the requirements of modern devices. >> +a driver's write to the guest features registers, for example if the >> +driver sets the VIRTIO_BLK_F_CONFIG_WCE guest feature bit but does not >> +set the VIRTIO_BLK_F_FLUSH bit. > > Is it ever ok for device to modify the field? Yes, see above regarding FEATURES_OK. That's the only case. >> >> -Some older legacy devices did not operate in writethrough mode even >> -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH. >> >> \subsection{Device Operation}\label{sec:Device Types / Block Device / Device Operation} >> >> @@ -3865,14 +3897,67 @@ A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. >> A driver MUST set \field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. >> A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. >> >> +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY >> +switch to writethrough or writeback mode by writing respectively 0 and >> +1 to the \field{writeback} field. After writing a 0 to \field{writeback}, >> +the driver MUST NOT assume that any volatile writes have been committed >> +to non-volatile storage. > > I think volatile here clashes with its use for buffers. > How about we say e.g. "permanent device backend storage" here? > Same below. Okay, or even just s/non-volatile/permanent/. >> + >> \devicenormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} >> >> A device MUST set the \field{status} byte to VIRTIO_BLK_S_IOERR >> for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT >> write any data. >> >> -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure >> -that any writes which were completed are committed to non-volatile storage. >> +A write is considered volatile when it is submitted; the contents of >> +sectors covered by a volatile are undefined until the write becomes stable. >> +A write becomes stable once it is completed and one or more of the >> +following conditions is true: >> + >> +\begin{enumerate} >> +\item\label{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor >> + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was >> + proposed by the device; >> + >> +\item\label{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the >> + \field{writeback} field in configuration space was 0 \textbf{all the time between >> + the submission of the write and its completion}; >> + >> +\item\label{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent \textbf{after the write is >> + completed} and is completed itself. >> +\end{enumerate} >> + >> +The device MUST ensure that stable writes are committed to >> +non-volatile storage, > > So in particular at the moment QEMU violates this requirement. It doesn't. Why do you say it does? > How would it avoid violating it? E.g. I can use a file in tmpfs as a backend. > Why do drivers have to know? > I think it's useful to pretend to guest data is stored, and then cheat. > In case of tmpfs flushes are cheap, so I don't see what do we > gain by requiring that drivers skip it with such force. > > How about we say something like "permanent device backend storage" > or even "device backend storage"? > >> before reporting completion of the write >> +(cases~\ref{item:flush1} and~\ref{item:flush2}) or the flush >> +(case~\ref{item:flush3}). Failure to do so can cause data loss >> +in case of a crash. >> + >> +% ** Not using MAY/MAY NOT intentionally, this is not optional behavior ** >> +If the driver changes \field{writeback} between the submission of the write >> +and its completion, the write can be either volatile or stable when >> +its completion is reported; the exact behavior is undefined. > > So MUST then? No, definitely not MUST. "The write MUST be either X or not X" is a tautology. "The write MAY be either X or not X" also does not make sense, because there's no way for an implementation to do neither X nor not X. Tertium non datur. It's describing a case where the behavior is undefined. The RFC does not include a case for this, so I used "can". >> + >> +% According to the device requirements for device initialization: >> +% Propose(CONFIG_WCE) => Propose(FLUSH). >> +% >> +% After reversing the implication: >> +% not Propose(FLUSH) => not Propose(CONFIG_WCE). >> + >> +If VIRTIO_BLK_F_FLUSH was not proposed by the >> + device\footnote{Note that in this case, according to >> + \ref{devicenormative:Device Types / Block Device / Device Initialization}, >> + the device will not have proposed VIRTIO_BLK_F_CONFIG_WCE either.}, the >> +device MAY also commit writes to non-volatile storage before reporting >> +their completion. Unlike case~\ref{item:flush1}, however, this is not >> +an absolute requirement of the specification. > > Pls drop "absolute", let's use specific terms from an RFC. "absolute requirement" comes from RFC2119. It means "it's not a MUST", except it's valid English. > Maybe this one is a SHOULD? No, it's definitely a MAY. "One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item" is a perfect description of this. Also, the driver can recognize the situation (and though it cannot do anything about it, it can log something). >> + >> +\begin{note} >> + An implementation that does not propose VIRTIO_BLK_F_FLUSH and does not commit >> + completed writes will not be resilient to data loss in case of crashes. >> + Not proposing VIRTIO_BLK_F_FLUSH is an absolute requirement >> + for implementations that do not wish to be safe against such data losses. > > I think this note confuses more than it clarifies. > > It seems to imply VIRTIO_BLK_F_FLUSH somehow means "I will not lose > your data on crash". That requirement is just too strong, devices often have > no way to know, and guests often don't care. I don't think it's too strong. Of course it is more like "I will not lose your data on crash, unless the user screwed up". > Why is it tied to VIRTIO_BLK_F_FLUSH at all? Is this an optimization? > > >> +\end{note} >> >> \subsubsection{Legacy Interface: Device Operation}\label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} >> When using the legacy interface, transitional devices and drivers >> @@ -3907,6 +3992,9 @@ serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request >> does that. >> \end{note} >> >> +Some older legacy devices did not commit completed writes to non-volatile >> +storage, even if VIRTIO_BLK_F_FLUSH was proposed but not negotiated. >> + > > Anything transitional drivers can do it work around that? Always accept VIRTIO_BLK_F_FLUSH. > If no, this doesn't seem like a useful bit of info to me. I can remove it, but it was in the existing document and there is a workaround. Paolo
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]