OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio] [PATCH] virtio-blk: restore VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE


On 08/07/2015 11:30, Michael S. Tsirkin wrote:
> On Thu, Jul 02, 2015 at 04:20:22PM +0200, Paolo Bonzini wrote:
>> VIRTIO_BLK_F_CONFIG_WCE is important in order to achieve good performance
>> (up to 2x, though more realistically +30-40%) in latency-bound workloads.
>> However, it was removed by mistake together with VIRTIO_BLK_F_FLUSH.
>>
>> In addition, even removing VIRTIO_BLK_F_FLUSH was probably not a great
>> idea, because it simplifies simple drivers (e.g. firmware) that are okay
>> with a writethrough cache but still need data to persist after power loss.
>> What really should have been removed is just the possibility that devices
>> not propose VIRTIO_BLK_F_FLUSH, but even that only deserves a "SHOULD" in
>> the new world of conformance statements.
>>
>> Restore these, with the following changes:
>>
>> * clarify and use conformance statements in order to define writeback
>> and writethrough caching according to what is commonly done by high-end
>> storage.
>>
>> * clarify (with conformance statements) the influence of the
>> VIRTIO_BLK_F_FLUSH feature on caching and how to proceed if only one of
>> VIRTIO_BLK_F_FLUSH and VIRTIO_BLK_F_CONFIG_WCE is negotiated.
>>
>> * strengthen the requirement for persisting writes to MUST after
>> a VIRTIO_BLK_T_FLUSH request (and in other cases too involving the
>> new features).
>> The suggested behavior upon feature negotiation is okay for the Linux
>> implementation of virtio1, even after the implementation is modified to
>> support the two new features.
>>
>> This fixes VIRTIO-144.
>>
>> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>  conformance.tex |   2 +
>>  content.tex     | 130 +++++++++++++++++++++++++++++++++++++++++++++++---------
>>  2 files changed, 111 insertions(+), 21 deletions(-)
>>
>> diff --git a/conformance.tex b/conformance.tex
>> index 29c6ba8..8753cd6 100644
>> --- a/conformance.tex
>> +++ b/conformance.tex
>> @@ -102,6 +102,7 @@ A network driver MUST conform to the following normative statements:
>>  A block driver MUST conform to the following normative statements:
>>  
>>  \begin{itemize}
>> +\item \ref{drivernormative:Device Types / Block Device / Device Initialization}
>>  \item \ref{drivernormative:Device Types / Block Device / Device Operation}
>>  \end{itemize}
>>  
>> @@ -205,6 +206,7 @@ A network device MUST conform to the following normative statements:
>>  A block device MUST conform to the following normative statements:
>>  
>>  \begin{itemize}
>> +\item \ref{devicenormative:Device Types / Block Device / Device Initialization}
>>  \item \ref{devicenormative:Device Types / Block Device / Device Operation}
>>  \end{itemize}
>>  
>> diff --git a/content.tex b/content.tex
>> index 3b12263..e58a0bd 100644
>> --- a/content.tex
>> +++ b/content.tex
>> @@ -3723,8 +3723,13 @@ device except where noted.
>>  
>>  \item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in \field{blk_size}.
>>  
>> +\item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support.
>> +
>>  \item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O
>>      alignment.
>> +
>> +\item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback
>> +    and writethrough modes.
>>  \end{description}
>>  
>>  \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits}
>> @@ -3733,16 +3738,12 @@ device except where noted.
>>  \item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers.
>>  
>>  \item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands.
>> -
>> -\item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support.
>> -
>> -\item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback
>> -    and writethrough modes.
>>  \end{description}
>>  
>> -VIRTIO_BLK_F_FLUSH was also called VIRTIO_BLK_F_WCE: Legacy drivers
>> -MUST only negotiate this feature if they are capable of sending
>> -VIRTIO_BLK_T_FLUSH commands.
>> +\begin{note}
>> +  In the legacy interface, VIRTIO_BLK_F_FLUSH was also
>> +  called VIRTIO_BLK_F_WCE.
>> +\end{note}
>>  
>>  \subsection{Device configuration layout}\label{sec:Device Types / Block Device / Device configuration layout}
>>  
>> @@ -3771,7 +3772,7 @@ struct virtio_blk_config {
>>                  // optimal (suggested maximum) I/O size in blocks
>>                  le32 opt_io_size;
>>          } topology;
>> -        u8 reserved;
>> +        u8 writeback;
>>  };
>>  \end{lstlisting}
>>  
>> @@ -3801,21 +3802,52 @@ according to the native endian of the guest rather than
>>    \field{topology} struct can be read to determine the physical block size and optimal
>>    I/O lengths for the driver to use. This also does not affect the units
>>    in the protocol, only performance.
>> +
>> +\item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache
>> +  mode can be read or set through the \field{writeback} field.  0 corresponds
>> +  to a writethrough cache, 1 to a writeback cache\footnote{Consistent with
>> +    \ref{devicenormative:Device Types / Block Device / Device Operation},
>> +    a writethrough cache can be defined broadly as a cache that commits
>> +    writes to non-volatile storage before reporting their completion.
>> +    For example, a battery-backed writeback cache actually counts as
>> +    writethrough according to this definition.}.  The cache mode after
>> +  reset is undefined.
>>  \end{enumerate}
>>  
>> +\drivernormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization}
>> +
>> +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of
>> +sending VIRTIO_BLK_T_FLUSH commands.
>> +
>> +If the VIRTIO_BLK_F_CONFIG_WCE feature is not negotiated, but
>> +VIRTIO_BLK_F_FLUSH was proposed by the device, the driver MAY deduce
>> +the presence of a writethrough cache.  Otherwise, the driver SHOULD
>> +assume presence of a writeback cache.
>> +
>> +\devicenormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization}
>> +
>> +Devices SHOULD always propose VIRTIO_BLK_F_FLUSH, and MUST propose it
>> +if they propose VIRTIO_BLK_F_CONFIG_WCE.
> 
> s/propose/offer/

Okay.

>> +
>> +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but the VIRTIO_BLK_F_FLUSH feature
>> +is not, the device MUST set \field{writeback} to 0 as soon as the driver
>> +sets the FEATURES_OK status bit.
> 
> So this field is dual use, sometimes it's written by driver, sometimes -
> by a device. A bit messy I think, will cause races or at least
> confusion.

It doesn't cause races, because the device can set it before returning
from the instruction that wrote FEATURES_OK.

> What is this in aid of?
> 
> If VIRTIO_BLK_F_FLUSH is not negotiated devices already commit writes.
> 
> Why also set writeback to 0?

So that it's consistent with the status of the device.  The driver may
set it to 1 to prevent the device from committing the writes.

>> +
>>  \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization}
>>  
>> -The \field{reserved} field used to be called \field{writeback}.  If the
>> -VIRTIO_BLK_F_CONFIG_WCE feature is offered, the cache mode can be
>> -read from \field{writeback}; the
>> -driver can also write to the field in order to toggle the cache
>> -between writethrough (0) and writeback (1) mode.  If the feature is
>> -not available, the driver can instead look at the result of
>> -negotiating VIRTIO_BLK_F_FLUSH: the cache will be in writeback mode
>> -after reset if and only if VIRTIO_BLK_F_FLUSH is negotiated.
>> +Because legacy devices do not have FEATURES_OK, legacy device behavior
>> +differs around feature negotiation; legacy devices never modify
>> +\field{writeback} as a result of a driver's write to the status register.
>> +In particular, some legacy drivers wrote to \field{writeback} between
>> +setting the DRIVER status bit and setting the DRIVER_OK status bit.
>> +It would be a bug for the device to overwrite \field{writeback} when
>> +the DRIVER_OK status bit is set.
> 
> Can this be a MUST NOT statement?
> And when is it ok to write then?

For legacy, never.  For modern, when you get FEATURES_OK.

>> +
>> +Legacy devices must also never modify \field{writeback} as a result of
> 
> We can not require anything from legacy devices. They are out there.

They didn't as far as I know, and if you want to implement both modern
and legacy this is a deviation from the requirements of modern devices.

>> +a driver's write to the guest features registers, for example if the
>> +driver sets the VIRTIO_BLK_F_CONFIG_WCE guest feature bit but does not
>> +set the VIRTIO_BLK_F_FLUSH bit.
> 
> Is it ever ok for device to modify the field?

Yes, see above regarding FEATURES_OK.  That's the only case.

>>  
>> -Some older legacy devices did not operate in writethrough mode even
>> -after a driver announced lack of support for VIRTIO_BLK_F_FLUSH.
>>  
>>  \subsection{Device Operation}\label{sec:Device Types / Block Device / Device Operation}
>>  
>> @@ -3865,14 +3897,67 @@ A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered.
>>  A driver MUST set \field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request.
>>  A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request.
>>  
>> +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY
>> +switch to writethrough or writeback mode by writing respectively 0 and
>> +1 to the \field{writeback} field.  After writing a 0 to \field{writeback},
>> +the driver MUST NOT assume that any volatile writes have been committed
>> +to non-volatile storage.
> 
> I think volatile here clashes with its use for buffers.
> How about we say e.g.  "permanent device backend storage" here?
> Same below.

Okay, or even just s/non-volatile/permanent/.

>> +
>>  \devicenormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation}
>>  
>>  A device MUST set the \field{status} byte to VIRTIO_BLK_S_IOERR
>>  for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT
>>  write any data.
>>  
>> -Upon receipt of a VIRTIO_BLK_T_FLUSH request, the device SHOULD ensure
>> -that any writes which were completed are committed to non-volatile storage.
>> +A write is considered volatile when it is submitted; the contents of
>> +sectors covered by a volatile are undefined until the write becomes stable.
>> +A write becomes stable once it is completed and one or more of the
>> +following conditions is true:
>> +
>> +\begin{enumerate}
>> +\item\label{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor
>> +  VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was
>> +  proposed by the device;
>> +
>> +\item\label{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the
>> +  \field{writeback} field in configuration space was 0 \textbf{all the time between
>> +  the submission of the write and its completion};
>> +
>> +\item\label{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent \textbf{after the write is
>> +  completed} and is completed itself.
>> +\end{enumerate}
>> +
>> +The device MUST ensure that stable writes are committed to
>> +non-volatile storage,
> 
> So in particular at the moment QEMU violates this requirement.

It doesn't.  Why do you say it does?

> How would it avoid violating it?  E.g. I can use a file in tmpfs as a backend.
> Why do drivers have to know?
> I think it's useful to pretend to guest data is stored, and then cheat.
> In case of tmpfs flushes are cheap, so I don't see what do we
> gain by requiring that drivers skip it with such force.
> 
> How about we say something like "permanent device backend storage"
> or even "device backend storage"?
> 
>> before reporting completion of the write
>> +(cases~\ref{item:flush1} and~\ref{item:flush2}) or the flush
>> +(case~\ref{item:flush3}).  Failure to do so can cause data loss
>> +in case of a crash.
>> +
>> +% ** Not using MAY/MAY NOT intentionally, this is not optional behavior **
>> +If the driver changes \field{writeback} between the submission of the write
>> +and its completion, the write can be either volatile or stable when
>> +its completion is reported; the exact behavior is undefined.
> 
> So MUST then?

No, definitely not MUST.  "The write MUST be either X or not X" is a
tautology.  "The write MAY be either X or not X" also does not make
sense, because there's no way for an implementation to do neither X nor
not X.  Tertium non datur.

It's describing a case where the behavior is undefined.  The RFC does
not include a case for this, so I used "can".

>> +
>> +% According to the device requirements for device initialization:
>> +%   Propose(CONFIG_WCE) => Propose(FLUSH).
>> +%
>> +% After reversing the implication:
>> +%   not Propose(FLUSH) => not Propose(CONFIG_WCE).
>> +
>> +If VIRTIO_BLK_F_FLUSH was not proposed by the
>> +  device\footnote{Note that in this case, according to
>> +  \ref{devicenormative:Device Types / Block Device / Device Initialization},
>> +  the device will not have proposed VIRTIO_BLK_F_CONFIG_WCE either.}, the
>> +device MAY also commit writes to non-volatile storage before reporting
>> +their completion.  Unlike case~\ref{item:flush1}, however, this is not
>> +an absolute requirement of the specification.
> 
> Pls drop "absolute", let's use specific terms from an RFC.

"absolute requirement" comes from RFC2119.  It means "it's not a MUST",
except it's valid English.

> Maybe this one is a SHOULD?

No, it's definitely a MAY.  "One vendor may choose to include the item
because a particular marketplace requires it or because the vendor feels
that it enhances the product while another vendor may omit the same
item" is a perfect description of this.  Also, the driver can recognize
the situation (and though it cannot do anything about it, it can log
something).

>> +
>> +\begin{note}
>> +  An implementation that does not propose VIRTIO_BLK_F_FLUSH and does not commit
>> +  completed writes will not be resilient to data loss in case of crashes.
>> +  Not proposing VIRTIO_BLK_F_FLUSH is an absolute requirement
>> +  for implementations that do not wish to be safe against such data losses.
> 
> I think this note confuses more than it clarifies.
> 
> It seems to imply VIRTIO_BLK_F_FLUSH somehow means "I will not lose
> your data on crash". That requirement is just too strong, devices often have
> no way to know, and guests often don't care.

I don't think it's too strong.  Of course it is more like "I will not
lose your data on crash, unless the user screwed up".

> Why is it tied to VIRTIO_BLK_F_FLUSH at all? Is this an optimization?
> 
> 
>> +\end{note}
>>  
>>  \subsubsection{Legacy Interface: Device Operation}\label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation}
>>  When using the legacy interface, transitional devices and drivers
>> @@ -3907,6 +3992,9 @@ serve as data consistency guarantee.  Only a VIRTIO_BLK_T_FLUSH request
>>  does that.
>>  \end{note}
>>  
>> +Some older legacy devices did not commit completed writes to non-volatile
>> +storage, even if VIRTIO_BLK_F_FLUSH was proposed but not negotiated.
>> +
> 
> Anything transitional drivers can do it work around that?

Always accept VIRTIO_BLK_F_FLUSH.

> If no, this doesn't seem like a useful bit of info to me.

I can remove it, but it was in the existing document and there is a
workaround.

Paolo


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]