[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: [PATCH 5/6] virtio-block: Maintain block device spec in separate file
Move virtio block device specification to its own file similar to recent virtio devices. Fixes: https://github.com/oasis-tcs/virtio-spec/issues/153 Signed-off-by: Parav Pandit <parav@nvidia.com> --- content.tex | 1315 +--------------------------------------------- virtio-block.tex | 1315 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 1316 insertions(+), 1314 deletions(-) create mode 100644 virtio-block.tex diff --git a/content.tex b/content.tex index 0a25765..9f9180b 100644 --- a/content.tex +++ b/content.tex @@ -4598,1320 +4598,7 @@ \subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device See \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing}. -\section{Block Device}\label{sec:Device Types / Block Device} - -The virtio block device is a simple virtual block device (ie. -disk). Read and write requests (and other exotic requests) are -placed in one of its queues, and serviced (probably out of order) by the -device except where noted. - -\subsection{Device ID}\label{sec:Device Types / Block Device / Device ID} - 2 - -\subsection{Virtqueues}\label{sec:Device Types / Block Device / Virtqueues} -\begin{description} -\item[0] requestq1 -\item[\ldots] -\item[N-1] requestqN -\end{description} - - N=1 if VIRTIO_BLK_F_MQ is not negotiated, otherwise N is set by - \field{num_queues}. - -\subsection{Feature bits}\label{sec:Device Types / Block Device / Feature bits} - -\begin{description} -\item[VIRTIO_BLK_F_SIZE_MAX (1)] Maximum size of any single segment is - in \field{size_max}. - -\item[VIRTIO_BLK_F_SEG_MAX (2)] Maximum number of segments in a - request is in \field{seg_max}. - -\item[VIRTIO_BLK_F_GEOMETRY (4)] Disk-style geometry specified in - \field{geometry}. - -\item[VIRTIO_BLK_F_RO (5)] Device is read-only. - -\item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in \field{blk_size}. - -\item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. - -\item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O - alignment. - -\item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback - and writethrough modes. - -\item[VIRTIO_BLK_F_MQ (12)] Device supports multiqueue. - -\item[VIRTIO_BLK_F_DISCARD (13)] Device can support discard command, maximum - discard sectors size in \field{max_discard_sectors} and maximum discard - segment number in \field{max_discard_seg}. - -\item[VIRTIO_BLK_F_WRITE_ZEROES (14)] Device can support write zeroes command, - maximum write zeroes sectors size in \field{max_write_zeroes_sectors} and - maximum write zeroes segment number in \field{max_write_zeroes_seg}. - -\item[VIRTIO_BLK_F_LIFETIME (15)] Device supports providing storage lifetime - information. - -\item[VIRTIO_BLK_F_SECURE_ERASE (16)] Device supports secure erase command, - maximum erase sectors count in \field{max_secure_erase_sectors} and - maximum erase segment number in \field{max_secure_erase_seg}. - -\item[VIRTIO_BLK_F_ZONED(17)] Device is a Zoned Block Device, that is, a device - that follows the zoned storage device behavior that is also supported by - industry standards such as the T10 Zoned Block Command standard (ZBC r05) or - the NVMe(TM) NVM Express Zoned Namespace Command Set Specification 1.1b - (ZNS). For brevity, these standard documents are referred as "ZBD standards" - from this point on in the text. - -\end{description} - -\subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} - -\begin{description} -\item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. - -\item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. -\end{description} - -\begin{note} - In the legacy interface, VIRTIO_BLK_F_FLUSH was also - called VIRTIO_BLK_F_WCE. -\end{note} - -\subsection{Device configuration layout}\label{sec:Device Types / Block Device / Device configuration layout} - -The \field{capacity} of the device (expressed in 512-byte sectors) is always -present. The availability of the others all depend on various feature -bits as indicated above. - -The field \field{num_queues} only exists if VIRTIO_BLK_F_MQ is set. This field specifies -the number of queues. - -The parameters in the configuration space of the device \field{max_discard_sectors} -\field{discard_sector_alignment} are expressed in 512-byte units if the -VIRTIO_BLK_F_DISCARD feature bit is negotiated. The \field{max_write_zeroes_sectors} -is expressed in 512-byte units if the VIRTIO_BLK_F_WRITE_ZEROES feature -bit is negotiated. The parameters in the configuration space of the device -\field{max_secure_erase_sectors} \field{secure_erase_sector_alignment} are expressed -in 512-byte units if the VIRTIO_BLK_F_SECURE_ERASE feature bit is negotiated. - -If the VIRTIO_BLK_F_ZONED feature is negotiated, then in -\field{virtio_blk_zoned_characteristics}, -\begin{itemize} -\item \field{zone_sectors} value is expressed in 512-byte sectors. -\item \field{max_append_sectors} value is expressed in 512-byte sectors. -\item \field{write_granularity} value is expressed in bytes. -\end{itemize} - -The \field{model} field in \field{zoned} may have the following values: - -\begin{lstlisting} -#define VIRTIO_BLK_Z_NONE 0 -#define VIRTIO_BLK_Z_HM 1 -#define VIRTIO_BLK_Z_HA 2 -\end{lstlisting} - -Depending on their design, zoned block devices may follow several possible -models of operation. The three models that are standardized for ZBDs are -drive-managed, host-managed and host-aware. - -While being zoned internally, drive-managed ZBDs behave exactly like regular, -non-zoned block devices. For the purposes of virtio standardization, -drive-managed ZBDs can always be treated as non-zoned devices. These devices -have the VIRTIO_BLK_Z_NONE model value set in the \field{model} field in -\field{zoned}. - -Devices that offer the VIRTIO_BLK_F_ZONED feature while reporting the -VIRTIO_BLK_Z_NONE zoned model are drive-managed zoned block devices. In this -case, the driver treats the device as a regular non-zoned block device. - -Host-managed zoned block devices have their LBA range divided into Sequential -Write Required (SWR) zones that require some additional handling by the host -for correct operation. All write requests to SWR zones are required be -sequential and zones containing some written data need to be reset before that -data can be rewritten. Host-managed devices support a set of ZBD-specific I/O -requests that can be used by the host to manage device zones. Host-managed -devices report VIRTIO_BLK_Z_HM in the \field{model} field in \field{zoned}. - -Host-aware zoned block devices have their LBA range divided to Sequential -Write Preferred (SWP) zones that support random write access, similar to -regular non-zoned devices. However, the device I/O performance might not be -optimal if SWP zones are used in a random I/O pattern. SWP zones also support -the same set of ZBD-specific I/O requests as host-managed devices that allow -host-aware devices to be managed by any host that supports zoned block devices -to achieve its optimum performance. Host-aware devices report VIRTIO_BLK_Z_HA -in the \field{model} field in \field{zoned}. - -Both SWR zones and SWP zones are sometimes referred as sequential zones. - -During device operation, sequential zones can be in one of the following states: -empty, implicitly-open, explicitly-open, closed and full. The state machine that -governs the transitions between these states is described later in this document. - -SWR and SWP zones consume volatile device resources while being in certain -states and the device may set limits on the number of zones that can be in these -states simultaneously. - -Zoned block devices use two internal counters to account for the device -resources in use, the number of currently open zones and the number of currently -active zones. - -Any zone state transition from a state that doesn't consume a zone resource to a -state that consumes the same resource increments the internal device counter for -that resource. Any zone transition out of a state that consumes a zone resource -to a state that doesn't consume the same resource decrements the counter. Any -request that causes the device to exceed the reported zone resource limits is -terminated by the device with a "zone resources exceeded" error as defined for -specific commands later. - -\begin{lstlisting} -struct virtio_blk_config { - le64 capacity; - le32 size_max; - le32 seg_max; - struct virtio_blk_geometry { - le16 cylinders; - u8 heads; - u8 sectors; - } geometry; - le32 blk_size; - struct virtio_blk_topology { - // # of logical blocks per physical block (log2) - u8 physical_block_exp; - // offset of first aligned logical block - u8 alignment_offset; - // suggested minimum I/O size in blocks - le16 min_io_size; - // optimal (suggested maximum) I/O size in blocks - le32 opt_io_size; - } topology; - u8 writeback; - u8 unused0; - u16 num_queues; - le32 max_discard_sectors; - le32 max_discard_seg; - le32 discard_sector_alignment; - le32 max_write_zeroes_sectors; - le32 max_write_zeroes_seg; - u8 write_zeroes_may_unmap; - u8 unused1[3]; - le32 max_secure_erase_sectors; - le32 max_secure_erase_seg; - le32 secure_erase_sector_alignment; - struct virtio_blk_zoned_characteristics { - le32 zone_sectors; - le32 max_open_zones; - le32 max_active_zones; - le32 max_append_sectors; - le32 write_granularity; - u8 model; - u8 unused2[3]; - } zoned; -}; -\end{lstlisting} - - -\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / Block Device / Device configuration layout / Legacy Interface: Device configuration layout} -When using the legacy interface, transitional devices and drivers -MUST format the fields in struct virtio_blk_config -according to the native endian of the guest rather than -(necessarily when not using the legacy interface) little-endian. - - -\subsection{Device Initialization}\label{sec:Device Types / Block Device / Device Initialization} - -\begin{enumerate} -\item The device size can be read from \field{capacity}. - -\item If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, - \field{blk_size} can be read to determine the optimal sector size - for the driver to use. This does not affect the units used in - the protocol (always 512 bytes), but awareness of the correct - value can affect performance. - -\item If the VIRTIO_BLK_F_RO feature is set by the device, any write - requests will fail. - -\item If the VIRTIO_BLK_F_TOPOLOGY feature is negotiated, the fields in the - \field{topology} struct can be read to determine the physical block size and optimal - I/O lengths for the driver to use. This also does not affect the units - in the protocol, only performance. - -\item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache - mode can be read or set through the \field{writeback} field. 0 corresponds - to a writethrough cache, 1 to a writeback cache\footnote{Consistent with - \ref{devicenormative:Device Types / Block Device / Device Operation}, - a writethrough cache can be defined broadly as a cache that commits - writes to persistent device backend storage before reporting their - completion. For example, a battery-backed writeback cache actually - counts as writethrough according to this definition.}. The cache mode - after reset can be either writeback or writethrough. The actual - mode can be determined by reading \field{writeback} after feature - negotiation. - -\item If the VIRTIO_BLK_F_DISCARD feature is negotiated, - \field{max_discard_sectors} and \field{max_discard_seg} can be read - to determine the maximum discard sectors and maximum number of discard - segments for the block driver to use. \field{discard_sector_alignment} - can be used by OS when splitting a request based on alignment. - -\item If the VIRTIO_BLK_F_WRITE_ZEROES feature is negotiated, - \field{max_write_zeroes_sectors} and \field{max_write_zeroes_seg} can - be read to determine the maximum write zeroes sectors and maximum - number of write zeroes segments for the block driver to use. - -\item If the VIRTIO_BLK_F_MQ feature is negotiated, \field{num_queues} field - can be read to determine the number of queues. - -\item If the VIRTIO_BLK_F_SECURE_ERASE feature is negotiated, - \field{max_secure_erase_sectors} and \field{max_secure_erase_seg} can be read - to determine the maximum secure erase sectors and maximum number of - secure erase segments for the block driver to use. - \field{secure_erase_sector_alignment} can be used by OS when splitting a - request based on alignment. - -\item If the VIRTIO_BLK_F_ZONED feature is negotiated, the fields in - \field{zoned} can be read by the driver to determine the zone - characteristics of the device. All \field{zoned} fields are read-only. - -\end{enumerate} - -\drivernormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} - -Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of -sending VIRTIO_BLK_T_FLUSH commands. - -If neither VIRTIO_BLK_F_CONFIG_WCE nor VIRTIO_BLK_F_FLUSH are -negotiated, the driver MAY deduce the presence of a writethrough cache. -If VIRTIO_BLK_F_CONFIG_WCE was not negotiated but VIRTIO_BLK_F_FLUSH was, -the driver SHOULD assume presence of a writeback cache. - -The driver MUST NOT read \field{writeback} before setting -the FEATURES_OK \field{device status} bit. - -Drivers MUST NOT negotiate the VIRTIO_BLK_F_ZONED feature if they are incapable -of supporting devices with the VIRTIO_BLK_Z_HM, VIRTIO_BLK_Z_HA or -VIRTIO_BLK_Z_NONE zoned model. - -If the VIRTIO_BLK_F_ZONED feature is offered by the device with the -VIRTIO_BLK_Z_HM zone model, then the VIRTIO_BLK_F_DISCARD feature MUST NOT be -offered by the driver. - -If the VIRTIO_BLK_F_ZONED feature and VIRTIO_BLK_F_DISCARD feature are both -offered by the device with the VIRTIO_BLK_Z_HA or VIRTIO_BLK_Z_NONE zone model, -then the driver MAY negotiate these two bits independently. - -If the VIRTIO_BLK_F_ZONED feature is negotiated, then -\begin{itemize} -\item if the driver that can not support host-managed zoned devices - reads VIRTIO_BLK_Z_HM from the \field{model} field of \field{zoned}, the - driver MUST NOT set FEATURES_OK flag and instead set the FAILED bit. - -\item if the driver that can not support zoned devices reads VIRTIO_BLK_Z_HA - from the \field{model} field of \field{zoned}, the driver - MAY handle the device as a non-zoned device. In this case, the - driver SHOULD ignore all other fields in \field{zoned}. -\end{itemize} - -\devicenormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} - -Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it -if they offer VIRTIO_BLK_F_CONFIG_WCE. - -If VIRTIO_BLK_F_CONFIG_WCE is negotiated but VIRTIO_BLK_F_FLUSH -is not, the device MUST initialize \field{writeback} to 0. - -The device MUST initialize padding bytes \field{unused0} and -\field{unused1} to 0. - -If the device that is being initialized is a not a zoned device, the device -SHOULD NOT offer the VIRTIO_BLK_F_ZONED feature. - -The VIRTIO_BLK_F_ZONED feature cannot be properly negotiated without -FEATURES_OK bit. Legacy devices MUST NOT offer VIRTIO_BLK_F_ZONED feature bit. - -If the VIRTIO_BLK_F_ZONED feature is not accepted by the driver, -\begin{itemize} -\item the device with the VIRTIO_BLK_Z_HA or VIRTIO_BLK_Z_NONE zone model SHOULD - proceed with the initialization while setting all zoned characteristics - fields to zero. - -\item the device with the VIRTIO_BLK_Z_HM zone model MUST fail to set the - FEATURES_OK device status bit when the driver writes the Device Status - field. -\end{itemize} - -If the VIRTIO_BLK_F_ZONED feature is negotiated, then the \field{model} field in -\field{zoned} struct in the configuration space MUST be set by the device -\begin{itemize} -\item to the value of VIRTIO_BLK_Z_NONE if it operates as a drive-managed - zoned block device or a non-zoned block device. - -\item to the value of VIRTIO_BLK_Z_HM if it operates as a host-managed zoned - block device. - -\item to the value of VIRTIO_BLK_Z_HA if it operates as a host-aware zoned - block device. -\end{itemize} - -If the VIRTIO_BLK_F_ZONED feature is negotiated and the device \field{model} -field in \field{zoned} struct is VIRTIO_BLK_Z_HM or VIRTIO_BLK_Z_HA, - -\begin{itemize} -\item the \field{zone_sectors} field of \field{zoned} MUST be set by the device - to the size of a single zone on the device. All zones of the device have the - same size indicated by \field{zone_sectors} except for the last zone that - MAY be smaller than all other zones. The driver can calculate the number of - zones on the device as - \begin{lstlisting} - nr_zones = (capacity + zone_sectors - 1) / zone_sectors; - \end{lstlisting} - and the size of the last zone as - \begin{lstlisting} - zs_last = capacity - (nr_zones - 1) * zone_sectors; - \end{lstlisting} - -\item The \field{max_open_zones} field of the \field{zoned} structure MUST be - set by the device to the maximum number of zones that can be open on the - device (zones in the implicit open or explicit open state). A value - of zero indicates that the device does not have any limit on the number of - open zones. - -\item The \field{max_active_zones} field of the \field{zoned} structure MUST - be set by the device to the maximum number zones that can be active on the - device (zones in the implicit open, explicit open or closed state). A value - of zero indicates that the device does not have any limit on the number of - active zones. - -\item the \field{max_append_sectors} field of \field{zoned} MUST be set by - the device to the maximum data size of a VIRTIO_BLK_T_ZONE_APPEND request - that can be successfully issued to the device. The value of this field MUST - NOT exceed the \field{seg_max} * \field{size_max} value. A device MAY set - the \field{max_append_sectors} to zero if it doesn't support - VIRTIO_BLK_T_ZONE_APPEND requests. - -\item the \field{write_granularity} field of \field{zoned} MUST be set by the - device to the offset and size alignment constraint for VIRTIO_BLK_T_OUT - and VIRTIO_BLK_T_ZONE_APPEND requests issued to a sequential zone of the - device. - -\item the device MUST initialize padding bytes \field{unused2} to 0. -\end{itemize} - -\subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} - -Because legacy devices do not have FEATURES_OK, transitional devices -MUST implement slightly different behavior around feature negotiation -when used through the legacy interface. In particular, when using the -legacy interface: - -\begin{itemize} -\item the driver MAY read or write \field{writeback} before setting - the DRIVER or DRIVER_OK \field{device status} bit - -\item the device MUST NOT modify the cache mode (and \field{writeback}) - as a result of a driver setting a status bit, unless - the DRIVER_OK bit is being set and the driver has not set the - VIRTIO_BLK_F_CONFIG_WCE driver feature bit. - -\item the device MUST NOT modify the cache mode (and \field{writeback}) - as a result of a driver modifying the driver feature bits, for example - if the driver sets the VIRTIO_BLK_F_CONFIG_WCE driver feature bit but - does not set the VIRTIO_BLK_F_FLUSH bit. -\end{itemize} - - -\subsection{Device Operation}\label{sec:Device Types / Block Device / Device Operation} - -The driver queues requests to the virtqueues, and they are used by -the device (not necessarily in order). Each request except -VIRTIO_BLK_T_ZONE_APPEND is of form: - -\begin{lstlisting} -struct virtio_blk_req { - le32 type; - le32 reserved; - le64 sector; - u8 data[]; - u8 status; -}; -\end{lstlisting} - -The type of the request is either a read (VIRTIO_BLK_T_IN), a write -(VIRTIO_BLK_T_OUT), a discard (VIRTIO_BLK_T_DISCARD), a write zeroes -(VIRTIO_BLK_T_WRITE_ZEROES), a flush (VIRTIO_BLK_T_FLUSH), a get device ID -string command (VIRTIO_BLK_T_GET_ID), a secure erase -(VIRTIO_BLK_T_SECURE_ERASE), or a get device lifetime command -(VIRTIO_BLK_T_GET_LIFETIME). - -\begin{lstlisting} -#define VIRTIO_BLK_T_IN 0 -#define VIRTIO_BLK_T_OUT 1 -#define VIRTIO_BLK_T_FLUSH 4 -#define VIRTIO_BLK_T_GET_ID 8 -#define VIRTIO_BLK_T_GET_LIFETIME 10 -#define VIRTIO_BLK_T_DISCARD 11 -#define VIRTIO_BLK_T_WRITE_ZEROES 13 -#define VIRTIO_BLK_T_SECURE_ERASE 14 -\end{lstlisting} - -The \field{sector} number indicates the offset (multiplied by 512) where -the read or write is to occur. This field is unused and set to 0 for -commands other than read, write and some zone operations. - -VIRTIO_BLK_T_IN requests populate \field{data} with the contents of sectors -read from the block device (in multiples of 512 bytes). VIRTIO_BLK_T_OUT -requests write the contents of \field{data} to the block device (in multiples -of 512 bytes). - -The \field{data} used for discard, secure erase or write zeroes commands -consists of one or more segments. The maximum number of segments is -\field{max_discard_seg} for discard commands, \field{max_secure_erase_seg} for -secure erase commands and \field{max_write_zeroes_seg} for write zeroes -commands. -Each segment is of form: - -\begin{lstlisting} -struct virtio_blk_discard_write_zeroes { - le64 sector; - le32 num_sectors; - struct { - le32 unmap:1; - le32 reserved:31; - } flags; -}; -\end{lstlisting} - -\field{sector} indicates the starting offset (in 512-byte units) of the -segment, while \field{num_sectors} indicates the number of sectors in each -discarded range. \field{unmap} is only used in write zeroes commands and allows -the device to discard the specified range, provided that following reads return -zeroes. - -VIRTIO_BLK_T_GET_ID requests fetch the device ID string from the device into -\field{data}. The device ID string is a NUL-padded ASCII string up to 20 bytes -long. If the string is 20 bytes long then there is no NUL terminator. - -The \field{data} used for VIRTIO_BLK_T_GET_LIFETIME requests is populated -by the device, and is of the form - -\begin{lstlisting} -struct virtio_blk_lifetime { - le16 pre_eol_info; - le16 device_lifetime_est_typ_a; - le16 device_lifetime_est_typ_b; -}; -\end{lstlisting} - -The \field{pre_eol_info} specifies the percentage of reserved blocks -that are consumed and will have one of these values: - -\begin{lstlisting} -/* Value not available */ -#define VIRTIO_BLK_PRE_EOL_INFO_UNDEFINED 0 -/* < 80% of reserved blocks are consumed */ -#define VIRTIO_BLK_PRE_EOL_INFO_NORMAL 1 -/* 80% of reserved blocks are consumed */ -#define VIRTIO_BLK_PRE_EOL_INFO_WARNING 2 -/* 90% of reserved blocks are consumed */ -#define VIRTIO_BLK_PRE_EOL_INFO_URGENT 3 -/* All others values are reserved */ -\end{lstlisting} - -The \field{device_lifetime_est_typ_a} refers to wear of SLC cells and is provided -in increments of 10%, with 0 meaning undefined, 1 meaning up-to 10% of lifetime -used, and so on, thru to 11 meaning estimated lifetime exceeded. -All values above 11 are reserved. - -The \field{device_lifetime_est_typ_b} refers to wear of MLC cells and is provided -with the same semantics as \field{device_lifetime_est_typ_a}. - -The final \field{status} byte is written by the device: either -VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for device or driver -error or VIRTIO_BLK_S_UNSUPP for a request unsupported by device: - -\begin{lstlisting} -#define VIRTIO_BLK_S_OK 0 -#define VIRTIO_BLK_S_IOERR 1 -#define VIRTIO_BLK_S_UNSUPP 2 -\end{lstlisting} - -The status of individual segments is indeterminate when a discard or write zero -command produces VIRTIO_BLK_S_IOERR. A segment may have completed -successfully, failed, or not been processed by the device. - -The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is -negotiated. - -In addition to the request types defined for non-zoned devices, the type of the -request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit zone open -(VIRTIO_BLK_T_ZONE_OPEN), a zone close (VIRTIO_BLK_T_ZONE_CLOSE), a zone finish -(VIRTIO_BLK_T_ZONE_FINISH), a zone_append (VIRTIO_BLK_T_ZONE_APPEND), a zone -reset (VIRTIO_BLK_T_ZONE_RESET) or a zone reset all -(VIRTIO_BLK_T_ZONE_RESET_ALL). - -\begin{lstlisting} -#define VIRTIO_BLK_T_ZONE_APPEND 15 -#define VIRTIO_BLK_T_ZONE_REPORT 16 -#define VIRTIO_BLK_T_ZONE_OPEN 18 -#define VIRTIO_BLK_T_ZONE_CLOSE 20 -#define VIRTIO_BLK_T_ZONE_FINISH 22 -#define VIRTIO_BLK_T_ZONE_RESET 24 -#define VIRTIO_BLK_T_ZONE_RESET_ALL 26 -\end{lstlisting} - -Requests of type VIRTIO_BLK_T_OUT, VIRTIO_BLK_T_ZONE_OPEN, -VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH, VIRTIO_BLK_T_ZONE_APPEND, -VIRTIO_BLK_T_ZONE_RESET or VIRTIO_BLK_T_ZONE_RESET_ALL may be completed by the -device with VIRTIO_BLK_S_OK, VIRTIO_BLK_S_IOERR or VIRTIO_BLK_S_UNSUPP -\field{status}, or, additionally, with VIRTIO_BLK_S_ZONE_INVALID_CMD, -VIRTIO_BLK_S_ZONE_UNALIGNED_WP, VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or -VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE ZBD-specific status codes. - -Besides the request status, VIRTIO_BLK_T_ZONE_APPEND requests return the -starting sector of the appended data back to the driver. For this reason, -the VIRTIO_BLK_T_ZONE_APPEND request has the layout that is extended to have -the \field{append_sector} field to carry this value: - -\begin{lstlisting} -struct virtio_blk_req_za { - le32 type; - le32 reserved; - le64 sector; - u8 data[]; - le64 append_sector; - u8 status; -}; -\end{lstlisting} - -\begin{lstlisting} -#define VIRTIO_BLK_S_ZONE_INVALID_CMD 3 -#define VIRTIO_BLK_S_ZONE_UNALIGNED_WP 4 -#define VIRTIO_BLK_S_ZONE_OPEN_RESOURCE 5 -#define VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE 6 -\end{lstlisting} - -Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of the type -VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN, -VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH, VIRTIO_BLK_T_ZONE_RESET and -VIRTIO_BLK_T_ZONE_RESET_ALL are non-data requests. - -Zone sector address is a 64-bit address of the first 512-byte sector of the -zone. - -VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and -VIRTIO_BLK_T_ZONE_RESET requests make the zone operation to act on a particular -zone specified by the zone sector address in the \field{sector} of the request. - -VIRTIO_BLK_T_ZONE_RESET_ALL request acts upon all applicable zones of the -device. The \field{sector} value is not used for this request. - -In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone -Management Receive" command category and VIRTIO_BLK_T_ZONE_OPEN, -VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and -VIRTIO_BLK_T_ZONE_RESET/VIRTIO_BLK_T_ZONE_RESET_ALL requests are categorized as -"Zone Management Send" commands. VIRTIO_BLK_T_ZONE_APPEND is categorized -separately from zone management commands and is the only request that uses -the \field{append_secctor} field \field{virtio_blk_req_za} to return -to the driver the sector at which the data has been appended to the zone. - -VIRTIO_BLK_T_ZONE_REPORT is a read request that returns the information about -the current state of zones on the device starting from the zone containing the -\field{sector} of the request. The report consists of a header followed by zero -or more zone descriptors. - -A zone report reply has the following structure: - -\begin{lstlisting} -struct virtio_blk_zone_report { - le64 nr_zones; - u8 reserved[56]; - struct virtio_blk_zone_descriptor zones[]; -}; -\end{lstlisting} - -The device sets the \field{nr_zones} field in the report header to the number of -fully transferred zone descriptors in the data buffer. - -A zone descriptor has the following structure: - -\begin{lstlisting} -struct virtio_blk_zone_descriptor { - le64 z_cap; - le64 z_start; - le64 z_wp; - u8 z_type; - u8 z_state; - u8 reserved[38]; -}; -\end{lstlisting} - -The zone descriptor field \field{z_type} \field{virtio_blk_zone_descriptor} -indicates the type of the zone. - -The following zone types are available: - -\begin{lstlisting} -#define VIRTIO_BLK_ZT_CONV 1 -#define VIRTIO_BLK_ZT_SWR 2 -#define VIRTIO_BLK_ZT_SWP 3 -\end{lstlisting} - -Read and write operations into zones with the VIRTIO_BLK_ZT_CONV (Conventional) -type have the same behavior as read and write operations on a regular block -device. Any block in a conventional zone can be read or written at any time and -in any order. - -Zones with VIRTIO_BLK_ZT_SWR can be read randomly, but must be written -sequentially at a certain point in the zone called the Write Pointer (WP). With -every write, the Write Pointer is incremented by the number of sectors written. - -Zones with VIRTIO_BLK_ZT_SWP can be read randomly and should be written -sequentially, similarly to SWR zones. However, SWP zones can accept random write -operations, that is, VIRTIO_BLK_T_OUT requests with a start sector different -from the zone write pointer position. - -The field \field{z_state} of \field{virtio_blk_zone_descriptor} indicates the -state of the device zone. - -The following zone states are available: - -\begin{lstlisting} -#define VIRTIO_BLK_ZS_NOT_WP 0 -#define VIRTIO_BLK_ZS_EMPTY 1 -#define VIRTIO_BLK_ZS_IOPEN 2 -#define VIRTIO_BLK_ZS_EOPEN 3 -#define VIRTIO_BLK_ZS_CLOSED 4 -#define VIRTIO_BLK_ZS_RDONLY 13 -#define VIRTIO_BLK_ZS_FULL 14 -#define VIRTIO_BLK_ZS_OFFLINE 15 -\end{lstlisting} - -Zones of the type VIRTIO_BLK_ZT_CONV are always reported by the device to be in -the VIRTIO_BLK_ZS_NOT_WP state. Zones of the types VIRTIO_BLK_ZT_SWR and -VIRTIO_BLK_ZT_SWP can not transition to the VIRTIO_BLK_ZS_NOT_WP state. - -Zones in VIRTIO_BLK_ZS_EMPTY (Empty), VIRTIO_BLK_ZS_IOPEN (Implicitly Open), -VIRTIO_BLK_ZS_EOPEN (Explicitly Open) and VIRTIO_BLK_ZS_CLOSED (Closed) state -are writable, but zones in VIRTIO_BLK_ZS_RDONLY (Read-Only), VIRTIO_BLK_ZS_FULL -(Full) and VIRTIO_BLK_ZS_OFFLINE (Offline) state are not. The write pointer -value (\field{z_wp}) is not valid for Read-Only, Full and Offline zones. - -The zone descriptor field \field{z_cap} contains the maximum number of 512-byte -sectors that are available to be written with user data when the zone is in the -Empty state. This value shall be less than or equal to the \field{zone_sectors} -value in \field{virtio_blk_zoned_characteristics} structure in the device -configuration space. - -The zone descriptor field \field{z_start} contains the zone sector address. - -The zone descriptor field \field{z_wp} contains the sector address where the -next write operation for this zone should be issued. This value is undefined -for conventional zones and for zones in VIRTIO_BLK_ZS_RDONLY, -VIRTIO_BLK_ZS_FULL and VIRTIO_BLK_ZS_OFFLINE state. - -Depending on their state, zones consume resources as follows: -\begin{itemize} -\item a zone in VIRTIO_BLK_ZS_IOPEN and VIRTIO_BLK_ZS_EOPEN state consumes one - open zone resource and, additionally, - -\item a zone in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN and - VIRTIO_BLK_ZS_CLOSED state consumes one active resource. -\end{itemize} - -Attempts for zone transitions that violate zone resource limits must fail with -VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE -\field{status}. - -Zones in the VIRTIO_BLK_ZS_EMPTY (Empty) state have the write pointer value -equal to the sector address of the zone. In this state, the entire capacity of -the zone is available for writing. A zone can transition from this state to -\begin{itemize} -\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or - VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the zone. - -\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is - received for the zone -\end{itemize} - -When a VIRTIO_BLK_T_ZONE_RESET request is issued to an Empty zone, the request -is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EMPTY state. - -Zones in the VIRTIO_BLK_ZS_IOPEN (Implicitly Open) state transition from -this state to -\begin{itemize} -\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is - received for the zone, - -\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET_ALL request - is received by the device, - -\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is - received for the zone, - -\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request is - received for the zone, - -\item VIRTIO_BLK_ZS_CLOSED implicitly by the device when another zone is - entering the VIRTIO_BLK_ZS_IOPEN or VIRTIO_BLK_ZS_EOPEN state and the number - of currently open zones is at \field{max_open_zones} limit, - -\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request is - received for the zone. - -\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or - VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its writable - capacity is received for the zone. -\end{itemize} - -Zones in the VIRTIO_BLK_ZS_EOPEN (Explicitly Open) state transition from -this state to -\begin{itemize} -\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is - received for the zone, - -\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET_ALL request - is received by the device, - -\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_CLOSE request is - received for the zone and the write pointer of the zone has the value equal - to the start sector of the zone, - -\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request is - received for the zone and the zone write pointer is larger then the start - sector of the zone, - -\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request is - received for the zone, - -\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or - VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its writable - capacity is received for the zone. -\end{itemize} - -When a VIRTIO_BLK_T_ZONE_EOPEN request is issued to an Explicitly Open zone, the -request is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EOPEN -state. - -Zones in the VIRTIO_BLK_ZS_CLOSED (Closed) state transition from this state -to -\begin{itemize} -\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is - received for the zone, - -\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET_ALL request - is received by the device, - -\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or - VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the zone. - -\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is - received for the zone, -\end{itemize} - -When a VIRTIO_BLK_T_ZONE_CLOSE request is issued to a Closed zone, the request -is completed successfully and the zone stays in the VIRTIO_BLK_ZS_CLOSED state. - -Zones in the VIRTIO_BLK_ZS_FULL (Full) state transition from this state to -VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is -received for the zone or a successful VIRTIO_BLK_T_ZONE_RESET_ALL request is -received by the device. - -When a VIRTIO_BLK_T_ZONE_FINISH request is issued to a Full zone, the request -is completed successfully and the zone stays in the VIRTIO_BLK_ZS_FULL state. - -The device may automatically transition zones to VIRTIO_BLK_ZS_RDONLY -(Read-Only) or VIRTIO_BLK_ZS_OFFLINE (Offline) state from any other state. The -device may also automatically transition zones in the Read-Only state to the -Offline state. Zones in the Offline state may not transition to any other state. -Such automatic transitions usually indicate hardware failures. The previously -written data may only be read from zones in the Read-Only state. Zones in the -Offline state can not be read or written. - -VIRTIO_BLK_S_ZONE_UNALIGNED_WP is set by the device when the request received -from the driver attempts to perform a write to an SWR zone and at least one of -the following conditions is met: - -\begin{itemize} -\item the starting sector of the request is not equal to the current value of - the zone write pointer. - -\item the ending sector of the request data multiplied by 512 is not a multiple - of the value reported by the device in the field \field{write_granularity} - in the device configuration space. -\end{itemize} - -VIRTIO_BLK_S_ZONE_OPEN_RESOURCE is set by the device when a zone operation or -write request received from the driver can not be handled without exceeding the -\field{max_open_zones} limit value reported by the device in the configuration -space. - -VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE is set by the device when a zone operation or -write request received from the driver can not be handled without exceeding the -\field{max_active_zones} limit value reported by the device in the configuration -space. - -A zone transition request that leads to both the \field{max_open_zones} and the -\field{max_active_zones} limits to be exceeded is terminated by the device with -VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE \field{status} value. - -The device reports all other error conditions related to zoned block model -operation by setting the VIRTIO_BLK_S_ZONE_INVALID_CMD value in -\field{status} of \field{virtio_blk_req} structure. - -\drivernormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} - -The driver SHOULD check if the content of the \field{capacity} field has -changed upon receiving a configuration change notification. - -A driver MUST NOT submit a request which would cause a read or write -beyond \field{capacity}. - -A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. - -A driver MUST set \field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. -A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. - -The length of \field{data} MUST be a multiple of 512 bytes for VIRTIO_BLK_T_IN -and VIRTIO_BLK_T_OUT requests. - -The length of \field{data} MUST be a multiple of the size of struct -virtio_blk_discard_write_zeroes for VIRTIO_BLK_T_DISCARD, -VIRTIO_BLK_T_SECURE_ERASE and VIRTIO_BLK_T_WRITE_ZEROES requests. - -The length of \field{data} MUST be 20 bytes for VIRTIO_BLK_T_GET_ID requests. - -VIRTIO_BLK_T_DISCARD requests MUST NOT contain more than -\field{max_discard_seg} struct virtio_blk_discard_write_zeroes segments in -\field{data}. - -VIRTIO_BLK_T_SECURE_ERASE requests MUST NOT contain more than -\field{max_secure_erase_seg} struct virtio_blk_discard_write_zeroes segments in -\field{data}. - -VIRTIO_BLK_T_WRITE_ZEROES requests MUST NOT contain more than -\field{max_write_zeroes_seg} struct virtio_blk_discard_write_zeroes segments in -\field{data}. - -If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY -switch to writethrough or writeback mode by writing respectively 0 and -1 to the \field{writeback} field. After writing a 0 to \field{writeback}, -the driver MUST NOT assume that any volatile writes have been committed -to persistent device backend storage. - -The \field{unmap} bit MUST be zero for discard commands. The driver -MUST NOT assume anything about the data returned by read requests after -a range of sectors has been discarded. - -A driver MUST NOT assume that individual segments in a multi-segment -VIRTIO_BLK_T_DISCARD or VIRTIO_BLK_T_WRITE_ZEROES request completed -successfully, failed, or were processed by the device at all if the request -failed with VIRTIO_BLK_S_IOERR. - -The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is -negotiated. - -A zone sector address provided by the driver MUST be a multiple of 512 bytes. - -When forming a VIRTIO_BLK_T_ZONE_REPORT request, the driver MUST set a sector -within the sector range of the starting zone to report to \field{sector} field. -It MAY be a sector that is different from the zone sector address. - -In VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and -VIRTIO_BLK_T_ZONE_RESET requests, the driver MUST set \field{sector} field to -point at the first sector in the target zone. - -In VIRTIO_BLK_T_ZONE_RESET_ALL request, the driver MUST set the field -\field{sector} to zero value. - -The \field{sector} field of the VIRTIO_BLK_T_ZONE_APPEND request MUST specify -the zone sector address of the zone to which data is to be appended at the -position of the write pointer. The size of the data that is appended MUST be a -multiple of \field{write_granularity} bytes and MUST NOT exceed the -\field{max_append_sectors} value provided by the device in -\field{virtio_blk_zoned_characteristics} configuration space structure. - -Upon a successful completion of a VIRTIO_BLK_T_ZONE_APPEND request, the driver -MAY read the starting sector location of the written data from the request -field \field{append_sector}. - -All VIRTIO_BLK_T_OUT requests issued by the driver to sequential zones and -VIRTIO_BLK_T_ZONE_APPEND requests MUST have: - -\begin{enumerate} -\item the data size that is a multiple of the number of bytes reported - by the device in the field \field{write_granularity} in the - \field{virtio_blk_zoned_characteristics} configuration space structure. - -\item the value of the field \field{sector} that is a multiple of the number of - bytes reported by the device in the field \field{write_granularity} in the - \field{virtio_blk_zoned_characteristics} configuration space structure. - -\item the data size that will not exceed the writable zone capacity when its - value is added to the current value of the write pointer of the zone. - -\end{enumerate} - -\devicenormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} - -The device MAY change the content of the \field{capacity} field during -operation of the device. When this happens, the device SHOULD trigger a -configuration change notification. - -A device MUST set the \field{status} byte to VIRTIO_BLK_S_IOERR -for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT -write any data. - -The device MUST set the \field{status} byte to VIRTIO_BLK_S_UNSUPP for -discard, secure erase and write zeroes commands if any unknown flag is set. -Furthermore, the device MUST set the \field{status} byte to -VIRTIO_BLK_S_UNSUPP for discard commands if the \field{unmap} flag is set. - -For discard commands, the device MAY deallocate the specified range of -sectors in the device backend storage. - -For write zeroes commands, if the \field{unmap} is set, the device MAY -deallocate the specified range of sectors in the device backend storage, -as if the discard command had been sent. After a write zeroes command -is completed, reads of the specified ranges of sectors MUST return -zeroes. This is true independent of whether \field{unmap} was set or clear. - -The device SHOULD clear the \field{write_zeroes_may_unmap} field of the -virtio configuration space if and only if a write zeroes request cannot -result in deallocating one or more sectors. The device MAY change the -content of the field during operation of the device; when this happens, -the device SHOULD trigger a configuration change notification. - -A write is considered volatile when it is submitted; the contents of -sectors covered by a volatile write are undefined in persistent device -backend storage until the write becomes stable. A write becomes stable -once it is completed and one or more of the following conditions is true: - -\begin{enumerate} -\item\label{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor - VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was - offered by the device; - -\item\label{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the - \field{writeback} field in configuration space was 0 \textbf{all the time between - the submission of the write and its completion}; - -\item\label{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent \textbf{after the write is - completed} and is completed itself. -\end{enumerate} - -If the device is backed by persistent storage, the device MUST ensure that -stable writes are committed to it, before reporting completion of the write -(cases~\ref{item:flush1} and~\ref{item:flush2}) or the flush -(case~\ref{item:flush3}). Failure to do so can cause data loss -in case of a crash. - -If the driver changes \field{writeback} between the submission of the write -and its completion, the write could be either volatile or stable when -its completion is reported; in other words, the exact behavior is undefined. - -% According to the device requirements for device initialization: -% Offer(CONFIG_WCE) => Offer(FLUSH). -% -% After reversing the implication: -% not Offer(FLUSH) => not Offer(CONFIG_WCE). - -If VIRTIO_BLK_F_FLUSH was not offered by the - device\footnote{Note that in this case, according to - \ref{devicenormative:Device Types / Block Device / Device Initialization}, - the device will not have offered VIRTIO_BLK_F_CONFIG_WCE either.}, the -device MAY also commit writes to persistent device backend storage before -reporting their completion. Unlike case~\ref{item:flush1}, however, this -is not an absolute requirement of the specification. - -\begin{note} - An implementation that does not offer VIRTIO_BLK_F_FLUSH and does not commit - completed writes will not be resilient to data loss in case of crashes. - Not offering VIRTIO_BLK_F_FLUSH is an absolute requirement - for implementations that do not wish to be safe against such data losses. -\end{note} - -If the device is backed by storage providing lifetime metrics (such as eMMC -or UFS persistent storage), the device SHOULD offer the VIRTIO_BLK_F_LIFETIME -flag. The flag MUST NOT be offered if the device is backed by storage for which -the lifetime metrics described in this document cannot be obtained or for which -such metrics have no useful meaning. If the metrics are offered, the device MUST NOT -send any reserved values, as defined in this specification. - -\begin{note} - The device lifetime metrics \field{pre_eol_info}, \field{device_lifetime_est_a} - and \field{device_lifetime_est_b} are discussed in the JESD84-B50 specification. - - The complete JESD84-B50 is available at the JEDEC website (https://www.jedec.org) - pursuant to JEDEC's licensing terms and conditions. This information is provided to - simplfy passthrough implementations from eMMC devices. -\end{note} - -If the VIRTIO_BLK_F_ZONED feature is not negotiated, the device MUST reject -VIRTIO_BLK_T_ZONE_REPORT, VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE, -VIRTIO_BLK_T_ZONE_FINISH, VIRTIO_BLK_T_ZONE_APPEND, VIRTIO_BLK_T_ZONE_RESET and -VIRTIO_BLK_T_ZONE_RESET_ALL requests with VIRTIO_BLK_S_UNSUPP status. - -The following device requirements only apply if the VIRTIO_BLK_F_ZONED feature -is negotiated. - -If a request of type VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE, -VIRTIO_BLK_T_ZONE_FINISH or VIRTIO_BLK_T_ZONE_RESET is issued for a Conventional -zone (type VIRTIO_BLK_ZT_CONV), the device MUST complete the request with -VIRTIO_BLK_S_ZONE_INVALID_CMD \field{status}. - -If the zone specified by the VIRTIO_BLK_T_ZONE_APPEND request is not a SWR zone, -then the request SHALL be completed with VIRTIO_BLK_S_ZONE_INVALID_CMD -\field{status}. - -The device handles a VIRTIO_BLK_T_ZONE_OPEN request by attempting to change the -state of the zone with the \field{sector} address to VIRTIO_BLK_ZS_EOPEN. If the -transition to this state can not be performed, the request MUST be completed -with VIRTIO_BLK_S_ZONE_INVALID_CMD \field{status}. If, while processing this -request, the available zone resources are insufficient, then the zone state does -not change and the request MUST be completed with -VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE value in -the field \field{status}. - -The device handles a VIRTIO_BLK_T_ZONE_CLOSE request by attempting to change the -state of the zone with the \field{sector} address to VIRTIO_BLK_ZS_CLOSED. If -the transition to this state can not be performed, the request MUST be completed -with VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{status}. - -The device handles a VIRTIO_BLK_T_ZONE_FINISH request by attempting to change -the state of the zone with the \field{sector} address to VIRTIO_BLK_ZS_FULL. If -the transition to this state can not be performed, the zone state does not -change and the request MUST be completed with VIRTIO_BLK_S_ZONE_INVALID_CMD -value in the field \field{status}. - -The device handles a VIRTIO_BLK_T_ZONE_RESET request by attempting to change the -state of the zone with the \field{sector} address to VIRTIO_BLK_ZS_EMPTY state. -If the transition to this state can not be performed, the zone state does not -change and the request MUST be completed with VIRTIO_BLK_S_ZONE_INVALID_CMD -value in the field \field{status}. - -The device handles a VIRTIO_BLK_T_ZONE_RESET_ALL request by transitioning all -sequential device zones in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN, -VIRTIO_BLK_ZS_CLOSED and VIRTIO_BLK_ZS_FULL state to VIRTIO_BLK_ZS_EMPTY state. - -Upon receiving a VIRTIO_BLK_T_ZONE_APPEND request or a VIRTIO_BLK_T_OUT -request issued to a SWR zone in VIRTIO_BLK_ZS_EMPTY or VIRTIO_BLK_ZS_CLOSED -state, the device attempts to perform the transition of the zone to -VIRTIO_BLK_ZS_IOPEN state before writing data. This transition may fail due to -insufficient open and/or active zone resources available on the device. In this -case, the request MUST be completed with VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or -VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE value in the \field{status}. - -If the \field{sector} field in the VIRTIO_BLK_T_ZONE_APPEND request does not -specify the lowest sector for a zone, then the request SHALL be completed with -VIRTIO_BLK_S_ZONE_INVALID_CMD value in \field{status}. - -A VIRTIO_BLK_T_ZONE_APPEND request or a VIRTIO_BLK_T_OUT request that has the -data range that exceeds the remaining writable capacity for the zone, then the -request SHALL be completed with VIRTIO_BLK_S_ZONE_INVALID_CMD value in -\field{status}. - -If a request of the type VIRTIO_BLK_T_ZONE_APPEND is completed with -VIRTIO_BLK_S_OK status, the field \field{append_sector} in -\field{virtio_blk_req_za} MUST be set by the device to contain the first sector -of the data written to the zone. - -If a request of the type VIRTIO_BLK_T_ZONE_APPEND is completed with a status -other than VIRTIO_BLK_S_OK, the value of \field{append_sector} field in -\field{virtio_blk_req_za} is undefined. - -A VIRTIO_BLK_T_ZONE_APPEND request that has the data size that exceeds -\field{max_append_sectors} configuration space value, then, -\begin{itemize} -\item if \field{max_append_sectors} configuration space value is reported as - zero by the device, the request SHALL be completed with VIRTIO_BLK_S_UNSUPP - \field{status}. - -\item if \field{max_append_sectors} configuration space value is reported as - a non-zero value by the device, the request SHALL be completed with - VIRTIO_BLK_S_ZONE_INVALID_CMD \field{status}. -\end{itemize} - -If a VIRTIO_BLK_T_ZONE_APPEND request, a VIRTIO_BLK_T_IN request or a -VIRTIO_BLK_T_OUT request issued to a SWR zone has the range that has sectors in -more than one zone, then the request SHALL be completed with -VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{status}. - -A VIRTIO_BLK_T_OUT request that has the \field{sector} value that is not aligned -with the write pointer for the zone, then the request SHALL be completed with -VIRTIO_BLK_S_ZONE_UNALIGNED_WP value in the field \field{status}. - -In order to avoid resource-related errors while opening zones implicitly, the -device MAY automatically transition zones in VIRTIO_BLK_ZS_IOPEN state to -VIRTIO_BLK_ZS_CLOSED state. - -All VIRTIO_BLK_T_OUT requests or VIRTIO_BLK_T_ZONE_APPEND requests issued -to a zone in the VIRTIO_BLK_ZS_RDONLY state SHALL be completed with -VIRTIO_BLK_S_ZONE_INVALID_CMD \field{status}. - -All requests issued to a zone in the VIRTIO_BLK_ZS_OFFLINE state SHALL be -completed with VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{status}. - -The device MUST consider the sectors that are read between the write pointer -position of a zone and the end of the last sector of the zone as unwritten data. -The sectors between the write pointer position and the end of the last sector -within the zone capacity during VIRTIO_BLK_T_ZONE_FINISH request processing are -also considered unwritten data. - -When unwritten data is present in the sector range of a read request, the device -MUST process this data in one of the following ways - - -\begin{enumerate} -\item Fill the unwritten data with a device-specific byte pattern. The -configuration, control and reporting of this byte pattern is beyond the scope -of this standard. This is the preferred approach. - -\item Fail the request. Depending on the driver implementation, this may prevent -the device from becoming operational. -\end{enumerate} - -If both the VIRTIO_BLK_F_ZONED and VIRTIO_BLK_F_SECURE_ERASE features are -negotiated, then - -\begin{enumerate} -\item the field \field{secure_erase_sector_alignment} in the configuration space -of the device MUST be a multiple of \field{zone_sectors} value reported in the -device configuration space. - -\item the data size in VIRTIO_BLK_T_SECURE_ERASE requests MUST be a multiple of -\field{zone_sectors} value in the device configuration space. -\end{enumerate} - -The device MUST handle a VIRTIO_BLK_T_SECURE_ERASE request in the same way it -handles VIRTIO_BLK_T_ZONE_RESET request for the zone range specified in the -VIRTIO_BLK_T_SECURE_ERASE request. - -\subsubsection{Legacy Interface: Device Operation}\label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} -When using the legacy interface, transitional devices and drivers -MUST format the fields in struct virtio_blk_req -according to the native endian of the guest rather than -(necessarily when not using the legacy interface) little-endian. - -When using the legacy interface, transitional drivers -SHOULD ignore the used length values. -\begin{note} -Historically, some devices put the total descriptor length, -or the total length of device-writable buffers there, -even when only the status byte was actually written. -\end{note} - -The \field{reserved} field was previously called \field{ioprio}. \field{ioprio} -is a hint about the relative priorities of requests to the device: -higher numbers indicate more important requests. - -\begin{lstlisting} -#define VIRTIO_BLK_T_FLUSH_OUT 5 -\end{lstlisting} - -The command VIRTIO_BLK_T_FLUSH_OUT was a synonym for VIRTIO_BLK_T_FLUSH; -a driver MUST treat it as a VIRTIO_BLK_T_FLUSH command. - -\begin{lstlisting} -#define VIRTIO_BLK_T_BARRIER 0x80000000 -\end{lstlisting} - -If the device has VIRTIO_BLK_F_BARRIER -feature the high bit (VIRTIO_BLK_T_BARRIER) indicates that this -request acts as a barrier and that all preceding requests SHOULD be -complete before this one, and all following requests SHOULD NOT be -started until this is complete. - -\begin{note} A barrier does not flush -caches in the underlying backend device in host, and thus does not -serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request -does that. -\end{note} - -Some older legacy devices did not commit completed writes to persistent -device backend storage when VIRTIO_BLK_F_FLUSH was offered but not -negotiated. In order to work around this, the driver MAY set the -\field{writeback} to 0 (if available) or it MAY send an explicit flush -request after every completed write. - -If the device has VIRTIO_BLK_F_SCSI feature, it can also support -scsi packet command requests, each of these requests is of form: - -\begin{lstlisting} -/* All fields are in guest's native endian. */ -struct virtio_scsi_pc_req { - u32 type; - u32 ioprio; - u64 sector; - u8 cmd[]; - u8 data[][512]; -#define SCSI_SENSE_BUFFERSIZE 96 - u8 sense[SCSI_SENSE_BUFFERSIZE]; - u32 errors; - u32 data_len; - u32 sense_len; - u32 residual; - u8 status; -}; -\end{lstlisting} - -A request type can also be a scsi packet command (VIRTIO_BLK_T_SCSI_CMD or -VIRTIO_BLK_T_SCSI_CMD_OUT). The two types are equivalent, the device -does not distinguish between them: - -\begin{lstlisting} -#define VIRTIO_BLK_T_SCSI_CMD 2 -#define VIRTIO_BLK_T_SCSI_CMD_OUT 3 -\end{lstlisting} - -The \field{cmd} field is only present for scsi packet command requests, -and indicates the command to perform. This field MUST reside in a -single, separate device-readable buffer; command length can be derived -from the length of this buffer. - -Note that these first three (four for scsi packet commands) -fields are always device-readable: \field{data} is either device-readable -or device-writable, depending on the request. The size of the read or -write can be derived from the total size of the request buffers. - -\field{sense} is only present for scsi packet command requests, -and indicates the buffer for scsi sense data. - -\field{data_len} is only present for scsi packet command -requests, this field is deprecated, and SHOULD be ignored by the -driver. Historically, devices copied data length there. - -\field{sense_len} is only present for scsi packet command -requests and indicates the number of bytes actually written to -the \field{sense} buffer. - -\field{residual} field is only present for scsi packet command -requests and indicates the residual size, calculated as data -length - number of bytes actually transferred. - -\subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device Types / Block Device / Legacy Interface: Framing Requirements} - -When using legacy interfaces, transitional drivers which have not -negotiated VIRTIO_F_ANY_LAYOUT: - -\begin{itemize} -\item MUST use a single 8-byte descriptor containing \field{type}, - \field{reserved} and \field{sector}, followed by descriptors - for \field{data}, then finally a separate 1-byte descriptor - for \field{status}. - -\item For SCSI commands there are additional constraints. - \field{sense} MUST reside in a - single separate device-writable descriptor of size 96 bytes, - and \field{errors}, \field{data_len}, \field{sense_len} and - \field{residual} MUST reside a single separate - device-writable descriptor. -\end{itemize} - -See \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing}. - +\input{virtio-block.tex} \input{virtio-console.tex} \input{virtio-entropy.tex} \input{virtio-mem-balloon.tex} diff --git a/virtio-block.tex b/virtio-block.tex new file mode 100644 index 0000000..60cdde6 --- /dev/null +++ b/virtio-block.tex @@ -0,0 +1,1315 @@ +\section{Block Device}\label{sec:Device Types / Block Device} + +The virtio block device is a simple virtual block device (ie. +disk). Read and write requests (and other exotic requests) are +placed in one of its queues, and serviced (probably out of order) by the +device except where noted. + +\subsection{Device ID}\label{sec:Device Types / Block Device / Device ID} + 2 + +\subsection{Virtqueues}\label{sec:Device Types / Block Device / Virtqueues} +\begin{description} +\item[0] requestq1 +\item[\ldots] +\item[N-1] requestqN +\end{description} + + N=1 if VIRTIO_BLK_F_MQ is not negotiated, otherwise N is set by + \field{num_queues}. + +\subsection{Feature bits}\label{sec:Device Types / Block Device / Feature bits} + +\begin{description} +\item[VIRTIO_BLK_F_SIZE_MAX (1)] Maximum size of any single segment is + in \field{size_max}. + +\item[VIRTIO_BLK_F_SEG_MAX (2)] Maximum number of segments in a + request is in \field{seg_max}. + +\item[VIRTIO_BLK_F_GEOMETRY (4)] Disk-style geometry specified in + \field{geometry}. + +\item[VIRTIO_BLK_F_RO (5)] Device is read-only. + +\item[VIRTIO_BLK_F_BLK_SIZE (6)] Block size of disk is in \field{blk_size}. + +\item[VIRTIO_BLK_F_FLUSH (9)] Cache flush command support. + +\item[VIRTIO_BLK_F_TOPOLOGY (10)] Device exports information on optimal I/O + alignment. + +\item[VIRTIO_BLK_F_CONFIG_WCE (11)] Device can toggle its cache between writeback + and writethrough modes. + +\item[VIRTIO_BLK_F_MQ (12)] Device supports multiqueue. + +\item[VIRTIO_BLK_F_DISCARD (13)] Device can support discard command, maximum + discard sectors size in \field{max_discard_sectors} and maximum discard + segment number in \field{max_discard_seg}. + +\item[VIRTIO_BLK_F_WRITE_ZEROES (14)] Device can support write zeroes command, + maximum write zeroes sectors size in \field{max_write_zeroes_sectors} and + maximum write zeroes segment number in \field{max_write_zeroes_seg}. + +\item[VIRTIO_BLK_F_LIFETIME (15)] Device supports providing storage lifetime + information. + +\item[VIRTIO_BLK_F_SECURE_ERASE (16)] Device supports secure erase command, + maximum erase sectors count in \field{max_secure_erase_sectors} and + maximum erase segment number in \field{max_secure_erase_seg}. + +\item[VIRTIO_BLK_F_ZONED(17)] Device is a Zoned Block Device, that is, a device + that follows the zoned storage device behavior that is also supported by + industry standards such as the T10 Zoned Block Command standard (ZBC r05) or + the NVMe(TM) NVM Express Zoned Namespace Command Set Specification 1.1b + (ZNS). For brevity, these standard documents are referred as "ZBD standards" + from this point on in the text. + +\end{description} + +\subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types / Block Device / Feature bits / Legacy Interface: Feature bits} + +\begin{description} +\item[VIRTIO_BLK_F_BARRIER (0)] Device supports request barriers. + +\item[VIRTIO_BLK_F_SCSI (7)] Device supports scsi packet commands. +\end{description} + +\begin{note} + In the legacy interface, VIRTIO_BLK_F_FLUSH was also + called VIRTIO_BLK_F_WCE. +\end{note} + +\subsection{Device configuration layout}\label{sec:Device Types / Block Device / Device configuration layout} + +The \field{capacity} of the device (expressed in 512-byte sectors) is always +present. The availability of the others all depend on various feature +bits as indicated above. + +The field \field{num_queues} only exists if VIRTIO_BLK_F_MQ is set. This field specifies +the number of queues. + +The parameters in the configuration space of the device \field{max_discard_sectors} +\field{discard_sector_alignment} are expressed in 512-byte units if the +VIRTIO_BLK_F_DISCARD feature bit is negotiated. The \field{max_write_zeroes_sectors} +is expressed in 512-byte units if the VIRTIO_BLK_F_WRITE_ZEROES feature +bit is negotiated. The parameters in the configuration space of the device +\field{max_secure_erase_sectors} \field{secure_erase_sector_alignment} are expressed +in 512-byte units if the VIRTIO_BLK_F_SECURE_ERASE feature bit is negotiated. + +If the VIRTIO_BLK_F_ZONED feature is negotiated, then in +\field{virtio_blk_zoned_characteristics}, +\begin{itemize} +\item \field{zone_sectors} value is expressed in 512-byte sectors. +\item \field{max_append_sectors} value is expressed in 512-byte sectors. +\item \field{write_granularity} value is expressed in bytes. +\end{itemize} + +The \field{model} field in \field{zoned} may have the following values: + +\begin{lstlisting} +#define VIRTIO_BLK_Z_NONE 0 +#define VIRTIO_BLK_Z_HM 1 +#define VIRTIO_BLK_Z_HA 2 +\end{lstlisting} + +Depending on their design, zoned block devices may follow several possible +models of operation. The three models that are standardized for ZBDs are +drive-managed, host-managed and host-aware. + +While being zoned internally, drive-managed ZBDs behave exactly like regular, +non-zoned block devices. For the purposes of virtio standardization, +drive-managed ZBDs can always be treated as non-zoned devices. These devices +have the VIRTIO_BLK_Z_NONE model value set in the \field{model} field in +\field{zoned}. + +Devices that offer the VIRTIO_BLK_F_ZONED feature while reporting the +VIRTIO_BLK_Z_NONE zoned model are drive-managed zoned block devices. In this +case, the driver treats the device as a regular non-zoned block device. + +Host-managed zoned block devices have their LBA range divided into Sequential +Write Required (SWR) zones that require some additional handling by the host +for correct operation. All write requests to SWR zones are required be +sequential and zones containing some written data need to be reset before that +data can be rewritten. Host-managed devices support a set of ZBD-specific I/O +requests that can be used by the host to manage device zones. Host-managed +devices report VIRTIO_BLK_Z_HM in the \field{model} field in \field{zoned}. + +Host-aware zoned block devices have their LBA range divided to Sequential +Write Preferred (SWP) zones that support random write access, similar to +regular non-zoned devices. However, the device I/O performance might not be +optimal if SWP zones are used in a random I/O pattern. SWP zones also support +the same set of ZBD-specific I/O requests as host-managed devices that allow +host-aware devices to be managed by any host that supports zoned block devices +to achieve its optimum performance. Host-aware devices report VIRTIO_BLK_Z_HA +in the \field{model} field in \field{zoned}. + +Both SWR zones and SWP zones are sometimes referred as sequential zones. + +During device operation, sequential zones can be in one of the following states: +empty, implicitly-open, explicitly-open, closed and full. The state machine that +governs the transitions between these states is described later in this document. + +SWR and SWP zones consume volatile device resources while being in certain +states and the device may set limits on the number of zones that can be in these +states simultaneously. + +Zoned block devices use two internal counters to account for the device +resources in use, the number of currently open zones and the number of currently +active zones. + +Any zone state transition from a state that doesn't consume a zone resource to a +state that consumes the same resource increments the internal device counter for +that resource. Any zone transition out of a state that consumes a zone resource +to a state that doesn't consume the same resource decrements the counter. Any +request that causes the device to exceed the reported zone resource limits is +terminated by the device with a "zone resources exceeded" error as defined for +specific commands later. + +\begin{lstlisting} +struct virtio_blk_config { + le64 capacity; + le32 size_max; + le32 seg_max; + struct virtio_blk_geometry { + le16 cylinders; + u8 heads; + u8 sectors; + } geometry; + le32 blk_size; + struct virtio_blk_topology { + // # of logical blocks per physical block (log2) + u8 physical_block_exp; + // offset of first aligned logical block + u8 alignment_offset; + // suggested minimum I/O size in blocks + le16 min_io_size; + // optimal (suggested maximum) I/O size in blocks + le32 opt_io_size; + } topology; + u8 writeback; + u8 unused0; + u16 num_queues; + le32 max_discard_sectors; + le32 max_discard_seg; + le32 discard_sector_alignment; + le32 max_write_zeroes_sectors; + le32 max_write_zeroes_seg; + u8 write_zeroes_may_unmap; + u8 unused1[3]; + le32 max_secure_erase_sectors; + le32 max_secure_erase_seg; + le32 secure_erase_sector_alignment; + struct virtio_blk_zoned_characteristics { + le32 zone_sectors; + le32 max_open_zones; + le32 max_active_zones; + le32 max_append_sectors; + le32 write_granularity; + u8 model; + u8 unused2[3]; + } zoned; +}; +\end{lstlisting} + + +\subsubsection{Legacy Interface: Device configuration layout}\label{sec:Device Types / Block Device / Device configuration layout / Legacy Interface: Device configuration layout} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_blk_config +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + + +\subsection{Device Initialization}\label{sec:Device Types / Block Device / Device Initialization} + +\begin{enumerate} +\item The device size can be read from \field{capacity}. + +\item If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, + \field{blk_size} can be read to determine the optimal sector size + for the driver to use. This does not affect the units used in + the protocol (always 512 bytes), but awareness of the correct + value can affect performance. + +\item If the VIRTIO_BLK_F_RO feature is set by the device, any write + requests will fail. + +\item If the VIRTIO_BLK_F_TOPOLOGY feature is negotiated, the fields in the + \field{topology} struct can be read to determine the physical block size and optimal + I/O lengths for the driver to use. This also does not affect the units + in the protocol, only performance. + +\item If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the cache + mode can be read or set through the \field{writeback} field. 0 corresponds + to a writethrough cache, 1 to a writeback cache\footnote{Consistent with + \ref{devicenormative:Device Types / Block Device / Device Operation}, + a writethrough cache can be defined broadly as a cache that commits + writes to persistent device backend storage before reporting their + completion. For example, a battery-backed writeback cache actually + counts as writethrough according to this definition.}. The cache mode + after reset can be either writeback or writethrough. The actual + mode can be determined by reading \field{writeback} after feature + negotiation. + +\item If the VIRTIO_BLK_F_DISCARD feature is negotiated, + \field{max_discard_sectors} and \field{max_discard_seg} can be read + to determine the maximum discard sectors and maximum number of discard + segments for the block driver to use. \field{discard_sector_alignment} + can be used by OS when splitting a request based on alignment. + +\item If the VIRTIO_BLK_F_WRITE_ZEROES feature is negotiated, + \field{max_write_zeroes_sectors} and \field{max_write_zeroes_seg} can + be read to determine the maximum write zeroes sectors and maximum + number of write zeroes segments for the block driver to use. + +\item If the VIRTIO_BLK_F_MQ feature is negotiated, \field{num_queues} field + can be read to determine the number of queues. + +\item If the VIRTIO_BLK_F_SECURE_ERASE feature is negotiated, + \field{max_secure_erase_sectors} and \field{max_secure_erase_seg} can be read + to determine the maximum secure erase sectors and maximum number of + secure erase segments for the block driver to use. + \field{secure_erase_sector_alignment} can be used by OS when splitting a + request based on alignment. + +\item If the VIRTIO_BLK_F_ZONED feature is negotiated, the fields in + \field{zoned} can be read by the driver to determine the zone + characteristics of the device. All \field{zoned} fields are read-only. + +\end{enumerate} + +\drivernormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} + +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_FLUSH if they are incapable of +sending VIRTIO_BLK_T_FLUSH commands. + +If neither VIRTIO_BLK_F_CONFIG_WCE nor VIRTIO_BLK_F_FLUSH are +negotiated, the driver MAY deduce the presence of a writethrough cache. +If VIRTIO_BLK_F_CONFIG_WCE was not negotiated but VIRTIO_BLK_F_FLUSH was, +the driver SHOULD assume presence of a writeback cache. + +The driver MUST NOT read \field{writeback} before setting +the FEATURES_OK \field{device status} bit. + +Drivers MUST NOT negotiate the VIRTIO_BLK_F_ZONED feature if they are incapable +of supporting devices with the VIRTIO_BLK_Z_HM, VIRTIO_BLK_Z_HA or +VIRTIO_BLK_Z_NONE zoned model. + +If the VIRTIO_BLK_F_ZONED feature is offered by the device with the +VIRTIO_BLK_Z_HM zone model, then the VIRTIO_BLK_F_DISCARD feature MUST NOT be +offered by the driver. + +If the VIRTIO_BLK_F_ZONED feature and VIRTIO_BLK_F_DISCARD feature are both +offered by the device with the VIRTIO_BLK_Z_HA or VIRTIO_BLK_Z_NONE zone model, +then the driver MAY negotiate these two bits independently. + +If the VIRTIO_BLK_F_ZONED feature is negotiated, then +\begin{itemize} +\item if the driver that can not support host-managed zoned devices + reads VIRTIO_BLK_Z_HM from the \field{model} field of \field{zoned}, the + driver MUST NOT set FEATURES_OK flag and instead set the FAILED bit. + +\item if the driver that can not support zoned devices reads VIRTIO_BLK_Z_HA + from the \field{model} field of \field{zoned}, the driver + MAY handle the device as a non-zoned device. In this case, the + driver SHOULD ignore all other fields in \field{zoned}. +\end{itemize} + +\devicenormative{\subsubsection}{Device Initialization}{Device Types / Block Device / Device Initialization} + +Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it +if they offer VIRTIO_BLK_F_CONFIG_WCE. + +If VIRTIO_BLK_F_CONFIG_WCE is negotiated but VIRTIO_BLK_F_FLUSH +is not, the device MUST initialize \field{writeback} to 0. + +The device MUST initialize padding bytes \field{unused0} and +\field{unused1} to 0. + +If the device that is being initialized is a not a zoned device, the device +SHOULD NOT offer the VIRTIO_BLK_F_ZONED feature. + +The VIRTIO_BLK_F_ZONED feature cannot be properly negotiated without +FEATURES_OK bit. Legacy devices MUST NOT offer VIRTIO_BLK_F_ZONED feature bit. + +If the VIRTIO_BLK_F_ZONED feature is not accepted by the driver, +\begin{itemize} +\item the device with the VIRTIO_BLK_Z_HA or VIRTIO_BLK_Z_NONE zone model SHOULD + proceed with the initialization while setting all zoned characteristics + fields to zero. + +\item the device with the VIRTIO_BLK_Z_HM zone model MUST fail to set the + FEATURES_OK device status bit when the driver writes the Device Status + field. +\end{itemize} + +If the VIRTIO_BLK_F_ZONED feature is negotiated, then the \field{model} field in +\field{zoned} struct in the configuration space MUST be set by the device +\begin{itemize} +\item to the value of VIRTIO_BLK_Z_NONE if it operates as a drive-managed + zoned block device or a non-zoned block device. + +\item to the value of VIRTIO_BLK_Z_HM if it operates as a host-managed zoned + block device. + +\item to the value of VIRTIO_BLK_Z_HA if it operates as a host-aware zoned + block device. +\end{itemize} + +If the VIRTIO_BLK_F_ZONED feature is negotiated and the device \field{model} +field in \field{zoned} struct is VIRTIO_BLK_Z_HM or VIRTIO_BLK_Z_HA, + +\begin{itemize} +\item the \field{zone_sectors} field of \field{zoned} MUST be set by the device + to the size of a single zone on the device. All zones of the device have the + same size indicated by \field{zone_sectors} except for the last zone that + MAY be smaller than all other zones. The driver can calculate the number of + zones on the device as + \begin{lstlisting} + nr_zones = (capacity + zone_sectors - 1) / zone_sectors; + \end{lstlisting} + and the size of the last zone as + \begin{lstlisting} + zs_last = capacity - (nr_zones - 1) * zone_sectors; + \end{lstlisting} + +\item The \field{max_open_zones} field of the \field{zoned} structure MUST be + set by the device to the maximum number of zones that can be open on the + device (zones in the implicit open or explicit open state). A value + of zero indicates that the device does not have any limit on the number of + open zones. + +\item The \field{max_active_zones} field of the \field{zoned} structure MUST + be set by the device to the maximum number zones that can be active on the + device (zones in the implicit open, explicit open or closed state). A value + of zero indicates that the device does not have any limit on the number of + active zones. + +\item the \field{max_append_sectors} field of \field{zoned} MUST be set by + the device to the maximum data size of a VIRTIO_BLK_T_ZONE_APPEND request + that can be successfully issued to the device. The value of this field MUST + NOT exceed the \field{seg_max} * \field{size_max} value. A device MAY set + the \field{max_append_sectors} to zero if it doesn't support + VIRTIO_BLK_T_ZONE_APPEND requests. + +\item the \field{write_granularity} field of \field{zoned} MUST be set by the + device to the offset and size alignment constraint for VIRTIO_BLK_T_OUT + and VIRTIO_BLK_T_ZONE_APPEND requests issued to a sequential zone of the + device. + +\item the device MUST initialize padding bytes \field{unused2} to 0. +\end{itemize} + +\subsubsection{Legacy Interface: Device Initialization}\label{sec:Device Types / Block Device / Device Initialization / Legacy Interface: Device Initialization} + +Because legacy devices do not have FEATURES_OK, transitional devices +MUST implement slightly different behavior around feature negotiation +when used through the legacy interface. In particular, when using the +legacy interface: + +\begin{itemize} +\item the driver MAY read or write \field{writeback} before setting + the DRIVER or DRIVER_OK \field{device status} bit + +\item the device MUST NOT modify the cache mode (and \field{writeback}) + as a result of a driver setting a status bit, unless + the DRIVER_OK bit is being set and the driver has not set the + VIRTIO_BLK_F_CONFIG_WCE driver feature bit. + +\item the device MUST NOT modify the cache mode (and \field{writeback}) + as a result of a driver modifying the driver feature bits, for example + if the driver sets the VIRTIO_BLK_F_CONFIG_WCE driver feature bit but + does not set the VIRTIO_BLK_F_FLUSH bit. +\end{itemize} + + +\subsection{Device Operation}\label{sec:Device Types / Block Device / Device Operation} + +The driver queues requests to the virtqueues, and they are used by +the device (not necessarily in order). Each request except +VIRTIO_BLK_T_ZONE_APPEND is of form: + +\begin{lstlisting} +struct virtio_blk_req { + le32 type; + le32 reserved; + le64 sector; + u8 data[]; + u8 status; +}; +\end{lstlisting} + +The type of the request is either a read (VIRTIO_BLK_T_IN), a write +(VIRTIO_BLK_T_OUT), a discard (VIRTIO_BLK_T_DISCARD), a write zeroes +(VIRTIO_BLK_T_WRITE_ZEROES), a flush (VIRTIO_BLK_T_FLUSH), a get device ID +string command (VIRTIO_BLK_T_GET_ID), a secure erase +(VIRTIO_BLK_T_SECURE_ERASE), or a get device lifetime command +(VIRTIO_BLK_T_GET_LIFETIME). + +\begin{lstlisting} +#define VIRTIO_BLK_T_IN 0 +#define VIRTIO_BLK_T_OUT 1 +#define VIRTIO_BLK_T_FLUSH 4 +#define VIRTIO_BLK_T_GET_ID 8 +#define VIRTIO_BLK_T_GET_LIFETIME 10 +#define VIRTIO_BLK_T_DISCARD 11 +#define VIRTIO_BLK_T_WRITE_ZEROES 13 +#define VIRTIO_BLK_T_SECURE_ERASE 14 +\end{lstlisting} + +The \field{sector} number indicates the offset (multiplied by 512) where +the read or write is to occur. This field is unused and set to 0 for +commands other than read, write and some zone operations. + +VIRTIO_BLK_T_IN requests populate \field{data} with the contents of sectors +read from the block device (in multiples of 512 bytes). VIRTIO_BLK_T_OUT +requests write the contents of \field{data} to the block device (in multiples +of 512 bytes). + +The \field{data} used for discard, secure erase or write zeroes commands +consists of one or more segments. The maximum number of segments is +\field{max_discard_seg} for discard commands, \field{max_secure_erase_seg} for +secure erase commands and \field{max_write_zeroes_seg} for write zeroes +commands. +Each segment is of form: + +\begin{lstlisting} +struct virtio_blk_discard_write_zeroes { + le64 sector; + le32 num_sectors; + struct { + le32 unmap:1; + le32 reserved:31; + } flags; +}; +\end{lstlisting} + +\field{sector} indicates the starting offset (in 512-byte units) of the +segment, while \field{num_sectors} indicates the number of sectors in each +discarded range. \field{unmap} is only used in write zeroes commands and allows +the device to discard the specified range, provided that following reads return +zeroes. + +VIRTIO_BLK_T_GET_ID requests fetch the device ID string from the device into +\field{data}. The device ID string is a NUL-padded ASCII string up to 20 bytes +long. If the string is 20 bytes long then there is no NUL terminator. + +The \field{data} used for VIRTIO_BLK_T_GET_LIFETIME requests is populated +by the device, and is of the form + +\begin{lstlisting} +struct virtio_blk_lifetime { + le16 pre_eol_info; + le16 device_lifetime_est_typ_a; + le16 device_lifetime_est_typ_b; +}; +\end{lstlisting} + +The \field{pre_eol_info} specifies the percentage of reserved blocks +that are consumed and will have one of these values: + +\begin{lstlisting} +/* Value not available */ +#define VIRTIO_BLK_PRE_EOL_INFO_UNDEFINED 0 +/* < 80% of reserved blocks are consumed */ +#define VIRTIO_BLK_PRE_EOL_INFO_NORMAL 1 +/* 80% of reserved blocks are consumed */ +#define VIRTIO_BLK_PRE_EOL_INFO_WARNING 2 +/* 90% of reserved blocks are consumed */ +#define VIRTIO_BLK_PRE_EOL_INFO_URGENT 3 +/* All others values are reserved */ +\end{lstlisting} + +The \field{device_lifetime_est_typ_a} refers to wear of SLC cells and is provided +in increments of 10%, with 0 meaning undefined, 1 meaning up-to 10% of lifetime +used, and so on, thru to 11 meaning estimated lifetime exceeded. +All values above 11 are reserved. + +The \field{device_lifetime_est_typ_b} refers to wear of MLC cells and is provided +with the same semantics as \field{device_lifetime_est_typ_a}. + +The final \field{status} byte is written by the device: either +VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for device or driver +error or VIRTIO_BLK_S_UNSUPP for a request unsupported by device: + +\begin{lstlisting} +#define VIRTIO_BLK_S_OK 0 +#define VIRTIO_BLK_S_IOERR 1 +#define VIRTIO_BLK_S_UNSUPP 2 +\end{lstlisting} + +The status of individual segments is indeterminate when a discard or write zero +command produces VIRTIO_BLK_S_IOERR. A segment may have completed +successfully, failed, or not been processed by the device. + +The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is +negotiated. + +In addition to the request types defined for non-zoned devices, the type of the +request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit zone open +(VIRTIO_BLK_T_ZONE_OPEN), a zone close (VIRTIO_BLK_T_ZONE_CLOSE), a zone finish +(VIRTIO_BLK_T_ZONE_FINISH), a zone_append (VIRTIO_BLK_T_ZONE_APPEND), a zone +reset (VIRTIO_BLK_T_ZONE_RESET) or a zone reset all +(VIRTIO_BLK_T_ZONE_RESET_ALL). + +\begin{lstlisting} +#define VIRTIO_BLK_T_ZONE_APPEND 15 +#define VIRTIO_BLK_T_ZONE_REPORT 16 +#define VIRTIO_BLK_T_ZONE_OPEN 18 +#define VIRTIO_BLK_T_ZONE_CLOSE 20 +#define VIRTIO_BLK_T_ZONE_FINISH 22 +#define VIRTIO_BLK_T_ZONE_RESET 24 +#define VIRTIO_BLK_T_ZONE_RESET_ALL 26 +\end{lstlisting} + +Requests of type VIRTIO_BLK_T_OUT, VIRTIO_BLK_T_ZONE_OPEN, +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH, VIRTIO_BLK_T_ZONE_APPEND, +VIRTIO_BLK_T_ZONE_RESET or VIRTIO_BLK_T_ZONE_RESET_ALL may be completed by the +device with VIRTIO_BLK_S_OK, VIRTIO_BLK_S_IOERR or VIRTIO_BLK_S_UNSUPP +\field{status}, or, additionally, with VIRTIO_BLK_S_ZONE_INVALID_CMD, +VIRTIO_BLK_S_ZONE_UNALIGNED_WP, VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or +VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE ZBD-specific status codes. + +Besides the request status, VIRTIO_BLK_T_ZONE_APPEND requests return the +starting sector of the appended data back to the driver. For this reason, +the VIRTIO_BLK_T_ZONE_APPEND request has the layout that is extended to have +the \field{append_sector} field to carry this value: + +\begin{lstlisting} +struct virtio_blk_req_za { + le32 type; + le32 reserved; + le64 sector; + u8 data[]; + le64 append_sector; + u8 status; +}; +\end{lstlisting} + +\begin{lstlisting} +#define VIRTIO_BLK_S_ZONE_INVALID_CMD 3 +#define VIRTIO_BLK_S_ZONE_UNALIGNED_WP 4 +#define VIRTIO_BLK_S_ZONE_OPEN_RESOURCE 5 +#define VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE 6 +\end{lstlisting} + +Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of the type +VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN, +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH, VIRTIO_BLK_T_ZONE_RESET and +VIRTIO_BLK_T_ZONE_RESET_ALL are non-data requests. + +Zone sector address is a 64-bit address of the first 512-byte sector of the +zone. + +VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and +VIRTIO_BLK_T_ZONE_RESET requests make the zone operation to act on a particular +zone specified by the zone sector address in the \field{sector} of the request. + +VIRTIO_BLK_T_ZONE_RESET_ALL request acts upon all applicable zones of the +device. The \field{sector} value is not used for this request. + +In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone +Management Receive" command category and VIRTIO_BLK_T_ZONE_OPEN, +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and +VIRTIO_BLK_T_ZONE_RESET/VIRTIO_BLK_T_ZONE_RESET_ALL requests are categorized as +"Zone Management Send" commands. VIRTIO_BLK_T_ZONE_APPEND is categorized +separately from zone management commands and is the only request that uses +the \field{append_secctor} field \field{virtio_blk_req_za} to return +to the driver the sector at which the data has been appended to the zone. + +VIRTIO_BLK_T_ZONE_REPORT is a read request that returns the information about +the current state of zones on the device starting from the zone containing the +\field{sector} of the request. The report consists of a header followed by zero +or more zone descriptors. + +A zone report reply has the following structure: + +\begin{lstlisting} +struct virtio_blk_zone_report { + le64 nr_zones; + u8 reserved[56]; + struct virtio_blk_zone_descriptor zones[]; +}; +\end{lstlisting} + +The device sets the \field{nr_zones} field in the report header to the number of +fully transferred zone descriptors in the data buffer. + +A zone descriptor has the following structure: + +\begin{lstlisting} +struct virtio_blk_zone_descriptor { + le64 z_cap; + le64 z_start; + le64 z_wp; + u8 z_type; + u8 z_state; + u8 reserved[38]; +}; +\end{lstlisting} + +The zone descriptor field \field{z_type} \field{virtio_blk_zone_descriptor} +indicates the type of the zone. + +The following zone types are available: + +\begin{lstlisting} +#define VIRTIO_BLK_ZT_CONV 1 +#define VIRTIO_BLK_ZT_SWR 2 +#define VIRTIO_BLK_ZT_SWP 3 +\end{lstlisting} + +Read and write operations into zones with the VIRTIO_BLK_ZT_CONV (Conventional) +type have the same behavior as read and write operations on a regular block +device. Any block in a conventional zone can be read or written at any time and +in any order. + +Zones with VIRTIO_BLK_ZT_SWR can be read randomly, but must be written +sequentially at a certain point in the zone called the Write Pointer (WP). With +every write, the Write Pointer is incremented by the number of sectors written. + +Zones with VIRTIO_BLK_ZT_SWP can be read randomly and should be written +sequentially, similarly to SWR zones. However, SWP zones can accept random write +operations, that is, VIRTIO_BLK_T_OUT requests with a start sector different +from the zone write pointer position. + +The field \field{z_state} of \field{virtio_blk_zone_descriptor} indicates the +state of the device zone. + +The following zone states are available: + +\begin{lstlisting} +#define VIRTIO_BLK_ZS_NOT_WP 0 +#define VIRTIO_BLK_ZS_EMPTY 1 +#define VIRTIO_BLK_ZS_IOPEN 2 +#define VIRTIO_BLK_ZS_EOPEN 3 +#define VIRTIO_BLK_ZS_CLOSED 4 +#define VIRTIO_BLK_ZS_RDONLY 13 +#define VIRTIO_BLK_ZS_FULL 14 +#define VIRTIO_BLK_ZS_OFFLINE 15 +\end{lstlisting} + +Zones of the type VIRTIO_BLK_ZT_CONV are always reported by the device to be in +the VIRTIO_BLK_ZS_NOT_WP state. Zones of the types VIRTIO_BLK_ZT_SWR and +VIRTIO_BLK_ZT_SWP can not transition to the VIRTIO_BLK_ZS_NOT_WP state. + +Zones in VIRTIO_BLK_ZS_EMPTY (Empty), VIRTIO_BLK_ZS_IOPEN (Implicitly Open), +VIRTIO_BLK_ZS_EOPEN (Explicitly Open) and VIRTIO_BLK_ZS_CLOSED (Closed) state +are writable, but zones in VIRTIO_BLK_ZS_RDONLY (Read-Only), VIRTIO_BLK_ZS_FULL +(Full) and VIRTIO_BLK_ZS_OFFLINE (Offline) state are not. The write pointer +value (\field{z_wp}) is not valid for Read-Only, Full and Offline zones. + +The zone descriptor field \field{z_cap} contains the maximum number of 512-byte +sectors that are available to be written with user data when the zone is in the +Empty state. This value shall be less than or equal to the \field{zone_sectors} +value in \field{virtio_blk_zoned_characteristics} structure in the device +configuration space. + +The zone descriptor field \field{z_start} contains the zone sector address. + +The zone descriptor field \field{z_wp} contains the sector address where the +next write operation for this zone should be issued. This value is undefined +for conventional zones and for zones in VIRTIO_BLK_ZS_RDONLY, +VIRTIO_BLK_ZS_FULL and VIRTIO_BLK_ZS_OFFLINE state. + +Depending on their state, zones consume resources as follows: +\begin{itemize} +\item a zone in VIRTIO_BLK_ZS_IOPEN and VIRTIO_BLK_ZS_EOPEN state consumes one + open zone resource and, additionally, + +\item a zone in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN and + VIRTIO_BLK_ZS_CLOSED state consumes one active resource. +\end{itemize} + +Attempts for zone transitions that violate zone resource limits must fail with +VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE +\field{status}. + +Zones in the VIRTIO_BLK_ZS_EMPTY (Empty) state have the write pointer value +equal to the sector address of the zone. In this state, the entire capacity of +the zone is available for writing. A zone can transition from this state to +\begin{itemize} +\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or + VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the zone. + +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is + received for the zone +\end{itemize} + +When a VIRTIO_BLK_T_ZONE_RESET request is issued to an Empty zone, the request +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EMPTY state. + +Zones in the VIRTIO_BLK_ZS_IOPEN (Implicitly Open) state transition from +this state to +\begin{itemize} +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is + received for the zone, + +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET_ALL request + is received by the device, + +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is + received for the zone, + +\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request is + received for the zone, + +\item VIRTIO_BLK_ZS_CLOSED implicitly by the device when another zone is + entering the VIRTIO_BLK_ZS_IOPEN or VIRTIO_BLK_ZS_EOPEN state and the number + of currently open zones is at \field{max_open_zones} limit, + +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request is + received for the zone. + +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or + VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its writable + capacity is received for the zone. +\end{itemize} + +Zones in the VIRTIO_BLK_ZS_EOPEN (Explicitly Open) state transition from +this state to +\begin{itemize} +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is + received for the zone, + +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET_ALL request + is received by the device, + +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_CLOSE request is + received for the zone and the write pointer of the zone has the value equal + to the start sector of the zone, + +\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request is + received for the zone and the zone write pointer is larger then the start + sector of the zone, + +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request is + received for the zone, + +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or + VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its writable + capacity is received for the zone. +\end{itemize} + +When a VIRTIO_BLK_T_ZONE_EOPEN request is issued to an Explicitly Open zone, the +request is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EOPEN +state. + +Zones in the VIRTIO_BLK_ZS_CLOSED (Closed) state transition from this state +to +\begin{itemize} +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is + received for the zone, + +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET_ALL request + is received by the device, + +\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or + VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the zone. + +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request is + received for the zone, +\end{itemize} + +When a VIRTIO_BLK_T_ZONE_CLOSE request is issued to a Closed zone, the request +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_CLOSED state. + +Zones in the VIRTIO_BLK_ZS_FULL (Full) state transition from this state to +VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is +received for the zone or a successful VIRTIO_BLK_T_ZONE_RESET_ALL request is +received by the device. + +When a VIRTIO_BLK_T_ZONE_FINISH request is issued to a Full zone, the request +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_FULL state. + +The device may automatically transition zones to VIRTIO_BLK_ZS_RDONLY +(Read-Only) or VIRTIO_BLK_ZS_OFFLINE (Offline) state from any other state. The +device may also automatically transition zones in the Read-Only state to the +Offline state. Zones in the Offline state may not transition to any other state. +Such automatic transitions usually indicate hardware failures. The previously +written data may only be read from zones in the Read-Only state. Zones in the +Offline state can not be read or written. + +VIRTIO_BLK_S_ZONE_UNALIGNED_WP is set by the device when the request received +from the driver attempts to perform a write to an SWR zone and at least one of +the following conditions is met: + +\begin{itemize} +\item the starting sector of the request is not equal to the current value of + the zone write pointer. + +\item the ending sector of the request data multiplied by 512 is not a multiple + of the value reported by the device in the field \field{write_granularity} + in the device configuration space. +\end{itemize} + +VIRTIO_BLK_S_ZONE_OPEN_RESOURCE is set by the device when a zone operation or +write request received from the driver can not be handled without exceeding the +\field{max_open_zones} limit value reported by the device in the configuration +space. + +VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE is set by the device when a zone operation or +write request received from the driver can not be handled without exceeding the +\field{max_active_zones} limit value reported by the device in the configuration +space. + +A zone transition request that leads to both the \field{max_open_zones} and the +\field{max_active_zones} limits to be exceeded is terminated by the device with +VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE \field{status} value. + +The device reports all other error conditions related to zoned block model +operation by setting the VIRTIO_BLK_S_ZONE_INVALID_CMD value in +\field{status} of \field{virtio_blk_req} structure. + +\drivernormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} + +The driver SHOULD check if the content of the \field{capacity} field has +changed upon receiving a configuration change notification. + +A driver MUST NOT submit a request which would cause a read or write +beyond \field{capacity}. + +A driver SHOULD accept the VIRTIO_BLK_F_RO feature if offered. + +A driver MUST set \field{sector} to 0 for a VIRTIO_BLK_T_FLUSH request. +A driver SHOULD NOT include any data in a VIRTIO_BLK_T_FLUSH request. + +The length of \field{data} MUST be a multiple of 512 bytes for VIRTIO_BLK_T_IN +and VIRTIO_BLK_T_OUT requests. + +The length of \field{data} MUST be a multiple of the size of struct +virtio_blk_discard_write_zeroes for VIRTIO_BLK_T_DISCARD, +VIRTIO_BLK_T_SECURE_ERASE and VIRTIO_BLK_T_WRITE_ZEROES requests. + +The length of \field{data} MUST be 20 bytes for VIRTIO_BLK_T_GET_ID requests. + +VIRTIO_BLK_T_DISCARD requests MUST NOT contain more than +\field{max_discard_seg} struct virtio_blk_discard_write_zeroes segments in +\field{data}. + +VIRTIO_BLK_T_SECURE_ERASE requests MUST NOT contain more than +\field{max_secure_erase_seg} struct virtio_blk_discard_write_zeroes segments in +\field{data}. + +VIRTIO_BLK_T_WRITE_ZEROES requests MUST NOT contain more than +\field{max_write_zeroes_seg} struct virtio_blk_discard_write_zeroes segments in +\field{data}. + +If the VIRTIO_BLK_F_CONFIG_WCE feature is negotiated, the driver MAY +switch to writethrough or writeback mode by writing respectively 0 and +1 to the \field{writeback} field. After writing a 0 to \field{writeback}, +the driver MUST NOT assume that any volatile writes have been committed +to persistent device backend storage. + +The \field{unmap} bit MUST be zero for discard commands. The driver +MUST NOT assume anything about the data returned by read requests after +a range of sectors has been discarded. + +A driver MUST NOT assume that individual segments in a multi-segment +VIRTIO_BLK_T_DISCARD or VIRTIO_BLK_T_WRITE_ZEROES request completed +successfully, failed, or were processed by the device at all if the request +failed with VIRTIO_BLK_S_IOERR. + +The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is +negotiated. + +A zone sector address provided by the driver MUST be a multiple of 512 bytes. + +When forming a VIRTIO_BLK_T_ZONE_REPORT request, the driver MUST set a sector +within the sector range of the starting zone to report to \field{sector} field. +It MAY be a sector that is different from the zone sector address. + +In VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and +VIRTIO_BLK_T_ZONE_RESET requests, the driver MUST set \field{sector} field to +point at the first sector in the target zone. + +In VIRTIO_BLK_T_ZONE_RESET_ALL request, the driver MUST set the field +\field{sector} to zero value. + +The \field{sector} field of the VIRTIO_BLK_T_ZONE_APPEND request MUST specify +the zone sector address of the zone to which data is to be appended at the +position of the write pointer. The size of the data that is appended MUST be a +multiple of \field{write_granularity} bytes and MUST NOT exceed the +\field{max_append_sectors} value provided by the device in +\field{virtio_blk_zoned_characteristics} configuration space structure. + +Upon a successful completion of a VIRTIO_BLK_T_ZONE_APPEND request, the driver +MAY read the starting sector location of the written data from the request +field \field{append_sector}. + +All VIRTIO_BLK_T_OUT requests issued by the driver to sequential zones and +VIRTIO_BLK_T_ZONE_APPEND requests MUST have: + +\begin{enumerate} +\item the data size that is a multiple of the number of bytes reported + by the device in the field \field{write_granularity} in the + \field{virtio_blk_zoned_characteristics} configuration space structure. + +\item the value of the field \field{sector} that is a multiple of the number of + bytes reported by the device in the field \field{write_granularity} in the + \field{virtio_blk_zoned_characteristics} configuration space structure. + +\item the data size that will not exceed the writable zone capacity when its + value is added to the current value of the write pointer of the zone. + +\end{enumerate} + +\devicenormative{\subsubsection}{Device Operation}{Device Types / Block Device / Device Operation} + +The device MAY change the content of the \field{capacity} field during +operation of the device. When this happens, the device SHOULD trigger a +configuration change notification. + +A device MUST set the \field{status} byte to VIRTIO_BLK_S_IOERR +for a write request if the VIRTIO_BLK_F_RO feature if offered, and MUST NOT +write any data. + +The device MUST set the \field{status} byte to VIRTIO_BLK_S_UNSUPP for +discard, secure erase and write zeroes commands if any unknown flag is set. +Furthermore, the device MUST set the \field{status} byte to +VIRTIO_BLK_S_UNSUPP for discard commands if the \field{unmap} flag is set. + +For discard commands, the device MAY deallocate the specified range of +sectors in the device backend storage. + +For write zeroes commands, if the \field{unmap} is set, the device MAY +deallocate the specified range of sectors in the device backend storage, +as if the discard command had been sent. After a write zeroes command +is completed, reads of the specified ranges of sectors MUST return +zeroes. This is true independent of whether \field{unmap} was set or clear. + +The device SHOULD clear the \field{write_zeroes_may_unmap} field of the +virtio configuration space if and only if a write zeroes request cannot +result in deallocating one or more sectors. The device MAY change the +content of the field during operation of the device; when this happens, +the device SHOULD trigger a configuration change notification. + +A write is considered volatile when it is submitted; the contents of +sectors covered by a volatile write are undefined in persistent device +backend storage until the write becomes stable. A write becomes stable +once it is completed and one or more of the following conditions is true: + +\begin{enumerate} +\item\label{item:flush1} neither VIRTIO_BLK_F_CONFIG_WCE nor + VIRTIO_BLK_F_FLUSH feature were negotiated, but VIRTIO_BLK_F_FLUSH was + offered by the device; + +\item\label{item:flush2} the VIRTIO_BLK_F_CONFIG_WCE feature was negotiated and the + \field{writeback} field in configuration space was 0 \textbf{all the time between + the submission of the write and its completion}; + +\item\label{item:flush3} a VIRTIO_BLK_T_FLUSH request is sent \textbf{after the write is + completed} and is completed itself. +\end{enumerate} + +If the device is backed by persistent storage, the device MUST ensure that +stable writes are committed to it, before reporting completion of the write +(cases~\ref{item:flush1} and~\ref{item:flush2}) or the flush +(case~\ref{item:flush3}). Failure to do so can cause data loss +in case of a crash. + +If the driver changes \field{writeback} between the submission of the write +and its completion, the write could be either volatile or stable when +its completion is reported; in other words, the exact behavior is undefined. + +% According to the device requirements for device initialization: +% Offer(CONFIG_WCE) => Offer(FLUSH). +% +% After reversing the implication: +% not Offer(FLUSH) => not Offer(CONFIG_WCE). + +If VIRTIO_BLK_F_FLUSH was not offered by the + device\footnote{Note that in this case, according to + \ref{devicenormative:Device Types / Block Device / Device Initialization}, + the device will not have offered VIRTIO_BLK_F_CONFIG_WCE either.}, the +device MAY also commit writes to persistent device backend storage before +reporting their completion. Unlike case~\ref{item:flush1}, however, this +is not an absolute requirement of the specification. + +\begin{note} + An implementation that does not offer VIRTIO_BLK_F_FLUSH and does not commit + completed writes will not be resilient to data loss in case of crashes. + Not offering VIRTIO_BLK_F_FLUSH is an absolute requirement + for implementations that do not wish to be safe against such data losses. +\end{note} + +If the device is backed by storage providing lifetime metrics (such as eMMC +or UFS persistent storage), the device SHOULD offer the VIRTIO_BLK_F_LIFETIME +flag. The flag MUST NOT be offered if the device is backed by storage for which +the lifetime metrics described in this document cannot be obtained or for which +such metrics have no useful meaning. If the metrics are offered, the device MUST NOT +send any reserved values, as defined in this specification. + +\begin{note} + The device lifetime metrics \field{pre_eol_info}, \field{device_lifetime_est_a} + and \field{device_lifetime_est_b} are discussed in the JESD84-B50 specification. + + The complete JESD84-B50 is available at the JEDEC website (https://www.jedec.org) + pursuant to JEDEC's licensing terms and conditions. This information is provided to + simplfy passthrough implementations from eMMC devices. +\end{note} + +If the VIRTIO_BLK_F_ZONED feature is not negotiated, the device MUST reject +VIRTIO_BLK_T_ZONE_REPORT, VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE, +VIRTIO_BLK_T_ZONE_FINISH, VIRTIO_BLK_T_ZONE_APPEND, VIRTIO_BLK_T_ZONE_RESET and +VIRTIO_BLK_T_ZONE_RESET_ALL requests with VIRTIO_BLK_S_UNSUPP status. + +The following device requirements only apply if the VIRTIO_BLK_F_ZONED feature +is negotiated. + +If a request of type VIRTIO_BLK_T_ZONE_OPEN, VIRTIO_BLK_T_ZONE_CLOSE, +VIRTIO_BLK_T_ZONE_FINISH or VIRTIO_BLK_T_ZONE_RESET is issued for a Conventional +zone (type VIRTIO_BLK_ZT_CONV), the device MUST complete the request with +VIRTIO_BLK_S_ZONE_INVALID_CMD \field{status}. + +If the zone specified by the VIRTIO_BLK_T_ZONE_APPEND request is not a SWR zone, +then the request SHALL be completed with VIRTIO_BLK_S_ZONE_INVALID_CMD +\field{status}. + +The device handles a VIRTIO_BLK_T_ZONE_OPEN request by attempting to change the +state of the zone with the \field{sector} address to VIRTIO_BLK_ZS_EOPEN. If the +transition to this state can not be performed, the request MUST be completed +with VIRTIO_BLK_S_ZONE_INVALID_CMD \field{status}. If, while processing this +request, the available zone resources are insufficient, then the zone state does +not change and the request MUST be completed with +VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE value in +the field \field{status}. + +The device handles a VIRTIO_BLK_T_ZONE_CLOSE request by attempting to change the +state of the zone with the \field{sector} address to VIRTIO_BLK_ZS_CLOSED. If +the transition to this state can not be performed, the request MUST be completed +with VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{status}. + +The device handles a VIRTIO_BLK_T_ZONE_FINISH request by attempting to change +the state of the zone with the \field{sector} address to VIRTIO_BLK_ZS_FULL. If +the transition to this state can not be performed, the zone state does not +change and the request MUST be completed with VIRTIO_BLK_S_ZONE_INVALID_CMD +value in the field \field{status}. + +The device handles a VIRTIO_BLK_T_ZONE_RESET request by attempting to change the +state of the zone with the \field{sector} address to VIRTIO_BLK_ZS_EMPTY state. +If the transition to this state can not be performed, the zone state does not +change and the request MUST be completed with VIRTIO_BLK_S_ZONE_INVALID_CMD +value in the field \field{status}. + +The device handles a VIRTIO_BLK_T_ZONE_RESET_ALL request by transitioning all +sequential device zones in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN, +VIRTIO_BLK_ZS_CLOSED and VIRTIO_BLK_ZS_FULL state to VIRTIO_BLK_ZS_EMPTY state. + +Upon receiving a VIRTIO_BLK_T_ZONE_APPEND request or a VIRTIO_BLK_T_OUT +request issued to a SWR zone in VIRTIO_BLK_ZS_EMPTY or VIRTIO_BLK_ZS_CLOSED +state, the device attempts to perform the transition of the zone to +VIRTIO_BLK_ZS_IOPEN state before writing data. This transition may fail due to +insufficient open and/or active zone resources available on the device. In this +case, the request MUST be completed with VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or +VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE value in the \field{status}. + +If the \field{sector} field in the VIRTIO_BLK_T_ZONE_APPEND request does not +specify the lowest sector for a zone, then the request SHALL be completed with +VIRTIO_BLK_S_ZONE_INVALID_CMD value in \field{status}. + +A VIRTIO_BLK_T_ZONE_APPEND request or a VIRTIO_BLK_T_OUT request that has the +data range that exceeds the remaining writable capacity for the zone, then the +request SHALL be completed with VIRTIO_BLK_S_ZONE_INVALID_CMD value in +\field{status}. + +If a request of the type VIRTIO_BLK_T_ZONE_APPEND is completed with +VIRTIO_BLK_S_OK status, the field \field{append_sector} in +\field{virtio_blk_req_za} MUST be set by the device to contain the first sector +of the data written to the zone. + +If a request of the type VIRTIO_BLK_T_ZONE_APPEND is completed with a status +other than VIRTIO_BLK_S_OK, the value of \field{append_sector} field in +\field{virtio_blk_req_za} is undefined. + +A VIRTIO_BLK_T_ZONE_APPEND request that has the data size that exceeds +\field{max_append_sectors} configuration space value, then, +\begin{itemize} +\item if \field{max_append_sectors} configuration space value is reported as + zero by the device, the request SHALL be completed with VIRTIO_BLK_S_UNSUPP + \field{status}. + +\item if \field{max_append_sectors} configuration space value is reported as + a non-zero value by the device, the request SHALL be completed with + VIRTIO_BLK_S_ZONE_INVALID_CMD \field{status}. +\end{itemize} + +If a VIRTIO_BLK_T_ZONE_APPEND request, a VIRTIO_BLK_T_IN request or a +VIRTIO_BLK_T_OUT request issued to a SWR zone has the range that has sectors in +more than one zone, then the request SHALL be completed with +VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{status}. + +A VIRTIO_BLK_T_OUT request that has the \field{sector} value that is not aligned +with the write pointer for the zone, then the request SHALL be completed with +VIRTIO_BLK_S_ZONE_UNALIGNED_WP value in the field \field{status}. + +In order to avoid resource-related errors while opening zones implicitly, the +device MAY automatically transition zones in VIRTIO_BLK_ZS_IOPEN state to +VIRTIO_BLK_ZS_CLOSED state. + +All VIRTIO_BLK_T_OUT requests or VIRTIO_BLK_T_ZONE_APPEND requests issued +to a zone in the VIRTIO_BLK_ZS_RDONLY state SHALL be completed with +VIRTIO_BLK_S_ZONE_INVALID_CMD \field{status}. + +All requests issued to a zone in the VIRTIO_BLK_ZS_OFFLINE state SHALL be +completed with VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{status}. + +The device MUST consider the sectors that are read between the write pointer +position of a zone and the end of the last sector of the zone as unwritten data. +The sectors between the write pointer position and the end of the last sector +within the zone capacity during VIRTIO_BLK_T_ZONE_FINISH request processing are +also considered unwritten data. + +When unwritten data is present in the sector range of a read request, the device +MUST process this data in one of the following ways - + +\begin{enumerate} +\item Fill the unwritten data with a device-specific byte pattern. The +configuration, control and reporting of this byte pattern is beyond the scope +of this standard. This is the preferred approach. + +\item Fail the request. Depending on the driver implementation, this may prevent +the device from becoming operational. +\end{enumerate} + +If both the VIRTIO_BLK_F_ZONED and VIRTIO_BLK_F_SECURE_ERASE features are +negotiated, then + +\begin{enumerate} +\item the field \field{secure_erase_sector_alignment} in the configuration space +of the device MUST be a multiple of \field{zone_sectors} value reported in the +device configuration space. + +\item the data size in VIRTIO_BLK_T_SECURE_ERASE requests MUST be a multiple of +\field{zone_sectors} value in the device configuration space. +\end{enumerate} + +The device MUST handle a VIRTIO_BLK_T_SECURE_ERASE request in the same way it +handles VIRTIO_BLK_T_ZONE_RESET request for the zone range specified in the +VIRTIO_BLK_T_SECURE_ERASE request. + +\subsubsection{Legacy Interface: Device Operation}\label{sec:Device Types / Block Device / Device Operation / Legacy Interface: Device Operation} +When using the legacy interface, transitional devices and drivers +MUST format the fields in struct virtio_blk_req +according to the native endian of the guest rather than +(necessarily when not using the legacy interface) little-endian. + +When using the legacy interface, transitional drivers +SHOULD ignore the used length values. +\begin{note} +Historically, some devices put the total descriptor length, +or the total length of device-writable buffers there, +even when only the status byte was actually written. +\end{note} + +The \field{reserved} field was previously called \field{ioprio}. \field{ioprio} +is a hint about the relative priorities of requests to the device: +higher numbers indicate more important requests. + +\begin{lstlisting} +#define VIRTIO_BLK_T_FLUSH_OUT 5 +\end{lstlisting} + +The command VIRTIO_BLK_T_FLUSH_OUT was a synonym for VIRTIO_BLK_T_FLUSH; +a driver MUST treat it as a VIRTIO_BLK_T_FLUSH command. + +\begin{lstlisting} +#define VIRTIO_BLK_T_BARRIER 0x80000000 +\end{lstlisting} + +If the device has VIRTIO_BLK_F_BARRIER +feature the high bit (VIRTIO_BLK_T_BARRIER) indicates that this +request acts as a barrier and that all preceding requests SHOULD be +complete before this one, and all following requests SHOULD NOT be +started until this is complete. + +\begin{note} A barrier does not flush +caches in the underlying backend device in host, and thus does not +serve as data consistency guarantee. Only a VIRTIO_BLK_T_FLUSH request +does that. +\end{note} + +Some older legacy devices did not commit completed writes to persistent +device backend storage when VIRTIO_BLK_F_FLUSH was offered but not +negotiated. In order to work around this, the driver MAY set the +\field{writeback} to 0 (if available) or it MAY send an explicit flush +request after every completed write. + +If the device has VIRTIO_BLK_F_SCSI feature, it can also support +scsi packet command requests, each of these requests is of form: + +\begin{lstlisting} +/* All fields are in guest's native endian. */ +struct virtio_scsi_pc_req { + u32 type; + u32 ioprio; + u64 sector; + u8 cmd[]; + u8 data[][512]; +#define SCSI_SENSE_BUFFERSIZE 96 + u8 sense[SCSI_SENSE_BUFFERSIZE]; + u32 errors; + u32 data_len; + u32 sense_len; + u32 residual; + u8 status; +}; +\end{lstlisting} + +A request type can also be a scsi packet command (VIRTIO_BLK_T_SCSI_CMD or +VIRTIO_BLK_T_SCSI_CMD_OUT). The two types are equivalent, the device +does not distinguish between them: + +\begin{lstlisting} +#define VIRTIO_BLK_T_SCSI_CMD 2 +#define VIRTIO_BLK_T_SCSI_CMD_OUT 3 +\end{lstlisting} + +The \field{cmd} field is only present for scsi packet command requests, +and indicates the command to perform. This field MUST reside in a +single, separate device-readable buffer; command length can be derived +from the length of this buffer. + +Note that these first three (four for scsi packet commands) +fields are always device-readable: \field{data} is either device-readable +or device-writable, depending on the request. The size of the read or +write can be derived from the total size of the request buffers. + +\field{sense} is only present for scsi packet command requests, +and indicates the buffer for scsi sense data. + +\field{data_len} is only present for scsi packet command +requests, this field is deprecated, and SHOULD be ignored by the +driver. Historically, devices copied data length there. + +\field{sense_len} is only present for scsi packet command +requests and indicates the number of bytes actually written to +the \field{sense} buffer. + +\field{residual} field is only present for scsi packet command +requests and indicates the residual size, calculated as data +length - number of bytes actually transferred. + +\subsubsection{Legacy Interface: Framing Requirements}\label{sec:Device Types / Block Device / Legacy Interface: Framing Requirements} + +When using legacy interfaces, transitional drivers which have not +negotiated VIRTIO_F_ANY_LAYOUT: + +\begin{itemize} +\item MUST use a single 8-byte descriptor containing \field{type}, + \field{reserved} and \field{sector}, followed by descriptors + for \field{data}, then finally a separate 1-byte descriptor + for \field{status}. + +\item For SCSI commands there are additional constraints. + \field{sense} MUST reside in a + single separate device-writable descriptor of size 96 bytes, + and \field{errors}, \field{data_len}, \field{sense_len} and + \field{residual} MUST reside a single separate + device-writable descriptor. +\end{itemize} + +See \ref{sec:Basic Facilities of a Virtio Device / Virtqueues / Message Framing}. + + -- 2.26.2
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]