OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification


On Thu, 2022-06-02 at 08:23 +0100, Stefan Hajnoczi wrote:
> (Resending from my mailing list subscriber address so this makes it onto
> the list.)
> 
> On Thu, 2 Jun 2022 at 00:56, Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
> > 
> > Introduce support for Zoned Block Devices to virtio.
> > 
> > Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
> > and/or cost characteristics compared to commonly available block
> > devices by getting the entire LBA space of the device divided to block
> > regions that are much larger than the LBA size. These regions are
> > called zones and they can only be written sequentially. More details
> > about ZBDs can be found at
> > 
> > https://zonedstorage.io/docs/introduction/zoned-storageÂ;.
> > 
> > In its current form, the virtio protocol for block devices (virtio-blk)
> > is not aware of ZBDs but it allows the guest to successfully scan a
> > host-managed drive provided by the host. As the result, the
> > host-managed drive appears at the guest as a regular drive that will
> > operate erroneously under the most common write workloads.
> > 
> > To fix this, the virtio-blk protocol needs to be extended to add the
> > capabilities to convey the zone characteristics of host ZBDs to the
> > guest and to provide the support for ZBD-specific commands - Report
> > Zones, four zone operations and (optionally) Zone Append. The proposed
> > standard extension aims to provide this functionality.
> > 
> > This patch extends the virtio-blk section of virtio specification with
> > the minimum set of requirements that are necessary to support ZBDs.
> > The resulting device model is a subset of the models defined in ZAC/ZBC
> > and ZNS standards documents. The included functionality mirrors
> > the existing Linux kernel block layer ZBD support and should be
> > sufficient to handle the host-managed and host-aware HDDs that are on
> > the market today as well as ZNS SSDs that are entering the market at
> > the moment of this patch submission.
> > 
> > I have developed a proof of concept patch series that adds ZBD support
> > to virtio-blk Linux kernel driver by implementing the protocol
> > extensions defined in the spec patch. I would like to receive the
> > initial feedback on this specification patch before posting that series
> > to the block LKML.
> > 
> > I would like to thank the following people for their useful feedback
> > and suggestions while working on the initial iterations of this patch.
> > 
> > Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > Matias BjÃrling <Matias.Bjorling@wdc.com>
> > Niklas Cassel <Niklas.Cassel@wdc.com>
> > Hans Holmberg <Hans.Holmberg@wdc.com>
> > 
> > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > ---
> > Âcontent.tex | 686 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> > Â1 file changed, 685 insertions(+), 1 deletion(-)
> > 
> > diff --git a/content.tex b/content.tex
> > index 7508dd1..8ae7578 100644
> > --- a/content.tex
> > +++ b/content.tex
> > @@ -4557,6 +4557,11 @@ \subsection{Feature bits}\label{sec:Device Types /
> > Block Device / Feature bits}
> > ÂÂÂÂÂ maximum erase sectors count in \field{max_secure_erase_sectors} and
> > ÂÂÂÂÂ maximum erase segment number in \field{max_secure_erase_seg}.
> > 
> > +\item[VIRTIO_BLK_F_ZONED(17)] Device is a Zoned Block Device, that is, a
> > device
> > +ÂÂÂÂÂÂ that behaves as defined by the T10 Zoned Block Command standard (ZBC
> > r05) or
> > +ÂÂÂÂÂÂ the NVMe(TM) NVM Express Zoned Namespace Command Set Specification
> > 1.1b
> > +ÂÂÂÂÂÂ (ZNS).
> 
> The rest of the spec doesn't mention ZBC/ZNS but at least the zone
> state machine is likely to be derived from these standards. Have you
> reviewed the license/IP terms of VIRTIO, ZBC, and ZNS to check that
> it's okay to do this? It's probably okay but I wanted to mention it
> because it would be a shame to discover an IP issue after the VIRTIO
> spec has been released.

I've read the virtio LICENSE.md file and visited the links that are included in
this file.

I am not a legal expert, but I don't expect any IP-related issues to arise by
including this patch. The T10 committee is an INCITS entity which is a part of
ANSI, a public standards body. This is the committee that develops SCSI standards
that are already supported by the virtio spec and ZBC is simply another SCSI
standard, alongside with SPC, SBC and so on. The member companies of T10 usually
have a legal review process to make sure that all the standards that are being
released are not encumbered by any patent claims, etc.

ZNS is a product of NVM Express, an open standard consortium. Here is the page on
their site that has the link to ZNS command set specification pdf file -

https://nvmexpress.org/developers/nvme-command-set-specifications/

This consortium also has a legal review process and I don't foresee any IP-
related difficulties that would arise from following this specification for the
purposes of making a virtio ZBD standard.

> 
> Also, the wording suggests to me that the spec will refer to ZBC/ZNS
> instead of defining the behavior itself, but actually you've written a
> full self-contained spec (which is great!). Maybe change "behaves as
> defined by" to "follows the zoned storage device model that is also
> supported by industry standards such as".

Agreed, the wording here could be more precise. This could be "follows the zoned
storage device behavior that is also supported by industry standards such as". I
would avoid using the word "model" here because it usually refers specifically to
host-managed vs. host-aware vs. drive-managed device models in ZBD world.

> 
> 
> > +
> > Â\end{description}
> > 
> > Â\subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types /
> > Block Device / Feature bits / Legacy Interface: Feature bits}
> > @@ -4589,6 +4594,31 @@ \subsection{Device configuration
> > layout}\label{sec:Device Types / Block Device /
> > Â\field{max_secure_erase_sectors} \field{secure_erase_sector_alignment} are
> > expressed
> > Âin 512-byte units if the VIRTIO_BLK_F_SECURE_ERASE feature bit is
> > negotiated.
> > 
> > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then in
> > +\field{virtio_blk_zoned_characteristics},
> > +\begin{itemize}
> > +\item \field{zone_sectors} value is expressed in 512-byte sectors.
> > +\item \field{max_append_sectors} value is expressed in 512-byte sectors.
> > +\item \field{write_granularity} value is expressed in bytes.
> > +\end{itemize}
> > +
> > +The \field{model} field in \field{zoned} may have the following values:
> > +VIRTIO_BLK_Z_HM(1) and VIRTIO_BLK_Z_HA(2).
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_BLK_Z_HMÂÂÂÂÂÂÂ 1
> > +#define VIRTIO_BLK_Z_HAÂÂÂÂÂÂÂ 2
> > +\end{lstlisting}
> > +
> > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> > +\begin{itemize}
> > +\item The value of VIRTIO_BLK_Z_HM MUST be set by the device if it operates
> > +ÂÂÂ as a host-managed zoned block device.
> 
> The term host-managed is not defined. Please either define it or
> include a reference to the zoned storage device model (maybe
> zonedstorage.io?).

OK, indeed. There is some text later in the patch describing the differences
between Sequential Write Required (SWR) and Sequential Write Preferred (SWP)
zones, but the wording that relates HM to SWR and HA to SWP is missing, will add
it here.

> 
> > +
> > +\item The value of VIRTIO_BLK_Z_HA MUST be set by the device if it operates
> > +ÂÂÂ as a host-aware zoned block device.
> 
> The term host-aware is not defined.
> 
> > +\end{itemize}
> 
> The VIRTIO spec is organized into normative sections containing
> MUST/MAY/SHOULD and descriptive sections. MUST/MAY/SHOULD need to go
> in drivernormative or devicenormative sections and can't be used in
> descriptive sections.
> (I believe the reason for this is to give driver/device implementors a
> definitive list of normative statements they can easily review without
> re-reading the entire text of the specification.)


Got it, will reword.

> +
>  \begin{lstlisting}
>  struct virtio_blk_config {
>          le64 capacity;
> @@ -4623,6 +4653,15 @@ \subsection{Device configuration
layout}\label{sec:Device Types / Block Device /
>          le32 max_secure_erase_sectors;
>          le32 max_secure_erase_seg;
>          le32 secure_erase_sector_alignment;
> +        struct virtio_blk_zoned_characteristics {
> +                le32 zone_sectors;
> +                le32 max_open_zones;
> +                le32 max_active_zones;
> +                le32 max_append_sectors;
> +                le32 write_granularity;
> +                u8 model;
> +                u8 unused2[3];
> +        } zoned;
>  };
>  \end{lstlisting}
>
> @@ -4686,6 +4725,10 @@ \subsection{Device Initialization}\label{sec:Device
Types / Block Device / Devic
>      \field{secure_erase_sector_alignment} can be used by OS when splitting a
>      request based on alignment.
>
> +\item If the VIRTIO_BLK_F_ZONED feature is negotiated, the fields in
> +    \field{zoned} can be read by the driver to determine the zone
> +    characteristics of the device. All \field{zoned} fields are read-only.
> +
>  \end{enumerate}
>
>  \drivernormative{\subsubsection}{Device Initialization}{Device Types / Block
Device / Device Initialization}
> @@ -4701,6 +4744,24 @@ \subsection{Device Initialization}\label{sec:Device
Types / Block Device / Devic
>  The driver MUST NOT read \field{writeback} before setting
>  the FEATURES_OK \field{device status} bit.
>
> +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_ZONED feature if they are incapable
> +of supporting devices with the VIRTIO_BLK_Z_HM or VIRTIO_BLK_Z_HA zoned model.

> This raises the question whether devices can offer VIRTIO_BLK_F_ZONED
> but fall back to regular block device operation if the driver does not
> negotiate the bit?

It could be useful (for testing/development, etc.) to allow the device to offer
the VIRTIO_BLK_F_ZONED feature even when the actual backing block device is not
zoned. In this case, it the driver is capable of handling ZBDs, it will acceptÂit
and the i/o should be operational via virtio_blk_zoned_req requests with
VIRTIO_BLK_Z_NONE model. Therefore there is a SHOULD NOT. Perhaps it is worth it
to add this explanation to the text... or not to allow this at all and change the
text to "must not".

Regarding fallbacks, there are clauses below that define these situations and
they mostly boil down to the following -
1. if the device offers VIRTIO_BLK_F_ZONED with an HM backing device and the
driver doesn't negotiate the feature, the device needs to disallow the device to
operate by not setting FEATURES_OK bit.
2. if the device offers VIRTIO_BLK_F_ZONED with an *HA* backing device and the
driver doesn't negotiate the feature, the device needs to present the device as
non-zoned because in this case it will still work.

> +
> +If the VIRTIO_BLK_F_ZONED feature is offered by the device, the
> +VIRTIO_BLK_F_DISCARD feature MUST NOT be offered.
> 
> This statement partially precludes that because it requires devices to
> disable VIRTIO_BLK_F_DISCARD if they advertize VIRTIO_BLK_F_ZONED. I
> think there is a way to allow VIRTIO_BLK_F_DISCARD in
> !VIRTIO_BLK_F_ZONED mode while forbidding it in VIRTIO_BLK_F_ZONED
> mode: the device can fail to set FEATURES_OK when the driver writes
> the Device Status field with VIRTIO_BLK_F_DISCARD &&
> VIRTIO_BLK_F_ZONED. That way a host-aware device in
> !VIRTIO_BLK_F_ZONED mode could allow VIRTIO_BLK_F_DISCARD. Is this
> useful?

Good catch, I verified that HA devices can usually handle discard. In this case,
the clause above becomes too strict. Perhaps it should be changed to

If the VIRTIO_BLK_F_ZONED feature is offered by the device with the
VIRTIO_BLK_Z_HM zone model, then
 - the VIRTIO_BLK_F_DISCARD feature must not be offered by the driver.
 - if the driver sets both VIRTIO_BLK_F_DISCARD and VIRTIO_BLK_F_ZONED feature
bits, the device must fail by not setting FEATURES_OK bit.

> 
> > +
> > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> > +\begin{itemize}
> > +\item If the driver that can not support host-managed zoned devices
> > +ÂÂÂ reads VIRTIO_BLK_Z_HM from the \field{model} field of \field{zoned}, the
> > +ÂÂÂ driver MUST NOT set FEATURES_OK flag and instead set the FAILED bit.
> > +
> > +\item If the driver that can not support support zoned devices reads
> > +ÂÂÂ VIRTIO_BLK_Z_HA from the \field{model} field of \field{zoned}, the
> > driver
> > +ÂÂÂ MAY present the device to the guest as a non-zoned device. In this case,
> > the
> 
> Please avoid using the terms "guest" or "host" in the virtualization
> sense and instead use "driver" and "device". VIRTIO can be used in
> non-virtualization use cases and it's clearer to avoid virtualization
> terminology.

OK, will change this.

> 
> > +ÂÂÂ driver SHOULD ignore all other fields in \field{zoned}.
> 
> Why would the driver negotiate VIRTIO_BLK_F_ZONED if it doesn't
> support zoned devices?

This covers cases when the driver can recognize the VIRTIO_BLK_F_ZONED feature,
but it is incapable to support ZBDs for some reason. For example, the Linux
virtio-blk driver could be up to date with ZBD support, but if the kernel is
built without CONFIG_BLK_DEV_ZONED option, it will not be able to handle HM
devices. But it still can deal with HA drives in this case.

> 
> > +\end{itemize}
> > +
> > Â\devicenormative{\subsubsection}{Device Initialization}{Device Types / Block
> > Device / Device Initialization}
> > 
> > ÂDevices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it
> > @@ -4712,6 +4773,77 @@ \subsection{Device Initialization}\label{sec:Device
> > Types / Block Device / Devic
> > ÂThe device MUST initialize padding bytes \field{unused0} and
> > Â\field{unused1} to 0.
> > 
> > +If the device that is being initialized is a not a zoned device, the device
> > MUST
> > +NOT offer the VIRTIO_BLK_F_ZONED feature.
> > +
> > +If the VIRTIO_BLK_F_ZONED bit is not set by the driver,
> > +\begin{itemize}
> > +\item the device with the VIRTIO_BLK_Z_HA zone model SHOULD proceed with the
> > +ÂÂÂ initialization while setting all zoned topology fields to zero.
> > +
> > +\item the device with the VIRTIO_BLK_Z_HM zone model MUST report the device
> > +ÂÂÂ capacity in \field{capacity} in the configuration space as zero to
> > prevent
> > +ÂÂÂ the use of the device that is incorrectly recognized by the driver as
> > +ÂÂÂ a non-zoned device.
> 
> I suggest changing this to "MUST fail to set the FEATURES_OK device
> status bit when the driver writes the Device Status field". That way
> the driver gets a clear error instead of seeing a virtio-blk device
> with 0 capacity.

OK. I was under impression that only the driver can do this. It's certainly
better to fail the initialization in this case.

> 
> 
> > +\end{itemize}
> > +
> > +If the VIRTIO_BLK_F_ZONED feature is negotiated,
> > +\begin{itemize}
> > +\item \field{zoned} can be read by the driver to discover the size of a
> > single
> > +ÂÂÂ zone on the device. All zones of the device have the same size indicated
> > by
> > +ÂÂÂ the \field{zone_sectors} field of \field{zoned} except for the last zone
> > +ÂÂÂ that MAY be smaller than all other zones. The driver can calculate the
> > +ÂÂÂ number of zones on the device as
> > +ÂÂÂ \begin{lstlisting}
> > +ÂÂÂÂÂÂÂ nr_zones = (capacity + zone_sectors - 1) / zone_sectors;
> > +ÂÂÂ \end{lstlisting}
> > +ÂÂÂ and the size of the last zone as
> > +ÂÂÂ \begin{lstlisting}
> > +ÂÂÂÂÂÂÂ zs_last = capacity - (nr_zones - 1) * zone_sectors;
> > +ÂÂÂ \end{lstlisting}
> > +
> > +\item Zones consume volatile device resources while being in certain states
> > and
> > +ÂÂÂ the device MAY set limits on the number of zones that can be in these
> > states
> > +ÂÂÂ simultaneously.
> > +
> > +ÂÂÂ Zoned block devices use two internal counters to account for the device
> > +ÂÂÂ resources in use, the number of currently open zones and the number of
> > +ÂÂÂ currently active zones.
> > +
> > +ÂÂÂ Any zone state transition from a state that doesn't consume a zone
> > resource
> > +ÂÂÂ to a state that consumes the same resource increments the internal
> > device
> > +ÂÂÂ counter for that resource. Any zone transition out of a state that
> > consumes
> > +ÂÂÂ a zone resource to a state that doesn't consume the same resource
> > decrements
> > +ÂÂÂ the counter. Any request that causes the device to exceed the reported
> > zone
> > +ÂÂÂ resource limits is terminated by the device with an error.
> 
> "an error" is vague. I was wondering what the exact error should be.
> Please either mention the error codes here or add "as defined for
> specific commands later" so it's clear that the device doesn't get the
> choose the error code but they are defined by the spec.

Right. The error is defined later in the text, it's worth mentioning it here.

> 
> > +
> > +\item The \field{max_open_zones} field of the \field{zoned} structure can be
> > +ÂÂÂ read by the driver to discover the maximum number of zones that can be
> > open
> > +ÂÂÂ on the device (zones in the implicit open or explicit open state). A
> > value
> > +ÂÂÂ of zero indicates that the device does not have any limit on the number
> > of
> > +ÂÂÂ open zones.
> > +
> > +\item The \field{max_active_zones} field of the \field{zoned} structure can
> > be
> > +ÂÂÂ read by the driver to discover the maximum number zones that can be
> > active
> > +ÂÂÂ on the device (zones in the implicit open, explicit open or closed
> > state).
> > +ÂÂÂ A value of zero indicates that the device does not have any limit on the
> > +ÂÂÂ number of active zones.
> > +
> > +\item the \field{max_append_sectors} field of \field{zoned} can be read by
> > the
> > +ÂÂÂ driver to get the maximum data size of a VIRTIO_BLK_T_ZONE_APPEND
> > request
> > +ÂÂÂ issued to the device. The value of this field MUST NOT exceed the
> > +ÂÂÂ \field{seg_max} * \field{size_max} value. A device MAY set the
> > +ÂÂÂ \field{max_append_sectors} to zero if it doesn't support
> > +ÂÂÂ VIRTIO_BLK_T_ZONE_APPEND requests.
> 
> The MUST NOT/MAY statements need to be in a drivernormative section.
> There are more instances below.

Yea, I need to sort this out everywhere in the patch.

> 
> 
> > +
> > +\item the \field{write_granularity} field of \field{zoned} can be read by
> > the
> > +ÂÂÂ driver to discover the offset and size alignment constraint for
> > +ÂÂÂ VIRTIO_BLK_T_OUT and VIRTIO_BLK_T_ZONE_APPEND requests issued to
> > +ÂÂÂ a sequential zone of the device.
> > +
> > +\item the device MUST initialize padding bytes \field{unused2} to 0.
> > +\end{itemize}
> > +
> > Â\subsubsection{Legacy Interface: Device Initialization}\label{sec:Device
> > Types / Block Device / Device Initialization / Legacy Interface: Device
> > Initialization}
> > 
> > ÂBecause legacy devices do not have FEATURES_OK, transitional devices
> > @@ -4738,7 +4870,8 @@ \subsubsection{Legacy Interface: Device
> > Initialization}\label{sec:Device Types /
> > Â\subsection{Device Operation}\label{sec:Device Types / Block Device / Device
> > Operation}
> > 
> > ÂThe driver queues requests to the virtqueues, and they are used by
> > -the device (not necessarily in order). Each request is of form:
> > +the device (not necessarily in order). If the VIRTIO_BLK_F_ZONED feature
> > +is not negotiated, then each request is of form:
> > 
> > Â\begin{lstlisting}
> > Âstruct virtio_blk_req {
> > @@ -4853,6 +4986,331 @@ \subsection{Device Operation}\label{sec:Device Types
> > / Block Device / Device Ope
> > Âcommand produces VIRTIO_BLK_S_IOERR. A segment may have completed
> > Âsuccessfully, failed, or not been processed by the device.
> > 
> > +The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is
> > +negotiated.
> > +
> > +Each request is of form:
> > +
> > +\begin{lstlisting}
> > +struct virtio_blk_zoned_req {
> > +ÂÂÂÂÂÂÂ le32 type;
> > +ÂÂÂÂÂÂÂ le32 reserved;
> > +ÂÂÂÂÂÂÂ le64 sector;
> > +ÂÂÂÂÂÂÂ union zoned_params {
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct {
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /* ALL zone operation flag */
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8 all;
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8 unused1[3];
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ } mgmt_send;
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct {
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /* Partial zone report flag */
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8 partial;
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8 unused2[3];
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ } mgmt_receive;
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ struct {
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8 unused3[4];
> > +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ } append;
> > +ÂÂÂÂÂÂÂ } zone;
> > +ÂÂÂÂÂÂÂ u8 data[];
> > +ÂÂÂÂÂÂÂ le64 zone_result;
> > +ÂÂÂÂÂÂÂ u8 status;
> > +ÂÂÂÂÂÂÂ u8 reserved1[3];
> 
> Please make status the last byte so device implementations can reuse
> their existing request completion code to handle virtio_blk_zoned_req.
> Either reserved1[] can be removed or it can be moved before status.

OK.

> 
> Here is the non-union representation of the structs. Note that this
> means existing request processing code can handle IN/OUT/FLUSH even
> when VIRTIO_BLK_F_ZONED is negotiated. We don't need two different
> code paths for these common commands:
> 
> /* IN/OUT/FLUSH/DISCARD/WRITE_ZEROES */
> struct virtio_blk_req {
> ÂÂÂÂÂÂÂ le32 type;
> ÂÂÂÂÂÂÂ le32 reserved;
> ÂÂÂÂÂÂÂ le64 sector;
> ÂÂÂÂÂÂÂ u8 data[];
> ÂÂÂÂÂÂÂ u8 status;
> };
> 
> /* ZONE_APPEND */
> struct virtio_blk_req_zone_append {
> ÂÂÂÂÂÂÂ le32 type;
> ÂÂÂÂÂÂÂ le32 reserved;
> ÂÂÂÂÂÂÂ le64 sector;
> ÂÂÂÂÂÂÂ u8 data[];
> ÂÂÂÂÂÂÂ le64 zone_result;
> ÂÂÂÂÂÂÂ u8 status;
> };
> 
> /* ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH, ZONE_RESET */
> struct virtio_blk_req_zone_mgmt_send {
> ÂÂÂÂÂÂÂ le32 type;
> ÂÂÂÂÂÂÂ le32 reserved;
> ÂÂÂÂÂÂÂ le64 sector;
> ÂÂÂÂÂÂÂ /* ALL zone operation flag */
> ÂÂÂÂÂÂÂ __u8 all;
> ÂÂÂÂÂÂÂ __u8 unused1[3];
> ÂÂÂÂÂÂÂ u8 data[];
> ÂÂÂÂÂÂÂ u8 status;
> };
> 
> /* ZONE_REPORT */
> struct virtio_blk_req_zone_mgmt_receive {
> ÂÂÂÂÂÂÂ le32 type;
> ÂÂÂÂÂÂÂ le32 reserved;
> ÂÂÂÂÂÂÂ le64 sector;
> ÂÂÂÂÂÂÂ /* Partial zone report flag */
> ÂÂÂÂÂÂÂ __u8 partial;
> ÂÂÂÂÂÂÂ __u8 unused2[3];
> ÂÂÂÂÂÂÂ u8 data[];
> ÂÂÂÂÂÂÂ u8 status;
> };

We felt that it is less confusing to have a single ZBD request structure so there
is 1:1 relationship between the VIRTIO_BLK_F_ZONED bit value and the choice of
the request -

VIRTIO_BLK_F_ZONED off ->  virtio_blk_req
VIRTIO_BLK_F_ZONED on ->  virtio_blk_zoned_req

We also want to make sure that zone management/send/receive and append requests
are always the same size and it should be easier with a single request
structure.Â

It is possible to use a simpler, but less structured union here -

struct virtio_blk_zoned_req {
        le32 type;
        le32 reserved;
        le64 sector;
        union zoned_params {
	        /* ALL zone operation flag */
	        __u8 all;
	        /* Partial zone report flag */
	        __u8 partial;
	        __u32 rsvd1;
        } zone;
        u8 data[];
        le64 zone_result;
        u8 reserved1[3];
        u8 status;
}

Using the single code path is very important and the PoC Linux driver code that I
have sketched up achieves that. Since the request header is identical for the
both cases, I hope that it can be done in QEMU too.

> 
> > +};
> > +\end{lstlisting}
> > +
> > +In addition to the request types defined for non-zoned devices, the type of
> > the
> > +request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit zone
> > open
> > +(VIRTIO_BLK_T_ZONE_OPEN), an explicit zone close (VIRTIO_BLK_T_ZONE_CLOSE),
> > a
> > +zone finish (VIRTIO_BLK_T_ZONE_FINISH), a zone_append
> > (VIRTIO_BLK_T_ZONE_APPEND)
> > +or a zone reset (VIRTIO_BLK_T_ZONE_RESET).
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_BLK_T_ZONE_APPENDÂÂÂ 15
> > +#define VIRTIO_BLK_T_ZONE_REPORTÂÂÂ 16
> > +#define VIRTIO_BLK_T_ZONE_OPENÂÂÂÂÂ 18
> > +#define VIRTIO_BLK_T_ZONE_CLOSEÂÂÂÂ 20
> > +#define VIRTIO_BLK_T_ZONE_FINISHÂÂÂ 22
> > +#define VIRTIO_BLK_T_ZONE_RESETÂÂÂÂ 24
> > +\end{lstlisting}
> > +
> > +Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of the
> > type
> 
> Here "reads" means data transfers from the device to the driver rather
> than disk reads. I found that confusing. Maybe
> "VIRTIO_BLK_T_ZONE_REPORT transfer data from the device"?

Sure, your wording is better.

> For a cur
> > +VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN,
> > +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and
> > VIRTIO_BLK_T_ZONE_RESET
> > +are non-data requests.
> > +
> > +In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone
> 
> "ZBD standards" is not defined. I think this refers to the zoned
> storage device model and its embodiment in the ZBC/ZNS specs.

> The configuration space fields already allow the driver to calculate
> the total number of zones. What is the purpose of partial=0, it
> doesn't seem to provide any new information?

Both ZBC and ZNS have this categorization. "ZBD standards" can be defined earlier
in the text as a catchall for ZBC/ZNS.

If partial is 0, the nr_zones will contain the number of zones that potentially
can be reported starting from the request "sector" position and until the end of
sector range of the device. It does not necessarily equal the total number of
zones on the drive. 

> 
> When the device marks a virtqueue buffer used, it indicates the number
> of bytes written (struct virtq_used_elem's len field). The driver can
> use this field to determine the number of descriptors filled in by the
> device. Is the nr_zones field really necessary?

This field can help when the buffer is larger than the returned number of
descriptors, but nr_zones is still needed for the partial=0 case above.

> 
> 
> > +
> > +A zone descriptor has the following structure:
> > +
> > +\begin{lstlisting}
> > +struct virtio_blk_zone_descriptor {
> > +ÂÂÂÂÂÂÂ le64ÂÂ z_cap;
> > +ÂÂÂÂÂÂÂ le64ÂÂ z_start;
> > +ÂÂÂÂÂÂÂ le64ÂÂ z_wp;
> > +ÂÂÂÂÂÂÂ u8ÂÂÂÂ z_type;
> > +ÂÂÂÂÂÂÂ u8ÂÂÂÂ z_state;
> > +ÂÂÂÂÂÂÂ u8ÂÂÂÂ reserved[38];
> > +};
> > +\end{lstlisting}
> > +
> > +The zone descriptor field \field{z_type} \field{virtio_blk_zone_descriptor}
> > +indicates the type of the zone. The available types are
> > VIRTIO_BLK_ZT_CONV(1),
> > +VIRTIO_BLK_ZT_SWR(2) or VIRTIO_BLK_ZT_SWP(3).
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_BLK_ZT_CONVÂÂÂÂ 1
> > +#define VIRTIO_BLK_ZT_SWRÂÂÂÂÂ 2
> > +#define VIRTIO_BLK_ZT_SWPÂÂÂÂÂ 3
> > +\end{lstlisting}
> > +
> > +Read and write operations into zones with the VIRTIO_BLK_ZT_CONV
> > (Conventional)
> > +type have the same behavior as read and write operations on a regular block
> > +device. Any block in a conventional zone can be read or written at any time
> > and
> > +in any order.
> > +
> > +Zones with VIRTIO_BLK_ZT_SWR (Sequential Write Required or SWR) can be read
> > +randomly, but MUST be written sequentially at a certain point in the zone
> > called
> > +the Write Pointer (WP). With every write, the Write Pointer is incremented
> > by
> > +the number of sectors written.
> > +
> > +Zones with VIRTIO_BLK_ZT_SWP (Sequential Write Preferred or SWP) can be read
> > +randomly and SHOULD be written sequentially, similarly to SWR zones.
> > However,
> > +SWP zones can accept random write operations, that is, VIRTIO_BLK_T_OUT
> > requests
> > +with a start sector different from the zone write pointer position.
> > +
> > +The field \field{z_state} of \field{virtio_blk_zone_descriptor} indicates
> > the
> > +state of the device zone. The available zone states are
> > VIRTIO_BLK_ZS_NOT_WP(0),
> > +VIRTIO_BLK_ZS_EMPTY(1), VIRTIO_BLK_ZS_IOPEN(2), VIRTIO_BLK_ZS_EOPEN(3),
> > +VIRTIO_BLK_ZS_CLOSED(4), VIRTIO_BLK_ZS_RDONLY(13), VIRTIO_BLK_ZS_FULL(14)
> > and
> > +VIRTIO_BLK_ZS_OFFLINE(15).
> > +
> > +\begin{lstlisting}
> > +#define VIRTIO_BLK_ZS_NOT_WPÂÂ 0
> > +#define VIRTIO_BLK_ZS_EMPTYÂÂÂ 1
> > +#define VIRTIO_BLK_ZS_IOPENÂÂÂ 2
> > +#define VIRTIO_BLK_ZS_EOPENÂÂÂ 3
> > +#define VIRTIO_BLK_ZS_CLOSEDÂÂ 4
> > +#define VIRTIO_BLK_ZS_RDONLYÂÂ 13
> > +#define VIRTIO_BLK_ZS_FULLÂÂÂÂ 14
> > +#define VIRTIO_BLK_ZS_OFFLINEÂ 15
> > +\end{lstlisting}
> > +
> > +Zones of the type VIRTIO_BLK_ZT_CONV are always reported by the device to be
> > in
> > +the VIRTIO_BLK_ZS_NOT_WP state. Zones of the types VIRTIO_BLK_ZT_SWR and
> > +VIRTIO_BLK_ZT_SWP can not transition to the VIRTIO_BLK_ZS_NOT_WP state.
> > +
> > +Zones in VIRTIO_BLK_ZS_EMPTY (Empty), VIRTIO_BLK_ZS_IOPEN (Implicitly Open),
> > +VIRTIO_BLK_ZS_EOPEN (Explicitly Open) and VIRTIO_BLK_ZS_CLOSED (Closed)
> > state
> > +are writable, but zones in VIRTIO_BLK_ZS_RDONLY (Read-Only),
> > VIRTIO_BLK_ZS_FULL
> > +(Full) and VIRTIO_BLK_ZS_OFFLINE (Offline) state are not. The write pointer
> > +value (\field{z_wp}) is not valid for Read-Only, Full and Offline zones.
> > +
> > +The zone descriptor field \field{z_cap} contains the maximum number of 512-
> > byte
> > +sectors that are available to be written with user data when the zone is in
> > the
> > +Empty state. This value shall be less than or equal to the
> > \field{zone_sectors}
> > +value in \field{virtio_blk_zoned_characteristics} structure in the device
> > +configuration space.
> > +
> > +The zone descriptor field \field{z_start} contains the 64-bit address of the
> > +first 512-byte sector of the zone.
> > +
> > +The zone descriptor field \field{z_wp} contains the 64-bit sector address
> > where
> > +the next write operation for this zone should be issued. This value is
> > undefined
> > +for conventional zones and for zones in VIRTIO_BLK_ZS_RDONLY,
> > VIRTIO_BLK_ZS_FULL
> > +and VIRTIO_BLK_ZS_OFFLINE state.
> > +
> > +Depending on their state, zones consume resources as follows:
> > +\begin{itemize}
> > +\item a zone in VIRTIO_BLK_ZS_IOPEN and VIRTIO_BLK_ZS_EOPEN state consumes
> > one
> > +ÂÂÂ open zone resource and, additionally,
> > +
> > +\item a zone in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN and
> > +ÂÂÂ VIRTIO_BLK_ZS_CLOSED state consumes one active resource.
> > +\end{itemize}
> > +
> > +Attempts for zone transitions that violate zone resource limits MUST fail
> > with
> > +VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE
> > +\field{zone_result}.
> 
> Why use zone_result instead of the status field? The _S_ naming
> suggests this is a status field constant, not a zone_result constant.

zone_result only a valid ZBD error if the status is VIRTIO_BLK_S_IOERR. If the
status is VIRTIO_BLK_S_OK, this field may contain ALBA for Zone Append request.
It is possible to just define the ZBD-specific error codes following
VIRTIO_BLK_S_UNSUPP, but I think it is cleaner to create a separate numbering for
ZBD-specific codes and keep the non-zoned errors as-is in their impressive
simplicity.

The ZBD error codes all currently have _S_ZONE_ naming, but it might be better to
use something like _ZR_ for example, i.e.

#define VIRTIO_BLK_ZR_INVALID_CMD     0
#define VIRTIO_BLK_ZR_UNALIGNED_WP    1
#define VIRTIO_BLK_ZR_OPEN_RESOURCE   2
#define VIRTIO_BLK_ZR_ACTIVE_RESOURCE 3

> 
> 
> > +
> > +Zones in the VIRTIO_BLK_ZS_EMPTY (Empty) state have the write pointer value
> > +equal to the start sector of the zone. In this state, the entire capacity of
> > the
> > +zone is available for writing. A zone can transition from this state to
> > +\begin{itemize}
> > +\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or
> > +ÂÂÂ VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the
> > zone.
> > +
> > +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request
> > is
> > +ÂÂÂ received for the zone
> > +\end{itemize}
> > +
> > +When a VIRTIO_BLK_T_ZONE_RESET request is issued to an Empty zone, the
> > request
> > +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_EMPTY
> > state.
> > +
> > +Zones in the VIRTIO_BLK_ZS_IOPEN (Implicitly Open) state can transition from
> > +this state to
> > +\begin{itemize}
> > +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request
> > is
> > +ÂÂÂ received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request
> > is
> > +ÂÂÂ received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request
> > is
> > +ÂÂÂ received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_CLOSED implicitly by the device when another zone is
> > +ÂÂÂ entering the VIRTIO_BLK_ZS_IOPEN or VIRTIO_BLK_ZS_EOPEN state and the
> > number
> > +ÂÂÂ of currently open zones is at \field{max_open_zones} limit,
> > +
> > +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request
> > is
> > +ÂÂÂ received for the zone.
> > +
> > +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or
> > +ÂÂÂ VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its
> > writable
> > +ÂÂÂ capacity is received for the zone.
> > +\end{itemize}
> > +
> > +Zones in the VIRTIO_BLK_ZS_EOPEN (Explicitly Open) state can transition from
> > +this state to
> > +\begin{itemize}
> > +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request
> > is
> > +ÂÂÂ received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_CLOSE request
> > is
> > +ÂÂÂ received for the zone and the write pointer of the zone has the value
> > equal
> > +ÂÂÂ to the start sector of the zone,
> > +
> > +\item VIRTIO_BLK_ZS_CLOSED when a successful VIRTIO_BLK_T_ZONE_CLOSE request
> > is
> > +ÂÂÂ received for the zone and the zone write pointer is larger then the
> > start
> > +ÂÂÂ sector of the zone,
> > +
> > +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_ZONE_FINISH request
> > is
> > +ÂÂÂ received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_FULL when a successful VIRTIO_BLK_T_OUT or
> > +ÂÂÂ VIRTIO_BLK_T_ZONE_APPEND request that causes the zone to reach its
> > writable
> > +ÂÂÂ capacity is received for the zone.
> > +\end{itemize}
> > +
> > +When a VIRTIO_BLK_T_ZONE_EOPEN request is issued to an Explicitly Open zone,
> > the
> > +request is completed successfully and the zone stays in the
> > VIRTIO_BLK_ZS_EOPEN
> > +state.
> > +
> > +Zones in the VIRTIO_BLK_ZS_CLOSED (Closed) state can transition from this
> > state
> > +to
> > +\begin{itemize}
> > +\item VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request
> > is
> > +ÂÂÂ received for the zone,
> > +
> > +\item VIRTIO_BLK_ZS_IOPEN when a successful VIRTIO_BLK_T_OUT request or
> > +ÂÂÂ VIRTIO_BLK_T_ZONE_APPEND with a non-zero data size is received for the
> > zone.
> > +
> > +\item VIRTIO_BLK_ZS_EOPEN when a successful VIRTIO_BLK_T_ZONE_OPEN request
> > is
> > +ÂÂÂ received for the zone,
> > +\end{itemize}
> > +
> > +When a VIRTIO_BLK_T_ZONE_CLOSE request is issued to a Closed zone, the
> > request
> > +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_CLOSED
> > state.
> > +
> > +Zones in the VIRTIO_BLK_ZS_FULL (Full) state can transition from this state
> > to
> > +VIRTIO_BLK_ZS_EMPTY when a successful VIRTIO_BLK_T_ZONE_RESET request is
> > +received for the zone
> > +
> > +When a VIRTIO_BLK_T_ZONE_FINISH request is issued to a Full zone, the
> > request
> > +is completed successfully and the zone stays in the VIRTIO_BLK_ZS_FULL
> > state.
> > +
> > +The device MAY automatically transition zones to VIRTIO_BLK_ZS_RDONLY
> > +(Read-Only) or VIRTIO_BLK_ZS_OFFLINE (Offline) state from any other state.
> > The
> > +device MAY also automatically transition zones in the Read-Only state to the
> > +Offline state. Zones in the Offline state MAY NOT transition to any other
> > state.
> > +Such automatic transitions usually indicate hardware failures. The
> > previously
> > +written data may only be read from zones in the Read-Only state. Zones in
> > the
> > +Offline state can not be read or written.
> 
> What is the status code when this is attempted?

It is stated elsewhere -
All requests issued to a zone in the VIRTIO_BLK_ZS_OFFLINE state SHALL be
completed with VIRTIO_BLK_S_IOERR \field{status} and
VIRTIO_BLK_S_ZONE_INVALID_CMD value in the field \field{zone_result}.





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]