OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification


(Resent from my subscriber email.)

On Fri, 3 Jun 2022 at 03:30, Dmitry Fomichev <Dmitry.Fomichev@wdc.com> wrote:
>
> On Thu, 2022-06-02 at 08:23 +0100, Stefan Hajnoczi wrote:
> > (Resending from my mailing list subscriber address so this makes it onto
> > the list.)
> >
> > On Thu, 2 Jun 2022 at 00:56, Dmitry Fomichev <dmitry.fomichev@wdc.com> wrote:
> > >
> > > Introduce support for Zoned Block Devices to virtio.
> > >
> > > Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
> > > and/or cost characteristics compared to commonly available block
> > > devices by getting the entire LBA space of the device divided to block
> > > regions that are much larger than the LBA size. These regions are
> > > called zones and they can only be written sequentially. More details
> > > about ZBDs can be found at
> > >
> > > https://zonedstorage.io/docs/introduction/zoned-storage .
> > >
> > > In its current form, the virtio protocol for block devices (virtio-blk)
> > > is not aware of ZBDs but it allows the guest to successfully scan a
> > > host-managed drive provided by the host. As the result, the
> > > host-managed drive appears at the guest as a regular drive that will
> > > operate erroneously under the most common write workloads.
> > >
> > > To fix this, the virtio-blk protocol needs to be extended to add the
> > > capabilities to convey the zone characteristics of host ZBDs to the
> > > guest and to provide the support for ZBD-specific commands - Report
> > > Zones, four zone operations and (optionally) Zone Append. The proposed
> > > standard extension aims to provide this functionality.
> > >
> > > This patch extends the virtio-blk section of virtio specification with
> > > the minimum set of requirements that are necessary to support ZBDs.
> > > The resulting device model is a subset of the models defined in ZAC/ZBC
> > > and ZNS standards documents. The included functionality mirrors
> > > the existing Linux kernel block layer ZBD support and should be
> > > sufficient to handle the host-managed and host-aware HDDs that are on
> > > the market today as well as ZNS SSDs that are entering the market at
> > > the moment of this patch submission.
> > >
> > > I have developed a proof of concept patch series that adds ZBD support
> > > to virtio-blk Linux kernel driver by implementing the protocol
> > > extensions defined in the spec patch. I would like to receive the
> > > initial feedback on this specification patch before posting that series
> > > to the block LKML.
> > >
> > > I would like to thank the following people for their useful feedback
> > > and suggestions while working on the initial iterations of this patch.
> > >
> > > Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > > Matias Bjørling <Matias.Bjorling@wdc.com>
> > > Niklas Cassel <Niklas.Cassel@wdc.com>
> > > Hans Holmberg <Hans.Holmberg@wdc.com>
> > >
> > > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > ---
> > >  content.tex | 686 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 685 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/content.tex b/content.tex
> > > index 7508dd1..8ae7578 100644
> > > --- a/content.tex
> > > +++ b/content.tex
> > > @@ -4557,6 +4557,11 @@ \subsection{Feature bits}\label{sec:Device Types /
> > > Block Device / Feature bits}
> > >       maximum erase sectors count in \field{max_secure_erase_sectors} and
> > >       maximum erase segment number in \field{max_secure_erase_seg}.
> > >
> > > +\item[VIRTIO_BLK_F_ZONED(17)] Device is a Zoned Block Device, that is, a
> > > device
> > > +       that behaves as defined by the T10 Zoned Block Command standard (ZBC
> > > r05) or
> > > +       the NVMe(TM) NVM Express Zoned Namespace Command Set Specification
> > > 1.1b
> > > +       (ZNS).
> >
> > The rest of the spec doesn't mention ZBC/ZNS but at least the zone
> > state machine is likely to be derived from these standards. Have you
> > reviewed the license/IP terms of VIRTIO, ZBC, and ZNS to check that
> > it's okay to do this? It's probably okay but I wanted to mention it
> > because it would be a shame to discover an IP issue after the VIRTIO
> > spec has been released.
>
> I've read the virtio LICENSE.md file and visited the links that are included in
> this file.
>
> I am not a legal expert, but I don't expect any IP-related issues to arise by
> including this patch. The T10 committee is an INCITS entity which is a part of
> ANSI, a public standards body. This is the committee that develops SCSI standards
> that are already supported by the virtio spec and ZBC is simply another SCSI
> standard, alongside with SPC, SBC and so on. The member companies of T10 usually
> have a legal review process to make sure that all the standards that are being
> released are not encumbered by any patent claims, etc.
>
> ZNS is a product of NVM Express, an open standard consortium. Here is the page on
> their site that has the link to ZNS command set specification pdf file -
>
> https://nvmexpress.org/developers/nvme-command-set-specifications/
>
> This consortium also has a legal review process and I don't foresee any IP-
> related difficulties that would arise from following this specification for the
> purposes of making a virtio ZBD standard.
>
> >
> > Also, the wording suggests to me that the spec will refer to ZBC/ZNS
> > instead of defining the behavior itself, but actually you've written a
> > full self-contained spec (which is great!). Maybe change "behaves as
> > defined by" to "follows the zoned storage device model that is also
> > supported by industry standards such as".
>
> Agreed, the wording here could be more precise. This could be "follows the zoned
> storage device behavior that is also supported by industry standards such as". I
> would avoid using the word "model" here because it usually refers specifically to
> host-managed vs. host-aware vs. drive-managed device models in ZBD world.
>
> >
> >
> > > +
> > >  \end{description}
> > >
> > >  \subsubsection{Legacy Interface: Feature bits}\label{sec:Device Types /
> > > Block Device / Feature bits / Legacy Interface: Feature bits}
> > > @@ -4589,6 +4594,31 @@ \subsection{Device configuration
> > > layout}\label{sec:Device Types / Block Device /
> > >  \field{max_secure_erase_sectors} \field{secure_erase_sector_alignment} are
> > > expressed
> > >  in 512-byte units if the VIRTIO_BLK_F_SECURE_ERASE feature bit is
> > > negotiated.
> > >
> > > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then in
> > > +\field{virtio_blk_zoned_characteristics},
> > > +\begin{itemize}
> > > +\item \field{zone_sectors} value is expressed in 512-byte sectors.
> > > +\item \field{max_append_sectors} value is expressed in 512-byte sectors.
> > > +\item \field{write_granularity} value is expressed in bytes.
> > > +\end{itemize}
> > > +
> > > +The \field{model} field in \field{zoned} may have the following values:
> > > +VIRTIO_BLK_Z_HM(1) and VIRTIO_BLK_Z_HA(2).
> > > +
> > > +\begin{lstlisting}
> > > +#define VIRTIO_BLK_Z_HM        1
> > > +#define VIRTIO_BLK_Z_HA        2
> > > +\end{lstlisting}
> > > +
> > > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> > > +\begin{itemize}
> > > +\item The value of VIRTIO_BLK_Z_HM MUST be set by the device if it operates
> > > +    as a host-managed zoned block device.
> >
> > The term host-managed is not defined. Please either define it or
> > include a reference to the zoned storage device model (maybe
> > zonedstorage.io?).
>
> OK, indeed. There is some text later in the patch describing the differences
> between Sequential Write Required (SWR) and Sequential Write Preferred (SWP)
> zones, but the wording that relates HM to SWR and HA to SWP is missing, will add
> it here.
>
> >
> > > +
> > > +\item The value of VIRTIO_BLK_Z_HA MUST be set by the device if it operates
> > > +    as a host-aware zoned block device.
> >
> > The term host-aware is not defined.
> >
> > > +\end{itemize}
> >
> > The VIRTIO spec is organized into normative sections containing
> > MUST/MAY/SHOULD and descriptive sections. MUST/MAY/SHOULD need to go
> > in drivernormative or devicenormative sections and can't be used in
> > descriptive sections.
> > (I believe the reason for this is to give driver/device implementors a
> > definitive list of normative statements they can easily review without
> > re-reading the entire text of the specification.)
>
>
> Got it, will reword.
>
> > +
> >  \begin{lstlisting}
> >  struct virtio_blk_config {
> >          le64 capacity;
> > @@ -4623,6 +4653,15 @@ \subsection{Device configuration
> layout}\label{sec:Device Types / Block Device /
> >          le32 max_secure_erase_sectors;
> >          le32 max_secure_erase_seg;
> >          le32 secure_erase_sector_alignment;
> > +        struct virtio_blk_zoned_characteristics {
> > +                le32 zone_sectors;
> > +                le32 max_open_zones;
> > +                le32 max_active_zones;
> > +                le32 max_append_sectors;
> > +                le32 write_granularity;
> > +                u8 model;
> > +                u8 unused2[3];
> > +        } zoned;
> >  };
> >  \end{lstlisting}
> >
> > @@ -4686,6 +4725,10 @@ \subsection{Device Initialization}\label{sec:Device
> Types / Block Device / Devic
> >      \field{secure_erase_sector_alignment} can be used by OS when splitting a
> >      request based on alignment.
> >
> > +\item If the VIRTIO_BLK_F_ZONED feature is negotiated, the fields in
> > +    \field{zoned} can be read by the driver to determine the zone
> > +    characteristics of the device. All \field{zoned} fields are read-only.
> > +
> >  \end{enumerate}
> >
> >  \drivernormative{\subsubsection}{Device Initialization}{Device Types / Block
> Device / Device Initialization}
> > @@ -4701,6 +4744,24 @@ \subsection{Device Initialization}\label{sec:Device
> Types / Block Device / Devic
> >  The driver MUST NOT read \field{writeback} before setting
> >  the FEATURES_OK \field{device status} bit.
> >
> > +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_ZONED feature if they are incapable
> > +of supporting devices with the VIRTIO_BLK_Z_HM or VIRTIO_BLK_Z_HA zoned model.
>
> > This raises the question whether devices can offer VIRTIO_BLK_F_ZONED
> > but fall back to regular block device operation if the driver does not
> > negotiate the bit?
>
> It could be useful (for testing/development, etc.) to allow the device to offer
> the VIRTIO_BLK_F_ZONED feature even when the actual backing block device is not
> zoned. In this case, it the driver is capable of handling ZBDs, it will accept it
> and the i/o should be operational via virtio_blk_zoned_req requests with
> VIRTIO_BLK_Z_NONE model. Therefore there is a SHOULD NOT. Perhaps it is worth it
> to add this explanation to the text... or not to allow this at all and change the
> text to "must not".
>
> Regarding fallbacks, there are clauses below that define these situations and
> they mostly boil down to the following -
> 1. if the device offers VIRTIO_BLK_F_ZONED with an HM backing device and the
> driver doesn't negotiate the feature, the device needs to disallow the device to
> operate by not setting FEATURES_OK bit.
> 2. if the device offers VIRTIO_BLK_F_ZONED with an *HA* backing device and the
> driver doesn't negotiate the feature, the device needs to present the device as
> non-zoned because in this case it will still work.
>
> > +
> > +If the VIRTIO_BLK_F_ZONED feature is offered by the device, the
> > +VIRTIO_BLK_F_DISCARD feature MUST NOT be offered.
> >
> > This statement partially precludes that because it requires devices to
> > disable VIRTIO_BLK_F_DISCARD if they advertize VIRTIO_BLK_F_ZONED. I
> > think there is a way to allow VIRTIO_BLK_F_DISCARD in
> > !VIRTIO_BLK_F_ZONED mode while forbidding it in VIRTIO_BLK_F_ZONED
> > mode: the device can fail to set FEATURES_OK when the driver writes
> > the Device Status field with VIRTIO_BLK_F_DISCARD &&
> > VIRTIO_BLK_F_ZONED. That way a host-aware device in
> > !VIRTIO_BLK_F_ZONED mode could allow VIRTIO_BLK_F_DISCARD. Is this
> > useful?
>
> Good catch, I verified that HA devices can usually handle discard. In this case,
> the clause above becomes too strict. Perhaps it should be changed to
>
> If the VIRTIO_BLK_F_ZONED feature is offered by the device with the
> VIRTIO_BLK_Z_HM zone model, then
>  - the VIRTIO_BLK_F_DISCARD feature must not be offered by the driver.
>  - if the driver sets both VIRTIO_BLK_F_DISCARD and VIRTIO_BLK_F_ZONED feature
> bits, the device must fail by not setting FEATURES_OK bit.

Sounds good.


>
> >
> > > +
> > > +If the VIRTIO_BLK_F_ZONED feature is negotiated, then
> > > +\begin{itemize}
> > > +\item If the driver that can not support host-managed zoned devices
> > > +    reads VIRTIO_BLK_Z_HM from the \field{model} field of \field{zoned}, the
> > > +    driver MUST NOT set FEATURES_OK flag and instead set the FAILED bit.
> > > +
> > > +\item If the driver that can not support support zoned devices reads
> > > +    VIRTIO_BLK_Z_HA from the \field{model} field of \field{zoned}, the
> > > driver
> > > +    MAY present the device to the guest as a non-zoned device. In this case,
> > > the
> >
> > Please avoid using the terms "guest" or "host" in the virtualization
> > sense and instead use "driver" and "device". VIRTIO can be used in
> > non-virtualization use cases and it's clearer to avoid virtualization
> > terminology.
>
> OK, will change this.
>
> >
> > > +    driver SHOULD ignore all other fields in \field{zoned}.
> >
> > Why would the driver negotiate VIRTIO_BLK_F_ZONED if it doesn't
> > support zoned devices?
>
> This covers cases when the driver can recognize the VIRTIO_BLK_F_ZONED feature,
> but it is incapable to support ZBDs for some reason. For example, the Linux
> virtio-blk driver could be up to date with ZBD support, but if the kernel is
> built without CONFIG_BLK_DEV_ZONED option, it will not be able to handle HM
> devices. But it still can deal with HA drives in this case.

I see, thanks.


>
> >
> > > +\end{itemize}
> > > +
> > >  \devicenormative{\subsubsection}{Device Initialization}{Device Types / Block
> > > Device / Device Initialization}
> > >
> > >  Devices SHOULD always offer VIRTIO_BLK_F_FLUSH, and MUST offer it
> > > @@ -4712,6 +4773,77 @@ \subsection{Device Initialization}\label{sec:Device
> > > Types / Block Device / Devic
> > >  The device MUST initialize padding bytes \field{unused0} and
> > >  \field{unused1} to 0.
> > >
> > > +If the device that is being initialized is a not a zoned device, the device
> > > MUST
> > > +NOT offer the VIRTIO_BLK_F_ZONED feature.
> > > +
> > > +If the VIRTIO_BLK_F_ZONED bit is not set by the driver,
> > > +\begin{itemize}
> > > +\item the device with the VIRTIO_BLK_Z_HA zone model SHOULD proceed with the
> > > +    initialization while setting all zoned topology fields to zero.
> > > +
> > > +\item the device with the VIRTIO_BLK_Z_HM zone model MUST report the device
> > > +    capacity in \field{capacity} in the configuration space as zero to
> > > prevent
> > > +    the use of the device that is incorrectly recognized by the driver as
> > > +    a non-zoned device.
> >
> > I suggest changing this to "MUST fail to set the FEATURES_OK device
> > status bit when the driver writes the Device Status field". That way
> > the driver gets a clear error instead of seeing a virtio-blk device
> > with 0 capacity.
>
> OK. I was under impression that only the driver can do this. It's certainly
> better to fail the initialization in this case.
>
> >
> >
> > > +\end{itemize}
> > > +
> > > +If the VIRTIO_BLK_F_ZONED feature is negotiated,
> > > +\begin{itemize}
> > > +\item \field{zoned} can be read by the driver to discover the size of a
> > > single
> > > +    zone on the device. All zones of the device have the same size indicated
> > > by
> > > +    the \field{zone_sectors} field of \field{zoned} except for the last zone
> > > +    that MAY be smaller than all other zones. The driver can calculate the
> > > +    number of zones on the device as
> > > +    \begin{lstlisting}
> > > +        nr_zones = (capacity + zone_sectors - 1) / zone_sectors;
> > > +    \end{lstlisting}
> > > +    and the size of the last zone as
> > > +    \begin{lstlisting}
> > > +        zs_last = capacity - (nr_zones - 1) * zone_sectors;
> > > +    \end{lstlisting}
> > > +
> > > +\item Zones consume volatile device resources while being in certain states
> > > and
> > > +    the device MAY set limits on the number of zones that can be in these
> > > states
> > > +    simultaneously.
> > > +
> > > +    Zoned block devices use two internal counters to account for the device
> > > +    resources in use, the number of currently open zones and the number of
> > > +    currently active zones.
> > > +
> > > +    Any zone state transition from a state that doesn't consume a zone
> > > resource
> > > +    to a state that consumes the same resource increments the internal
> > > device
> > > +    counter for that resource. Any zone transition out of a state that
> > > consumes
> > > +    a zone resource to a state that doesn't consume the same resource
> > > decrements
> > > +    the counter. Any request that causes the device to exceed the reported
> > > zone
> > > +    resource limits is terminated by the device with an error.
> >
> > "an error" is vague. I was wondering what the exact error should be.
> > Please either mention the error codes here or add "as defined for
> > specific commands later" so it's clear that the device doesn't get the
> > choose the error code but they are defined by the spec.
>
> Right. The error is defined later in the text, it's worth mentioning it here.
>
> >
> > > +
> > > +\item The \field{max_open_zones} field of the \field{zoned} structure can be
> > > +    read by the driver to discover the maximum number of zones that can be
> > > open
> > > +    on the device (zones in the implicit open or explicit open state). A
> > > value
> > > +    of zero indicates that the device does not have any limit on the number
> > > of
> > > +    open zones.
> > > +
> > > +\item The \field{max_active_zones} field of the \field{zoned} structure can
> > > be
> > > +    read by the driver to discover the maximum number zones that can be
> > > active
> > > +    on the device (zones in the implicit open, explicit open or closed
> > > state).
> > > +    A value of zero indicates that the device does not have any limit on the
> > > +    number of active zones.
> > > +
> > > +\item the \field{max_append_sectors} field of \field{zoned} can be read by
> > > the
> > > +    driver to get the maximum data size of a VIRTIO_BLK_T_ZONE_APPEND
> > > request
> > > +    issued to the device. The value of this field MUST NOT exceed the
> > > +    \field{seg_max} * \field{size_max} value. A device MAY set the
> > > +    \field{max_append_sectors} to zero if it doesn't support
> > > +    VIRTIO_BLK_T_ZONE_APPEND requests.
> >
> > The MUST NOT/MAY statements need to be in a drivernormative section.
> > There are more instances below.
>
> Yea, I need to sort this out everywhere in the patch.
>
> >
> >
> > > +
> > > +\item the \field{write_granularity} field of \field{zoned} can be read by
> > > the
> > > +    driver to discover the offset and size alignment constraint for
> > > +    VIRTIO_BLK_T_OUT and VIRTIO_BLK_T_ZONE_APPEND requests issued to
> > > +    a sequential zone of the device.
> > > +
> > > +\item the device MUST initialize padding bytes \field{unused2} to 0.
> > > +\end{itemize}
> > > +
> > >  \subsubsection{Legacy Interface: Device Initialization}\label{sec:Device
> > > Types / Block Device / Device Initialization / Legacy Interface: Device
> > > Initialization}
> > >
> > >  Because legacy devices do not have FEATURES_OK, transitional devices
> > > @@ -4738,7 +4870,8 @@ \subsubsection{Legacy Interface: Device
> > > Initialization}\label{sec:Device Types /
> > >  \subsection{Device Operation}\label{sec:Device Types / Block Device / Device
> > > Operation}
> > >
> > >  The driver queues requests to the virtqueues, and they are used by
> > > -the device (not necessarily in order). Each request is of form:
> > > +the device (not necessarily in order). If the VIRTIO_BLK_F_ZONED feature
> > > +is not negotiated, then each request is of form:
> > >
> > >  \begin{lstlisting}
> > >  struct virtio_blk_req {
> > > @@ -4853,6 +4986,331 @@ \subsection{Device Operation}\label{sec:Device Types
> > > / Block Device / Device Ope
> > >  command produces VIRTIO_BLK_S_IOERR.  A segment may have completed
> > >  successfully, failed, or not been processed by the device.
> > >
> > > +The following requirements only apply if the VIRTIO_BLK_F_ZONED feature is
> > > +negotiated.
> > > +
> > > +Each request is of form:
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_blk_zoned_req {
> > > +        le32 type;
> > > +        le32 reserved;
> > > +        le64 sector;
> > > +        union zoned_params {
> > > +                struct {
> > > +                        /* ALL zone operation flag */
> > > +                        __u8 all;
> > > +                        __u8 unused1[3];
> > > +                } mgmt_send;
> > > +                struct {
> > > +                        /* Partial zone report flag */
> > > +                        __u8 partial;
> > > +                        __u8 unused2[3];
> > > +                } mgmt_receive;
> > > +                struct {
> > > +                        __u8 unused3[4];
> > > +                } append;
> > > +        } zone;
> > > +        u8 data[];
> > > +        le64 zone_result;
> > > +        u8 status;
> > > +        u8 reserved1[3];
> >
> > Please make status the last byte so device implementations can reuse
> > their existing request completion code to handle virtio_blk_zoned_req.
> > Either reserved1[] can be removed or it can be moved before status.
>
> OK.
>
> >
> > Here is the non-union representation of the structs. Note that this
> > means existing request processing code can handle IN/OUT/FLUSH even
> > when VIRTIO_BLK_F_ZONED is negotiated. We don't need two different
> > code paths for these common commands:
> >
> > /* IN/OUT/FLUSH/DISCARD/WRITE_ZEROES */
> > struct virtio_blk_req {
> >         le32 type;
> >         le32 reserved;
> >         le64 sector;
> >         u8 data[];
> >         u8 status;
> > };
> >
> > /* ZONE_APPEND */
> > struct virtio_blk_req_zone_append {
> >         le32 type;
> >         le32 reserved;
> >         le64 sector;
> >         u8 data[];
> >         le64 zone_result;
> >         u8 status;
> > };
> >
> > /* ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH, ZONE_RESET */
> > struct virtio_blk_req_zone_mgmt_send {
> >         le32 type;
> >         le32 reserved;
> >         le64 sector;
> >         /* ALL zone operation flag */
> >         __u8 all;
> >         __u8 unused1[3];
> >         u8 data[];
> >         u8 status;
> > };
> >
> > /* ZONE_REPORT */
> > struct virtio_blk_req_zone_mgmt_receive {
> >         le32 type;
> >         le32 reserved;
> >         le64 sector;
> >         /* Partial zone report flag */
> >         __u8 partial;
> >         __u8 unused2[3];
> >         u8 data[];
> >         u8 status;
> > };
>
> We felt that it is less confusing to have a single ZBD request structure so there
> is 1:1 relationship between the VIRTIO_BLK_F_ZONED bit value and the choice of
> the request -
>
> VIRTIO_BLK_F_ZONED off ->  virtio_blk_req
> VIRTIO_BLK_F_ZONED on ->  virtio_blk_zoned_req
>
> We also want to make sure that zone management/send/receive and append requests
> are always the same size and it should be easier with a single request
> structure.
>
> It is possible to use a simpler, but less structured union here -
>
> struct virtio_blk_zoned_req {
>         le32 type;
>         le32 reserved;
>         le64 sector;
>         union zoned_params {
>                 /* ALL zone operation flag */
>                 __u8 all;
>                 /* Partial zone report flag */
>                 __u8 partial;
>                 __u32 rsvd1;
>         } zone;
>         u8 data[];
>         le64 zone_result;
>         u8 reserved1[3];
>         u8 status;
> }
>
> Using the single code path is very important and the PoC Linux driver code that I
> have sketched up achieves that. Since the request header is identical for the
> both cases, I hope that it can be done in QEMU too.

There are two possibilities for optimizing the request layout:
1. For drivers and devices that implement both !VIRTIO_BLK_F_ZONED and
VIRTIO_BLK_F_ZONED. Here it's best to share the same request layout
for commands available in both modes. In other words IN and OUT
requests to zoned devices should use struct virtio_blk_req so code can
be shared and only commands unique to zoned devices use struct
virtio_blk_zoned_req.
2. For drivers and devices that implement only one of
!VIRTIO_BLK_F_ZONED and VIRTIO_BLK_F_ZONED. Here consistent request
layout is best so the code parses all commands in the same way. In
other words, IN and OUT requests to zoned devices use struct
virtio_blk_zoned_req so that the implementation doesn't have to bother
with struct virtio_blk_req at all.

I think you chose #2 and I gravitated towards #1. Which one is optimal
depends on the use case. Feel free to stick with #2.


>
> >
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +In addition to the request types defined for non-zoned devices, the type of
> > > the
> > > +request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit zone
> > > open
> > > +(VIRTIO_BLK_T_ZONE_OPEN), an explicit zone close (VIRTIO_BLK_T_ZONE_CLOSE),
> > > a
> > > +zone finish (VIRTIO_BLK_T_ZONE_FINISH), a zone_append
> > > (VIRTIO_BLK_T_ZONE_APPEND)
> > > +or a zone reset (VIRTIO_BLK_T_ZONE_RESET).
> > > +
> > > +\begin{lstlisting}
> > > +#define VIRTIO_BLK_T_ZONE_APPEND    15
> > > +#define VIRTIO_BLK_T_ZONE_REPORT    16
> > > +#define VIRTIO_BLK_T_ZONE_OPEN      18
> > > +#define VIRTIO_BLK_T_ZONE_CLOSE     20
> > > +#define VIRTIO_BLK_T_ZONE_FINISH    22
> > > +#define VIRTIO_BLK_T_ZONE_RESET     24
> > > +\end{lstlisting}
> > > +
> > > +Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of the
> > > type
> >
> > Here "reads" means data transfers from the device to the driver rather
> > than disk reads. I found that confusing. Maybe
> > "VIRTIO_BLK_T_ZONE_REPORT transfer data from the device"?
>
> Sure, your wording is better.
>
> > For a cur
> > > +VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN,
> > > +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and
> > > VIRTIO_BLK_T_ZONE_RESET
> > > +are non-data requests.
> > > +
> > > +In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone
> >
> > "ZBD standards" is not defined. I think this refers to the zoned
> > storage device model and its embodiment in the ZBC/ZNS specs.
>
> > The configuration space fields already allow the driver to calculate
> > the total number of zones. What is the purpose of partial=0, it
> > doesn't seem to provide any new information?
>
> Both ZBC and ZNS have this categorization. "ZBD standards" can be defined earlier
> in the text as a catchall for ZBC/ZNS.

Great.

> If partial is 0, the nr_zones will contain the number of zones that potentially
> can be reported starting from the request "sector" position and until the end of
> sector range of the device. It does not necessarily equal the total number of
> zones on the drive.

The driver knows zone_sectors and nr_zones, so it can calculate
nr_zones - (sector / zone_sectors) itself. Why send a ZONE_REPORT
partial=0 request to the device to get this value?


>
> >
> > When the device marks a virtqueue buffer used, it indicates the number
> > of bytes written (struct virtq_used_elem's len field). The driver can
> > use this field to determine the number of descriptors filled in by the
> > device. Is the nr_zones field really necessary?
>
> This field can help when the buffer is larger than the returned number of
> descriptors, but nr_zones is still needed for the partial=0 case above.
>
> >
> >
> > > +
> > > +A zone descriptor has the following structure:
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_blk_zone_descriptor {
> > > +        le64   z_cap;
> > > +        le64   z_start;
> > > +        le64   z_wp;
> > > +        u8     z_type;
> > > +        u8     z_state;
> > > +        u8     reserved[38];
> > > +};
> > > +\end{lstlisting}
> > > +
> > > +The zone descriptor field \field{z_type} \field{virtio_blk_zone_descriptor}
> > > +indicates the type of the zone. The available types are
> > > VIRTIO_BLK_ZT_CONV(1),
> > > +VIRTIO_BLK_ZT_SWR(2) or VIRTIO_BLK_ZT_SWP(3).
> > > +
> > > +\begin{lstlisting}
> > > +#define VIRTIO_BLK_ZT_CONV     1
> > > +#define VIRTIO_BLK_ZT_SWR      2
> > > +#define VIRTIO_BLK_ZT_SWP      3
> > > +\end{lstlisting}
> > > +
> > > +Read and write operations into zones with the VIRTIO_BLK_ZT_CONV
> > > (Conventional)
> > > +type have the same behavior as read and write operations on a regular block
> > > +device. Any block in a conventional zone can be read or written at any time
> > > and
> > > +in any order.
> > > +
> > > +Zones with VIRTIO_BLK_ZT_SWR (Sequential Write Required or SWR) can be read
> > > +randomly, but MUST be written sequentially at a certain point in the zone
> > > called
> > > +the Write Pointer (WP). With every write, the Write Pointer is incremented
> > > by
> > > +the number of sectors written.
> > > +
> > > +Zones with VIRTIO_BLK_ZT_SWP (Sequential Write Preferred or SWP) can be read
> > > +randomly and SHOULD be written sequentially, similarly to SWR zones.
> > > However,
> > > +SWP zones can accept random write operations, that is, VIRTIO_BLK_T_OUT
> > > requests
> > > +with a start sector different from the zone write pointer position.
> > > +
> > > +The field \field{z_state} of \field{virtio_blk_zone_descriptor} indicates
> > > the
> > > +state of the device zone. The available zone states are
> > > VIRTIO_BLK_ZS_NOT_WP(0),
> > > +VIRTIO_BLK_ZS_EMPTY(1), VIRTIO_BLK_ZS_IOPEN(2), VIRTIO_BLK_ZS_EOPEN(3),
> > > +VIRTIO_BLK_ZS_CLOSED(4), VIRTIO_BLK_ZS_RDONLY(13), VIRTIO_BLK_ZS_FULL(14)
> > > and
> > > +VIRTIO_BLK_ZS_OFFLINE(15).
> > > +
> > > +\begin{lstlisting}
> > > +#define VIRTIO_BLK_ZS_NOT_WP   0
> > > +#define VIRTIO_BLK_ZS_EMPTY    1
> > > +#define VIRTIO_BLK_ZS_IOPEN    2
> > > +#define VIRTIO_BLK_ZS_EOPEN    3
> > > +#define VIRTIO_BLK_ZS_CLOSED   4
> > > +#define VIRTIO_BLK_ZS_RDONLY   13
> > > +#define VIRTIO_BLK_ZS_FULL     14
> > > +#define VIRTIO_BLK_ZS_OFFLINE  15
> > > +\end{lstlisting}
> > > +
> > > +Zones of the type VIRTIO_BLK_ZT_CONV are always reported by the device to be
> > > in
> > > +the VIRTIO_BLK_ZS_NOT_WP state. Zones of the types VIRTIO_BLK_ZT_SWR and
> > > +VIRTIO_BLK_ZT_SWP can not transition to the VIRTIO_BLK_ZS_NOT_WP state.
> > > +
> > > +Zones in VIRTIO_BLK_ZS_EMPTY (Empty), VIRTIO_BLK_ZS_IOPEN (Implicitly Open),
> > > +VIRTIO_BLK_ZS_EOPEN (Explicitly Open) and VIRTIO_BLK_ZS_CLOSED (Closed)
> > > state
> > > +are writable, but zones in VIRTIO_BLK_ZS_RDONLY (Read-Only),
> > > VIRTIO_BLK_ZS_FULL
> > > +(Full) and VIRTIO_BLK_ZS_OFFLINE (Offline) state are not. The write pointer
> > > +value (\field{z_wp}) is not valid for Read-Only, Full and Offline zones.
> > > +
> > > +The zone descriptor field \field{z_cap} contains the maximum number of 512-
> > > byte
> > > +sectors that are available to be written with user data when the zone is in
> > > the
> > > +Empty state. This value shall be less than or equal to the
> > > \field{zone_sectors}
> > > +value in \field{virtio_blk_zoned_characteristics} structure in the device
> > > +configuration space.
> > > +
> > > +The zone descriptor field \field{z_start} contains the 64-bit address of the
> > > +first 512-byte sector of the zone.
> > > +
> > > +The zone descriptor field \field{z_wp} contains the 64-bit sector address
> > > where
> > > +the next write operation for this zone should be issued. This value is
> > > undefined
> > > +for conventional zones and for zones in VIRTIO_BLK_ZS_RDONLY,
> > > VIRTIO_BLK_ZS_FULL
> > > +and VIRTIO_BLK_ZS_OFFLINE state.
> > > +
> > > +Depending on their state, zones consume resources as follows:
> > > +\begin{itemize}
> > > +\item a zone in VIRTIO_BLK_ZS_IOPEN and VIRTIO_BLK_ZS_EOPEN state consumes
> > > one
> > > +    open zone resource and, additionally,
> > > +
> > > +\item a zone in VIRTIO_BLK_ZS_IOPEN, VIRTIO_BLK_ZS_EOPEN and
> > > +    VIRTIO_BLK_ZS_CLOSED state consumes one active resource.
> > > +\end{itemize}
> > > +
> > > +Attempts for zone transitions that violate zone resource limits MUST fail
> > > with
> > > +VIRTIO_BLK_S_ZONE_OPEN_RESOURCE or VIRTIO_BLK_S_ZONE_ACTIVE_RESOURCE
> > > +\field{zone_result}.
> >
> > Why use zone_result instead of the status field? The _S_ naming
> > suggests this is a status field constant, not a zone_result constant.
>
> zone_result only a valid ZBD error if the status is VIRTIO_BLK_S_IOERR. If the
> status is VIRTIO_BLK_S_OK, this field may contain ALBA for Zone Append request.
> It is possible to just define the ZBD-specific error codes following
> VIRTIO_BLK_S_UNSUPP, but I think it is cleaner to create a separate numbering for
> ZBD-specific codes and keep the non-zoned errors as-is in their impressive
> simplicity.
>
> The ZBD error codes all currently have _S_ZONE_ naming, but it might be better to
> use something like _ZR_ for example, i.e.
>
> #define VIRTIO_BLK_ZR_INVALID_CMD     0
> #define VIRTIO_BLK_ZR_UNALIGNED_WP    1
> #define VIRTIO_BLK_ZR_OPEN_RESOURCE   2
> #define VIRTIO_BLK_ZR_ACTIVE_RESOURCE 3

IMO multiple levels of result codes adds complexity in the driver
implementation, especially when zoned and non-zoned code is shared. It
can make error handling and reporting in drivers awkward (e.g. do
error messages for zoned devices include a separate zone_result field
while non-zoned devices just have the status field in their error
messages?). I would use status codes but if you really want to keep
the zoned error codes separate then _ZR_ naming is good.

Attachment: signature.asc
Description: PGP signature



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]