OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] [RFC PATCH] virtio-blk: add zoned block device specification


On Fri, 2022-06-03 at 09:08 +0100, Stefan Hajnoczi wrote:
> (Resent from my subscriber email.)
> 
> On Fri, 3 Jun 2022 at 03:30, Dmitry Fomichev <Dmitry.Fomichev@wdc.com> wrote:
> > 
> > On Thu, 2022-06-02 at 08:23 +0100, Stefan Hajnoczi wrote:
> > > (Resending from my mailing list subscriber address so this makes it onto
> > > the list.)
> > > 
> > > On Thu, 2 Jun 2022 at 00:56, Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > wrote:
> > > > 
> > > > Introduce support for Zoned Block Devices to virtio.
> > > > 
> > > > Zoned Block Devices (ZBDs) aim to achieve a better capacity, latency
> > > > and/or cost characteristics compared to commonly available block
> > > > devices by getting the entire LBA space of the device divided to block
> > > > regions that are much larger than the LBA size. These regions are
> > > > called zones and they can only be written sequentially. More details
> > > > about ZBDs can be found at
> > > > 
> > > > https://zonedstorage.io/docs/introduction/zoned-storageÂ;.
> > > > 
> > > > In its current form, the virtio protocol for block devices (virtio-blk)
> > > > is not aware of ZBDs but it allows the guest to successfully scan a
> > > > host-managed drive provided by the host. As the result, the
> > > > host-managed drive appears at the guest as a regular drive that will
> > > > operate erroneously under the most common write workloads.
> > > > 
> > > > To fix this, the virtio-blk protocol needs to be extended to add the
> > > > capabilities to convey the zone characteristics of host ZBDs to the
> > > > guest and to provide the support for ZBD-specific commands - Report
> > > > Zones, four zone operations and (optionally) Zone Append. The proposed
> > > > standard extension aims to provide this functionality.
> > > > 
> > > > This patch extends the virtio-blk section of virtio specification with
> > > > the minimum set of requirements that are necessary to support ZBDs.
> > > > The resulting device model is a subset of the models defined in ZAC/ZBC
> > > > and ZNS standards documents. The included functionality mirrors
> > > > the existing Linux kernel block layer ZBD support and should be
> > > > sufficient to handle the host-managed and host-aware HDDs that are on
> > > > the market today as well as ZNS SSDs that are entering the market at
> > > > the moment of this patch submission.
> > > > 
> > > > I have developed a proof of concept patch series that adds ZBD support
> > > > to virtio-blk Linux kernel driver by implementing the protocol
> > > > extensions defined in the spec patch. I would like to receive the
> > > > initial feedback on this specification patch before posting that series
> > > > to the block LKML.
> > > > 
> > > > I would like to thank the following people for their useful feedback
> > > > and suggestions while working on the initial iterations of this patch.
> > > > 
> > > > Damien Le Moal <damien.lemoal@opensource.wdc.com>
> > > > Matias BjÃrling <Matias.Bjorling@wdc.com>
> > > > Niklas Cassel <Niklas.Cassel@wdc.com>
> > > > Hans Holmberg <Hans.Holmberg@wdc.com>
> > > > 
> > > > Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
> > > > ---
> > > > Âcontent.tex | 686 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> > > > Â1 file changed, 685 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/content.tex b/content.tex
> > > > index 7508dd1..8ae7578 100644
> > > > --- a/content.tex
> > > > +++ b/content.tex
> > 

<snip>

> > > 
> > > Here is the non-union representation of the structs. Note that this
> > > means existing request processing code can handle IN/OUT/FLUSH even
> > > when VIRTIO_BLK_F_ZONED is negotiated. We don't need two different
> > > code paths for these common commands:
> > > 
> > > /* IN/OUT/FLUSH/DISCARD/WRITE_ZEROES */
> > > struct virtio_blk_req {
> > > ÂÂÂÂÂÂÂ le32 type;
> > > ÂÂÂÂÂÂÂ le32 reserved;
> > > ÂÂÂÂÂÂÂ le64 sector;
> > > ÂÂÂÂÂÂÂ u8 data[];
> > > ÂÂÂÂÂÂÂ u8 status;
> > > };
> > > 
> > > /* ZONE_APPEND */
> > > struct virtio_blk_req_zone_append {
> > > ÂÂÂÂÂÂÂ le32 type;
> > > ÂÂÂÂÂÂÂ le32 reserved;
> > > ÂÂÂÂÂÂÂ le64 sector;
> > > ÂÂÂÂÂÂÂ u8 data[];
> > > ÂÂÂÂÂÂÂ le64 zone_result;
> > > ÂÂÂÂÂÂÂ u8 status;
> > > };
> > > 
> > > /* ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH, ZONE_RESET */
> > > struct virtio_blk_req_zone_mgmt_send {
> > > ÂÂÂÂÂÂÂ le32 type;
> > > ÂÂÂÂÂÂÂ le32 reserved;
> > > ÂÂÂÂÂÂÂ le64 sector;
> > > ÂÂÂÂÂÂÂ /* ALL zone operation flag */
> > > ÂÂÂÂÂÂÂ __u8 all;
> > > ÂÂÂÂÂÂÂ __u8 unused1[3];
> > > ÂÂÂÂÂÂÂ u8 data[];
> > > ÂÂÂÂÂÂÂ u8 status;
> > > };
> > > 
> > > /* ZONE_REPORT */
> > > struct virtio_blk_req_zone_mgmt_receive {
> > > ÂÂÂÂÂÂÂ le32 type;
> > > ÂÂÂÂÂÂÂ le32 reserved;
> > > ÂÂÂÂÂÂÂ le64 sector;
> > > ÂÂÂÂÂÂÂ /* Partial zone report flag */
> > > ÂÂÂÂÂÂÂ __u8 partial;
> > > ÂÂÂÂÂÂÂ __u8 unused2[3];
> > > ÂÂÂÂÂÂÂ u8 data[];
> > > ÂÂÂÂÂÂÂ u8 status;
> > > };
> > 
> > We felt that it is less confusing to have a single ZBD request structure so
> > there
> > is 1:1 relationship between the VIRTIO_BLK_F_ZONED bit value and the choice
> > of
> > the request -
> > 
> > VIRTIO_BLK_F_ZONED off ->Â virtio_blk_req
> > VIRTIO_BLK_F_ZONED on ->Â virtio_blk_zoned_req
> > 
> > We also want to make sure that zone management/send/receive and append
> > requests
> > are always the same size and it should be easier with a single request
> > structure.
> > 
> > It is possible to use a simpler, but less structured union here -
> > 
> > struct virtio_blk_zoned_req {
> > ÂÂÂÂÂÂÂ le32 type;
> > ÂÂÂÂÂÂÂ le32 reserved;
> > ÂÂÂÂÂÂÂ le64 sector;
> > ÂÂÂÂÂÂÂ union zoned_params {
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /* ALL zone operation flag */
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8 all;
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /* Partial zone report flag */
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u8 partial;
> > ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __u32 rsvd1;
> > ÂÂÂÂÂÂÂ } zone;
> > ÂÂÂÂÂÂÂ u8 data[];
> > ÂÂÂÂÂÂÂ le64 zone_result;
> > ÂÂÂÂÂÂÂ u8 reserved1[3];
> > ÂÂÂÂÂÂÂ u8 status;
> > }
> > 
> > Using the single code path is very important and the PoC Linux driver code
> > that I
> > have sketched up achieves that. Since the request header is identical for the
> > both cases, I hope that it can be done in QEMU too.
> 
> There are two possibilities for optimizing the request layout:
> 1. For drivers and devices that implement both !VIRTIO_BLK_F_ZONED and
> VIRTIO_BLK_F_ZONED. Here it's best to share the same request layout
> for commands available in both modes. In other words IN and OUT
> requests to zoned devices should use struct virtio_blk_req so code can
> be shared and only commands unique to zoned devices use struct
> virtio_blk_zoned_req.
> 2. For drivers and devices that implement only one of
> !VIRTIO_BLK_F_ZONED and VIRTIO_BLK_F_ZONED. Here consistent request
> layout is best so the code parses all commands in the same way. In
> other words, IN and OUT requests to zoned devices use struct
> virtio_blk_zoned_req so that the implementation doesn't have to bother
> with struct virtio_blk_req at all.
> 
> I think you chose #2 and I gravitated towards #1. Which one is optimal
> depends on the use case. Feel free to stick with #2.
> 

OK. We certainly considered #1, but we found that in Linux virtio-blk driver
implementation, the request size is fixed. There is the member called vblk-
>tag_set.cmd_size and it is set to sizeof(struct virtblk_req) + <sg list size> as
a part of the device scan routine. Since the Zone Append request has to be larger
than the regular (non-zoned) one by at least 8 bytes to accommodate the ALBA
field, we can't mix and match the two for the same device. Here is how this
device initialization part can be modified to handle the VIRTIO_BLK_F_ZONED
feature bit if we stick with #2 -


@@ -810,13 +1287,18 @@ static int virtblk_probe(struct virtio_device *vdev)
        }
 
        memset(&vblk->tag_set, 0, sizeof(vblk->tag_set));
-       vblk->tag_set.ops = &virtio_mq_ops;
        vblk->tag_set.queue_depth = queue_depth;
        vblk->tag_set.numa_node = NUMA_NO_NODE;
        vblk->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
-       vblk->tag_set.cmd_size =
-               sizeof(struct virtblk_req) +
-               sizeof(struct scatterlist) * VIRTIO_BLK_INLINE_SG_CNT;
+       if (virtio_has_feature(vdev, VIRTIO_BLK_F_ZONED)) {
+               vblk->tag_set.ops = &virtio_mq_zoned_ops;
+               vblk->tag_set.cmd_size = sizeof(struct virtblk_zoned_req);
+       } else {
+               vblk->tag_set.ops = &virtio_mq_ops;
+               vblk->tag_set.cmd_size = sizeof(struct virtblk_req);
+       };
+       vblk->tag_set.cmd_size +=
+                       sizeof(struct scatterlist) * VIRTIO_BLK_INLINE_SG_CNT;
        vblk->tag_set.driver_data = vblk;
        vblk->tag_set.nr_hw_queues = vblk->num_vqs;

> 
> > 
> > > 
> > > > +};
> > > > +\end{lstlisting}
> > > > +
> > > > +In addition to the request types defined for non-zoned devices, the type
> > > > of
> > > > the
> > > > +request can be a zone report (VIRTIO_BLK_T_ZONE_REPORT), an explicit
> > > > zone
> > > > open
> > > > +(VIRTIO_BLK_T_ZONE_OPEN), an explicit zone close
> > > > (VIRTIO_BLK_T_ZONE_CLOSE),
> > > > a
> > > > +zone finish (VIRTIO_BLK_T_ZONE_FINISH), a zone_append
> > > > (VIRTIO_BLK_T_ZONE_APPEND)
> > > > +or a zone reset (VIRTIO_BLK_T_ZONE_RESET).
> > > > +
> > > > +\begin{lstlisting}
> > > > +#define VIRTIO_BLK_T_ZONE_APPENDÂÂÂ 15
> > > > +#define VIRTIO_BLK_T_ZONE_REPORTÂÂÂ 16
> > > > +#define VIRTIO_BLK_T_ZONE_OPENÂÂÂÂÂ 18
> > > > +#define VIRTIO_BLK_T_ZONE_CLOSEÂÂÂÂ 20
> > > > +#define VIRTIO_BLK_T_ZONE_FINISHÂÂÂ 22
> > > > +#define VIRTIO_BLK_T_ZONE_RESETÂÂÂÂ 24
> > > > +\end{lstlisting}
> > > > +
> > > > +Requests of the type VIRTIO_BLK_T_ZONE_REPORT are reads and requests of
> > > > the
> > > > type
> > > 
> > > Here "reads" means data transfers from the device to the driver rather
> > > than disk reads. I found that confusing. Maybe
> > > "VIRTIO_BLK_T_ZONE_REPORT transfer data from the device"?
> > 
> > Sure, your wording is better.
> > 
> > > For a cur
> > > > +VIRTIO_BLK_T_ZONE_APPEND are writes. VIRTIO_BLK_T_ZONE_OPEN,
> > > > +VIRTIO_BLK_T_ZONE_CLOSE, VIRTIO_BLK_T_ZONE_FINISH and
> > > > VIRTIO_BLK_T_ZONE_RESET
> > > > +are non-data requests.
> > > > +
> > > > +In ZBD standards, the VIRTIO_BLK_T_ZONE_REPORT request belongs to "Zone
> > > 
> > > "ZBD standards" is not defined. I think this refers to the zoned
> > > storage device model and its embodiment in the ZBC/ZNS specs.
> > 
> > > The configuration space fields already allow the driver to calculate
> > > the total number of zones. What is the purpose of partial=0, it
> > > doesn't seem to provide any new information?
> > 
> > Both ZBC and ZNS have this categorization. "ZBD standards" can be defined
> > earlier
> > in the text as a catchall for ZBC/ZNS.
> 
> Great.
> 
> > If partial is 0, the nr_zones will contain the number of zones that
> > potentially
> > can be reported starting from the request "sector" position and until the end
> > of
> > sector range of the device. It does not necessarily equal the total number of
> > zones on the drive.
> 
> The driver knows zone_sectors and nr_zones, so it can calculate
> nr_zones - (sector / zone_sectors) itself. Why send a ZONE_REPORT
> partial=0 request to the device to get this value?
> 

As Damien mentioned in the other reply, the partial bit is needed for the cases
when this calculation is not possible, i.e. when the zone size varies. Yes, we
can omit it for now and possibly add a new feature bit in the future if there's
need to handle ZBDs with varying zone sizes.

> 
<snip>

> > zone_result only a valid ZBD error if the status is VIRTIO_BLK_S_IOERR. If
> > the
> > status is VIRTIO_BLK_S_OK, this field may contain ALBA for Zone Append
> > request.
> > It is possible to just define the ZBD-specific error codes following
> > VIRTIO_BLK_S_UNSUPP, but I think it is cleaner to create a separate numbering
> > for
> > ZBD-specific codes and keep the non-zoned errors as-is in their impressive
> > simplicity.
> > 
> > The ZBD error codes all currently have _S_ZONE_ naming, but it might be
> > better to
> > use something like _ZR_ for example, i.e.
> > 
> > #define VIRTIO_BLK_ZR_INVALID_CMDÂÂÂÂ 0
> > #define VIRTIO_BLK_ZR_UNALIGNED_WPÂÂÂ 1
> > #define VIRTIO_BLK_ZR_OPEN_RESOURCEÂÂ 2
> > #define VIRTIO_BLK_ZR_ACTIVE_RESOURCE 3
> 
> IMO multiple levels of result codes adds complexity in the driver
> implementation, especially when zoned and non-zoned code is shared. It
> can make error handling and reporting in drivers awkward (e.g. do
> error messages for zoned devices include a separate zone_result field
> while non-zoned devices just have the status field in their error
> messages?). I would use status codes but if you really want to keep
> the zoned error codes separate then _ZR_ naming is good.

Ok, let's just define the new zone error codes following VIRTIO_BLK_S_UNSUPP.
This is more straightforward and, as an added benefit, the ALBA field can be
named more appropriately in this case.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]