OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] [RFC PATCH v2] virtio-blk: add zoned block device specification


On Sun, Jun 12, 2022 at 01:58:36AM +0000, Dmitry Fomichev wrote:
> On Sat, 2022-06-11 at 21:09 +0100, Stefan Hajnoczi wrote:
> > On Thu, Jun 09, 2022 at 08:43:21PM -0400, Dmitry Fomichev wrote:
> > > +Host-managed zoned block devices have their LBA range divided to Sequential
> > > +Write Required (SWR) zones that require some additional handling from the
> > > host
> > > +for sustainable operation. All write requests to SWR zones must be
> > > sequential
> > > +and the zones with some data need to be reset before that data can be
> > > rewritten.
> > > +Host-managed devices support a set of ZBD-specific I/O requests that can be
> > > used
> > > +by the host to manage device zones. Host-managed devices report
> > > VIRTIO_BLK_Z_HM
> > > +value in the \field{model} field in \field{zoned}.
> > 
> > One of
> > "report the VIRTIO_BLK_Z_HM value in ..."
> > "report a VIRTIO_BLK_Z_HM value in ..."
> > "report VIRTIO_BLK_Z_HM in ..."
> > reads more naturally.
> 
> OK.
> 
> > 
> > > +
> > > +Host-aware zoned block devices have their LBA range divided to Sequential
> > > +Write Preferred (SWP) zones that support the random write access, similar to
> > > +regular non-zoned devices. However, the device I/O performance might not be
> > > +optimal if SWP zones are used in a random I/O pattern. SWP zones also
> > > support
> > > +the same set of ZBD-specific I/O requests as host-managed devices that allow
> > > +host-aware devices to be managed by any host that supports zoned block
> > > devices
> > > +to achieve its optimum performance. Host-aware devices report
> > > VIRTIO_BLK_Z_HA
> > > +value in the \field{model} field in \field{zoned}.
> > 
> > Same as above.
> 
> Sure.
> 
> > 
> > > +
> > > +During device operation, SWR and SWP zones can be in one of the following
> > > states:
> > > +empty, implicitly-open, explicitly-open, closed and full. The state machine
> > > that
> > > +governs the transitions between these states is described later in this
> > > document.
> > > +
> > > +SWR and SWP zones consume volatile device resources while being in certain
> > > +states and the device may set limits on the number of zones that can be in
> > > these
> > > +states simultaneously.
> > > +
> > > +Zoned block devices use two internal counters to account for the device
> > > +resources in use, the number of currently open zones and the number of
> > > currently
> > > +active zones.
> > > +
> > > +Any zone state transition from a state that doesn't consume a zone resource
> > > to a
> > > +state that consumes the same resource increments the internal device counter
> > > for
> > > +that resource. Any zone transition out of a state that consumes a zone
> > > resource
> > > +to a state that doesn't consume the same resource decrements the counter.
> > > Any
> > > +request that causes the device to exceed the reported zone resource limits
> > > is
> > > +terminated by the device with a "zone resources exceeded" error as defined
> > > for
> > > +specific commands later.
> > > +
> > >  \begin{lstlisting}
> > >  struct virtio_blk_config {
> > >          le64 capacity;
> > > @@ -4623,6 +4694,15 @@ \subsection{Device configuration
> > > layout}\label{sec:Device Types / Block Device /
> > >          le32 max_secure_erase_sectors;
> > >          le32 max_secure_erase_seg;
> > >          le32 secure_erase_sector_alignment;
> > > +        struct virtio_blk_zoned_characteristics {
> > > +                le32 zone_sectors;
> > > +                le32 max_open_zones;
> > > +                le32 max_active_zones;
> > > +                le32 max_append_sectors;
> > > +                le32 write_granularity;
> > > +                u8 model;
> > > +                u8 unused2[3];
> > > +        } zoned;
> > >  };
> > >  \end{lstlisting}
> > >  
> > > @@ -4686,6 +4766,10 @@ \subsection{Device Initialization}\label{sec:Device
> > > Types / Block Device / Devic
> > >      \field{secure_erase_sector_alignment} can be used by OS when splitting a
> > >      request based on alignment.
> > >  
> > > +\item If the VIRTIO_BLK_F_ZONED feature is negotiated, the fields in
> > > +    \field{zoned} can be read by the driver to determine the zone
> > > +    characteristics of the device. All \field{zoned} fields are read-only.
> > > +
> > >  \end{enumerate}
> > >  
> > >  \drivernormative{\subsubsection}{Device Initialization}{Device Types / Block
> > > Device / Device Initialization}
> > > @@ -4701,6 +4785,33 @@ \subsection{Device Initialization}\label{sec:Device
> > > Types / Block Device / Devic
> > >  The driver MUST NOT read \field{writeback} before setting
> > >  the FEATURES_OK \field{device status} bit.
> > >  
> > > +Drivers SHOULD NOT negotiate VIRTIO_BLK_F_ZONED feature if they are
> > > incapable
> > > +of supporting devices with the VIRTIO_BLK_Z_HM or VIRTIO_BLK_Z_HA zoned
> > > model.
> > > +
> > > +Drivers MAY operate with VIRTIO_BLK_F_ZONED feature negotiated when the
> > > device
> > > +reports VIRTIO_BLK_Z_NONE zoned model for testing and development.
> > 
> > Specifying a particular use case ("testing and development") could be
> > interpreted to imply that other use cases MAY NOT operate with
> > VIRTIO_BLK_F_ZONED.
> > 
> > I suggest dropping the use case from the sentence or moving that sub-point
> > outside the normative section to where the text describes VIRTIO_BLK_Z_NONE.
> > Something like "Devices that offer VIRTIO_BLK_F_ZONED with VIRTIO_BLK_Z_NONE
> > commonly do so for testing and development purposes".
> 
> Good point, will do.
> 
> > 
> > > +Each request is of form:
> > > +
> > > +\begin{lstlisting}
> > > +struct virtio_blk_zoned_req {
> > > +        le32 type;
> > > +        le32 reserved;
> > > +        le64 sector;
> > > +        struct {
> > > +                /* ALL zone operation flag */
> > > +                le32 mgmt_send_all:1;
> > 
> > Bit-fields cannot be used in external interfaces (like hardware
> > interfaces) because the C standard says (N1570 6.7.2.1 11):
> > 
> >   The order of allocation of bit-fields within a unit (high-order to
> >   low-order or low-order to high-order) is implementation-defined.
> > 
> > So this C syntax does not define a specific binary representation, it's
> > up to the compiler or system ABI.
> > 
> 
> Right... I was confused by the fact that virtio_blk_discard_write_zeroes struct
> in the same spec happily uses bit-fields for the unmap flag. Perhaps, that
> flag could be also defined using a bit mask... :)

Ah, good point. I was wrong. See 1.4 Structure Specifications in the
VIRTIO spec:

  Bit-fields within integer fields are always listed in order, from the
  least significant to the most significant bit. The bit-fields are
  considered unsigned integers of the specified width with the next in
  significance relationship of the bits preserved.

So it is okay to use bit-fields in the spec.

It pushes the problem onto implementors of the spec. They cannot
bit-fields in implementations where the C compiler uses a representation
that differs from the VIRTIO spec :).

> 
> > Please use a le32 field and define a bit number:
> > 
> >   #define VIRTIO_BLK_ZONE_MGMT_F_SEND_ALL (1u << 0)
> >   le32 flags;
> > 
> > > +                le32 reserved:31;
> > > +        } zone;
> > > +        u8 data[];
> > > +        le64 append_sector;
> > > +        u8 reserved1[3];
> > > +        u8 status;
> > > +};
> > > +\end{lstlisting}
> > 
> 
> Yep, the ends up to be pretty simple. Using a le32 should be sufficient for
> future command flag additions like the partial flag, etc.
> 
> > Now that the status field is at the end of the struct the zoned commands
> > can be treated like any other struct virtio_blk_req.
> > 
> > The contents of the data[] field for ZONE_OPEN, ZONE_CLOSE, ZONE_FINISH,
> > and ZONE_RESET requests is:
> > 
> >   #define VIRTIO_BLK_ZONE_MGMT_F_SEND_ALL (1u << 0)
> >   struct virtio_blk_zoned_mgmt_send {
> >       le32 flags;
> >   };
> > 
> > The contents of the data[] field for ZONE_APPEND is:
> > 
> >   u8 payload[];            /* driver -> device */
> > 
> > followed by:
> > 
> >   le64 append_sector;      /* device -> driver */
> > 
> > The contents of the data[] field for ZONE_REPORT is struct
> > virtio_blk_zone_report.
> > 
> > The reason I'm pushing for getting rid of struct virtio_blk_zoned_req is
> > that the zbd-specific fields are not common to all ZONE_* requests. I
> > don't see a reason to define a common request layout anymore.
> > 
> > Also, it would make the zoned command design similar to struct
> > virtio_blk_discard_write_zeroes where request metadata is a separate
> > struct that is located in struct virtio_blk_req's data[] field instead
> > of a new top-level struct virtio_blk_discard_write_zeroes_request that
> > duplicates struct virtio_blk_req fields.
> 
> My concern is that this approach requires quite a few of runtime checks to be
> added in the hot path - if (alba), if (flags), request opcode checks... My PoC
> code that uses virtio_blk_zoned_req adds zero checks, both for zoned and non-
> zoned operation.
> 
> With discard, there is no other way to handle it except to introduce
> virtio_blk_discard_write_zeroes because the number of incoming discard segments
> is variable, but for zoned operation it should be sufficient to use a simpler
> solution with the virtio_blk_zoned_req which seems to be less fragile in terms of
> implementation.

Can you describe where new runtime checks are required?

The content of data[] is specific to each request type. Driver and
device implementations already have conditional branches for handling
different request types. The data[] layout code goes into those branches
and it should therefore not require new runtime checks. In other words,
the ZONE_APPEND code knows statically at compile-time that the
append_sectors field is needed.

So I think any runtime checks that may exist in driver code right now
are accidental and can be avoided.

Stefan

Attachment: signature.asc
Description: PGP signature



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]