OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] virtio packed ring concerns


On Wed, 2020-01-29 at 09:41 +0000, Stefan Hajnoczi wrote:
> On Tue, Jan 28, 2020 at 05:26:07PM +0000, Walker, Benjamin wrote:
> > Hi all - I'm a core maintainer for the SPDK project, which includes a vhost-
> > user 
> > target implementation that handles both virtio-blk and virtio-scsi. Recently
> > we
> > had some engineers go off and attempt to add support for packed virtqueues,
> > as
> > defined here: 
> > https://github.com/oasis-tcs/virtio-spec/blob/master/packed-ring.tex
> > 
> > While reviewing the patches they submitted I noticed two separate issues.
> > The
> > first issue is that the ring cannot function correctly under certain
> > circumstances by spec. The second issue is a performance issue. I'm going to
> > try
> > to keep both of these entirely separate.
> > 
> > Comments are very welcome here, and I'm hoping I'm just missing some
> > critical
> > piece of information that resolves the whole thing, but testing is so far
> > showing that I'm finding a real problem.
> > 
> > First, the functional issue. The problem scenario involves out of order
> > completions during ring wrap around. Here are the most relevant sections of
> > the
> > specification:
> > 
> > ~~~
> > Note: after reading driver descriptors and starting their processing in
> > order,
> > the device might complete their processing out of order.  Used device
> > descriptors are written in the order in which their processing is complete.
> > 
> > The counter maintained by the driver is called the Driver Ring Wrap Counter.
> > The
> > driver changes the value of this counter each time it makes available the
> > last
> > descriptor in the ring (after making the last descriptor available).
> > 
> > The counter maintained by the device is called the Device Ring Wrap
> > Counter.  The device changes the value of this counter
> > each time it uses the last descriptor in
> > the ring (after marking the last descriptor used).
> > 
> > To mark a descriptor as available, the driver sets the VIRTQ_DESC_F_AVAIL
> > bit in
> > Flags to match the internal Driver Ring Wrap Counter.  It also sets the
> > VIRTQ_DESC_F_USED bit to match the inverse value (i.e. to not match the
> > internal
> > Driver Ring Wrap Counter).
> > 
> > To mark a descriptor as used, the device sets the VIRTQ_DESC_F_USED bit in
> > Flags
> > to match the internal Device Ring Wrap Counter.  It also sets the
> > VIRTQ_DESC_F_AVAIL bit to match the \emph{same} value.
> > ~~~
> > 
> > Imagine a packed ring with 4 descriptors. At some earlier time let's say the
> > first 3 descriptors were submitted and completed, in order. Then, two more
> > descriptors were submitted (so they went into slot 3 and then 0). Each time
> > through the ring, the "ring wrap counter" toggles from 1 to 0 or vice versa.
> > The
> > avail and used flags are initialized to 1 at start up and the ring wrap
> > counter
> > is 0.
> 
> Both the Driver and Device Ring Wrap Counters are initialized to 1
> according to the spec, not 0.  This doesn't matter for the rest of the
> discussion though.  I just wanted to mention it in case your
> implementation has the value inverted.
> 
> > So when submitting into slot 3, the driver flips the avail bit from 1 to 0.
> > When
> > submitting into slot 0 after the ring-wrap, it flips the avail bit from 0 to
> > 1.
> > The spec says it also sets the used flag, but the used flag is already the
> > correct value. That's all solid so far.
> > 
> > The device is then polling on slot 3 waiting for new descriptors to be
> > submitted. Note here that the device must be maintaining its own copy of
> > what it
> > thinks is the driver's ring wrap counter value to detect descriptors being
> > made
> > available, which isn't called out in the specification at all but all actual
> > implementations do. Regardless, once it sees the avail flag in the
> > descriptor in
> > slot 3 flip, it begins processing the descriptor. It then moves on to poll
> > slot
> > 0 for incoming descriptors (and updates its expected avail ring wrap counter
> > value internally). In this case, it sees the bit flip for slot 0 and begins
> > processing the descriptor. Descriptor processing is asynchronous and may
> > complete out of order, at least for storage devices.
> > 
> > So let's say that the descriptors do, in fact, complete out of order, such
> > that
> > the descriptor in slot 0 completes first and then the descriptor in slot 3
> > completes second. The specification is very clear that this is allowed.
> > 
> > When completing the descriptor in slot 0, the device has not yet completed
> > the
> > last descriptor in the ring (they're out of order!). So the ring wrap
> > counter is
> > still 0. So it then proceeds to write both the avail flag and the used flag
> > to
> > 0. But the driver is expecting the used flag to get flipped to 1 to indicate
> > a
> > completion. This is broken.
> 
> You're saying used descriptors overwrite the same descriptor slot where
> they were made available by the driver.  This is not the case.  The
> device overwrites the descriptor ring in completion order.  See my
> response to your concern about performance for an illustration.

Thank you! Now I get it! Here's the portion of the spec that describes this
behavior. I agonized over this for hours trying to decipher what it was saying,
and even sent it to several other engineers, and we still misunderstood.

~~~
The driver then notifies the device. When the device has finished
processing the buffer, it writes a used device descriptor
including the Buffer ID into the Descriptor Ring (overwriting a
driver descriptor previously made available), and sends a
used event notification.

The Descriptor Ring is used in a circular manner: the driver writes
descriptors into the ring in order. After reaching the end of the ring,
the next descriptor is placed at the head of the ring.  Once the ring is
full of driver descriptors, the driver stops sending new requests and
waits for the device to start processing descriptors and to write out
some used descriptors before making new driver descriptors
available.

Similarly, the device reads descriptors from the ring in order and
detects that a driver descriptor has been made available.  As
processing of descriptors is completed, used descriptors are
written by the device back into the ring.
~~~

Note that the wording says the "driver writes descriptors into the ring in
order" - it doesn't say the device does. It does say the device "reads
descriptors from the ring in order" but it doesn't say it writes them back in
order. The only hint of the behavior you're saying is in the first paragraph,
where it says "(overwriting a driver descriptor previously made available)"
because why would it be worded that way if it was updating a descriptor instead
of completely replacing it. But unfortunately I wasn't able to piece it all
together correctly.

> 
> > I wrote a unit test for the SPDK implementation and it is indeed broken in
> > this
> > scenario, causing the driver to hang. I've also at least looked through the
> > code
> > in DPDK and in the Linux kernel implementations and they appear broken in
> > various ways as well. It appears that DPDK may decide the ring is corrupt
> > and
> > "drop the packet" in this scenario, which is data corruption for storage,
> > but I
> > haven't yet run a test to confirm. The Linux kernel implementation looks to
> > me
> > like it will never consider the descriptor as complete and eventually hang.
> > 
> > The device doesn't actually need to track a ring wrap counter for the used
> > flag
> > at all. The device does need to track a ring wrap counter for the expected
> > value
> > of the avail bit, to mirror the driver side value while polling for incoming
> > descriptors. But to mark descriptors as used, it simply needs to set the
> > used
> > flag to the current value of the avail flag in that descriptor.
> > 
> > ----------------------
> > 
> > Moving on to my separate concerns about performance. Critically, it's
> > important
> > to note that storage devices these days are inherently parallel pieces of
> > hardware. You can submit a large number of commands at once and the device
> > actually does process some number of them simultaneously. That means things
> > complete out of order all the time. For example, imagine submitting two
> > commands
> > to a NAND flash storage device where one is a 1MB read and the second one is
> > a
> > 4k read. Let's say that the data these are reading are on entirely separate
> > NAND
> > chips internally. That 4k read will complete ahead of the 1MB read, and it
> > would
> > be awful to force that small read to wait on the larger one.
> > 
> > But with the packed ring design, there's only one ring for both submission
> > and
> > completion, so descriptor slots can only be re-used in order. Even if the
> > device
> > implementation is able to complete the 4K read out of order ahead of time
> > and
> > mark the descriptor as used, a driver that's polling for completions won't
> > see
> > it until the earlier descriptor completes. A driver using interrupts may be
> > able
> > to see it if the interrupt tells it specifically which slot to look at, but
> > it
> > can't re-use the descriptor slot for a new command because submissions must
> > go
> > in order. While the bug I outlined in the first part of my email is fixable,
> > this design issue is not.
> 
> The spec says:
> 
>   When the device has finished processing the buffer, it writes a used
>   device descriptor including the Buffer ID into the Descriptor Ring
>   (overwriting a driver descriptor previously made available)
> 
>   ...
> 
>   Note: after reading driver descriptors and starting their processing
>   in order, the device might complete their processing out of order.
>   Used device descriptors are written in the order in which their
>   processing is complete.
> 
> Say the write has the following requests:
> 
>   | 1M (Avail) | 4K (Avail) |
> 
> The device sees both requests and completes the 4K request.  The device
> overwrites the oldest available descriptor:
> 
>   | 4K (Used)  | 4K (Avail/seen) |
> 
> The driver polls the first descriptor for completions.  Therefore, the
> driver can process the 4K request completion while the 1M request is
> still pending.  Now the driver can submit the next request while the 1M
> request is in flight:
> 
>   | 8K (Avail) | 4K (Avail/seen) |
> 
> With this mental model of how the packed ring layout works I think
> neither of the issues you've raised is a problem.  Have I missed
> anything?
> 

Now that I see that the used descriptors are written in order, replacing
whatever descriptor was previously in the slot, I no longer have performance
concerns.

> Stefan



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]