OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] virtio packed ring concerns


On Tue, Jan 28, 2020 at 05:26:07PM +0000, Walker, Benjamin wrote:
> Hi all - I'm a core maintainer for the SPDK project, which includes a vhost-user 
> target implementation that handles both virtio-blk and virtio-scsi. Recently we
> had some engineers go off and attempt to add support for packed virtqueues, as
> defined here: 
> https://github.com/oasis-tcs/virtio-spec/blob/master/packed-ring.tex
> 
> While reviewing the patches they submitted I noticed two separate issues. The
> first issue is that the ring cannot function correctly under certain
> circumstances by spec. The second issue is a performance issue. I'm going to try
> to keep both of these entirely separate.
> 
> Comments are very welcome here, and I'm hoping I'm just missing some critical
> piece of information that resolves the whole thing, but testing is so far
> showing that I'm finding a real problem.
> 
> First, the functional issue. The problem scenario involves out of order
> completions during ring wrap around. Here are the most relevant sections of the
> specification:
> 
> ~~~
> Note: after reading driver descriptors and starting their processing in order,
> the device might complete their processing out of order.  Used device
> descriptors are written in the order in which their processing is complete.
> 
> The counter maintained by the driver is called the Driver Ring Wrap Counter. The
> driver changes the value of this counter each time it makes available the last
> descriptor in the ring (after making the last descriptor available).
> 
> The counter maintained by the device is called the Device Ring Wrap
> Counter.  The device changes the value of this counter
> each time it uses the last descriptor in
> the ring (after marking the last descriptor used).
> 
> To mark a descriptor as available, the driver sets the VIRTQ_DESC_F_AVAIL bit in
> Flags to match the internal Driver Ring Wrap Counter.  It also sets the
> VIRTQ_DESC_F_USED bit to match the inverse value (i.e. to not match the internal
> Driver Ring Wrap Counter).
> 
> To mark a descriptor as used, the device sets the VIRTQ_DESC_F_USED bit in Flags
> to match the internal Device Ring Wrap Counter.  It also sets the
> VIRTQ_DESC_F_AVAIL bit to match the \emph{same} value.
> ~~~
> 
> Imagine a packed ring with 4 descriptors. At some earlier time let's say the
> first 3 descriptors were submitted and completed, in order. Then, two more
> descriptors were submitted (so they went into slot 3 and then 0). Each time
> through the ring, the "ring wrap counter" toggles from 1 to 0 or vice versa. The
> avail and used flags are initialized to 1 at start up and the ring wrap counter
> is 0.

Both the Driver and Device Ring Wrap Counters are initialized to 1
according to the spec, not 0.  This doesn't matter for the rest of the
discussion though.  I just wanted to mention it in case your
implementation has the value inverted.

> So when submitting into slot 3, the driver flips the avail bit from 1 to 0. When
> submitting into slot 0 after the ring-wrap, it flips the avail bit from 0 to 1.
> The spec says it also sets the used flag, but the used flag is already the
> correct value. That's all solid so far.
> 
> The device is then polling on slot 3 waiting for new descriptors to be
> submitted. Note here that the device must be maintaining its own copy of what it
> thinks is the driver's ring wrap counter value to detect descriptors being made
> available, which isn't called out in the specification at all but all actual
> implementations do. Regardless, once it sees the avail flag in the descriptor in
> slot 3 flip, it begins processing the descriptor. It then moves on to poll slot
> 0 for incoming descriptors (and updates its expected avail ring wrap counter
> value internally). In this case, it sees the bit flip for slot 0 and begins
> processing the descriptor. Descriptor processing is asynchronous and may
> complete out of order, at least for storage devices.
> 
> So let's say that the descriptors do, in fact, complete out of order, such that
> the descriptor in slot 0 completes first and then the descriptor in slot 3
> completes second. The specification is very clear that this is allowed.
> 
> When completing the descriptor in slot 0, the device has not yet completed the
> last descriptor in the ring (they're out of order!). So the ring wrap counter is
> still 0. So it then proceeds to write both the avail flag and the used flag to
> 0. But the driver is expecting the used flag to get flipped to 1 to indicate a
> completion. This is broken.

You're saying used descriptors overwrite the same descriptor slot where
they were made available by the driver.  This is not the case.  The
device overwrites the descriptor ring in completion order.  See my
response to your concern about performance for an illustration.

> I wrote a unit test for the SPDK implementation and it is indeed broken in this
> scenario, causing the driver to hang. I've also at least looked through the code
> in DPDK and in the Linux kernel implementations and they appear broken in
> various ways as well. It appears that DPDK may decide the ring is corrupt and
> "drop the packet" in this scenario, which is data corruption for storage, but I
> haven't yet run a test to confirm. The Linux kernel implementation looks to me
> like it will never consider the descriptor as complete and eventually hang.
> 
> The device doesn't actually need to track a ring wrap counter for the used flag
> at all. The device does need to track a ring wrap counter for the expected value
> of the avail bit, to mirror the driver side value while polling for incoming
> descriptors. But to mark descriptors as used, it simply needs to set the used
> flag to the current value of the avail flag in that descriptor.
> 
> ----------------------
> 
> Moving on to my separate concerns about performance. Critically, it's important
> to note that storage devices these days are inherently parallel pieces of
> hardware. You can submit a large number of commands at once and the device
> actually does process some number of them simultaneously. That means things
> complete out of order all the time. For example, imagine submitting two commands
> to a NAND flash storage device where one is a 1MB read and the second one is a
> 4k read. Let's say that the data these are reading are on entirely separate NAND
> chips internally. That 4k read will complete ahead of the 1MB read, and it would
> be awful to force that small read to wait on the larger one.
> 
> But with the packed ring design, there's only one ring for both submission and
> completion, so descriptor slots can only be re-used in order. Even if the device
> implementation is able to complete the 4K read out of order ahead of time and
> mark the descriptor as used, a driver that's polling for completions won't see
> it until the earlier descriptor completes. A driver using interrupts may be able
> to see it if the interrupt tells it specifically which slot to look at, but it
> can't re-use the descriptor slot for a new command because submissions must go
> in order. While the bug I outlined in the first part of my email is fixable,
> this design issue is not.

The spec says:

  When the device has finished processing the buffer, it writes a used
  device descriptor including the Buffer ID into the Descriptor Ring
  (overwriting a driver descriptor previously made available)

  ...

  Note: after reading driver descriptors and starting their processing
  in order, the device might complete their processing out of order.
  Used device descriptors are written in the order in which their
  processing is complete.

Say the write has the following requests:

  | 1M (Avail) | 4K (Avail) |

The device sees both requests and completes the 4K request.  The device
overwrites the oldest available descriptor:

  | 4K (Used)  | 4K (Avail/seen) |

The driver polls the first descriptor for completions.  Therefore, the
driver can process the 4K request completion while the 1M request is
still pending.  Now the driver can submit the next request while the 1M
request is in flight:

  | 8K (Avail) | 4K (Avail/seen) |

With this mental model of how the packed ring layout works I think
neither of the issues you've raised is a problem.  Have I missed
anything?

Stefan

Attachment: signature.asc
Description: PGP signature



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]