OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: virtio packed ring concerns


Hi all - I'm a core maintainer for the SPDK project, which includes a vhost-user 
target implementation that handles both virtio-blk and virtio-scsi. Recently we
had some engineers go off and attempt to add support for packed virtqueues, as
defined here: 
https://github.com/oasis-tcs/virtio-spec/blob/master/packed-ring.tex

While reviewing the patches they submitted I noticed two separate issues. The
first issue is that the ring cannot function correctly under certain
circumstances by spec. The second issue is a performance issue. I'm going to try
to keep both of these entirely separate.

Comments are very welcome here, and I'm hoping I'm just missing some critical
piece of information that resolves the whole thing, but testing is so far
showing that I'm finding a real problem.

First, the functional issue. The problem scenario involves out of order
completions during ring wrap around. Here are the most relevant sections of the
specification:

~~~
Note: after reading driver descriptors and starting their processing in order,
the device might complete their processing out of order.  Used device
descriptors are written in the order in which their processing is complete.

The counter maintained by the driver is called the Driver Ring Wrap Counter. The
driver changes the value of this counter each time it makes available the last
descriptor in the ring (after making the last descriptor available).

The counter maintained by the device is called the Device Ring Wrap
Counter.  The device changes the value of this counter
each time it uses the last descriptor in
the ring (after marking the last descriptor used).

To mark a descriptor as available, the driver sets the VIRTQ_DESC_F_AVAIL bit in
Flags to match the internal Driver Ring Wrap Counter.  It also sets the
VIRTQ_DESC_F_USED bit to match the inverse value (i.e. to not match the internal
Driver Ring Wrap Counter).

To mark a descriptor as used, the device sets the VIRTQ_DESC_F_USED bit in Flags
to match the internal Device Ring Wrap Counter.  It also sets the
VIRTQ_DESC_F_AVAIL bit to match the \emph{same} value.
~~~

Imagine a packed ring with 4 descriptors. At some earlier time let's say the
first 3 descriptors were submitted and completed, in order. Then, two more
descriptors were submitted (so they went into slot 3 and then 0). Each time
through the ring, the "ring wrap counter" toggles from 1 to 0 or vice versa. The
avail and used flags are initialized to 1 at start up and the ring wrap counter
is 0.

So when submitting into slot 3, the driver flips the avail bit from 1 to 0. When
submitting into slot 0 after the ring-wrap, it flips the avail bit from 0 to 1.
The spec says it also sets the used flag, but the used flag is already the
correct value. That's all solid so far.

The device is then polling on slot 3 waiting for new descriptors to be
submitted. Note here that the device must be maintaining its own copy of what it
thinks is the driver's ring wrap counter value to detect descriptors being made
available, which isn't called out in the specification at all but all actual
implementations do. Regardless, once it sees the avail flag in the descriptor in
slot 3 flip, it begins processing the descriptor. It then moves on to poll slot
0 for incoming descriptors (and updates its expected avail ring wrap counter
value internally). In this case, it sees the bit flip for slot 0 and begins
processing the descriptor. Descriptor processing is asynchronous and may
complete out of order, at least for storage devices.

So let's say that the descriptors do, in fact, complete out of order, such that
the descriptor in slot 0 completes first and then the descriptor in slot 3
completes second. The specification is very clear that this is allowed.

When completing the descriptor in slot 0, the device has not yet completed the
last descriptor in the ring (they're out of order!). So the ring wrap counter is
still 0. So it then proceeds to write both the avail flag and the used flag to
0. But the driver is expecting the used flag to get flipped to 1 to indicate a
completion. This is broken.

I wrote a unit test for the SPDK implementation and it is indeed broken in this
scenario, causing the driver to hang. I've also at least looked through the code
in DPDK and in the Linux kernel implementations and they appear broken in
various ways as well. It appears that DPDK may decide the ring is corrupt and
"drop the packet" in this scenario, which is data corruption for storage, but I
haven't yet run a test to confirm. The Linux kernel implementation looks to me
like it will never consider the descriptor as complete and eventually hang.

The device doesn't actually need to track a ring wrap counter for the used flag
at all. The device does need to track a ring wrap counter for the expected value
of the avail bit, to mirror the driver side value while polling for incoming
descriptors. But to mark descriptors as used, it simply needs to set the used
flag to the current value of the avail flag in that descriptor.

----------------------

Moving on to my separate concerns about performance. Critically, it's important
to note that storage devices these days are inherently parallel pieces of
hardware. You can submit a large number of commands at once and the device
actually does process some number of them simultaneously. That means things
complete out of order all the time. For example, imagine submitting two commands
to a NAND flash storage device where one is a 1MB read and the second one is a
4k read. Let's say that the data these are reading are on entirely separate NAND
chips internally. That 4k read will complete ahead of the 1MB read, and it would
be awful to force that small read to wait on the larger one.

But with the packed ring design, there's only one ring for both submission and
completion, so descriptor slots can only be re-used in order. Even if the device
implementation is able to complete the 4K read out of order ahead of time and
mark the descriptor as used, a driver that's polling for completions won't see
it until the earlier descriptor completes. A driver using interrupts may be able
to see it if the interrupt tells it specifically which slot to look at, but it
can't re-use the descriptor slot for a new command because submissions must go
in order. While the bug I outlined in the first part of my email is fixable,
this design issue is not.

Thanks,
Ben


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]