[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: virtio packed ring concerns
Hi all - I'm a core maintainer for the SPDK project, which includes a vhost-user target implementation that handles both virtio-blk and virtio-scsi. Recently we had some engineers go off and attempt to add support for packed virtqueues, as defined here: https://github.com/oasis-tcs/virtio-spec/blob/master/packed-ring.tex While reviewing the patches they submitted I noticed two separate issues. The first issue is that the ring cannot function correctly under certain circumstances by spec. The second issue is a performance issue. I'm going to try to keep both of these entirely separate. Comments are very welcome here, and I'm hoping I'm just missing some critical piece of information that resolves the whole thing, but testing is so far showing that I'm finding a real problem. First, the functional issue. The problem scenario involves out of order completions during ring wrap around. Here are the most relevant sections of the specification: ~~~ Note: after reading driver descriptors and starting their processing in order, the device might complete their processing out of order. Used device descriptors are written in the order in which their processing is complete. The counter maintained by the driver is called the Driver Ring Wrap Counter. The driver changes the value of this counter each time it makes available the last descriptor in the ring (after making the last descriptor available). The counter maintained by the device is called the Device Ring Wrap Counter. The device changes the value of this counter each time it uses the last descriptor in the ring (after marking the last descriptor used). To mark a descriptor as available, the driver sets the VIRTQ_DESC_F_AVAIL bit in Flags to match the internal Driver Ring Wrap Counter. It also sets the VIRTQ_DESC_F_USED bit to match the inverse value (i.e. to not match the internal Driver Ring Wrap Counter). To mark a descriptor as used, the device sets the VIRTQ_DESC_F_USED bit in Flags to match the internal Device Ring Wrap Counter. It also sets the VIRTQ_DESC_F_AVAIL bit to match the \emph{same} value. ~~~ Imagine a packed ring with 4 descriptors. At some earlier time let's say the first 3 descriptors were submitted and completed, in order. Then, two more descriptors were submitted (so they went into slot 3 and then 0). Each time through the ring, the "ring wrap counter" toggles from 1 to 0 or vice versa. The avail and used flags are initialized to 1 at start up and the ring wrap counter is 0. So when submitting into slot 3, the driver flips the avail bit from 1 to 0. When submitting into slot 0 after the ring-wrap, it flips the avail bit from 0 to 1. The spec says it also sets the used flag, but the used flag is already the correct value. That's all solid so far. The device is then polling on slot 3 waiting for new descriptors to be submitted. Note here that the device must be maintaining its own copy of what it thinks is the driver's ring wrap counter value to detect descriptors being made available, which isn't called out in the specification at all but all actual implementations do. Regardless, once it sees the avail flag in the descriptor in slot 3 flip, it begins processing the descriptor. It then moves on to poll slot 0 for incoming descriptors (and updates its expected avail ring wrap counter value internally). In this case, it sees the bit flip for slot 0 and begins processing the descriptor. Descriptor processing is asynchronous and may complete out of order, at least for storage devices. So let's say that the descriptors do, in fact, complete out of order, such that the descriptor in slot 0 completes first and then the descriptor in slot 3 completes second. The specification is very clear that this is allowed. When completing the descriptor in slot 0, the device has not yet completed the last descriptor in the ring (they're out of order!). So the ring wrap counter is still 0. So it then proceeds to write both the avail flag and the used flag to 0. But the driver is expecting the used flag to get flipped to 1 to indicate a completion. This is broken. I wrote a unit test for the SPDK implementation and it is indeed broken in this scenario, causing the driver to hang. I've also at least looked through the code in DPDK and in the Linux kernel implementations and they appear broken in various ways as well. It appears that DPDK may decide the ring is corrupt and "drop the packet" in this scenario, which is data corruption for storage, but I haven't yet run a test to confirm. The Linux kernel implementation looks to me like it will never consider the descriptor as complete and eventually hang. The device doesn't actually need to track a ring wrap counter for the used flag at all. The device does need to track a ring wrap counter for the expected value of the avail bit, to mirror the driver side value while polling for incoming descriptors. But to mark descriptors as used, it simply needs to set the used flag to the current value of the avail flag in that descriptor. ---------------------- Moving on to my separate concerns about performance. Critically, it's important to note that storage devices these days are inherently parallel pieces of hardware. You can submit a large number of commands at once and the device actually does process some number of them simultaneously. That means things complete out of order all the time. For example, imagine submitting two commands to a NAND flash storage device where one is a 1MB read and the second one is a 4k read. Let's say that the data these are reading are on entirely separate NAND chips internally. That 4k read will complete ahead of the 1MB read, and it would be awful to force that small read to wait on the larger one. But with the packed ring design, there's only one ring for both submission and completion, so descriptor slots can only be re-used in order. Even if the device implementation is able to complete the 4K read out of order ahead of time and mark the descriptor as used, a driver that's polling for completions won't see it until the earlier descriptor completes. A driver using interrupts may be able to see it if the interrupt tells it specifically which slot to look at, but it can't re-use the descriptor slot for a new command because submissions must go in order. While the bug I outlined in the first part of my email is fixable, this design issue is not. Thanks, Ben
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]