virtio-dev message

Subject: VM memory protection and zero-copy transfers.
From: "Afsa, Baptiste" <Baptiste.Afsa@harman.com>
To: "virtio-dev@lists.oasis-open.org" <virtio-dev@lists.oasis-open.org>
Date: Fri, 8 Jul 2022 13:56:31 +0000
Hello everyone,

The traditional virtio model relies on the ability for the host to access the
entire memory of the guest VM. Virtio is also used in system configurations
where the devices are not featured by the host (which may not exist as such in
the case of a Type-1 hypervisor) but by another, unprivileged guest VM. In such
a configuration, the guest VM memory sharing requirement would raise security
concerns.

The following proposal removes that requirement by introducing an alternative
model where the interactions between the virtio driver and the virtio device are
mediated by the hypervisor. This concept is applicable to both Type-1 and Type-2
hypervisors. In the following write-up, the "host" thus refers either to the
host OS or to the guest VM that executes the virtio device.

The main objective is to keep the memory of the VM that runs the driver isolated
from the memory that runs the device, while still allowing zero-copy transfers
between the two domains. The operations that control the exchange of the virtio
buffers are handled by hypervisor code that sits between the device and the
driver.

As opposed to the regular virtio model, the virtqueues allocated by the driver
are not shared with the device directly. Instead, the hypervisor allocates a
separate set of virtqueues that have the same sizes as the original ones and
shares this second set with the device. These hypervisor-allocated virtqueues
are referred as the "shadow virtqueues".

During device operation, the hypervisor copies the descriptors between the
driver and the shadow virtqueues as the buffers cycle between the driver and the
device.

Whenever the driver adds some buffers to the available ring, the hypervisor
validates the descriptors and dynamically grants the I/O buffers to the host or
VM that runs the device. The hypervisor then copies these descriptors to the
shadow virtqueue's available ring. At the other end, when the device returns
buffers to the shadow virtqueue's used ring, the hypervisor unmaps these buffers
from the host's address space and copy the descriptor to the driver's used ring.

Although the virtio buffers can be allocated anywhere in the guest memory and
are not necessarily page-aligned, the memory sharing granularity is constrained
by the page size. So when a buffer is mapped to the host address space, the
hypervisor may end up sharing more memory that what is strictly needed.

The cost of granting the memory dynamically as virtio transfers goes is
significant, though. We measured up to 40% performance degradation when using
this dynamic buffer granting mechanism.

We also compared this solution to other approaches that we have seen elsewhere.
For instance, using the swiotlb mechanism along with the
VIRTIO_F_ACCESS_PLATFORM feature bit to force a copy of the I/O buffers to a
statically shared memory region. In that case, the same set of benchmarks shows
an even bigger performance degradation, up to 60%, compared to the original
virtio performance.

Although the shadow virtqueue concept looks fairly simple, there is still one
point that has not been covered yet: indirect descriptors.

To support indirect descriptors, the following two options were considered
initially:

  1. Grant the indirect descriptor as-is to the host while it is on the used
     ring. This introduces a security issue because a compromised guest OS can
     modify the indirect descriptor after it has been pushed to the available
     ring. This would cause the device to fault while trying to access any
     arbitrary memory that was not actually granted.

     Note that in the shadow virtqueue model, there is no need for the device to
     validate the descriptors in the available rings, because the hypervisor
     already performed such checks before granting the memory.

  2. Follow the same logic that is used for the "normal" descriptors and
     introduce shadow indirect descriptors. This would require the hypervisor to
     provision a memory pool to allocate these shadow indirect descriptors and
     determining the size of this pool may not be trivial.

     Additionally, indirect descriptors can be as large as the driver wants them
     to be, something that can cause the hypervisor to copy an arbitrary large
     amount of data.

An alternative approach consists in introducing a new virtio feature bit. This
feature bit, when set by the device, instructs the driver to allocate indirect
descriptors using dedicated memory pages. These pages shall hold no other data
than the indirect descriptors. Since a correct virtio driver implementation does
not modify an indirect descriptor once it has been pushed to the device, the
pages where the indirect descriptors lies can later on be remapped as read-only
in the guest address space.

This allows the hypervisor to validate the content of the indirect descriptor,
grant it to the host (along with all the buffers referenced by this descriptor)
and remap the indirect descriptor read-only in the guest address space as long
as it is granted to the host (i.e. until the indirect descriptor is returned
through the used ring).

The present proposal has some obvious drawbacks, but we believe that memory
protection will not come for free. We know that there are other folks out there
which try to address this issue of memory sharing between VMs, so we would be
pleased to hear what you guys think about this approach.

Additionally, we would like to know whether a feature bit similar to the one
that was discussed here could be considered for addition to the virtio standard.

Looking forwards to hear from you.
Baptiste
Follow-Ups:
- Re: [virtio-dev] VM memory protection and zero-copy transfers.
  - From: Stefan Hajnoczi <stefanha@redhat.com>