virtio-comment message

Subject: Re: [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature
From: Eugenio Perez Martin <eperezma@redhat.com>
To: "Afsa, Baptiste" <Baptiste.Afsa@harman.com>
Date: Thu, 2 Mar 2023 17:51:44 +0100
On Mon, Feb 27, 2023 at 3:54 PM Afsa, Baptiste <Baptiste.Afsa@harman.com> wrote:
>
> > Hi Baptiste,
> >
> > It is not the current way of working for shadow virtqueue in qemu, but
> > it is doable.
> >
> > QEMU offers a new address space to the device, including all the guest
> > memory and the shadow vrings in qemu memory. By default, qemu
> > translates each descriptor to this new address space and copies it in
> > the shadow vring, making sure it's a valid guest address.
> >
> > Net CVQ is a special case, as qemu copies also the buffer so a
> > malicious guest will not be able to change it and make vdpa device and
> > qemu's knowledge of it out of sync. This is done in a region mapped at
> > device start for this purpose for performance and simplicity, not for
> > each descriptor.
> >
> > It is technically possible to develop the copy of the buffer content
> > in a memory region too. Once a suitable size is agreed (or given by a
> > parameter in cmdline, for example), SVQ already supports things like
> > backoff in case this region is full. To malloc and map this region
> > dynamically is also possible.
> >
>
> Hello Eugenio,
>
> Thanks for the insight. It does have some similarities with what we have done on
> our side.
>
> We have also tried the approach of doing a copy of the buffer content into a
> shared memory region but we did that on the guest side using things like the
> swiotlb or restricted DMA pools. This works well when buffers are small, but
> when they get larger, we get better performance by granting the memory
> dynamically.
>
> > > This new feature bit helps supporting indirect
> > > descriptors in this context.
> > >
> > > When an indirect descriptor is added to a virtqueue, we need to "share" the
> > > indirect descriptor table in addition to the buffers themselves. To do this we
> > > have two options:
> > >
> > >   - Share a shadow copy of the table with the device but this requires some kind
> > >     of dynamic memory allocation in the hypervisor.
> > >
> > >   - Grant the indirect descriptor table from the guest as-is. Note that in our
> > >     use case we do not translate buffer addresses when importing them into the
> > >     device address space.
> > >
> >
> > The plan with indirect descriptors is similar to CVQ buffer treatment:
> > To preallocate a big enough chunk of memory able to hold a copy of the
> > indirect table and then translate each descriptor. The current dynamic
> > allocation scheme is very simple but it should work as long as there
> > is enough memory. As commented before, backoff is already available
> > for other cases.
>
> Ok. This is one of the options we considered initially, but that we rejected
> because we wanted to avoid dynamic memory allocation.
>
> However, we are reconsidering this idea now. We initially overlooked the fact
> that the spec limits the size of an indirect descriptor table to the queue
> size. Having this allows us to figure out the maximum size for a single indirect
> descriptor table and the total size needed for the memory pool that will store
> the copies of the indirect descriptor tables.
>
> The issue that we have now, is that this limitation does not seem to be enforced
> in Linux virtio drivers today. I came across:
>
> https://github.com/oasis-tcs/virtio-spec/issues/122
>
> which looks like a good base for us to build upon, but I'm not sure what is the
> status with this issue. Do you know whether there is any plan regarding this?
>

I was not aware of that lack of enforcement as -net already uses
MAX_SKB_FRAGS. Is it actually possible for a Linux driver to send an
indirect table with more than 256 indirect descriptors?

When SVQ implements indirect in qemu it may make sense to assume each
indirect table may hold 1024 indirect descriptors max. If I'm not
wrong it is a limit already enforced by qemu.

Another useful property of SVQ is that it does not strictly follow the
guest's avail_idx. It may make available some descriptors / indirect
tables, and then wait for completion. It then disables guest's kick
notification, and will wait for the device to make any of them used.
When there is space, it will make available more descriptors / tables.

That property makes it not mandatory to allocate the worst case memory
for all indirect tables, only to some of them. The right value is
still to be agreed though.

Would that help your use case?

> > Going a step forward with current qemu's SVQ, a feature that would
> > help is to allow the device to translate buffers addresses with a
> > different IOVA from vring and indirect table. But this is far from
> > standard POV, since it does not have even an address space concept.
>
> I'm not sure to see what this would bring compared to the current design.
>

To avoid the need of modifying devices, SVQ vrings are offered as
regulars vring to the device. All buffer address must be translated by
qemu because of this. In the indirect table case, this multiplies the
number of translations.

If we can make the device to interpret the vring address one way and
buffers / indirect table addresses another, this translation is
actually a memcpy.

Since you mention later in your mail that you're actually shadowing
the vrings, how do you deal with this? Do you translate each
descriptor or do you instruct the device to translate addresses
differently?

Thanks!

> > Do you think both modes can converge?
>
> I think so since we are now thinking of using a copy of the indirect descriptor
> table. This would eliminate the need for any additional feature bit.
>
> > > The issue with the second option is that we need to ensure that the driver will
> > > not modify the indirect descriptor table while it is on the device side.
> > > Otherwise the device could attempt to access the buffers using the new addresses
> > > which would not correspond to the one the hypervisor used when granting memory.
> > >
> > > Since this is something that a correct driver implementation would not do, this
> > > led to the idea of remapping the table read-only while it is used by the device.
> > > Because an indirect descriptor table may not fill an entire page, we needed a
> > > way to ensure that the OS would not re-use the rest of these pages for other
> > > purposes while we have made them read-only.
> > >
> >
> > How do you guarantee it for vring itself?
>
> For the vrings we do exactly what you described for QEMU. The virtqueues that
> the device sees are not the virtqueues allocated by the driver but the shadow
> copies managed by the hypervisor. If the driver touches the descriptors after
> they have been copied to the shadow virtqueue, the hypervisor will ignore these
> changes.
>
> Thanks.
>
Follow-Ups:
- Re: [virtio-comment] [PATCH] Introduce VIRTIO_F_ISOLATE_INDIRECT_DESC feature
  - From: Stefan Hajnoczi <stefanha@redhat.com>