virtio-dev message

Subject: Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
From: Jan Kiszka <jan.kiszka@siemens.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Date: Thu, 18 Jun 2020 15:59:54 +0200
On 18.06.20 15:29, Stefan Hajnoczi wrote:
> On Wed, Jun 17, 2020 at 08:01:14PM +0200, Jan Kiszka wrote:
>> On 17.06.20 19:31, Alex Bennée wrote:
>>>
>>> Hi,
>>>
>>> This follows on from the discussion in the last thread I raised:
>>>
>>>   Subject: Backend libraries for VirtIO device emulation
>>>   Date: Fri, 06 Mar 2020 18:33:57 +0000
>>>   Message-ID: <874kv15o4q.fsf@linaro.org>
>>>
>>> To support the concept of a VirtIO backend having limited visibility of
>>> a guests memory space there needs to be some mechanism to limit the
>>> where that guest may place things. A simple VirtIO device can be
>>> expressed purely in virt resources, for example:
>>>
>>>    * status, feature and config fields
>>>    * notification/doorbell
>>>    * one or more virtqueues
>>>
>>> Using a PCI backend the location of everything but the virtqueues it
>>> controlled by the mapping of the PCI device so something that is
>>> controllable by the host/hypervisor. However the guest is free to
>>> allocate the virtqueues anywhere in the virtual address space of system
>>> RAM.
>>>
>>> In theory this shouldn't matter because sharing virtual pages is just a
>>> matter of putting the appropriate translations in place. However there
>>> are multiple ways the host and guest may interact:
>>>
>>> * QEMU TCG
>>>
>>> QEMU sees a block of system memory in it's virtual address space that
>>> has a one to one mapping with the guests physical address space. If QEMU
>>> want to share a subset of that address space it can only realistically
>>> do it for a contiguous region of it's address space which implies the
>>> guest must use a contiguous region of it's physical address space.
>>>
>>> * QEMU KVM
>>>
>>> The situation here is broadly the same - although both QEMU and the
>>> guest are seeing a their own virtual views of a linear address space
>>> which may well actually be a fragmented set of physical pages on the
>>> host.
>>>
>>> KVM based guests have additional constraints if they ever want to access
>>> real hardware in the host as you need to ensure any address accessed by
>>> the guest can be eventually translated into an address that can
>>> physically access the bus which a device in one (for device
>>> pass-through). The area also has to be DMA coherent so updates from a
>>> bus are reliably visible to software accessing the same address space.
>>>
>>> * Xen (and other type-1's?)
>>>
>>> Here the situation is a little different because the guest explicitly
>>> makes it's pages visible to other domains by way of grant tables. The
>>> guest is still free to use whatever parts of its address space it wishes
>>> to. Other domains then request access to those pages via the hypervisor.
>>>
>>> In theory the requester is free to map the granted pages anywhere in
>>> its own address space. However there are differences between the
>>> architectures on how well this is supported.
>>>
>>> So I think this makes a case for having a mechanism by which the guest
>>> can restrict it's allocation to a specific area of the guest physical
>>> address space. The question is then what is the best way to inform the
>>> guest kernel of the limitation?
>>>
>>> Option 1 - Kernel Command Line
>>> ==============================
>>>
>>> This isn't without precedent - the kernel supports options like "memmap"
>>> which can with the appropriate amount of crafting be used to carve out
>>> sections of bad ram from the physical address space. Other formulations
>>> can be used to mark specific areas of the address space as particular
>>> types of memory.  
>>>
>>> However there are cons to this approach as it then becomes a job for
>>> whatever builds the VMM command lines to ensure the both the backend and
>>> the kernel know where things are. It is also very Linux centric and
>>> doesn't solve the problem for other guest OSes. Considering the rest of
>>> VirtIO can be made discover-able this seems like it would be a backward
>>> step.
>>>
>>> Option 2 - Additional Platform Data
>>> ===================================
>>>
>>> This would be extending using something like device tree or ACPI tables
>>> which could define regions of memory that would inform the low level
>>> memory allocation routines where they could allocate from. There is
>>> already of the concept of "dma-ranges" in device tree which can be a
>>> per-device property which defines the region of space that is DMA
>>> coherent for a device.
>>>
>>> There is the question of how you tie regions declared here with the
>>> eventual instantiating of the VirtIO devices?
>>>
>>> For a fully distributed set of backends (one backend per device per
>>> worker VM) you would need several different regions. Would each region
>>> be tied to each device or just a set of areas the guest would allocate
>>> from in sequence?
>>>
>>> Option 3 - Abusing PCI Regions
>>> ==============================
>>>
>>> One of the reasons to use the VirtIO PCI backend it to help with
>>> automatic probing and setup. Could we define a new PCI region which on
>>> backend just maps to RAM but from the front-ends point of view is a
>>> region it can allocate it's virtqueues? Could we go one step further and
>>> just let the host to define and allocate the virtqueue in the reserved
>>> PCI space and pass the base of it somehow?
>>>
>>> Options 4 - Extend VirtIO Config
>>> ================================
>>>
>>> Another approach would be to extend the VirtIO configuration and
>>> start-up handshake to supply these limitations to the guest. This could
>>> be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and
>>> additional configuration information.
>>>
>>> One problem I can foresee is device initialisation is usually done
>>> fairly late in the start-up of a kernel by which time any memory zoning
>>> restrictions will likely need to have informed the kernels low level
>>> memory management. Does that mean we would have to combine such a
>>> feature behaviour with a another method anyway?
>>>
>>> Option 5 - Additional Device
>>> ============================
>>>
>>> The final approach would be to tie the allocation of virtqueues to
>>> memory regions as defined by additional devices. For example the
>>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present
>>> a fixed non-mappable region of the address space. Other proposals like
>>> virtio-mem allow for hot plugging of "physical" memory into the guest
>>> (conveniently treatable as separate shareable memory objects for QEMU
>>> ;-).
>>>
>>
>> I think you forgot one approach: virtual IOMMU. That is the advanced
>> form of the grant table approach. The backend still "sees" the full
>> address space of the frontend, but it will not be able to access all of
>> it and there might even be a translation going on. Well, like IOMMUs work.
>>
>> However, this implies dynamics that are under guest control, namely of
>> the frontend guest. And such dynamics can be counterproductive for
>> certain scenarios. That's where this static windows of shared memory
>> came up.
> 
> Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs
> are now widely implemented in Linux and virtualization software. That
> means guest modifications aren't necessary and unmodified guest
> applications will run.
> 
> Applications that need the best performance can use a static mapping
> while applications that want the strongest isolation can map/unmap DMA
> buffers dynamically.

I do not see yet that you can model with an IOMMU a static, not guest
controlled window.

And IOMMU implies guest modifications as well (you need its driver). It
just happened to be there now in newer guests. A virtio shared memory
transport could be introduced similarly.

But the biggest challenge would be that a static mode would allow for a
trivial hypervisor side model. Otherwise, we would only try to achieve a
simpler secure model by adding complexity elsewhere.

I'm not arguing against vIOMMU per se. It's there, it is and will be
widely used. It's just not solving all issues.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux
Follow-Ups:
- Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  - From: "Michael S. Tsirkin" <mst@redhat.com>
References:
- Constraining where a guest may allocate virtio accessible resources
  - From: Alex BennÃe <alex.bennee@linaro.org>
- Re: Constraining where a guest may allocate virtio accessible resources
  - From: Jan Kiszka <jan.kiszka@siemens.com>
- Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
  - From: Stefan Hajnoczi <stefanha@redhat.com>