[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio-dev] Re: Constraining where a guest may allocate virtio accessible resources
On 18.06.20 15:29, Stefan Hajnoczi wrote: > On Wed, Jun 17, 2020 at 08:01:14PM +0200, Jan Kiszka wrote: >> On 17.06.20 19:31, Alex Bennée wrote: >>> >>> Hi, >>> >>> This follows on from the discussion in the last thread I raised: >>> >>> Subject: Backend libraries for VirtIO device emulation >>> Date: Fri, 06 Mar 2020 18:33:57 +0000 >>> Message-ID: <874kv15o4q.fsf@linaro.org> >>> >>> To support the concept of a VirtIO backend having limited visibility of >>> a guests memory space there needs to be some mechanism to limit the >>> where that guest may place things. A simple VirtIO device can be >>> expressed purely in virt resources, for example: >>> >>> * status, feature and config fields >>> * notification/doorbell >>> * one or more virtqueues >>> >>> Using a PCI backend the location of everything but the virtqueues it >>> controlled by the mapping of the PCI device so something that is >>> controllable by the host/hypervisor. However the guest is free to >>> allocate the virtqueues anywhere in the virtual address space of system >>> RAM. >>> >>> In theory this shouldn't matter because sharing virtual pages is just a >>> matter of putting the appropriate translations in place. However there >>> are multiple ways the host and guest may interact: >>> >>> * QEMU TCG >>> >>> QEMU sees a block of system memory in it's virtual address space that >>> has a one to one mapping with the guests physical address space. If QEMU >>> want to share a subset of that address space it can only realistically >>> do it for a contiguous region of it's address space which implies the >>> guest must use a contiguous region of it's physical address space. >>> >>> * QEMU KVM >>> >>> The situation here is broadly the same - although both QEMU and the >>> guest are seeing a their own virtual views of a linear address space >>> which may well actually be a fragmented set of physical pages on the >>> host. >>> >>> KVM based guests have additional constraints if they ever want to access >>> real hardware in the host as you need to ensure any address accessed by >>> the guest can be eventually translated into an address that can >>> physically access the bus which a device in one (for device >>> pass-through). The area also has to be DMA coherent so updates from a >>> bus are reliably visible to software accessing the same address space. >>> >>> * Xen (and other type-1's?) >>> >>> Here the situation is a little different because the guest explicitly >>> makes it's pages visible to other domains by way of grant tables. The >>> guest is still free to use whatever parts of its address space it wishes >>> to. Other domains then request access to those pages via the hypervisor. >>> >>> In theory the requester is free to map the granted pages anywhere in >>> its own address space. However there are differences between the >>> architectures on how well this is supported. >>> >>> So I think this makes a case for having a mechanism by which the guest >>> can restrict it's allocation to a specific area of the guest physical >>> address space. The question is then what is the best way to inform the >>> guest kernel of the limitation? >>> >>> Option 1 - Kernel Command Line >>> ============================== >>> >>> This isn't without precedent - the kernel supports options like "memmap" >>> which can with the appropriate amount of crafting be used to carve out >>> sections of bad ram from the physical address space. Other formulations >>> can be used to mark specific areas of the address space as particular >>> types of memory. >>> >>> However there are cons to this approach as it then becomes a job for >>> whatever builds the VMM command lines to ensure the both the backend and >>> the kernel know where things are. It is also very Linux centric and >>> doesn't solve the problem for other guest OSes. Considering the rest of >>> VirtIO can be made discover-able this seems like it would be a backward >>> step. >>> >>> Option 2 - Additional Platform Data >>> =================================== >>> >>> This would be extending using something like device tree or ACPI tables >>> which could define regions of memory that would inform the low level >>> memory allocation routines where they could allocate from. There is >>> already of the concept of "dma-ranges" in device tree which can be a >>> per-device property which defines the region of space that is DMA >>> coherent for a device. >>> >>> There is the question of how you tie regions declared here with the >>> eventual instantiating of the VirtIO devices? >>> >>> For a fully distributed set of backends (one backend per device per >>> worker VM) you would need several different regions. Would each region >>> be tied to each device or just a set of areas the guest would allocate >>> from in sequence? >>> >>> Option 3 - Abusing PCI Regions >>> ============================== >>> >>> One of the reasons to use the VirtIO PCI backend it to help with >>> automatic probing and setup. Could we define a new PCI region which on >>> backend just maps to RAM but from the front-ends point of view is a >>> region it can allocate it's virtqueues? Could we go one step further and >>> just let the host to define and allocate the virtqueue in the reserved >>> PCI space and pass the base of it somehow? >>> >>> Options 4 - Extend VirtIO Config >>> ================================ >>> >>> Another approach would be to extend the VirtIO configuration and >>> start-up handshake to supply these limitations to the guest. This could >>> be handled by the addition of a feature bit (VIRTIO_F_HOST_QUEUE?) and >>> additional configuration information. >>> >>> One problem I can foresee is device initialisation is usually done >>> fairly late in the start-up of a kernel by which time any memory zoning >>> restrictions will likely need to have informed the kernels low level >>> memory management. Does that mean we would have to combine such a >>> feature behaviour with a another method anyway? >>> >>> Option 5 - Additional Device >>> ============================ >>> >>> The final approach would be to tie the allocation of virtqueues to >>> memory regions as defined by additional devices. For example the >>> proposed IVSHMEMv2 spec offers the ability for the hypervisor to present >>> a fixed non-mappable region of the address space. Other proposals like >>> virtio-mem allow for hot plugging of "physical" memory into the guest >>> (conveniently treatable as separate shareable memory objects for QEMU >>> ;-). >>> >> >> I think you forgot one approach: virtual IOMMU. That is the advanced >> form of the grant table approach. The backend still "sees" the full >> address space of the frontend, but it will not be able to access all of >> it and there might even be a translation going on. Well, like IOMMUs work. >> >> However, this implies dynamics that are under guest control, namely of >> the frontend guest. And such dynamics can be counterproductive for >> certain scenarios. That's where this static windows of shared memory >> came up. > > Yes, I think IOMMU interfaces are worth investigating more too. IOMMUs > are now widely implemented in Linux and virtualization software. That > means guest modifications aren't necessary and unmodified guest > applications will run. > > Applications that need the best performance can use a static mapping > while applications that want the strongest isolation can map/unmap DMA > buffers dynamically. I do not see yet that you can model with an IOMMU a static, not guest controlled window. And IOMMU implies guest modifications as well (you need its driver). It just happened to be there now in newer guests. A virtio shared memory transport could be introduced similarly. But the biggest challenge would be that a static mode would allow for a trivial hypervisor side model. Otherwise, we would only try to achieve a simpler secure model by adding complexity elsewhere. I'm not arguing against vIOMMU per se. It's there, it is and will be widely used. It's just not solving all issues. Jan -- Siemens AG, Corporate Technology, CT RDA IOT SES-DE Corporate Competence Center Embedded Linux
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]