OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: Constraining where a guest may allocate virtio accessible resources


On Fri, Jun 19, 2020 at 06:35:39PM +0100, Alex Bennée wrote:
> 
> Stefan Hajnoczi <stefanha@redhat.com> writes:
> 
> > On Wed, Jun 17, 2020 at 06:31:15PM +0100, Alex Bennée wrote:
> >> This follows on from the discussion in the last thread I raised:
> >> 
> >>   Subject: Backend libraries for VirtIO device emulation
> >>   Date: Fri, 06 Mar 2020 18:33:57 +0000
> >>   Message-ID: <874kv15o4q.fsf@linaro.org>
> >> 
> >> To support the concept of a VirtIO backend having limited visibility of
> >
> > It's unclear what we're discussing. Does "VirtIO backend" mean
> > vhost-user devices?
> >
> > Can you describe what you are trying to do?
> >
> 
> Yes - although eventually the vhost-user device might be hosted in a
> separate VM. See this contrived architecture diagram:
> 
>                                                    |                                                     
>                 Secure World                       |          Non-secure world                 
>                                                    |   +--------------------+  +---------------+  
>                                                    |   |c1AB                |  |cGRE           |  
>                                                    |   |                    |  |               |  
>                                                    |   |     Primary OS     |  |   Secondary   |  
>                                                    |   |      (android)     |  |      VM       |  
>          +--------------+                          |   |                    |  |               |  
>          |cYEL          |                          |   |                    |  |   (Backend)   |  
>          |              |                          |   |                    |  +---------------+  
>          |              |                          |   |                    |                   
>          |  Untrusted   |                          |   |                    |                   
>          |              |                          |   |                    |  +---------------+  
>    EL0   |   Service    |                          |   |                    |  |cGRE           |
>     .    |              |                          |   |                    |  |               |
>     .    |              |                          :   | +----------------+ |  |   Secondary   |
>     .    |              |                          |   | |{io} VirtIO     | |  |      VM       |
>    EL1   |              |                          |   | |                | |  |               |
>          |  (Backend)   |                          |   | +----------------+ |  |   (Backend)   |
>          +--------------+                          |   +----------------+---+  +---------------+
>                                                    |                                        
>          +-------------------------------------+   |   +---------------------------------------+
>          |cPNK                                 |   |   |cGRE                                   |
>    EL2   |        Secure Hypervisor            |   |   |          Non-secure Hypervisor        |
>          |                                     |   |   |                                       |
>          +-------------------------------------+   |   +---------------------------------------+
>                                                    +-----------------------------------------------
>          +-------------------------------------------------------------------------------------+
>          |cRED                                                                                 |
>    EL3   |                                  Secure Firmware                                    |
>          |                                                                                     |
>          +-------------------------------------------------------------------------------------+
>   ----=-----------------------------------------------------------------------------------------   
>          +------------------------+ +-------------------------+ +------------------------------+
>          | c444                   | | {s}                c444 | | {io}                    c444 |
>    HW    |        Compute         | |         Storage         | |             I/O              |
>          |  (CPUs, GPUs, Accel)   | |  (Flash, Secure Flash)  | |   Network, USB, Peripherals  |
>          |                        | |                         | |                              |
>          +------------------------+ +-------------------------+ +------------------------------+
> 
> Here the primary OS is connected to the work through VirtIO devices
> (acting as a common HAL). Each individual device might have a secondary
> VM associated with it. Some devices might be virtual - for example a 3rd
> party DRM module. It would be un-trusted so doesn't run as part of the
> secure firmware but it might still need to access secure resources like
> a key store or a video port.
> 
> For all these backends they should only have access to the minimum
> amount of the primary OS's memory space that they need to fulfil their
> job. While the non-secure hypervisor could be something like KVM it's
> likely the secure one will be a much more lightweight type-1 hypervisor.

This is possible with vhost-user + virtio-vhost-user:

Expose a subset of the Primary VM's memory over vhost-user to the
Secondary VM. Normally all guest RAM is exposed over vhost-user, but
it's simple to expose only a subset. The DMA region needs to be its own
file descriptor (not a larger memfd that also contains other guest RAM)
since vhost-user uses file descriptor passing to share memory.

The "Guest Physical Addresses" in virtqueues don't need to be
translated, they can be the same GPAs used inside the guest. The
Secondary just have access to GPAs outside the memfd region(s) that have
been provided by the Primary.

The guest drivers in the Primary will need to copy buffers to the DMA
memory region if existing applications are not aware of DMA address
contraints.

Nikos Dragazis has DPDK and SPDK code with virtio-vhost-user support, so
you could use that as a starting point to add DMA constraints. The
Secondary either emulates a virtio-net (DPDK) device or a virtio-scsi
(SPDK) device:

https://ndragazis.github.io/spdk.html

The missing pieces are:

1. A way to associate memory backends with vhost-user devices so the
   Primary knows which memory to expose:

     -object memory-backend-memfd,id=foo,...
     -device vhost-user-scsi-pci,memory-backend[0]=foo,...

   Now the vhost-uesr-scsi-pci device will only expose the 'foo' memfd
   over vhost-user, not all of guest RAM.

2. A way to communicate per-device DMA address restrictions to the
   Primary OS and the necessary driver memory allocation changes and/or
   bounce buffers.

Adding those two things on top of virtio-vhost-user will do what you've
described in the diagram.

> My interest in including TCG in this mix is for early prototyping and
> ease of debugging when working with this towering array of layers ;-)

Once TCG works with vhost-user it will also work with virtio-vhost-user.

> >> a guests memory space there needs to be some mechanism to limit the
> >> where that guest may place things.
> >
> > Or an enforcing IOMMU? In other words, an IOMMU that only gives access
> > to memory that has been put forth for DMA.
> >
> > This was discussed recently in the context of the ongoing
> > vfio-over-socket work ("RFC: use VFIO over a UNIX domain socket to
> > implement device offloading" on qemu-devel). The idea is to use the VFIO
> > protocol but over UNIX domain sockets to another host userspace process
> > instead of over ioctls to the kernel VFIO drivers. This would allow
> > arbitary devices to be emulated in a separate process from QEMU. As a
> > first step I suggested DMA_READ/DMA_WRITE protocol messages, even though
> > this will have poor performance.
> 
> This is still mediated by a kernel though right?

The host kernel? The guest kernel?

It's possible to have a vIOMMU in the guest and no host kernel IOMMU.

The guest kernel needs to support the vIOMMU. Infrastructure is there in
Linux since the vIOMMU can already be used already for device
passthrough.

> > I think finding a solution for an enforcing IOMMU is preferrable to
> > guest cooperation. The problem with guest cooperation is that you may be
> > able to get new VIRTIO guest drivers to restrict where the virtqueues
> > are placed, but what about applications (e.g. O_DIRECT disk I/O, network
> > packets) with memory buffers at arbitrary addresses?
> 
> The virtqueues are the simple case but yes it gets complex for the rest
> of the data - the simple case is handled by a bounce buffer which the
> guest then copies from into it's own secure address space.
> 
> > Modifying guest applications to honor buffer memory restrictions is too
> > disruptive for most use cases.
> >
> >> A simple VirtIO device can be
> >> expressed purely in virt resources, for example:
> >> 
> >>    * status, feature and config fields
> >>    * notification/doorbell
> >>    * one or more virtqueues
> >> 
> >> Using a PCI backend the location of everything but the virtqueues it
> >> controlled by the mapping of the PCI device so something that is
> >> controllable by the host/hypervisor. However the guest is free to
> >> allocate the virtqueues anywhere in the virtual address space of system
> >> RAM.
> >> 
> >> In theory this shouldn't matter because sharing virtual pages is just a
> >> matter of putting the appropriate translations in place. However there
> >> are multiple ways the host and guest may interact:
> >> 
> >> * QEMU TCG
> >> 
> >> QEMU sees a block of system memory in it's virtual address space that
> >> has a one to one mapping with the guests physical address space. If QEMU
> >> want to share a subset of that address space it can only realistically
> >> do it for a contiguous region of it's address space which implies the
> >> guest must use a contiguous region of it's physical address space.
> >
> > This paragraph doesn't reflect my understanding. There can be multiple
> > RAMBlocks. There isn't necessarily just 1 contiguous piece of RAM.
> >
> >> 
> >> * QEMU KVM
> >> 
> >> The situation here is broadly the same - although both QEMU and the
> >> guest are seeing a their own virtual views of a linear address space
> >> which may well actually be a fragmented set of physical pages on the
> >> host.
> >
> > I don't understand the "although" part. Isn't the situation the same as
> > with TCG, where guest physical memory ranges can cross RAMBlock
> > boundaries?
> 
> You are correct - I was over simplifying. This is why I was thinking
> about the virtio-mem device. That would have it's own RAMBlock which
> could be the only one with an associated shared memory object.
> 
> >> KVM based guests have additional constraints if they ever want to access
> >> real hardware in the host as you need to ensure any address accessed by
> >> the guest can be eventually translated into an address that can
> >> physically access the bus which a device in one (for device
> >> pass-through). The area also has to be DMA coherent so updates from a
> >> bus are reliably visible to software accessing the same address space.
> >
> > I'm surprised about the DMA coherency sentence. Dont't VFIO and other
> > userspace I/O APIs provide the DMA APIs allowing applications to deal
> > with caches/coherency?
> 
> Yes - but the kernel has to ensure the buffers used by these APIs are
> allocated in regions that meet the requirements.

Is coherency as issue for software devices? Normally software device
implementations respect memory ordering (e.g. by using memory barriers)
but there is nothing else to worry about.

> >
> >> 
> >> * Xen (and other type-1's?)
> >> 
> >> Here the situation is a little different because the guest explicitly
> >> makes it's pages visible to other domains by way of grant tables. The
> >> guest is still free to use whatever parts of its address space it wishes
> >> to. Other domains then request access to those pages via the hypervisor.
> >> 
> >> In theory the requester is free to map the granted pages anywhere in
> >> its own address space. However there are differences between the
> >> architectures on how well this is supported.
> >> 
> >> So I think this makes a case for having a mechanism by which the guest
> >> can restrict it's allocation to a specific area of the guest physical
> >> address space. The question is then what is the best way to inform the
> >> guest kernel of the limitation?
> >
> > As mentioned above, I don't think it's possible to do this without
> > modifying applications - which is not possible in many use cases.
> > Instead we could improve IOMMU support so that this works transparently.
> 
> Yes. So the IOMMU allows the guest to mark all the pages associated with
> a particular device and it's transactions but how do we map that to the
> userspace view which in controlled in software?

The userspace device implementation in the Secondary? vhost-user
supports vIOMMU address translation. It asks the vIOMMU to translate a
IOVA to a GPA. The problem is that the device can still access all of
guest RAM. I'm not aware of a fast and secure interface for changing
mmaps in another process :( so it seems tricky to achieve this.

If you care more about security than performance you can add DMA
read/write messages to the vhost-user protocol. The Secondary will the
perform each DMA read/write by sending a message to the Primary
including the data that needs to be transferred.

To get slightly better performance this could be enhanced with
traditional shared memory regions for the virtqueues so that at least
the vring can be accessed via shared memory. Then only indirect
descriptor tables and the actual data buffers need to take the slow
path through the Primary.

Stefan

Attachment: signature.asc
Description: PGP signature



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]