virtio-dev message

Subject: Re: [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends
From: Matias Ezequiel Vara Larsen <matiasevara@gmail.com>
To: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date: Mon, 4 Oct 2021 13:33:43 +0200
On Mon, Aug 23, 2021 at 10:20:29AM +0900, AKASHI Takahiro wrote:
> Hi Matias,
> 
> On Sat, Aug 21, 2021 at 04:08:20PM +0200, Matias Ezequiel Vara Larsen wrote:
> > Hello,
> > 
> > On Fri, Aug 20, 2021 at 03:05:58PM +0900, AKASHI Takahiro wrote:
> > > Hi Matias,
> > > 
> > > On Thu, Aug 19, 2021 at 11:11:55AM +0200, Matias Ezequiel Vara Larsen wrote:
> > > > Hello Alex,
> > > > 
> > > > I can tell you my experience from working on a PoC (library) 
> > > > to allow the implementation of virtio-devices that are hypervisor/OS agnostic. 
> > > 
> > > What hypervisor are you using for your PoC here?
> > > 
> > 
> > I am using an in-house hypervisor, which is similar to Jailhouse.
> > 
> > > > I focused on two use cases:
> > > > 1. type-I hypervisor in which the backend is running as a VM. This
> > > > is an in-house hypervisor that does not support VMExits.
> > > > 2. Linux user-space. In this case, the library is just used to
> > > > communicate threads. The goal of this use case is merely testing.
> > > > 
> > > > I have chosen virtio-mmio as the way to exchange information
> > > > between the frontend and backend. I found it hard to synchronize the
> > > > access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow 
> > > 
> > > Can you explain how MMIOs to registers in virito-mmio layout
> > > (which I think means a configuration space?) will be propagated to BE?
> > > 
> > 
> > In this PoC, the BE guest is created with a fixed number of regions
> > of memory that represents each device. The BE initializes these regions, and then, waits
> > for the FEs to begin the initialization. 
> 
> Let me ask you in another way; When FE tries to write a register
> in configuration space, say QueueSel, how is BE notified of this event?
> 
In my PoC, it is never notified when FE writes to a register. For example, the QueueSel is only used in one of the
steps of the device status configuration. The BE is only notified when the
FE is in that step. When the FE is setting up the vrings, it sets the address, set the QueueSel, and 
then blocks until the BE can get the values. The BE gets the values and resumes the FE, which moves to the next step. 

> > > > the front-end and back-end to synchronize, which is required
> > > > during the device-status initialization. These extra bits would not be 
> > > > needed in case the hypervisor supports VMExits, e.g., KVM.
> > > > 
> > > > Each guest has a memory region that is shared with the backend. 
> > > > This memory region is used by the frontend to allocate the io-buffers. This region also 
> > > > maps the virtio-mmio layout that is initialized by the backend. For the moment, this region 
> > > > is defined when the guest is created. One limitation is that the memory for io-buffers is fixed. 
> > > 
> > > So in summary, you have a single memory region that is used
> > > for virtio-mmio layout and io-buffers (I think they are for payload)
> > > and you assume that the region will be (at lease for now) statically
> > > shared between FE and BE so that you can eliminate 'mmap' at every
> > > time to access the payload.
> > > Correct?
> > >
> > 
> > Yes, It is. 
> > 
> > > If so, it can be an alternative solution for memory access issue,
> > > and a similar technique is used in some implementations:
> > > - (Jailhouse's) ivshmem
> > > - Arnd's fat virtqueue
> > >
> > > In either case, however, you will have to allocate payload from the region
> > > and so you will see some impact on FE code (at least at some low level).
> > > (In ivshmem, dma_ops in the kernel is defined for this purpose.)
> > > Correct?
> > 
> > Yes, It is. The FE implements a sort of malloc() to organize the allocation of io-buffers from that
> > memory region.
> > 
> > Rethinking about the VMExits, I am not sure how this mechanism may be used when both the FE and 
> > the BE are VMs. The use of VMExits may require to involve the hypervisor.
> 
> Maybe I misunderstand something. Are FE/BE not VMs in your PoC?
> 

Yes, both are VMs. I meant, in case that both are VMs AND a VMExit
mechanism is used, such a mechanism would require the hypervisor to
forward the traps. In my PoC, both are VMs BUT there is not a VMExit
mechanism.

Matias
> -Takahiro Akashi
> 
> > Matias
> > > 
> > > -Takahiro Akashi
> > > 
> > > > At some point, the guest shall be able to balloon this region. Notifications between 
> > > > the frontend and the backend are implemented by using an hypercall. The hypercall 
> > > > mechanism and the memory allocation are abstracted away by a platform layer that 
> > > > exposes an interface that is hypervisor/os agnostic.
> > > > 
> > > > I split the backend into a virtio-device driver and a
> > > > backend driver. The virtio-device driver is the virtqueues and the
> > > > backend driver gets packets from the virtqueue for
> > > > post-processing. For example, in the case of virtio-net, the backend
> > > > driver would decide if the packet goes to the hardware or to another
> > > > virtio-net device. The virtio-device drivers may be
> > > > implemented in different ways like by using a single thread, multiple threads, 
> > > > or one thread for all the virtio-devices.
> > > > 
> > > > In this PoC, I just tackled two very simple use-cases. These
> > > > use-cases allowed me to extract some requirements for an hypervisor to
> > > > support virtio.
> > > > 
> > > > Matias
> > > > 
> > > > On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
> > > > > Hi,
> > > > > 
> > > > > One of the goals of Project Stratos is to enable hypervisor agnostic
> > > > > backends so we can enable as much re-use of code as possible and avoid
> > > > > repeating ourselves. This is the flip side of the front end where
> > > > > multiple front-end implementations are required - one per OS, assuming
> > > > > you don't just want Linux guests. The resultant guests are trivially
> > > > > movable between hypervisors modulo any abstracted paravirt type
> > > > > interfaces.
> > > > > 
> > > > > In my original thumb nail sketch of a solution I envisioned vhost-user
> > > > > daemons running in a broadly POSIX like environment. The interface to
> > > > > the daemon is fairly simple requiring only some mapped memory and some
> > > > > sort of signalling for events (on Linux this is eventfd). The idea was a
> > > > > stub binary would be responsible for any hypervisor specific setup and
> > > > > then launch a common binary to deal with the actual virtqueue requests
> > > > > themselves.
> > > > > 
> > > > > Since that original sketch we've seen an expansion in the sort of ways
> > > > > backends could be created. There is interest in encapsulating backends
> > > > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > > > > has prompted ideas of using the trait interface to abstract differences
> > > > > away as well as the idea of bare-metal Rust backends.
> > > > > 
> > > > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > > > calls for a description of the APIs needed from the hypervisor side to
> > > > > support VirtIO guests and their backends. However we are some way off
> > > > > from that at the moment as I think we need to at least demonstrate one
> > > > > portable backend before we start codifying requirements. To that end I
> > > > > want to think about what we need for a backend to function.
> > > > > 
> > > > > Configuration
> > > > > =============
> > > > > 
> > > > > In the type-2 setup this is typically fairly simple because the host
> > > > > system can orchestrate the various modules that make up the complete
> > > > > system. In the type-1 case (or even type-2 with delegated service VMs)
> > > > > we need some sort of mechanism to inform the backend VM about key
> > > > > details about the system:
> > > > > 
> > > > >   - where virt queue memory is in it's address space
> > > > >   - how it's going to receive (interrupt) and trigger (kick) events
> > > > >   - what (if any) resources the backend needs to connect to
> > > > > 
> > > > > Obviously you can elide over configuration issues by having static
> > > > > configurations and baking the assumptions into your guest images however
> > > > > this isn't scalable in the long term. The obvious solution seems to be
> > > > > extending a subset of Device Tree data to user space but perhaps there
> > > > > are other approaches?
> > > > > 
> > > > > Before any virtio transactions can take place the appropriate memory
> > > > > mappings need to be made between the FE guest and the BE guest.
> > > > > Currently the whole of the FE guests address space needs to be visible
> > > > > to whatever is serving the virtio requests. I can envision 3 approaches:
> > > > > 
> > > > >  * BE guest boots with memory already mapped
> > > > > 
> > > > >  This would entail the guest OS knowing where in it's Guest Physical
> > > > >  Address space is already taken up and avoiding clashing. I would assume
> > > > >  in this case you would want a standard interface to userspace to then
> > > > >  make that address space visible to the backend daemon.
> > > > > 
> > > > >  * BE guests boots with a hypervisor handle to memory
> > > > > 
> > > > >  The BE guest is then free to map the FE's memory to where it wants in
> > > > >  the BE's guest physical address space. To activate the mapping will
> > > > >  require some sort of hypercall to the hypervisor. I can see two options
> > > > >  at this point:
> > > > > 
> > > > >   - expose the handle to userspace for daemon/helper to trigger the
> > > > >     mapping via existing hypercall interfaces. If using a helper you
> > > > >     would have a hypervisor specific one to avoid the daemon having to
> > > > >     care too much about the details or push that complexity into a
> > > > >     compile time option for the daemon which would result in different
> > > > >     binaries although a common source base.
> > > > > 
> > > > >   - expose a new kernel ABI to abstract the hypercall differences away
> > > > >     in the guest kernel. In this case the userspace would essentially
> > > > >     ask for an abstract "map guest N memory to userspace ptr" and let
> > > > >     the kernel deal with the different hypercall interfaces. This of
> > > > >     course assumes the majority of BE guests would be Linux kernels and
> > > > >     leaves the bare-metal/unikernel approaches to their own devices.
> > > > > 
> > > > > Operation
> > > > > =========
> > > > > 
> > > > > The core of the operation of VirtIO is fairly simple. Once the
> > > > > vhost-user feature negotiation is done it's a case of receiving update
> > > > > events and parsing the resultant virt queue for data. The vhost-user
> > > > > specification handles a bunch of setup before that point, mostly to
> > > > > detail where the virt queues are set up FD's for memory and event
> > > > > communication. This is where the envisioned stub process would be
> > > > > responsible for getting the daemon up and ready to run. This is
> > > > > currently done inside a big VMM like QEMU but I suspect a modern
> > > > > approach would be to use the rust-vmm vhost crate. It would then either
> > > > > communicate with the kernel's abstracted ABI or be re-targeted as a
> > > > > build option for the various hypervisors.
> > > > > 
> > > > > One question is how to best handle notification and kicks. The existing
> > > > > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > > > > is quite capable of simulating them when you use TCG). Xen has it's own
> > > > > IOREQ mechanism. However latency is an important factor and having
> > > > > events go through the stub would add quite a lot.
> > > > > 
> > > > > Could we consider the kernel internally converting IOREQ messages from
> > > > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > > > hypercall interfaces?
> > > > > 
> > > > > So any thoughts on what directions are worth experimenting with?
> > > > > 
> > > > > -- 
> > > > > Alex Bennée
> > > > > 
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > > > >