virtio-dev message

Subject: Re: [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends
From: Matias Ezequiel Vara Larsen <matiasevara@gmail.com>
To: AKASHI Takahiro <takahiro.akashi@linaro.org>
Date: Sat, 21 Aug 2021 16:08:20 +0200
Hello,

On Fri, Aug 20, 2021 at 03:05:58PM +0900, AKASHI Takahiro wrote:
> Hi Matias,
> 
> On Thu, Aug 19, 2021 at 11:11:55AM +0200, Matias Ezequiel Vara Larsen wrote:
> > Hello Alex,
> > 
> > I can tell you my experience from working on a PoC (library) 
> > to allow the implementation of virtio-devices that are hypervisor/OS agnostic. 
> 
> What hypervisor are you using for your PoC here?
> 

I am using an in-house hypervisor, which is similar to Jailhouse.

> > I focused on two use cases:
> > 1. type-I hypervisor in which the backend is running as a VM. This
> > is an in-house hypervisor that does not support VMExits.
> > 2. Linux user-space. In this case, the library is just used to
> > communicate threads. The goal of this use case is merely testing.
> > 
> > I have chosen virtio-mmio as the way to exchange information
> > between the frontend and backend. I found it hard to synchronize the
> > access to the virtio-mmio layout without VMExits. I had to add some extra bits to allow 
> 
> Can you explain how MMIOs to registers in virito-mmio layout
> (which I think means a configuration space?) will be propagated to BE?
> 

In this PoC, the BE guest is created with a fixed number of regions
of memory that represents each device. The BE initializes these regions, and then, waits
for the FEs to begin the initialization. 

> > the front-end and back-end to synchronize, which is required
> > during the device-status initialization. These extra bits would not be 
> > needed in case the hypervisor supports VMExits, e.g., KVM.
> > 
> > Each guest has a memory region that is shared with the backend. 
> > This memory region is used by the frontend to allocate the io-buffers. This region also 
> > maps the virtio-mmio layout that is initialized by the backend. For the moment, this region 
> > is defined when the guest is created. One limitation is that the memory for io-buffers is fixed. 
> 
> So in summary, you have a single memory region that is used
> for virtio-mmio layout and io-buffers (I think they are for payload)
> and you assume that the region will be (at lease for now) statically
> shared between FE and BE so that you can eliminate 'mmap' at every
> time to access the payload.
> Correct?
>

Yes, It is. 

> If so, it can be an alternative solution for memory access issue,
> and a similar technique is used in some implementations:
> - (Jailhouse's) ivshmem
> - Arnd's fat virtqueue
>
> In either case, however, you will have to allocate payload from the region
> and so you will see some impact on FE code (at least at some low level).
> (In ivshmem, dma_ops in the kernel is defined for this purpose.)
> Correct?

Yes, It is. The FE implements a sort of malloc() to organize the allocation of io-buffers from that
memory region.

Rethinking about the VMExits, I am not sure how this mechanism may be used when both the FE and 
the BE are VMs. The use of VMExits may require to involve the hypervisor.

Matias
> 
> -Takahiro Akashi
> 
> > At some point, the guest shall be able to balloon this region. Notifications between 
> > the frontend and the backend are implemented by using an hypercall. The hypercall 
> > mechanism and the memory allocation are abstracted away by a platform layer that 
> > exposes an interface that is hypervisor/os agnostic.
> > 
> > I split the backend into a virtio-device driver and a
> > backend driver. The virtio-device driver is the virtqueues and the
> > backend driver gets packets from the virtqueue for
> > post-processing. For example, in the case of virtio-net, the backend
> > driver would decide if the packet goes to the hardware or to another
> > virtio-net device. The virtio-device drivers may be
> > implemented in different ways like by using a single thread, multiple threads, 
> > or one thread for all the virtio-devices.
> > 
> > In this PoC, I just tackled two very simple use-cases. These
> > use-cases allowed me to extract some requirements for an hypervisor to
> > support virtio.
> > 
> > Matias
> > 
> > On Wed, Aug 04, 2021 at 10:04:30AM +0100, Alex Bennée wrote:
> > > Hi,
> > > 
> > > One of the goals of Project Stratos is to enable hypervisor agnostic
> > > backends so we can enable as much re-use of code as possible and avoid
> > > repeating ourselves. This is the flip side of the front end where
> > > multiple front-end implementations are required - one per OS, assuming
> > > you don't just want Linux guests. The resultant guests are trivially
> > > movable between hypervisors modulo any abstracted paravirt type
> > > interfaces.
> > > 
> > > In my original thumb nail sketch of a solution I envisioned vhost-user
> > > daemons running in a broadly POSIX like environment. The interface to
> > > the daemon is fairly simple requiring only some mapped memory and some
> > > sort of signalling for events (on Linux this is eventfd). The idea was a
> > > stub binary would be responsible for any hypervisor specific setup and
> > > then launch a common binary to deal with the actual virtqueue requests
> > > themselves.
> > > 
> > > Since that original sketch we've seen an expansion in the sort of ways
> > > backends could be created. There is interest in encapsulating backends
> > > in RTOSes or unikernels for solutions like SCMI. There interest in Rust
> > > has prompted ideas of using the trait interface to abstract differences
> > > away as well as the idea of bare-metal Rust backends.
> > > 
> > > We have a card (STR-12) called "Hypercall Standardisation" which
> > > calls for a description of the APIs needed from the hypervisor side to
> > > support VirtIO guests and their backends. However we are some way off
> > > from that at the moment as I think we need to at least demonstrate one
> > > portable backend before we start codifying requirements. To that end I
> > > want to think about what we need for a backend to function.
> > > 
> > > Configuration
> > > =============
> > > 
> > > In the type-2 setup this is typically fairly simple because the host
> > > system can orchestrate the various modules that make up the complete
> > > system. In the type-1 case (or even type-2 with delegated service VMs)
> > > we need some sort of mechanism to inform the backend VM about key
> > > details about the system:
> > > 
> > >   - where virt queue memory is in it's address space
> > >   - how it's going to receive (interrupt) and trigger (kick) events
> > >   - what (if any) resources the backend needs to connect to
> > > 
> > > Obviously you can elide over configuration issues by having static
> > > configurations and baking the assumptions into your guest images however
> > > this isn't scalable in the long term. The obvious solution seems to be
> > > extending a subset of Device Tree data to user space but perhaps there
> > > are other approaches?
> > > 
> > > Before any virtio transactions can take place the appropriate memory
> > > mappings need to be made between the FE guest and the BE guest.
> > > Currently the whole of the FE guests address space needs to be visible
> > > to whatever is serving the virtio requests. I can envision 3 approaches:
> > > 
> > >  * BE guest boots with memory already mapped
> > > 
> > >  This would entail the guest OS knowing where in it's Guest Physical
> > >  Address space is already taken up and avoiding clashing. I would assume
> > >  in this case you would want a standard interface to userspace to then
> > >  make that address space visible to the backend daemon.
> > > 
> > >  * BE guests boots with a hypervisor handle to memory
> > > 
> > >  The BE guest is then free to map the FE's memory to where it wants in
> > >  the BE's guest physical address space. To activate the mapping will
> > >  require some sort of hypercall to the hypervisor. I can see two options
> > >  at this point:
> > > 
> > >   - expose the handle to userspace for daemon/helper to trigger the
> > >     mapping via existing hypercall interfaces. If using a helper you
> > >     would have a hypervisor specific one to avoid the daemon having to
> > >     care too much about the details or push that complexity into a
> > >     compile time option for the daemon which would result in different
> > >     binaries although a common source base.
> > > 
> > >   - expose a new kernel ABI to abstract the hypercall differences away
> > >     in the guest kernel. In this case the userspace would essentially
> > >     ask for an abstract "map guest N memory to userspace ptr" and let
> > >     the kernel deal with the different hypercall interfaces. This of
> > >     course assumes the majority of BE guests would be Linux kernels and
> > >     leaves the bare-metal/unikernel approaches to their own devices.
> > > 
> > > Operation
> > > =========
> > > 
> > > The core of the operation of VirtIO is fairly simple. Once the
> > > vhost-user feature negotiation is done it's a case of receiving update
> > > events and parsing the resultant virt queue for data. The vhost-user
> > > specification handles a bunch of setup before that point, mostly to
> > > detail where the virt queues are set up FD's for memory and event
> > > communication. This is where the envisioned stub process would be
> > > responsible for getting the daemon up and ready to run. This is
> > > currently done inside a big VMM like QEMU but I suspect a modern
> > > approach would be to use the rust-vmm vhost crate. It would then either
> > > communicate with the kernel's abstracted ABI or be re-targeted as a
> > > build option for the various hypervisors.
> > > 
> > > One question is how to best handle notification and kicks. The existing
> > > vhost-user framework uses eventfd to signal the daemon (although QEMU
> > > is quite capable of simulating them when you use TCG). Xen has it's own
> > > IOREQ mechanism. However latency is an important factor and having
> > > events go through the stub would add quite a lot.
> > > 
> > > Could we consider the kernel internally converting IOREQ messages from
> > > the Xen hypervisor to eventfd events? Would this scale with other kernel
> > > hypercall interfaces?
> > > 
> > > So any thoughts on what directions are worth experimenting with?
> > > 
> > > -- 
> > > Alex Bennée
> > > 
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> > > For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> > >
References:
- Enabling hypervisor agnosticism for VirtIO backends
  - From: Alex BennÃe <alex.bennee@linaro.org>
- Re: [virtio-dev] Enabling hypervisor agnosticism for VirtIO backends
  - From: Matias Ezequiel Vara Larsen <matiasevara@gmail.com>