virtio-dev message

Subject: [RFC 3/3] virtio-iommu: future work
From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
To: iommu@lists.linux-foundation.org, kvm@vger.kernel.org, virtualization@lists.linux-foundation.org, virtio-dev@lists.oasis-open.org
Date: Fri, 7 Apr 2017 20:17:47 +0100
Here I propose a few ideas for extensions and optimizations. This is all
very exploratory, feel free to correct mistakes and suggest more things.

	I.   Linux host
	     1. vhost-iommu
	     2. VFIO nested translation
	II.  Page table sharing
	     1. Sharing IOMMU page tables
	     2. Sharing MMU page tables (SVM)
	     3. Fault reporting
	     4. Host implementation with VFIO
	III. Relaxed operations
	IV.  Misc


  I. Linux host
  =============

  1. vhost-iommu
  --------------

An advantage of virtualizing an IOMMU using virtio is that it allows to
hoist a lot of the emulation code into the kernel using vhost, and avoid
returning to userspace for each request. The mainline kernel already
implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code
could be reused.

Introducing vhost in a simplified scenario 1 (removed guest userspace
pass-through, irrelevant to this example) gives us the following:

  MEM____pIOMMU________PCI device____________                    HARDWARE
            |                                \
  ----------|-------------+-------------+-----\--------------------------
            |             :     KVM     :      \
       pIOMMU drv         :             :       \                  KERNEL
            |             :             :     net drv
          VFIO            :             :       /
            |             :             :      /
       vhost-iommu_________________________virtio-iommu-drv
                          :             :
  --------------------------------------+-------------------------------
                 HOST                   :             GUEST


Introducing vhost in scenario 2, userspace now only handles the device
initialisation part, and most runtime communication is handled in kernel:

  MEM__pIOMMU___PCI device                                     HARDWARE
         |         |
  -------|---------|------+-------------+-------------------------------
         |         |      :     KVM     :
    pIOMMU drv     |      :             :                         KERNEL
             \__net drv   :             :
                   |      :             :
                  tap     :             :
                   |      :             :
              _vhost-net________________________virtio-net drv
         (2) /            :             :           / (1a)
            /             :             :          /
   vhost-iommu________________________________virtio-iommu drv
                          :             : (1b)
  ------------------------+-------------+-------------------------------
                 HOST                   :             GUEST

(1) a. Guest virtio driver maps ring and buffers
    b. Map requests are relayed to the host the same way.
(2) To access any guest memory, vhost-net must query the IOMMU. We can
    reuse the existing TLB protocol for this. TLB commands are written to
    and read from the vhost-net fd.

As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure
has everything needed for map/unmap operations:

	struct vhost_iotlb_msg {
		__u64	iova;
		__u64	size;
		__u64	uaddr;
		__u8	perm; /* R/W */
		__u8	type;
	#define VHOST_IOTLB_MISS
	#define VHOST_IOTLB_UPDATE	/* MAP */
	#define VHOST_IOTLB_INVALIDATE	/* UNMAP */
	#define VHOST_IOTLB_ACCESS_FAIL
	};

	struct vhost_msg {
		int type;
		union {
			struct vhost_iotlb_msg iotlb;
			__u8 padding[64];
		};
	};

The vhost-iommu device associates a virtual device ID to a TLB fd. We
should be able to use the same commands for [vhost-net <-> virtio-iommu]
and [virtio-net <-> vhost-iommu] communication. A virtio-net device
would open a socketpair and hand one side to vhost-iommu.

If vhost_msg is ever used for another purpose than TLB, we'll have some
trouble, as there will be multiple clients that want to read/write the
vhost fd. A multicast transport method will be needed. Until then, this
can work.

Details of operations would be:

(1) Userspace sets up vhost-iommu as with other vhost devices, by using
standard vhost ioctls. Userspace starts by describing the system topology
via ioctl:

	ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct
	      vhost_iommu_add_device)

	#define VHOST_IOMMU_DEVICE_TYPE_VFIO
	#define VHOST_IOMMU_DEVICE_TYPE_TLB

	struct vhost_iommu_add_device {
		__u8 type;
		__u32 devid;
		union {
			struct vhost_iommu_device_vfio {
				int vfio_group_fd;
			};
			struct vhost_iommu_device_tlb {
				int fd;
			};
		};
	};

(2) VIRTIO_IOMMU_T_ATTACH(address space, devid)

vhost-iommu creates an address space if necessary, finds the device along
with the relevant operations. If type is VFIO, operations are done on a
container, otherwise they are done on single devices.

(3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags)

Turn phys into an hva using the vhost mem table.

- If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the
  mapping locally and wait for the TLB to ask for it with a
  VHOST_IOTLB_MISS.
- If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to
  introduce a shortcut in the external user API of VFIO).

(4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags)

- If type is TLB, send a VHOST_IOTLB_INVALIDATE.
- If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA.

(5) VIRTIO_IOMMU_T_DETACH(address space, devid)

Undo whatever was done in (2).


  2. VFIO nested translation
  --------------------------

For my current kvmtool implementation, I am putting each VFIO group in a
different container during initialization. We cannot detach a group from a
container at runtime without first resetting all devices in that group. So
the best way to provide dynamic address spaces right now is one container
per group. The drawback is that we need to maintain multiple sets of page
tables even if the guest wants to put all devices in the same address
space. Another disadvantage is when implementing bypass mode, we need to
map the whole address space at the beginning, then unmap everything on
attach. Adding nested support would be a nice way to provide dynamic
address spaces while keeping groups tied to a container at all times.

A physical IOMMU may offer nested translation. In this case, address
spaces are managed by two page directories instead of one. A guest-
virtual address is translated into a guest-physical one using what we'll
call here "stage-1" (s1) page tables, and the guest-physical address is
translated into a host-physical one using "stage-2" (s2) page tables.

                             s1      s2
                         GVA --> GPA --> HPA

There isn't a lot of support in Linux for nesting IOMMU page directories
at the moment (though SVM support is coming, see II). VFIO does have a
"nesting" IOMMU type, which doesn't mean much at the moment. The ARM SMMU
code uses this to decide whether to manage the container with s2 page
tables instead of s1, but even then we still only have a single stage and
it is assumed that IOVA=GPA.

Another model that would help with dynamically changing address spaces is
nesting VFIO containers:

                           Parent  <---------- map/unmap
                          container
                         /   |     \
                        /   group   \
                     Child         Child  <--- map/unmap
                   container     container
                    |   |             |
                 group group        group

At the beginning all groups are attached to the parent container, and
there is no child container. Doing map/unmap on the parent container maps
stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should
be able to choose whether they want all devices attached to this container
to be able to access GPAs (bypass mode, as it currently is) or simply
block all DMA (in which case there is no need to pin pages here).

At some point the guest wants to create an address space and attaches
children to it. Using an ioctl (to be defined), we can derive a child
container from the parent container, and move groups from parent to child.

This returns a child fd. When the guest maps something in this new address
space, we can do a map ioctl on the child container, which maps stage-1
page tables (map GVA -> GPA).

A page table walk may access multiple levels of tables (pgd, p4d, pud,
pmd, pt). With nested translation, each access to a table during the
stage-1 walk requires a stage-2 walk. This makes a full translation costly
so it is preferable to use a single stage of translation when possible.
Folding two stages into one is simple with a single container, as shown in
the kvmtool example. The host keeps track of GPA->HVA mappings, so it can
fold the full GVA->HVA mapping before sending the VFIO request. With
nested containers however, the IOMMU driver would have to do the folding
work itself. Keeping a copy of stage-2 mapping created on the parent
container, it would fold them into the actual stage-2 page tables when
receiving a map request on the child container (note that software folding
is not possible when stage-1 pgd is managed by the guest, as described in
next section).

I don't know if nested VFIO containers are a desirable feature at all. I
find the concept cute on paper, and it would make it easier for userspace
to juggle with address spaces, but it might require some invasive changes
in VFIO, and people have been able to use the current API for IOMMU
virtualization so far.


  II. Page table sharing
  ======================

  1. Sharing IOMMU page tables
  ----------------------------

VIRTIO_IOMMU_F_PT_SHARING

This is independent of the nested mode described in I.2, but relies on a
similar feature in the physical IOMMU: having two stages of page tables,
one for the host and one for the guest.

When this is supported, the guest can manage its own s1 page directory, to
avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows
a driver to give a page directory pointer (pgd) to the host and send
invalidations when removing or changing a mapping. In this mode, three
requests are used: probe, attach and invalidate. An address space cannot
be using the MAP/UNMAP interface and PT_SHARING at the same time.

Device and driver first need to negotiate which page table format they
will be using. This depends on the physical IOMMU, so the request contains
a negotiation part to probe the device capabilities.

(1) Driver attaches devices to address spaces as usual, but a flag
    VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to
    create page tables for use with the MAP/UNMAP API. The driver intends
    to manage the address space itself.

(2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of
    pg_format array.

	VIRTIO_IOMMU_T_PROBE_TABLE

	struct virtio_iommu_req_probe_table {
		le32	address_space;
		le32	flags;
		le32	len;
	
		le32	nr_contexts;
		struct {
			le32	model;
			u8	format[64];
		} pg_format[len];
	};

Introducing a probe request is more flexible than advertising those
features in virtio config, because capabilities are dynamic, and depend on
which devices are attached to an address space. Within a single address
space, devices may support different numbers of contexts (PASIDs), and
some may not support recoverable faults.

(3) Device responds success with all page table formats implemented by the
    physical IOMMU in pg_format. 'model' 0 is invalid, so driver can
    initialize the array to 0 and deduce from there which entries have
    been filled by the device.

Using a probe method seems preferable over trying to attach every possible
format until one sticks. For instance, with an ARM guest running on an x86
host, PROBE_TABLE would return the Intel IOMMU page table format, and the
guest could use that page table code to handle its mappings, hidden behind
the IOMMU API. This requires that the page-table code is reasonably
abstracted from the architecture, as is done with drivers/iommu/io-pgtable
(an x86 guest could use any format implement by io-pgtable for example.)

(4) If the driver is able to use this format, it sends the ATTACH_TABLE
    request.

	VIRTIO_IOMMU_T_ATTACH_TABLE

	struct virtio_iommu_req_attach_table {
		le32	address_space;
		le32	flags;
		le64	table;
	
		le32	nr_contexts;
		/* Page-table format description */
	
		le32	model;
		u8	config[64]
	};


    'table' is a pointer to the page directory. 'nr_contexts' isn't used
    here.

    For both ATTACH and PROBE, 'flags' are the following (and will be
    explained later):

	VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT	(1 << 0)
	VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE	(1 << 1)
	VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT	(1 << 2)

Now 'model' is a bit tricky. We need to specify all possible page table
formats and their parameters. I'm not well-versed in x86, s390 or other
IOMMUs, so I'll just focus on the ARM world for this example. We basically
have two page table models, with a multitude of configuration bits:

	* ARM LPAE
	* ARM short descriptor

We could define a high-level identifier per page-table model, such as:

	#define PG_TABLE_ARM	0x1
	#define PG_TABLE_X86	0x2
	...

And each model would define its own structure. On ARM 'format' could be a
simple u32 defining a variant, LPAE 32/64 or short descriptor. It could
also contain additional capabilities. Then depending on the variant,
'config' would be:

	struct pg_config_v7s {
		le32	tcr;
		le32	prrr;
		le32	nmrr;
		le32	asid;
	};
	
	struct pg_config_lpae {
		le64	tcr;
		le64	mair;
		le32	asid;
	
		/* And maybe TTB1? */
	};

	struct pg_config_arm {
		le32	variant;
		union ...;
	};

I am really uneasy with describing all those nasty architectural details
in the virtio-iommu specification. We certainly won't start describing the
content bit-by-bit of tcr or mair here, but just declaring these fields
might be sufficient.

(5) Once the table is attached, the driver can simply write the page
    tables and expect the physical IOMMU to observe the mappings without
    any additional request. When changing or removing a mapping, however,
    the driver must send an invalidate request.

	VIRTIO_IOMMU_T_INVALIDATE

	struct virtio_iommu_req_invalidate {
		le32	address_space;
		le32	context;
		le32	flags;
		le64	virt_addr;
		le64	range_size;
	
		u8	opaque[64];
	};

    'flags' may be:

    VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range
      from 'context' (context is 0 when !F_INDIRECT).

    And with context tables only (explained below):

    VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from
      'context' (context is 0 when !F_INDIRECT). virt_addr and range_size
      are ignored.

    VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries
      in the table that changed. Device reads the table again, compares it
      to previous values, and invalidate all mappings for contexts that
      changed. context, virt_addr and range_size are ignored.

IOMMUs may offer hints and quirks in their invalidation packets. The
opaque structure in invalidate would allow to transport those. This
depends on the page table format and as with architectural page-table
definitions, I really don't want to have those details in the spec itself.


  2. Sharing MMU page tables
  --------------------------

The guest can share process page-tables with the physical IOMMU. To do
that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The
page table format is implicit, so the pg_format array can be empty (unless
the guest wants to query some specific property, e.g. number of levels
supported by the pIOMMU?). If the host answers with success, guest can
send its MMU page table details with ATTACH_TABLE and (F_NATIVE |
F_INDIRECT | F_FAULT) flags.

F_FAULT means that the host communicates page requests from device to the
guest, and the guest can handle them by mapping virtual address in the
fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see
below.)

F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU
pgtable format.

F_INDIRECT means that 'table' pointer is a context table, instead of a
page directory. Each slot in the context table points to a page directory:

                       64              2 1 0
          table ----> +---------------------+
                      |       pgd       |0|1|<--- context 0
                      |       ---       |0|0|<--- context 1
                      |       pgd       |0|1|
                      |       ---       |0|0|
                      |       ---       |0|0|
                      +---------------------+
                                         | \___Entry is valid
                                         |______reserved

Question: do we want per-context page table format, or can it stay global
for the whole indirect table?

Having a context table allows to provide multiple address spaces for a
single device. In the simplest form, without F_INDIRECT we have a single
address space per device, but some devices may implement more, for
instance devices with the PCI PASID extension.

A slot's position in the context table gives an ID, between 0 and
nr_contexts. The guest can use this ID to have the device target a
specific address space with DMA. The mechanism to do that is
device-specific. For a PCI device, the ID is a PASID, and PCI doesn't
define a specific way of using them for DMA, it's the device driver's
concern.


  3. Fault reporting
  ------------------

VIRTIO_IOMMU_F_EVENT_QUEUE

With this feature, an event virtqueue (1) is available. For now it will
only be used for fault handling, but I'm calling it eventq so that other
asynchronous features can piggy-back on it. Device may report faults and
page requests by sending buffers via the used ring.

	#define VIRTIO_IOMMU_T_FAULT	0x05

	struct virtio_iommu_evt_fault {
		struct virtio_iommu_evt_head {
			u8 type;
			u8 reserved[3];
		};
	
		u32 address_space;
		u32 context;
	
		u64 vaddr;
		u32 flags;	/* Access details: R/W/X */
	
		/* In the reply: */
		u32 reply;	/* Fault handled, or failure */
		u64 paddr;
	};

Driver must send the reply via the request queue, with the fault status
in 'reply', and the mapped page in 'paddr' on success.

Existing fault handling interfaces such as PRI have a tag (PRG) allowing
to identify a page request (or group thereof) when sending a reply. I
wonder if this would be useful to us, but it seems like the
(address_space, context, vaddr) tuple is sufficient to identify a page
fault, provided the device doesn't send duplicate faults. Duplicate faults
could be required if they have a side effect, for instance implementing a
poor man's doorbell. If this is desirable, we could add a fault_id field.


  4. Host implementation with VFIO
  --------------------------------

The VFIO interface for sharing page tables is being worked on at the
moment by Intel. Other virtual IOMMU implementation will most likely let
guest manage full context tables (PASID tables) themselves, giving the
context table pointer to the pIOMMU via a VFIO ioctl.

For the architecture-agnostic virtio-iommu however, we shouldn't have to
implement all possible formats of context table (they are at least
different between ARM SMMU and Intel IOMMU, and will certainly be extended
in future physical IOMMU architectures.) In addition, most users might
only care about having one page directory per device, as SVM is a luxury
at the moment and few devices support it. For these reasons, we should
allow to pass single page directories via VFIO, using very similar
structures as described above, whilst reusing the VFIO channel developed
for Intel vIOMMU.

	* VFIO_SVM_INFO: probe page table formats
	* VFIO_SVM_BIND: set pgd and arch-specific configuration

There is an inconvenient with letting the pIOMMU driver manage the guest's
context table. During a page table walk, the pIOMMU translates the context
table pointer using the stage-2 page tables. The context table must
therefore be mapped in guest-physical space by the pIOMMU driver. One
solution is to let the pIOMMU driver reserve some GPA space upfront using
the iommu and sysfs resv API [1]. The host would then carve that region
out of the guest-physical space using a firmware mechanism (for example DT
reserved-memory node).


  III. Relaxed operations
  =======================

VIRTIO_IOMMU_F_RELAXED

Adding an IOMMU dramatically reduces performance of a device, because
map/unmap operations are costly and produce a lot of TLB traffic. For
significant performance improvements, device might allow the driver to
sacrifice safety for speed. In this mode, the driver does not need to send
UNMAP requests. The semantics of MAP change and are more complex to
implement. Given a MAP([start:end] -> phys, flags) request:

(1) If [start:end] isn't mapped, request succeeds as usual.
(2) If [start:end] overlaps an existing mapping [old_start:old_end], we
    unmap [max(start, old_start):min(end, old_end)] and replace it with
    [start:end].
(3) If [start:end] overlaps an existing mapping that matches the new map
    request exactly (same flags, same phys address), the old mapping is
    kept.

This squashing could be performed by the guest. The driver can catch unmap
requests from the DMA layer, and only relay map requests for (1) and (2).
A MAP request is therefore able to split and partially override an
existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests
are unnecessary, but are now allowed to split or carve holes in mappings.

In this model, a MAP request may take longer, but we may have a net gain
by removing a lot of redundant requests. Squashing series of map/unmap
performed by the guest for the same mapping improves temporal reuse of
IOVA mappings, which I can observe by simply dumping IOMMU activity of a
virtio device. It reduce the number of TLB invalidations to the strict
minimum while keeping correctness of DMA operations (provided the device
obeys its driver). There is a good read on the subject of optimistic
teardown in paper [2].

This model is completely unsafe. A stale DMA transaction might access a
page long after the device driver in the guest unmapped it and
decommissioned the page. The DMA transaction might hit into a completely
different part of the system that is now reusing the page. Existing
relaxed implementations attempt to mitigate the risk by setting a timeout
on the teardown. Unmap requests from device drivers are not discarded
entirely, but buffered and sent at a later time. Paper [2] reports good
results with a 10ms delay.

We could add a way for device and driver to negotiate a vulnerability
window to mitigate the risk of DMA attacks. Driver might not accept a
window at all, since it requires more infrastructure to keep delayed
mappings. In my opinion, it should be made clear that regardless of the
duration of this window, any driver accepting F_RELAXED feature makes the
guest completely vulnerable, and the choice boils down to either isolation
or speed, not a bit of both.


  IV. Misc
  ========

I think we have enough to go on for a while. To improve MAP throughput, I
considered adding a MAP_SG request depending on a feature bit, with
variable size:

	struct virtio_iommu_req_map_sg {
		struct virtio_iommu_req_head;
		u32	address_space;
		u32	nr_elems;
		u64	virt_addr;
		u64	size;
		u64	phys_addr[nr_elems];
	};

Would create the following mappings:

	virt_addr		-> phys_addr[0]
	virt_addr + size	-> phys_addr[1]
	virt_addr + 2 * size	-> phys_addr[2]
	...

This would avoid the overhead of multiple map commands. We could try to
find a more cunning format to compress virtually-contiguous mappings with
different (phys, size) pairs as well. But Linux drivers rarely prefer
map_sg() functions over regular map(), so I don't know if the whole map_sg
feature is worth the effort. All we would gain is a few bytes anyway.

My current map_sg implementation in the virtio-iommu driver adds a batch
of map requests to the queue and kick the host once. That might be enough
of an optimization.


Another invasive optimization would be adding grouped requests. By adding
two flags in the header, L and G, we can group sequences of requests
together, and have one status at the end, either 0 if all requests in the
group succeeded, or the status of the first request that failed. This is
all in-order. Requests in a group follow each others, there is no sequence
identifier.

	                       ___ L: request is last in the group
	                      /  _ G: request is part of a group
	                     |  /
	                     v v
	31                   9 8 7      0
	+--------------------------------+ <------- RO descriptor
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |0|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+
	|        res0       |1|1|  type  |
	+--------------------------------+
	|            payload             |
	+--------------------------------+ <------- WO descriptor
	|        res0           | status |
	+--------------------------------+

This adds some complexity on the device, since it must unroll whatever was
done by successful requests in a group as soon as one fails, and reject
all subsequent ones. A group of requests is an atomic operation. As with
map_sg, this change mostly allows to save space and virtio descriptors.


[1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups
[2] vIOMMU: Efficient IOMMU Emulation
    N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster
Follow-Ups:
- Re: [RFC 3/3] virtio-iommu: future work
  - From: "Michael S. Tsirkin" <mst@redhat.com>
References:
- [RFC 0/3] virtio-iommu: a paravirtualized IOMMU
  - From: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>