virtio-dev message

Subject: Re: [virtio-dev] Re: [PATCH v1 6/6] vhost-user: add VFIO based accelerators support
From: Alexander Duyck <alexander.duyck@gmail.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 7 Feb 2018 10:02:24 -0800
On Wed, Feb 7, 2018 at 8:43 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Sun, Feb 04, 2018 at 01:49:46PM -0800, Alexander Duyck wrote:
>> On Thu, Jan 25, 2018 at 9:57 PM, Tiwei Bie <tiwei.bie@intel.com> wrote:
>> > On Fri, Jan 26, 2018 at 11:41:27AM +0800, Jason Wang wrote:
>> >> On 2018年01月26日 07:59, Michael S. Tsirkin wrote:
>> >> > > The virtual IOMMU isn't supported by the accelerators for now.
>> >> > > Because vhost-user currently lacks of an efficient way to share
>> >> > > the IOMMU table in VM to vhost backend. That's why the software
>> >> > > implementation of virtual IOMMU support in vhost-user backend
>> >> > > can't support dynamic mapping well.
>> >> > What exactly is meant by that? vIOMMU seems to work for people,
>> >> > it's not that fast if you change mappings all the time,
>> >> > but e.g. dpdk within guest doesn't.
>> >>
>> >> Yes, software implementation support dynamic mapping for sure. I think the
>> >> point is, current vhost-user backend can not program hardware IOMMU. So it
>> >> can not let hardware accelerator to cowork with software vIOMMU.
>> >
>> > Vhost-user backend can program hardware IOMMU. Currently
>> > vhost-user backend (or more precisely the vDPA driver in
>> > vhost-user backend) will use the memory table (delivered
>> > by the VHOST_USER_SET_MEM_TABLE message) to program the
>> > IOMMU via vfio, and that's why accelerators can use the
>> > GPA (guest physical address) in descriptors directly.
>> >
>> > Theoretically, we can use the IOVA mapping info (delivered
>> > by the VHOST_USER_IOTLB_MSG message) to program the IOMMU,
>> > and accelerators will be able to use IOVA. But the problem
>> > is that in vhost-user QEMU won't push all the IOVA mappings
>> > to backend directly. Backend needs to ask for those info
>> > when it meets a new IOVA. Such design and implementation
>> > won't work well for dynamic mappings anyway and couldn't
>> > be supported by hardware accelerators.
>> >
>> >> I think
>> >> that's another call to implement the offloaded path inside qemu which has
>> >> complete support for vIOMMU co-operated VFIO.
>> >
>> > Yes, that's exactly what we want. After revisiting the
>> > last paragraph in the commit message, I found it's not
>> > really accurate. The practicability of dynamic mappings
>> > support is a common issue for QEMU. It also exists for
>> > vfio (hw/vfio in QEMU). If QEMU needs to trap all the
>> > map/unmap events, the data path performance couldn't be
>> > high. If we want to thoroughly fix this issue especially
>> > for vfio (hw/vfio in QEMU), we need to have the offload
>> > path Jason mentioned in QEMU. And I think accelerators
>> > could use it too.
>> >
>> > Best regards,
>> > Tiwei Bie
>>
>> I wonder if we couldn't look at coming up with an altered security
>> model for the IOMMU drivers to address some of the performance issues
>> seen with typical hardware IOMMU?
>>
>> In the case of most network devices, we seem to be moving toward a
>> model where the Rx pages are mapped for an extended period of time and
>> see a fairly high rate of reuse. As such pages mapped as being
>> writable or read/write by the device are left mapped for an extended
>> period of time while Tx pages, which are read only, are often
>> mapped/unmapped since they are coming from some other location in the
>> kernel beyond the driver's control.
>>
>> If we were to somehow come up with a model where the read-only(Tx)
>> pages had access to a pre-allocated memory mapped address, and the
>> read/write(descriptor rings), write-only(Rx) pages were provided with
>> dynamic addresses we might be able to come up with a solution that
>> would allow for fairly high network performance while at least
>> protecting from memory corruption. The only issue it would open up is
>> that the device would have the ability to read any/all memory on the
>> guest. I was wondering about doing something like this with the vIOMMU
>> with VFIO for the Intel NICs this way since an interface like igb,
>> ixgbe, ixgbevf, i40e, or i40evf would probably show pretty good
>> performance under such a model and as long as the writable pages were
>> being tracked by the vIOMMU. It could even allow for live migration
>> support if the vIOMMU provided the info needed for migratable/dirty
>> page tracking and we held off on migrating any of the dynamically
>> mapped pages until after they were either unmapped or an FLR reset the
>> device.
>>
>> Thanks.
>>
>> - Alex
>
>
>
> It might be a good idea to change the iommu instead - how about a
> variant of strict in intel iommu which forces an IOTLB flush after
> invalidating a writeable mapping but not a RO mapping?  Not sure what the
> name would be - relaxed-ro?
>
> This is probably easier than poking at the drivers and net core.
>
> Keeping the RX pages mapped in the IOMMU was envisioned for XDP.
> That might be a good place to start.

My plan is to update the Intel IOMMU driver first since it seems like
something that shouldn't require too much expertise in the operation
of the IOMMU to accomplish. My idea was more along the lines of
something like a "iommu=read-only-pt" or maybe "iommu=pt-ro" where the
Tx data would be identity mapped, and the descriptor rings and Rx data
could be in the dynamic mapping setup. The idea is loosely based on
the existing "iommu=pt" option that is normally used on the host if
you want to avoid the cost for dynamic mapping. Basically we just need
to keep an eye on the number of mappings that the device can write to.
Ideally if we leave the Tx as identity mapped that means we never have
to actually write to update any mapping which would mean no having to
jump into the hypervisor to deal with the update. The fact that most
of the drivers already leave the Rx buffers and descriptor rings
statically mapped should essentially take care of the rest for us.
What this would become is a version of "iommu=pt" where the user cares
about preventing the device from possibly corrupting memory, but would
still like better performance at the cost of the device being able to
ready and/all memory on the system.

As far as if it is strict or not I don't know how much we would need
to worry about that for the migration case. Essentially a deferred
IOTLB flush would result in us having extra pages marked as dirty and
non-migratable, but we would need to see how much overhead there is in
the migration to deal with those extra pages versus the cost of having
to do an IOTLB flush on every unmap call.

Anyway this is an idea that just occurred to me the other day so I
still need to do some more research into how easy/difficult
implementing a solution like this would be.

Thanks.

- Alex
Follow-Ups:
- Re: [virtio-dev] Re: [PATCH v1 6/6] vhost-user: add VFIO based accelerators support
  - From: "Michael S. Tsirkin" <mst@redhat.com>
References:
- Re: [virtio-dev] Re: [PATCH v1 6/6] vhost-user: add VFIO based accelerators support
  - From: Alexander Duyck <alexander.duyck@gmail.com>
- Re: [virtio-dev] Re: [PATCH v1 6/6] vhost-user: add VFIO based accelerators support
  - From: "Michael S. Tsirkin" <mst@redhat.com>