virtio-comment message

Subject: Re: [virtio-comment] Live Migration of Virtio Virtual Function

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Date: Fri, 20 Aug 2021 03:03:57 -0400

On Fri, Aug 20, 2021 at 10:17:05AM +0800, Jason Wang wrote:
> 
> å 2021/8/19 äå10:58, Michael S. Tsirkin åé:
> > On Thu, Aug 19, 2021 at 10:44:46AM +0800, Jason Wang wrote:
> > > > The PF device will have an option to quiesce/freeze the VF device.
> > > 
> > > Is such design a must? If no, why not simply introduce those functions in
> > > the VF?
> > Many IOMMUs only support protections at the function level.
> > Thus we need ability to have one device (e.g. a PF)
> > to control migration of another (e.g. a VF).
> 
> 
> So as discussed previously, the only possible "advantage" is that the DMA is
> isolated.
> 
> 
> > This is because allowing VF to access hypervisor memory used for
> > migration is not a good idea.
> > For IOMMUs that support subfunctions, these "devices" could be
> > subfunctions.
> > 
> > The only alternative is to keep things in device memory which
> > does not need an IOMMU.
> > I guess we'd end up with something like a VQ in device memory which might
> > be tricky from multiple points of view, but yes, this could be
> > useful and people did ask for such a capability in the past.
> 
> 
> I assume the spec already support this. We probably need some clarification
> at the transport layer. But it's as simple as setting MMIO are as virtqueue
> address?

Several issues
- we do not support changing VQ address. Devices do need to support
  changing memory addresses.
- Ordering becomes tricky especially .
  E.g. when device reads descriptor in VQ
  memory it suddenly does not flush out writes into buffer
  that is potentially in RAM. We might also need even stronger
  barriers on the driver side. We used dma_wmb but now it's
  probably need to be wmb.
  Reading multibyte structures from device memory is slow.
  To get reasonable performance we might need to mark this device memory
  WB or WC. That generally makes things even trickier.


> Except for the dirty bit tracking, we don't have bulk data that needs to be
> transferred during migration. So a virtqueue is not must even in this case.

Main traffic is write tracking.


> 
> > 
> > > If yes, what's the reason for making virtio different (e.g VCPU live
> > > migration is not designed like that)?
> > I think the main difference is we need PF's help for memory
> > tracking for pre-copy migration anyway.
> 
> 
> Such kind of memory tracking is not a must. KVM uses software assisted
> technologies (write protection) and it works very well.

So page-fault support is absolutely a viable option IMHO.
To work well we need VIRTIO_F_PARTIAL_ORDER - there was not
a lot of excitement but sure I will finalize and repost it.


However we need support for reporting and handling faults.
Again this is data path stuff and needs to be under
hypervisor control so I guess we get right back
to having this in the PF?





> For virtio,
> technology like shadow virtqueue has been used by DPDK and prototyped by
> Eugenio.

That's ok but I think since it affects performance at 100% of the
time when active we can not rely on this as the only solution.


> Even if we want to go with hardware technology, we have many alternatives
> (as we've discussed in the past):
> 
> 1) IOMMU dirty bit (E.g modern IOMMU have EA bit for logging external device
> write)
> 2) Write protection via IOMMU or device MMU
> 3) Address space ID for isolating DMAs

Not all systems support any of the above unfortunately.

Also some systems might have a limited # of PASIDs.
So burning up a extra PASID per VF halving their
number might not be great as the only option.


> 
> Using physical function is sub-optimal that all of the above since:
> 
> 1) limited to a specific transport or implementation and it doesn't work for
> device or transport without PF
> 2) the virtio level function is not self contained, this makes any feature
> that ties to PF impossible to be used in the nested layer
> 3) more complicated than leveraging the existing facilities provided by the
> platform or transport

I think I disagree with 2 and 3 above simply because controlling VFs through
a PF is how all other devices did this. About 1 - well this is
just about us being smart and writing this in a way that is
generic enough, right? E.g. include options for PASIDs too.

Note that support for cross-device addressing is useful
even outside of migration.  We also have things like
priority where it is useful to adjust properties of
a VF on the fly while it is active. Again the normal way
all devices do this is through a PF. Yes a bunch of tricks
in QEMU is possible but having a driver in host kernel
and just handle it in a contained way is way cleaner.


> Consider (P)ASID will be ready very soon, workaround the platform limitation
> via PF is not a good idea for me. Especially consider it's not a must and we
> had already prototype the software assisted technology.

Well PASID is just one technology.


> 
> >   Might as well integrate
> > the rest of state in the same channel.
> 
> 
> That's another question. I think for the function that is a must for doing
> live migration, introducing them in the function itself is the most natural
> way since we did all the other facilities there. This ease the function that
> can be used in the nested layer.
> 
> And using the channel in the PF is not coming for free. It requires
> synchronization in the software or even QOS.
> 
> Or we can just separate the dirty page tracking into PF (but need to define
> them as basic facility for future extension).

Well maybe just start focusing on write tracking, sure.
Once there's a proposal for this we can see whether
adding other state there is easier or harder.


> 
> > 
> > Another answer is that CPUs trivially switch between
> > functions by switching the active page tables. For PCI DMA
> > it is all much trickier sine the page tables can be separate
> > from the device, and assumed to be mostly static.
> 
> 
> I don't see much different, the page table is also separated from the CPU.
> If the device supports state save and restore we can scheduling the multiple
> VMs/VCPUs on the same device.

It's just that performance is terrible. If you keep losing packets
migration might as well not be live.

> 
> > So if you want to create something like the VMCS then
> > again you either need some help from another device or
> > put it in device memory.
> 
> 
> For CPU virtualization, the states could be saved and restored via MSRs. For
> virtio, accessing them via registers is also possible and much more simple.
> 
> Thanks

IMy guess is performance is going to be bad. MSRs are part of the
same CPU that is executing the accesses....

> 
> > 
> >

Follow-Ups:
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: Jason Wang <jasowang@redhat.com>

References:
- Live Migration of Virtio Virtual Function
  - From: Max Gurtovoy <mgurtovoy@nvidia.com>
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: Max Gurtovoy <mgurtovoy@nvidia.com>
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: Max Gurtovoy <mgurtovoy@nvidia.com>
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: Max Gurtovoy <mgurtovoy@nvidia.com>
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] Live Migration of Virtio Virtual Function
  - From: Jason Wang <jasowang@redhat.com>