virtio-comment message

Subject: Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Parav Pandit <parav@nvidia.com>
Date: Wed, 18 Oct 2023 05:56:12 -0400

On Wed, Oct 18, 2023 at 09:48:55AM +0000, Parav Pandit wrote:
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Wednesday, October 18, 2023 2:13 PM
> > 
> > On 10/18/2023 3:20 PM, Parav Pandit wrote:
> > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >> Sent: Wednesday, October 18, 2023 12:22 PM
> > >>
> > >> On 10/18/2023 2:41 PM, Parav Pandit wrote:
> > >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>> Sent: Wednesday, October 18, 2023 12:06 PM
> > >>>>
> > >>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
> > >>>>>> From: virtio-comment@lists.oasis-open.org
> > >>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu, Lingshan
> > >>>>>> Sent: Monday, October 16, 2023 3:18 PM
> > >>>>>>
> > >>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
> > >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
> > >>>>>>>>>>>> How do you transfer the ownership?
> > >>>>>>>>>>> An additional ownership deletgation by a new admin command.
> > >>>>>>>>>> if you think this can work, do you want to cook a patch to
> > >>>>>>>>>> implement this before you submitting this live migration series?
> > >>>>>>>>> I answered this already above.
> > >>>>>>>> talk is cheap, show me your patch
> > >>>>>>> Huh. We presented the infrastructure that migrates, 30+ device
> > >>>>>>> types,
> > >>>>>> covering device context ideas from Oracle.
> > >>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
> > >>>>>>>
> > >>>>>>> Please have some respect for other members who covered more
> > >>>>>>> ground than
> > >>>>>> your series.
> > >>>>>>> What more? Apply the same nested concept on the member device as
> > >>>>>> Michael suggested, it is nested virtualization maintain exact
> > >>>>>> same
> > >> semantics.
> > >>>>>>> So a VF is mapped as PF to the L1 guest.
> > >>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
> > >>>>>>>
> > >>>>>>> This nested work can be extended in future, once first level
> > >>>>>>> nesting is
> > >>>>>> covered.
> > >>>>>>>> Answer all questions above, if you think a management VF can
> > >>>>>>>> work, please show me your patch.
> > >>>>>>> The idea evolves from technical debate then pointing fingers
> > >>>>>>> like your
> > >>>>>> comment.
> > >>>>>>> I think a positive discussion with Michael and a pointer to the
> > >>>>>>> paper from
> > >>>>>> Jason gave a good direction of doing _right_ nesting that follows
> > >>>>>> two
> > >>>> principles.
> > >>>>>>> a. efficiency property
> > >>>>>>> b. equivalence property
> > >>>>>>>
> > >>>>>>> (c. resource control is natural already)
> > >>>>>>>
> > >>>>>>> Both apply at VMM and at VM level enabling recursive
> > >>>>>>> virtualization, by
> > >>>>>> having VF that can act as PF inside the guest.
> > >>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
> > >>>>>> Please just show me your patch resolving these opens, how about
> > >>>>>> start from defining virito-fs device context and your management VF?
> > >>>>> As answered, device context infrastructure is done, per device
> > >>>>> specific device-
> > >>>> context will be defined incrementally.
> > >>>>> I will not be including virtio-fs in this series. It will be done
> > >>>>> incrementally in
> > >>>> future utilizing the infrastructure build in this series.
> > >>>> Done? How do you conclude this? You just tell me what is the full
> > >>>> set of virito-fs device context now and how to migrate them.
> > >>>>
> > >>>> You cant? you refuse or you don't? Do you expect the HW designer to
> > >>>> figure out by themself?
> > >>> I wont be able to tell now as I donât think it is necessary for this series.
> > >>> If one out of 30 devices cannot migrate because of unimaginable
> > >>> amount of
> > >> complexity has been placed there, may be one will not implement it as
> > >> member device.
> > >>>   From experience of migratable complex gpu devices, rdma devices
> > >>> (stateful
> > >> having hundred thousand of stateful QPs), my understanding is complex
> > >> state of virtio-fs can be defined and migratable.
> > >>> Mlx5 driver consist of 150,000 lines of code and that device is
> > >>> migratable
> > >> with complex state.
> > >>> So I am optimistic that virtio-fs can be migratable too.
> > >>> It does not have to limited by my limited creativity of 2023.
> > >>> May be I am wrong, in that case one will not implement passthrough
> > >>> virtio-fs
> > >> device.
> > >> your series wants to migrate device context, but doesn't define
> > >> device context, does this sounds reasonable?
> > > Device generic context is defined at [1] and also the infrastructure for defining
> > the device context in parallel by multiple people can be done post the work of
> > [1].
> > >
> > > Per each device type context will be defined incrementally post this work.
> > >
> > > [1]
> > > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190.h
> > > tml
> > This is not post of the work, you should define them before you use them in this
> > series.
> > 
> I donât agree to cook ocean in this patch series.
> No practical spec devel community does it.
> As long as we feel comfortable that device context framework is extendible, it is fine.
> If virtio-fs seems very hard, may be one will come with a new light weight FS device. I really donât know.
> 
> > And you need to prove why admin vq are better than registers solution if you
> > want a merge.
> Michael already responded the practical aspects.
> Since you may claim, I didnât answer, below is the technical details.
> 
> Why admin commands and aq is better is because of below reasons in my view:
> 
> Functionally better:
> 1. When the live migration registers are located on the VF itself, VMM does not have control of it.
> These registers reset, on FLR and device reset because these are virtio registers of the device.
> Hence, VMM lost the state for the job that VMM was supposed to do.
> Therefore, passthrough mode cannot depend on these registers.
> 
> 2. Any bulk data transfer of device context and dirty page tracking requires DMA.
> Hence those DMA must happen to the device which is different than VF itself.
> If it is on the VF itself, it has two problems.
> 2.a. VF device reset and FLR will clear them, and device context is lost.
> 
> 2.b. the DMA occurs at the PCI RID level.
> IOMMU cannot bifurcate the DMA of one RID to two different address space of guest and hypervisor.
> This requires PASID support.
> Using PASID has following problems.
> 2.b.1 PASID typically not used by the kernel software. It is only meant for the user processes.
> Hence for kernel work a reserving PASID won't be acceptable upstream kernel.
> 2.b.2 Somehow if this is done, When the VF itself supports PASID, it required now vPASID support.
> This is again not where industry is going in other forums where I am part of. Hence, it will be failure for virtio. Hence, I do not recommend vPASID route.
> 2.b.3 One of the widely used cpu seems to have dropped the support due to limitation of an instruction around PASID.
> So it cannot be used there, this further limits virtio passthrough users.
> 
> Even if somehow 2.b.2 and 2.b.3 is overcome in theory, #1 and 2.a is functional problems.
> 
> Scale wise better:
> 3. Admin command and admin vq are used _only_ when one does device migration command.
> One does not migrate VMs every few msec.
> Hence such functionality to be better be done which is efficient for performance, but without consuming on-chip memory.
> Admin command and admin vq satisfy those.
> 
> 4. Once the software matures further, admin command would prefer completion interrupt, instead of poll.
> How to get notification/interrupt? Well, virtqueue defines this already.
> Should we replicate that in some PF registers?
> It can be. But once you put all the functionalities of admin command and aq in registers the whole thing becomes yet another register_q.
> 
> 5. Can these registers be placed in the PF to overcome #1 and #2 for passthrough?
> In theory yes.
> In practice, no, as there are many commands that flow, which needs to scale to reasonable number of VFs.
> Admin commands over admin vq provides this generic facility.
> 
> 6. Most modern devices who attempts to scale, cut down their register footprint, registers are used only for main bootstap, init time config work.
> Even in virtio spec, one can read:
> "Device configuration space is generally used for rarely changing or initialization-time parameters."
> 
> Adding some additional registers to a PF device config space for non init time parameters does not make sense.
> 
> 7. Additionally, a nested virtualization should be done by truly nesting the device at right abstraction point of owner-member relationship.
> This follows two principles of (a) efficiency and (b) equivalency of what Jason paper pointed.
> And we ask for nested VF extension we will get our guidance from PCI-SIG, of why it should be done if it is matching with rest of the ecosystem components that support/donât support the nesting.


For completeness, and to shorten the thread, can you please list known
issues/use cases that are addressed by the status bit interface and how
you plan for them to be addressed?

Follow-Ups:
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>

References:
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>