virtio-comment message

Subject: RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration

From: Parav Pandit <parav@nvidia.com>
To: "Zhu, Lingshan" <lingshan.zhu@intel.com>, "Michael S. Tsirkin" <mst@redhat.com>, Jason Wang <jasowang@redhat.com>
Date: Thu, 19 Oct 2023 09:13:16 +0000

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, October 19, 2023 2:40 PM
> 
> On 10/19/2023 5:01 PM, Parav Pandit wrote:
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Thursday, October 19, 2023 1:45 PM
> >>
> >> On 10/18/2023 5:48 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Wednesday, October 18, 2023 2:13 PM
> >>>>
> >>>> On 10/18/2023 3:20 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Wednesday, October 18, 2023 12:22 PM
> >>>>>>
> >>>>>> On 10/18/2023 2:41 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Wednesday, October 18, 2023 12:06 PM
> >>>>>>>>
> >>>>>>>> On 10/18/2023 1:02 PM, Parav Pandit wrote:
> >>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> >>>>>>>>>> Lingshan
> >>>>>>>>>> Sent: Monday, October 16, 2023 3:18 PM
> >>>>>>>>>>
> >>>>>>>>>> On 10/13/2023 7:54 PM, Parav Pandit wrote:
> >>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> Sent: Friday, October 13, 2023 3:14 PM
> >>>>>>>>>>>>>>>> How do you transfer the ownership?
> >>>>>>>>>>>>>>> An additional ownership deletgation by a new admin
> >> command.
> >>>>>>>>>>>>>> if you think this can work, do you want to cook a patch
> >>>>>>>>>>>>>> to implement this before you submitting this live migration
> series?
> >>>>>>>>>>>>> I answered this already above.
> >>>>>>>>>>>> talk is cheap, show me your patch
> >>>>>>>>>>> Huh. We presented the infrastructure that migrates, 30+
> >>>>>>>>>>> device types,
> >>>>>>>>>> covering device context ideas from Oracle.
> >>>>>>>>>>> Covering P2P, supporting device_reset, FLR, dirty page tracking.
> >>>>>>>>>>>
> >>>>>>>>>>> Please have some respect for other members who covered more
> >>>>>>>>>>> ground than
> >>>>>>>>>> your series.
> >>>>>>>>>>> What more? Apply the same nested concept on the member
> >>>>>>>>>>> device as
> >>>>>>>>>> Michael suggested, it is nested virtualization maintain exact
> >>>>>>>>>> same
> >>>>>> semantics.
> >>>>>>>>>>> So a VF is mapped as PF to the L1 guest.
> >>>>>>>>>>> L1 guest can enable SR-IOV on it, and map one VF to L2 guest.
> >>>>>>>>>>>
> >>>>>>>>>>> This nested work can be extended in future, once first level
> >>>>>>>>>>> nesting is
> >>>>>>>>>> covered.
> >>>>>>>>>>>> Answer all questions above, if you think a management VF
> >>>>>>>>>>>> can work, please show me your patch.
> >>>>>>>>>>> The idea evolves from technical debate then pointing fingers
> >>>>>>>>>>> like your
> >>>>>>>>>> comment.
> >>>>>>>>>>> I think a positive discussion with Michael and a pointer to
> >>>>>>>>>>> the paper from
> >>>>>>>>>> Jason gave a good direction of doing _right_ nesting that
> >>>>>>>>>> follows two
> >>>>>>>> principles.
> >>>>>>>>>>> a. efficiency property
> >>>>>>>>>>> b. equivalence property
> >>>>>>>>>>>
> >>>>>>>>>>> (c. resource control is natural already)
> >>>>>>>>>>>
> >>>>>>>>>>> Both apply at VMM and at VM level enabling recursive
> >>>>>>>>>>> virtualization, by
> >>>>>>>>>> having VF that can act as PF inside the guest.
> >>>>>>>>>>> [1] https://dl.acm.org/doi/pdf/10.1145/361011.361073
> >>>>>>>>>> Please just show me your patch resolving these opens, how
> >>>>>>>>>> about start from defining virito-fs device context and your
> management VF?
> >>>>>>>>> As answered, device context infrastructure is done, per device
> >>>>>>>>> specific device-
> >>>>>>>> context will be defined incrementally.
> >>>>>>>>> I will not be including virtio-fs in this series. It will be
> >>>>>>>>> done incrementally in
> >>>>>>>> future utilizing the infrastructure build in this series.
> >>>>>>>> Done? How do you conclude this? You just tell me what is the
> >>>>>>>> full set of virito-fs device context now and how to migrate them.
> >>>>>>>>
> >>>>>>>> You cant? you refuse or you don't? Do you expect the HW
> >>>>>>>> designer to figure out by themself?
> >>>>>>> I wont be able to tell now as I donât think it is necessary for this series.
> >>>>>>> If one out of 30 devices cannot migrate because of unimaginable
> >>>>>>> amount of
> >>>>>> complexity has been placed there, may be one will not implement
> >>>>>> it as member device.
> >>>>>>>     From experience of migratable complex gpu devices, rdma
> >>>>>>> devices (stateful
> >>>>>> having hundred thousand of stateful QPs), my understanding is
> >>>>>> complex state of virtio-fs can be defined and migratable.
> >>>>>>> Mlx5 driver consist of 150,000 lines of code and that device is
> >>>>>>> migratable
> >>>>>> with complex state.
> >>>>>>> So I am optimistic that virtio-fs can be migratable too.
> >>>>>>> It does not have to limited by my limited creativity of 2023.
> >>>>>>> May be I am wrong, in that case one will not implement
> >>>>>>> passthrough virtio-fs
> >>>>>> device.
> >>>>>> your series wants to migrate device context, but doesn't define
> >>>>>> device context, does this sounds reasonable?
> >>>>> Device generic context is defined at [1] and also the
> >>>>> infrastructure for defining
> >>>> the device context in parallel by multiple people can be done post
> >>>> the work of [1].
> >>>>> Per each device type context will be defined incrementally post this
> work.
> >>>>>
> >>>>> [1]
> >>>>> https://lists.oasis-open.org/archives/virtio-comment/202310/msg001
> >>>>> 90
> >>>>> .h
> >>>>> tml
> >>>> This is not post of the work, you should define them before you use
> >>>> them in this series.
> >>>>
> >>> I donât agree to cook ocean in this patch series.
> >>> No practical spec devel community does it.
> >>> As long as we feel comfortable that device context framework is
> >>> extendible, it
> >> is fine.
> >>> If virtio-fs seems very hard, may be one will come with a new light
> >>> weight FS
> >> device. I really donât know.
> >> so you want to migrate device context, but refuse to define them?
> >>>> And you need to prove why admin vq are better than registers
> >>>> solution if you want a merge.
> >>> Michael already responded the practical aspects.
> >>> Since you may claim, I didnât answer, below is the technical details.
> >>>
> >>> Why admin commands and aq is better is because of below reasons in
> >>> my
> >> view:
> >>> Functionally better:
> >>> 1. When the live migration registers are located on the VF itself,
> >>> VMM does
> >> not have control of it.
> >>> These registers reset, on FLR and device reset because these are
> >>> virtio
> >> registers of the device.
> >>> Hence, VMM lost the state for the job that VMM was supposed to do.
> >>> Therefore, passthrough mode cannot depend on these registers.
> >>>
> >>> 2. Any bulk data transfer of device context and dirty page tracking
> >>> requires
> >> DMA.
> >>> Hence those DMA must happen to the device which is different than VF
> itself.
> >>> If it is on the VF itself, it has two problems.
> >>> 2.a. VF device reset and FLR will clear them, and device context is lost.
> >>>
> >>> 2.b. the DMA occurs at the PCI RID level.
> >>> IOMMU cannot bifurcate the DMA of one RID to two different address
> >>> space
> >> of guest and hypervisor.
> >>> This requires PASID support.
> >>> Using PASID has following problems.
> >>> 2.b.1 PASID typically not used by the kernel software. It is only
> >>> meant for the
> >> user processes.
> >>> Hence for kernel work a reserving PASID won't be acceptable upstream
> >> kernel.
> >>> 2.b.2 Somehow if this is done, When the VF itself supports PASID, it
> >>> required
> >> now vPASID support.
> >>> This is again not where industry is going in other forums where I am part of.
> >> Hence, it will be failure for virtio. Hence, I do not recommend vPASID route.
> >>> 2.b.3 One of the widely used cpu seems to have dropped the support
> >>> due to
> >> limitation of an instruction around PASID.
> >>> So it cannot be used there, this further limits virtio passthrough users.
> >>>
> >>> Even if somehow 2.b.2 and 2.b.3 is overcome in theory, #1 and 2.a is
> >> functional problems.
> >>> Scale wise better:
> >>> 3. Admin command and admin vq are used _only_ when one does device
> >> migration command.
> >>> One does not migrate VMs every few msec.
> >>> Hence such functionality to be better be done which is efficient for
> >> performance, but without consuming on-chip memory.
> >>> Admin command and admin vq satisfy those.
> >>>
> >>> 4. Once the software matures further, admin command would prefer
> >> completion interrupt, instead of poll.
> >>> How to get notification/interrupt? Well, virtqueue defines this already.
> >>> Should we replicate that in some PF registers?
> >>> It can be. But once you put all the functionalities of admin command
> >>> and aq
> >> in registers the whole thing becomes yet another register_q.
> >>> 5. Can these registers be placed in the PF to overcome #1 and #2 for
> >> passthrough?
> >>> In theory yes.
> >>> In practice, no, as there are many commands that flow, which needs
> >>> to scale
> >> to reasonable number of VFs.
> >>> Admin commands over admin vq provides this generic facility.
> >>>
> >>> 6. Most modern devices who attempts to scale, cut down their
> >>> register
> >> footprint, registers are used only for main bootstap, init time config work.
> >>> Even in virtio spec, one can read:
> >>> "Device configuration space is generally used for rarely changing or
> >> initialization-time parameters."
> >>> Adding some additional registers to a PF device config space for non
> >>> init time
> >> parameters does not make sense.
> >>> 7. Additionally, a nested virtualization should be done by truly
> >>> nesting the
> >> device at right abstraction point of owner-member relationship.
> >>> This follows two principles of (a) efficiency and (b) equivalency of
> >>> what Jason
> >> paper pointed.
> >>> And we ask for nested VF extension we will get our guidance from
> >>> PCI-SIG, of
> >> why it should be done if it is matching with rest of the ecosystem
> >> components that support/donât support the nesting.
> >> It they are true, shall we refactor virtio-pci common cfg
> >> functionalities to use admin vq?
> > For non-backward compatible SIOV device of the future, yes, virtio-pci
> common config (non init registers) should be moved to a vq, located on the
> member device directly.
> Oh, really? Quite interesting, do you want to move all config space fields in VF
> to admin vq? Have a plan?
Not in my plan for spec 1.4 time frame.
I do not want to divert the discussion, would like to focus on device migration phases.
Lets please discuss in some other dedicated thread.

Follow-Ups:
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>

References:
- [PATCH v1 0/8] Introduce device migration support commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>