virtio-comment message

Subject: RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration

From: Parav Pandit <parav@nvidia.com>
To: "Zhu, Lingshan" <lingshan.zhu@intel.com>, "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 11 Oct 2023 11:43:24 +0000

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Wednesday, October 11, 2023 3:55 PM

> >>>>> I donât have any strong opinion to keep it or remove it as most
> >>>>> stakeholders
> >>>> has the clear view of requirements now.
> >>>>> Let me know.
> >>>> So some people use VFs with VFIO. Hence the module name.  This
> >>>> sentence by itself seems to have zero value for the spec. Just drop it.
> >>> Ok. Will drop.
> >> So why not build your admin vq live migration on our config space
> >> solution, get out of the troubles, to make your life easier?
> >>
> > Your this question is completely unrelated to this reply or you misunderstood
> what dropping commit log means.
> if you can rebase admin vq LM on our basic facilities, I think you dont need to
> talk about vfio in the first place, so I ask you to re-consider Jason's proposal.
I donât really know why you are upset with the vfio term.
It is the use case of the cloud operator and it is listed to indicate how proposal fits in a such use case.
If for some reason, you donât like vfio, fine. Ignore it and move on.

I already answered that I will remove from the commit log, because the requirements are well understood now by the committee.

Your comment is again unrelated (repeated) to your past two questions.

I explained you the technical problem that admin command (not admin VQ) of basic facilities cannot be done using config registers without any mediation layer.

> >
> > Dropping link to vfio does not drop the requirement.
> > I am ok to drop because requirements are clear of passthrough of member
> device.
> > Vfio is not a trouble at all.
> > Admin command is not a trouble either.
> >
> > The pure technical reason is: all the functionalities proposed cannot be done
> in any other existing way.
> > Why? For below reasons.
> > 1. device context, and write records (aka dirty page addresses) is
> > huge which cannot be shared using config registers at scale of 4000
> > member devices
> dirty page tracking will be implmemented in V2, actually I have the patch right
> now.
That is yet again the invitation to non_colloboration mode.
Without reviewing, v0 and v1, you want to show dirty page tracking in some other way.

But ok, that is your non_coperative mode of working. Cannot help further.

> inflight descriptor tracking will be implemented by Eugenio in V2.
When we have near complete proposal from two device vendors, you want to push something to unknown future without reviewing the work; 
does not make sense.

You are still in the mode of _take_ what we did with near zero explanation.
You asked question of why passthrough proposal cannot advantage of in_band config registers.
I explained technical reason listed here.

So please donât jump to conclusions before finishing the discussion on how both side can take advantage of each other.

Lets please do that.

> There are no scale problem as I repeated for many time, they are per-device
> basic facilities, just migrate the VF by its own facility, so there are no 40000
> member devices, this is not per PF.
> 
I explained that device reset, flr etc flow cannot work when controlling and controlled functions are single entity for passthrough mode.
The scale problem is, one needs to duplicate the registers on each VF.
The industry is moving away from the register interface in many _real_ hw devices implementation.
Some of the examples are IMS, SIOV, NVMe and more.

> The device context can be read from config space or trapped, like shadow
There are 1 million flows of the net device flow filters in progress.
Each flow is 64B in size.
Total size is 64MB.
I donât see how one can read such amount of memory using config registers.

> control vq which is already done, that is basic virtualization.

There is nothing like "basic virtualization".
What is proposed here is fulfilling the requirement of passthrough mode.

Your comment is implying, "I donât care for passthrough requirements, do non_passthrough".

The discussion should be,
How can we leverage common framework for passthrough and mediated mode?
Can we? If so, which are the pieces?

For me it is frankly very weird to take native virtio member device, convert into a medicated device using a giant software, and after that convolution get virtio device.
But for nested case you have the use case.
So if we focus positively on how two use cases can use some common functionality, that will be great.

> If you want to migrate device context, you need to specify device context for
> every type of device, net maybe easy, how do you see virtio-fs?
Virtio-fs will have its on device context too.
Every device has some sort of backend in varied degree.
Net being widely used and moderate complex device.
Fs being slightly stateful but less complex than net, as it has far less control operations.
In fact virtio-fs device already discusses the migrating the device side state, as listed in device context.
So virtio-fs device will have its own device-context defined.

The infrastructure and basic facilities are setup in this series, that one can easily extend for all the current and new device types.

> And we are migrating stateless devices, or no? How do you migrate virtio-fs?
> > 2. sharing such large context and write addresses in parallel for
> > multiple devices cannot be done using single register file
> see above
> > 3. These registers cannot be residing in the VF because VF can undergo
> > FLR, and device reset which must clear these registers
> 
> do you mean you want to audit all PCI features? When FLR, the device is rested,

> do you expect a device remember anything after FLR?
Not at all. VF member device will not remember anything after FLR.
> Do you want to trap FLR? Why?
This proposal does _not_ want to trap the FLR in the hypervisor virtio driver.

When one does the mediation-based design, it must trap/emulate/fake the FLR.
It helps to address the case of nested as you mentioned.
> 
> Why FLR block or conflict with live migration?
It does not block or conflict.

The whole point is, when you put live migration functionality on the VF itself, you just cannot FLR this device.
One must trap the FLR and do fake FLR and build the whole infrastructure to not FLR The device.
Above is not passthrough device.

> 
> > 4. When VF does the DMA, all dma occurs in the guest address space, not in
> hypervisor space; any flr and device reset must stop such dma.
> > And device reset and flr are controlled by the guest (not mediated by
> hypervisor).
> if the guest reset the device, it is totally reasonable operation, and the guest
> own the risk, right?
Sure, but the guest still expects its dirty pages and device context to be migrated across device_reset.
Device_reset will lose all this information within the device if done without mediation and special care.

So, to avoid that now one needs to have fake device reset too and build that infrastructure to not reset.

The passthrough proposal fundamental concept is: 

all the native virtio functionalities are between guest driver and the actual device.

> and still, do you want to audit every PCI features? at least you didn't do that in
> your series.
Can you please list which PCI features audit you are talking about?

Keep in mind, that will all the mediation, one now must equally audit all this giant software stack too.
So maybe it is fine for those who are ok with it.

> For migration, you know the hypervisor takes the ownership of the device in the
> stop_window.
I do not know what stop_window means.
Do you mean stop_copy of vfio or it is qemu term?

> > 5. Any PASID to separate out admin vq on the VF does not work for two
> reasons.
> > R_1: device flr and device reset must stop all the dmas.
> > R_2: PASID by most leading vendors is still not mature enough
> > R_3: One also needs to do inversion to not expose PASID capability of
> > the member PCI device to not expose
> see above and what if guest shutdown? the same answer, right?
Not sure, I follow.
If the guest shutdown, the guest specific shutdown APIs are called.

With passthrough device, R_1 just works as is.
R_3 is not needed as they are directly given to the guest.
R_2 platform dependency is not needed either.

> >
> >> Actually you don't see any technical problems in our config space
> >> proposal, right?
> > In config registers method, for passthrough I clearly see the technical
> problems (functional and scale) listed above.
> > Due to which config registers cannot reside on the VF and cannot scale either.
> so see above answers.

Follow-Ups:
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>

References:
- [PATCH v1 0/8] Introduce device migration support commands
  - From: Parav Pandit <parav@nvidia.com>
- [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>