virtio-comment message

Subject: RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
From: Parav Pandit <parav@nvidia.com>
To: Jason Wang <jasowang@redhat.com>
Date: Wed, 1 Nov 2023 03:31:54 +0000

> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 1, 2023 6:04 AM
> 
> On Tue, Oct 31, 2023 at 1:30âPM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, October 31, 2023 7:05 AM
> > >
> > > On Mon, Oct 30, 2023 at 12:47âPM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > >
> > > > > From: virtio-comment@lists.oasis-open.org
> > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Jason Wang
> > > > >
> > > > > On Thu, Oct 26, 2023 at 11:45âAM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > >
> > > > > > > On Wed, Oct 25, 2023 at 3:03âPM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 AM
> > > > > > > > > > For passthrough PASID assignment vq is not needed.
> > > > > > > > >
> > > > > > > > > How do you know that?
> > > > > > > > Because for passthrough, the hypervisor is not involved in
> > > > > > > > dealing with VQ at
> > > > > > > all.
> > > > > > >
> > > > > > > Ok, so if I understand correctly, you are saying your design
> > > > > > > can't work for the case of PASID assignment.
> > > > > > >
> > > > > > No. PASID assignment will happen from the guest for its own
> > > > > > use and device
> > > > > migration will just work fine because device context will capture this.
> > > > >
> > > > > It's not about device context. We're discussing "passthrough", no?
> > > > >
> > > > Not sure, we are discussing same.
> > > > A member device is passthrough to the guest, dealing with its own
> > > > PASIDs and
> > > virtio interface for some VQ assignment to PASID.
> > > > So VQ context captured by the hypervisor, will have some PASID
> > > > attached to
> > > this VQ.
> > > > Device context will be updated.
> > > >
> > > > > You want all virtio stuff to be "passthrough", but assigning a
> > > > > PASID to a specific virtqueue in the guest must be trapped.
> > > > >
> > > > No. PASID assignment to a specific virtqueue in the guest must go
> > > > directly
> > > from guest to device.
> > >
> > > This works like setting CR3, you can't simply let it go from guest to host.
> > >
> > > Host IOMMU driver needs to know the PASID to program the IO page
> > > tables correctly.
> > >
> > This will be done by the IOMMU.
> >
> > > > When guest iommu may need to communicate anything for this PASID,
> > > > it will
> > > come through its proper IOMMU channel/hypercall.
> > >
> > > Let's say using PASID X for queue 0, this knowledge is beyond the
> > > IOMMU scope but belongs to virtio. Or please explain how it can work
> > > when it goes directly from guest to device.
> > >
> > We are yet to ever see spec for PASID to VQ assignment.
> 
> It has one.
> 
> > For ok for theory sake it is there.
> >
> > Virtio driver will assign the PASID directly from guest driver to device using a
> create_vq(pasid=X) command.
> > Same process is somehow attached the PASID by the guest OS.
> > The whole PASID range is known to the hypervisor when the device is handed
> over to the guest VM.
> 
> How can it know?
> 
> > So PASID mapping is setup by the hypervisor IOMMU at this point.
> 
> You disallow the PASID to be virtualized here. What's more, such a PASID
> passthrough has security implications.
>
No. virtio spec is not disallowing. At least for sure, this series is not the one.
My main point is, virtio device interface will not be the source of hypercall to program IOMMU in the hypervisor.
It is something to be done by IOMMU side.

> Again, we are talking about different things, I've tried to show you that there are
> cases that passthrough can't work but if you think the only way for migration is
> to use passthrough in every case, you will probably fail.
> 
I didn't say only way for migration is passthrough.
Passthrough is clearly one way.
Other ways may be possible.

> >
> > > > Virtio device is not the conduit for this exchange.
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > There are works ongoing to make vPASID work for the
> > > > > > > > > guest like
> > > vSVA.
> > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > Passthrough do not run like SVA.
> > > > > > >
> > > > > > > Great, you find another limitation of "passthrough" by yourself.
> > > > > > >
> > > > > > No. it is not the limitation it is just the way it does not
> > > > > > need complex SVA to
> > > > > split the device for unrelated usage.
> > > > >
> > > > > How can you limit the user in the guest to not use vSVA?
> > > > >
> > > > He he, I am not limiting, again misunderstanding or wrong attribution.
> > > > I explained that hypervisor for passthrough does not need SVA.
> > > > Guest can do anything it wants from the guest OS with the member
> device.
> > >
> > > Ok, so the point stills, see above.
> >
> > I donât think so. The guest owns its PASID space
> 
> Again, vPASID to PASID can't be done hardware unless I miss some recent
> features of IOMMUs.
> 
Cpu vendors have different way of doing vPASID to pPASID.
It is still an early space for virtio.

> > and directly communicates like any other device attribute.
> >
> > >
> > > >
> > > > > >
> > > > > > > > Each passthrough device has PASID from its own space fully
> > > > > > > > managed by the
> > > > > > > guest.
> > > > > > > > Some cpu required vPASID and SIOV is not going this way anmore.
> > > > > > >
> > > > > > > Then how to migrate? Invent a full set of something else
> > > > > > > through another giant series like this to migrate to the SIOV thing?
> > > > > > > That's a mess for
> > > > > sure.
> > > > > > >
> > > > > > SIOV will for sure reuse most or all parts of this work, almost entirely
> as_is.
> > > > > > vPASID is cpu/platform specific things not part of the SIOV devices.
> > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > If at all it is done, it will be done from the guest
> > > > > > > > > > by the driver using virtio
> > > > > > > > > interface.
> > > > > > > > >
> > > > > > > > > Then you need to trap. Such things couldn't be passed
> > > > > > > > > through to guests
> > > > > > > directly.
> > > > > > > > >
> > > > > > > > Only PASID capability is trapped. PASID allocation and
> > > > > > > > usage is directly from
> > > > > > > guest.
> > > > > > >
> > > > > > > How can you achieve this? Assigning a PAISD to a device is
> > > > > > > completely
> > > > > > > device(virtio) specific. How can you use a general layer
> > > > > > > without the knowledge of virtio to trap that?
> > > > > > When one wants to map vPASID to pPASID a platform needs to be
> > > involved.
> > > > >
> > > > > I'm not talking about how to map vPASID to pPASID, it's out of
> > > > > the scope of virtio. I'm talking about assigning a vPASID to a
> > > > > specific virtqueue or other virtio function in the guest.
> > > > >
> > > > That can be done in the guest. The key is guest wont know that it
> > > > is dealing
> > > with vPASID.
> > > > It will follow the same principle from your paper of equivalency,
> > > > where virtio
> > > software layer will assign PASID to VQ and communicate to device.
> > > >
> > > > Anyway, all of this just digression from current series.
> > >
> > > It's not, as you mention that only MSI-X is trapped, I give you another one.
> > >
> > PASID access from the guest to be done fully by the guest IOMMU.
> > Not by virtio devices.
> >
> > > >
> > > > > You need a virtio specific queue or capability to assign a PASID
> > > > > to a specific virtqueue, and that can't be done without trapping
> > > > > and without virito specific knowledge.
> > > > >
> > > > I disagree. PASID assignment to a virqueue in future from guest
> > > > virtio driver to
> > > device is uniform method.
> > > > Whether its PF assigning PASID to VQ of self, Or VF driver in the
> > > > guest assigning PASID to VQ.
> > > >
> > > > All same.
> > > > Only IOMMU layer hypercalls will know how to deal with PASID
> > > > assignment at
> > > platform layer to setup the domain etc table.
> > > >
> > > > And this is way beyond our device migration discussion.
> > > > By any means, if you were implying that somehow vq to PASID
> > > > assignment
> > > _may_ need trap+emulation, hence whole device migration to depend on
> > > some
> > > trap+emulation, than surely, than I do not agree to it.
> > >
> > > See above.
> > >
> > Yeah, I disagree to such implying.
> >
> > > >
> > > > PASID equivalent in mlx5 world is ODP_MR+PD isolating the guest
> > > > process and
> > > all of that just works on efficiency and equivalence principle
> > > already for a decade now without any trap+emulation.
> > > >
> > > > > > When virtio passthrough device is in guest, it has all its PASID
> accessible.
> > > > > >
> > > > > > All these is large deviation from current discussion of this
> > > > > > series, so I will keep
> > > > > it short.
> > > > > >
> > > > > > >
> > > > > > > > Regardless it is not relevant to passthrough mode as PASID
> > > > > > > > is yet another
> > > > > > > resource.
> > > > > > > > And for some cpu if it is trapped, it is generic layer,
> > > > > > > > that does not require virtio
> > > > > > > involvement.
> > > > > > > > So virtio interface asking to trap something because
> > > > > > > > generic facility has done
> > > > > > > in not the approach.
> > > > > > >
> > > > > > > This misses the point of PASID. How to use PASID is totally
> > > > > > > device
> > > specific.
> > > > > > Sure, and how to virtualize vPASID/pPASID is platform specific
> > > > > > as single PASID
> > > > > can be used by multiple devices and process.
> > > > >
> > > > > See above, I think we're talking about different things.
> > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > > Capabilities of #2 is generic across all pci devices,
> > > > > > > > > > so it will be handled by the
> > > > > > > > > HV.
> > > > > > > > > > ATS/PRI cap is also generic manner handled by the HV
> > > > > > > > > > and PCI
> > > device.
> > > > > > > > >
> > > > > > > > > No, ATS/PRI requires the cooperation from the vIOMMU.
> > > > > > > > > You can simply do ATS/PRI passthrough but with an emulated
> vIOMMU.
> > > > > > > > And that is not the reason for virtio device to build
> > > > > > > > trap+emulation for
> > > > > > > passthrough member devices.
> > > > > > >
> > > > > > > vIOMMU is emulated by hypervisor with a PRI queue,
> > > > > > PRI requests arrive on the PF for the VF.
> > > > >
> > > > > Shouldn't it arrive at platform IOMMU first? The path should be
> > > > > PRI
> > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU PRI -> guest IOMMU.
> > > > >
> > > > Above sequence seems write.
> > > >
> > > > > And things will be more complicated when (v)PASID is used. So
> > > > > you can't simply let PRI go directly to the guest with the current
> architecture.
> > > > >
> > > > In current architecture of the pci VF, PRI does not go directly to the guest.
> > > > (and that is not reason to trap and emulate other things).
> > >
> > > Ok, so beyond MSI-X we need to trap PRI, and we will probably trap
> > > other things in the future like PASID assignment.
> > PRI etc all belong to generic PCI 4K config space region.
> 
> It's not about the capability, it's about the whole process of PRI request
> handling. We've agreed that the PRI request needs to be trapped by the
> hypervisor and then delivered to the vIOMMU.
>
 
> > Trap+emulation done in generic manner without involving virtio or other
> device types.
> >
> > >
> > > >
> > > > > >
> > > > > > > how can you pass
> > > > > > > through a hardware PRI request to a guest directly without
> > > > > > > trapping it
> > > then?
> > > > > > > What's more, PCIE allows the PRI to be done in a vendor
> > > > > > > (virtio) specific way, so you want to break this rule? Or
> > > > > > > you want to blacklist ATS/PRI
> > > > > for virtio?
> > > > > > >
> > > > > > I was aware of only pci-sig way of PRI.
> > > > > > Do you have a reference to the ECN that enables vendor
> > > > > > specific way of PRI? I
> > > > > would like to read it.
> > > > >
> > > > > I mean it doesn't forbid us to build a virtio specific interface
> > > > > for I/O page fault report and recovery.
> > > > >
> > > > So PRI of PCI does not allow. It is ODP kind of technique you meant above.
> > > > Yes one can build.
> > > > Ok. unrelated to device migration, so I will park this good discussion for
> later.
> > >
> > > That's fine.
> > >
> > > >
> > > > > > This will be very good to eliminate IOMMU PRI limitations.
> > > > >
> > > > > Probably.
> > > > >
> > > > > > PRI will directly go to the guest driver, and guest would
> > > > > > interact with IOMMU
> > > > > to service the paging request through IOMMU APIs.
> > > > >
> > > > > With PASID, it can't go directly.
> > > > >
> > > > When the request consist of PASID in it, it can.
> > > > But again these PCI-SIG extensions of PASID are not related to
> > > > device
> > > migration, so I am differing it.
> > > >
> > > > > > For PRI in vendor specific way needs a separate discussion. It
> > > > > > is not related to
> > > > > live migration.
> > > > >
> > > > > PRI itself is not related. But the point is, you can't simply
> > > > > pass through ATS/PRI now.
> > > > >
> > > > Ah ok. the whole 4K PCI config space where ATS/PRI capabilities
> > > > are located
> > > are trapped+emulated by hypervisor.
> > > > So?
> > > > So do we start emulating virito interfaces too for passthrough?
> > > > No.
> > > > Can one still continue to trap+emulate?
> > > > Sure why not?
> > >
> > > Then let's not limit your proposal to be used by "passthrough" only?
> > One can possibly build some variant of the existing virtio member device
> using same owner and member scheme.
> 
> It's not about the member/owner, it's about e.g whether the hypervisor can
> trap and emulate.
> 
> I've pointed out that what you invent here is actually a partial new transport, for
> example, a hypervisor can trap and use things like device context in PF to bypass
> the registers in VF. This is the idea of transport commands/q.
>
I will not mix transport commands which are mainly useful for actual device operation for SIOV only for backward compatibility that too optionally.
One may still choose to have virtio common and device config in MMIO ofcourse at lower scale.

Anyway, mixing migration context with actual SIOV specific thing is not correct as device context is read/write incremental values.

> > If for that is some admin commands are missing, may be one can add them.
> 
> I would then build the device context commands on top of the transport
> commands/q, then it would be complete.
> 
> > No need to step on toes of use cases as they are different...
> >
> > > I've shown you that
> > >
> > > 1) you can't easily say you can pass through all the virtio
> > > facilities
> > > 2) how ambiguous for terminology like "passthrough"
> > >
> > It is not, it is well defined in v3, v2.
> > One can continue to argue and keep defining the variant and still call it data
> path acceleration and then claim it as passthrough ...
> > But I won't debate this anymore as its just non-technical aspects of least
> interest.
> 
> You use this terminology in the spec which is all about technical, and you think
> how to define it is a matter of non-technical. This is self-contradictory. If you fail,
> it probably means it's ambiguous.
> Let's don't use that terminology.
>
What it means is described in theory of operation.
 
> > We have technical tasks and more improved specs to update going forward.
> 
> It's a burden to do the synchronization.
We have discussed this.
In current proposed the member device is not bifurcated, so it implements the necessary pieces.
Feature != burden.

> 
> > Working on extension for device specific contexts to enrich it.
> 
> Again, making the proposal to be general is much more beneficial.

Yes, it is general and like any other device-type, each has their extensions.
Infrastructure covers in v3.
Follow-Ups:
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
References:
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>