OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration


> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, November 16, 2023 9:50 AM
> 
> On Thu, Nov 16, 2023 at 1:39âAM Parav Pandit <parav@nvidia.com> wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 13, 2023 9:03 AM
> > >
> > > On Thu, Nov 9, 2023 at 2:25âPM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Tuesday, November 7, 2023 9:35 AM
> > > > >
> > > > > On Mon, Nov 6, 2023 at 3:05âPM Parav Pandit <parav@nvidia.com>
> wrote:
> > > > > >
> > > > > >
> > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > Sent: Monday, November 6, 2023 12:05 PM
> > > > > > >
> > > > > > > On Thu, Nov 2, 2023 at 2:10âPM Parav Pandit
> > > > > > > <parav@nvidia.com>
> > > wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > Sent: Thursday, November 2, 2023 9:56 AM
> > > > > > > > >
> > > > > > > > > On Wed, Nov 1, 2023 at 11:32âAM Parav Pandit
> > > > > > > > > <parav@nvidia.com>
> > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:04 AM
> > > > > > > > > > >
> > > > > > > > > > > On Tue, Oct 31, 2023 at 1:30âPM Parav Pandit
> > > > > > > > > > > <parav@nvidia.com>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Oct 30, 2023 at 12:47âPM Parav Pandit
> > > > > > > > > > > > > <parav@nvidia.com>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On
> > > > > > > > > > > > > > > Behalf Of Jason Wang
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Oct 26, 2023 at 11:45âAM Parav
> > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03âPM Parav
> > > > > > > > > > > > > > > > > Pandit <parav@nvidia.com>
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > From: Jason Wang
> > > > > > > > > > > > > > > > > > > <jasowang@redhat.com>
> > > > > > > > > > > > > > > > > > > Sent: Wednesday, October 25, 2023
> > > > > > > > > > > > > > > > > > > 6:59 AM
> > > > > > > > > > > > > > > > > > > > For passthrough PASID assignment
> > > > > > > > > > > > > > > > > > > > vq is not
> > > needed.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > How do you know that?
> > > > > > > > > > > > > > > > > > Because for passthrough, the
> > > > > > > > > > > > > > > > > > hypervisor is not involved in dealing
> > > > > > > > > > > > > > > > > > with VQ at
> > > > > > > > > > > > > > > > > all.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Ok, so if I understand correctly, you
> > > > > > > > > > > > > > > > > are saying your design can't work for
> > > > > > > > > > > > > > > > > the case of PASID
> > > assignment.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > No. PASID assignment will happen from the
> > > > > > > > > > > > > > > > guest for its own use and device
> > > > > > > > > > > > > > > migration will just work fine because device
> > > > > > > > > > > > > > > context will capture
> > > > > > > this.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It's not about device context. We're
> > > > > > > > > > > > > > > discussing "passthrough",
> > > > > > > no?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > Not sure, we are discussing same.
> > > > > > > > > > > > > > A member device is passthrough to the guest,
> > > > > > > > > > > > > > dealing with its own PASIDs and
> > > > > > > > > > > > > virtio interface for some VQ assignment to PASID.
> > > > > > > > > > > > > > So VQ context captured by the hypervisor, will
> > > > > > > > > > > > > > have some PASID attached to
> > > > > > > > > > > > > this VQ.
> > > > > > > > > > > > > > Device context will be updated.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You want all virtio stuff to be
> > > > > > > > > > > > > > > "passthrough", but assigning a PASID to a
> > > > > > > > > > > > > > > specific virtqueue in the guest must be
> > > > > > > trapped.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > No. PASID assignment to a specific virtqueue
> > > > > > > > > > > > > > in the guest must go directly
> > > > > > > > > > > > > from guest to device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This works like setting CR3, you can't simply
> > > > > > > > > > > > > let it go from guest to
> > > > > > > host.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Host IOMMU driver needs to know the PASID to
> > > > > > > > > > > > > program the IO page tables correctly.
> > > > > > > > > > > > >
> > > > > > > > > > > > This will be done by the IOMMU.
> > > > > > > > > > > >
> > > > > > > > > > > > > > When guest iommu may need to communicate
> > > > > > > > > > > > > > anything for this PASID, it will
> > > > > > > > > > > > > come through its proper IOMMU channel/hypercall.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Let's say using PASID X for queue 0, this
> > > > > > > > > > > > > knowledge is beyond the IOMMU scope but belongs
> > > > > > > > > > > > > to virtio. Or please explain how it can work
> > > > > > > > > > > > > when it goes directly from guest to
> > > > > device.
> > > > > > > > > > > > >
> > > > > > > > > > > > We are yet to ever see spec for PASID to VQ assignment.
> > > > > > > > > > >
> > > > > > > > > > > It has one.
> > > > > > > > > > >
> > > > > > > > > > > > For ok for theory sake it is there.
> > > > > > > > > > > >
> > > > > > > > > > > > Virtio driver will assign the PASID directly from
> > > > > > > > > > > > guest driver to device using a
> > > > > > > > > > > create_vq(pasid=X) command.
> > > > > > > > > > > > Same process is somehow attached the PASID by the guest
> OS.
> > > > > > > > > > > > The whole PASID range is known to the hypervisor
> > > > > > > > > > > > when the device is handed
> > > > > > > > > > > over to the guest VM.
> > > > > > > > > > >
> > > > > > > > > > > How can it know?
> > > > > > > > > > >
> > > > > > > > > > > > So PASID mapping is setup by the hypervisor IOMMU
> > > > > > > > > > > > at this
> > > point.
> > > > > > > > > > >
> > > > > > > > > > > You disallow the PASID to be virtualized here.
> > > > > > > > > > > What's more, such a PASID passthrough has security
> implications.
> > > > > > > > > > >
> > > > > > > > > > No. virtio spec is not disallowing. At least for sure,
> > > > > > > > > > this series is not the
> > > > > > > one.
> > > > > > > > > > My main point is, virtio device interface will not be
> > > > > > > > > > the source of hypercall to
> > > > > > > > > program IOMMU in the hypervisor.
> > > > > > > > > > It is something to be done by IOMMU side.
> > > > > > > > >
> > > > > > > > > So unless vPASID can be used by the hardware you need to
> > > > > > > > > trap the mapping from a PASID to a virtqueue. Then you
> > > > > > > > > need virtio specific
> > > > > > > knowledge.
> > > > > > > > >
> > > > > > > > vPASID by hardware is unlikely to be used by hw PCI EP
> > > > > > > > devices at least in any
> > > > > > > near term future.
> > > > > > > > This requires either vPASID to pPASID table in device or in IOMMU.
> > > > > > >
> > > > > > > So we are on the same page.
> > > > > > >
> > > > > > > Claiming a method that can only work for passthrough or
> > > > > > > emulation is not
> > > > > good.
> > > > > > > We all know virtualization is passthrough + emulation.
> > > > > > Again, I agree but I wont generalize it here.
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Again, we are talking about different things, I've
> > > > > > > > > > > tried to show you that there are cases that
> > > > > > > > > > > passthrough can't work but if you think the only way
> > > > > > > > > > > for migration is to use passthrough in every case,
> > > > > > > > > > > you will
> > > > > > > > > probably fail.
> > > > > > > > > > >
> > > > > > > > > > I didn't say only way for migration is passthrough.
> > > > > > > > > > Passthrough is clearly one way.
> > > > > > > > > > Other ways may be possible.
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > > Virtio device is not the conduit for this exchange.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > There are works ongoing to make
> > > > > > > > > > > > > > > > > > > vPASID work for the guest like
> > > > > > > > > > > > > vSVA.
> > > > > > > > > > > > > > > > > > > Virtio doesn't differ from other devices.
> > > > > > > > > > > > > > > > > > Passthrough do not run like SVA.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Great, you find another limitation of
> > > > > > > > > > > > > > > > > "passthrough" by
> > > > > > > yourself.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > No. it is not the limitation it is just
> > > > > > > > > > > > > > > > the way it does not need complex SVA to
> > > > > > > > > > > > > > > split the device for unrelated usage.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > How can you limit the user in the guest to not use
> vSVA?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > He he, I am not limiting, again
> > > > > > > > > > > > > > misunderstanding or wrong
> > > > > > > attribution.
> > > > > > > > > > > > > > I explained that hypervisor for passthrough
> > > > > > > > > > > > > > does not need
> > > SVA.
> > > > > > > > > > > > > > Guest can do anything it wants from the guest
> > > > > > > > > > > > > > OS with the member
> > > > > > > > > > > device.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok, so the point stills, see above.
> > > > > > > > > > > >
> > > > > > > > > > > > I donât think so. The guest owns its PASID space
> > > > > > > > > > >
> > > > > > > > > > > Again, vPASID to PASID can't be done hardware unless
> > > > > > > > > > > I miss some recent features of IOMMUs.
> > > > > > > > > > >
> > > > > > > > > > Cpu vendors have different way of doing vPASID to pPASID.
> > > > > > > > >
> > > > > > > > > At least for the current version of major IOMMU vendors,
> > > > > > > > > such translation (aka PASID remapping) is not
> > > > > > > > > implemented in the hardware so it needs to be trapped first.
> > > > > > > > >
> > > > > > > > Right. So it is really far in future, atleast few years away.
> > > > > > > >
> > > > > > > > > > It is still an early space for virtio.
> > > > > > > > > >
> > > > > > > > > > > > and directly communicates like any other device attribute.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Each passthrough device has PASID from
> > > > > > > > > > > > > > > > > > its own space fully managed by the
> > > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > > > Some cpu required vPASID and SIOV is
> > > > > > > > > > > > > > > > > > not going this way
> > > > > > > > > anmore.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Then how to migrate? Invent a full set
> > > > > > > > > > > > > > > > > of something else through another giant
> > > > > > > > > > > > > > > > > series like this to migrate to the SIOV
> > > > > > > > > thing?
> > > > > > > > > > > > > > > > > That's a mess for
> > > > > > > > > > > > > > > sure.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > SIOV will for sure reuse most or all parts
> > > > > > > > > > > > > > > > of this work, almost entirely
> > > > > > > > > > > as_is.
> > > > > > > > > > > > > > > > vPASID is cpu/platform specific things not
> > > > > > > > > > > > > > > > part of the SIOV
> > > > > > > devices.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > If at all it is done, it will be
> > > > > > > > > > > > > > > > > > > > done from the guest by the driver
> > > > > > > > > > > > > > > > > > > > using virtio
> > > > > > > > > > > > > > > > > > > interface.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Then you need to trap. Such things
> > > > > > > > > > > > > > > > > > > couldn't be passed through to guests
> > > > > > > > > > > > > > > > > directly.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Only PASID capability is trapped.
> > > > > > > > > > > > > > > > > > PASID allocation and usage is directly
> > > > > > > > > > > > > > > > > > from
> > > > > > > > > > > > > > > > > guest.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > How can you achieve this? Assigning a
> > > > > > > > > > > > > > > > > PAISD to a device is completely
> > > > > > > > > > > > > > > > > device(virtio) specific. How can you use
> > > > > > > > > > > > > > > > > a general layer without the knowledge of
> > > > > > > > > > > > > > > > > virtio to trap
> > > that?
> > > > > > > > > > > > > > > > When one wants to map vPASID to pPASID a
> > > > > > > > > > > > > > > > platform needs to be
> > > > > > > > > > > > > involved.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'm not talking about how to map vPASID to
> > > > > > > > > > > > > > > pPASID, it's out of the scope of virtio. I'm
> > > > > > > > > > > > > > > talking about assigning a vPASID to a
> > > > > > > > > > > > > > > specific virtqueue or other virtio function
> > > > > > > > > > > > > > > in the
> > > > > > > guest.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > That can be done in the guest. The key is
> > > > > > > > > > > > > > guest wont know that it is dealing
> > > > > > > > > > > > > with vPASID.
> > > > > > > > > > > > > > It will follow the same principle from your
> > > > > > > > > > > > > > paper of equivalency, where virtio
> > > > > > > > > > > > > software layer will assign PASID to VQ and
> > > > > > > > > > > > > communicate to
> > > > > device.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Anyway, all of this just digression from current series.
> > > > > > > > > > > > >
> > > > > > > > > > > > > It's not, as you mention that only MSI-X is
> > > > > > > > > > > > > trapped, I give you another
> > > > > > > > > one.
> > > > > > > > > > > > >
> > > > > > > > > > > > PASID access from the guest to be done fully by
> > > > > > > > > > > > the guest
> > > IOMMU.
> > > > > > > > > > > > Not by virtio devices.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > You need a virtio specific queue or
> > > > > > > > > > > > > > > capability to assign a PASID to a specific
> > > > > > > > > > > > > > > virtqueue, and that can't be done without
> > > > > > > > > > > > > > > trapping and without virito specific
> > > > > knowledge.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > I disagree. PASID assignment to a virqueue in
> > > > > > > > > > > > > > future from guest virtio driver to
> > > > > > > > > > > > > device is uniform method.
> > > > > > > > > > > > > > Whether its PF assigning PASID to VQ of self,
> > > > > > > > > > > > > > Or VF driver in the guest assigning PASID to VQ.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > All same.
> > > > > > > > > > > > > > Only IOMMU layer hypercalls will know how to
> > > > > > > > > > > > > > deal with PASID assignment at
> > > > > > > > > > > > > platform layer to setup the domain etc table.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And this is way beyond our device migration discussion.
> > > > > > > > > > > > > > By any means, if you were implying that
> > > > > > > > > > > > > > somehow vq to PASID assignment
> > > > > > > > > > > > > _may_ need trap+emulation, hence whole device
> > > > > > > > > > > > > migration to depend on some
> > > > > > > > > > > > > trap+emulation, than surely, than I do not agree to it.
> > > > > > > > > > > > >
> > > > > > > > > > > > > See above.
> > > > > > > > > > > > >
> > > > > > > > > > > > Yeah, I disagree to such implying.
> > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD
> > > > > > > > > > > > > > isolating the guest process and
> > > > > > > > > > > > > all of that just works on efficiency and
> > > > > > > > > > > > > equivalence principle already for a decade now
> > > > > > > > > > > > > without any
> > > trap+emulation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When virtio passthrough device is in
> > > > > > > > > > > > > > > > guest, it has all its PASID
> > > > > > > > > > > accessible.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > All these is large deviation from current
> > > > > > > > > > > > > > > > discussion of this series, so I will keep
> > > > > > > > > > > > > > > it short.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Regardless it is not relevant to
> > > > > > > > > > > > > > > > > > passthrough mode as PASID is yet
> > > > > > > > > > > > > > > > > > another
> > > > > > > > > > > > > > > > > resource.
> > > > > > > > > > > > > > > > > > And for some cpu if it is trapped, it
> > > > > > > > > > > > > > > > > > is generic layer, that does not
> > > > > > > > > > > > > > > > > > require virtio
> > > > > > > > > > > > > > > > > involvement.
> > > > > > > > > > > > > > > > > > So virtio interface asking to trap
> > > > > > > > > > > > > > > > > > something because generic facility has
> > > > > > > > > > > > > > > > > > done
> > > > > > > > > > > > > > > > > in not the approach.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This misses the point of PASID. How to
> > > > > > > > > > > > > > > > > use PASID is totally device
> > > > > > > > > > > > > specific.
> > > > > > > > > > > > > > > > Sure, and how to virtualize vPASID/pPASID
> > > > > > > > > > > > > > > > is platform specific as single PASID
> > > > > > > > > > > > > > > can be used by multiple devices and process.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > See above, I think we're talking about different things.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Capabilities of #2 is generic
> > > > > > > > > > > > > > > > > > > > across all pci devices, so it will
> > > > > > > > > > > > > > > > > > > > be handled by the
> > > > > > > > > > > > > > > > > > > HV.
> > > > > > > > > > > > > > > > > > > > ATS/PRI cap is also generic manner
> > > > > > > > > > > > > > > > > > > > handled by the HV and PCI
> > > > > > > > > > > > > device.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > No, ATS/PRI requires the cooperation
> > > > > > > > > > > > > > > > > > > from the
> > > > > vIOMMU.
> > > > > > > > > > > > > > > > > > > You can simply do ATS/PRI
> > > > > > > > > > > > > > > > > > > passthrough but with an emulated
> > > > > > > > > > > vIOMMU.
> > > > > > > > > > > > > > > > > > And that is not the reason for virtio
> > > > > > > > > > > > > > > > > > device to build
> > > > > > > > > > > > > > > > > > trap+emulation for
> > > > > > > > > > > > > > > > > passthrough member devices.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > vIOMMU is emulated by hypervisor with a
> > > > > > > > > > > > > > > > > PRI queue,
> > > > > > > > > > > > > > > > PRI requests arrive on the PF for the VF.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Shouldn't it arrive at platform IOMMU first?
> > > > > > > > > > > > > > > The path should be PRI
> > > > > > > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor ->
> > > > > > > > > > > > > > > -> vIOMMU PRI
> > > > > > > > > > > > > > > -> -> guest
> > > > > > > > > IOMMU.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > Above sequence seems write.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And things will be more complicated when
> > > > > > > > > > > > > > > (v)PASID is
> > > used.
> > > > > > > > > > > > > > > So you can't simply let PRI go directly to
> > > > > > > > > > > > > > > the guest with the current
> > > > > > > > > > > architecture.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > In current architecture of the pci VF, PRI
> > > > > > > > > > > > > > does not go directly to the
> > > > > > > > > guest.
> > > > > > > > > > > > > > (and that is not reason to trap and emulate other things).
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we
> > > > > > > > > > > > > will probably trap other things in the future
> > > > > > > > > > > > > like PASID
> > > assignment.
> > > > > > > > > > > > PRI etc all belong to generic PCI 4K config space region.
> > > > > > > > > > >
> > > > > > > > > > > It's not about the capability, it's about the whole
> > > > > > > > > > > process of PRI request handling. We've agreed that
> > > > > > > > > > > the PRI request needs to be trapped by the
> > > > > > > > > > > hypervisor and then delivered to the
> > > > > vIOMMU.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > Trap+emulation done in generic manner without
> > > > > > > > > > > > Trap+involving virtio or other
> > > > > > > > > > > device types.
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > how can you pass through a hardware PRI
> > > > > > > > > > > > > > > > > request to a guest directly without
> > > > > > > > > > > > > > > > > trapping it
> > > > > > > > > > > > > then?
> > > > > > > > > > > > > > > > > What's more, PCIE allows the PRI to be
> > > > > > > > > > > > > > > > > done in a vendor
> > > > > > > > > > > > > > > > > (virtio) specific way, so you want to break this rule?
> > > > > > > > > > > > > > > > > Or you want to blacklist ATS/PRI
> > > > > > > > > > > > > > > for virtio?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I was aware of only pci-sig way of PRI.
> > > > > > > > > > > > > > > > Do you have a reference to the ECN that
> > > > > > > > > > > > > > > > enables vendor specific way of PRI? I
> > > > > > > > > > > > > > > would like to read it.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I mean it doesn't forbid us to build a
> > > > > > > > > > > > > > > virtio specific interface for I/O page fault report and
> recovery.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > So PRI of PCI does not allow. It is ODP kind
> > > > > > > > > > > > > > of technique you meant
> > > > > > > > > above.
> > > > > > > > > > > > > > Yes one can build.
> > > > > > > > > > > > > > Ok. unrelated to device migration, so I will
> > > > > > > > > > > > > > park this good discussion for
> > > > > > > > > > > later.
> > > > > > > > > > > > >
> > > > > > > > > > > > > That's fine.
> > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This will be very good to eliminate IOMMU
> > > > > > > > > > > > > > > > PRI
> > > limitations.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Probably.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > PRI will directly go to the guest driver,
> > > > > > > > > > > > > > > > and guest would interact with IOMMU
> > > > > > > > > > > > > > > to service the paging request through IOMMU APIs.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > With PASID, it can't go directly.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > When the request consist of PASID in it, it can.
> > > > > > > > > > > > > > But again these PCI-SIG extensions of PASID
> > > > > > > > > > > > > > are not related to device
> > > > > > > > > > > > > migration, so I am differing it.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > For PRI in vendor specific way needs a
> > > > > > > > > > > > > > > > separate discussion. It is not related to
> > > > > > > > > > > > > > > live migration.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > PRI itself is not related. But the point is,
> > > > > > > > > > > > > > > you can't simply pass through ATS/PRI now.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > Ah ok. the whole 4K PCI config space where
> > > > > > > > > > > > > > ATS/PRI capabilities are located
> > > > > > > > > > > > > are trapped+emulated by hypervisor.
> > > > > > > > > > > > > > So?
> > > > > > > > > > > > > > So do we start emulating virito interfaces too
> > > > > > > > > > > > > > for
> > > passthrough?
> > > > > > > > > > > > > > No.
> > > > > > > > > > > > > > Can one still continue to trap+emulate?
> > > > > > > > > > > > > > Sure why not?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Then let's not limit your proposal to be used by
> "passthrough"
> > > > > only?
> > > > > > > > > > > > One can possibly build some variant of the
> > > > > > > > > > > > existing virtio member device
> > > > > > > > > > > using same owner and member scheme.
> > > > > > > > > > >
> > > > > > > > > > > It's not about the member/owner, it's about e.g
> > > > > > > > > > > whether the hypervisor can trap and emulate.
> > > > > > > > > > >
> > > > > > > > > > > I've pointed out that what you invent here is
> > > > > > > > > > > actually a partial new transport, for example, a
> > > > > > > > > > > hypervisor can trap and use things like device
> > > > > > > > > > > context in PF to bypass the registers in VF. This is
> > > > > > > > > > > the idea of
> > > > > > > > > transport commands/q.
> > > > > > > > > > >
> > > > > > > > > > I will not mix transport commands which are mainly
> > > > > > > > > > useful for actual device
> > > > > > > > > operation for SIOV only for backward compatibility that
> > > > > > > > > too
> > > optionally.
> > > > > > > > > > One may still choose to have virtio common and device
> > > > > > > > > > config in MMIO
> > > > > > > > > ofcourse at lower scale.
> > > > > > > > > >
> > > > > > > > > > Anyway, mixing migration context with actual SIOV
> > > > > > > > > > specific thing is not correct
> > > > > > > > > as device context is read/write incremental values.
> > > > > > > > >
> > > > > > > > > SIOV is transport level stuff, the transport virtqueue
> > > > > > > > > is designed in a way that is general enough to cover it.
> > > > > > > > > Let's not shift
> > > > > concepts.
> > > > > > > > >
> > > > > > > > Such TVQ is only for backward compatible vPCI composition.
> > > > > > > > For ground up work such TVQ must not be done through the
> > > > > > > > owner
> > > > > device.
> > > > > > >
> > > > > > > That's the idea actually.
> > > > > > >
> > > > > > > > Each SIOV device to have its own channel to communicate
> > > > > > > > directly to the
> > > > > > > device.
> > > > > > > >
> > > > > > > > > One thing that you ignore is that, hypervisor can use
> > > > > > > > > what you invented as a transport for VF, no?
> > > > > > > > >
> > > > > > > > No. by design,
> > > > > > >
> > > > > > > It works like hypervisor traps the virito config and
> > > > > > > forwards it to admin virtqueue and starts the device via device
> context.
> > > > > > It needs more granular support than the management framework
> > > > > > of device
> > > > > context.
> > > > >
> > > > > It doesn't otherwise it is a design defect as you can't recover
> > > > > the device context in the destination.
> > > > >
> > > > > Let me give you an example:
> > > > >
> > > > > 1) in the case of live migration, dst receive migration byte
> > > > > flows and convert them into device context
> > > > > 2) in the case of transporting, hypervisor traps virtio config
> > > > > and convert them into the device context
> > > > >
> > > > > I don't see anything different in this case. Or can you give me an
> example?
> > > > In #1 dst received byte flows one or multiple times.
> > >
> > > How can this be different?
> > >
> > > Transport can also receive initial state incrementally.
> > >
> > Transport is just simple register RW interface without any caching layer in-
> between.
> > More below.
> > > > And byte flows can be large.
> > >
> > > So when doing transport, it is not that large, that's it. If it can
> > > work with large byte flow, why can't it work for small?
> > Write context can as used (abused) for different purpose.
> > Read cannot because it is meant to be incremental.
> 
> Well hypervisor can just cache what it reads since the last, what's wrong with it?
> 
But hypervisor does not know what changed, so it does do guess work to find out what to query.

> > One can invent a cheap command to read it.
> 
> For sure, but it's not the context here.
>
It is.  
> >
> >
> > >
> > > > So it does not always contain everything. It only contains the new
> > > > delta of the
> > > device context.
> > >
> > > Isn't it just how current PCI transport does?
> > >
> > No. PCI transport has explicit API between device and driver to read or write
> at specific offset and value.
> 
> The point is that they are functional equivalents.
> 
I disagree.
There are two different functionalities.

Functionality_1: explicit ask for read or write
Functionality_2: read what has changed

Should one merge 1 and 2 and complicate the command? 
I prefer not to.

Now having two different commands help for debugging to differentiate between mgmt. commands and guest initiated commands. :)

> >
> > > Guest configure the following one by one:
> > >
> > > 1) vq size
> > > 2) vq addresses
> > > 3) MSI-X
> > >
> > > etc?
> > >
> > I think you interpreted "incremental" differently than I described.
> > In the device context read, the incremental is:
> >
> > If the hypervisor driver has read the device context twice, the second read
> won't return any new data if nothing changed.
> 
> See above.
>
Yeah, two separate commands needed.
 
> > For example, if RSS configuration didnât change between two reads, the
> second read wont return the TLV for RSS Context.
> >
> > While for transport the need is, when guest asked, one device must read it
> regardless of the change.
> >
> > So notion of incremental is not by address, but by the value.
> >
> > > > For example, VQ configuration is exchanged once between src and dst.
> > > > But VQ avail and used index may be updated multiple times.
> > >
> > > If it can work with multiple times of updating, why can't it work if
> > > we just update it once?
> > Functionally it can work.
> 
> I think you answer yourself.
>
Yes, I donât like abuse of command.
 
> > Performance wise, one does not want to update multiple times, unless there
> is a change.
> >
> > Read as explained above is not meant to return same content again.
> >
> > >
> > > > So here hypervisor do not want to read any specific set of fields
> > > > and
> > > hypervisor is not parsing them either.
> > > > It is just a byte stream for it.
> > >
> > > Firstly, spec must define the device context format, so hypervisor
> > > can understand which byte is what otherwise you can't maintain
> > > migration compatibility.
> > Device context is defined already in the latest version.
> >
> > > Secondly, you can't mandate how the hypervisor is written.
> > >
> > > >
> > > > As opposed to that, in case of transport, the guest explicitly
> > > > asks to read or
> > > write specific bytes.
> > > > Therefore, it is not incremental.
> > >
> > > I'm totally lost. Which part of the transport is not incremental?
> > >
> > > >
> > > > Additionally, if hypervisor has put the trap on virtio config, and
> > > > because the memory device already has the interface for virtio
> > > > config,
> > > >
> > > > Hypervisor can directly write/read from the virtual config to the
> > > > member's
> > > config space, without going through the device context, right?
> > >
> > > If it can do it or it can choose to not. I don't see how it is
> > > related to the discussion here.
> > >
> > It is. I donât see a point of hypervisor not using the native interface provided
> by the member device.
> 
> It really depends on the case, and I see how it duplicates with the functionality
> that is provided by both:
> 
> 1) The existing PCI transport
> 
> or
> 
> 2) The transport virtqueue
> 
I would like to conclude that we disagree in our approaches.
PCI transport is for member device to directly communicate from guest driver to the device.
This is uniform across PF, VFs, SIOV.

Admin commands are transport independent and their task is device migration.
One is not replacing the other.

Transport virtqueue will never transport driver notifications, hence it does not qualify at "transport".

For the vdpa case, there is no need for extra admin commands as the mediation layer can directly use the interface available from the member device itself.

You continue to want to overload admin commands for dual purpose, does not make sense to me.

> >
> >  > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > > > it is not good idea to overload management commands with
> > > > > > > > actual run time
> > > > > > > guest commands.
> > > > > > > > The device context read writes are largely for incremental updates.
> > > > > > >
> > > > > > > It doesn't matter if it is incremental or not but
> > > > > > >
> > > > > > It does because you want different functionality only for
> > > > > > purpose of backward
> > > > > compatibility.
> > > > > > That also if the device does not offer them as portion of MMIO BAR.
> > > > >
> > > > > I don't see how it is related to the "incremental part".
> > > > >
> > > > > >
> > > > > > > 1) the function is there
> > > > > > > 2) hypervisor can use that function if they want and virtio
> > > > > > > (spec) can't forbid that
> > > > > > >
> > > > > > It is not about forbidding or supporting.
> > > > > > Its about what functionality to use for management plane and
> > > > > > guest
> > > plane.
> > > > > > Both have different needs.
> > > > >
> > > > > People can have different views, there's nothing we can prevent
> > > > > a hypervisor from using it as a transport as far as I can see.
> > > > For device context write command, it can be used (or probably
> > > > abused) to do
> > > write but I fail to see why to use it.
> > >
> > > The function is there, you can't prevent people from doing that.
> > >
> > One can always mess up itself. :)
> > It is not prevented. It is just not right way to use the interface.
> >
> > > > Because member device already has the interface to do config
> > > > read/write and
> > > it is accessible to the hypervisor.
> > >
> > > Well, it looks self-contradictory again. Are you saying another set
> > > of commands that is similar to device context is needed for non-PCI
> transport?
> > >
> > All these non pci transport discussion is just meaning less.
> > Let MMIO bring the concept of member device at that point something make
> sense to discuss.
> 
> It's not necessarily MMIO. For example the SIOV, which I don't think can use the
> existing PCI transport.
> 
> > PCI SIOV is also the PCI device at the end.
> 
> We don't want to end up with two sets of commands to save/load SRIOV and
> SIOV at least.
> 
This proposal ensures that SRIOV and SIOV devices are treated equally.
How brand new non-compatible SIOV device to transport this, is outside of the scope of this work.

> Thanks
> 
> 
> 
> >
> > > >
> > > > The read as_is using device context cannot be done because the
> > > > caller is not
> > > explicitly asking what to read.
> > > > And the interface does not have it, because member device has it.
> > > >
> > > > So lets find the need if incremental bit is needed in the
> > > > device_Context read
> > > command or not or a bits to ask explicitly what to read optionally.
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > For VF driver it has own direct channel via its own BAR to
> > > > > > > > talk to the
> > > > > device.
> > > > > > > So no need to transport via PF.
> > > > > > > > For SIOV for backward compat vPCI composition, it may be needed.
> > > > > > > > Hard to say, if that can be memory mapped as well on the
> > > > > > > > BAR of the
> > > PF.
> > > > > > > > We have seen one device supporting it outside of the virtio.
> > > > > > > > For scale anyway, one needs to use the device own cvq for
> > > > > > > > complex
> > > > > > > configuration.
> > > > > > >
> > > > > > > That's the idea but I meant your current proposal overlaps
> > > > > > > those
> > > functions.
> > > > > > >
> > > > > > Not really. One can have simple virtio config space access
> > > > > > read/write
> > > > > functionality, in addition to what is done here.
> > > > > > And that is still fine. One is doing proxying for guest.
> > > > > > Management plane is doing more than just register proxy.
> > > > >
> > > > > See above, let's figure out whether it is possible as a transport first then.
> > > > >
> > > > Right. lets figure out.
> > > >
> > > > I would still promote to not mix management command with transport
> > > command.
> > >
> > > It's not a mixing, it's just because they are functional equivalents.
> > >
> > It is not.
> > I clarified the fundamental difference between the two.
> > One is explicit read and write.
> > Other is, return read data on change.
> > For write, it is explicit set and it does not take effect until the mode is changed
> back to active.
> >
> > > > Commands are cheap in nature. For transport if needed, they can be
> > > > explicit
> > > commands.
> > >
> > > It will be a partial duplication of what is being proposed here.
> >
> > There is always some overlap between management plane (hypervisor
> set/get) and control plane (guest driver get/set).
> > >
> > > Thanks
> > >
> > >
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > If for that is some admin commands are missing,
> > > > > > > > > > > > may be one can add
> > > > > > > > > them.
> > > > > > > > > > >
> > > > > > > > > > > I would then build the device context commands on
> > > > > > > > > > > top of the transport commands/q, then it would be complete.
> > > > > > > > > > >
> > > > > > > > > > > > No need to step on toes of use cases as they are different...
> > > > > > > > > > > >
> > > > > > > > > > > > > I've shown you that
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1) you can't easily say you can pass through all
> > > > > > > > > > > > > the virtio facilities
> > > > > > > > > > > > > 2) how ambiguous for terminology like "passthrough"
> > > > > > > > > > > > >
> > > > > > > > > > > > It is not, it is well defined in v3, v2.
> > > > > > > > > > > > One can continue to argue and keep defining the
> > > > > > > > > > > > variant and still call it data
> > > > > > > > > > > path acceleration and then claim it as passthrough ...
> > > > > > > > > > > > But I won't debate this anymore as its just
> > > > > > > > > > > > non-technical aspects of least
> > > > > > > > > > > interest.
> > > > > > > > > > >
> > > > > > > > > > > You use this terminology in the spec which is all
> > > > > > > > > > > about technical, and you think how to define it is a
> > > > > > > > > > > matter of non-technical. This is self-contradictory.
> > > > > > > > > > > If you fail, it probably means it's
> > > > > > > ambiguous.
> > > > > > > > > > > Let's don't use that terminology.
> > > > > > > > > > >
> > > > > > > > > > What it means is described in theory of operation.
> > > > > > > > > >
> > > > > > > > > > > > We have technical tasks and more improved specs to
> > > > > > > > > > > > update going
> > > > > > > > > forward.
> > > > > > > > > > >
> > > > > > > > > > > It's a burden to do the synchronization.
> > > > > > > > > > We have discussed this.
> > > > > > > > > > In current proposed the member device is not
> > > > > > > > > > bifurcated,
> > > > > > > > >
> > > > > > > > > It is. Part of the functions were carried via the PCI
> > > > > > > > > interface, some are carried via owner. You end up with
> > > > > > > > > two drivers to drive the
> > > > > > > devices.
> > > > > > > > >
> > > > > > > > Nop.
> > > > > > > > All admin work of device migration is carried out via the
> > > > > > > > owner
> > > device.
> > > > > > > > All guest triggered work is carried out using VF itself.
> > > > > > >
> > > > > > > Guests don't (or can't) care about how the hypervisor is structured.
> > > > > > For passthrough mode, it just cannot be structured inside the VF.
> > > > >
> > > > > Well, again, we are talking about different things.
> > > > >
> > > > > >
> > > > > > > So we're discussing the view of device, member devices needs
> > > > > > > to server for
> > > > > > >
> > > > > > > 1) request from the transport (it's guest in your context)
> > > > > > > 2) request from the owner
> > > > > >
> > > > > > Doing #2 of the owner on the member device functionality do
> > > > > > not work when
> > > > > hypervisor do not have access to the member device.
> > > > >
> > > > > I don't get here, isn't 2) just what we invent for admin commands?
> > > > > Driver sends commands to the owner, owner forward those requests
> > > > > to the member?
> > > > I am most with the term "driver" without notion of guest/hypervisor
> prefix.
> > > >
> > > > In one model,
> > > > Member device does everything through its native interface =
> > > > virtio config
> > > and device space, cvq, data vqs etc.
> > > > Here member device do not forward anything to its owner.
> > > >
> > > > The live migration hypervisor driver who has the knowledge of live
> > > > migration
> > > flow, accesses the owner device and get the side band member's
> > > information to control it.
> > > > So member driver do not forward anything here to owner driver.
> > > >
> >



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]