[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
On Thu, Nov 16, 2023 at 1:39âAM Parav Pandit <parav@nvidia.com> wrote: > > > From: Jason Wang <jasowang@redhat.com> > > Sent: Monday, November 13, 2023 9:03 AM > > > > On Thu, Nov 9, 2023 at 2:25âPM Parav Pandit <parav@nvidia.com> wrote: > > > > > > > > > > From: Jason Wang <jasowang@redhat.com> > > > > Sent: Tuesday, November 7, 2023 9:35 AM > > > > > > > > On Mon, Nov 6, 2023 at 3:05âPM Parav Pandit <parav@nvidia.com> wrote: > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com> > > > > > > Sent: Monday, November 6, 2023 12:05 PM > > > > > > > > > > > > On Thu, Nov 2, 2023 at 2:10âPM Parav Pandit <parav@nvidia.com> > > wrote: > > > > > > > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com> > > > > > > > > Sent: Thursday, November 2, 2023 9:56 AM > > > > > > > > > > > > > > > > On Wed, Nov 1, 2023 at 11:32âAM Parav Pandit > > > > > > > > <parav@nvidia.com> > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com> > > > > > > > > > > Sent: Wednesday, November 1, 2023 6:04 AM > > > > > > > > > > > > > > > > > > > > On Tue, Oct 31, 2023 at 1:30âPM Parav Pandit > > > > > > > > > > <parav@nvidia.com> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com> > > > > > > > > > > > > Sent: Tuesday, October 31, 2023 7:05 AM > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Oct 30, 2023 at 12:47âPM Parav Pandit > > > > > > > > > > > > <parav@nvidia.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org > > > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf > > > > > > > > > > > > > > Of Jason Wang > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Oct 26, 2023 at 11:45âAM Parav Pandit > > > > > > > > > > > > > > <parav@nvidia.com> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com> > > > > > > > > > > > > > > > > Sent: Thursday, October 26, 2023 6:16 AM > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Oct 25, 2023 at 3:03âPM Parav Pandit > > > > > > > > > > > > > > > > <parav@nvidia.com> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > From: Jason Wang <jasowang@redhat.com> > > > > > > > > > > > > > > > > > > Sent: Wednesday, October 25, 2023 6:59 > > > > > > > > > > > > > > > > > > AM > > > > > > > > > > > > > > > > > > > For passthrough PASID assignment vq is not > > needed. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > How do you know that? > > > > > > > > > > > > > > > > > Because for passthrough, the hypervisor is > > > > > > > > > > > > > > > > > not involved in dealing with VQ at > > > > > > > > > > > > > > > > all. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ok, so if I understand correctly, you are > > > > > > > > > > > > > > > > saying your design can't work for the case of PASID > > assignment. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > No. PASID assignment will happen from the > > > > > > > > > > > > > > > guest for its own use and device > > > > > > > > > > > > > > migration will just work fine because device > > > > > > > > > > > > > > context will capture > > > > > > this. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It's not about device context. We're discussing > > > > > > > > > > > > > > "passthrough", > > > > > > no? > > > > > > > > > > > > > > > > > > > > > > > > > > > Not sure, we are discussing same. > > > > > > > > > > > > > A member device is passthrough to the guest, > > > > > > > > > > > > > dealing with its own PASIDs and > > > > > > > > > > > > virtio interface for some VQ assignment to PASID. > > > > > > > > > > > > > So VQ context captured by the hypervisor, will > > > > > > > > > > > > > have some PASID attached to > > > > > > > > > > > > this VQ. > > > > > > > > > > > > > Device context will be updated. > > > > > > > > > > > > > > > > > > > > > > > > > > > You want all virtio stuff to be "passthrough", > > > > > > > > > > > > > > but assigning a PASID to a specific virtqueue in > > > > > > > > > > > > > > the guest must be > > > > > > trapped. > > > > > > > > > > > > > > > > > > > > > > > > > > > No. PASID assignment to a specific virtqueue in > > > > > > > > > > > > > the guest must go directly > > > > > > > > > > > > from guest to device. > > > > > > > > > > > > > > > > > > > > > > > > This works like setting CR3, you can't simply let it > > > > > > > > > > > > go from guest to > > > > > > host. > > > > > > > > > > > > > > > > > > > > > > > > Host IOMMU driver needs to know the PASID to program > > > > > > > > > > > > the IO page tables correctly. > > > > > > > > > > > > > > > > > > > > > > > This will be done by the IOMMU. > > > > > > > > > > > > > > > > > > > > > > > > When guest iommu may need to communicate anything > > > > > > > > > > > > > for this PASID, it will > > > > > > > > > > > > come through its proper IOMMU channel/hypercall. > > > > > > > > > > > > > > > > > > > > > > > > Let's say using PASID X for queue 0, this knowledge > > > > > > > > > > > > is beyond the IOMMU scope but belongs to virtio. Or > > > > > > > > > > > > please explain how it can work when it goes directly > > > > > > > > > > > > from guest to > > > > device. > > > > > > > > > > > > > > > > > > > > > > > We are yet to ever see spec for PASID to VQ assignment. > > > > > > > > > > > > > > > > > > > > It has one. > > > > > > > > > > > > > > > > > > > > > For ok for theory sake it is there. > > > > > > > > > > > > > > > > > > > > > > Virtio driver will assign the PASID directly from > > > > > > > > > > > guest driver to device using a > > > > > > > > > > create_vq(pasid=X) command. > > > > > > > > > > > Same process is somehow attached the PASID by the guest OS. > > > > > > > > > > > The whole PASID range is known to the hypervisor when > > > > > > > > > > > the device is handed > > > > > > > > > > over to the guest VM. > > > > > > > > > > > > > > > > > > > > How can it know? > > > > > > > > > > > > > > > > > > > > > So PASID mapping is setup by the hypervisor IOMMU at this > > point. > > > > > > > > > > > > > > > > > > > > You disallow the PASID to be virtualized here. What's > > > > > > > > > > more, such a PASID passthrough has security implications. > > > > > > > > > > > > > > > > > > > No. virtio spec is not disallowing. At least for sure, > > > > > > > > > this series is not the > > > > > > one. > > > > > > > > > My main point is, virtio device interface will not be the > > > > > > > > > source of hypercall to > > > > > > > > program IOMMU in the hypervisor. > > > > > > > > > It is something to be done by IOMMU side. > > > > > > > > > > > > > > > > So unless vPASID can be used by the hardware you need to > > > > > > > > trap the mapping from a PASID to a virtqueue. Then you need > > > > > > > > virtio specific > > > > > > knowledge. > > > > > > > > > > > > > > > vPASID by hardware is unlikely to be used by hw PCI EP devices > > > > > > > at least in any > > > > > > near term future. > > > > > > > This requires either vPASID to pPASID table in device or in IOMMU. > > > > > > > > > > > > So we are on the same page. > > > > > > > > > > > > Claiming a method that can only work for passthrough or > > > > > > emulation is not > > > > good. > > > > > > We all know virtualization is passthrough + emulation. > > > > > Again, I agree but I wont generalize it here. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Again, we are talking about different things, I've tried > > > > > > > > > > to show you that there are cases that passthrough can't > > > > > > > > > > work but if you think the only way for migration is to > > > > > > > > > > use passthrough in every case, you will > > > > > > > > probably fail. > > > > > > > > > > > > > > > > > > > I didn't say only way for migration is passthrough. > > > > > > > > > Passthrough is clearly one way. > > > > > > > > > Other ways may be possible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Virtio device is not the conduit for this exchange. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There are works ongoing to make vPASID > > > > > > > > > > > > > > > > > > work for the guest like > > > > > > > > > > > > vSVA. > > > > > > > > > > > > > > > > > > Virtio doesn't differ from other devices. > > > > > > > > > > > > > > > > > Passthrough do not run like SVA. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Great, you find another limitation of > > > > > > > > > > > > > > > > "passthrough" by > > > > > > yourself. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > No. it is not the limitation it is just the > > > > > > > > > > > > > > > way it does not need complex SVA to > > > > > > > > > > > > > > split the device for unrelated usage. > > > > > > > > > > > > > > > > > > > > > > > > > > > > How can you limit the user in the guest to not use vSVA? > > > > > > > > > > > > > > > > > > > > > > > > > > > He he, I am not limiting, again misunderstanding > > > > > > > > > > > > > or wrong > > > > > > attribution. > > > > > > > > > > > > > I explained that hypervisor for passthrough does not need > > SVA. > > > > > > > > > > > > > Guest can do anything it wants from the guest OS > > > > > > > > > > > > > with the member > > > > > > > > > > device. > > > > > > > > > > > > > > > > > > > > > > > > Ok, so the point stills, see above. > > > > > > > > > > > > > > > > > > > > > > I donât think so. The guest owns its PASID space > > > > > > > > > > > > > > > > > > > > Again, vPASID to PASID can't be done hardware unless I > > > > > > > > > > miss some recent features of IOMMUs. > > > > > > > > > > > > > > > > > > > Cpu vendors have different way of doing vPASID to pPASID. > > > > > > > > > > > > > > > > At least for the current version of major IOMMU vendors, > > > > > > > > such translation (aka PASID remapping) is not implemented in > > > > > > > > the hardware so it needs to be trapped first. > > > > > > > > > > > > > > > Right. So it is really far in future, atleast few years away. > > > > > > > > > > > > > > > > It is still an early space for virtio. > > > > > > > > > > > > > > > > > > > > and directly communicates like any other device attribute. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Each passthrough device has PASID from its > > > > > > > > > > > > > > > > > own space fully managed by the > > > > > > > > > > > > > > > > guest. > > > > > > > > > > > > > > > > > Some cpu required vPASID and SIOV is not > > > > > > > > > > > > > > > > > going this way > > > > > > > > anmore. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Then how to migrate? Invent a full set of > > > > > > > > > > > > > > > > something else through another giant series > > > > > > > > > > > > > > > > like this to migrate to the SIOV > > > > > > > > thing? > > > > > > > > > > > > > > > > That's a mess for > > > > > > > > > > > > > > sure. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > SIOV will for sure reuse most or all parts of > > > > > > > > > > > > > > > this work, almost entirely > > > > > > > > > > as_is. > > > > > > > > > > > > > > > vPASID is cpu/platform specific things not > > > > > > > > > > > > > > > part of the SIOV > > > > > > devices. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If at all it is done, it will be done > > > > > > > > > > > > > > > > > > > from the guest by the driver using > > > > > > > > > > > > > > > > > > > virtio > > > > > > > > > > > > > > > > > > interface. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Then you need to trap. Such things > > > > > > > > > > > > > > > > > > couldn't be passed through to guests > > > > > > > > > > > > > > > > directly. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Only PASID capability is trapped. PASID > > > > > > > > > > > > > > > > > allocation and usage is directly from > > > > > > > > > > > > > > > > guest. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > How can you achieve this? Assigning a PAISD > > > > > > > > > > > > > > > > to a device is completely > > > > > > > > > > > > > > > > device(virtio) specific. How can you use a > > > > > > > > > > > > > > > > general layer without the knowledge of virtio to trap > > that? > > > > > > > > > > > > > > > When one wants to map vPASID to pPASID a > > > > > > > > > > > > > > > platform needs to be > > > > > > > > > > > > involved. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'm not talking about how to map vPASID to > > > > > > > > > > > > > > pPASID, it's out of the scope of virtio. I'm > > > > > > > > > > > > > > talking about assigning a vPASID to a specific > > > > > > > > > > > > > > virtqueue or other virtio function in the > > > > > > guest. > > > > > > > > > > > > > > > > > > > > > > > > > > > That can be done in the guest. The key is guest > > > > > > > > > > > > > wont know that it is dealing > > > > > > > > > > > > with vPASID. > > > > > > > > > > > > > It will follow the same principle from your paper > > > > > > > > > > > > > of equivalency, where virtio > > > > > > > > > > > > software layer will assign PASID to VQ and > > > > > > > > > > > > communicate to > > > > device. > > > > > > > > > > > > > > > > > > > > > > > > > > Anyway, all of this just digression from current series. > > > > > > > > > > > > > > > > > > > > > > > > It's not, as you mention that only MSI-X is trapped, > > > > > > > > > > > > I give you another > > > > > > > > one. > > > > > > > > > > > > > > > > > > > > > > > PASID access from the guest to be done fully by the guest > > IOMMU. > > > > > > > > > > > Not by virtio devices. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You need a virtio specific queue or capability > > > > > > > > > > > > > > to assign a PASID to a specific virtqueue, and > > > > > > > > > > > > > > that can't be done without trapping and without > > > > > > > > > > > > > > virito specific > > > > knowledge. > > > > > > > > > > > > > > > > > > > > > > > > > > > I disagree. PASID assignment to a virqueue in > > > > > > > > > > > > > future from guest virtio driver to > > > > > > > > > > > > device is uniform method. > > > > > > > > > > > > > Whether its PF assigning PASID to VQ of self, Or > > > > > > > > > > > > > VF driver in the guest assigning PASID to VQ. > > > > > > > > > > > > > > > > > > > > > > > > > > All same. > > > > > > > > > > > > > Only IOMMU layer hypercalls will know how to deal > > > > > > > > > > > > > with PASID assignment at > > > > > > > > > > > > platform layer to setup the domain etc table. > > > > > > > > > > > > > > > > > > > > > > > > > > And this is way beyond our device migration discussion. > > > > > > > > > > > > > By any means, if you were implying that somehow vq > > > > > > > > > > > > > to PASID assignment > > > > > > > > > > > > _may_ need trap+emulation, hence whole device > > > > > > > > > > > > migration to depend on some > > > > > > > > > > > > trap+emulation, than surely, than I do not agree to it. > > > > > > > > > > > > > > > > > > > > > > > > See above. > > > > > > > > > > > > > > > > > > > > > > > Yeah, I disagree to such implying. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > PASID equivalent in mlx5 world is ODP_MR+PD > > > > > > > > > > > > > isolating the guest process and > > > > > > > > > > > > all of that just works on efficiency and equivalence > > > > > > > > > > > > principle already for a decade now without any > > trap+emulation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > When virtio passthrough device is in guest, it > > > > > > > > > > > > > > > has all its PASID > > > > > > > > > > accessible. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > All these is large deviation from current > > > > > > > > > > > > > > > discussion of this series, so I will keep > > > > > > > > > > > > > > it short. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regardless it is not relevant to > > > > > > > > > > > > > > > > > passthrough mode as PASID is yet another > > > > > > > > > > > > > > > > resource. > > > > > > > > > > > > > > > > > And for some cpu if it is trapped, it is > > > > > > > > > > > > > > > > > generic layer, that does not require > > > > > > > > > > > > > > > > > virtio > > > > > > > > > > > > > > > > involvement. > > > > > > > > > > > > > > > > > So virtio interface asking to trap > > > > > > > > > > > > > > > > > something because generic facility has > > > > > > > > > > > > > > > > > done > > > > > > > > > > > > > > > > in not the approach. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This misses the point of PASID. How to use > > > > > > > > > > > > > > > > PASID is totally device > > > > > > > > > > > > specific. > > > > > > > > > > > > > > > Sure, and how to virtualize vPASID/pPASID is > > > > > > > > > > > > > > > platform specific as single PASID > > > > > > > > > > > > > > can be used by multiple devices and process. > > > > > > > > > > > > > > > > > > > > > > > > > > > > See above, I think we're talking about different things. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Capabilities of #2 is generic across > > > > > > > > > > > > > > > > > > > all pci devices, so it will be handled > > > > > > > > > > > > > > > > > > > by the > > > > > > > > > > > > > > > > > > HV. > > > > > > > > > > > > > > > > > > > ATS/PRI cap is also generic manner > > > > > > > > > > > > > > > > > > > handled by the HV and PCI > > > > > > > > > > > > device. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > No, ATS/PRI requires the cooperation > > > > > > > > > > > > > > > > > > from the > > > > vIOMMU. > > > > > > > > > > > > > > > > > > You can simply do ATS/PRI passthrough > > > > > > > > > > > > > > > > > > but with an emulated > > > > > > > > > > vIOMMU. > > > > > > > > > > > > > > > > > And that is not the reason for virtio > > > > > > > > > > > > > > > > > device to build > > > > > > > > > > > > > > > > > trap+emulation for > > > > > > > > > > > > > > > > passthrough member devices. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > vIOMMU is emulated by hypervisor with a PRI > > > > > > > > > > > > > > > > queue, > > > > > > > > > > > > > > > PRI requests arrive on the PF for the VF. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Shouldn't it arrive at platform IOMMU first? The > > > > > > > > > > > > > > path should be PRI > > > > > > > > > > > > > > -> RC -> IOMMU -> host -> Hypervisor -> vIOMMU > > > > > > > > > > > > > > -> PRI > > > > > > > > > > > > > > -> -> guest > > > > > > > > IOMMU. > > > > > > > > > > > > > > > > > > > > > > > > > > > Above sequence seems write. > > > > > > > > > > > > > > > > > > > > > > > > > > > And things will be more complicated when (v)PASID is > > used. > > > > > > > > > > > > > > So you can't simply let PRI go directly to the > > > > > > > > > > > > > > guest with the current > > > > > > > > > > architecture. > > > > > > > > > > > > > > > > > > > > > > > > > > > In current architecture of the pci VF, PRI does > > > > > > > > > > > > > not go directly to the > > > > > > > > guest. > > > > > > > > > > > > > (and that is not reason to trap and emulate other things). > > > > > > > > > > > > > > > > > > > > > > > > Ok, so beyond MSI-X we need to trap PRI, and we will > > > > > > > > > > > > probably trap other things in the future like PASID > > assignment. > > > > > > > > > > > PRI etc all belong to generic PCI 4K config space region. > > > > > > > > > > > > > > > > > > > > It's not about the capability, it's about the whole > > > > > > > > > > process of PRI request handling. We've agreed that the > > > > > > > > > > PRI request needs to be trapped by the hypervisor and > > > > > > > > > > then delivered to the > > > > vIOMMU. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Trap+emulation done in generic manner without > > > > > > > > > > > Trap+involving virtio or other > > > > > > > > > > device types. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > how can you pass through a hardware PRI > > > > > > > > > > > > > > > > request to a guest directly without trapping > > > > > > > > > > > > > > > > it > > > > > > > > > > > > then? > > > > > > > > > > > > > > > > What's more, PCIE allows the PRI to be done > > > > > > > > > > > > > > > > in a vendor > > > > > > > > > > > > > > > > (virtio) specific way, so you want to break this rule? > > > > > > > > > > > > > > > > Or you want to blacklist ATS/PRI > > > > > > > > > > > > > > for virtio? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I was aware of only pci-sig way of PRI. > > > > > > > > > > > > > > > Do you have a reference to the ECN that > > > > > > > > > > > > > > > enables vendor specific way of PRI? I > > > > > > > > > > > > > > would like to read it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I mean it doesn't forbid us to build a virtio > > > > > > > > > > > > > > specific interface for I/O page fault report and recovery. > > > > > > > > > > > > > > > > > > > > > > > > > > > So PRI of PCI does not allow. It is ODP kind of > > > > > > > > > > > > > technique you meant > > > > > > > > above. > > > > > > > > > > > > > Yes one can build. > > > > > > > > > > > > > Ok. unrelated to device migration, so I will park > > > > > > > > > > > > > this good discussion for > > > > > > > > > > later. > > > > > > > > > > > > > > > > > > > > > > > > That's fine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This will be very good to eliminate IOMMU PRI > > limitations. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Probably. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > PRI will directly go to the guest driver, and > > > > > > > > > > > > > > > guest would interact with IOMMU > > > > > > > > > > > > > > to service the paging request through IOMMU APIs. > > > > > > > > > > > > > > > > > > > > > > > > > > > > With PASID, it can't go directly. > > > > > > > > > > > > > > > > > > > > > > > > > > > When the request consist of PASID in it, it can. > > > > > > > > > > > > > But again these PCI-SIG extensions of PASID are > > > > > > > > > > > > > not related to device > > > > > > > > > > > > migration, so I am differing it. > > > > > > > > > > > > > > > > > > > > > > > > > > > > For PRI in vendor specific way needs a > > > > > > > > > > > > > > > separate discussion. It is not related to > > > > > > > > > > > > > > live migration. > > > > > > > > > > > > > > > > > > > > > > > > > > > > PRI itself is not related. But the point is, you > > > > > > > > > > > > > > can't simply pass through ATS/PRI now. > > > > > > > > > > > > > > > > > > > > > > > > > > > Ah ok. the whole 4K PCI config space where ATS/PRI > > > > > > > > > > > > > capabilities are located > > > > > > > > > > > > are trapped+emulated by hypervisor. > > > > > > > > > > > > > So? > > > > > > > > > > > > > So do we start emulating virito interfaces too for > > passthrough? > > > > > > > > > > > > > No. > > > > > > > > > > > > > Can one still continue to trap+emulate? > > > > > > > > > > > > > Sure why not? > > > > > > > > > > > > > > > > > > > > > > > > Then let's not limit your proposal to be used by "passthrough" > > > > only? > > > > > > > > > > > One can possibly build some variant of the existing > > > > > > > > > > > virtio member device > > > > > > > > > > using same owner and member scheme. > > > > > > > > > > > > > > > > > > > > It's not about the member/owner, it's about e.g whether > > > > > > > > > > the hypervisor can trap and emulate. > > > > > > > > > > > > > > > > > > > > I've pointed out that what you invent here is actually a > > > > > > > > > > partial new transport, for example, a hypervisor can > > > > > > > > > > trap and use things like device context in PF to bypass > > > > > > > > > > the registers in VF. This is the idea of > > > > > > > > transport commands/q. > > > > > > > > > > > > > > > > > > > I will not mix transport commands which are mainly useful > > > > > > > > > for actual device > > > > > > > > operation for SIOV only for backward compatibility that too > > optionally. > > > > > > > > > One may still choose to have virtio common and device > > > > > > > > > config in MMIO > > > > > > > > ofcourse at lower scale. > > > > > > > > > > > > > > > > > > Anyway, mixing migration context with actual SIOV specific > > > > > > > > > thing is not correct > > > > > > > > as device context is read/write incremental values. > > > > > > > > > > > > > > > > SIOV is transport level stuff, the transport virtqueue is > > > > > > > > designed in a way that is general enough to cover it. Let's > > > > > > > > not shift > > > > concepts. > > > > > > > > > > > > > > > Such TVQ is only for backward compatible vPCI composition. > > > > > > > For ground up work such TVQ must not be done through the owner > > > > device. > > > > > > > > > > > > That's the idea actually. > > > > > > > > > > > > > Each SIOV device to have its own channel to communicate > > > > > > > directly to the > > > > > > device. > > > > > > > > > > > > > > > One thing that you ignore is that, hypervisor can use what > > > > > > > > you invented as a transport for VF, no? > > > > > > > > > > > > > > > No. by design, > > > > > > > > > > > > It works like hypervisor traps the virito config and forwards it > > > > > > to admin virtqueue and starts the device via device context. > > > > > It needs more granular support than the management framework of > > > > > device > > > > context. > > > > > > > > It doesn't otherwise it is a design defect as you can't recover the > > > > device context in the destination. > > > > > > > > Let me give you an example: > > > > > > > > 1) in the case of live migration, dst receive migration byte flows > > > > and convert them into device context > > > > 2) in the case of transporting, hypervisor traps virtio config and > > > > convert them into the device context > > > > > > > > I don't see anything different in this case. Or can you give me an example? > > > In #1 dst received byte flows one or multiple times. > > > > How can this be different? > > > > Transport can also receive initial state incrementally. > > > Transport is just simple register RW interface without any caching layer in-between. > More below. > > > And byte flows can be large. > > > > So when doing transport, it is not that large, that's it. If it can work with large > > byte flow, why can't it work for small? > Write context can as used (abused) for different purpose. > Read cannot because it is meant to be incremental. Well hypervisor can just cache what it reads since the last, what's wrong with it? > One can invent a cheap command to read it. For sure, but it's not the context here. > > > > > > > So it does not always contain everything. It only contains the new delta of the > > device context. > > > > Isn't it just how current PCI transport does? > > > No. PCI transport has explicit API between device and driver to read or write at specific offset and value. The point is that they are functional equivalents. > > > Guest configure the following one by one: > > > > 1) vq size > > 2) vq addresses > > 3) MSI-X > > > > etc? > > > I think you interpreted "incremental" differently than I described. > In the device context read, the incremental is: > > If the hypervisor driver has read the device context twice, the second read won't return any new data if nothing changed. See above. > For example, if RSS configuration didnât change between two reads, the second read wont return the TLV for RSS Context. > > While for transport the need is, when guest asked, one device must read it regardless of the change. > > So notion of incremental is not by address, but by the value. > > > > For example, VQ configuration is exchanged once between src and dst. > > > But VQ avail and used index may be updated multiple times. > > > > If it can work with multiple times of updating, why can't it work if we just > > update it once? > Functionally it can work. I think you answer yourself. > Performance wise, one does not want to update multiple times, unless there is a change. > > Read as explained above is not meant to return same content again. > > > > > > So here hypervisor do not want to read any specific set of fields and > > hypervisor is not parsing them either. > > > It is just a byte stream for it. > > > > Firstly, spec must define the device context format, so hypervisor can > > understand which byte is what otherwise you can't maintain migration > > compatibility. > Device context is defined already in the latest version. > > > Secondly, you can't mandate how the hypervisor is written. > > > > > > > > As opposed to that, in case of transport, the guest explicitly asks to read or > > write specific bytes. > > > Therefore, it is not incremental. > > > > I'm totally lost. Which part of the transport is not incremental? > > > > > > > > Additionally, if hypervisor has put the trap on virtio config, and > > > because the memory device already has the interface for virtio config, > > > > > > Hypervisor can directly write/read from the virtual config to the member's > > config space, without going through the device context, right? > > > > If it can do it or it can choose to not. I don't see how it is related to the > > discussion here. > > > It is. I donât see a point of hypervisor not using the native interface provided by the member device. It really depends on the case, and I see how it duplicates with the functionality that is provided by both: 1) The existing PCI transport or 2) The transport virtqueue > > > > > > > > > > > > > > > > > > > > > > > > > > it is not good idea to overload management commands with > > > > > > > actual run time > > > > > > guest commands. > > > > > > > The device context read writes are largely for incremental updates. > > > > > > > > > > > > It doesn't matter if it is incremental or not but > > > > > > > > > > > It does because you want different functionality only for purpose > > > > > of backward > > > > compatibility. > > > > > That also if the device does not offer them as portion of MMIO BAR. > > > > > > > > I don't see how it is related to the "incremental part". > > > > > > > > > > > > > > > 1) the function is there > > > > > > 2) hypervisor can use that function if they want and virtio > > > > > > (spec) can't forbid that > > > > > > > > > > > It is not about forbidding or supporting. > > > > > Its about what functionality to use for management plane and guest > > plane. > > > > > Both have different needs. > > > > > > > > People can have different views, there's nothing we can prevent a > > > > hypervisor from using it as a transport as far as I can see. > > > For device context write command, it can be used (or probably abused) to do > > write but I fail to see why to use it. > > > > The function is there, you can't prevent people from doing that. > > > One can always mess up itself. :) > It is not prevented. It is just not right way to use the interface. > > > > Because member device already has the interface to do config read/write and > > it is accessible to the hypervisor. > > > > Well, it looks self-contradictory again. Are you saying another set of commands > > that is similar to device context is needed for non-PCI transport? > > > All these non pci transport discussion is just meaning less. > Let MMIO bring the concept of member device at that point something make sense to discuss. It's not necessarily MMIO. For example the SIOV, which I don't think can use the existing PCI transport. > PCI SIOV is also the PCI device at the end. We don't want to end up with two sets of commands to save/load SRIOV and SIOV at least. Thanks > > > > > > > The read as_is using device context cannot be done because the caller is not > > explicitly asking what to read. > > > And the interface does not have it, because member device has it. > > > > > > So lets find the need if incremental bit is needed in the device_Context read > > command or not or a bits to ask explicitly what to read optionally. > > > > > > > > > > > > > > > > > > > > > > > > > > For VF driver it has own direct channel via its own BAR to > > > > > > > talk to the > > > > device. > > > > > > So no need to transport via PF. > > > > > > > For SIOV for backward compat vPCI composition, it may be needed. > > > > > > > Hard to say, if that can be memory mapped as well on the BAR of the > > PF. > > > > > > > We have seen one device supporting it outside of the virtio. > > > > > > > For scale anyway, one needs to use the device own cvq for > > > > > > > complex > > > > > > configuration. > > > > > > > > > > > > That's the idea but I meant your current proposal overlaps those > > functions. > > > > > > > > > > > Not really. One can have simple virtio config space access > > > > > read/write > > > > functionality, in addition to what is done here. > > > > > And that is still fine. One is doing proxying for guest. > > > > > Management plane is doing more than just register proxy. > > > > > > > > See above, let's figure out whether it is possible as a transport first then. > > > > > > > Right. lets figure out. > > > > > > I would still promote to not mix management command with transport > > command. > > > > It's not a mixing, it's just because they are functional equivalents. > > > It is not. > I clarified the fundamental difference between the two. > One is explicit read and write. > Other is, return read data on change. > For write, it is explicit set and it does not take effect until the mode is changed back to active. > > > > Commands are cheap in nature. For transport if needed, they can be explicit > > commands. > > > > It will be a partial duplication of what is being proposed here. > > There is always some overlap between management plane (hypervisor set/get) and control plane (guest driver get/set). > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If for that is some admin commands are missing, may be > > > > > > > > > > > one can add > > > > > > > > them. > > > > > > > > > > > > > > > > > > > > I would then build the device context commands on top of > > > > > > > > > > the transport commands/q, then it would be complete. > > > > > > > > > > > > > > > > > > > > > No need to step on toes of use cases as they are different... > > > > > > > > > > > > > > > > > > > > > > > I've shown you that > > > > > > > > > > > > > > > > > > > > > > > > 1) you can't easily say you can pass through all the > > > > > > > > > > > > virtio facilities > > > > > > > > > > > > 2) how ambiguous for terminology like "passthrough" > > > > > > > > > > > > > > > > > > > > > > > It is not, it is well defined in v3, v2. > > > > > > > > > > > One can continue to argue and keep defining the > > > > > > > > > > > variant and still call it data > > > > > > > > > > path acceleration and then claim it as passthrough ... > > > > > > > > > > > But I won't debate this anymore as its just > > > > > > > > > > > non-technical aspects of least > > > > > > > > > > interest. > > > > > > > > > > > > > > > > > > > > You use this terminology in the spec which is all about > > > > > > > > > > technical, and you think how to define it is a matter of > > > > > > > > > > non-technical. This is self-contradictory. If you fail, > > > > > > > > > > it probably means it's > > > > > > ambiguous. > > > > > > > > > > Let's don't use that terminology. > > > > > > > > > > > > > > > > > > > What it means is described in theory of operation. > > > > > > > > > > > > > > > > > > > > We have technical tasks and more improved specs to > > > > > > > > > > > update going > > > > > > > > forward. > > > > > > > > > > > > > > > > > > > > It's a burden to do the synchronization. > > > > > > > > > We have discussed this. > > > > > > > > > In current proposed the member device is not bifurcated, > > > > > > > > > > > > > > > > It is. Part of the functions were carried via the PCI > > > > > > > > interface, some are carried via owner. You end up with two > > > > > > > > drivers to drive the > > > > > > devices. > > > > > > > > > > > > > > > Nop. > > > > > > > All admin work of device migration is carried out via the owner > > device. > > > > > > > All guest triggered work is carried out using VF itself. > > > > > > > > > > > > Guests don't (or can't) care about how the hypervisor is structured. > > > > > For passthrough mode, it just cannot be structured inside the VF. > > > > > > > > Well, again, we are talking about different things. > > > > > > > > > > > > > > > So we're discussing the view of device, member devices needs to > > > > > > server for > > > > > > > > > > > > 1) request from the transport (it's guest in your context) > > > > > > 2) request from the owner > > > > > > > > > > Doing #2 of the owner on the member device functionality do not > > > > > work when > > > > hypervisor do not have access to the member device. > > > > > > > > I don't get here, isn't 2) just what we invent for admin commands? > > > > Driver sends commands to the owner, owner forward those requests to > > > > the member? > > > I am most with the term "driver" without notion of guest/hypervisor prefix. > > > > > > In one model, > > > Member device does everything through its native interface = virtio config > > and device space, cvq, data vqs etc. > > > Here member device do not forward anything to its owner. > > > > > > The live migration hypervisor driver who has the knowledge of live migration > > flow, accesses the owner device and get the side band member's information to > > control it. > > > So member driver do not forward anything here to owner driver. > > > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]