virtio-comment message

Subject: RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration

From: Parav Pandit <parav@nvidia.com>
To: Jason Wang <jasowang@redhat.com>
Date: Tue, 24 Oct 2023 10:01:54 +0000

> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 24, 2023 10:27 AM
> 
> On Mon, Oct 23, 2023 at 12:43âPM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, October 23, 2023 9:15 AM
> > >
> > > On Wed, Oct 18, 2023 at 6:23âPM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Wednesday, October 18, 2023 3:26 PM
> > > >
> > > > > For completeness, and to shorten the thread, can you please list
> > > > > known issues/use cases that are addressed by the status bit
> > > > > interface and how you plan for them to be addressed?
> > > >
> > > > I will avoid listing known issues for a moment for status bit in this email.
> > > >
> > > > Status bit interface helps in following good ways.
> > > > 1. suspend/resume the device fully by the guest by negotiating the
> > > > new
> > > feature.
> > > > This can be useful in the guest-controlled PM flows of suspend/resume.
> > > > I still think for this, only feature bit is necessary, and
> > > > device_status
> > > modification is not needed.
> > >
> > > Which feature bit did you mean here?
> > >
> > A new feature bit to indicate the guest that device supports suspend and
> resume, hence, there is no need to reset the device and destroy resources like
> how it is done today.
> 
> Well, I don't see how it is different from what LingShan proposed.
The difference is, in passthrough mode, it will be fully controlled by the guest VM without involving hypervisor.
It will work even when device migration is ongoing.
What Lingshan proposed involved messing with the device status.
It should be separate register like how Jingchen proposed or not have register at all if the pci transport support it.
> 
> >
> > > > D0->D3 and D3->D0 transition of the pci can suspend and resume the
> > > > D0->device
> > > which can preserve the last device_status value before entering D3.
> > >
> > > It's not only about the device status. I would not repeat the
> > > question I've asked in another thread.
> > >
> > > What's more, if you really want to suspend/freeze at PCI level and
> > > deal with PCI specific issues like P2P.  You should really try to
> > > leverage or invent a PCI mechanism instead of trying to carry such
> > > semantics via a virtio specific stuff like adminq. Solving transport
> > > specific problems at the virtio level is a layer violation.
> > >
> > PCI spec has already defined what it needs to.
> 
> If PCI spec has good support for suspend/resume, why bother inventing
> mechanisms in virtio?
> 
Because virtio today does not know if the PCI level suspend/resume will actually work or not, because in past it has not worked even if the PM capability was exposed.
So only a feature bit is needed.

> > SR-PCIM interface is already concluded being outside of PCI-spec by the pci-
> sig.
> > And no, there is no layer violation.
> >
> > Any non PCI member device can always implement necessary STOP mode as
> no-op.
> >
> > And all of those talk make sense when one creates MMIO based member
> device, until that point is just objections...
> 
> They are different layers:
> 
> 1) suspend/resume at virtio level
> 2) suspend/resume at transport level
> 
> We need both of them to satisfy different cases. Just as we need to reset at both
> virtio and VF(FLR). Lingshan proposes 1) while it looks to me you propose 2) via
> virtio adminq but you said it has been supported by PCI which is then a
> duplication.
> 
#1 is needed and to be owned by the guest driver in passthrough
I didnât propose #2.
I proposed #2 be controlled by the vmm/hypervisor (via admin cmd) who is in charge of vm suspend/resume flow.

> >
> > > > (Like preserving all rest of the fields of common and other device config).
> > > > This is orthogonal and needed regardless of device migration.
> > > >
> > > > 2. If one does not want to passthrough a member device, but build
> > > > a mediation-based device on top of existing virtio device, It can
> > > > be useful with
> > > mediating software.
> > > > Here the mediating software has ample duplicated knowledge of what
> > > > the
> > > member device already has.
> > >
> > > It is the way the hypervisors are doing for not only virtio but also
> > > for CPU and MMU as well.
> > >
> > Not really, vcpus and VMCS and more are part of the hardware support.
> 
> That's not the context here. Hypervisors need to know almost every detail to
> make CPU virtualization work. 
Cpu virtualization is accelerated for 1st level nesting including interrupts.

> That's the fact, and it works for virio as well for years.
> 
> What's more, nothing prevents us from inventing something similar in virtio to
> speed up the context switch or migration if necessary.
The major difference with cpu virtualization with nw device virtualization is, former flow is controlled by the sw, the later one is controlled by the network which is not predictable.
Hence, and context switching can mostly work in theory and not perform well with varied workload.
Most production users prefer dedicated/isolated non_context switched rx.

> 
> > 2 level nested page tables is hw support.
> > Anything beyond 2 level nesting, likely involves hypervisor.
> 
> Needs emulation/trap for sure. That's the point.
> 
> >
> > > > This can fulfil the nested requirement differently provided a
> > > > platform support
> > > it.
> > > > (PASID limitation will be practical blocker here).
> > >
> > > I don't think PASID is a blocker. It is only a blocker if you want to do
> passthrough.
> > >
> > Even without passthrough, one needs to steer the hypervisor DMA to non
> guest memory.
> > And guest driver must not be able to attack (read/write) from that memory.
> > I donât see how one can do this without PASID. As all DMAs are tagged using
> only RID.
> 
> There are a lot of other ways, but in order to converge, we can leave it for
> future discussions.
> 
So, first level passthrough seems a basic requirement to support to operate from vmm control.

2nd level nesting can be emulated or accelerated to follow the principles of the paper you pointed.

> What's more, if we design virtio for the future, PASID must be considered as a
> way as we all know it would come for sure.
> 
For future PASID be fully controlled by the guest to continue like today.
PASID based bifurcation is still open question to me.

> >
> > > >
> > > > How to I plan to address above two?
> > > > a. #1 to be addressed by having the _F_PM bit, when the bit is
> > > > negotiated PCI
> > > PM drives the state.
> > >
> > > We can't duplicate every transport specific feature in virtio. This
> > > is a layer violation again. We should reuse the PCI facility here.
> > >
> > It is reused by having the feature bit to indicate that device supports
> suspend/resume.
> > If from Day_1, if the PCI PM bits used, it would not require the feature bit.
> > But that was not the case.
> > So the guest driver do not know if using the PCI PM bit is enough to decide, if
> suspend/resume by guest will work or not.
> > Hence the feature bit.
> 
> Anyhow you need to update the driver if it has an issue. In the update, you can
> check and use PCI PM. If it doesn't have PCI PM, you can only suspend/resume
> at virtio level. Defining transport semantics at the virtio level breaks the layers.
> 
This series does not define transport semantics at virtio level.
It only defines virtio level semantics of what to be done/not done.

> >
> > > > This will work orthogonal to VMM side migration and will co-exist
> > > > with VMM
> > > based device migration.
> 
> Actually not, if PF can suspend VF via PCI facilities, that would be no layer
> violation any more.
> 
There is no such PCI facility. PCI capabilities is not supposed to contain device migration kind of complex commands.
I explained in the discussion with Michael.

> > > >
> > > > b. nested use case:
> > > > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > > > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> > >
> > > Let me ask it again here, how can you migrate L2 using L1 "emulated"
> > > PF? Emulation?
> > >
> > Emulation is one way as most nested platform components do.
> 
> That's the point, you can't avoid emulation.
It is applicable only after first level.
First level must be able to take the benefit without emulation like rest of the system modules do today.

Follow-Ups:
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Jason Wang <jasowang@redhat.com>

References:
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Jason Wang <jasowang@redhat.com>