OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, October 25, 2023 6:59 AM

[..]
> > > resume, hence, there is no need to reset the device and destroy
> > > resources like how it is done today.
> > >
> > > Well, I don't see how it is different from what LingShan proposed.
> > The difference is, in passthrough mode, it will be fully controlled by the guest
> VM without involving hypervisor.
> 
> How does a rest in guest work but not suspend? You can choose to pass through
> the suspend to the guest, and use save/load to migrate it.
> 
> > It will work even when device migration is ongoing.
> > What Lingshan proposed involved messing with the device status.
> 
> Your proposal messes with the PCI semantics (as you want to rule the
> behaviours like P2P).
> 
> > It should be separate register like how Jingchen proposed or not have register
> at all if the pci transport support it.
> 
> It should not, then you will end up defining the interaction with the status state
> machine.
> 
Treating all registers equally and synchronizing it in the device is better model to not bifurcate the device.

> > >
> > > >
> > > > > > D0->D3 and D3->D0 transition of the pci can suspend and resume
> > > > > > D0->the device
> > > > > which can preserve the last device_status value before entering D3.
> > > > >
> > > > > It's not only about the device status. I would not repeat the
> > > > > question I've asked in another thread.
> > > > >
> > > > > What's more, if you really want to suspend/freeze at PCI level
> > > > > and deal with PCI specific issues like P2P.  You should really
> > > > > try to leverage or invent a PCI mechanism instead of trying to
> > > > > carry such semantics via a virtio specific stuff like adminq.
> > > > > Solving transport specific problems at the virtio level is a layer violation.
> > > > >
> > > > PCI spec has already defined what it needs to.
> > >
> > > If PCI spec has good support for suspend/resume, why bother
> > > inventing mechanisms in virtio?
> > >
> > Because virtio today does not know if the PCI level suspend/resume
> > will actually work or not,
> 
> It's not the charge of virtio to know about this. Otherwise how many PCI stuffs
> needs virtio to understand? PCIE supports various capabilities.
> 
> > because in past it has not worked even if the PM capability was exposed.
> 
> Let's fix the hypervisor but last time I checked, suspend/hibernation works at
> least for virtio-net.
Because it destroyed the resource and re-created them.
It didnât resume from where it left off. Ideally it should have done that.
Even if you fix the hypervisor, guest does not know that it is fixed in hypervisor, so guest does not know when to skip the current flow of reset.
Hence the bit is needed.

> 
> > So only a feature bit is needed.
> >
> > > > SR-PCIM interface is already concluded being outside of PCI-spec
> > > > by the pci-
> > > sig.
> > > > And no, there is no layer violation.
> > > >
> > > > Any non PCI member device can always implement necessary STOP mode
> > > > as
> > > no-op.
> > > >
> > > > And all of those talk make sense when one creates MMIO based
> > > > member
> > > device, until that point is just objections...
> > >
> > > They are different layers:
> > >
> > > 1) suspend/resume at virtio level
> > > 2) suspend/resume at transport level
> > >
> > > We need both of them to satisfy different cases. Just as we need to
> > > reset at both virtio and VF(FLR). Lingshan proposes 1) while it
> > > looks to me you propose 2) via virtio adminq but you said it has
> > > been supported by PCI which is then a duplication.
> > >
> > #1 is needed and to be owned by the guest driver in passthrough I
> > didnât propose #2.
> > I proposed #2 be controlled by the vmm/hypervisor (via admin cmd) who is in
> charge of vm suspend/resume flow.
> 
> So you're saying it's the virtio level suspend but you want to limit PCI
> transactions in P2P. That's not the suspend/resume at virtio level for sure.
> 
Every virtio instruction translates to its underlying transport construct.
Be it driver notification or device notification moderation or vq dma.

Similarly, mode setting translated to its transport binding.

> >
> > > >
> > > > > > (Like preserving all rest of the fields of common and other device
> config).
> > > > > > This is orthogonal and needed regardless of device migration.
> > > > > >
> > > > > > 2. If one does not want to passthrough a member device, but
> > > > > > build a mediation-based device on top of existing virtio
> > > > > > device, It can be useful with
> > > > > mediating software.
> > > > > > Here the mediating software has ample duplicated knowledge of
> > > > > > what the
> > > > > member device already has.
> > > > >
> > > > > It is the way the hypervisors are doing for not only virtio but
> > > > > also for CPU and MMU as well.
> > > > >
> > > > Not really, vcpus and VMCS and more are part of the hardware support.
> > >
> > > That's not the context here. Hypervisors need to know almost every
> > > detail to make CPU virtualization work.
> > Cpu virtualization is accelerated for 1st level nesting including interrupts.
> >
> > > That's the fact, and it works for virio as well for years.
> > >
> > > What's more, nothing prevents us from inventing something similar in
> > > virtio to speed up the context switch or migration if necessary.
> > The major difference with cpu virtualization with nw device virtualization is,
> former flow is controlled by the sw, the later one is controlled by the network
> which is not predictable.
> 
> The guest behaviour is also unpredictable, and guests may share memories with
> others. I don't see your point.
> 
> > Hence, and context switching can mostly work in theory and not perform well
> with varied workload.
> 
> I don't think so, vCPU context is much more complicated than most of the virtio
> devices. I don't see why it can't work for simple virtio devices.
> 

So try to switch a RQ between two VMs at 100Gbps packet rate without a packet drop and see how it performs.

> > Most production users prefer dedicated/isolated non_context switched rx.
> 
> I don't think you can cover "most production users" here. Such use cases are
> limited with the missing save/load mechanism.
And they apparently are being blocked from year 2021 when these device migration efforts started.
Not any more..
> 
> >
> > >
> > > > 2 level nested page tables is hw support.
> > > > Anything beyond 2 level nesting, likely involves hypervisor.
> > >
> > > Needs emulation/trap for sure. That's the point.
> > >
> > > >
> > > > > > This can fulfil the nested requirement differently provided a
> > > > > > platform support
> > > > > it.
> > > > > > (PASID limitation will be practical blocker here).
> > > > >
> > > > > I don't think PASID is a blocker. It is only a blocker if you
> > > > > want to do
> > > passthrough.
> > > > >
> > > > Even without passthrough, one needs to steer the hypervisor DMA to
> > > > non
> > > guest memory.
> > > > And guest driver must not be able to attack (read/write) from that
> memory.
> > > > I donât see how one can do this without PASID. As all DMAs are
> > > > tagged using
> > > only RID.
> > >
> > > There are a lot of other ways, but in order to converge, we can
> > > leave it for future discussions.
> > >
> > So, first level passthrough seems a basic requirement to support to operate
> from vmm control.
> >
> > 2nd level nesting can be emulated or accelerated to follow the principles of
> the paper you pointed.
> >
> > > What's more, if we design virtio for the future, PASID must be
> > > considered as a way as we all know it would come for sure.
> > >
> > For future PASID be fully controlled by the guest to continue like today.
> > PASID based bifurcation is still open question to me.
> 
> It is by design, e.g devices can have secondary PASID. It's not hard to
> understand. And it's much simpler than doing "bifurcation" in PF.
> 
> >
> > > >
> > > > > >
> > > > > > How to I plan to address above two?
> > > > > > a. #1 to be addressed by having the _F_PM bit, when the bit is
> > > > > > negotiated PCI
> > > > > PM drives the state.
> > > > >
> > > > > We can't duplicate every transport specific feature in virtio.
> > > > > This is a layer violation again. We should reuse the PCI facility here.
> > > > >
> > > > It is reused by having the feature bit to indicate that device
> > > > supports
> > > suspend/resume.
> > > > If from Day_1, if the PCI PM bits used, it would not require the feature bit.
> > > > But that was not the case.
> > > > So the guest driver do not know if using the PCI PM bit is enough
> > > > to decide, if
> > > suspend/resume by guest will work or not.
> > > > Hence the feature bit.
> > >
> > > Anyhow you need to update the driver if it has an issue. In the
> > > update, you can check and use PCI PM. If it doesn't have PCI PM, you
> > > can only suspend/resume at virtio level. Defining transport semantics at the
> virtio level breaks the layers.
> > >
> > This series does not define transport semantics at virtio level.
> 
> Don't you want to limit P2P in those states?
> 
At virtio level, they are not defined.
Virtio to transport binding has it like every single virtio construct has transport binding from notification, dma, sriov to anything else.

> > It only defines virtio level semantics of what to be done/not done.
> >
> > > >
> > > > > > This will work orthogonal to VMM side migration and will
> > > > > > co-exist with VMM
> > > > > based device migration.
> > >
> > > Actually not, if PF can suspend VF via PCI facilities, that would be
> > > no layer violation any more.
> > >
> > There is no such PCI facility.
> 
> If you want to make passthrough work without layer violation, you need either:
> 
> 1) invent them in the PCI
> 
This will follow the paper you pointed and follow all the principles listed there.

> or
> 
> 2) Trap and let hypervisor to control how to implement the suspend, for
> example hypervisor can choose to control the PM of VF
> 
> > PCI capabilities is not supposed to contain device migration kind of complex
> commands.
> 
> We're discussing suspending here, no? Talking about PCI, even if capabilities are
> not, it doesn't mean we can't extend PCI to use others. Anyhow, this is really ir-
> revelant to the discussion here.
PCI capability cannot contain virtio specific RW complex registers.
Vendor defined capability was done which is largely RO things which is ok.

> Virtio does virtio not PCI, you can't invent new features in virtio in order to be
> able to extend or fix the function of PCI.
Virtio needs to live with the limitation of the PCI and also needs to extend the PCI when it needs to.

> 
> > I explained in the discussion with Michael.
> >
> > > > > >
> > > > > > b. nested use case:
> > > > > > L0 VMM maps a VF to L1 guest as PF with emulated SR-IOV capability.
> > > > > > L1 guest to enable SR-IOV and mapping the VF to L2 guest.
> > > > >
> > > > > Let me ask it again here, how can you migrate L2 using L1 "emulated"
> > > > > PF? Emulation?
> > > > >
> > > > Emulation is one way as most nested platform components do.
> > >
> > > That's the point, you can't avoid emulation.
> > It is applicable only after first level.
> > First level must be able to take the benefit without emulation like rest of the
> system modules do today.
> 
> You can't avoid traps and emulation. So the key is what/when/where to trap,
> this is my logic of questions .
> 
I propose to do the nesting of the VF and follow the same model as 2 level nested page tables that actually work in the hw.

> You want to pass through virtio facilities without trap and emulation, you need
> to justify that.

For first level, it is clear to passthrough without trap and emulation like cpu page table walkthough.
I donât know what you mean by justification, but it is the requirement to passthrough.
N level nesting is secondary requirement that should consult PCI-SIG if needed.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]