virtio-comment message

Subject: RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration

From: Parav Pandit <parav@nvidia.com>
To: Jason Wang <jasowang@redhat.com>
Date: Wed, 18 Oct 2023 05:28:10 +0000


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, October 18, 2023 6:23 AM
> 
> On Tue, Oct 17, 2023 at 11:46âAM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, October 17, 2023 7:41 AM
> > >
> > > On Fri, Oct 13, 2023 at 2:40âPM Parav Pandit <parav@nvidia.com> wrote:
> > > >
> > > >
> > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > Sent: Friday, October 13, 2023 6:48 AM
> > > > >
> > > > > On Thu, Oct 12, 2023 at 7:37âPM Parav Pandit <parav@nvidia.com>
> > > > > wrote:>
> > > > > > As Michael said, software based nesting is used..
> > > > >
> > > > > I've pointed out in another thread when hardware has less
> > > > > abstraction level than nesting, trap/emulation is a must.
> > > > >
> > > > > > See if actual hw based devices can implement it or not. Many
> > > > > > components of
> > > > > cpu cannot do N level nesting either, but may be virtio can.
> > > > > > I donât know how yet.
> > > > >
> > > > > I would not repeat the lessons given by Gerald J. Popek and Robert P.
> > > > > Goldberg[1] in 1976, but I think you miss a lot of fundamental
> > > > > things in the methodology of virtualization.
> > > > Weekend is coming. I will read it.
> > > >
> > > > > For example, nesting is a very important criteria to examine
> > > > > whether an architecture is well designed for virtualization.
> > > > >
> > > >
> > > > In my reading of a leading OS vendor documentation, I leant that
> > > > OS vendor
> > > do not recommend nested virtualization for production at [1].
> > > > Snippet:
> > > > "In addition, Red Hat does not recommend using nested
> > > > virtualization in
> > > production user environments, due to various limitations in functionality.
> > > Instead, nested virtualization is primarily intended for development
> > > and testing scenarios."
> > > >
> > > > [1]
> > > > https://access.redhat.com/documentation/en-us/red_hat_enterprise_l
> > > > inux
> > > > /8/html/configuring_and_managing_virtualization/creating-nested-vi
> > > > rtua l-machines_configuring-and-managing-virtualization
> > > >
> > > > 2nd leading hypervisor listed nested virtualization to be not used
> > > > for
> > > "performance sensitive applications".
> > >
> > > Another concept shift.
> > >
> > > I'm not going to comment on the choice for individual distros. But
> > > the points are whether we can deploy a nesting virtualization easily
> > > under a specific hardware architecture. In this regard, the above is a good
> example.
> > >
> > And most of such nesting seems for non production use, helpful for
> debugging and more.
> 
> I'm asking you to google, but you refuse to spent 1 minutes to do that but
> spending several days to debate on this fact:
> 
> https://cloud.google.com/compute/docs/instances/nested-
> virtualization/overview
> 
> Please don't waste the time of both of us.

I showed the link of Redhat and another one is Hyper-V.
You showed link of google cloud.

There are no representatives from Google and Microsoft here to support nested here in the discussion.

I assume you as part of Redhat show some production use, but in public documentation of Redhat it said non production.

Regardless, I want to emphasize that I am not against the use case of nested.

I am highlighting that any L2 nesting involves today hw emulation in the ecosystem.
If this is incorrect, please point to the datasheet. (not user documentation at high level).

And for L2 nesting, virtio doing hw emulation is fine to me.
And one wants to improve that too, lets have the proper nested VF.
Lets discuss in other thread, where you have many questions.


> 
> >
> > And the nesting is not working without trap + emulation for > 2 level of
> nesting outside of virtio as far as I understand.
> 
> Read the above link.
> 
> > Like Intel PML. How many levels of nesting is done by hw for PML?
> >
> > > Again, just a simple google will tell you the instances that support
> > > nesting have been available for almost all the major cloud vendors for a
> while.
> > >
> > From cpu data sheets, it does not appear that hw is able to do such nesting.
> 
> For PML, it's up to the CPU vendor to consider a good way to be self virtualized.
> If it's not, it's a design defect. This is not the place to discuss the design choice
> of a specific CPU vendor, if you are really interested in this, you can go back in
> the archive to figure out why AMD nesting is done much earlier than Intel.

In the google link you posted, I read "VMs powered by AMD processors are not supported".
I wish they should have been able to utilize it.

> 
> >
> > > >
> > > > I want to repeat and emphasize that I am not ignoring the nested case.
> > > >
> > > > An extension for nesting would be the VF presented to the guest
> > > > itself with
> > > SR-IOV capability can work as_is as proposed here.
> > >
> > > How can a VF have the SR-IOV capability?
> > >
> > One option is by trap + emulation.
> 
> Great.
> 
> > Second is having it actually on the VF, which will follow the true definition of
> nesting.
> 
> How is VF allowed to have SR-IOV capability by the spec?
>
To support nesting, PCI-SIG can extend it.
 
> >
> > > > Michael presented the idea of the dummy PF, which is to represent
> > > > the VF as
> > > dummy PF which can do the SR-IOV with one VF.
> > >
> > > Why do we need the complicated SR-IOV emulation at the nesting level?
> > You have to complicate one way or the other.
> 
> How? I've demonstrated that you won't end up with such complications if
> everything is self contained.
The primary problem with self-contained is it is not fitting the requirements of passthrough.
How can we do self-contained interface without mediation where device context, dirty pages are lost, when device reset/flr occurs?
Also the dma occurs in the guest.
We need facility like PML where PML logs the pages in the VMM level, in virtio case to the owner PF.

> 
> > And here it does not look complicated because it uses all existing defined
> constructs available at VMM and GVM level.
> > It follows both the principles you listed in the paper, i.e. (a) efficiency and (b)
> equivalence property.
> 
> In order to achieve (b), you need to have many PFs and many levels which is an
> obvious unnecessary complication.
> 
This is what you wanted to follow the paper.
It does not need many PFs, at L0 there is one PF and N VFs.
At L1, one VF is given with emulated config space that consist of SR-IOV capability.
This L1 VF allows creating new VF, one of the VF will be passed to L2.

> >
> > > How can you make sure such a design can result in a live migration
> > > to be done at any levels?
> > >
> > I will propose design that is practical and has some use case.
> > I will not propose theoretical work that no one will implement.
> 
> Again, it's only a matter if you want to do everything in a passthrough mode,
> this is not to the methodology proven by [1]. It's not a matter if you stick to
> trapping.
>
I didnât understand, but I donât see a point of discussing passthrough vs non_passthrough.

 
> >
> > > E.g in LN, you had a PF and a VF. How to migrate the PF to this level?
> > > You want two PFs in the L(N-1) level?
> > >
> > Likely yes as dummy PF with emulated caps.
> 
> Ok, so you will have N PFs in L0 which is unrealistic. Not only because of the
> limitation of the resources but also because there's no way for the hypervisor to
> know how many levels of nesting are being used.
>
Only one PF in L0. Emulated PF in L1. Similar to how rest of the eco-system platform components are doing it.
When whole platform commit to do N level nesting, it make sense for virtio to align.
For example cpu vendors to commit to do N level nested page table traversal on pci read/writes, posted interrupts at N level, PML logging at N level.
At that point virtio for N level nesting make sense.

> >
> > > > You need the support from the platform too, I guess TC can extend it.
> > > > May be a different interface more suitable for nested case which
> > > > do not have
> > > performance needs.
> > >
> > > I disagree, it's about if the performance can satisfy the requirement at N
> level.
> > >
> > > >
> > > > How about a nested user to have AQ located on the VF so that
> > > > mediation sw
> > > can operate admin commands over self?
> > >
> > > I would go with such complicated architecture.
> > >
> > You like meant, you wouldn't, Right?
> 
> Right.
> 
> >
> > Also, following your paper which clearly highlights, "execution of privileged
> instruction in vm occurs, which would have effect of changing machine
> resources".
> > In the passthrough case it is not the privileged instruction because the
> resource is not composed by the the machine, it is already done by the device".
> 
> How do you know that? With save/load of a device state, you can
> schedule/share a VF among multiple VMs. Then you still want to pass through
> everything? 
You cannot share a VF among multiple VMs as each VM has its own isolated memory boundary, isolated by the IOMMU and MMU.
PCI incoming requests of a specific RID cannot split to two different guest VMs.

> Let's just not invent a mechanism that can only work for a very
> limited use case.
> 
The use case you are quoting as limited is common one for passthrough users.

> > Hence for such cvq operation trap is not to be done for member virtio device.
> >
> > It would make sense to trap cvq for non virtio device, where cvq is composed
> as part of the machine resource.
> >
> > > > Device mode commands will not be applicable there, instead some
> > > > other
> > > things to be done.
> > > > So non passthrough mode software possibly can make use of it?
> > >
> > > It would be a great burden if you
> > >
> > > 1) use passthrough in L0
> > > 2) use trap/emulation in L(N+1)
> > >
> > How is this different than Intel PML hw?
> 
> Let me clarify my points, I meant.
> 
> You can't simply use pass through in order to live migrate at any level. So what
> you can did is:
> 
> 1) using passthrough to VF in L0
> 2) using trap/emulation for PF/VF in L1 and LN
> 
> Isn't this much more complicated than simply having a self contained device for
> VF, then you don't need the composition of PF in any level.
> No?
>
The problem in self-contained is it is not able to do even #1.
 
> >
> > > >
> > > > > That is to say for any CPU/hypervisor vendors, the architecture
> > > > > should be designed to run any levels of nesting instead of just
> > > > > an awkward 2 levels (but what you proposed can not work for even 2).
> > > > Huh, some missing text for corner case as making claim,
> > > > _not_working in not a
> > > healthy discussion.
> > > >
> > > > > For x86 and KVM, any level of
> > > > > nesting has been done for about 10 years ago.
> > > > >
> > > > I didnât find hw for PML support in x86 for N or 3 level nesting. Did I miss?
> > > > I didnât find hw for nested page tables upto N level walking on
> > > > the PCIe
> > > read/writes in any cpu. Did I miss?
> > >
> > > You need first asking why it is a must to achieve nested
> > > virtualization. All of those obstacles come only if you want to use
> "passthrough" for any levels.
> > >
> > > > Have you seen nesting in hw works at N level?
> > >
> > > Again, hardware can't have endless resources for endless levels.
> > Can you please list two or 3 hw features that are in hw, for > 2 levels?
> 
> Why do I need to do this? What I'm saying is that hardware doesn't need to be
> designed for N levels. What it needs to make sure to satisfy the requirement
> proved by [1].
>
You need it because you want to follow the 3 principles listed in the paper, i.e. efficiency, equivalency and resource control.
 
> >
> > > Trap and
> > > emulation is a must for achieving nesting virtualization. If you try
> > > to invent a passthrough method that can work for any level, you will
> > > probably fail
> >
> > It at least follows the design principle of the paper you suggested.
> 
> I don't see it this way, see the above reply. The paper is for trap and emulation
> for sure but you propose to pass through everything.
> 
> > I donât see a point of designing something for N level nesting in first go when
> rest eco system is not there to support it at hw level.
> 
> Your design complicates the nesting a lot. We have hands-on methodology
> which has been well studied since the 1970s where you refuse to start with.
> Then you may end up with a lot of issues.
> 
I donât think so. When the hw eco-system is built for nesting, it make sense for virtio to do nesting acceleration.
Otherwise method done in other nesting is enough for virtio.

> What's more you design is incomplete as it can't be used for migrating:
> 
> 1) owner
Owner migration is not requirement. That is just silly.
If one wants to migrate owner, an admin virtio device can be present outside of owner to migrate.

> 2) virtio devices that doesn't structure as owner/member
> 
As with spec 1.3. they are structured for PCI SR-IOV group type.
MMIO transport is just missing out on the advancement happening on the PCI transport.
If there is user interest, one will do for MMIO too.

Follow-Ups:
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Jason Wang <jasowang@redhat.com>

References:
- [PATCH v1 0/8] Introduce device migration support commands
  - From: Parav Pandit <parav@nvidia.com>
- [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Jason Wang <jasowang@redhat.com>