virtio-comment message

Subject: Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration

From: Jason Wang <jasowang@redhat.com>
To: Parav Pandit <parav@nvidia.com>
Date: Wed, 18 Oct 2023 08:52:51 +0800

On Tue, Oct 17, 2023 at 11:46âAM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 17, 2023 7:41 AM
> >
> > On Fri, Oct 13, 2023 at 2:40âPM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Friday, October 13, 2023 6:48 AM
> > > >
> > > > On Thu, Oct 12, 2023 at 7:37âPM Parav Pandit <parav@nvidia.com>
> > > > wrote:>
> > > > > As Michael said, software based nesting is used..
> > > >
> > > > I've pointed out in another thread when hardware has less
> > > > abstraction level than nesting, trap/emulation is a must.
> > > >
> > > > > See if actual hw based devices can implement it or not. Many
> > > > > components of
> > > > cpu cannot do N level nesting either, but may be virtio can.
> > > > > I donât know how yet.
> > > >
> > > > I would not repeat the lessons given by Gerald J. Popek and Robert P.
> > > > Goldberg[1] in 1976, but I think you miss a lot of fundamental
> > > > things in the methodology of virtualization.
> > > Weekend is coming. I will read it.
> > >
> > > > For example, nesting is a very important criteria to examine whether
> > > > an architecture is well designed for virtualization.
> > > >
> > >
> > > In my reading of a leading OS vendor documentation, I leant that OS vendor
> > do not recommend nested virtualization for production at [1].
> > > Snippet:
> > > "In addition, Red Hat does not recommend using nested virtualization in
> > production user environments, due to various limitations in functionality.
> > Instead, nested virtualization is primarily intended for development and testing
> > scenarios."
> > >
> > > [1]
> > > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux
> > > /8/html/configuring_and_managing_virtualization/creating-nested-virtua
> > > l-machines_configuring-and-managing-virtualization
> > >
> > > 2nd leading hypervisor listed nested virtualization to be not used for
> > "performance sensitive applications".
> >
> > Another concept shift.
> >
> > I'm not going to comment on the choice for individual distros. But the points are
> > whether we can deploy a nesting virtualization easily under a specific hardware
> > architecture. In this regard, the above is a good example.
> >
> And most of such nesting seems for non production use, helpful for debugging and more.

I'm asking you to google, but you refuse to spent 1 minutes to do that
but spending several days to debate on this fact:

https://cloud.google.com/compute/docs/instances/nested-virtualization/overview

Please don't waste the time of both of us.

>
> And the nesting is not working without trap + emulation for > 2 level of nesting outside of virtio as far as I understand.

Read the above link.

> Like Intel PML. How many levels of nesting is done by hw for PML?
>
> > Again, just a simple google will tell you the instances that support nesting have
> > been available for almost all the major cloud vendors for a while.
> >
> From cpu data sheets, it does not appear that hw is able to do such nesting.

For PML, it's up to the CPU vendor to consider a good way to be self
virtualized. If it's not, it's a design defect. This is not the place
to discuss the design choice of a specific CPU vendor, if you are
really interested in this, you can go back in the archive to figure
out why AMD nesting is done much earlier than Intel.

>
> > >
> > > I want to repeat and emphasize that I am not ignoring the nested case.
> > >
> > > An extension for nesting would be the VF presented to the guest itself with
> > SR-IOV capability can work as_is as proposed here.
> >
> > How can a VF have the SR-IOV capability?
> >
> One option is by trap + emulation.

Great.

> Second is having it actually on the VF, which will follow the true definition of nesting.

How is VF allowed to have SR-IOV capability by the spec?

>
> > > Michael presented the idea of the dummy PF, which is to represent the VF as
> > dummy PF which can do the SR-IOV with one VF.
> >
> > Why do we need the complicated SR-IOV emulation at the nesting level?
> You have to complicate one way or the other.

How? I've demonstrated that you won't end up with such complications
if everything is self contained.

> And here it does not look complicated because it uses all existing defined constructs available at VMM and GVM level.
> It follows both the principles you listed in the paper, i.e. (a) efficiency and (b) equivalence property.

In order to achieve (b), you need to have many PFs and many levels
which is an obvious unnecessary complication.

>
> > How can you make sure such a design can result in a live migration to be done at
> > any levels?
> >
> I will propose design that is practical and has some use case.
> I will not propose theoretical work that no one will implement.

Again, it's only a matter if you want to do everything in a
passthrough mode, this is not to the methodology proven by [1]. It's
not a matter if you stick to trapping.

>
> > E.g in LN, you had a PF and a VF. How to migrate the PF to this level?
> > You want two PFs in the L(N-1) level?
> >
> Likely yes as dummy PF with emulated caps.

Ok, so you will have N PFs in L0 which is unrealistic. Not only
because of the limitation of the resources but also because there's no
way for the hypervisor to know how many levels of nesting are being
used.

>
> > > You need the support from the platform too, I guess TC can extend it.
> > > May be a different interface more suitable for nested case which do not have
> > performance needs.
> >
> > I disagree, it's about if the performance can satisfy the requirement at N level.
> >
> > >
> > > How about a nested user to have AQ located on the VF so that mediation sw
> > can operate admin commands over self?
> >
> > I would go with such complicated architecture.
> >
> You like meant, you wouldn't, Right?

Right.

>
> Also, following your paper which clearly highlights, "execution of privileged instruction in vm occurs, which would have effect of changing machine resources".
> In the passthrough case it is not the privileged instruction because the resource is not composed by the the machine, it is already done by the device".

How do you know that? With save/load of a device state, you can
schedule/share a VF among multiple VMs. Then you still want to pass
through everything? Let's just not invent a mechanism that can only
work for a very limited use case.

> Hence for such cvq operation trap is not to be done for member virtio device.
>
> It would make sense to trap cvq for non virtio device, where cvq is composed as part of the machine resource.
>
> > > Device mode commands will not be applicable there, instead some other
> > things to be done.
> > > So non passthrough mode software possibly can make use of it?
> >
> > It would be a great burden if you
> >
> > 1) use passthrough in L0
> > 2) use trap/emulation in L(N+1)
> >
> How is this different than Intel PML hw?

Let me clarify my points, I meant.

You can't simply use pass through in order to live migrate at any
level. So what you can did is:

1) using passthrough to VF in L0
2) using trap/emulation for PF/VF in L1 and LN

Isn't this much more complicated than simply having a self contained
device for VF, then you don't need the composition of PF in any level.
No?

>
> > >
> > > > That is to say for any CPU/hypervisor vendors, the architecture
> > > > should be designed to run any levels of nesting instead of just an
> > > > awkward 2 levels (but what you proposed can not work for even 2).
> > > Huh, some missing text for corner case as making claim, _not_working in not a
> > healthy discussion.
> > >
> > > > For x86 and KVM, any level of
> > > > nesting has been done for about 10 years ago.
> > > >
> > > I didnât find hw for PML support in x86 for N or 3 level nesting. Did I miss?
> > > I didnât find hw for nested page tables upto N level walking on the PCIe
> > read/writes in any cpu. Did I miss?
> >
> > You need first asking why it is a must to achieve nested virtualization. All of
> > those obstacles come only if you want to use "passthrough" for any levels.
> >
> > > Have you seen nesting in hw works at N level?
> >
> > Again, hardware can't have endless resources for endless levels.
> Can you please list two or 3 hw features that are in hw, for > 2 levels?

Why do I need to do this? What I'm saying is that hardware doesn't
need to be designed for N levels. What it needs to make sure to
satisfy the requirement proved by [1].

>
> > Trap and
> > emulation is a must for achieving nesting virtualization. If you try to invent a
> > passthrough method that can work for any level, you will probably fail
>
> It at least follows the design principle of the paper you suggested.

I don't see it this way, see the above reply. The paper is for trap
and emulation for sure but you propose to pass through everything.

> I donât see a point of designing something for N level nesting in first go when rest eco system is not there to support it at hw level.

Your design complicates the nesting a lot. We have hands-on
methodology which has been well studied since the 1970s where you
refuse to start with. Then you may end up with a lot of issues.

What's more you design is incomplete as it can't be used for migrating:

1) owner
2) virtio devices that doesn't structure as owner/member

That's why I see this as incomplete and immature.

Thanks

Follow-Ups:
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>

References:
- [PATCH v1 0/8] Introduce device migration support commands
  - From: Parav Pandit <parav@nvidia.com>
- [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>