virtio-comment message

Subject: Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers

From: Parav Pandit <parav@nvidia.com>
To: Jason Wang <jasowang@redhat.com>
Date: Wed, 12 Apr 2023 23:31:28 -0400



On 4/12/2023 9:48 PM, Jason Wang wrote:

On Wed, Apr 12, 2023 at 10:23âPM Parav Pandit <parav@nvidia.com> wrote:

From: Jason Wang <jasowang@redhat.com>
Sent: Wednesday, April 12, 2023 2:15 AM

On Wed, Apr 12, 2023 at 1:55âPM Parav Pandit <parav@nvidia.com> wrote:

From: Jason Wang <jasowang@redhat.com>
Sent: Wednesday, April 12, 2023 1:38 AM

Modern device says FEAETURE_1 must be offered and must be
negotiated by

driver.

Legacy has Mac as RW area. (hypervisor can do it).
Reset flow is difference between the legacy and modern.


Just to make sure we're at the same page. We're talking in the
context of mediation. Without mediation, your proposal can't work.

Right.

So in this case, the guest driver is not talking with the device
directly. Qemu needs to traps whatever it wants to achieve the
mediation:

I prefer to avoid picking specific sw component here, but yes. QEMU can trap.

1) It's perfectly fine that Qemu negotiated VERSION_1 but presented
a mediated legacy device to guests.

Right but if VERSION_1 is negotiated, device will work as V_1 with 12B

virtio_net_hdr.

Shadow virtqueue could be used here. And we have much more issues without
shadow virtqueue, more below.

2) For MAC and Reset, Qemu can trap and do anything it wants.

The idea is not to poke in the fields even though such sw can.
MAC is RW in legacy.
Mac ia RO in 1.x.

So QEMU cannot make RO register into RW.


It can be done via using the control vq. Trap the MAC write and forward it via
control virtqueue.

This proposal Is not implementing about vdpa mediator that requires far higher understanding in hypervisor.


It's not related to vDPA, it's about a common technology that is used
in virtualization. You do a trap and emulate the status, why can't you
do that for others?

Such mediation works fine for vdpa and it is upto vdpa layer to do. Not relevant here.


The proposed solution in this series enables it and avoid per field sw

interpretation and mediation in parsing values etc.

I don't think it's possible. See the discussion about ORDER_PLATFORM and
ACCESS_PLATFORM in previous threads.

I have read the previous thread.
Hypervisor will be limiting to those platforms where ORDER_PLATFORM is not needed.


So you introduce a bunch of new facilities that only work on some
specific archs. This breaks the architecture independence of virtio

since 1.0.

The defined spec for PCI device does not work today for transitionaldevice for virtualization. Only works in limited PF case.

Hence this update. More below.

The root cause is legacy is not fit for hardware
implementation, any kind of hardware that tries to offer legacy
function will finally run into those corner cases which require extra
interfaces which may finally end up with a (partial) duplication of
the modern interface.

I agree with you. We cannot change the legacy.

What is being added here it to enable legacy transport via MMIO or AQand using notification region.


Will comment where you listed 3 options.

And this is a pci transitional device that uses the standard platform dma anyway so ACCESS_PLATFORM is not related.


So which type of transactions did this device use when it is used via
legacy MMIO BAR? Translated request or not?

Device uses the PCI transport level addresses configured because its aPCI device.

For example, a device may have implemented say only BAR2, and small portion of the BAR2 is pointing to legacy MMIO config registers.


We're discussing spec changes, not a specific implementation here. Why
is the device can't use BAR0, do you see any restriction in the spec?

No restriction.
Forcing it to use BAR0 is the restrictive method.

A mediator hypervisor sw will be able to read/write to it when BAR0 is exposed towards the guest VM as IOBAR 0.


So I don't think it can work:

1) This is very dangerous unless the spec mandates the size (this is
also tricky since page size varies among arches) for any
BAR/capability which is not what virtio wants, the spec leave those
flexibility to the implementation:

E.g

"""
The driver MUST accept a cap_len value which is larger than specified here.
"""

cap_len talks about length of the PCI capability structure as defined bythe PCI spec. BAR length is located in the le32 length.


So new MMIO region can be of any size and anywhere in the BAR.

For LM BAR length and number should be same between two PCI VFs. But itsorthogonal to this point. Such checks will be done anyway.


2) A blocker for live migration (and compatibility), the hypervisor
should not assume the size for any capability so for whatever case it
should have a fallback for the case where the BAR can't be assigned.

I agree that hypervisor should not assume.
for LM such compatibility checks will be done anyway.
So not a blocker, they should match on two sides is all needed.

Let me summarize, we had three ways currently:

1) legacy MMIO BAR via capability:

Pros:
- allow some flexibility to place MMIO BAR other than 0
Cons:
- new device ID

Not needed as Michael suggest. Existing transitional or non transitionaldevice can expose this optional capability and its attached MMIO region.


Spec changes are similar to #2.

- non trivial spec changes which ends up of the tricky cases that
tries to workaround legacy to fit for a hardware implementation
- work only for the case of virtualization with the help of
meditation, can't work for bare metal

For bare-metal PFs usually thin hypervisors are used that does veryminimal setup. But I agree that bare-metal is relatively less important.

- only work for some specific archs without SVQ

That is the legacy limitation that we don't worry about.

2) allow BAR0 to be MMIO for transitional device

Pros:
- very minor change for the spec

Spec changes wise they are similar to #1.

- work for virtualization (and it work even without dedicated
mediation for some setups)

I am not aware where can it work without mediation. Do you know anyspecific kernel version where it actually works?

- work for bare metal for some setups (without mediation)
Cons:
- only work for some specific archs without SVQ
- BAR0 is required

Both are not limitation as they are mainly coming from the legacy sideof things.

3) modern device mediation for legacy

Pros:
- no changes in the spec
Cons:
- require mediation layer in order to work in bare metal
- require datapath mediation like SVQ to work for virtualization

Spec change is still require for net and blk because modern device donot understand legacy, even with mediation layer.

FEATURE_1, RW cap via CVQ which is not really owned by the hypervisor.
A guest may be legacy or non legacy, so mediation shouldn't be always done.

Compared to method 2) the only advantages of method 1) is the
flexibility of BAR0 but it has too many disadvantages. If we only care
about virtualization, modern devices are sufficient. Then why bother
for that?

So that a single stack which doesn't always have the knowledge of whichdriver version is running is guest can utilize it. Otherwise 1.x alsoend up doing mediation when guest driver = 1.x and device = transitionalPCI VF.

so (1) and (2) both are equivalent, one is more flexible, if you knowmore valid cases where BAR0 as MMIO can work as_is, such option is open.


We can draft the spec that MMIO BAR SHOULD be exposes in BAR0.

Follow-Ups:
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Jason Wang <jasowang@redhat.com>

References:
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH 09/11] transport-pci: Describe PCI MMR dev config registers
  - From: Jason Wang <jasowang@redhat.com>