virtio-comment message

Subject: Re: [PATCH v2 0/2] transport-pci: Introduce legacy registers access using AQ
From: Jason Wang <jasowang@redhat.com>
To: Parav Pandit <parav@nvidia.com>
Date: Tue, 9 May 2023 11:44:30 +0800
On Tue, May 9, 2023 at 1:08âAM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> On 5/7/2023 10:23 PM, Jason Wang wrote:
> > On Sun, May 7, 2023 at 9:44âPM Michael S. Tsirkin <mst@redhat.com> wrote:
> >>
> >> On Sat, May 06, 2023 at 10:31:30AM +0800, Jason Wang wrote:
> >>> On Sat, May 6, 2023 at 8:02âAM Parav Pandit <parav@nvidia.com> wrote:
> >>>>
> >>>> This short series introduces legacy registers access commands for the owner
> >>>> group member PCI PF to access the legacy registers of the member VFs.
> >>>>
> >>>> If in future any SIOV devices to support legacy registers, they
> >>>> can be easily supported using same commands by using the group
> >>>> member identifiers of the future SIOV devices.
> >>>>
> >>>> More details as overview, motivation, use case are further described
> >>>> below.
> >>>>
> >>>> Patch summary:
> >>>> --------------
> >>>> patch-1 adds administrative virtuqueue commands
> >>>> patch-2 adds its conformance section
> >>>>
> >>>> This short series is on top of latest work [1] from Michael.
> >>>> It uses the newly introduced administrative virtqueue facility with 3 new
> >>>> commands which uses the existing virtio_admin_cmd.
> >>>>
> >>>> [1] https://lists.oasis-open.org/archives/virtio-comment/202305/msg00112.html
> >>>>
> >>>> Usecase:
> >>>> --------
> >>>> 1. A hypervisor/system needs to provide transitional
> >>>>     virtio devices to the guest VM at scale of thousands,
> >>>>     typically, one to eight devices per VM.
> >>>>
> >>>> 2. A hypervisor/system needs to provide such devices using a
> >>>>     vendor agnostic driver in the hypervisor system.
> >>>>
> >>>> 3. A hypervisor system prefers to have single stack regardless of
> >>>>     virtio device type (net/blk) and be future compatible with a
> >>>>     single vfio stack using SR-IOV or other scalable device
> >>>>     virtualization technology to map PCI devices to the guest VM.
> >>>>     (as transitional or otherwise)
> >>>>
> >>>> Motivation/Background:
> >>>> ----------------------
> >>>> The existing virtio transitional PCI device is missing support for
> >>>> PCI SR-IOV based devices. Currently it does not work beyond
> >>>> PCI PF, or as software emulated device in reality. Currently it
> >>>> has below cited system level limitations:
> >>>>
> >>>> [a] PCIe spec citation:
> >>>> VFs do not support I/O Space and thus VF BARs shall not indicate I/O Space.
> >>>>
> >>>> [b] cpu arch citiation:
> >>>> Intel 64 and IA-32 Architectures Software Developerâs Manual:
> >>>> The processorâs I/O address space is separate and distinct from
> >>>> the physical-memory address space. The I/O address space consists
> >>>> of 64K individually addressable 8-bit I/O ports, numbered 0 through FFFFH.
> >>>>
> >>>> [c] PCIe spec citation:
> >>>> If a bridge implements an I/O address range,...I/O address range will be
> >>>> aligned to a 4 KB boundary.
> >>>>
> >>>> Overview:
> >>>> ---------
> >>>> Above usecase requirements can be solved by PCI PF group owner accessing
> >>>> its group member PCI VFs legacy registers using an admin virtqueue of
> >>>> the group owner PCI PF.
> >>>>
> >>>> Two new admin virtqueue commands are added which read/write PCI VF
> >>>> registers.
> >>>>
> >>>> The third command suggested by Jason queries the VF device's driver
> >>>> notification region.
> >>>>
> >>>> Software usage example:
> >>>> -----------------------
> >>>> One way to use and map to the guest VM is by using vfio driver
> >>>> framework in Linux kernel.
> >>>>
> >>>>                  +----------------------+
> >>>>                  |pci_dev_id = 0x100X   |
> >>>> +---------------|pci_rev_id = 0x0      |-----+
> >>>> |vfio device    |BAR0 = I/O region     |     |
> >>>> |               |Other attributes      |     |
> >>>> |               +----------------------+     |
> >>>> |                                            |
> >>>> +   +--------------+     +-----------------+ |
> >>>> |   |I/O BAR to AQ |     | Other vfio      | |
> >>>> |   |rd/wr mapper  |     | functionalities | |
> >>>> |   +--------------+     +-----------------+ |
> >>>> |                                            |
> >>>> +------+-------------------------+-----------+
> >>>>         |                         |
> >>>>    +----+------------+       +----+------------+
> >>>>    | +-----+         |       | PCI VF device A |
> >>>>    | | AQ  |-------------+---->+-------------+ |
> >>>>    | +-----+         |   |   | | legacy regs | |
> >>>>    | PCI PF device   |   |   | +-------------+ |
> >>>>    +-----------------+   |   +-----------------+
> >>>>                          |
> >>>>                          |   +----+------------+
> >>>>                          |   | PCI VF device N |
> >>>>                          +---->+-------------+ |
> >>>>                              | | legacy regs | |
> >>>>                              | +-------------+ |
> >>>>                              +-----------------+
> >>>>
> >>>> 2. Virtio pci driver to bind to the listed device id and
> >>>>     use it as native device in the host.
> >>>>
> >>>> 3. Use it in a light weight hypervisor to run bare-metal OS.
> >>>>
> >>>> Please review.
> >>>>
> >>>> Fixes: https://github.com/oasis-tcs/virtio-spec/issues/167
> >>>> Signed-off-by: Parav Pandit <parav@nvidia.com>
> >>>>
> >>>> ---
> >>>> changelog:
> >>>> v1->v2:
> >>>> - addressed comments from Michael
> >>>> - added theory of operation
> >>>> - grammar corrections
> >>>> - removed group fields description from individual commands as
> >>>>    it is already present in generic section
> >>>> - added endianness normative for legacy device registers region
> >>>> - renamed the file to drop vf and add legacy prefix
> >>>> - added overview in commit log
> >>>> - renamed subsection to reflect command
> >>>
> >>> So as replied in V1, I think it's not a good idea to invent commands
> >>> for a partial transport just for legacy devices. It's better either:
> >>>
> >>> 1) rebase or collaborate this work on top of the transport virtqueue
> >>>
> >>> or
> >>>
> >>> 2) having a PCI over admin virtqueue transport, since this proposal
> >>> has already had BAR access, we can add config space access then it is
> >>> self-contained so we don't need to go through every corner case like
> >>> inventing dedicated commands to accessing some function that is
> >>> duplicated with capabilities. It will become yet another transport and
> >>> legacy support is just a good byproduct.
> >>>
> >>> Thanks
> >>
> >>
> >> I thought so too originally. Unfortunately I now think that no, legacy is not
> >> going to be a byproduct of transport virtqueue for modern -
> >> it is different enough that it needs dedicated commands.
> >
> > If you mean the transport virtqueue, I think some dedicated commands
> > for legacy are needed. Then it would be a transport that supports
> > transitional devices. It would be much better than having commands for
> > a partial transport like this patch did.
> >
> >> Consider simplest case, multibyte fields. Legacy needs multibyte write,
> >> modern does not even need multibyte read.
> >
> > I'm not sure I will get here, since we can't expose admin vq to
> > guests, it means we need some software mediation. So if we just
> > implement what PCI allows us, then everything would be fine (even if
> > some method is not used).
> >
> > Thanks
>
> The fundamental reason for not accessing the 1.x VF and SIOV device
> registers, config space, feature bits through PF is: it requires PF
> device mediation. VF and SIOV devices are first class citizen in PCIe
> spec and deserve direct interaction with the device.

Unless I miss something obvious, SIOV requires mediation (or
composition) for sure. Otherwise you break the compatibility.

>
> Hence, the transport we built is to consider this in mind for the coming
> future.

For transport virtqueue, it's not specific to PCI. It could be used in
a much broader use case.

> So if each VF has its own configq, or cmdq, it totally make sense to me
> which is bootstrap interface to transport existing config space interface.
> The problem is: it is not backward compatible;
> Hence a device has no way of when to support both or only new configq.

Providing compatibility in software is much more simpler than
inventing new hardware interfaces. Isn't it? (e.g if we want to
provide compatibility for VF on a SIOV device). And inventing a new
hardware interface for compatibility might not always work, it may
break the advantages of the new hardware (like scalability).

>
> So eve growing these fields and optionally placement on configq doesn't
> really help and device builder to build it efficiently (without much
> predictability).

Config queue is not the only choice, we have a lot of other choices
(for example PASID may help to reduce the on-chip resources).

>
> Instead of we say, that what exists today in config space stays in
> config space, anything additional on new q, than its deterministic
> behavior to size up the scale.

Just to be clear, if we have PCI over adminq, VF's config space could
be the minimal one for PCI spec complaint. The real config space is
accessed via the admin virtqueue.

>
> For example, a PCI device who wants to support 100 VFs, can easily size
> its memory to 30 bytes * 100 reserved for supporting config space.

Those capabilities (30 bytes) can be accessed via admin virtqueue. So
we don't need to place them in the config space.

> And new 40 bytes * 100 fields doesn't have to be in the resident memory.
>
> If we have optional configq/cmdq for transport, than 30*100 bytes are
> used (reserved) as 3000/(30+40) = 42 VFs.
>
> Only if some VFs use configq, more VFs can be deployed.

I don't understand here.

> It is hard to
> build scale this way. Therefore suggestion is to place new attributes on
> new config/cmd/transport q, and old to stay as-is.

Just to be sure we're on the same page. The proposal of both you and
mine are based on the adminq for PF not VF. The reason is obvious:
adminq per VF won't work without PASID, since it would have security
issues.

>
> The legacy infra is unfortunately is for the exception path due to the
> history; hence they are different commands as Michael suggests.
>

Thanks
Follow-Ups:
- RE: [PATCH v2 0/2] transport-pci: Introduce legacy registers access using AQ
  - From: Parav Pandit <parav@nvidia.com>
References:
- [PATCH v2 0/2] transport-pci: Introduce legacy registers access using AQ
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v2 0/2] transport-pci: Introduce legacy registers access using AQ
  - From: Jason Wang <jasowang@redhat.com>
- Re: [PATCH v2 0/2] transport-pci: Introduce legacy registers access using AQ
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [PATCH v2 0/2] transport-pci: Introduce legacy registers access using AQ
  - From: Jason Wang <jasowang@redhat.com>
- Re: [PATCH v2 0/2] transport-pci: Introduce legacy registers access using AQ
  - From: Parav Pandit <parav@nvidia.com>