virtio-comment message

Subject: RE: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands

From: Parav Pandit <parav@nvidia.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Tue, 23 May 2023 17:13:54 +0000

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, May 22, 2023 5:35 PM

> 
> iokit is from my previous work experience not virtio.
> For virtio we had a lot of trouble with hacks like this on windows.
I don't know why windows driver need to talk to PF driver until now when the AQ doesn't exists.
> That's an important platform.
> 
I am not sure how important is it.
What I learnt that a decade has passed and even after that virtio drivers are not part of the OS distro.

> > > If you are mapping VF BAR, this has to be done by VF driver, and
> > > please just avoid pain for everyone by exposing the necessary info
> > > through the VF itself.
> > > A capability seems perfect for this.
> > >
> > Why not the MMIO region as proposed in v0?
> 
> Are you asking what were issues with v0? The issue with v0 was not MMIO as
> such at least for me.  The issue was that it was not giving us any way to work
> around bugs in the legacy interface. AQ was a fix for that.
> 
It doesn't need to because guest is anyway going to do whatever it does. 
The clear requirement is: to not fix legacy but to support legacy.
Fixing legacy is role of 1.x. Not the role of legacy interface itself using AQ/MMIO.

And even if one desire to coalesce MMIO writes, it can still do it for mac address.
We discussed this.
We also discussed that kernel as old as 2.6.32 uses CVQ for mac address.

And Even device can apply mac address when it reaches the count of byte writes to 6, if wishes to.
> 
> > Capability based read/write requires extra redirection, it is less optimal than
> MMIO.
> >
> > > An alternative is not to map VF BAR at all, steal space from PF BAR instead.
> > Come an iokit will be able to access this PF BAR without binding to it?
> 
> I don't get what you are saying here.  You have to bind a PF driver.
> Otherwise you can not send AQ command. That driver can map PF BAR into
> applications.
> 
This is usual model that we follow. When PF driver is serving and emulating it provides the shared platform for other devices for legacy functionality.
It is for legacy not for any other newer interface.

And for purpose of BAR mapping, such BAR in the VF can be mapped too.
You didn't answer why VF BAR is a problem and suggest PF BAR.
Such VF BAR already exists and asking for new resource creation doesn't fit in your "simple" scheme suggestion either.

> 
> > > Guest is not accessing this anyway. So I really do not see why not
> > > from software point of view. There's a hardware reason? Could you
> > > talk to hardware guys about this? You objected to AQ too then
> > > hardware guys told you it is not a big deal.
> > I didn't object to AQ on hw reason.
> > I explained AQ is not the best compared to MMIO because it is slow
> compared to MMIO by law of physics as it needs to move more bits than
> MMIO.
> > And so, the capability based tunneling too is slow to MMIO.
> > >
> > > > > So we have legacy emulation send commands to VF or to PF.  Okay.
> > > > > But let us avoid the need for VF driver to send commands to PF to
> initialize.
> > > > > Just get all information it needs from VF itself.
> > > > >
> > > > >
> > > > > Maybe it's a good idea to reuse existing notification
> > > > > capability, or maybe a new one, but let's avoid making VF driver
> > > > > depend on PF
> > > commands.
> > > > >
> > > > We agreed in v1 on Jason's suggestion to have the AQ command and
> > > > yet you and Jason hinder this in v2 with this exact repeated question.
> > > > Lets please avoid this and move forward.
> > >
> > > How was I supposed to raise this issue on v1 if it was not there?
> > > Stop arguing in circles and things will move forward.
> > v1 version clearly had AQ commands.
> >
> > Please revisit your comments.
> > v0 -> you suggested:
> >  - "why not AQ, like transport vq, it can coalesce partial mac writes".
> >
> > v1-> over AQ, you suggested:
> >  - to "drop the size"
> > - summarized to split 2 commands to 4 commands to split device and config
> area.
> >
> > After that summary, you propose, why not self-contained in VF itself? -> circle
> back to v0 method.
> > (Instead of mmio, suggesting capability, but not explaining why mmio is not
> fine. Which is better than cap).
> >
> > Then now you propose "have BAR in the PF which is owned by other PF
> driver", with least reasoning.
> > And it convolutes with your suggestion of PF driver access...
> >
> > So, I am bit lost on your circular proposal, or it was just a probe question...
> > Not sure.
> 
> You did move things to AQ however not completely.  You kept notification in
> MMIO. If everything was in AQ we would be done as far as I am concerned
> (though Jason still does not like duplication of functionality with transport vq,
> but that is a separate issue).
> 
It is a wrong design point to ask to "move everything to AQ", because AQ cannot be the data link layer for everything.
One can ask in the, why don't I put my 1GB of data inside the AQ itself? AQ should do everything...
Why not device build MB of shared memory for MMIO, queue data ... one shared memory should do everything.

And list like that can continue with no end.

We have seen the perf numbers with iobar and memory bar both for PCI devices in lab.
For the PCI transport, perf wise, memory bar for notification, second is iobar, 3rd is AQ.
They are order of magnitude apart for latency.
So AQ to send DB is only looks good in the PDF like how INTX looks good in PDF with SR-IOV but doesn't work well on perf wise.

> As notification is in VF MMIO I am trying to help you by suggesting other ways
> to keep things consistent. If it's in VF I think it follows that where it is has to be
> in VF as well. Not in AQ.
Huh.
If notification for VF is in VF -> hence other config registers also in VF, such blanket statements does not help.

I explained in past that notifications are latency sensitive, and they are best served using MMIO.

Can one implement legacy registers as MMIO as proposed in v0?
Yes, comes at high cost to build such hw.
Is there a better optimal option for legacy?
Yes, i.e. use AQ which is what we wrote in cover letter of AQ.

b. Use MMIO registers pair on the VF itself for example,
legacy_register_addr = offset + size + operation + completion
legacy_register_data = register content

an example sequence is:
hv sw writes 
legacy_register_addr = offset=0x2, size=0x4, operation=read,
Poll for 
legacy_register_addr = offset=0x2, size=0x4, operation=read, cmpl=1
Read the register data from legacy_register_data.

This way a small register footprint on device. Self-contained. Works for the device reset too.
Less memory footprint on the device too.

Does this solve the legacy interface problem? No. Because that is not the objective.

Is this small enough to carry in spec without disrupting the spec too much -> yes.
Is this lighter for sw and device to do? -> yes.
Does it work at scale of VFs? -> yes.
Does it work as_is with legacy sw? -> No.

> 
> And look if that just serves to annoy I can just point out issues and complain
> without suggestions. Was trying to be helpful here really.
> 
Your comments and suggestions are helpful.
Some of them follow the loop that doesn't help.
Again, thanks a lot for the inputs and reviews.

The design discussion helped to produce two solutions here than what was started in v0.

> > >
> > > If we have trouble converging on the notification thing, how about
> > > we make progress without it for now?  Merge a slow version that
> > > sends kicks through aq, then work on the best way to make notifications
> faster, separately.
> > >
> > Sure, we can differ, but the VF has the notification BAR to utilize.
> > So why not use it?
> 
> Are you asking why is merging a smaller partial patch then adding functionality
> a good idea?
> 
I am asking a 1.x PCI device has the VF notification BAR already, why we don't use it, why to build an additional one in device?

> > > VF(member) driver can't implement AQ commands at all. They are sent
> > > to
> > > PF(owner) thus by PF(owner) driver.  In virt configs, VMM can trap
> > > legacy access and talk to owner driver to send them.  It can also
> > > talk to member driver to send notifications.  You can I guess have
> > > VMM get member BAR offsets from owner and pass them to member for
> > > mapping. This raises all kind of questions of trust.
> > Only the member driver has access to the owner driver via kernel API.
> 
> Assuming you can predict existance of specific kernel APIs on all OSes is IMO a
> fool's errand. 
A specific kernel API is not expected.
A framework is expected in OS.
And sw improvise as use case arise.

Driver to driver communication may be possible in WDM using [1].
Outside of virtio, I can imagine for PRI level functionality two drivers interaction would be needed too.

[1] https://learn.microsoft.com/en-us/windows-hardware/drivers/kernel/different-ways-of-handling-irps-cheat-sheet#scenario-3-forward-with-a-completion-routine

> > Part of the VF BAR information at the PCI level is also on the PF.
> > There is no trust break here. A kernel exposes single self-contained object to
> user space VMM.
> 
> That is ok.  The trust break is if PF driver is in userspace. You then are trusting
> userspace to pass offsets back to kernel VF driver.
> 
> 
I don't see how a MMIO over PF is somehow more secure than AQ over PF regardless of in user or kernel.
I don't see how this a problem is.
Even today when/if IOBAR is trapped by the user space forwarding to kernel can modify the offset and value without any AQ etc.

I don't even see how one OS can spawn SR-IOV VFs from the user space driver in the kernel.

Look one can even take the PF from user space to yet another guest VM and add ten other complexities to put oneself to a dead end.

> 
> > > If you can map BAR from PF that would be simplest, VMM then only
> > > pokes at PF and not VF. If not then we really should expose this
> > > info n the VF if at all possible.
> > >
> > There is clearly no need to use PF for notification when the notification
> method already exists on the VF.
> 
> All this looks backwards with mixed up functionality.  What VF has is modern
> BAR. If you insist on reusing modern functionality for legacy when it suits you
> then maybe we should reuse it for everything, that would be extending
> transport vq to support legacy somehow.
> 
You bring back same point that we already closed in discussion.
It is not about suiting me.
It is what is there in the spec and in device for transitional device. (not specific to 1.x).
The design criteria is: 
a. use as much as possible from 1.x and transitional device for the legacy support so that spec changes, device side, sw side changes are minimal.
AQ fits very well to this criteria.

b. Avoid inventing new interfaces between driver and device, specifically at PCI device level for things that are often implemented in hw.
Here also AQ scores well.
MMIO two register example also works well.

And notification also works well that reuses transitional device's notification.

3 out of 4 functionalities are done on AQ of the TVQ.
So, it is time to stop asking to replicate the AQ functionality on some new TVQ.
This was discussed in depth.

> 
> > In fact, it is yet extra steering entry for the hw to perform such mapping.
> >
> > To iterate what presented yday in [1] to Jason.
> >
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202305/msg00298.h
> > tml
> >
> > 1. legacy register access via AQ (v1)
> > Pros:
> > a. Light weight for hypervisor and devices (mainly PCI) to implement.
> > b. Enables sw to coalesce some device specific registers, if needed.
> > Cons:
> > a. Not self-contained, requires PF's AQ which is anyway designed for such
> purpose.
> >
> > 2. Legacy registers access via new MMIO region (v0 + sane reset)
> 
> I think I would buy this one if it was actually in BAR0 exactly legacy with no
> changes, no reset no nothing. a hack like Jason described that at least makes
> legacy linux drivers work with no software glue.  That is horrible, fragile but
> engineering history is full of hacks like this.
> If you need software in the middle then let's please make it robust.
> 
Your ignorance and inclination to see only from the sw pov is not helpful here.
Feedback from the PCI device side given here to be considered too.

This approach doesn't work well for device reset at scale which is a one way signal.

Hence, I proposed the either the AQ over PF or the two-register scheme of above over MMIO VF.

Does that work for you?

> 
> > Pros:
> > a. smaller code in slow register access than AQ b. Sw cannot coalesce
> > some device specific registers.
> > c. Self-contained in the VF
> >
> > Cons:
> > a. Relatively burdensome for the device as it requires more RW registers at
> scale, but it can be indirect register.
> 
> what I keep saying
> 
> b. a ton of legacy bugs which hypervisor can't detect or work around
>    (reset is just one issue)
> 
> 
> > 3. legacy registers tunneling with additional indirection via PCI
> > capability
> 
> 
> I don't know what's this one.

Above example of legacy_register_addr and legacy register_data on the VF.

Follow-Ups:
- Re: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>

References:
- RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] RE: [PATCH v2 1/2] transport-pci: Introduce legacy registers access commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>