virtio-comment message

Subject: Re: [virtio-comment] Re: [PATCH 00/11] Introduce transitional mmr pci device

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Parav Pandit <parav@nvidia.com>
Date: Mon, 3 Apr 2023 13:16:32 -0400

On Mon, Apr 03, 2023 at 03:36:25PM +0000, Parav Pandit wrote:
> 
> 
> > From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> > open.org> On Behalf Of Michael S. Tsirkin
> 
> > > Transport vq for legacy MMR purpose seems fine with its latency and DMA
> > overheads.
> > > Your question was about "scalability".
> > > After your latest response, I am unclear what "scalability" means.
> > > Do you mean saving the register space in the PCI device?
> > 
> > yes that's how you used scalability in the past.
> >
> Ok. I am aligned.
>  
> > > If yes, than, no for legacy guests for scalability it is not required, because the
> > legacy register is subset of 1.x.
> > 
> > Weird.  what does guest being legacy have to do with a wish to save registers
> > on the host hardware? 
> Because legacy has subset of the registers of 1.x. So no new registers additional expected on legacy side.
> 
> > You don't have so many legacy guests as modern
> > guests? Why?
> > 
> This isn't true.
> 
> There is a trade-off, upto certain N, MMR based register access is fine.
> This is because 1.x is exposing super set of registers of legacy.
> Beyond a certain point device will have difficulty in doing MMR for legacy and 1.x.
> At that point, legacy over tvq can be better scale but with lot higher latency order of magnitude higher compare to MMR.
> If tvq being the only transport for these registers access, it would hurt at lower scale too, due the primary nature of non_register access.
> And scale is relative from device to device.

Wow! Why an order of magnitide?

> > >
> > > > > > And presumably it can all be done in firmware ...
> > > > > > Is there actual hardware that can't implement transport vq but
> > > > > > is going to implement the mmr spec?
> > > > > >
> > > > > Nvidia and Marvell DPUs implement MMR spec.
> > > >
> > > > Hmm implement it in what sense exactly?
> > > >
> > > Do not follow the question.
> > > The proposed series will be implemented as PCI SR-IOV devices using MMR
> > spec.
> > >
> > > > > Transport VQ has very high latency and DMA overheads for 2 to 4
> > > > > bytes
> > > > read/write.
> > > >
> > > > How many of these 2 byte accesses trigger from a typical guest?
> > > >
> > > Mostly during the VM boot time. 20 to 40 registers read write access.
> > 
> > That is not a lot! How long does a DMA operation take then?
> > 
> > > > > And before discussing "why not that approach", lets finish
> > > > > reviewing "this
> > > > approach" first.
> > > >
> > > > That's a weird way to put it. We don't want so many ways to do
> > > > legacy if we can help it.
> > > Sure, so lets finish the review of current proposal details.
> > > At the moment
> > > a. I don't see any visible gain of transport VQ other than device reset part I
> > explained.
> > 
> > For example, we do not need a new range of device IDs and existing drivers can
> > bind on the host.
> >
> So, unlikely due to already discussed limitation of feature negotiation.
> Existing transitional driver would also look for an IOBAR being second limitation.

Some confusion here.
If you have a transitional driver you do not need a legacy device.



> > > b. it can be a way with high latency, DMA overheads on the virtqueue for
> > read/writes for small access.
> > 
> > numbers?
> It depends on the implementation, but at minimum, writes and reads can pay order of magnitude higher in 10 msec range.

A single VQ roundtrip takes a minimum of 10 milliseconds? This is indeed
completely unworkable for transport vq. Points:
- even for memory mapped you have an access take 1 millisecond?
  Extremely slow. Why?
- Why is DMA 10x more expensive? I expect it to be 2x more expensive:
  Normal read goes cpu -> device -> cpu, DMA does cpu -> device -> memory -> device -> cpu

Reason I am asking is because it is important for transport vq to have
a workable design.


But let me guess. Is there a chance that you are talking about an
interrupt driven design? *That* is going to be slow though I don't think
10msec, more like 10usec. But I expect transport vq to typically
work by (adaptive?) polling mostly avoiding interrupts.

-- 
MST

Follow-Ups:
- Re: [virtio-comment] Re: [PATCH 00/11] Introduce transitional mmr pci device
  - From: Parav Pandit <parav@nvidia.com>

References:
- Re: [PATCH 00/11] Introduce transitional mmr pci device
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [PATCH 00/11] Introduce transitional mmr pci device
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH 00/11] Introduce transitional mmr pci device
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [PATCH 00/11] Introduce transitional mmr pci device
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH 00/11] Introduce transitional mmr pci device
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH 00/11] Introduce transitional mmr pci device
  - From: Parav Pandit <parav@nvidia.com>