OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload

On Mon, Feb 11, 2019 at 08:52:01AM +0000, David Riddoch wrote:
> On 11/02/2019 07:33, Michael S. Tsirkin wrote:
> > On Fri, Feb 01, 2019 at 02:23:55PM +0000, David Riddoch wrote:
> > > All,
> > > 
> > > I'd like to propose a small extension to the packed virtqueue mode.  My
> > > proposal is to add an offset/wrap field, written by the driver, indicating
> > > how many available descriptors have been added to the ring.
> > > 
> > > The reason for wanting this is to improve performance of hardware devices.
> > > Because of high read latency over a PCIe bus, it is important for hardware
> > > devices to read multiple ring entries in parallel.  It is desirable to know
> > > how many descriptors are available prior to issuing these reads, else you
> > > risk fetching descriptors that are not yet available.  As well as wasting
> > > bus bandwidth this adds complexity.
> > > 
> > > I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
> > > problem,
> > Right. And this seems like the ideal solution latency-wise since
> > this pushes data out to device without need for round-trips
> > over PCIe.
> Yes, and I'm not proposing getting rid of it.  I'd expect a PCIe device to
> use both features together.
> > > but we still have a problem.  If you rely on doorbells to tell you
> > > how many descriptors are available, then you have to keep doorbells enabled
> > > at all times.
> > I would say right now there are two modes and device can transition
> > between them at will:
> > 
> > 1. read each descriptor twice - once speculatively, once
> >     to get the actual data
> > 
> >     optimal for driver suboptimal for device
> You might read each descriptor multiple times in some scenarios. Reading
> descriptors in batches is hugely important given the latency and overheads
> of PCIe (and lack of adjacent data fetching that caching gives you).
> > 2. enable notification for each descritor and rely on
> >     these notifications
> > 
> >     optimal for device suboptimal for driver
> > 
> > > This can result in a very high rate of doorbells with some
> > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > MMIOs at very high rates).
> > Interesting. Is there any data you could share to help guide the design?
> > E.g. what's the highest rate of MMIO writes supported etc?
> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s.  On
> PCIe-remote socket ~8M/s.

Is that mega bytes? Or million writes?
With 32 bit writes are we limited to 2M then?

> This doesn't just impose a limit on aggregate packet rate: If you hit this
> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> instr/cycle takes a huge hit.
> > > The proposed offset/wrap field allows devices to disable doorbells when
> > > appropriate, and determine the latest fill level via a PCIe read.
> > This kind of looks like a split ring though, does it not?
> I don't agree, because the ring isn't split.  Split-ring is very painful for
> hardware because of the potentially random accesses to the descriptor table
> (including pointer chasing) which inhibit batching of descriptor fetches, as
> well as causing complexity in implementation.

That's different with IN_ORDER, right?

> Packed ring is a huge improvement on both dimensions, but introduces a
> specific pain point for hardware offload.
> > The issue is
> > we will again need to bounce more cache lines to communicate.
> You'll only bounce cache lines if the device chooses to read the idx.  A PV
> device need not offer this feature.  A PCIe device will, but the cost to the
> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> which is always a strongly ordered barrier on x86/64.
> With vDPA you ideally would have this feature enabled, and the device would
> sometimes be PV and sometimes PCIe.  The PV device would ignore the new idx
> field and so cache lines would not bounce then either.
> Ie. The only time cache lines are shared is when sharing with a PCIe device,
> which is the scenario when this is a win.

True. OTOH on non-x86 you will need more write barriers :( It would be
natural to say driver can reuse the barrier before flags update, but note
that that would mean hardware still needs to support re-fetching if
flags value is wrong. Such re-fetch is probably rare so fine by me, but it
does add a bit more complexity.

> > So I wonder: what if we made a change to spec that would allow prefetch
> > of ring entries?  E.g. you would be able to read at random and if you
> > guessed right then you can just use what you have read, no need to
> > re-fetch?
> Unless I've misunderstood I think this would imply that the driver would
> have to ensure strict ordering for each descriptor it wrote, which would
> impose a cost to the driver.  At the moment a write barrier is only needed
> before writing the flags of the first descriptor in a batch.

On non-x86 right?  OTOH the extra data structure also adds more ordering

> > > I suggest the best place to put this would be in the driver area,
> > > immediately after the event suppression structure.
> > Could you comment on why is that a good place though?
> The new field is written by the driver, as are the other fields in the
> driver area.  Also I expect that devices might be able to read this new idx
> together with the interrupt suppression fields in a single read, reducing
> PCIe overheads.

OK then you need it in the same aligned 256 byte boundary.

> Placing the new field immediately after the descriptor ring would also work,
> but lose the benefit of combining reads, and potentially cause drivers to
> allocate a substantially bigger buffer (as I expect the descriptor ring is
> typically a power-of-2 size and drivers allocate multiples of page size).
> Placing the new field in a separate data structure is undesirable, as it
> would require devices to store a further 64bits per virt-queue.
> Thanks,
> David
> > > Presumably we would like this to be an optional feature, as implementations
> > > of packed mode already exist in the wild.  How about
> > > If I prepare a patch to the spec is there still time to get this into v1.1?
> > Any new feature would require another round of public review.
> > I personally think it's better to instead try to do 1.2 soon after,
> > e.g. we could try to target quarterly releases.
> > 
> > But ultimately that would be up to the TC vote.
> -- 
> David Riddoch  <driddoch@solarflare.com> -- Chief Architect, Solarflare
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]