OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload


On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
> 
> On 2019/2/13 äå1:35, Michael S. Tsirkin wrote:
> > On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
> > > > > > > I would prefer to have the write barrier before writing the idx.
> > > > > > Well that's driver overhead for something device might never utilise in
> > > > > > a given workload. If we are optimizing let's optimize for speed.
> > > > > I think doing the barrier before writing idx is best for speed (see below).
> > > > I don't see it below:(
> > > Sorry, I'm not being clear. If the write barrier is before the idx, then a
> > > PV device can read the idx, do a single rmb and read a whole bunch of
> > > descriptors. As things stand today a PV device has to do an rmb for each
> > > descriptor that it reads.
> > > 
> > > I'm not sure, but this may be what Jason meant when he said "prefetch".
> 
> 
> Yes.
> 
> 
> > Oh. Right. So for PV it's easy to convert an rmb to a data dependency
> > instead.  It remains to be seen whether an extra cache miss per
> > batch is cheaper or more expensive than such a dependency
> > per descriptor.
> > 
> > I hope Jason can experiment and let us know.
> > 
> > 
> 
> To clarify, actually I did play someting like:
> 
> 
> if (desc_is_avail((last_avail + 32) % size))
> 
> ÂÂÂ [last_avail, last_avail + 32] is avail [1]
> 
> else if (desc_is_avail(last_avail + 1) % size))
> 
> ÂÂÂ last_avail is avail
> 
> else
> 
> ÂÂ no new
> 
> 
> And looks like [1] depends on the driver to do:
> 
> for all descriptors
> 
> ÂÂÂ update desc (but not make it available for the flag)
> 
> wmb()
> 
> for all descriptors
> 
> ÂÂÂ update desc.flag
> 
> 
> This means if last_avail + 32 is avail, no need to check flags of
> [last_avail, last_avail + 31] since the descriptor content (except for the
> flag) is guaranteed to be correct.

We can try looking into that, sure. It's not just wmb though which is
anyway free on x86. There are cases of e.g. chained descriptors
to consider where we do not want device to see a partial chain.


> I get ~10% PPS improvement.
> 
> Thanks

I suspect most of the gain isn't in the check at all though.
It's probably reading descriptors and maybe loop unrolling.
Something like:

__attribute__((optimize("unroll-loops")))
bool fast_get_descriptors(vq, desc)
{

	copy X descriptors
	u16 aflags = 0xffff;
	u16 oflags = 0x0;
	for (i = 0; i < X; ++i) {
		aflags &= desc[i].flags;
		oflags |= desc[i].flags;
	}
	/* all flags must be same */
	if (aflags != oflags)
		return false;

	return is_avail(vq->wrapcount, aflags);
}


I also think 32 is very aggressive, I would start with 8.


With above, you could try:

	if (fast_get_descriptors(vq, desc))
	 ÂÂÂ [last_avail, last_avail + 32] is avail [1]
 
	else if (desc_is_avail(last_avail + 1) % size))
	 ÂÂÂ last_avail is avail
 
	 else
 	 	no new


-- 
MST


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]