virtio-dev message

Subject: Re: [virtio-dev] packed ring layout proposal

From: "Pierre Pfister (ppfister)" <ppfister@cisco.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Tue, 20 Sep 2016 08:33:02 +0000

More inline, but first I want to propose something.

One of the expensive operation in device workload is to map physical address found in the descriptor to
the virtual address.

The reason being that there are multiple memory regions.

When there are only a few regions, optimal implementation uses a loop.
Otherwise, a tree lookup can be used.
In any case, it is costy.

So here is the proposal.

The driver numbers memory regions and provides this numbering to the device, such that
both device and driver have the same memory region indexing.

Then, instead of transmitting '__le64 addr' in the descriptor, we put 





>> 
>> I think the simplified ring layout is a good idea. 
>> It reduces the number of indirections, and will probably help pre-fetching data in order to avoid cache misses.
>> 
>> I am not completely convinced by the flag-based approach instead of the index-based approach though.
>> With the index-based approach, you have only one-read to do in order
>> to know up to what point in the queue you can go.
>> With this flag, every-time you look at a new descriptor, before doing anything you will have to look at the flags.
>> Meaning that you will have to wait for the result of a test on a very newly fetched piece of memory.
>> 
>> So maybe it would be interesting to still have a used_index and available_index (put in different cache lines).
> 
> Problem is, index access is guaranteed to be a cache miss.
> You need to stall waiting for descriptor length anyway -
> go ahead and implement an alternative, and prove me wrong,
> but my testing shows removing the index speeds things up.
> 
> Find the prototype here:
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/tools/virtio/ringtest/ring.c
> 

Actually what you say makes a lot of sense.
And it's really easy to just prefetch next descriptors (which would include de flags).

I had a few reasons why having the index was handy, but nothing so important.

>> 
>> First of all, if I understand correctly, VRING_DESC_F_NEXT doesn't use a 'next' field, and rather just
>> say that the next descriptor in the queue is part of the same message. I think this is really, really, good.
>> 
>> Paolo proposed in a later mail to drop VRING_DESC_F_NEXT altogether, and I think
>> he is right at least in the sense that having two solutions for doing almost the same thing 
>> is not necessary (IMHO, even harmful).
> 
> I'd like to see some code and numbers to back this up.  In particular,
> testing dpdk performance with/without indirect descriptors would be
> interesting.
> If the optimal things depends on workload or the type of device in use,
> then it might be necessary.

There is a gain when indirect descriptors are enabled, but I think this is just due to the fact that
the number of descriptors is virtually increased.
I will run some tests this week and try to compare 512 w/o indirect Vs 256 with indirect.

> 
>> But, VRING_DESC_F_INDIRECT adds a layer of indirection and makes prefetching a bit
>> more complicated.
>> The reason why VRING_DESC_F_INDIRECT was useful in the first place is that virtio 1.0 
>> didn't have enough descriptors in the main queue (it is equal to the queue size).
> 
> rings are physically contigious, and allocating large physically
> contigious buffers is problematic for many guests.
> 

If having physically contiguous rings is very important, could we allocate multiple arrays instead of one ?

For example, allocating 8 rings of 1024 descriptos;
struct desc *current_desc = &vring[idx & 0x3c00][idx & 0x3ff];

Just like indirect descriptors, this has two levels of indirection, but unlike indirect, this the lookup
is from non-shared memory and is guaranteed to not change.


> 
>> And if delay spent in the queue is an issue. I think it's driver's job to avoid bufferbloat if necessary 
>> by adapting the number of descriptors is allows the device to use.
>> 
>>> 
>>> 
>>> * Batching descriptors:
>>> 
>>> virtio 1.0 allows passing a batch of descriptors in both directions, by
>>> incrementing the used/avail index by values > 1.  We can support this by
>>> tagging first/middle/last descriptors in the flags field. For
>>> comptibility, polarity is reversed so bit is set for non-first and
>>> non-last descriptors in a batch:
>>> 
>>> #define BATCH_NOT_FIRST 0x0010
>>> #define BATCH_NOT_LAST  0x0020
>>> 
>>> In other words only descriptor in batch is 0000.
>>> 
>>> Batch does not have to be processed together, so
>>> !first/!last flags can be changed when descriptor is used.
>>> 
>> 
>> I don't exactly understand why this is necessary.
>> Is there any added value by knowing that a bunch of messages have been written at 
>> the same time.
> 
> virtio net mergeable buffers pretty much rely on this.
> 

One other way to do a batch commit is to set the DESC_HW flag of the first committed
descriptor after the following.


e.g. to batch 3 descriptors for hardware.

==> [...] [HW] [SW] [SW] [SW] [SW] [SW] [...] 
==> [...] [HW] [SW] [HW] [SW] [SW] [SW] [...] 
==>  [...] [HW] [SW] [HW] [HW] [SW] [SW] [...] 
==> [...] [HW] [HW] [HW] [HW] [SW] [SW] [...]

Follow-Ups:
- Re: [virtio-dev] packed ring layout proposal
  - From: "Pierre Pfister (ppfister)" <ppfister@cisco.com>

References:
- packed ring layout proposal
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-dev] packed ring layout proposal
  - From: "Pierre Pfister (ppfister)" <ppfister@cisco.com>
- Re: [virtio-dev] packed ring layout proposal
  - From: "Michael S. Tsirkin" <mst@redhat.com>