virtio-comment message

Subject: RE: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout
From: "Dhanoa, Kully" <kully.dhanoa@intel.com>
To: "'virtio-comment@lists.oasis-open.org'" <virtio-comment@lists.oasis-open.org>
Date: Fri, 1 Sep 2017 08:35:38 +0000
Adding background to Tiwei's comments on reason why tail pointer is good for hardware implementation:

Assuming no tail pointer:

1. Hardware would have to speculatively read descriptors and check their validity by checking that DESC_HW=1.
	1.1 Yes Hardware could request a large number of descriptors at a time, making the pcie read response transfer (i.e. read 			descriptors) an efficient PCIe transfer.

The problems are as follows:

2. Issue 1: Wasting PCIe bandwidth
	2.1 Although the PCIe read responses may be efficient transfers, if they contain invalid descriptors (DESC_HW=0), we have wasted 		PCIe bandwidth. This can be a problem when trying to maximize the performance possible from a design.

3. Issue 2: Wasting Hardware memory resources
	3.1 When issuing PCie read requests for descriptors, the hardware must reserve inadvance memory to store the descriptors.
	3.2. Given PCIe read latencies can be in the order of 1us, this memory is reserved for that length of time.
	3.3. For hardware, 1us is a very long time and for FPGAs, memory is not as plentiful/cheap as in a PC.
	3.4 So reserving memory for descriptors that may end up being invalid is a waste. Ultimately, this could effect performance if a 		large number of invalid descriptors are being read.

So it is better for hardware to know which queues (and hence guests) have descriptors available and fetch those only.
The argument above is biased towards Tx (transfer of packets from guest to device) but does also apply for Rx.

Tail pointer resides in the hardware and so the hardware always knows how many descriptors are available for each queue (no need to waste PCIe bandwidth to determine this) and so can fetch only those valid descriptors.

Regards
Kully
-----Original Message-----
From: Bie, Tiwei
Sent: Friday, September 1, 2017 6:35 AM
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>; virtio-comment@lists.oasis-open.org; mst@redhat.com; Dhanoa, Kully <kully.dhanoa@intel.com>; Liang, Cunming <cunming.liang@intel.com>; Gray, Mark D <mark.d.gray@intel.com>
Subject: Re: [virtio-comment] Hardware friendly proposals from Intel for packed-ring-layout

On Thu, Aug 31, 2017 at 11:21:50AM +0100, Stefan Hajnoczi wrote:
> On Fri, Aug 25, 2017 at 04:32:26PM +0100, Stefan Hajnoczi wrote:
> > On Thu, Aug 24, 2017 at 10:11:33PM +0800, Tiwei Bie wrote:
> > > On Thu, Aug 24, 2017 at 03:32:15PM +0200, Paolo Bonzini wrote:
> > > > On 24/08/2017 15:11, Tiwei Bie wrote:
> > > > > On Thu, Aug 24, 2017 at 02:10:34PM +0200, Paolo Bonzini wrote:
> > > > >> On 24/08/2017 13:53, Tiwei Bie wrote:
> > > > >>>
> > > > >>> * In addition to the DESC_HW flag, each virtio queue has a tail pointer
> > > > >>>     - Driver creates suitable (i.e. multiple of cacheline) descriptors,
> > > > >>>       then performs MMIO write to tail pointer.
> > > > >>
> > > > >> If I understand correctly, the tail pointer is the value that 
> > > > >> is written to the MMIO register.  If that is the case, this 
> > > > >> is unfortunately bad for virtualization.  Virt prefers a 
> > > > >> doorbell register where the value doesn't matter.  This is because:
> > > > >>
> > > > >> 1) the value is not available directly and computing it 
> > > > >> requires instruction decoding, which in turn requires walking 
> > > > >> page tables
> > > > >>
> > > > >> 2) if the value doesn't matter, the hypervisor can simply 
> > > > >> wake up a userspace thread that processes the virtio queue 
> > > > >> without bothering to pass the value.
> > > > >>
> > > > >> On the other hand, writing a tail pointer _before_ the MMIO 
> > > > >> write may cost a cache miss.  Hence the packed ring layout 
> > > > >> proposal replaced the tail pointer write with lookahead on the ring buffer's DESC_HW flags.
> > > > >> The idea is that lookahead is cheaper, because hopefully the 
> > > > >> first non-DESC_HW buffer will be in the same cache line as 
> > > > >> the last DESC_HW buffer.
> > > > > 
> > > > > Thank you so much for such quick and detailed reply!
> > > > > 
> > > > > Yeah, we know it's a bit tricky to support the tail pointer in 
> > > > > software. But it's really helpful for the hardware implementation.
> > > > > So we want more discussions on this.
> > > > > 
> > > > > How about having this feature be switchable at runtime, so 
> > > > > it's possible to be enabled after migrating to a hardware 
> > > > > backend, or disabled after migrating to a software backend. So 
> > > > > for the software backend, it can still use the DESC_HW based mechanism.
> > > > > 
> > > > > It's just some rough thoughts, and we haven't thought about 
> > > > > the implementation details. What's your thoughts on this?
> > > > 
> > > > Why is lookahead bad for hardware?  Can a PCIe device use burst 
> > > > reads to retrieve many 2-byte descriptor in a single TLP transaction?
> > > > 
> > > 
> > > I'm not a hardware engineer, so what I said may be not accurate.
> > > Kully (Cc'ed in this thread) can provide more details if necessary.
> > > 
> > > From my understanding, in current design hardware will need to 
> > > keep issuing TLP transactions at a certain pace to check whether 
> > > the descriptor is available or not. And it's possible that there 
> > > is no available descriptor. In this case, there will be a lot of 
> > > TLP transactions wasted. That is to say, the doorbell (MMIO write) 
> > > is helpful for the hardware.
> > > 
> > > So maybe we can combine both of them, similar to the notification 
> > > support in virtio1.0, we can have the tail pointer be optional, 
> > > the backend is free to choose to enable (and use) it or disable it.
> > > And the software backend doesn't need to support it and just need 
> > > to keep it disabled. Any thoughts?
> > 
> > Isn't the VIRTIO 1.0 "4.1.2.3 Notification structure layout" still 
> > going to be available in the new ring layout?
> > 
> > That means the device already has a doorbell and does not need to 
> > keep issuing bus transactions to poll the DESC_HW bit.
> 
> I think I misunderstood the discussion:
> 
> The VIRTIO 1.0 "4.1.2.3 Notification structure layout" doorbell does 
> not indicate how many descriptors are available for the device.
> 
> This means the device needs to read the DESC_HW bit for each 
> descriptor until it reaches the first descriptor without DESC_HW set.
> 
> If the doorbell contained the number of descriptors then the device 
> could fetch exactly N descriptors instead of N + 1.  Maybe it could 
> also use fewer bus transactions (I'm not familiar with low-level PCIe).
> 
> So there does seem to be a performance advantage if the VIRTIO 1.0
> "4.1.2.3 Notification structure layout" is modified to include the 
> number of descriptors available.  Seems like a good idea.
> 

Yeah, you are right! :-) We are proposing making the doorbell indicate how many descriptors are available for the device. This idea was raised by our hardware engineer. I'm also not really familiar with the low-level PCIe. Besides, sorry for not making a very clear/thorough proposal on this.. And many thanks for Michael's and Paolo's great feedbacks and suggestions on this!

Best regards,
Tiwei Bie
---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.