virtio-dev message

Subject: Re: [virtio-dev] [PATCH v8] virtio_net: support for split transport header
From: Heng Qi <hengqi@linux.alibaba.com>
To: Jason Wang <jasowang@redhat.com>
Date: Thu, 20 Oct 2022 16:16:01 +0800
On Sat, Oct 08, 2022 at 12:37:45PM +0800, Jason Wang wrote:
> On Thu, Sep 29, 2022 at 3:04 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Thu, Sep 29, 2022 at 09:48:33AM +0800, Jason Wang wrote:
> > > On Wed, Sep 28, 2022 at 9:39 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > >
> > > > On Mon, Sep 26, 2022 at 04:06:17PM +0800, Jason Wang wrote:
> > > > > > Jason I think the issue with previous proposals is that they conflict
> > > > > > with VIRTIO_F_ANY_LAYOUT. We have repeatedly found that giving the
> > > > > > driver flexibility in arranging the packet in memory is benefitial.
> > > > >
> > > > >
> > > > > Yes, but I didn't found how it can conflict the any_layout. Device can just
> > > > > to not split the header when the layout doesn't fit for header splitting.
> > > > > (And this seems the case even if we're using buffers).
> > > >
> > > > Well spec says:
> > > >
> > > >         indicates to both the device and the driver that no
> > > >         assumptions were made about framing.
> > > >
> > > > if device assumes that descriptor boundaries are where
> > > > driver wants packet to be stored that is clearly
> > > > an assumption.
> > >
> > > Yes but what I want to say is, the device can choose to not split the
> > > packet if the framing doesn't fit. Does it still comply with the above
> > > description?
> > >
> > > Thanks
> >
> > The point of ANY_LAYOUT is to give drivers maximum flexibility.
> > For example, if driver wants to split the header at some specific
> > offset this is already possible without extra functionality.
> 
> I'm not sure how this would work without the support from the device.
> This probably can only work if:
> 
> 1) the driver know what kind of packet it can receive
> 2) protocol have fixed length of the header
> 
> This is probably not true consider:
> 
> 1) TCP and UDP have different header length
> 2) IPv6 has an variable length of the header
> 
> 
> >
> > Let's keep it that way.
> >
> > Now, let's formulate what are some of the problems with the current way.
> >
> >
> >
> > A- mergeable buffers is even more flexible, since a single packet
> >   is built up of multiple buffers. And in theory device can
> >   choose arbitrary set of buffers to store a packet.
> >   So you could supply a small buffer for headers followed by a bigger
> >   one for payload, in theory even without any changes.
> >   Problem 1: However since this is not how devices currently operate,
> >   a feature bit would be helpful.
> 
> How do we know the bigger buffer is sufficient for the packet? If we
> try to allocate 64K (not sufficient for the future even) it breaks the
> effort of the mergeable buffer:
> 
> header buffer #1
> payload buffer #1
> header buffer #2
> payload buffer #2
> 
> Is the device expected to
> 
> 1) fill payload in header buffer #2, this breaks the effort that we
> want to make payload page aligned
> 2) skip header buffer #2, in this case, the device assumes the framing
> when it breaks any layout
> 
> >
> >   Problem 2: Also, in the past we found it useful to be able to figure out whether
> >   packet fits in a single buffer without looking at the header.
> >   For this reason, we have this text:
> >
> >         If a receive packet is spread over multiple buffers, the device
> >         MUST use all buffers but the last (i.e. the first \field{num_buffers} -
> >         1 buffers) completely up to the full length of each buffer
> >         supplied by the driver.
> >
> >   if we want to keep this optimization and allow using a separate
> >   buffer for headers, then I think we could rely on the feature bit
> >   from Problem 1 and just make an exception for the first buffer.
> >   Also num_buffers is then always >= 2, maybe state this to avoid
> >   confusion.
> >
> >
> >
> >
> >
> > B- without mergeable, there's no flexibility. In particular, there can
> > not be uninitialized space between header and data.
> 
> I had two questions
> 
> 1) why is this not a problem of mergeable? There's no guarantee that
> the header is just the length of what the driver allocates for header
> buffer anyhow
> 
> E.g the header length could be smaller than the header buffer, the
> device still needs to skip part of the space in the header buffer.
> 
> 2) it should be the responsibility of the driver to handle the
> uninitialized space, it should do anything that is necessary for
> security, more below
> 


We've talked a bit more about split header so far, but there still seem to
be some issues, so let's recap.

äã Method Discussion Review

In order to adapt to the Eric's tcp receive interface to achieve zero copy,
header and payload are required to be stored separately, and the payload is
stored in a paged alignment way. Therefore, we have discussed several options
for split header as follows:

1: method A ( depend on the descriptor chain )
|                         receive buffer                            | 
|              0th descriptor                      | 1th descriptor | 
| virtnet hdr | mac | ip hdr | tcp hdr|<-- hold -->|      payload   | 
Method A uses a buffer plus a separate page when allocating the receive
buffer. In this way, we can ensure that all payloads can be put
independently in a page, which is very beneficial for the zerocopy 
implemented by the upper layer. 

The advantage of method A is that the implementation is clearer, it can support normal
header spit and the rollback conditions. It can also easily support xdp. The downside is
that devices operating directly on the descriptor chain may cause the layering violation,
and also affect the performance.

2. method B ( depend on mergeable buffer)
|                   receive buffer (page)                                 | receive buffer (page) | 
| <-- offset(hold) --> | virtnet hdr | mac | ip hdr | tcp hdr|<-- hold -->|         payload       | 
^
|
pointer to device

Method B is based on your previous suggestion, it is implemented based
on mergeable buffer, filling a separate page each time. 

If the split header is negotiated and the packet can be successfully split by the device,
the device needs to find at least two buffers, namely two pages, one for the virtio-net header
and transport header, and the other for the payload.

The advantage of method B is that it relies on mergeable buffer instead of the descriptor chain.
It overcomes the shortcomings of method A and can achieve the purpose of the device focusing
on the buffer instead of the descriptor. Its disadvantage is that it causes memory waste.

3. method C ( depend on mergeable buffer )
| small buffer | data buffer (page) | small buffer | data buffer (page) | small buffer | data buffer (page) |

Method B fills a separate page each time, while method C needs to fill the small buffer and
page buffer separately. Method C puts the header in small buffer and the payload in a page.

The advantage of method C is that some buffers are filled for header and data respectively,
which reduces the memory waste of method B. However, this method is difficult to weigh
the number of filled header buffers and data buffers, and an unreasonable proportion will
affect performance. For example, in a scenario with a large number of large packets,
too many header buffers will affect performance, or in a scenario with a large number of small
packets, too many data buffers can also affect performance. At the same time, if some protocols
with a large number of packets do not support split header, the existence of the header buffers
will also affect performance.

äãPoints of agreement and disagreement

1. What we have now agreed upon is that:
None of the three methods break VIRTIO_F_ANY_LAYOUT, they make virtio net header and
packet header stored together.

We have now agreed to relax the following in the split header scenario,
	"indicates to both the device and the driver that no assumptions were made about framing."
because when a bigger packet comes, and a data buffer is not enough to store this packet,
the device either chooses to skip the next header buffer to break what the spec says above,
or chooses not to skip the header buffer and cannot make payload page aligned.
Therefore, all three methods need to relax the above requirements.

2. What we haven't now agreed upon is that:
The point where we don't agree now is that we don't have a more precise discussion of which
approach to take, but we're still bouncing between approaches.
At present, all three approaches seem to achieve our requirements, but each has advantages
and disadvantages. Should we focus on the most important points, such as performance to choose.
It seems a little difficult to cover everything?

äãTwo forms of implementing receive zerocopy

the Eric's tcp receive interface requires the header and payload are stored in separate buffers, and the payload is
stored in a paged alignment way.

Now, io_uring also proposes a new receive zerocopy method, which requires header and payload
to be stored in separate buffers, but does not require payload page aligned.
https://lore.kernel.org/io-uring/20221007211713.170714-1-jonathan.lemon@gmail.com/T/#m678770d1fa7040fd76ed35026b93dfcbf25f6196


Thanks.

> > If we had flexibility here, this could be
> > helpful for alignment, security, etc.
> > Unfortunately, our hands are tied:
> >
> >
> >         \field{len} is particularly useful
> >         for drivers using untrusted buffers: if a driver does not know exactly
> >         how much has been written by the device, the driver would have to zero
> >         the buffer in advance to ensure no data leakage occurs.
> >
> >         For example, a network driver may hand a received buffer directly to
> >         an unprivileged userspace application.  If the network device has not
> >         overwritten the bytes which were in that buffer, this could leak the
> >         contents of freed memory from other processes to the application.
> 
> This should be a bug of the driver. For example, if the driver wants
> to implement zerocopy, it must guarantee that every byte was written
> by the device before mapping it to the userspace, if it can't it
> should do the copy instead.
> 
> >
> >
> > so all buffers have to be initialized completely up to the reported
> > used length.
> >
> > It could be that the guarantee is not relevant in some use-cases.
> > We would have to specify that this is an exception to the rule,
> > explain that drivers must be careful about information leaks.
> > Let's say there's a feature bit that adds uninitialized space
> > somewhere. How much was added? We can find out by parsing the
> > packet but once you start doing that just to assemble the skb
> > you have already lost performance.
> 
> I don't get here, for those uninitialized spaces, it looks just
> tailroom for the skb.
> 
> Thanks
> 
> > So lots of spec work, some security risks, and unclear performance.
> >
> >
> >
> >
> > Is above a fair summary?
> >
> >
> >
> > If yes I would say let's address A, but make sure we ask drivers
> > to clear the new feature bit if there's no mergeable
> > (as opposed to e.g. failing probe) so we can add
> > support for !mergeable down the road.
> >
> >
> >
> >
> >
> > --
> > MST
> >
References:
- Re: [virtio-dev] [PATCH v8] virtio_net: support for split transport header
  - From: Jason Wang <jasowang@redhat.com>