virtio-comment message

Subject: RE: [virtio-comment] ååï[virtio-dev] [PATCH v8] virtio_net: support for split transport header
From: Parav Pandit <parav@nvidia.com>
To: hengqi <hengqi@linux.alibaba.com>, virtio-dev <virtio-dev@lists.oasis-open.org>, virtio-comment <virtio-comment@lists.oasis-open.org>
Date: Tue, 31 Jan 2023 23:08:22 +0000
Hi Heng Qi,

Sorry for the joining this conversation little late.
Your email has very useful summary.

Unfortunately, non-text content (HTML content) doesnât get achieved.
So, changing the format to text to capture your useful comments.
If you can change your email client settings to be text mode, it will be easier to converse.

We have equal interest in having efficient split hdr support and do together with you.
Please find response "response" tag below at the end of email to avoid top posting.

From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-open.org> On Behalf Of hengqi
Sent: Tuesday, January 31, 2023 4:23 AM
To: virtio-dev <virtio-dev@lists.oasis-open.org>; virtio-comment <virtio-comment@lists.oasis-open.org>
Cc: Michael S. Tsirkin <mst@redhat.com>; Jason Wang <jasowang@redhat.com>; Cornelia Huck <cohuck@redhat.com>; Kangjie Xu <kangjie.xu@linux.alibaba.com>; Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Subject: [virtio-comment] ååï[virtio-dev] [PATCH v8] virtio_net: support for split transport header

Hi, all.

Split header isÂaÂtechniqueÂwithÂimportantÂapplications,ÂsuchÂasÂEric (https://lwn.net/Articles/754681/)
andÂJonathanÂLemon (https://lore.kernel.org/io-uring/20221007211713.170714-1-Âmailto:jonathan.lemon@gmail.com/T/#m678770d1fa7040fd76ed35026b93dfcbf25f6196)
realizeÂtheÂzero-copyÂtechnologyÂrespectively,ÂtheyÂallÂhaveÂoneÂthingÂinÂcommonÂthatÂtheÂheaderÂand
theÂpayloadÂneedÂtoÂbeÂinÂseparateÂbuffers,ÂandÂEric'sÂmethodÂrequiresÂtheÂpayloadÂtoÂbeÂpage-aligned.

WeÂimplementedÂzero-copyÂonÂtheÂvirtio-netÂdriverÂaccordingÂtoÂEric'sÂmethod.ÂTheÂcommandsÂand
environmentÂareÂasÂfollows:
#Âenvironment
VM1<---->vhost-user<->OVS<->vhost-user<---->VM2
cpuÂModelÂname:ÂIntel(R)ÂXeon(R)ÂCPUÂE5-2682Âv4Â@Â2.50GHz
kernel version 6.0

# commands (linux/tools/testing/selftests/net)
./tcp_mmapÂ-sÂ-zÂ-4Â-pÂ1000Â&
./tcp_mmapÂ-HÂ10.0.0.2Â-zÂ-4Â-pÂ1000

TheÂperformanceÂdataÂisÂasÂfollowsÂ(implementedÂaccordingÂtoÂtheÂsplitÂheaderÂv7Âversion,
https://lists.oasis-open.org/archives/virtio-dev/202209/msg00004.html):
#ÂdirectÂcopy
17.6604ÂsÂ10.08Âs
# zero copy
1.9ÂGB/sÂ3.3ÂGB/s

WeÂdiscussedÂaÂlotÂbefore,Âthe coreÂpointÂisÂtheÂchoiceÂofÂmethodÂAÂandÂmethodÂC, weÂseem
toÂbeÂunableÂtoÂreachÂanÂagreementÂonÂthisÂpoint,Âseeing theÂaboveÂsummaryÂandÂpreviousÂdiscussionÂ(https://lists.oasis-open.org/archives/virtio-devÂ/202210/msg00017.html),
howÂcanÂweÂresolveÂthisÂconflictÂandÂletÂthisÂimportant feature continue?
I really needÂyourÂhelp.ÂCcÂJason,ÂMichael,ÂCornelia,ÂXuan.

Thanks.
------------------------------------------------------------------
åääïHeng Qi <mailto:hengqi@linux.alibaba.com>
åéæéï2022å10æ20æ(ææå) 16:34
æääïJason Wang <mailto:jasowang@redhat.com>
æãéïMichael S. Tsirkin <mailto:mst@redhat.com>; Xuan Zhuo <mailto:xuanzhuo@linux.alibaba.com>; Virtio-Dev <mailto:virtio-dev@lists.oasis-open.org>; Kangjie Xu <mailto:kangjie.xu@linux.alibaba.com>
äãéïRe: [virtio-dev] [PATCH v8] virtio_net: support for split transport header

OnÂSat,ÂOctÂ08,Â2022ÂatÂ12:37:45PMÂ+0800,ÂJasonÂWangÂwrote:
>ÂOnÂThu,ÂSepÂ29,Â2022ÂatÂ3:04ÂPMÂMichaelÂS.ÂTsirkinÂ<mailto:mst@redhat.com>Âwrote:
>Â>
>Â>ÂOnÂThu,ÂSepÂ29,Â2022ÂatÂ09:48:33AMÂ+0800,ÂJasonÂWangÂwrote:
>Â>Â>ÂOnÂWed,ÂSepÂ28,Â2022ÂatÂ9:39ÂPMÂMichaelÂS.ÂTsirkinÂ<mailto:mst@redhat.com>Âwrote:
>Â>Â>Â>
>Â>Â>Â>ÂOnÂMon,ÂSepÂ26,Â2022ÂatÂ04:06:17PMÂ+0800,ÂJasonÂWangÂwrote:
>Â>Â>Â>Â>Â>ÂJasonÂIÂthinkÂtheÂissueÂwithÂpreviousÂproposalsÂisÂthatÂtheyÂconflict
>Â>Â>Â>Â>Â>ÂwithÂVIRTIO_F_ANY_LAYOUT.ÂWeÂhaveÂrepeatedlyÂfoundÂthatÂgivingÂthe
>Â>Â>Â>Â>Â>ÂdriverÂflexibilityÂinÂarrangingÂtheÂpacketÂinÂmemoryÂisÂbenefitial.
>Â>Â>Â>Â>
>Â>Â>Â>Â>
>Â>Â>Â>Â>ÂYes,ÂbutÂIÂdidn'tÂfoundÂhowÂitÂcanÂconflictÂtheÂany_layout.ÂDeviceÂcanÂjust
>Â>Â>Â>Â>ÂtoÂnotÂsplitÂtheÂheaderÂwhenÂtheÂlayoutÂdoesn'tÂfitÂforÂheaderÂsplitting.
>Â>Â>Â>Â>Â(AndÂthisÂseemsÂtheÂcaseÂevenÂifÂwe'reÂusingÂbuffers).
>Â>Â>Â>
>Â>Â>Â>ÂWellÂspecÂsays:
>Â>Â>Â>
>Â>Â>Â>ÂÂÂÂÂÂÂÂÂindicatesÂtoÂbothÂtheÂdeviceÂandÂtheÂdriverÂthatÂno
>Â>Â>Â>ÂÂÂÂÂÂÂÂÂassumptionsÂwereÂmadeÂaboutÂframing.
>Â>Â>Â>
>Â>Â>Â>ÂifÂdeviceÂassumesÂthatÂdescriptorÂboundariesÂareÂwhere
>Â>Â>Â>ÂdriverÂwantsÂpacketÂtoÂbeÂstoredÂthatÂisÂclearly
>Â>Â>Â>ÂanÂassumption.
>Â>Â>
>Â>Â>ÂYesÂbutÂwhatÂIÂwantÂtoÂsayÂis,ÂtheÂdeviceÂcanÂchooseÂtoÂnotÂsplitÂthe
>Â>Â>ÂpacketÂifÂtheÂframingÂdoesn'tÂfit.ÂDoesÂitÂstillÂcomplyÂwithÂtheÂabove
>Â>Â>Âdescription?
>Â>Â>
>Â>Â>ÂThanks
>Â>
>Â>ÂTheÂpointÂofÂANY_LAYOUTÂisÂtoÂgiveÂdriversÂmaximumÂflexibility.
>Â>ÂForÂexample,ÂifÂdriverÂwantsÂtoÂsplitÂtheÂheaderÂatÂsomeÂspecific
>Â>ÂoffsetÂthisÂisÂalreadyÂpossibleÂwithoutÂextraÂfunctionality.
>Â
>ÂI'mÂnotÂsureÂhowÂthisÂwouldÂworkÂwithoutÂtheÂsupportÂfromÂtheÂdevice.
>ÂThisÂprobablyÂcanÂonlyÂworkÂif:
>Â
>Â1)ÂtheÂdriverÂknowÂwhatÂkindÂofÂpacketÂitÂcanÂreceive
>Â2)ÂprotocolÂhaveÂfixedÂlengthÂofÂtheÂheader
>Â
>ÂThisÂisÂprobablyÂnotÂtrueÂconsider:
>Â
>Â1)ÂTCPÂandÂUDPÂhaveÂdifferentÂheaderÂlength
>Â2)ÂIPv6ÂhasÂanÂvariableÂlengthÂofÂtheÂheader
>Â
>Â
>Â>
>Â>ÂLet'sÂkeepÂitÂthatÂway.
>Â>
>Â>ÂNow,Âlet'sÂformulateÂwhatÂareÂsomeÂofÂtheÂproblemsÂwithÂtheÂcurrentÂway.
>Â>
>Â>
>Â>
>Â>ÂA-ÂmergeableÂbuffersÂisÂevenÂmoreÂflexible,ÂsinceÂaÂsingleÂpacket
>Â>ÂÂÂisÂbuiltÂupÂofÂmultipleÂbuffers.ÂAndÂinÂtheoryÂdeviceÂcan
>Â>ÂÂÂchooseÂarbitraryÂsetÂofÂbuffersÂtoÂstoreÂaÂpacket.
>Â>ÂÂÂSoÂyouÂcouldÂsupplyÂaÂsmallÂbufferÂforÂheadersÂfollowedÂbyÂaÂbigger
>Â>ÂÂÂoneÂforÂpayload,ÂinÂtheoryÂevenÂwithoutÂanyÂchanges.
>Â>ÂÂÂProblemÂ1:ÂHoweverÂsinceÂthisÂisÂnotÂhowÂdevicesÂcurrentlyÂoperate,
>Â>ÂÂÂaÂfeatureÂbitÂwouldÂbeÂhelpful.
>Â
>ÂHowÂdoÂweÂknowÂtheÂbiggerÂbufferÂisÂsufficientÂforÂtheÂpacket?ÂIfÂwe
>ÂtryÂtoÂallocateÂ64KÂ(notÂsufficientÂforÂtheÂfutureÂeven)ÂitÂbreaksÂthe
>ÂeffortÂofÂtheÂmergeableÂbuffer:
>Â
>ÂheaderÂbufferÂ#1
>ÂpayloadÂbufferÂ#1
>ÂheaderÂbufferÂ#2
>ÂpayloadÂbufferÂ#2
>Â
>ÂIsÂtheÂdeviceÂexpectedÂto
>Â
>Â1)ÂfillÂpayloadÂinÂheaderÂbufferÂ#2,ÂthisÂbreaksÂtheÂeffortÂthatÂwe
>ÂwantÂtoÂmakeÂpayloadÂpageÂaligned
>Â2)ÂskipÂheaderÂbufferÂ#2,ÂinÂthisÂcase,ÂtheÂdeviceÂassumesÂtheÂframing
>ÂwhenÂitÂbreaksÂanyÂlayout
>Â
>Â>
>Â>ÂÂÂProblemÂ2:ÂAlso,ÂinÂtheÂpastÂweÂfoundÂitÂusefulÂtoÂbeÂableÂtoÂfigureÂoutÂwhether
>Â>ÂÂÂpacketÂfitsÂinÂaÂsingleÂbufferÂwithoutÂlookingÂatÂtheÂheader.
>Â>ÂÂÂForÂthisÂreason,ÂweÂhaveÂthisÂtext:
>Â>
>Â>ÂÂÂÂÂÂÂÂÂIfÂaÂreceiveÂpacketÂisÂspreadÂoverÂmultipleÂbuffers,ÂtheÂdevice
>Â>ÂÂÂÂÂÂÂÂÂMUSTÂuseÂallÂbuffersÂbutÂtheÂlastÂ(i.e.ÂtheÂfirstÂ\field{num_buffers}Â-
>Â>ÂÂÂÂÂÂÂÂÂ1Âbuffers)ÂcompletelyÂupÂtoÂtheÂfullÂlengthÂofÂeachÂbuffer
>Â>ÂÂÂÂÂÂÂÂÂsuppliedÂbyÂtheÂdriver.
>Â>
>Â>ÂÂÂifÂweÂwantÂtoÂkeepÂthisÂoptimizationÂandÂallowÂusingÂaÂseparate
>Â>ÂÂÂbufferÂforÂheaders,ÂthenÂIÂthinkÂweÂcouldÂrelyÂonÂtheÂfeatureÂbit
>Â>ÂÂÂfromÂProblemÂ1ÂandÂjustÂmakeÂanÂexceptionÂforÂtheÂfirstÂbuffer.
>Â>ÂÂÂAlsoÂnum_buffersÂisÂthenÂalwaysÂ>=Â2,ÂmaybeÂstateÂthisÂtoÂavoid
>Â>ÂÂÂconfusion.
>Â>
>Â>
>Â>
>Â>
>Â>
>Â>ÂB-ÂwithoutÂmergeable,Âthere'sÂnoÂflexibility.ÂInÂparticular,ÂthereÂcan
>Â>ÂnotÂbeÂuninitializedÂspaceÂbetweenÂheaderÂandÂdata.
>Â
>ÂIÂhadÂtwoÂquestions
>Â
>Â1)ÂwhyÂisÂthisÂnotÂaÂproblemÂofÂmergeable?ÂThere'sÂnoÂguaranteeÂthat
>ÂtheÂheaderÂisÂjustÂtheÂlengthÂofÂwhatÂtheÂdriverÂallocatesÂforÂheader
>ÂbufferÂanyhow
>Â
>ÂE.gÂtheÂheaderÂlengthÂcouldÂbeÂsmallerÂthanÂtheÂheaderÂbuffer,Âthe
>ÂdeviceÂstillÂneedsÂtoÂskipÂpartÂofÂtheÂspaceÂinÂtheÂheaderÂbuffer.
>Â
>Â2)ÂitÂshouldÂbeÂtheÂresponsibilityÂofÂtheÂdriverÂtoÂhandleÂthe
>ÂuninitializedÂspace,ÂitÂshouldÂdoÂanythingÂthatÂisÂnecessaryÂfor
>Âsecurity,ÂmoreÂbelow
>Â


We'veÂtalkedÂaÂbitÂmoreÂaboutÂsplitÂheaderÂsoÂfar,ÂbutÂthereÂstillÂseemÂto
beÂsomeÂissues,ÂsoÂlet'sÂrecap.

äãÂMethodÂDiscussionÂReview

InÂorderÂtoÂadaptÂtoÂtheÂEric'sÂtcpÂreceiveÂinterfaceÂtoÂachieveÂzeroÂcopy,
headerÂandÂpayloadÂareÂrequiredÂtoÂbeÂstoredÂseparately,ÂandÂtheÂpayloadÂis
storedÂinÂaÂpagedÂalignmentÂway.ÂTherefore,ÂweÂhaveÂdiscussedÂseveralÂoptions
forÂsplitÂheaderÂasÂfollows:

1:ÂmethodÂAÂ(ÂdependÂonÂtheÂdescriptorÂchainÂ)
|ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂreceiveÂbufferÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|Â
|ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ0thÂdescriptorÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|Â1thÂdescriptorÂ|Â
|ÂvirtnetÂhdrÂ|ÂmacÂ|ÂipÂhdrÂ|ÂtcpÂhdr|<--ÂholdÂ-->|ÂÂÂÂÂÂpayloadÂÂÂ|Â
MethodÂAÂusesÂaÂbufferÂplusÂaÂseparateÂpageÂwhenÂallocatingÂtheÂreceive
buffer.ÂInÂthisÂway,ÂweÂcanÂensureÂthatÂallÂpayloadsÂcanÂbeÂput
independentlyÂinÂaÂpage,ÂwhichÂisÂveryÂbeneficialÂforÂtheÂzerocopyÂ
implementedÂbyÂtheÂupperÂlayer.Â

TheÂadvantageÂofÂmethodÂAÂisÂthatÂtheÂimplementationÂisÂclearer,ÂitÂcanÂsupportÂnormal
headerÂspitÂandÂtheÂrollbackÂconditions.ÂItÂcanÂalsoÂeasilyÂsupportÂxdp.ÂTheÂdownsideÂis
thatÂdevicesÂoperatingÂdirectlyÂonÂtheÂdescriptorÂchainÂmayÂcauseÂtheÂlayeringÂviolation,
andÂalsoÂaffectÂtheÂperformance.

2.ÂmethodÂBÂ(ÂdependÂonÂmergeableÂbuffer)
|ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂreceiveÂbufferÂ(page)ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ|ÂreceiveÂbufferÂ(page)Â|Â
|Â<--Âoffset(hold)Â-->Â|ÂvirtnetÂhdrÂ|ÂmacÂ|ÂipÂhdrÂ|ÂtcpÂhdr|<--ÂholdÂ-->|ÂÂÂÂÂÂÂÂÂpayloadÂÂÂÂÂÂÂ|Â
^
|
pointerÂtoÂdevice

MethodÂBÂisÂbasedÂonÂyourÂpreviousÂsuggestion,ÂitÂisÂimplementedÂbased
onÂmergeableÂbuffer,ÂfillingÂaÂseparateÂpageÂeachÂtime.Â

IfÂtheÂsplitÂheaderÂisÂnegotiatedÂandÂtheÂpacketÂcanÂbeÂsuccessfullyÂsplitÂbyÂtheÂdevice,
theÂdeviceÂneedsÂtoÂfindÂatÂleastÂtwoÂbuffers,ÂnamelyÂtwoÂpages,ÂoneÂforÂtheÂvirtio-netÂheader
andÂtransportÂheader,ÂandÂtheÂotherÂforÂtheÂpayload.

TheÂadvantageÂofÂmethodÂBÂisÂthatÂitÂreliesÂonÂmergeableÂbufferÂinsteadÂofÂtheÂdescriptorÂchain.
ItÂovercomesÂtheÂshortcomingsÂofÂmethodÂAÂandÂcanÂachieveÂtheÂpurposeÂofÂtheÂdeviceÂfocusing
onÂtheÂbufferÂinsteadÂofÂtheÂdescriptor.ÂItsÂdisadvantageÂisÂthatÂitÂcausesÂmemoryÂwaste.

3.ÂmethodÂCÂ(ÂdependÂonÂmergeableÂbufferÂ)
|ÂsmallÂbufferÂ|ÂdataÂbufferÂ(page)Â|ÂsmallÂbufferÂ|ÂdataÂbufferÂ(page)Â|ÂsmallÂbufferÂ|ÂdataÂbufferÂ(page)Â|

MethodÂBÂfillsÂaÂseparateÂpageÂeachÂtime,ÂwhileÂmethodÂCÂneedsÂtoÂfillÂtheÂsmallÂbufferÂand
pageÂbufferÂseparately.ÂMethodÂCÂputsÂtheÂheaderÂinÂsmallÂbufferÂandÂtheÂpayloadÂinÂaÂpage.

TheÂadvantageÂofÂmethodÂCÂisÂthatÂsomeÂbuffersÂareÂfilledÂforÂheaderÂandÂdataÂrespectively,
whichÂreducesÂtheÂmemoryÂwasteÂofÂmethodÂB.ÂHowever,ÂthisÂmethodÂisÂdifficultÂtoÂweigh
theÂnumberÂofÂfilledÂheaderÂbuffersÂandÂdataÂbuffers,ÂandÂanÂunreasonableÂproportionÂwill
affectÂperformance.ÂForÂexample,ÂinÂaÂscenarioÂwithÂaÂlargeÂnumberÂofÂlargeÂpackets,
tooÂmanyÂheaderÂbuffersÂwillÂaffectÂperformance,ÂorÂinÂaÂscenarioÂwithÂaÂlargeÂnumberÂofÂsmall
packets,ÂtooÂmanyÂdataÂbuffersÂcanÂalsoÂaffectÂperformance.ÂAtÂtheÂsameÂtime,ÂifÂsomeÂprotocols
withÂaÂlargeÂnumberÂofÂpacketsÂdoÂnotÂsupportÂsplitÂheader,ÂtheÂexistenceÂofÂtheÂheaderÂbuffers
willÂalsoÂaffectÂperformance.

äãPointsÂofÂagreementÂandÂdisagreement

1.ÂWhatÂweÂhaveÂnowÂagreedÂuponÂisÂthat:
NoneÂofÂtheÂthreeÂmethodsÂbreakÂVIRTIO_F_ANY_LAYOUT,ÂtheyÂmakeÂvirtioÂnetÂheaderÂand
packetÂheaderÂstoredÂtogether.

WeÂhaveÂnowÂagreedÂtoÂrelaxÂtheÂfollowingÂinÂtheÂsplitÂheaderÂscenario,
Â"indicatesÂtoÂbothÂtheÂdeviceÂandÂtheÂdriverÂthatÂnoÂassumptionsÂwereÂmadeÂaboutÂframing."
becauseÂwhenÂaÂbiggerÂpacketÂcomes,ÂandÂaÂdataÂbufferÂisÂnotÂenoughÂtoÂstoreÂthisÂpacket,
theÂdeviceÂeitherÂchoosesÂtoÂskipÂtheÂnextÂheaderÂbufferÂtoÂbreakÂwhatÂtheÂspecÂsaysÂabove,
orÂchoosesÂnotÂtoÂskipÂtheÂheaderÂbufferÂandÂcannotÂmakeÂpayloadÂpageÂaligned.
Therefore,ÂallÂthreeÂmethodsÂneedÂtoÂrelaxÂtheÂaboveÂrequirements.

2.ÂWhatÂweÂhaven'tÂnowÂagreedÂuponÂisÂthat:
TheÂpointÂwhereÂweÂdon'tÂagreeÂnowÂisÂthatÂweÂdon'tÂhaveÂaÂmoreÂpreciseÂdiscussionÂofÂwhich
approachÂtoÂtake,ÂbutÂwe'reÂstillÂbouncingÂbetweenÂapproaches.
AtÂpresent,ÂallÂthreeÂapproachesÂseemÂtoÂachieveÂourÂrequirements,ÂbutÂeachÂhasÂadvantages
andÂdisadvantages.ÂShouldÂweÂfocusÂonÂtheÂmostÂimportantÂpoints,ÂsuchÂasÂperformanceÂtoÂchoose.
ItÂseemsÂaÂlittleÂdifficultÂtoÂcoverÂeverything?

äãTwoÂformsÂofÂimplementingÂreceiveÂzerocopy

theÂEric'sÂtcpÂreceiveÂinterfaceÂrequiresÂtheÂheaderÂandÂpayloadÂareÂstoredÂinÂseparateÂbuffers,ÂandÂtheÂpayloadÂis
storedÂinÂaÂpagedÂalignmentÂway.

Now,Âio_uringÂalsoÂproposesÂaÂnewÂreceiveÂzerocopyÂmethod,ÂwhichÂrequiresÂheaderÂandÂpayload
toÂbeÂstoredÂinÂseparateÂbuffers,ÂbutÂdoesÂnotÂrequireÂpayloadÂpageÂaligned.
https://lore.kernel.org/io-uring/20221007211713.170714-1-jonathan.lemon@gmail.com/T/#m678770d1fa7040fd76ed35026b93dfcbf25f6196

Response....

Page alignment requirements should not come from the virtio spec.
There are a variety of cases which may use non page aligned data buffers.
a. A kernel only consumer can use it who doesn't have mmap requirement.
b. A VQ accessible directly in user space may also use it without page alignment.
c. A system with 64k page size, page aligned memory has a fair amount of wastage.
d. iouring example you pointed, also has non page aligned use.

So let the driver deal with alignment restriction, outside of the virtio spec.

In header data split cases, data buffers utilization is more important than the tiny header buffers utilization.
How about if the headers do not interfere with the data buffers?

In other words, say a given RQ has optionally linked to a circular queue of header buffers.
All header buffers are of the same size, supplied one time.
This header size and circular q address is configured one time at RQ creation time.

With this the device doesn't need to process header buffer size every single incoming packet.
Data buffers can continue as chains or merged mode can be supported.
When the received packetâs header cannot fit, it continues as-is in the data buffer.
Virtio net hdr as suggest indicates usage of hdr buffer offset/index.

This method has few benefits on perf and buffer efficiency as below.
1. Data buffers can be directly mapped at best utilization
2. Device doesn't need to match up per packet header sizes and descriptor sizes, efficient for device to implement
3. No need to keep reposting the header buffers, only its tail index to be updated. 
Directly gives 50% cycle reduction on buffer traversing on driver side on rx path.
4. Ability to share this header buffer queue among multiple RQs if needed.
5. In the future there may be an extension to place tiny whole packets that can fit in the header buffer also to contain the rest of the data.
6. Device can always fall back to place packet header in data buffer when header buffer is not available or smaller than newer protocol
7. Because the header buffer comes virtually contiguous memory and not intermixed with data buffers, there isn't small per header allocations
8. Also works in both chained and merged mode
9. memory utilization for an RQ of depth 256, with 4K page size for data buffers = 1M, and hdr buffer per packet = 256 * 128bytes = only 3% of the data buffer.
So, in worst case when no packet uses the header buffers wastage is only 3%.
When high number of packets larger than 4K uses the header buffer, say 8K packets, header buffer utilization is at 50%. So, wastage is only 1.5%.
At 1500 mtu merged buffer data buffer size, it is also < 10% of hdr buffer memory.
All 3 cases are very manageable range of buffer utilization.

Crafting and modifying the feature bits from your v7 version and virtio net header is not difficult to get there if we like this approach.
References:
- ååï[virtio-dev] [PATCH v8] virtio_net: support for split transport header
  - From: "hengqi" <hengqi@linux.alibaba.com>