virtio-comment message

Subject: RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE

From: Parav Pandit <parav@nvidia.com>
To: "Zhu, Lingshan" <lingshan.zhu@intel.com>, "jasowang@redhat.com" <jasowang@redhat.com>, "mst@redhat.com" <mst@redhat.com>, "eperezma@redhat.com" <eperezma@redhat.com>, "cohuck@redhat.com" <cohuck@redhat.com>, "stefanha@redhat.com" <stefanha@redhat.com>
Date: Thu, 16 Nov 2023 10:21:08 +0000

> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> Sent: Thursday, November 16, 2023 3:45 PM
> 
> On 11/16/2023 1:35 AM, Parav Pandit wrote:
> >
> >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >> Sent: Monday, November 13, 2023 2:56 PM
> >>
> >>
> >>
> >> On 11/10/2023 8:31 PM, Parav Pandit wrote:
> >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>> Sent: Friday, November 10, 2023 1:22 PM
> >>>>
> >>>>
> >>>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
> >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>> Sent: Thursday, November 9, 2023 3:39 PM
> >>>>>>
> >>>>>>
> >>>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
> >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
> >>>>>>>>
> >>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
> >>>>>>>>>>
> >>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> >>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> >>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> >>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> >>>>>>>>>>>>>> Lingshan
> >>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> >>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This patch adds two new le16 fields to common
> >>>>>>>>>>>>>>>> configuration structure to support VIRTIO_F_QUEUE_STATE
> >>>>>>>>>>>>>>>> in PCI transport
> >>>> layer.
> >>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> >>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>>          transport-pci.tex | 18 ++++++++++++++++++
> >>>>>>>>>>>>>>>>          1 file changed, 18 insertions(+)
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex
> >>>>>>>>>>>>>>>> index
> >>>>>>>>>>>>>>>> a5c6719..3161519 100644
> >>>>>>>>>>>>>>>> --- a/transport-pci.tex
> >>>>>>>>>>>>>>>> +++ b/transport-pci.tex
> >>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
> >> configuration
> >>>>>>>>>>>> structure
> >>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
> >>>>>>>>>>>>>>>>                  /* About the administration virtqueue. */
> >>>>>>>>>>>>>>>>                  le16 admin_queue_index;         /* read-only for
> driver
> >>>> */
> >>>>>>>>>>>>>>>>                  le16 admin_queue_num;         /* read-only for
> driver
> >>>> */
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> +	/* Virtqueue state */
> >>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> >>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> >>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
> >>>>>>>>>>>>>>> register read writes, does
> >>>>>>>>>>>>>> not work effectively.
> >>>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
> >>>>>>>>>>>>>>> Hence toy registers like this do not work.
> >>>>>>>>>>>>>> Do you know there is a queue_select? Why this does not
> work?
> >>>>>>>>>>>>>> Do you know how other queue related fields work?
> >>>>>>>>>>>>> :)
> >>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug
> >>>>>>>>>>>>> fix was done when it
> >>>>>>>>>>>> was introduced so that live migration can _actually_ work.
> >>>>>>>>>>>>> When queue_select is done for 128 queues serially, it take
> >>>>>>>>>>>>> a lot of time to
> >>>>>>>>>>>> read those slow register interface for this + inflight
> >>>>>>>>>>>> descriptors +
> >>>> more.
> >>>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
> >>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not present,
> >>>>>>>>>>> number of
> >>>>>>>>>> queues were not in hw.
> >>>>>>>>>> The registers are control path in config space, how 400G or
> >>>>>>>>>> 800G
> >>>> affect??
> >>>>>>>>> Because those are the one in practice requires large number of VQs.
> >>>>>>>>>
> >>>>>>>>> You are asking per VQ register commands to modify things
> >>>>>>>>> dynamically via
> >>>>>>>> this one vq at a time, serializing all the operations.
> >>>>>>>>> It does not scale well with high q count.
> >>>>>>>> This is not dynamically, it only happens when SUSPEND and RESUME.
> >>>>>>>> This is the same mechanism how virtio initialize a virtqueue,
> >>>>>>>> working for many years.
> >>>>>>> No. when virtio driver initializes it for the first time, there
> >>>>>>> is no active traffic
> >>>>>> that gets lost.
> >>>>>>> This is because the interface is not yet up and not part of the
> >>>>>>> network
> >> yet.
> >>>>>>> The resume must be fast enough, because the remote node is
> >>>>>>> sending
> >>>>>> packets.
> >>>>>>> Hence it is different from driver init time queue enable.
> >>>>>> I am not sure any packets arrive before a link announce at the
> >>>>>> destination
> >>>> side.
> >>>>> I think it can.
> >>>>> Because there is no notification of member device link down
> >>>>> intimation to
> >>>> remote side.
> >>>>> The L4 and L5 protocols have no knowledge that node which they are
> >>>> interacting is behind some layers of switches.
> >>>>> So keeping this time low is desired.
> >>>> The NIC should broad cast itself first, so that other peers in the
> >>>> network know(for example its mac to route it) how to send a message to
> it.
> >>>>
> >>>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
> >>>> mechanism work for in-marketing productions for years.
> >>>>
> >>>> This is out of the topic anyway.
> >>>>>>>>>> See the virtio common cfg, you will find the max number of
> >>>>>>>>>> vqs is there, num_queues.
> >>>>>>>>> :)
> >>>>>>>>> Sure. those values at high q count affects.
> >>>>>>>> the driver need to initialize them anyway.
> >>>>>>> That is before the traffic starts from remote end.
> >>>>>> see above, that needs a link announce and this is after
> >>>>>> re-initialization
> >>>>>>>>>>> Device didnât support LM.
> >>>>>>>>>>> Many limitations existed all these years and TC is improving
> >>>>>>>>>>> and expanding
> >>>>>>>>>> them.
> >>>>>>>>>>> So all these years do not matter.
> >>>>>>>>>> Not sure what are you talking about, haven't we initialize
> >>>>>>>>>> the device and vqs in config space for years?????? What's
> >>>>>>>>>> wrong with this
> >>>>>> mechanism?
> >>>>>>>>>> Are you questioning virito-pci fundamentals???
> >>>>>>>>> Donât point to in-efficient past to establish similar in-efficient future.
> >>>>>>>> interesting, you know this is a one-time thing, right?
> >>>>>>>> and you are aware of this has been there for years.
> >>>>>>>>>>>>>> Like how to set a queue size and enable it?
> >>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as they
> >>>>>>>>>>>>> are init time
> >>>>>>>>>>>> registers.
> >>>>>>>>>>>>> Not to keep abusing them..
> >>>>>>>>>>>> don't you need to set queue_size at the destination side?
> >>>>>>>>>>> No.
> >>>>>>>>>>> But the src/dst does not matter.
> >>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> >>>>>>>>>>> registers, as all
> >>>>>>>>>> queues must be created before the driver_ok phase.
> >>>>>>>>>>> Queue_reset was last moment exception.
> >>>>>>>>>> create a queue? Nvidia specific?
> >>>>>>>>>>
> >>>>>>>>> Huh. No.
> >>>>>>>>> Do git log and realize what happened with queue_reset.
> >>>>>>>> You didn't answer the question, does the spec even has defined
> >>>>>>>> "create a
> >>>>>> vq"?
> >>>>>>> Enabled/created = tomato/tomato when discussing the spec in
> >>>>>>> non-normative
> >>>>>> email conversation.
> >>>>>>> It's irrelevant.
> >>>>>> Then lets not debate on this enable a vq or create a vq anymore
> >>>>>>> All I am saying is, when we know the limitations of the
> >>>>>>> transport and when industry is forwarding to not introduced more
> >>>>>>> and more on-die register
> >>>>>> for once in lifetime work of device migration, we just use the
> >>>>>> optimal command and queue interface that is native to virtio.
> >>>>>> PCI config space has its own limitations, and admin vq has its
> >>>>>> advantages, but that does not apply to all use cases.
> >>>>>>
> >>>>> There was a recent work done emulating the SR-IOV cap and allowing
> >>>>> VM to
> >>>> enable SR-IOV in [1].
> >>>>> This is the option I mentioned few weeks ago.
> >>>>>
> >>>>> So with admin commands and admin virtqueues, even nested model
> >>>>> will work
> >>>> using [1].
> >>>>> [1]
> >>>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offlo
> >>>>> ad
> >>>>> -o
> >>>>> n-virtual-machines.html
> >>>> We should take this into consideration once it is standardized in
> >>>> the spec, maybe not now, there can always be many workarounds to
> >>>> solve one
> >> problem.
> >>> Sure, until that point the admin commands are able to suffice the need
> well.
> >>> And when the spec changes in transport occurs (if needed), current
> >>> admin
> >> command and admin vq also fits very well that will follow above [1].
> >> we have pointed lots of problems for admin vq based live migration
> >> proposal, I won't repeat them here
> > I donât see any.
> > Nested is already solved using above.
> I don't see how, do you mind to work out the patches?
Once the base series is completed, nested cases can be addressed.
I wont be able to work on the patches for it until we finish for the first level virtualization.

> > Long time ago, you mentioned some QoS issue, which anyway exists in the
> device register method too.
> > Can you please list them if anything other than QoS and nest?

Follow-Ups:
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>

References:
- [PATCH V2 0/6] introduce basic facilities for virito live migration
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>