virtio-comment message

Subject: Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE

From: Jason Wang <jasowang@redhat.com>
To: Parav Pandit <parav@nvidia.com>
Date: Tue, 21 Nov 2023 12:30:43 +0800

On Fri, Nov 17, 2023 at 6:06âPM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > Sent: Friday, November 17, 2023 3:32 PM
> > To: Parav Pandit <parav@nvidia.com>; jasowang@redhat.com;
> > mst@redhat.com; eperezma@redhat.com; cohuck@redhat.com;
> > stefanha@redhat.com
> > Cc: virtio-comment@lists.oasis-open.org
> > Subject: Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement
> > VIRTIO_F_QUEUE_STATE
> >
> >
> >
> > On 11/16/2023 6:21 PM, Parav Pandit wrote:
> > >> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >> Sent: Thursday, November 16, 2023 3:45 PM
> > >>
> > >> On 11/16/2023 1:35 AM, Parav Pandit wrote:
> > >>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>> Sent: Monday, November 13, 2023 2:56 PM
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 11/10/2023 8:31 PM, Parav Pandit wrote:
> > >>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>> Sent: Friday, November 10, 2023 1:22 PM
> > >>>>>>
> > >>>>>>
> > >>>>>> On 11/9/2023 6:25 PM, Parav Pandit wrote:
> > >>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>> Sent: Thursday, November 9, 2023 3:39 PM
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On 11/9/2023 2:28 PM, Parav Pandit wrote:
> > >>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>> Sent: Tuesday, November 7, 2023 3:02 PM
> > >>>>>>>>>>
> > >>>>>>>>>> On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > >>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>> Sent: Monday, November 6, 2023 2:57 PM
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > >>>>>>>>>>>>>> From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>>>> Sent: Monday, November 6, 2023 9:01 AM
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > >>>>>>>>>>>>>>>> From: virtio-comment@lists.oasis-open.org
> > >>>>>>>>>>>>>>>> <virtio-comment@lists.oasis- open.org> On Behalf Of
> > >>>>>>>>>>>>>>>> Zhu, Lingshan
> > >>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 8:27 PM
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > >>>>>>>>>>>>>>>>>> From: Zhu Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>>>>>>>> Sent: Friday, November 3, 2023 4:05 PM
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> This patch adds two new le16 fields to common
> > >>>>>>>>>>>>>>>>>> configuration structure to support
> > >>>>>>>>>>>>>>>>>> VIRTIO_F_QUEUE_STATE in PCI transport
> > >>>>>> layer.
> > >>>>>>>>>>>>>>>>>> Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > >>>>>>>>>>>>>>>>>> ---
> > >>>>>>>>>>>>>>>>>>           transport-pci.tex | 18 ++++++++++++++++++
> > >>>>>>>>>>>>>>>>>>           1 file changed, 18 insertions(+)
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> diff --git a/transport-pci.tex b/transport-pci.tex
> > >>>>>>>>>>>>>>>>>> index
> > >>>>>>>>>>>>>>>>>> a5c6719..3161519 100644
> > >>>>>>>>>>>>>>>>>> --- a/transport-pci.tex
> > >>>>>>>>>>>>>>>>>> +++ b/transport-pci.tex
> > >>>>>>>>>>>>>>>>>> @@ -325,6 +325,10 @@ \subsubsection{Common
> > >>>> configuration
> > >>>>>>>>>>>>>> structure
> > >>>>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
> > >>>>>>>>>>>>>>>>>>                   /* About the administration virtqueue. */
> > >>>>>>>>>>>>>>>>>>                   le16 admin_queue_index;         /* read-only for
> > >> driver
> > >>>>>> */
> > >>>>>>>>>>>>>>>>>>                   le16 admin_queue_num;         /* read-only for
> > >> driver
> > >>>>>> */
> > >>>>>>>>>>>>>>>>>> +
> > >>>>>>>>>>>>>>>>>> +  /* Virtqueue state */
> > >>>>>>>>>>>>>>>>>> +        le16 queue_avail_state;         /* read-write */
> > >>>>>>>>>>>>>>>>>> +        le16 queue_used_state;          /* read-write */
> > >>>>>>>>>>>>>>>>> This tiny interface for 128 virtio net queues through
> > >>>>>>>>>>>>>>>>> register read writes, does
> > >>>>>>>>>>>>>>>> not work effectively.
> > >>>>>>>>>>>>>>>>> There are inflight out of order descriptors for block also.
> > >>>>>>>>>>>>>>>>> Hence toy registers like this do not work.
> > >>>>>>>>>>>>>>>> Do you know there is a queue_select? Why this does not
> > >> work?
> > >>>>>>>>>>>>>>>> Do you know how other queue related fields work?
> > >>>>>>>>>>>>>>> :)
> > >>>>>>>>>>>>>>> Yes. If you notice queue_reset related critical spec bug
> > >>>>>>>>>>>>>>> fix was done when it
> > >>>>>>>>>>>>>> was introduced so that live migration can _actually_ work.
> > >>>>>>>>>>>>>>> When queue_select is done for 128 queues serially, it
> > >>>>>>>>>>>>>>> take a lot of time to
> > >>>>>>>>>>>>>> read those slow register interface for this + inflight
> > >>>>>>>>>>>>>> descriptors +
> > >>>>>> more.
> > >>>>>>>>>>>>>> interesting, virtio work in this pattern for many years, right?
> > >>>>>>>>>>>>> All these years 400Gbps and 800Gbps virtio was not
> > >>>>>>>>>>>>> present, number of
> > >>>>>>>>>>>> queues were not in hw.
> > >>>>>>>>>>>> The registers are control path in config space, how 400G or
> > >>>>>>>>>>>> 800G
> > >>>>>> affect??
> > >>>>>>>>>>> Because those are the one in practice requires large number of
> > VQs.
> > >>>>>>>>>>>
> > >>>>>>>>>>> You are asking per VQ register commands to modify things
> > >>>>>>>>>>> dynamically via
> > >>>>>>>>>> this one vq at a time, serializing all the operations.
> > >>>>>>>>>>> It does not scale well with high q count.
> > >>>>>>>>>> This is not dynamically, it only happens when SUSPEND and
> > RESUME.
> > >>>>>>>>>> This is the same mechanism how virtio initialize a virtqueue,
> > >>>>>>>>>> working for many years.
> > >>>>>>>>> No. when virtio driver initializes it for the first time,
> > >>>>>>>>> there is no active traffic
> > >>>>>>>> that gets lost.
> > >>>>>>>>> This is because the interface is not yet up and not part of
> > >>>>>>>>> the network
> > >>>> yet.
> > >>>>>>>>> The resume must be fast enough, because the remote node is
> > >>>>>>>>> sending
> > >>>>>>>> packets.
> > >>>>>>>>> Hence it is different from driver init time queue enable.
> > >>>>>>>> I am not sure any packets arrive before a link announce at the
> > >>>>>>>> destination
> > >>>>>> side.
> > >>>>>>> I think it can.
> > >>>>>>> Because there is no notification of member device link down
> > >>>>>>> intimation to
> > >>>>>> remote side.
> > >>>>>>> The L4 and L5 protocols have no knowledge that node which they
> > >>>>>>> are
> > >>>>>> interacting is behind some layers of switches.
> > >>>>>>> So keeping this time low is desired.
> > >>>>>> The NIC should broad cast itself first, so that other peers in
> > >>>>>> the network know(for example its mac to route it) how to send a
> > >>>>>> message to
> > >> it.
> > >>>>>> This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE,
> > >>>>>> similar mechanism work for in-marketing productions for years.
> > >>>>>>
> > >>>>>> This is out of the topic anyway.
> > >>>>>>>>>>>> See the virtio common cfg, you will find the max number of
> > >>>>>>>>>>>> vqs is there, num_queues.
> > >>>>>>>>>>> :)
> > >>>>>>>>>>> Sure. those values at high q count affects.
> > >>>>>>>>>> the driver need to initialize them anyway.
> > >>>>>>>>> That is before the traffic starts from remote end.
> > >>>>>>>> see above, that needs a link announce and this is after
> > >>>>>>>> re-initialization
> > >>>>>>>>>>>>> Device didnât support LM.
> > >>>>>>>>>>>>> Many limitations existed all these years and TC is
> > >>>>>>>>>>>>> improving and expanding
> > >>>>>>>>>>>> them.
> > >>>>>>>>>>>>> So all these years do not matter.
> > >>>>>>>>>>>> Not sure what are you talking about, haven't we initialize
> > >>>>>>>>>>>> the device and vqs in config space for years?????? What's
> > >>>>>>>>>>>> wrong with this
> > >>>>>>>> mechanism?
> > >>>>>>>>>>>> Are you questioning virito-pci fundamentals???
> > >>>>>>>>>>> Donât point to in-efficient past to establish similar in-efficient
> > future.
> > >>>>>>>>>> interesting, you know this is a one-time thing, right?
> > >>>>>>>>>> and you are aware of this has been there for years.
> > >>>>>>>>>>>>>>>> Like how to set a queue size and enable it?
> > >>>>>>>>>>>>>>> Those are meant to be used before DRIVER_OK stage as
> > >>>>>>>>>>>>>>> they are init time
> > >>>>>>>>>>>>>> registers.
> > >>>>>>>>>>>>>>> Not to keep abusing them..
> > >>>>>>>>>>>>>> don't you need to set queue_size at the destination side?
> > >>>>>>>>>>>>> No.
> > >>>>>>>>>>>>> But the src/dst does not matter.
> > >>>>>>>>>>>>> Queue_size to be set before DRIVER_OK like rest of the
> > >>>>>>>>>>>>> registers, as all
> > >>>>>>>>>>>> queues must be created before the driver_ok phase.
> > >>>>>>>>>>>>> Queue_reset was last moment exception.
> > >>>>>>>>>>>> create a queue? Nvidia specific?
> > >>>>>>>>>>>>
> > >>>>>>>>>>> Huh. No.
> > >>>>>>>>>>> Do git log and realize what happened with queue_reset.
> > >>>>>>>>>> You didn't answer the question, does the spec even has
> > >>>>>>>>>> defined "create a
> > >>>>>>>> vq"?
> > >>>>>>>>> Enabled/created = tomato/tomato when discussing the spec in
> > >>>>>>>>> non-normative
> > >>>>>>>> email conversation.
> > >>>>>>>>> It's irrelevant.
> > >>>>>>>> Then lets not debate on this enable a vq or create a vq anymore
> > >>>>>>>>> All I am saying is, when we know the limitations of the
> > >>>>>>>>> transport and when industry is forwarding to not introduced
> > >>>>>>>>> more and more on-die register
> > >>>>>>>> for once in lifetime work of device migration, we just use the
> > >>>>>>>> optimal command and queue interface that is native to virtio.
> > >>>>>>>> PCI config space has its own limitations, and admin vq has its
> > >>>>>>>> advantages, but that does not apply to all use cases.
> > >>>>>>>>
> > >>>>>>> There was a recent work done emulating the SR-IOV cap and
> > >>>>>>> allowing VM to
> > >>>>>> enable SR-IOV in [1].
> > >>>>>>> This is the option I mentioned few weeks ago.
> > >>>>>>>
> > >>>>>>> So with admin commands and admin virtqueues, even nested model
> > >>>>>>> will work
> > >>>>>> using [1].
> > >>>>>>> [1]
> > >>>>>>> https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-off
> > >>>>>>> lo
> > >>>>>>> ad
> > >>>>>>> -o
> > >>>>>>> n-virtual-machines.html
> > >>>>>> We should take this into consideration once it is standardized in
> > >>>>>> the spec, maybe not now, there can always be many workarounds to
> > >>>>>> solve one
> > >>>> problem.
> > >>>>> Sure, until that point the admin commands are able to suffice the
> > >>>>> need
> > >> well.
> > >>>>> And when the spec changes in transport occurs (if needed), current
> > >>>>> admin
> > >>>> command and admin vq also fits very well that will follow above [1].
> > >>>> we have pointed lots of problems for admin vq based live migration
> > >>>> proposal, I won't repeat them here
> > >>> I donât see any.
> > >>> Nested is already solved using above.
> > >> I don't see how, do you mind to work out the patches?
> > > Once the base series is completed, nested cases can be addressed.
> > > I wont be able to work on the patches for it until we finish for the first level
> > virtualization.
> > As you know, nested is supported well in current virtio, so please don't break it.
>
> And same comment repeats. ð
> Expect same response...
> Sorry, no virtio specification does not support device migration today.
> Nothing is broken by adding new features.
>
> Above [1] has the right proposal that Jason's paper pointed out. Please use it.

I was involved in the design in [1]. And I don't see a connection to
the dicussion here

1) It is based on vDPA in L0
2) It doesn't address the nesting issue, it requires a proper design
in the virtio spec to support migration in the nesting layer.

Thanks

Follow-Ups:
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>

References:
- [PATCH V2 0/6] introduce basic facilities for virito live migration
  - From: Zhu Lingshan <lingshan.zhu@intel.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>