virtio-comment message

Subject: Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE

From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Zhu, Lingshan" <lingshan.zhu@intel.com>
Date: Fri, 17 Nov 2023 05:45:44 -0500

On Fri, Nov 17, 2023 at 06:02:14PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 11/16/2023 6:21 PM, Parav Pandit wrote:
> > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > Sent: Thursday, November 16, 2023 3:45 PM
> > > 
> > > On 11/16/2023 1:35 AM, Parav Pandit wrote:
> > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > Sent: Monday, November 13, 2023 2:56 PM
> > > > > 
> > > > > 
> > > > > 
> > > > > On 11/10/2023 8:31 PM, Parav Pandit wrote:
> > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > Sent: Friday, November 10, 2023 1:22 PM
> > > > > > > 
> > > > > > > 
> > > > > > > On 11/9/2023 6:25 PM, Parav Pandit wrote:
> > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > Sent: Thursday, November 9, 2023 3:39 PM
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On 11/9/2023 2:28 PM, Parav Pandit wrote:
> > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > Sent: Tuesday, November 7, 2023 3:02 PM
> > > > > > > > > > > 
> > > > > > > > > > > On 11/6/2023 6:52 PM, Parav Pandit wrote:
> > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > Sent: Monday, November 6, 2023 2:57 PM
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On 11/6/2023 12:12 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > Sent: Monday, November 6, 2023 9:01 AM
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > On 11/3/2023 11:50 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > > > From: virtio-comment@lists.oasis-open.org
> > > > > > > > > > > > > > > > > <virtio-comment@lists.oasis- open.org> On Behalf Of Zhu,
> > > > > > > > > > > > > > > > > Lingshan
> > > > > > > > > > > > > > > > > Sent: Friday, November 3, 2023 8:27 PM
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > On 11/3/2023 7:35 PM, Parav Pandit wrote:
> > > > > > > > > > > > > > > > > > > From: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > > > > > Sent: Friday, November 3, 2023 4:05 PM
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > This patch adds two new le16 fields to common
> > > > > > > > > > > > > > > > > > > configuration structure to support VIRTIO_F_QUEUE_STATE
> > > > > > > > > > > > > > > > > > > in PCI transport
> > > > > > > layer.
> > > > > > > > > > > > > > > > > > > Signed-off-by: Zhu Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > > > > >           transport-pci.tex | 18 ++++++++++++++++++
> > > > > > > > > > > > > > > > > > >           1 file changed, 18 insertions(+)
> > > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > > diff --git a/transport-pci.tex b/transport-pci.tex
> > > > > > > > > > > > > > > > > > > index
> > > > > > > > > > > > > > > > > > > a5c6719..3161519 100644
> > > > > > > > > > > > > > > > > > > --- a/transport-pci.tex
> > > > > > > > > > > > > > > > > > > +++ b/transport-pci.tex
> > > > > > > > > > > > > > > > > > > @@ -325,6 +325,10 @@ \subsubsection{Common
> > > > > configuration
> > > > > > > > > > > > > > > structure
> > > > > > > > > > > > > > > > > > > layout}\label{sec:Virtio Transport
> > > > > > > > > > > > > > > > > > >                   /* About the administration virtqueue. */
> > > > > > > > > > > > > > > > > > >                   le16 admin_queue_index;         /* read-only for
> > > driver
> > > > > > > */
> > > > > > > > > > > > > > > > > > >                   le16 admin_queue_num;         /* read-only for
> > > driver
> > > > > > > */
> > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > +	/* Virtqueue state */
> > > > > > > > > > > > > > > > > > > +        le16 queue_avail_state;         /* read-write */
> > > > > > > > > > > > > > > > > > > +        le16 queue_used_state;          /* read-write */
> > > > > > > > > > > > > > > > > > This tiny interface for 128 virtio net queues through
> > > > > > > > > > > > > > > > > > register read writes, does
> > > > > > > > > > > > > > > > > not work effectively.
> > > > > > > > > > > > > > > > > > There are inflight out of order descriptors for block also.
> > > > > > > > > > > > > > > > > > Hence toy registers like this do not work.
> > > > > > > > > > > > > > > > > Do you know there is a queue_select? Why this does not
> > > work?
> > > > > > > > > > > > > > > > > Do you know how other queue related fields work?
> > > > > > > > > > > > > > > > :)
> > > > > > > > > > > > > > > > Yes. If you notice queue_reset related critical spec bug
> > > > > > > > > > > > > > > > fix was done when it
> > > > > > > > > > > > > > > was introduced so that live migration can _actually_ work.
> > > > > > > > > > > > > > > > When queue_select is done for 128 queues serially, it take
> > > > > > > > > > > > > > > > a lot of time to
> > > > > > > > > > > > > > > read those slow register interface for this + inflight
> > > > > > > > > > > > > > > descriptors +
> > > > > > > more.
> > > > > > > > > > > > > > > interesting, virtio work in this pattern for many years, right?
> > > > > > > > > > > > > > All these years 400Gbps and 800Gbps virtio was not present,
> > > > > > > > > > > > > > number of
> > > > > > > > > > > > > queues were not in hw.
> > > > > > > > > > > > > The registers are control path in config space, how 400G or
> > > > > > > > > > > > > 800G
> > > > > > > affect??
> > > > > > > > > > > > Because those are the one in practice requires large number of VQs.
> > > > > > > > > > > > 
> > > > > > > > > > > > You are asking per VQ register commands to modify things
> > > > > > > > > > > > dynamically via
> > > > > > > > > > > this one vq at a time, serializing all the operations.
> > > > > > > > > > > > It does not scale well with high q count.
> > > > > > > > > > > This is not dynamically, it only happens when SUSPEND and RESUME.
> > > > > > > > > > > This is the same mechanism how virtio initialize a virtqueue,
> > > > > > > > > > > working for many years.
> > > > > > > > > > No. when virtio driver initializes it for the first time, there
> > > > > > > > > > is no active traffic
> > > > > > > > > that gets lost.
> > > > > > > > > > This is because the interface is not yet up and not part of the
> > > > > > > > > > network
> > > > > yet.
> > > > > > > > > > The resume must be fast enough, because the remote node is
> > > > > > > > > > sending
> > > > > > > > > packets.
> > > > > > > > > > Hence it is different from driver init time queue enable.
> > > > > > > > > I am not sure any packets arrive before a link announce at the
> > > > > > > > > destination
> > > > > > > side.
> > > > > > > > I think it can.
> > > > > > > > Because there is no notification of member device link down
> > > > > > > > intimation to
> > > > > > > remote side.
> > > > > > > > The L4 and L5 protocols have no knowledge that node which they are
> > > > > > > interacting is behind some layers of switches.
> > > > > > > > So keeping this time low is desired.
> > > > > > > The NIC should broad cast itself first, so that other peers in the
> > > > > > > network know(for example its mac to route it) how to send a message to
> > > it.
> > > > > > > This is necessary, for example VIRTIO_NET_F_GUEST_ANNOUNCE, similar
> > > > > > > mechanism work for in-marketing productions for years.
> > > > > > > 
> > > > > > > This is out of the topic anyway.
> > > > > > > > > > > > > See the virtio common cfg, you will find the max number of
> > > > > > > > > > > > > vqs is there, num_queues.
> > > > > > > > > > > > :)
> > > > > > > > > > > > Sure. those values at high q count affects.
> > > > > > > > > > > the driver need to initialize them anyway.
> > > > > > > > > > That is before the traffic starts from remote end.
> > > > > > > > > see above, that needs a link announce and this is after
> > > > > > > > > re-initialization
> > > > > > > > > > > > > > Device didnât support LM.
> > > > > > > > > > > > > > Many limitations existed all these years and TC is improving
> > > > > > > > > > > > > > and expanding
> > > > > > > > > > > > > them.
> > > > > > > > > > > > > > So all these years do not matter.
> > > > > > > > > > > > > Not sure what are you talking about, haven't we initialize
> > > > > > > > > > > > > the device and vqs in config space for years?????? What's
> > > > > > > > > > > > > wrong with this
> > > > > > > > > mechanism?
> > > > > > > > > > > > > Are you questioning virito-pci fundamentals???
> > > > > > > > > > > > Donât point to in-efficient past to establish similar in-efficient future.
> > > > > > > > > > > interesting, you know this is a one-time thing, right?
> > > > > > > > > > > and you are aware of this has been there for years.
> > > > > > > > > > > > > > > > > Like how to set a queue size and enable it?
> > > > > > > > > > > > > > > > Those are meant to be used before DRIVER_OK stage as they
> > > > > > > > > > > > > > > > are init time
> > > > > > > > > > > > > > > registers.
> > > > > > > > > > > > > > > > Not to keep abusing them..
> > > > > > > > > > > > > > > don't you need to set queue_size at the destination side?
> > > > > > > > > > > > > > No.
> > > > > > > > > > > > > > But the src/dst does not matter.
> > > > > > > > > > > > > > Queue_size to be set before DRIVER_OK like rest of the
> > > > > > > > > > > > > > registers, as all
> > > > > > > > > > > > > queues must be created before the driver_ok phase.
> > > > > > > > > > > > > > Queue_reset was last moment exception.
> > > > > > > > > > > > > create a queue? Nvidia specific?
> > > > > > > > > > > > > 
> > > > > > > > > > > > Huh. No.
> > > > > > > > > > > > Do git log and realize what happened with queue_reset.
> > > > > > > > > > > You didn't answer the question, does the spec even has defined
> > > > > > > > > > > "create a
> > > > > > > > > vq"?
> > > > > > > > > > Enabled/created = tomato/tomato when discussing the spec in
> > > > > > > > > > non-normative
> > > > > > > > > email conversation.
> > > > > > > > > > It's irrelevant.
> > > > > > > > > Then lets not debate on this enable a vq or create a vq anymore
> > > > > > > > > > All I am saying is, when we know the limitations of the
> > > > > > > > > > transport and when industry is forwarding to not introduced more
> > > > > > > > > > and more on-die register
> > > > > > > > > for once in lifetime work of device migration, we just use the
> > > > > > > > > optimal command and queue interface that is native to virtio.
> > > > > > > > > PCI config space has its own limitations, and admin vq has its
> > > > > > > > > advantages, but that does not apply to all use cases.
> > > > > > > > > 
> > > > > > > > There was a recent work done emulating the SR-IOV cap and allowing
> > > > > > > > VM to
> > > > > > > enable SR-IOV in [1].
> > > > > > > > This is the option I mentioned few weeks ago.
> > > > > > > > 
> > > > > > > > So with admin commands and admin virtqueues, even nested model
> > > > > > > > will work
> > > > > > > using [1].
> > > > > > > > [1]
> > > > > > > > https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offlo
> > > > > > > > ad
> > > > > > > > -o
> > > > > > > > n-virtual-machines.html
> > > > > > > We should take this into consideration once it is standardized in
> > > > > > > the spec, maybe not now, there can always be many workarounds to
> > > > > > > solve one
> > > > > problem.
> > > > > > Sure, until that point the admin commands are able to suffice the need
> > > well.
> > > > > > And when the spec changes in transport occurs (if needed), current
> > > > > > admin
> > > > > command and admin vq also fits very well that will follow above [1].
> > > > > we have pointed lots of problems for admin vq based live migration
> > > > > proposal, I won't repeat them here
> > > > I donât see any.
> > > > Nested is already solved using above.
> > > I don't see how, do you mind to work out the patches?
> > Once the base series is completed, nested cases can be addressed.
> > I wont be able to work on the patches for it until we finish for the first level virtualization.
> As you know, nested is supported well in current virtio, so please don't
> break it.

So for nesting, it seems cleaner to support sending commands through
device itself.  You aren't going to fit VQ state in a 16 bit register in
the general case though, and will have to resort to DMA. And if you are
doing that then please just use the admin command format (does not have
to be a VQ) and then we can all make peace finally.

-- 
MST

Follow-Ups:
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>

References:
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH V2 4/6] virtio-pci: implement VIRTIO_F_QUEUE_STATE
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>