virtio-comment message

Subject: Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands

From: Jason Wang <jasowang@redhat.com>
To: Parav Pandit <parav@nvidia.com>
Date: Wed, 22 Nov 2023 12:19:45 +0800

On Wed, Nov 22, 2023 at 12:30âAM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, November 21, 2023 12:25 PM
> >
> > On Fri, Nov 17, 2023 at 10:48âPM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > Sent: Friday, November 17, 2023 7:31 PM
> > > > To: Parav Pandit <parav@nvidia.com>
> > > >
> > > > On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 6:02 PM
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit wrote:
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800, Zhu,
> > > > > > > > > > > > > > > Lingshan
> > > > > > wrote:
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM +0000,
> > > > > > > > > > > > > > >>> Parav Pandit
> > > > > > wrote:
> > > > > > > > > > > > > > >>>> We should expose a limit of the device in
> > > > > > > > > > > > > > >>>> the proposed
> > > > > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how much
> > > > range
> > > > > > > > > > > > > > it can
> > > > > > > > > > track.
> > > > > > > > > > > > > > >>>> So that future provisioning framework can use it.
> > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > > > > > >>> I do worry about how this can even work though.
> > > > > > > > > > > > > > >>> If you want a generic device you do not get
> > > > > > > > > > > > > > >>> to dictate how much memory VM
> > > > > > > > > > has.
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> Aren't we talking bit per page? With 1TByte
> > > > > > > > > > > > > > >>> of memory to track
> > > > > > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> And you happily say "we'll address this in the future"
> > > > > > > > > > > > > > >>> while at the same time fighting tooth and
> > > > > > > > > > > > > > >>> nail against adding single bit status
> > > > > > > > > > > > > > >>> registers because
> > > > scalability?
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> I have a feeling doing this completely
> > > > > > > > > > > > > > >>> theoretical like this is
> > > > > > > > > > problematic.
> > > > > > > > > > > > > > >>> Maybe you have it all laid out neatly in
> > > > > > > > > > > > > > >>> your head but I suspect not all of TC can
> > > > > > > > > > > > > > >>> picture it clearly enough based just on spec
> > > > > > > > > > text.
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> We do sometimes ask for POC implementation
> > > > > > > > > > > > > > >>> in linux / qemu to demonstrate how things
> > > > > > > > > > > > > > >>> work before merging
> > > > > > code.
> > > > > > > > > > > > > > >>> We skipped this for admin things so far but
> > > > > > > > > > > > > > >>> I think it's a good idea to start doing it here.
> > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > >>> What makes me pause a bit before saying
> > > > > > > > > > > > > > >>> please do a PoC is all the opposition that
> > > > > > > > > > > > > > >>> seems to exist to even using admin commands
> > > > > > > > > > > > > > >>> in the 1st place. I think once we finally
> > > > > > > > > > > > > > >>> stop arguing about whether to use admin
> > > > > > > > > > > > > > >>> commands at all then a PoC will be needed
> > > > > > > > > > > > before merging.
> > > > > > > > > > > > > > >> We have POR productions that implemented the
> > > > > > > > > > > > > > >> approach in my
> > > > > > > > > > series.
> > > > > > > > > > > > > > >> They are multiple generations of productions
> > > > > > > > > > > > > > >> in market and running in customers data centers for
> > years.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> Back to 2019 when we start working on vDPA,
> > > > > > > > > > > > > > >> we have sent some samples of production(e.g.,
> > > > > > > > > > > > > > >> Cascade
> > > > > > > > > > > > > > >> Glacier) and the datasheet, you can find live
> > > > > > > > > > > > > > >> migration facilities there, includes suspend,
> > > > > > > > > > > > > > >> vq state and other
> > > > > > features.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> And there is an reference in DPDK live
> > > > > > > > > > > > > > >> migration, I have provided this page
> > > > > > > > > > > > > > >> before:
> > > > > > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadevs/if
> > > > > > > > > > > > > > >> c.ht ml, it has been working for long long
> > > > > > > > > > > > > > >> time.
> > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > >> So if we let the facts speak, if we want to
> > > > > > > > > > > > > > >> see if the proposal is proven to work, I
> > > > > > > > > > > > > > >> would
> > > > > > > > > > > > > > >> say: They are POR for years, customers
> > > > > > > > > > > > > > >> already deployed them for
> > > > > > > > > > years.
> > > > > > > > > > > > > > > And I guess what you are trying to say is that
> > > > > > > > > > > > > > > this patchset we are reviewing here should be
> > > > > > > > > > > > > > > help to the same standard and there should be
> > > > > > > > > > > > > > > a PoC? Sounds
> > > > reasonable.
> > > > > > > > > > > > > > Yes and the in-marketing productions are POR,
> > > > > > > > > > > > > > the series just improves the design, for
> > > > > > > > > > > > > > example, our series also use registers to track
> > > > > > > > > > > > > > vq state, but improvements than CG or BSC. So I
> > > > > > > > > > > > > > think they are proven
> > > > > > > > > > > > to work.
> > > > > > > > > > > > >
> > > > > > > > > > > > > If you prefer to go the route of POR and
> > > > > > > > > > > > > production and proven documents
> > > > > > > > > > > > etc, there is ton of it of multiple types of
> > > > > > > > > > > > products I can dump here with open- source code and
> > > > > > > > > > > > documentation and
> > > > more.
> > > > > > > > > > > > > Let me know what you would like to see.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Michael has requested some performance
> > > > > > > > > > > > > comparisons, not all are ready to
> > > > > > > > > > > > share yet.
> > > > > > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > > > > > >
> > > > > > > > > > > > > And all the vdpa dpdk you published does not have
> > > > > > > > > > > > > basic CVQ support when I
> > > > > > > > > > > > last looked at it.
> > > > > > > > > > > > > Do you know when was it added?
> > > > > > > > > > > >
> > > > > > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > > > > > The problem with CVQ generally, is that VDPA wants
> > > > > > > > > > > > to shadow CVQ it at all times because it wants to
> > > > > > > > > > > > decode and cache the content. But this problem has
> > > > > > > > > > > > nothing to do with dirty tracking even though it
> > > > > > > > > > > > also
> > > > > > > > > > mentions "shadow":
> > > > > > > > > > > > if device can report it's state then there's no need
> > > > > > > > > > > > to shadow
> > > > CVQ.
> > > > > > > > > > >
> > > > > > > > > > > For the performance numbers with the pre-copy and
> > > > > > > > > > > device context of
> > > > > > > > > > patches posted 1 to 5, the downtime reduction of the VM
> > > > > > > > > > is 3.71x with active traffic on 8 RQs at 100Gbps port speed.
> > > > > > > > > >
> > > > > > > > > > Sounds good can you please post a bit more detail?
> > > > > > > > > > which configs are you comparing what was the result on
> > > > > > > > > > each of
> > > > them.
> > > > > > > > >
> > > > > > > > > Common config: 8+8 tx and rx queues.
> > > > > > > > > Port speed: 100Gbps
> > > > > > > > > QEMU 8.1
> > > > > > > > > Libvirt 7.0
> > > > > > > > > GVM: Centos 7.4
> > > > > > > > > Device: virtio VF hardware device
> > > > > > > > >
> > > > > > > > > Config_1: virtio suspend/resume similar to what Lingshan
> > > > > > > > > has, largely vdpa stack
> > > > > > > > > Config_2: Device context method of admin commands
> > > > > > > >
> > > > > > > > OK that sounds good. The weird thing here is that you
> > > > > > > > measure
> > > > "downtime".
> > > > > > > > What exactly do you mean here?
> > > > > > > > I am guessing it's the time to retrieve on source and
> > > > > > > > re-program device state on destination? And this is 3.71x out of
> > how long?
> > > > > > > Yes. Downtime is the time during which the VM is not
> > > > > > > responding or receiving
> > > > > > packets, which involves reprogramming the device.
> > > > > > > 3.71x is relative time for this discussion.
> > > > > >
> > > > > > Oh interesting. So VM state movement including reprogramming the
> > > > > > CPU is dominated by reprogramming this single NIC, by a factor of
> > almost 4?
> > > > > Yes.
> > > >
> > > > Could you post some numbers too then?  I want to know whether that
> > > > would imply that VM boot is slowed down significantly too. If yes
> > > > that's another motivation for pci transport 2.0.
> > > It was 1.8 sec down to 480msec.
> >
> > Well, there's work ongoing to reduce the downtime of the shadow virtqueue.
> >
> > Eugenio or Si-wei may share an exact number, but it should be several
> > hundreds of ms.
> >
> Shadow vq is not applicable at all as comparison point because there is no virtio specific qemu etc software involved here.

I don't get the point.

Shadow virtqueue is virtio specific for sure and the core logic is
decoupled of the vDPA logic. If not, it's bug and we need to fix.

Thanks


>
> Anyways, the requested numbers are supplied for the device context based migration over admin vq proposed here.
>
>
> > But it seems the shadow virtqueue itself is not the major factor but the time
> > spent on programming vendor specific mappings for example.
> >
> > Thanks
> >
> > > The time didn't come from pci side or boot side.
> > >
> > > For pci side of things you would want to compare the pci vs non pci device
> > based VM boot time.
> > >
>

Follow-Ups:
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>

References:
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>