virtio-comment message

Subject: RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands

From: Parav Pandit <parav@nvidia.com>
To: Jason Wang <jasowang@redhat.com>
Date: Wed, 22 Nov 2023 04:28:19 +0000


> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, November 22, 2023 9:50 AM
> 
> On Wed, Nov 22, 2023 at 12:30âAM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Tuesday, November 21, 2023 12:25 PM
> > >
> > > On Fri, Nov 17, 2023 at 10:48âPM Parav Pandit <parav@nvidia.com>
> wrote:
> > > >
> > > >
> > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > Sent: Friday, November 17, 2023 7:31 PM
> > > > > To: Parav Pandit <parav@nvidia.com>
> > > > >
> > > > > On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> > > > > >
> > > > > >
> > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > Sent: Friday, November 17, 2023 6:02 PM
> > > > > > >
> > > > > > > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > > > >
> > > > > > > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > > > > > > >
> > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit
> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav
> > > > > > > > > > > > > Pandit
> > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800,
> > > > > > > > > > > > > > > > Zhu, Lingshan
> > > > > > > wrote:
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM
> > > > > > > > > > > > > > > >>> +0000, Parav Pandit
> > > > > > > wrote:
> > > > > > > > > > > > > > > >>>> We should expose a limit of the device
> > > > > > > > > > > > > > > >>>> in the proposed
> > > > > > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how
> > > > > > > > > > > > > > > much
> > > > > range
> > > > > > > > > > > > > > > it can
> > > > > > > > > > > track.
> > > > > > > > > > > > > > > >>>> So that future provisioning framework can use
> it.
> > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > > > > > > >>> I do worry about how this can even work
> though.
> > > > > > > > > > > > > > > >>> If you want a generic device you do not
> > > > > > > > > > > > > > > >>> get to dictate how much memory VM
> > > > > > > > > > > has.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> Aren't we talking bit per page? With
> > > > > > > > > > > > > > > >>> 1TByte of memory to track
> > > > > > > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> And you happily say "we'll address this in the
> future"
> > > > > > > > > > > > > > > >>> while at the same time fighting tooth
> > > > > > > > > > > > > > > >>> and nail against adding single bit
> > > > > > > > > > > > > > > >>> status registers because
> > > > > scalability?
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> I have a feeling doing this completely
> > > > > > > > > > > > > > > >>> theoretical like this is
> > > > > > > > > > > problematic.
> > > > > > > > > > > > > > > >>> Maybe you have it all laid out neatly in
> > > > > > > > > > > > > > > >>> your head but I suspect not all of TC
> > > > > > > > > > > > > > > >>> can picture it clearly enough based just
> > > > > > > > > > > > > > > >>> on spec
> > > > > > > > > > > text.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> We do sometimes ask for POC
> > > > > > > > > > > > > > > >>> implementation in linux / qemu to
> > > > > > > > > > > > > > > >>> demonstrate how things work before
> > > > > > > > > > > > > > > >>> merging
> > > > > > > code.
> > > > > > > > > > > > > > > >>> We skipped this for admin things so far
> > > > > > > > > > > > > > > >>> but I think it's a good idea to start doing it here.
> > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > >>> What makes me pause a bit before saying
> > > > > > > > > > > > > > > >>> please do a PoC is all the opposition
> > > > > > > > > > > > > > > >>> that seems to exist to even using admin
> > > > > > > > > > > > > > > >>> commands in the 1st place. I think once
> > > > > > > > > > > > > > > >>> we finally stop arguing about whether to
> > > > > > > > > > > > > > > >>> use admin commands at all then a PoC
> > > > > > > > > > > > > > > >>> will be needed
> > > > > > > > > > > > > before merging.
> > > > > > > > > > > > > > > >> We have POR productions that implemented
> > > > > > > > > > > > > > > >> the approach in my
> > > > > > > > > > > series.
> > > > > > > > > > > > > > > >> They are multiple generations of
> > > > > > > > > > > > > > > >> productions in market and running in
> > > > > > > > > > > > > > > >> customers data centers for
> > > years.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> Back to 2019 when we start working on
> > > > > > > > > > > > > > > >> vDPA, we have sent some samples of
> > > > > > > > > > > > > > > >> production(e.g., Cascade
> > > > > > > > > > > > > > > >> Glacier) and the datasheet, you can find
> > > > > > > > > > > > > > > >> live migration facilities there, includes
> > > > > > > > > > > > > > > >> suspend, vq state and other
> > > > > > > features.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> And there is an reference in DPDK live
> > > > > > > > > > > > > > > >> migration, I have provided this page
> > > > > > > > > > > > > > > >> before:
> > > > > > > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadev
> > > > > > > > > > > > > > > >> s/if c.ht ml, it has been working for
> > > > > > > > > > > > > > > >> long long time.
> > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > >> So if we let the facts speak, if we want
> > > > > > > > > > > > > > > >> to see if the proposal is proven to work,
> > > > > > > > > > > > > > > >> I would
> > > > > > > > > > > > > > > >> say: They are POR for years, customers
> > > > > > > > > > > > > > > >> already deployed them for
> > > > > > > > > > > years.
> > > > > > > > > > > > > > > > And I guess what you are trying to say is
> > > > > > > > > > > > > > > > that this patchset we are reviewing here
> > > > > > > > > > > > > > > > should be help to the same standard and
> > > > > > > > > > > > > > > > there should be a PoC? Sounds
> > > > > reasonable.
> > > > > > > > > > > > > > > Yes and the in-marketing productions are
> > > > > > > > > > > > > > > POR, the series just improves the design,
> > > > > > > > > > > > > > > for example, our series also use registers
> > > > > > > > > > > > > > > to track vq state, but improvements than CG
> > > > > > > > > > > > > > > or BSC. So I think they are proven
> > > > > > > > > > > > > to work.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > If you prefer to go the route of POR and
> > > > > > > > > > > > > > production and proven documents
> > > > > > > > > > > > > etc, there is ton of it of multiple types of
> > > > > > > > > > > > > products I can dump here with open- source code
> > > > > > > > > > > > > and documentation and
> > > > > more.
> > > > > > > > > > > > > > Let me know what you would like to see.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Michael has requested some performance
> > > > > > > > > > > > > > comparisons, not all are ready to
> > > > > > > > > > > > > share yet.
> > > > > > > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > And all the vdpa dpdk you published does not
> > > > > > > > > > > > > > have basic CVQ support when I
> > > > > > > > > > > > > last looked at it.
> > > > > > > > > > > > > > Do you know when was it added?
> > > > > > > > > > > > >
> > > > > > > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > > > > > > The problem with CVQ generally, is that VDPA
> > > > > > > > > > > > > wants to shadow CVQ it at all times because it
> > > > > > > > > > > > > wants to decode and cache the content. But this
> > > > > > > > > > > > > problem has nothing to do with dirty tracking
> > > > > > > > > > > > > even though it also
> > > > > > > > > > > mentions "shadow":
> > > > > > > > > > > > > if device can report it's state then there's no
> > > > > > > > > > > > > need to shadow
> > > > > CVQ.
> > > > > > > > > > > >
> > > > > > > > > > > > For the performance numbers with the pre-copy and
> > > > > > > > > > > > device context of
> > > > > > > > > > > patches posted 1 to 5, the downtime reduction of the
> > > > > > > > > > > VM is 3.71x with active traffic on 8 RQs at 100Gbps port
> speed.
> > > > > > > > > > >
> > > > > > > > > > > Sounds good can you please post a bit more detail?
> > > > > > > > > > > which configs are you comparing what was the result
> > > > > > > > > > > on each of
> > > > > them.
> > > > > > > > > >
> > > > > > > > > > Common config: 8+8 tx and rx queues.
> > > > > > > > > > Port speed: 100Gbps
> > > > > > > > > > QEMU 8.1
> > > > > > > > > > Libvirt 7.0
> > > > > > > > > > GVM: Centos 7.4
> > > > > > > > > > Device: virtio VF hardware device
> > > > > > > > > >
> > > > > > > > > > Config_1: virtio suspend/resume similar to what
> > > > > > > > > > Lingshan has, largely vdpa stack
> > > > > > > > > > Config_2: Device context method of admin commands
> > > > > > > > >
> > > > > > > > > OK that sounds good. The weird thing here is that you
> > > > > > > > > measure
> > > > > "downtime".
> > > > > > > > > What exactly do you mean here?
> > > > > > > > > I am guessing it's the time to retrieve on source and
> > > > > > > > > re-program device state on destination? And this is
> > > > > > > > > 3.71x out of
> > > how long?
> > > > > > > > Yes. Downtime is the time during which the VM is not
> > > > > > > > responding or receiving
> > > > > > > packets, which involves reprogramming the device.
> > > > > > > > 3.71x is relative time for this discussion.
> > > > > > >
> > > > > > > Oh interesting. So VM state movement including reprogramming
> > > > > > > the CPU is dominated by reprogramming this single NIC, by a
> > > > > > > factor of
> > > almost 4?
> > > > > > Yes.
> > > > >
> > > > > Could you post some numbers too then?  I want to know whether
> > > > > that would imply that VM boot is slowed down significantly too.
> > > > > If yes that's another motivation for pci transport 2.0.
> > > > It was 1.8 sec down to 480msec.
> > >
> > > Well, there's work ongoing to reduce the downtime of the shadow
> virtqueue.
> > >
> > > Eugenio or Si-wei may share an exact number, but it should be
> > > several hundreds of ms.
> > >
> > Shadow vq is not applicable at all as comparison point because there is no
> virtio specific qemu etc software involved here.
> 
> I don't get the point.
> 
> Shadow virtqueue is virtio specific for sure and the core logic is decoupled of
> the vDPA logic. If not, it's bug and we need to fix.
>
The base requirement is that the software is not mediating any virtio interfaces (config, cvq, data vqs).
Hence, for direct mapped device shadow vq is not appliable at all, hence there is no comparison point.
 
> Thanks
> 
> 
> >
> > Anyways, the requested numbers are supplied for the device context based
> migration over admin vq proposed here.
> >
> >
> > > But it seems the shadow virtqueue itself is not the major factor but
> > > the time spent on programming vendor specific mappings for example.
> > >
> > > Thanks
> > >
> > > > The time didn't come from pci side or boot side.
> > > >
> > > > For pci side of things you would want to compare the pci vs non
> > > > pci device
> > > based VM boot time.
> > > >
> >

Follow-Ups:
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>

References:
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Zhu, Lingshan" <lingshan.zhu@intel.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>