OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-comment] Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands


On Wed, Nov 22, 2023 at 12:28âPM Parav Pandit <parav@nvidia.com> wrote:
>
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, November 22, 2023 9:50 AM
> >
> > On Wed, Nov 22, 2023 at 12:30âAM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, November 21, 2023 12:25 PM
> > > >
> > > > On Fri, Nov 17, 2023 at 10:48âPM Parav Pandit <parav@nvidia.com>
> > wrote:
> > > > >
> > > > >
> > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > Sent: Friday, November 17, 2023 7:31 PM
> > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > >
> > > > > > On Fri, Nov 17, 2023 at 01:03:03PM +0000, Parav Pandit wrote:
> > > > > > >
> > > > > > >
> > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > Sent: Friday, November 17, 2023 6:02 PM
> > > > > > > >
> > > > > > > > On Fri, Nov 17, 2023 at 12:11:15PM +0000, Parav Pandit wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > Sent: Friday, November 17, 2023 5:35 PM
> > > > > > > > > > To: Parav Pandit <parav@nvidia.com>
> > > > > > > > > >
> > > > > > > > > > On Fri, Nov 17, 2023 at 11:45:20AM +0000, Parav Pandit wrote:
> > > > > > > > > > >
> > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > Sent: Friday, November 17, 2023 5:04 PM
> > > > > > > > > > > >
> > > > > > > > > > > > On Fri, Nov 17, 2023 at 11:05:16AM +0000, Parav Pandit
> > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From: Michael S. Tsirkin <mst@redhat.com>
> > > > > > > > > > > > > > Sent: Friday, November 17, 2023 4:30 PM
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Nov 17, 2023 at 10:03:47AM +0000, Parav
> > > > > > > > > > > > > > Pandit
> > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > From: Zhu, Lingshan <lingshan.zhu@intel.com>
> > > > > > > > > > > > > > > > Sent: Friday, November 17, 2023 3:30 PM
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On 11/16/2023 7:59 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > > On Thu, Nov 16, 2023 at 06:28:07PM +0800,
> > > > > > > > > > > > > > > > > Zhu, Lingshan
> > > > > > > > wrote:
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> On 11/16/2023 1:51 PM, Michael S. Tsirkin wrote:
> > > > > > > > > > > > > > > > >>> On Thu, Nov 16, 2023 at 05:29:54AM
> > > > > > > > > > > > > > > > >>> +0000, Parav Pandit
> > > > > > > > wrote:
> > > > > > > > > > > > > > > > >>>> We should expose a limit of the device
> > > > > > > > > > > > > > > > >>>> in the proposed
> > > > > > > > > > > > > > > > WRITE_RECORD_CAP_QUERY command, that how
> > > > > > > > > > > > > > > > much
> > > > > > range
> > > > > > > > > > > > > > > > it can
> > > > > > > > > > > > track.
> > > > > > > > > > > > > > > > >>>> So that future provisioning framework can use
> > it.
> > > > > > > > > > > > > > > > >>>>
> > > > > > > > > > > > > > > > >>>> I will cover this in v5 early next week.
> > > > > > > > > > > > > > > > >>> I do worry about how this can even work
> > though.
> > > > > > > > > > > > > > > > >>> If you want a generic device you do not
> > > > > > > > > > > > > > > > >>> get to dictate how much memory VM
> > > > > > > > > > > > has.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> Aren't we talking bit per page? With
> > > > > > > > > > > > > > > > >>> 1TByte of memory to track
> > > > > > > > > > > > > > > > >>> -> 256Gbit -> 32Gbit -> 8Gbyte per VF?
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> And you happily say "we'll address this in the
> > future"
> > > > > > > > > > > > > > > > >>> while at the same time fighting tooth
> > > > > > > > > > > > > > > > >>> and nail against adding single bit
> > > > > > > > > > > > > > > > >>> status registers because
> > > > > > scalability?
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> I have a feeling doing this completely
> > > > > > > > > > > > > > > > >>> theoretical like this is
> > > > > > > > > > > > problematic.
> > > > > > > > > > > > > > > > >>> Maybe you have it all laid out neatly in
> > > > > > > > > > > > > > > > >>> your head but I suspect not all of TC
> > > > > > > > > > > > > > > > >>> can picture it clearly enough based just
> > > > > > > > > > > > > > > > >>> on spec
> > > > > > > > > > > > text.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> We do sometimes ask for POC
> > > > > > > > > > > > > > > > >>> implementation in linux / qemu to
> > > > > > > > > > > > > > > > >>> demonstrate how things work before
> > > > > > > > > > > > > > > > >>> merging
> > > > > > > > code.
> > > > > > > > > > > > > > > > >>> We skipped this for admin things so far
> > > > > > > > > > > > > > > > >>> but I think it's a good idea to start doing it here.
> > > > > > > > > > > > > > > > >>>
> > > > > > > > > > > > > > > > >>> What makes me pause a bit before saying
> > > > > > > > > > > > > > > > >>> please do a PoC is all the opposition
> > > > > > > > > > > > > > > > >>> that seems to exist to even using admin
> > > > > > > > > > > > > > > > >>> commands in the 1st place. I think once
> > > > > > > > > > > > > > > > >>> we finally stop arguing about whether to
> > > > > > > > > > > > > > > > >>> use admin commands at all then a PoC
> > > > > > > > > > > > > > > > >>> will be needed
> > > > > > > > > > > > > > before merging.
> > > > > > > > > > > > > > > > >> We have POR productions that implemented
> > > > > > > > > > > > > > > > >> the approach in my
> > > > > > > > > > > > series.
> > > > > > > > > > > > > > > > >> They are multiple generations of
> > > > > > > > > > > > > > > > >> productions in market and running in
> > > > > > > > > > > > > > > > >> customers data centers for
> > > > years.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> Back to 2019 when we start working on
> > > > > > > > > > > > > > > > >> vDPA, we have sent some samples of
> > > > > > > > > > > > > > > > >> production(e.g., Cascade
> > > > > > > > > > > > > > > > >> Glacier) and the datasheet, you can find
> > > > > > > > > > > > > > > > >> live migration facilities there, includes
> > > > > > > > > > > > > > > > >> suspend, vq state and other
> > > > > > > > features.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> And there is an reference in DPDK live
> > > > > > > > > > > > > > > > >> migration, I have provided this page
> > > > > > > > > > > > > > > > >> before:
> > > > > > > > > > > > > > > > >> https://doc.dpdk.org/guides-21.11/vdpadev
> > > > > > > > > > > > > > > > >> s/if c.ht ml, it has been working for
> > > > > > > > > > > > > > > > >> long long time.
> > > > > > > > > > > > > > > > >>
> > > > > > > > > > > > > > > > >> So if we let the facts speak, if we want
> > > > > > > > > > > > > > > > >> to see if the proposal is proven to work,
> > > > > > > > > > > > > > > > >> I would
> > > > > > > > > > > > > > > > >> say: They are POR for years, customers
> > > > > > > > > > > > > > > > >> already deployed them for
> > > > > > > > > > > > years.
> > > > > > > > > > > > > > > > > And I guess what you are trying to say is
> > > > > > > > > > > > > > > > > that this patchset we are reviewing here
> > > > > > > > > > > > > > > > > should be help to the same standard and
> > > > > > > > > > > > > > > > > there should be a PoC? Sounds
> > > > > > reasonable.
> > > > > > > > > > > > > > > > Yes and the in-marketing productions are
> > > > > > > > > > > > > > > > POR, the series just improves the design,
> > > > > > > > > > > > > > > > for example, our series also use registers
> > > > > > > > > > > > > > > > to track vq state, but improvements than CG
> > > > > > > > > > > > > > > > or BSC. So I think they are proven
> > > > > > > > > > > > > > to work.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > If you prefer to go the route of POR and
> > > > > > > > > > > > > > > production and proven documents
> > > > > > > > > > > > > > etc, there is ton of it of multiple types of
> > > > > > > > > > > > > > products I can dump here with open- source code
> > > > > > > > > > > > > > and documentation and
> > > > > > more.
> > > > > > > > > > > > > > > Let me know what you would like to see.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Michael has requested some performance
> > > > > > > > > > > > > > > comparisons, not all are ready to
> > > > > > > > > > > > > > share yet.
> > > > > > > > > > > > > > > Some are present that I will share in coming weeks.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > And all the vdpa dpdk you published does not
> > > > > > > > > > > > > > > have basic CVQ support when I
> > > > > > > > > > > > > > last looked at it.
> > > > > > > > > > > > > > > Do you know when was it added?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > It's good enough for PoC I think, CVQ or not.
> > > > > > > > > > > > > > The problem with CVQ generally, is that VDPA
> > > > > > > > > > > > > > wants to shadow CVQ it at all times because it
> > > > > > > > > > > > > > wants to decode and cache the content. But this
> > > > > > > > > > > > > > problem has nothing to do with dirty tracking
> > > > > > > > > > > > > > even though it also
> > > > > > > > > > > > mentions "shadow":
> > > > > > > > > > > > > > if device can report it's state then there's no
> > > > > > > > > > > > > > need to shadow
> > > > > > CVQ.
> > > > > > > > > > > > >
> > > > > > > > > > > > > For the performance numbers with the pre-copy and
> > > > > > > > > > > > > device context of
> > > > > > > > > > > > patches posted 1 to 5, the downtime reduction of the
> > > > > > > > > > > > VM is 3.71x with active traffic on 8 RQs at 100Gbps port
> > speed.
> > > > > > > > > > > >
> > > > > > > > > > > > Sounds good can you please post a bit more detail?
> > > > > > > > > > > > which configs are you comparing what was the result
> > > > > > > > > > > > on each of
> > > > > > them.
> > > > > > > > > > >
> > > > > > > > > > > Common config: 8+8 tx and rx queues.
> > > > > > > > > > > Port speed: 100Gbps
> > > > > > > > > > > QEMU 8.1
> > > > > > > > > > > Libvirt 7.0
> > > > > > > > > > > GVM: Centos 7.4
> > > > > > > > > > > Device: virtio VF hardware device
> > > > > > > > > > >
> > > > > > > > > > > Config_1: virtio suspend/resume similar to what
> > > > > > > > > > > Lingshan has, largely vdpa stack
> > > > > > > > > > > Config_2: Device context method of admin commands
> > > > > > > > > >
> > > > > > > > > > OK that sounds good. The weird thing here is that you
> > > > > > > > > > measure
> > > > > > "downtime".
> > > > > > > > > > What exactly do you mean here?
> > > > > > > > > > I am guessing it's the time to retrieve on source and
> > > > > > > > > > re-program device state on destination? And this is
> > > > > > > > > > 3.71x out of
> > > > how long?
> > > > > > > > > Yes. Downtime is the time during which the VM is not
> > > > > > > > > responding or receiving
> > > > > > > > packets, which involves reprogramming the device.
> > > > > > > > > 3.71x is relative time for this discussion.
> > > > > > > >
> > > > > > > > Oh interesting. So VM state movement including reprogramming
> > > > > > > > the CPU is dominated by reprogramming this single NIC, by a
> > > > > > > > factor of
> > > > almost 4?
> > > > > > > Yes.
> > > > > >
> > > > > > Could you post some numbers too then?  I want to know whether
> > > > > > that would imply that VM boot is slowed down significantly too.
> > > > > > If yes that's another motivation for pci transport 2.0.
> > > > > It was 1.8 sec down to 480msec.
> > > >
> > > > Well, there's work ongoing to reduce the downtime of the shadow
> > virtqueue.
> > > >
> > > > Eugenio or Si-wei may share an exact number, but it should be
> > > > several hundreds of ms.
> > > >
> > > Shadow vq is not applicable at all as comparison point because there is no
> > virtio specific qemu etc software involved here.
> >
> > I don't get the point.
> >
> > Shadow virtqueue is virtio specific for sure and the core logic is decoupled of
> > the vDPA logic. If not, it's bug and we need to fix.
> >
> The base requirement is that the software is not mediating any virtio interfaces (config, cvq, data vqs).

I think we agree that any proposal should work in both passthrough and
non-passthrough. No?

Otherwise we circle back.

Thanks



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]