virtio-comment message

Subject: Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Date: Wed, 8 Nov 2023 03:17:18 -0500

On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> On Tue, Nov 7, 2023 at 3:05âPM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > Each virtio and non virtio devices who wants to report their dirty page report,
> > > > > will do their way.
> > > > > >
> > > > > > > 3) inventing it in the virtio layer will be deprecated in the future
> > > > > > > for sure, as platform will provide much rich features for logging
> > > > > > > e.g it can do it per PASID etc, I don't see any reason virtio need
> > > > > > > to compete with the features that will be provided by the platform
> > > > > > Can you bring the cpu vendors and committement to virtio tc with timelines
> > > > > so that virtio TC can omit?
> > > > >
> > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio needs to be built
> > > > > on top of transport or platform. There's no need to duplicate their job.
> > > > > Especially considering that virtio can't do better than them.
> > > > >
> > > > I wanted to see a strong commitment for the cpu vendors to support dirty page tracking.
> > >
> > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD and
> > > ARM are all supporting that now.
> > >
> > > > And the work seems to have started for some platforms.
> > >
> > > Let me quote from the above link:
> > >
> > > """
> > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > alongside VT-D rev3.x also do support.
> > > """
> > >
> > > > Without such platform commitment, virtio also skipping it would not work.
> > >
> > > Is the above sufficient? I'm a little bit more familiar with vtd, the
> > > hw feature has been there for years.
> >
> >
> > Repeating myself - I'm not sure that will work well for all workloads.
> 
> I think this comment applies to this proposal as well.

Yes - some systems might be better off with platform tracking.
And I think supporting shadow vq better would be nice too.

> > Definitely KVM did
> > not scan PTEs. It used pagefaults with bit per page and later as VM size
> > grew switched to PLM.  This interface is analogous to PLM,
> 
> I think you meant PML actually. And it doesn't work like PML. To
> behave like PML it needs to
> 
> 1) log buffers were organized as a queue with indices
> 2) device needs to suspend (as a #vmexit in PML) if it runs out of the buffers
> 3) device need to send a notification to the driver if it runs out of the buffer
> 
> I don't see any of the above in this proposal. If we do that it would
> be less problematic than what is being proposed here.

What is common between this and PML is that you get the addresses
directly without scanning megabytes of bitmaps or worse -
hundreds of megabytes of page tables.

The data structure is different but I don't see why it is critical.

I agree that I don't see out of buffers notifications too which implies
device has to maintain something like a bitmap internally.  Which I
guess could be fine but it is not clear to me how large that bitmap has
to be. How does the device know? Needs to be addressed.


> Even if we manage to do that, it doesn't mean we won't have issues.
> 
> 1) For many reasons it can neither see nor log via GPA, so this
> requires a traversal of the vIOMMU mapping tables by the hypervisor
> afterwards, it would be expensive and need synchronization with the
> guest modification of the IO page table which looks very hard.

vIOMMU is fast enough to be used on data path but not fast enough for
dirty tracking? Hard to believe.  If true and you want to speed up
vIOMMU then you implement an efficient datastructure for that.

> 2) There are a lot of special or reserved IOVA ranges (for example the
> interrupt areas in x86) that need special care which is architectural
> and where it is beyond the scope or knowledge of the virtio device but
> the platform IOMMU. Things would be more complicated when SVA is
> enabled.

SVA being what here?

> And there could be other architecte specific knowledge (e.g
> PAGE_SIZE) that might be needed. There's no easy way to deal with
> those cases.

Good point about page size actually - using 4k unconditionally
is a waste of resources.


> We wouldn't need to care about all of them if it is done at platform
> IOMMU level.

If someone logs at IOMMU level then nothing needs to be done
in the spec at all. This is about capability at the device level.


> > what Lingshan
> > proposed is analogous to bit per page - problem unfortunately is
> > you can't easily set a bit by DMA.
> >
> 
> I'm not saying bit/bytemap is the best, but it has been used by real
> hardware. And we have many other options.
> 
> > So I think this dirty tracking is a good option to have.
> >
> >
> >
> > > >
> > > > > > i.e. in first year of 2024?
> > > > >
> > > > > Why does it matter in 2024?
> > > > Because users needs to use it now.
> > > >
> > > > >
> > > > > > If not, we are better off to offer this, and when/if platform support is, sure,
> > > > > this feature can be disabled/not used/not enabled.
> > > > > >
> > > > > > > 4) if the platform support is missing, we can use software or
> > > > > > > leverage transport for assistance like PRI
> > > > > > All of these are in theory.
> > > > > > Our experiment shows PRI performance is 21x slower than page fault rate
> > > > > done by the cpu.
> > > > > > It simply does not even pass a simple 10Gbps test.
> > > > >
> > > > > If you stick to the wire speed during migration, it can converge.
> > > > Do you have perf data for this?
> > >
> > > No, but it's not hard to imagine the worst case. Wrote a small program
> > > that dirty every page by a NIC.
> > >
> > > > In the internal tests we donât see this happening.
> > >
> > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > >
> > > So if we get very high dirty rates (e.g by a high speed NIC), we can't
> > > satisfy the requirement of the downtime. Or if you see the converge,
> > > you might get help from the auto converge support by the hypervisors
> > > like KVM where it tries to throttle the VCPU then you can't reach the
> > > wire speed.
> >
> > Will only work for some device types.
> >
> 
> Yes, that's the point. Parav said he doesn't see the issue, it's
> probably because he is testing a virtio-net and so the vCPU is
> automatically throttled. It doesn't mean it can work for other virito
> devices.

Only for TX, and I'm pretty sure they had the foresight to test RX not
just TX but let's confirm. Parav did you test both directions?

> >
> >
> > > >
> > > > >
> > > > > > There is no requirement for mandating PRI either.
> > > > > > So it is unusable.
> > > > >
> > > > > It's not about mandating, it's about doing things in the correct layer. If PRI is
> > > > > slow, PCI can evolve for sure.
> > > > You should try.
> > >
> > > Not my duty, I just want to make sure things are done in the correct
> > > layer, and once it needs to be done in the virtio, there's nothing
> > > obviously wrong.
> >
> > Yea but just vague questions don't help to make sure eiter way.
> 
> I don't think it's vague, I have explained, if something in the virito
> slows down the PRI, we can try to fix them.

I don't believe you are going to make PRI fast. No one managed so far.

> Missing functions in
> platform or transport is not a good excuse to try to workaround it in
> the virtio. It's a layer violation and we never had any feature like
> this in the past.

Yes missing functionality in the platform is exactly why virtio
was born in the first place.

> >
> > > > In the current state, it is mandating.
> > > > And if you think PRI is the only way,
> > >
> > > I don't, it's just an example where virtio can leverage from either
> > > transport or platform. Or if it's the fault in virtio that slows down
> > > the PRI, then it is something we can do.
> > >
> > > >  than you should propose that in the dirty page tracking series that you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > >
> > > No, the point is to not duplicate works especially considering virtio
> > > can't do better than platform or transport.
> >
> > If someone says they tried and platform's migration support does not
> > work for them and they want to build a solution in virtio then
> > what exactly is the objection?
> 
> The discussion is to make sure whether virtio can do this easily and
> correctly, then we can have a conclusion. I've stated some issues
> above, and I've asked other questions related to them which are still
> not answered.
> 
> I think we had a very hard time in bypassing IOMMU in the past that we
> don't want to repeat.
> 
> We've gone through several methods of logging dirty pages in the past
> (each with pros/cons), but this proposal never explains why it chooses
> one of them but not others. Spec needs to find the best path instead
> of just a possible path without any rationale about why.

Adding more rationale isn't a bad thing.
In particular if platform supplies dirty tracking then how does
driver decide which to use platform or device capability?
A bit of discussion around this is a good idea.


> > virtio is here in the
> > first place because emulating devices didn't work well.
> 
> I don't understand here. We have supported emulated devices for years.
> I'm pretty sure a lot of issues could be uncovered if this proposal
> can be prototyped with an emulated device first.
> 
> Thanks

virtio was originally PV as opposed to emulation. That there's now
hardware virtio and you call software implementation "an emulation" is
very meta.


> 
> 
> 
> 
> >
> > --
> > MST
> >

Follow-Ups:
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>

References:
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>