virtio-comment message

Subject: RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands

From: Parav Pandit <parav@nvidia.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 15 Nov 2023 17:42:14 +0000

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 15, 2023 1:30 PM
> 
> On Thu, Nov 09, 2023 at 06:26:44AM +0000, Parav Pandit wrote:
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Wednesday, November 8, 2023 9:59 AM
> > >
> > > On Tue, Nov 7, 2023 at 3:05âPM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > > >
> > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > their dirty page report,
> > > > > > > will do their way.
> > > > > > > >
> > > > > > > > > 3) inventing it in the virtio layer will be deprecated
> > > > > > > > > in the future for sure, as platform will provide much
> > > > > > > > > rich features for logging e.g it can do it per PASID
> > > > > > > > > etc, I don't see any reason virtio need to compete with
> > > > > > > > > the features that will be provided by the platform
> > > > > > > > Can you bring the cpu vendors and committement to virtio
> > > > > > > > tc with timelines
> > > > > > > so that virtio TC can omit?
> > > > > > >
> > > > > > > Why do we need to bring CPU vendors in the virtio TC? Virtio
> > > > > > > needs to be built on top of transport or platform. There's
> > > > > > > no need to
> > > duplicate their job.
> > > > > > > Especially considering that virtio can't do better than them.
> > > > > > >
> > > > > > I wanted to see a strong commitment for the cpu vendors to
> > > > > > support dirty
> > > page tracking.
> > > > >
> > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, AMD
> > > > > and ARM are all supporting that now.
> > > > >
> > > > > > And the work seems to have started for some platforms.
> > > > >
> > > > > Let me quote from the above link:
> > > > >
> > > > > """
> > > > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2
> > > > > alongside VT-D rev3.x also do support.
> > > > > """
> > > > >
> > > > > > Without such platform commitment, virtio also skipping it would not
> work.
> > > > >
> > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > vtd, the hw feature has been there for years.
> > > >
> > > >
> > > > Repeating myself - I'm not sure that will work well for all workloads.
> > >
> > > I think this comment applies to this proposal as well.
> > >
> > > > Definitely KVM did
> > > > not scan PTEs. It used pagefaults with bit per page and later as
> > > > VM size grew switched to PLM.  This interface is analogous to PLM,
> > >
> > > I think you meant PML actually. And it doesn't work like PML. To
> > > behave like PML it needs to
> > >
> > > 1) log buffers were organized as a queue with indices
> > > 2) device needs to suspend (as a #vmexit in PML) if it runs out of
> > > the buffers
> > > 3) device need to send a notification to the driver if it runs out
> > > of the buffer
> > >
> > > I don't see any of the above in this proposal. If we do that it
> > > would be less problematic than what is being proposed here.
> > >
> > In this proposal, its slightly different than PML.
> > The log buffer is a write record with the device. It keeps recording it.
> > And owner driver queries the recorded pages.
> > The device internally can do PML or other different implementations as it
> finds suitable.
> 
> I personally like it that this detail is hidden inside the device.
> One important functionality that PML has and that this does not have is ability
> to interrupt host e.g. if is running low on space to record these info. Want to
> add it in some way?
Page tracking using PML equivalent can be an additional method.
It can possibly live as independent feature as well and also extension of it.

One trade-off to deal with in that approach is, when iotlb flush is needed, it needs to query the partial range.
This requires search in the log buffer and create holes in it.
And hypervisor needs to do search and also maintain the shadow to overcome the problem with the shadow.

Using vq for out-of-order generates too many writes.

In the current device-based query interface there are zero pci writes like PML.

I would say, we should invent the PML incrementally when the first round of features are done.

> E.g. a special command that is only used if device is low on buffers.
> 
> 
> > > Even if we manage to do that, it doesn't mean we won't have issues.
> > >
> > > 1) For many reasons it can neither see nor log via GPA, so this
> > > requires a traversal of the vIOMMU mapping tables by the hypervisor
> > > afterwards, it would be expensive and need synchronization with the
> > > guest modification of the IO page table which looks very hard.
> > > 2) There are a lot of special or reserved IOVA ranges (for example
> > > the interrupt areas in x86) that need special care which is
> > > architectural and where it is beyond the scope or knowledge of the virtio
> device but the platform IOMMU.
> > > Things would be more complicated when SVA is enabled. And there
> > > could be other architecte specific knowledge (e.g
> > > PAGE_SIZE) that might be needed. There's no easy way to deal with those
> cases.
> > >
> >
> > Current and future iommufd and OS interface likely can support this already.
> > In current proposal, multiple ranges are supplied to the device, the reserved
> ranges are not part of it.
> >
> > > We wouldn't need to care about all of them if it is done at platform
> > > IOMMU level.
> > >
> > I agree that when platform IOMMU has support and if its better it should be
> first priority to use by the hypervisor.
> > Mainly because the D bit of the page already there, and not a special PML
> queue or a racy bitmap like what was proposed in other series.
> 
> BTW your bitmap is also racy if there's a vIOMMU, unless hypervisor is very
> careful to empty the bitmap when mappings change.
> You should document this requirement.
> 
When to query the dirty page log is hypervisor's decision, map/unmap IOTLB flush etc is hard to document in the spec.
We can write some guiding notes for hypervisor but not a requirement.

> 
> --
> MST

References:
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Jason Wang <jasowang@redhat.com>
- RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
  - From: "Michael S. Tsirkin" <mst@redhat.com>