OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands


> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, November 16, 2023 9:49 AM
> 
> On Thu, Nov 16, 2023 at 1:42âAM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, November 13, 2023 9:02 AM
> > >
> > > On Thu, Nov 9, 2023 at 3:59âPM Michael S. Tsirkin <mst@redhat.com>
> wrote:
> > > >
> > > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > > > On Wed, Nov 8, 2023 at 4:17âPM Michael S. Tsirkin
> > > > > <mst@redhat.com>
> > > wrote:
> > > > > >
> > > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Nov 7, 2023 at 3:05âPM Michael S. Tsirkin
> > > > > > > <mst@redhat.com>
> > > wrote:
> > > > > > > >
> > > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > > > Each virtio and non virtio devices who wants to
> > > > > > > > > > > > report their dirty page report,
> > > > > > > > > > > will do their way.
> > > > > > > > > > > >
> > > > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > > > deprecated in the future for sure, as platform
> > > > > > > > > > > > > will provide much rich features for logging e.g
> > > > > > > > > > > > > it can do it per PASID etc, I don't see any
> > > > > > > > > > > > > reason virtio need to compete with the features
> > > > > > > > > > > > > that will be provided by the platform
> > > > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > > > virtio tc with timelines
> > > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > > >
> > > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > > > Virtio needs to be built on top of transport or
> > > > > > > > > > > platform. There's
> > > no need to duplicate their job.
> > > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > > >
> > > > > > > > > > I wanted to see a strong commitment for the cpu
> > > > > > > > > > vendors to
> > > support dirty page tracking.
> > > > > > > > >
> > > > > > > > > The RFC of IOMMUFD support can go back to early 2022.
> > > > > > > > > Intel, AMD and ARM are all supporting that now.
> > > > > > > > >
> > > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > > >
> > > > > > > > > Let me quote from the above link:
> > > > > > > > >
> > > > > > > > > """
> > > > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > > > """
> > > > > > > > >
> > > > > > > > > > Without such platform commitment, virtio also skipping
> > > > > > > > > > it would
> > > not work.
> > > > > > > > >
> > > > > > > > > Is the above sufficient? I'm a little bit more familiar
> > > > > > > > > with vtd, the hw feature has been there for years.
> > > > > > > >
> > > > > > > >
> > > > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > > > >
> > > > > > > I think this comment applies to this proposal as well.
> > > > > >
> > > > > > Yes - some systems might be better off with platform tracking.
> > > > > > And I think supporting shadow vq better would be nice too.
> > > > >
> > > > > For shadow vq, did you mean the work that is done by Eugenio?
> > > >
> > > > Yes.
> > >
> > > That's exactly why vDPA starts with shadow virtqueue. We've
> > > evaluated various possible approaches, each of them have their
> > > shortcomings and shadow virtqueue is the only one that doesn't
> > > require any additional hardware features to work in every platform.
> > >
> > > >
> > > > > >
> > > > > > > > Definitely KVM did
> > > > > > > > not scan PTEs. It used pagefaults with bit per page and
> > > > > > > > later as VM size grew switched to PLM.  This interface is
> > > > > > > > analogous to PLM,
> > > > > > >
> > > > > > > I think you meant PML actually. And it doesn't work like
> > > > > > > PML. To behave like PML it needs to
> > > > > > >
> > > > > > > 1) log buffers were organized as a queue with indices
> > > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs
> > > > > > > out of the buffers
> > > > > > > 3) device need to send a notification to the driver if it
> > > > > > > runs out of the buffer
> > > > > > >
> > > > > > > I don't see any of the above in this proposal. If we do that
> > > > > > > it would be less problematic than what is being proposed here.
> > > > > >
> > > > > > What is common between this and PML is that you get the
> > > > > > addresses directly without scanning megabytes of bitmaps or
> > > > > > worse - hundreds of megabytes of page tables.
> > > > >
> > > > > Yes, it has overhead but this is the method we use for vhost and
> > > > > KVM
> > > (earlier).
> > > > >
> > > > > To me the  important advantage of PML is that it uses limited
> > > > > resources on the host which
> > > > >
> > > > > 1) doesn't require resources in the device
> > > > > 2) doesn't scale as the guest memory increases. (but this
> > > > > advantage doesn't exist in neither this nor bitmap)
> > > >
> > > > it seems 2 exactly exists here.
> > >
> > > Actually not, Parav said the device needs to reserve sufficient
> > > resources in another thread.
> > The device resource reservation starts only when the device migration starts.
> > i.e. with WRITE_RECORDS_START command of patch 7 in the series.
> 
> Right, but this is not the question, see below.
> 
> >
> > >
> > > >
> > > >
> > > > > >
> > > > > > The data structure is different but I don't see why it is critical.
> > > > > >
> > > > > > I agree that I don't see out of buffers notifications too
> > > > > > which implies device has to maintain something like a bitmap internally.
> > > > > > Which I guess could be fine but it is not clear to me how
> > > > > > large that bitmap has to be. How does the device know? Needs to be
> addressed.
> > > > >
> > > > > This is the question I asked Parav in another thread. Using host
> > > > > memory as a queue with notification (like PML) might be much better.
> > > >
> > > > Well if queue is what you want to do you can just do it internally.
> > >
> > > Then it's not the proposal here, Parav has explained it in another
> > > reply, and as explained it lacks a lot of other facilities.
> > >
> > PML is yet another option that requires small pci writes.
> > In the current proposal, there are no small PCI writes.
> > It is a query interface from the device.
> 
> Well, you've explained in another thread that actually it needs small PCI writes.
>
No. There may be some misunderstanding.
 
> E.g during IOTLB invalidation ...
> 
This is not part of the virtio interface.

> >
> > > > Problem of course is that it might overflow and cause things like
> > > > packet drops.
> > >
> > > Exactly like PML. So sticking to wire speed should not be a general
> > > goal in the context of migration. It can be done if the speed of the
> > > migration interface is faster than the virtio device that needs to be migrated.
> > May not have to be.
> > Speed of page recording should be fast enough.
> > It usually improves with subsequent generation.
> 
> If you have something better, let's propose it from the start.
>
It is unproven that current proposal cannot be done by the device.
So I prefer to do this incrementally when we get there.
 
> > >
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > > > >
> > > > > > > 1) For many reasons it can neither see nor log via GPA, so
> > > > > > > this requires a traversal of the vIOMMU mapping tables by
> > > > > > > the hypervisor afterwards, it would be expensive and need
> > > > > > > synchronization with the guest modification of the IO page
> > > > > > > table which
> > > looks very hard.
> > > > > >
> > > > > > vIOMMU is fast enough to be used on data path but not fast
> > > > > > enough for dirty tracking?
> > > > >
> > > > > We set up SPTEs or using nesting offloading where the PTEs could
> > > > > be iterated by hardware directly which is fast.
> > > >
> > > > There's a way to have hardware find dirty PTEs for you quickly?
> > >
> > > Scanning PTEs on the host is faster and more secure than scanning
> > > guests, that's what I want to say:
> > >
> > > 1) the guest page could be swapped out but not the host one.
> > > 2) no guest triggerable behavior
> > >
> >
> > Device page tracking table to be consulted to flush on mapping change.
> >
> > > > I don't know how it's done. Do tell.
> > > >
> > > >
> > > > > This is not the case here where software needs to iterate the IO
> > > > > page tables in the guest which could be slow.
> > > > >
> > > > > > Hard to believe.  If true and you want to speed up vIOMMU then
> > > > > > you implement an efficient datastructure for that.
> > > > >
> > > > > Besides the issue of performance, it's also racy, assuming we
> > > > > are logging
> > > IOVA.
> > > > >
> > > > > 0) device log IOVA
> > > > > 1) hypervisor fetches IOVA from log buffer
> > > > > 2) guest map IOVA to a new GPA
> > > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > > >
> > > > > Then we lost the old GPA.
> > > >
> > > > Interesting and a good point.
> > >
> > > Note that PML logs at GPA as it works at L1 of EPT.
> > >
> > > > And by the way e.g. vhost has the same issue.  You need to flush
> > > > dirty tracking info when changing the mappings somehow.
> > >
> > > It's not,
> > >
> > > 1) memory translation is done by vhost
> > > 2) vhost knows GPA and it doesn't log via IOVA.
> > >
> > > See this for example, and DPDK has similar fixes.
> > >
> > > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > > Author: Jason Wang <jasowang@redhat.com>
> > > Date:   Wed Jan 16 16:54:42 2019 +0800
> > >
> > >     vhost: log dirty page correctly
> > >
> > >     Vhost dirty page logging API is designed to sync through GPA. But we
> > >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> > >     lead to missing data after migration.
> > >
> > >     To solve this issue, when logging with device IOTLB enabled, we will:
> > >
> > >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> > >        get HVA, for writable descriptor, get HVA through iovec. For used
> > >        ring update, translate its GIOVA to HVA
> > >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> > >        through GPA. Pay attention this reverse mapping is not guaranteed
> > >        to be unique, so we should log each possible GPA in this case.
> > >
> > >     This fix the failure of scp to guest during migration. In -next, we
> > >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> > >
> > >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> > >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> > >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> > >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> > >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> > >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > >
> > > All of the above is not what virtio did right now.
> > >
> > > > Parav what's the plan for this? Should be addressed in the spec too.
> > > >
> > >
> > > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> > >
> >
> > The query interface in this proposal works on the granular boundary to read
> and clear.
> > This will ensure that mapping is consistent.
> >
> > > >
> > > >
> > > > > >
> > > > > > > 2) There are a lot of special or reserved IOVA ranges (for
> > > > > > > example the interrupt areas in x86) that need special care
> > > > > > > which is architectural and where it is beyond the scope or
> > > > > > > knowledge of the virtio device but the platform IOMMU.
> > > > > > > Things would be more complicated when SVA is enabled.
> > > > > >
> > > > > > SVA being what here?
> > > > >
> > > > > For example, IOMMU may treat interrupt ranges differently
> > > > > depending on whether SVA is enabled or not. It's very hard and
> > > > > unnecessary to teach devices about this.
> > > >
> > > > Oh, shared virtual memory. So what you are saying here? virtio
> > > > does not care, it just uses some addresses and if you want it to
> > > > it can record writes somewhere.
> > >
> > > One example, PCI allows devices to send translated requests, how can
> > > a hypervisor know it's a PA or IOVA in this case? We probably need a
> > > new bit. But it's not the only thing we need to deal with.
> > >
> > > By definition, interrupt ranges and other reserved ranges should not
> > > belong to dirty pages. And the logging should be done before the DMA
> > > where there's no way for the device to know whether or not an IOVA
> > > is valid or not. It would be more safe to just not report them from
> > > the source instead of leaving it to the hypervisor to deal with but
> > > this seems impossible at the device level. Otherwise the hypervisor
> > > driver needs to communicate with the (v)IOMMU to be reached with the
> > > interrupt(MSI) area, RMRR area etc in order to do the correct things
> > > or it might have security implications. And those areas don't make
> > > sense at L1 when vSVA is enabled. What's more, when vIOMMU could be
> > > fully offloaded, there's no easy way to fetch that information.
> > >
> > There cannot be logging before the DMA.
> 
> Well, I don't see how this is related to the issue above. Logging after the DMA
> doesn't mean the device can know what sits behind an IOVA, no?
> 
Device does not know what sits behind IOVA.
IOVA to GPA mapping is maintained by the iommu.
Hence, device records and hypervor query them.

> > Only requirement is before the mapping changes, the dirty page tracking to be
> synced.
> >
> > In most common cases where the perf is critical, such mapping wont change
> so often dynamically anyway.
> 
> I've explained the issue in another reply.
> 
> >
> > > Again, it's hard to bypass or even duplicate the functionality of
> > > the platform or we need to step into every single detail of a
> > > specific transport, architecture or IOMMU to figure out whether or
> > > not logging at virtio is correct which is awkward and unrealistic.
> > > This proposal suffers from an exact similar issue when inventing
> > > things like freeze/stop where I've pointed out other branches of issues as
> well.
> > >
> > It is incorrect attribution that platform is duplicated here.
> > It feeds the data to the platform as needed without replicating.
> >
> > I do agree that there is overlap of IOMMU tracking the dirty and storing it in
> the per PTE vs device supplying its dirty track via its own interface.
> > Both are consolidated at hypervisor level.
> >
> > > >
> > > > > >
> > > > > > > And there could be other architecte specific knowledge (e.g
> > > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal
> > > > > > > with those cases.
> > > > > >
> > > > > > Good point about page size actually - using 4k unconditionally
> > > > > > is a waste of resources.
> > > > >
> > > > > Actually, they are more than just PAGE_SIZE, for example, PASID and
> others.
> > > >
> > > > what does pasid have to do with it? anyway, just give driver
> > > > control over page size.
> > >
> > > For example, two virtqueues have two PASIDs assigned. How can a
> > > hypervisor know which specific IOVA belongs to which IOVA? For
> > > platform IOMMU, they are handy as it talks to the transport. But I
> > > don't think we need to duplicate every transport specific address space
> feature in core virtio layer:
> > >
> > PASID to vq assignment won't be duplicated.
> > It is configured fully by the guest without consulting hypervisor at the device
> level.
> > Guest IOMMU would consult hypervisor to setup any PASID mapping as part
> of any mapping method.
> >
> > > 1) translated/untranslated request
> > > 2) request w/ and w/o PASID
> > >
> > > >
> > > > > >
> > > > > >
> > > > > > > We wouldn't need to care about all of them if it is done at
> > > > > > > platform IOMMU level.
> > > > > >
> > > > > > If someone logs at IOMMU level then nothing needs to be done
> > > > > > in the spec at all. This is about capability at the device level.
> > > > >
> > > > > True, but my question is where or not it can be done at the
> > > > > device level
> > > easily.
> > > >
> > > > there's no "easily" about live migration ever.
> > >
> > > I think I've stated sufficient issues to demonstrate how hard virtio wants to
> do it.
> > > And I've given the link that it is possible to do that in IOMMU
> > > without those issues. So in this context doing it in virtio is much harder.
> > >
> > > > For example on-device iommus are a thing.
> > >
> > > I'm not sure that's the way to go considering the platform IOMMU
> > > evolves very quickly.
> > >
> > > >
> > > > > >
> > > > > >
> > > > > > > > what Lingshan
> > > > > > > > proposed is analogous to bit per page - problem
> > > > > > > > unfortunately is you can't easily set a bit by DMA.
> > > > > > > >
> > > > > > >
> > > > > > > I'm not saying bit/bytemap is the best, but it has been used
> > > > > > > by real hardware. And we have many other options.
> > > > > > >
> > > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > > >
> > > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > > Because users needs to use it now.
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > If not, we are better off to offer this, and
> > > > > > > > > > > > when/if platform support is, sure,
> > > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > > >
> > > > > > > > > > > > > 4) if the platform support is missing, we can
> > > > > > > > > > > > > use software or leverage transport for
> > > > > > > > > > > > > assistance like PRI
> > > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > > > than page fault rate
> > > > > > > > > > > done by the cpu.
> > > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > > >
> > > > > > > > > > > If you stick to the wire speed during migration, it can
> converge.
> > > > > > > > > > Do you have perf data for this?
> > > > > > > > >
> > > > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > > > small program that dirty every page by a NIC.
> > > > > > > > >
> > > > > > > > > > In the internal tests we donât see this happening.
> > > > > > > > >
> > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > > >
> > > > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > > > NIC), we can't satisfy the requirement of the downtime.
> > > > > > > > > Or if you see the converge, you might get help from the
> > > > > > > > > auto converge support by the hypervisors like KVM where
> > > > > > > > > it tries to throttle the VCPU then you can't reach the wire speed.
> > > > > > > >
> > > > > > > > Will only work for some device types.
> > > > > > > >
> > > > > > >
> > > > > > > Yes, that's the point. Parav said he doesn't see the issue,
> > > > > > > it's probably because he is testing a virtio-net and so the
> > > > > > > vCPU is automatically throttled. It doesn't mean it can work
> > > > > > > for other virito devices.
> > > > > >
> > > > > > Only for TX, and I'm pretty sure they had the foresight to
> > > > > > test RX not just TX but let's confirm. Parav did you test both directions?
> > > > >
> > > > > RX speed somehow depends on the speed of refill, so throttling
> > > > > helps more or less.
> > > >
> > > > It doesn't depend on speed of refill you just underrun and drop
> > > > packets. then your nice 10usec latency becomes more like 10sec.
> > >
> > > I miss your point here. If the driver can't achieve wire speed
> > > without dirty page tracking, it can neither when dirty page tracking is
> enabled.
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > > So it is unusable.
> > > > > > > > > > >
> > > > > > > > > > > It's not about mandating, it's about doing things in
> > > > > > > > > > > the correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > > You should try.
> > > > > > > > >
> > > > > > > > > Not my duty, I just want to make sure things are done in
> > > > > > > > > the correct layer, and once it needs to be done in the
> > > > > > > > > virtio, there's nothing obviously wrong.
> > > > > > > >
> > > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > > >
> > > > > > > I don't think it's vague, I have explained, if something in
> > > > > > > the virito slows down the PRI, we can try to fix them.
> > > > > >
> > > > > > I don't believe you are going to make PRI fast. No one managed so far.
> > > > >
> > > > > So it's the fault of PRI not virito, but it doesn't mean we need
> > > > > to do it in virtio.
> > > >
> > > > I keep saying with this approach we would just say "e1000
> > > > emulation is slow and encumbered this is the fault of e1000" and
> > > > never get virtio at all.  Assigning blame only gets you so far.
> > >
> > > I think we are discussing different things. My point is virtio needs
> > > to leverage the functionality provided by transport or platform
> > > (especially considering they evolve faster than virtio). It seems to
> > > me it's hard even to duplicate some basic function of platform IOMMU in
> virtio.
> > >
> > Not duplicated. Feeding into the platform.
> >
> > > >
> > > > > >
> > > > > > > Missing functions in
> > > > > > > platform or transport is not a good excuse to try to
> > > > > > > workaround it in the virtio. It's a layer violation and we
> > > > > > > never had any feature like this in the past.
> > > > > >
> > > > > > Yes missing functionality in the platform is exactly why
> > > > > > virtio was born in the first place.
> > > > >
> > > > > Well the platform can't do device specific logic. But that's not
> > > > > the case of dirty page tracking which is device logic agnostic.
> > > >
> > > > Not true platforms have things like NICs on board and have for
> > > > many years. It's about performance really.
> > >
> > > I've stated sufficient issues above. And one more obvious issue for
> > > device initiated page logging is that it needs a lot of extra or
> > > unnecessary PCI transactions which will throttle the performance of
> > > the whole system (and lead to other issues like QOS). So I can't believe it has
> good performance overall.
> > > Logging via IOMMU or using shadow virtqueue doesn't need any extra
> > > PCI transactions at least.
> > >
> > In the current proposal, it does not required PCI transactions, as there is only a
> hypervisor-initiated query interface.
> 
> Such query requires at least several transactions, no?
>
It depends on how much memory is unmapped.

For the pinned memory cases with hw iommu's mapping unmapping is not frequent event.
 
> Or to make things more clear, could you list the steps how a hypervisor is
> expected to do the querying?
>
It is listed in other email thread where you described the race condition.
 
> > It is a trade off of using svq + pasid vs using something from the device.
> >
> > Again, both has different use case and value. One uses cpu and one uses
> device.
> 
> That's for sure.
> 
> > Depending how much power one wants to spend where..
> 
> But as the author of the proposal, you need to elaborate more on how you
> expect for a hypervisor instead of letting the reviewer guess.
Hypervisor would look for ways on how to do functionality and find it in the spec.
But I will add a short description on the write record read command.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]