OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands


On Thu, Nov 16, 2023 at 1:42âAM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, November 13, 2023 9:02 AM
> >
> > On Thu, Nov 9, 2023 at 3:59âPM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote:
> > > > On Wed, Nov 8, 2023 at 4:17âPM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > > > >
> > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote:
> > > > > > On Tue, Nov 7, 2023 at 3:05âPM Michael S. Tsirkin <mst@redhat.com>
> > wrote:
> > > > > > >
> > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote:
> > > > > > > > > > > Each virtio and non virtio devices who wants to report
> > > > > > > > > > > their dirty page report,
> > > > > > > > > > will do their way.
> > > > > > > > > > >
> > > > > > > > > > > > 3) inventing it in the virtio layer will be
> > > > > > > > > > > > deprecated in the future for sure, as platform will
> > > > > > > > > > > > provide much rich features for logging e.g it can do
> > > > > > > > > > > > it per PASID etc, I don't see any reason virtio need
> > > > > > > > > > > > to compete with the features that will be provided
> > > > > > > > > > > > by the platform
> > > > > > > > > > > Can you bring the cpu vendors and committement to
> > > > > > > > > > > virtio tc with timelines
> > > > > > > > > > so that virtio TC can omit?
> > > > > > > > > >
> > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC?
> > > > > > > > > > Virtio needs to be built on top of transport or platform. There's
> > no need to duplicate their job.
> > > > > > > > > > Especially considering that virtio can't do better than them.
> > > > > > > > > >
> > > > > > > > > I wanted to see a strong commitment for the cpu vendors to
> > support dirty page tracking.
> > > > > > > >
> > > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel,
> > > > > > > > AMD and ARM are all supporting that now.
> > > > > > > >
> > > > > > > > > And the work seems to have started for some platforms.
> > > > > > > >
> > > > > > > > Let me quote from the above link:
> > > > > > > >
> > > > > > > > """
> > > > > > > > Today, AMD Milan (or more recent) supports it while ARM
> > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support.
> > > > > > > > """
> > > > > > > >
> > > > > > > > > Without such platform commitment, virtio also skipping it would
> > not work.
> > > > > > > >
> > > > > > > > Is the above sufficient? I'm a little bit more familiar with
> > > > > > > > vtd, the hw feature has been there for years.
> > > > > > >
> > > > > > >
> > > > > > > Repeating myself - I'm not sure that will work well for all workloads.
> > > > > >
> > > > > > I think this comment applies to this proposal as well.
> > > > >
> > > > > Yes - some systems might be better off with platform tracking.
> > > > > And I think supporting shadow vq better would be nice too.
> > > >
> > > > For shadow vq, did you mean the work that is done by Eugenio?
> > >
> > > Yes.
> >
> > That's exactly why vDPA starts with shadow virtqueue. We've evaluated various
> > possible approaches, each of them have their shortcomings and shadow
> > virtqueue is the only one that doesn't require any additional hardware features
> > to work in every platform.
> >
> > >
> > > > >
> > > > > > > Definitely KVM did
> > > > > > > not scan PTEs. It used pagefaults with bit per page and later
> > > > > > > as VM size grew switched to PLM.  This interface is analogous
> > > > > > > to PLM,
> > > > > >
> > > > > > I think you meant PML actually. And it doesn't work like PML. To
> > > > > > behave like PML it needs to
> > > > > >
> > > > > > 1) log buffers were organized as a queue with indices
> > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs out
> > > > > > of the buffers
> > > > > > 3) device need to send a notification to the driver if it runs
> > > > > > out of the buffer
> > > > > >
> > > > > > I don't see any of the above in this proposal. If we do that it
> > > > > > would be less problematic than what is being proposed here.
> > > > >
> > > > > What is common between this and PML is that you get the addresses
> > > > > directly without scanning megabytes of bitmaps or worse - hundreds
> > > > > of megabytes of page tables.
> > > >
> > > > Yes, it has overhead but this is the method we use for vhost and KVM
> > (earlier).
> > > >
> > > > To me the  important advantage of PML is that it uses limited
> > > > resources on the host which
> > > >
> > > > 1) doesn't require resources in the device
> > > > 2) doesn't scale as the guest memory increases. (but this advantage
> > > > doesn't exist in neither this nor bitmap)
> > >
> > > it seems 2 exactly exists here.
> >
> > Actually not, Parav said the device needs to reserve sufficient resources in
> > another thread.
> The device resource reservation starts only when the device migration starts.
> i.e. with WRITE_RECORDS_START command of patch 7 in the series.

Right, but this is not the question, see below.

>
> >
> > >
> > >
> > > > >
> > > > > The data structure is different but I don't see why it is critical.
> > > > >
> > > > > I agree that I don't see out of buffers notifications too which
> > > > > implies device has to maintain something like a bitmap internally.
> > > > > Which I guess could be fine but it is not clear to me how large
> > > > > that bitmap has to be. How does the device know? Needs to be addressed.
> > > >
> > > > This is the question I asked Parav in another thread. Using host
> > > > memory as a queue with notification (like PML) might be much better.
> > >
> > > Well if queue is what you want to do you can just do it internally.
> >
> > Then it's not the proposal here, Parav has explained it in another reply, and as
> > explained it lacks a lot of other facilities.
> >
> PML is yet another option that requires small pci writes.
> In the current proposal, there are no small PCI writes.
> It is a query interface from the device.

Well, you've explained in another thread that actually it needs small
PCI writes.

E.g during IOTLB invalidation ...

>
> > > Problem of course is that it might overflow and cause things like
> > > packet drops.
> >
> > Exactly like PML. So sticking to wire speed should not be a general goal in the
> > context of migration. It can be done if the speed of the migration interface is
> > faster than the virtio device that needs to be migrated.
> May not have to be.
> Speed of page recording should be fast enough.
> It usually improves with subsequent generation.

If you have something better, let's propose it from the start.

> >
> > >
> > >
> > > > >
> > > > >
> > > > > > Even if we manage to do that, it doesn't mean we won't have issues.
> > > > > >
> > > > > > 1) For many reasons it can neither see nor log via GPA, so this
> > > > > > requires a traversal of the vIOMMU mapping tables by the
> > > > > > hypervisor afterwards, it would be expensive and need
> > > > > > synchronization with the guest modification of the IO page table which
> > looks very hard.
> > > > >
> > > > > vIOMMU is fast enough to be used on data path but not fast enough
> > > > > for dirty tracking?
> > > >
> > > > We set up SPTEs or using nesting offloading where the PTEs could be
> > > > iterated by hardware directly which is fast.
> > >
> > > There's a way to have hardware find dirty PTEs for you quickly?
> >
> > Scanning PTEs on the host is faster and more secure than scanning guests, that's
> > what I want to say:
> >
> > 1) the guest page could be swapped out but not the host one.
> > 2) no guest triggerable behavior
> >
>
> Device page tracking table to be consulted to flush on mapping change.
>
> > > I don't know how it's done. Do tell.
> > >
> > >
> > > > This is not the case here where software needs to iterate the IO
> > > > page tables in the guest which could be slow.
> > > >
> > > > > Hard to believe.  If true and you want to speed up vIOMMU then you
> > > > > implement an efficient datastructure for that.
> > > >
> > > > Besides the issue of performance, it's also racy, assuming we are logging
> > IOVA.
> > > >
> > > > 0) device log IOVA
> > > > 1) hypervisor fetches IOVA from log buffer
> > > > 2) guest map IOVA to a new GPA
> > > > 3) hypervisor traverse guest table to get IOVA to new GPA
> > > >
> > > > Then we lost the old GPA.
> > >
> > > Interesting and a good point.
> >
> > Note that PML logs at GPA as it works at L1 of EPT.
> >
> > > And by the way e.g. vhost has the same issue.  You need to flush dirty
> > > tracking info when changing the mappings somehow.
> >
> > It's not,
> >
> > 1) memory translation is done by vhost
> > 2) vhost knows GPA and it doesn't log via IOVA.
> >
> > See this for example, and DPDK has similar fixes.
> >
> > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4
> > Author: Jason Wang <jasowang@redhat.com>
> > Date:   Wed Jan 16 16:54:42 2019 +0800
> >
> >     vhost: log dirty page correctly
> >
> >     Vhost dirty page logging API is designed to sync through GPA. But we
> >     try to log GIOVA when device IOTLB is enabled. This is wrong and may
> >     lead to missing data after migration.
> >
> >     To solve this issue, when logging with device IOTLB enabled, we will:
> >
> >     1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
> >        get HVA, for writable descriptor, get HVA through iovec. For used
> >        ring update, translate its GIOVA to HVA
> >     2) traverse the GPA->HVA mapping to get the possible GPA and log
> >        through GPA. Pay attention this reverse mapping is not guaranteed
> >        to be unique, so we should log each possible GPA in this case.
> >
> >     This fix the failure of scp to guest during migration. In -next, we
> >     will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> >
> >     Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> >     Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> >     Cc: Jintack Lim <jintack@cs.columbia.edu>
> >     Signed-off-by: Jason Wang <jasowang@redhat.com>
> >     Acked-by: Michael S. Tsirkin <mst@redhat.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> > All of the above is not what virtio did right now.
> >
> > > Parav what's the plan for this? Should be addressed in the spec too.
> > >
> >
> > AFAIK, there's no easy/efficient way to do that. I hope I was wrong.
> >
>
> The query interface in this proposal works on the granular boundary to read and clear.
> This will ensure that mapping is consistent.
>
> > >
> > >
> > > > >
> > > > > > 2) There are a lot of special or reserved IOVA ranges (for
> > > > > > example the interrupt areas in x86) that need special care which
> > > > > > is architectural and where it is beyond the scope or knowledge
> > > > > > of the virtio device but the platform IOMMU. Things would be
> > > > > > more complicated when SVA is enabled.
> > > > >
> > > > > SVA being what here?
> > > >
> > > > For example, IOMMU may treat interrupt ranges differently depending
> > > > on whether SVA is enabled or not. It's very hard and unnecessary to
> > > > teach devices about this.
> > >
> > > Oh, shared virtual memory. So what you are saying here? virtio does
> > > not care, it just uses some addresses and if you want it to it can
> > > record writes somewhere.
> >
> > One example, PCI allows devices to send translated requests, how can a
> > hypervisor know it's a PA or IOVA in this case? We probably need a new bit. But
> > it's not the only thing we need to deal with.
> >
> > By definition, interrupt ranges and other reserved ranges should not belong to
> > dirty pages. And the logging should be done before the DMA where there's no
> > way for the device to know whether or not an IOVA is valid or not. It would be
> > more safe to just not report them from the source instead of leaving it to the
> > hypervisor to deal with but this seems impossible at the device level. Otherwise
> > the hypervisor driver needs to communicate with the (v)IOMMU to be reached
> > with the
> > interrupt(MSI) area, RMRR area etc in order to do the correct things or it might
> > have security implications. And those areas don't make sense at L1 when vSVA
> > is enabled. What's more, when vIOMMU could be fully offloaded, there's no
> > easy way to fetch that information.
> >
> There cannot be logging before the DMA.

Well, I don't see how this is related to the issue above. Logging
after the DMA doesn't mean the device can know what sits behind an
IOVA, no?

> Only requirement is before the mapping changes, the dirty page tracking to be synced.
>
> In most common cases where the perf is critical, such mapping wont change so often dynamically anyway.

I've explained the issue in another reply.

>
> > Again, it's hard to bypass or even duplicate the functionality of the platform or
> > we need to step into every single detail of a specific transport, architecture or
> > IOMMU to figure out whether or not logging at virtio is correct which is
> > awkward and unrealistic. This proposal suffers from an exact similar issue when
> > inventing things like freeze/stop where I've pointed out other branches of issues
> > as well.
> >
> It is incorrect attribution that platform is duplicated here.
> It feeds the data to the platform as needed without replicating.
>
> I do agree that there is overlap of IOMMU tracking the dirty and storing it in the per PTE vs device supplying its dirty track via its own interface.
> Both are consolidated at hypervisor level.
>
> > >
> > > > >
> > > > > > And there could be other architecte specific knowledge (e.g
> > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal
> > > > > > with those cases.
> > > > >
> > > > > Good point about page size actually - using 4k unconditionally is
> > > > > a waste of resources.
> > > >
> > > > Actually, they are more than just PAGE_SIZE, for example, PASID and others.
> > >
> > > what does pasid have to do with it? anyway, just give driver control
> > > over page size.
> >
> > For example, two virtqueues have two PASIDs assigned. How can a hypervisor
> > know which specific IOVA belongs to which IOVA? For platform IOMMU, they
> > are handy as it talks to the transport. But I don't think we need to duplicate
> > every transport specific address space feature in core virtio layer:
> >
> PASID to vq assignment won't be duplicated.
> It is configured fully by the guest without consulting hypervisor at the device level.
> Guest IOMMU would consult hypervisor to setup any PASID mapping as part of any mapping method.
>
> > 1) translated/untranslated request
> > 2) request w/ and w/o PASID
> >
> > >
> > > > >
> > > > >
> > > > > > We wouldn't need to care about all of them if it is done at
> > > > > > platform IOMMU level.
> > > > >
> > > > > If someone logs at IOMMU level then nothing needs to be done in
> > > > > the spec at all. This is about capability at the device level.
> > > >
> > > > True, but my question is where or not it can be done at the device level
> > easily.
> > >
> > > there's no "easily" about live migration ever.
> >
> > I think I've stated sufficient issues to demonstrate how hard virtio wants to do it.
> > And I've given the link that it is possible to do that in IOMMU without those
> > issues. So in this context doing it in virtio is much harder.
> >
> > > For example on-device iommus are a thing.
> >
> > I'm not sure that's the way to go considering the platform IOMMU evolves very
> > quickly.
> >
> > >
> > > > >
> > > > >
> > > > > > > what Lingshan
> > > > > > > proposed is analogous to bit per page - problem unfortunately
> > > > > > > is you can't easily set a bit by DMA.
> > > > > > >
> > > > > >
> > > > > > I'm not saying bit/bytemap is the best, but it has been used by
> > > > > > real hardware. And we have many other options.
> > > > > >
> > > > > > > So I think this dirty tracking is a good option to have.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > i.e. in first year of 2024?
> > > > > > > > > >
> > > > > > > > > > Why does it matter in 2024?
> > > > > > > > > Because users needs to use it now.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > If not, we are better off to offer this, and when/if
> > > > > > > > > > > platform support is, sure,
> > > > > > > > > > this feature can be disabled/not used/not enabled.
> > > > > > > > > > >
> > > > > > > > > > > > 4) if the platform support is missing, we can use
> > > > > > > > > > > > software or leverage transport for assistance like
> > > > > > > > > > > > PRI
> > > > > > > > > > > All of these are in theory.
> > > > > > > > > > > Our experiment shows PRI performance is 21x slower
> > > > > > > > > > > than page fault rate
> > > > > > > > > > done by the cpu.
> > > > > > > > > > > It simply does not even pass a simple 10Gbps test.
> > > > > > > > > >
> > > > > > > > > > If you stick to the wire speed during migration, it can converge.
> > > > > > > > > Do you have perf data for this?
> > > > > > > >
> > > > > > > > No, but it's not hard to imagine the worst case. Wrote a
> > > > > > > > small program that dirty every page by a NIC.
> > > > > > > >
> > > > > > > > > In the internal tests we donât see this happening.
> > > > > > > >
> > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed
> > > > > > > >
> > > > > > > > So if we get very high dirty rates (e.g by a high speed
> > > > > > > > NIC), we can't satisfy the requirement of the downtime. Or
> > > > > > > > if you see the converge, you might get help from the auto
> > > > > > > > converge support by the hypervisors like KVM where it tries
> > > > > > > > to throttle the VCPU then you can't reach the wire speed.
> > > > > > >
> > > > > > > Will only work for some device types.
> > > > > > >
> > > > > >
> > > > > > Yes, that's the point. Parav said he doesn't see the issue, it's
> > > > > > probably because he is testing a virtio-net and so the vCPU is
> > > > > > automatically throttled. It doesn't mean it can work for other
> > > > > > virito devices.
> > > > >
> > > > > Only for TX, and I'm pretty sure they had the foresight to test RX
> > > > > not just TX but let's confirm. Parav did you test both directions?
> > > >
> > > > RX speed somehow depends on the speed of refill, so throttling helps
> > > > more or less.
> > >
> > > It doesn't depend on speed of refill you just underrun and drop
> > > packets. then your nice 10usec latency becomes more like 10sec.
> >
> > I miss your point here. If the driver can't achieve wire speed without dirty page
> > tracking, it can neither when dirty page tracking is enabled.
> >
> > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > There is no requirement for mandating PRI either.
> > > > > > > > > > > So it is unusable.
> > > > > > > > > >
> > > > > > > > > > It's not about mandating, it's about doing things in the
> > > > > > > > > > correct layer. If PRI is slow, PCI can evolve for sure.
> > > > > > > > > You should try.
> > > > > > > >
> > > > > > > > Not my duty, I just want to make sure things are done in the
> > > > > > > > correct layer, and once it needs to be done in the virtio,
> > > > > > > > there's nothing obviously wrong.
> > > > > > >
> > > > > > > Yea but just vague questions don't help to make sure eiter way.
> > > > > >
> > > > > > I don't think it's vague, I have explained, if something in the
> > > > > > virito slows down the PRI, we can try to fix them.
> > > > >
> > > > > I don't believe you are going to make PRI fast. No one managed so far.
> > > >
> > > > So it's the fault of PRI not virito, but it doesn't mean we need to
> > > > do it in virtio.
> > >
> > > I keep saying with this approach we would just say "e1000 emulation is
> > > slow and encumbered this is the fault of e1000" and never get virtio
> > > at all.  Assigning blame only gets you so far.
> >
> > I think we are discussing different things. My point is virtio needs to leverage
> > the functionality provided by transport or platform (especially considering they
> > evolve faster than virtio). It seems to me it's hard even to duplicate some basic
> > function of platform IOMMU in virtio.
> >
> Not duplicated. Feeding into the platform.
>
> > >
> > > > >
> > > > > > Missing functions in
> > > > > > platform or transport is not a good excuse to try to workaround
> > > > > > it in the virtio. It's a layer violation and we never had any
> > > > > > feature like this in the past.
> > > > >
> > > > > Yes missing functionality in the platform is exactly why virtio
> > > > > was born in the first place.
> > > >
> > > > Well the platform can't do device specific logic. But that's not the
> > > > case of dirty page tracking which is device logic agnostic.
> > >
> > > Not true platforms have things like NICs on board and have for many
> > > years. It's about performance really.
> >
> > I've stated sufficient issues above. And one more obvious issue for device
> > initiated page logging is that it needs a lot of extra or unnecessary PCI
> > transactions which will throttle the performance of the whole system (and lead
> > to other issues like QOS). So I can't believe it has good performance overall.
> > Logging via IOMMU or using shadow virtqueue doesn't need any extra PCI
> > transactions at least.
> >
> In the current proposal, it does not required PCI transactions, as there is only a hypervisor-initiated query interface.

Such query requires at least several transactions, no?

Or to make things more clear, could you list the steps how a
hypervisor is expected to do the querying?

> It is a trade off of using svq + pasid vs using something from the device.
>
> Again, both has different use case and value. One uses cpu and one uses device.

That's for sure.

> Depending how much power one wants to spend where..

But as the author of the proposal, you need to elaborate more on how
you expect for a hypervisor instead of letting the reviewer guess.

Thanks



>
> > > So I'd like Parav to publish some
> > > experiment results and/or some estimates.
> > >
> >
> > That's fine, but the above equation (used by Qemu) is sufficient to demonstrate
> > how hard to stick wire speed in the case.
> >
> > >
> > > > >
> > > > > > >
> > > > > > > > > In the current state, it is mandating.
> > > > > > > > > And if you think PRI is the only way,
> > > > > > > >
> > > > > > > > I don't, it's just an example where virtio can leverage from
> > > > > > > > either transport or platform. Or if it's the fault in virtio
> > > > > > > > that slows down the PRI, then it is something we can do.
> > > > > > > >
> > > > > > > > >  than you should propose that in the dirty page tracking series that
> > you listed above to not do dirty page tracking. Rather depend on PRI, right?
> > > > > > > >
> > > > > > > > No, the point is to not duplicate works especially
> > > > > > > > considering virtio can't do better than platform or transport.
> > > > > > >
> > > > > > > If someone says they tried and platform's migration support
> > > > > > > does not work for them and they want to build a solution in
> > > > > > > virtio then what exactly is the objection?
> > > > > >
> > > > > > The discussion is to make sure whether virtio can do this easily
> > > > > > and correctly, then we can have a conclusion. I've stated some
> > > > > > issues above, and I've asked other questions related to them
> > > > > > which are still not answered.
> > > > > >
> > > > > > I think we had a very hard time in bypassing IOMMU in the past
> > > > > > that we don't want to repeat.
> > > > > >
> > > > > > We've gone through several methods of logging dirty pages in the
> > > > > > past (each with pros/cons), but this proposal never explains why
> > > > > > it chooses one of them but not others. Spec needs to find the
> > > > > > best path instead of just a possible path without any rationale about
> > why.
> > > > >
> > > > > Adding more rationale isn't a bad thing.
> > > > > In particular if platform supplies dirty tracking then how does
> > > > > driver decide which to use platform or device capability?
> > > > > A bit of discussion around this is a good idea.
> > > > >
> > > > >
> > > > > > > virtio is here in the
> > > > > > > first place because emulating devices didn't work well.
> > > > > >
> > > > > > I don't understand here. We have supported emulated devices for years.
> > > > > > I'm pretty sure a lot of issues could be uncovered if this
> > > > > > proposal can be prototyped with an emulated device first.
> > > > > >
> > > > > > Thanks
> > > > >
> > > > > virtio was originally PV as opposed to emulation. That there's now
> > > > > hardware virtio and you call software implementation "an
> > > > > emulation" is very meta.
> > > >
> > > > Yes but I don't see how it relates to dirty page tracking. When we
> > > > find a way it should work for both software and hardware devices.
> > > >
> > > > Thanks
> > >
> > > It has to work well on a variety of existing platforms. If it does
> > > then sure, why would we roll our own.
> >
> > If virtio can do that in an efficient way without any issues, I agree.
> > But it seems not.
> >
> > Thanks



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]