[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
On Thu, Nov 16, 2023 at 1:42âAM Parav Pandit <parav@nvidia.com> wrote: > > > > From: Jason Wang <jasowang@redhat.com> > > Sent: Monday, November 13, 2023 9:02 AM > > > > On Thu, Nov 9, 2023 at 3:59âPM Michael S. Tsirkin <mst@redhat.com> wrote: > > > > > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote: > > > > On Wed, Nov 8, 2023 at 4:17âPM Michael S. Tsirkin <mst@redhat.com> > > wrote: > > > > > > > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote: > > > > > > On Tue, Nov 7, 2023 at 3:05âPM Michael S. Tsirkin <mst@redhat.com> > > wrote: > > > > > > > > > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote: > > > > > > > > > > > Each virtio and non virtio devices who wants to report > > > > > > > > > > > their dirty page report, > > > > > > > > > > will do their way. > > > > > > > > > > > > > > > > > > > > > > > 3) inventing it in the virtio layer will be > > > > > > > > > > > > deprecated in the future for sure, as platform will > > > > > > > > > > > > provide much rich features for logging e.g it can do > > > > > > > > > > > > it per PASID etc, I don't see any reason virtio need > > > > > > > > > > > > to compete with the features that will be provided > > > > > > > > > > > > by the platform > > > > > > > > > > > Can you bring the cpu vendors and committement to > > > > > > > > > > > virtio tc with timelines > > > > > > > > > > so that virtio TC can omit? > > > > > > > > > > > > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC? > > > > > > > > > > Virtio needs to be built on top of transport or platform. There's > > no need to duplicate their job. > > > > > > > > > > Especially considering that virtio can't do better than them. > > > > > > > > > > > > > > > > > > > I wanted to see a strong commitment for the cpu vendors to > > support dirty page tracking. > > > > > > > > > > > > > > > > The RFC of IOMMUFD support can go back to early 2022. Intel, > > > > > > > > AMD and ARM are all supporting that now. > > > > > > > > > > > > > > > > > And the work seems to have started for some platforms. > > > > > > > > > > > > > > > > Let me quote from the above link: > > > > > > > > > > > > > > > > """ > > > > > > > > Today, AMD Milan (or more recent) supports it while ARM > > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support. > > > > > > > > """ > > > > > > > > > > > > > > > > > Without such platform commitment, virtio also skipping it would > > not work. > > > > > > > > > > > > > > > > Is the above sufficient? I'm a little bit more familiar with > > > > > > > > vtd, the hw feature has been there for years. > > > > > > > > > > > > > > > > > > > > > Repeating myself - I'm not sure that will work well for all workloads. > > > > > > > > > > > > I think this comment applies to this proposal as well. > > > > > > > > > > Yes - some systems might be better off with platform tracking. > > > > > And I think supporting shadow vq better would be nice too. > > > > > > > > For shadow vq, did you mean the work that is done by Eugenio? > > > > > > Yes. > > > > That's exactly why vDPA starts with shadow virtqueue. We've evaluated various > > possible approaches, each of them have their shortcomings and shadow > > virtqueue is the only one that doesn't require any additional hardware features > > to work in every platform. > > > > > > > > > > > > > > > > > Definitely KVM did > > > > > > > not scan PTEs. It used pagefaults with bit per page and later > > > > > > > as VM size grew switched to PLM. This interface is analogous > > > > > > > to PLM, > > > > > > > > > > > > I think you meant PML actually. And it doesn't work like PML. To > > > > > > behave like PML it needs to > > > > > > > > > > > > 1) log buffers were organized as a queue with indices > > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs out > > > > > > of the buffers > > > > > > 3) device need to send a notification to the driver if it runs > > > > > > out of the buffer > > > > > > > > > > > > I don't see any of the above in this proposal. If we do that it > > > > > > would be less problematic than what is being proposed here. > > > > > > > > > > What is common between this and PML is that you get the addresses > > > > > directly without scanning megabytes of bitmaps or worse - hundreds > > > > > of megabytes of page tables. > > > > > > > > Yes, it has overhead but this is the method we use for vhost and KVM > > (earlier). > > > > > > > > To me the important advantage of PML is that it uses limited > > > > resources on the host which > > > > > > > > 1) doesn't require resources in the device > > > > 2) doesn't scale as the guest memory increases. (but this advantage > > > > doesn't exist in neither this nor bitmap) > > > > > > it seems 2 exactly exists here. > > > > Actually not, Parav said the device needs to reserve sufficient resources in > > another thread. > The device resource reservation starts only when the device migration starts. > i.e. with WRITE_RECORDS_START command of patch 7 in the series. Right, but this is not the question, see below. > > > > > > > > > > > > > > > > > > > The data structure is different but I don't see why it is critical. > > > > > > > > > > I agree that I don't see out of buffers notifications too which > > > > > implies device has to maintain something like a bitmap internally. > > > > > Which I guess could be fine but it is not clear to me how large > > > > > that bitmap has to be. How does the device know? Needs to be addressed. > > > > > > > > This is the question I asked Parav in another thread. Using host > > > > memory as a queue with notification (like PML) might be much better. > > > > > > Well if queue is what you want to do you can just do it internally. > > > > Then it's not the proposal here, Parav has explained it in another reply, and as > > explained it lacks a lot of other facilities. > > > PML is yet another option that requires small pci writes. > In the current proposal, there are no small PCI writes. > It is a query interface from the device. Well, you've explained in another thread that actually it needs small PCI writes. E.g during IOTLB invalidation ... > > > > Problem of course is that it might overflow and cause things like > > > packet drops. > > > > Exactly like PML. So sticking to wire speed should not be a general goal in the > > context of migration. It can be done if the speed of the migration interface is > > faster than the virtio device that needs to be migrated. > May not have to be. > Speed of page recording should be fast enough. > It usually improves with subsequent generation. If you have something better, let's propose it from the start. > > > > > > > > > > > > > > > > > > > > > > > > Even if we manage to do that, it doesn't mean we won't have issues. > > > > > > > > > > > > 1) For many reasons it can neither see nor log via GPA, so this > > > > > > requires a traversal of the vIOMMU mapping tables by the > > > > > > hypervisor afterwards, it would be expensive and need > > > > > > synchronization with the guest modification of the IO page table which > > looks very hard. > > > > > > > > > > vIOMMU is fast enough to be used on data path but not fast enough > > > > > for dirty tracking? > > > > > > > > We set up SPTEs or using nesting offloading where the PTEs could be > > > > iterated by hardware directly which is fast. > > > > > > There's a way to have hardware find dirty PTEs for you quickly? > > > > Scanning PTEs on the host is faster and more secure than scanning guests, that's > > what I want to say: > > > > 1) the guest page could be swapped out but not the host one. > > 2) no guest triggerable behavior > > > > Device page tracking table to be consulted to flush on mapping change. > > > > I don't know how it's done. Do tell. > > > > > > > > > > This is not the case here where software needs to iterate the IO > > > > page tables in the guest which could be slow. > > > > > > > > > Hard to believe. If true and you want to speed up vIOMMU then you > > > > > implement an efficient datastructure for that. > > > > > > > > Besides the issue of performance, it's also racy, assuming we are logging > > IOVA. > > > > > > > > 0) device log IOVA > > > > 1) hypervisor fetches IOVA from log buffer > > > > 2) guest map IOVA to a new GPA > > > > 3) hypervisor traverse guest table to get IOVA to new GPA > > > > > > > > Then we lost the old GPA. > > > > > > Interesting and a good point. > > > > Note that PML logs at GPA as it works at L1 of EPT. > > > > > And by the way e.g. vhost has the same issue. You need to flush dirty > > > tracking info when changing the mappings somehow. > > > > It's not, > > > > 1) memory translation is done by vhost > > 2) vhost knows GPA and it doesn't log via IOVA. > > > > See this for example, and DPDK has similar fixes. > > > > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4 > > Author: Jason Wang <jasowang@redhat.com> > > Date: Wed Jan 16 16:54:42 2019 +0800 > > > > vhost: log dirty page correctly > > > > Vhost dirty page logging API is designed to sync through GPA. But we > > try to log GIOVA when device IOTLB is enabled. This is wrong and may > > lead to missing data after migration. > > > > To solve this issue, when logging with device IOTLB enabled, we will: > > > > 1) reuse the device IOTLB translation result of GIOVA->HVA mapping to > > get HVA, for writable descriptor, get HVA through iovec. For used > > ring update, translate its GIOVA to HVA > > 2) traverse the GPA->HVA mapping to get the possible GPA and log > > through GPA. Pay attention this reverse mapping is not guaranteed > > to be unique, so we should log each possible GPA in this case. > > > > This fix the failure of scp to guest during migration. In -next, we > > will probably support passing GIOVA->GPA instead of GIOVA->HVA. > > > > Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") > > Reported-by: Jintack Lim <jintack@cs.columbia.edu> > > Cc: Jintack Lim <jintack@cs.columbia.edu> > > Signed-off-by: Jason Wang <jasowang@redhat.com> > > Acked-by: Michael S. Tsirkin <mst@redhat.com> > > Signed-off-by: David S. Miller <davem@davemloft.net> > > > > All of the above is not what virtio did right now. > > > > > Parav what's the plan for this? Should be addressed in the spec too. > > > > > > > AFAIK, there's no easy/efficient way to do that. I hope I was wrong. > > > > The query interface in this proposal works on the granular boundary to read and clear. > This will ensure that mapping is consistent. > > > > > > > > > > > > > > > > > > 2) There are a lot of special or reserved IOVA ranges (for > > > > > > example the interrupt areas in x86) that need special care which > > > > > > is architectural and where it is beyond the scope or knowledge > > > > > > of the virtio device but the platform IOMMU. Things would be > > > > > > more complicated when SVA is enabled. > > > > > > > > > > SVA being what here? > > > > > > > > For example, IOMMU may treat interrupt ranges differently depending > > > > on whether SVA is enabled or not. It's very hard and unnecessary to > > > > teach devices about this. > > > > > > Oh, shared virtual memory. So what you are saying here? virtio does > > > not care, it just uses some addresses and if you want it to it can > > > record writes somewhere. > > > > One example, PCI allows devices to send translated requests, how can a > > hypervisor know it's a PA or IOVA in this case? We probably need a new bit. But > > it's not the only thing we need to deal with. > > > > By definition, interrupt ranges and other reserved ranges should not belong to > > dirty pages. And the logging should be done before the DMA where there's no > > way for the device to know whether or not an IOVA is valid or not. It would be > > more safe to just not report them from the source instead of leaving it to the > > hypervisor to deal with but this seems impossible at the device level. Otherwise > > the hypervisor driver needs to communicate with the (v)IOMMU to be reached > > with the > > interrupt(MSI) area, RMRR area etc in order to do the correct things or it might > > have security implications. And those areas don't make sense at L1 when vSVA > > is enabled. What's more, when vIOMMU could be fully offloaded, there's no > > easy way to fetch that information. > > > There cannot be logging before the DMA. Well, I don't see how this is related to the issue above. Logging after the DMA doesn't mean the device can know what sits behind an IOVA, no? > Only requirement is before the mapping changes, the dirty page tracking to be synced. > > In most common cases where the perf is critical, such mapping wont change so often dynamically anyway. I've explained the issue in another reply. > > > Again, it's hard to bypass or even duplicate the functionality of the platform or > > we need to step into every single detail of a specific transport, architecture or > > IOMMU to figure out whether or not logging at virtio is correct which is > > awkward and unrealistic. This proposal suffers from an exact similar issue when > > inventing things like freeze/stop where I've pointed out other branches of issues > > as well. > > > It is incorrect attribution that platform is duplicated here. > It feeds the data to the platform as needed without replicating. > > I do agree that there is overlap of IOMMU tracking the dirty and storing it in the per PTE vs device supplying its dirty track via its own interface. > Both are consolidated at hypervisor level. > > > > > > > > > > > > > > > And there could be other architecte specific knowledge (e.g > > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal > > > > > > with those cases. > > > > > > > > > > Good point about page size actually - using 4k unconditionally is > > > > > a waste of resources. > > > > > > > > Actually, they are more than just PAGE_SIZE, for example, PASID and others. > > > > > > what does pasid have to do with it? anyway, just give driver control > > > over page size. > > > > For example, two virtqueues have two PASIDs assigned. How can a hypervisor > > know which specific IOVA belongs to which IOVA? For platform IOMMU, they > > are handy as it talks to the transport. But I don't think we need to duplicate > > every transport specific address space feature in core virtio layer: > > > PASID to vq assignment won't be duplicated. > It is configured fully by the guest without consulting hypervisor at the device level. > Guest IOMMU would consult hypervisor to setup any PASID mapping as part of any mapping method. > > > 1) translated/untranslated request > > 2) request w/ and w/o PASID > > > > > > > > > > > > > > > > > > > > > We wouldn't need to care about all of them if it is done at > > > > > > platform IOMMU level. > > > > > > > > > > If someone logs at IOMMU level then nothing needs to be done in > > > > > the spec at all. This is about capability at the device level. > > > > > > > > True, but my question is where or not it can be done at the device level > > easily. > > > > > > there's no "easily" about live migration ever. > > > > I think I've stated sufficient issues to demonstrate how hard virtio wants to do it. > > And I've given the link that it is possible to do that in IOMMU without those > > issues. So in this context doing it in virtio is much harder. > > > > > For example on-device iommus are a thing. > > > > I'm not sure that's the way to go considering the platform IOMMU evolves very > > quickly. > > > > > > > > > > > > > > > > > > > > > > what Lingshan > > > > > > > proposed is analogous to bit per page - problem unfortunately > > > > > > > is you can't easily set a bit by DMA. > > > > > > > > > > > > > > > > > > > I'm not saying bit/bytemap is the best, but it has been used by > > > > > > real hardware. And we have many other options. > > > > > > > > > > > > > So I think this dirty tracking is a good option to have. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > i.e. in first year of 2024? > > > > > > > > > > > > > > > > > > > > Why does it matter in 2024? > > > > > > > > > Because users needs to use it now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If not, we are better off to offer this, and when/if > > > > > > > > > > > platform support is, sure, > > > > > > > > > > this feature can be disabled/not used/not enabled. > > > > > > > > > > > > > > > > > > > > > > > 4) if the platform support is missing, we can use > > > > > > > > > > > > software or leverage transport for assistance like > > > > > > > > > > > > PRI > > > > > > > > > > > All of these are in theory. > > > > > > > > > > > Our experiment shows PRI performance is 21x slower > > > > > > > > > > > than page fault rate > > > > > > > > > > done by the cpu. > > > > > > > > > > > It simply does not even pass a simple 10Gbps test. > > > > > > > > > > > > > > > > > > > > If you stick to the wire speed during migration, it can converge. > > > > > > > > > Do you have perf data for this? > > > > > > > > > > > > > > > > No, but it's not hard to imagine the worst case. Wrote a > > > > > > > > small program that dirty every page by a NIC. > > > > > > > > > > > > > > > > > In the internal tests we donât see this happening. > > > > > > > > > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed > > > > > > > > > > > > > > > > So if we get very high dirty rates (e.g by a high speed > > > > > > > > NIC), we can't satisfy the requirement of the downtime. Or > > > > > > > > if you see the converge, you might get help from the auto > > > > > > > > converge support by the hypervisors like KVM where it tries > > > > > > > > to throttle the VCPU then you can't reach the wire speed. > > > > > > > > > > > > > > Will only work for some device types. > > > > > > > > > > > > > > > > > > > Yes, that's the point. Parav said he doesn't see the issue, it's > > > > > > probably because he is testing a virtio-net and so the vCPU is > > > > > > automatically throttled. It doesn't mean it can work for other > > > > > > virito devices. > > > > > > > > > > Only for TX, and I'm pretty sure they had the foresight to test RX > > > > > not just TX but let's confirm. Parav did you test both directions? > > > > > > > > RX speed somehow depends on the speed of refill, so throttling helps > > > > more or less. > > > > > > It doesn't depend on speed of refill you just underrun and drop > > > packets. then your nice 10usec latency becomes more like 10sec. > > > > I miss your point here. If the driver can't achieve wire speed without dirty page > > tracking, it can neither when dirty page tracking is enabled. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There is no requirement for mandating PRI either. > > > > > > > > > > > So it is unusable. > > > > > > > > > > > > > > > > > > > > It's not about mandating, it's about doing things in the > > > > > > > > > > correct layer. If PRI is slow, PCI can evolve for sure. > > > > > > > > > You should try. > > > > > > > > > > > > > > > > Not my duty, I just want to make sure things are done in the > > > > > > > > correct layer, and once it needs to be done in the virtio, > > > > > > > > there's nothing obviously wrong. > > > > > > > > > > > > > > Yea but just vague questions don't help to make sure eiter way. > > > > > > > > > > > > I don't think it's vague, I have explained, if something in the > > > > > > virito slows down the PRI, we can try to fix them. > > > > > > > > > > I don't believe you are going to make PRI fast. No one managed so far. > > > > > > > > So it's the fault of PRI not virito, but it doesn't mean we need to > > > > do it in virtio. > > > > > > I keep saying with this approach we would just say "e1000 emulation is > > > slow and encumbered this is the fault of e1000" and never get virtio > > > at all. Assigning blame only gets you so far. > > > > I think we are discussing different things. My point is virtio needs to leverage > > the functionality provided by transport or platform (especially considering they > > evolve faster than virtio). It seems to me it's hard even to duplicate some basic > > function of platform IOMMU in virtio. > > > Not duplicated. Feeding into the platform. > > > > > > > > > > > > > > > Missing functions in > > > > > > platform or transport is not a good excuse to try to workaround > > > > > > it in the virtio. It's a layer violation and we never had any > > > > > > feature like this in the past. > > > > > > > > > > Yes missing functionality in the platform is exactly why virtio > > > > > was born in the first place. > > > > > > > > Well the platform can't do device specific logic. But that's not the > > > > case of dirty page tracking which is device logic agnostic. > > > > > > Not true platforms have things like NICs on board and have for many > > > years. It's about performance really. > > > > I've stated sufficient issues above. And one more obvious issue for device > > initiated page logging is that it needs a lot of extra or unnecessary PCI > > transactions which will throttle the performance of the whole system (and lead > > to other issues like QOS). So I can't believe it has good performance overall. > > Logging via IOMMU or using shadow virtqueue doesn't need any extra PCI > > transactions at least. > > > In the current proposal, it does not required PCI transactions, as there is only a hypervisor-initiated query interface. Such query requires at least several transactions, no? Or to make things more clear, could you list the steps how a hypervisor is expected to do the querying? > It is a trade off of using svq + pasid vs using something from the device. > > Again, both has different use case and value. One uses cpu and one uses device. That's for sure. > Depending how much power one wants to spend where.. But as the author of the proposal, you need to elaborate more on how you expect for a hypervisor instead of letting the reviewer guess. Thanks > > > > So I'd like Parav to publish some > > > experiment results and/or some estimates. > > > > > > > That's fine, but the above equation (used by Qemu) is sufficient to demonstrate > > how hard to stick wire speed in the case. > > > > > > > > > > > > > > > > > > > > > > > > > > In the current state, it is mandating. > > > > > > > > > And if you think PRI is the only way, > > > > > > > > > > > > > > > > I don't, it's just an example where virtio can leverage from > > > > > > > > either transport or platform. Or if it's the fault in virtio > > > > > > > > that slows down the PRI, then it is something we can do. > > > > > > > > > > > > > > > > > than you should propose that in the dirty page tracking series that > > you listed above to not do dirty page tracking. Rather depend on PRI, right? > > > > > > > > > > > > > > > > No, the point is to not duplicate works especially > > > > > > > > considering virtio can't do better than platform or transport. > > > > > > > > > > > > > > If someone says they tried and platform's migration support > > > > > > > does not work for them and they want to build a solution in > > > > > > > virtio then what exactly is the objection? > > > > > > > > > > > > The discussion is to make sure whether virtio can do this easily > > > > > > and correctly, then we can have a conclusion. I've stated some > > > > > > issues above, and I've asked other questions related to them > > > > > > which are still not answered. > > > > > > > > > > > > I think we had a very hard time in bypassing IOMMU in the past > > > > > > that we don't want to repeat. > > > > > > > > > > > > We've gone through several methods of logging dirty pages in the > > > > > > past (each with pros/cons), but this proposal never explains why > > > > > > it chooses one of them but not others. Spec needs to find the > > > > > > best path instead of just a possible path without any rationale about > > why. > > > > > > > > > > Adding more rationale isn't a bad thing. > > > > > In particular if platform supplies dirty tracking then how does > > > > > driver decide which to use platform or device capability? > > > > > A bit of discussion around this is a good idea. > > > > > > > > > > > > > > > > > virtio is here in the > > > > > > > first place because emulating devices didn't work well. > > > > > > > > > > > > I don't understand here. We have supported emulated devices for years. > > > > > > I'm pretty sure a lot of issues could be uncovered if this > > > > > > proposal can be prototyped with an emulated device first. > > > > > > > > > > > > Thanks > > > > > > > > > > virtio was originally PV as opposed to emulation. That there's now > > > > > hardware virtio and you call software implementation "an > > > > > emulation" is very meta. > > > > > > > > Yes but I don't see how it relates to dirty page tracking. When we > > > > find a way it should work for both software and hardware devices. > > > > > > > > Thanks > > > > > > It has to work well on a variety of existing platforms. If it does > > > then sure, why would we roll our own. > > > > If virtio can do that in an efficient way without any issues, I agree. > > But it seems not. > > > > Thanks
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]