[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: RE: [PATCH v3 6/8] admin: Add theory of operation for write recording commands
> From: Jason Wang <jasowang@redhat.com> > Sent: Thursday, November 16, 2023 9:49 AM > > On Thu, Nov 16, 2023 at 1:42âAM Parav Pandit <parav@nvidia.com> wrote: > > > > > > > From: Jason Wang <jasowang@redhat.com> > > > Sent: Monday, November 13, 2023 9:02 AM > > > > > > On Thu, Nov 9, 2023 at 3:59âPM Michael S. Tsirkin <mst@redhat.com> > wrote: > > > > > > > > On Thu, Nov 09, 2023 at 11:31:27AM +0800, Jason Wang wrote: > > > > > On Wed, Nov 8, 2023 at 4:17âPM Michael S. Tsirkin > > > > > <mst@redhat.com> > > > wrote: > > > > > > > > > > > > On Wed, Nov 08, 2023 at 12:28:36PM +0800, Jason Wang wrote: > > > > > > > On Tue, Nov 7, 2023 at 3:05âPM Michael S. Tsirkin > > > > > > > <mst@redhat.com> > > > wrote: > > > > > > > > > > > > > > > > On Tue, Nov 07, 2023 at 12:04:29PM +0800, Jason Wang wrote: > > > > > > > > > > > > Each virtio and non virtio devices who wants to > > > > > > > > > > > > report their dirty page report, > > > > > > > > > > > will do their way. > > > > > > > > > > > > > > > > > > > > > > > > > 3) inventing it in the virtio layer will be > > > > > > > > > > > > > deprecated in the future for sure, as platform > > > > > > > > > > > > > will provide much rich features for logging e.g > > > > > > > > > > > > > it can do it per PASID etc, I don't see any > > > > > > > > > > > > > reason virtio need to compete with the features > > > > > > > > > > > > > that will be provided by the platform > > > > > > > > > > > > Can you bring the cpu vendors and committement to > > > > > > > > > > > > virtio tc with timelines > > > > > > > > > > > so that virtio TC can omit? > > > > > > > > > > > > > > > > > > > > > > Why do we need to bring CPU vendors in the virtio TC? > > > > > > > > > > > Virtio needs to be built on top of transport or > > > > > > > > > > > platform. There's > > > no need to duplicate their job. > > > > > > > > > > > Especially considering that virtio can't do better than them. > > > > > > > > > > > > > > > > > > > > > I wanted to see a strong commitment for the cpu > > > > > > > > > > vendors to > > > support dirty page tracking. > > > > > > > > > > > > > > > > > > The RFC of IOMMUFD support can go back to early 2022. > > > > > > > > > Intel, AMD and ARM are all supporting that now. > > > > > > > > > > > > > > > > > > > And the work seems to have started for some platforms. > > > > > > > > > > > > > > > > > > Let me quote from the above link: > > > > > > > > > > > > > > > > > > """ > > > > > > > > > Today, AMD Milan (or more recent) supports it while ARM > > > > > > > > > SMMUv3.2 alongside VT-D rev3.x also do support. > > > > > > > > > """ > > > > > > > > > > > > > > > > > > > Without such platform commitment, virtio also skipping > > > > > > > > > > it would > > > not work. > > > > > > > > > > > > > > > > > > Is the above sufficient? I'm a little bit more familiar > > > > > > > > > with vtd, the hw feature has been there for years. > > > > > > > > > > > > > > > > > > > > > > > > Repeating myself - I'm not sure that will work well for all workloads. > > > > > > > > > > > > > > I think this comment applies to this proposal as well. > > > > > > > > > > > > Yes - some systems might be better off with platform tracking. > > > > > > And I think supporting shadow vq better would be nice too. > > > > > > > > > > For shadow vq, did you mean the work that is done by Eugenio? > > > > > > > > Yes. > > > > > > That's exactly why vDPA starts with shadow virtqueue. We've > > > evaluated various possible approaches, each of them have their > > > shortcomings and shadow virtqueue is the only one that doesn't > > > require any additional hardware features to work in every platform. > > > > > > > > > > > > > > > > > > > > > Definitely KVM did > > > > > > > > not scan PTEs. It used pagefaults with bit per page and > > > > > > > > later as VM size grew switched to PLM. This interface is > > > > > > > > analogous to PLM, > > > > > > > > > > > > > > I think you meant PML actually. And it doesn't work like > > > > > > > PML. To behave like PML it needs to > > > > > > > > > > > > > > 1) log buffers were organized as a queue with indices > > > > > > > 2) device needs to suspend (as a #vmexit in PML) if it runs > > > > > > > out of the buffers > > > > > > > 3) device need to send a notification to the driver if it > > > > > > > runs out of the buffer > > > > > > > > > > > > > > I don't see any of the above in this proposal. If we do that > > > > > > > it would be less problematic than what is being proposed here. > > > > > > > > > > > > What is common between this and PML is that you get the > > > > > > addresses directly without scanning megabytes of bitmaps or > > > > > > worse - hundreds of megabytes of page tables. > > > > > > > > > > Yes, it has overhead but this is the method we use for vhost and > > > > > KVM > > > (earlier). > > > > > > > > > > To me the important advantage of PML is that it uses limited > > > > > resources on the host which > > > > > > > > > > 1) doesn't require resources in the device > > > > > 2) doesn't scale as the guest memory increases. (but this > > > > > advantage doesn't exist in neither this nor bitmap) > > > > > > > > it seems 2 exactly exists here. > > > > > > Actually not, Parav said the device needs to reserve sufficient > > > resources in another thread. > > The device resource reservation starts only when the device migration starts. > > i.e. with WRITE_RECORDS_START command of patch 7 in the series. > > Right, but this is not the question, see below. > > > > > > > > > > > > > > > > > > > > > > > > > > The data structure is different but I don't see why it is critical. > > > > > > > > > > > > I agree that I don't see out of buffers notifications too > > > > > > which implies device has to maintain something like a bitmap internally. > > > > > > Which I guess could be fine but it is not clear to me how > > > > > > large that bitmap has to be. How does the device know? Needs to be > addressed. > > > > > > > > > > This is the question I asked Parav in another thread. Using host > > > > > memory as a queue with notification (like PML) might be much better. > > > > > > > > Well if queue is what you want to do you can just do it internally. > > > > > > Then it's not the proposal here, Parav has explained it in another > > > reply, and as explained it lacks a lot of other facilities. > > > > > PML is yet another option that requires small pci writes. > > In the current proposal, there are no small PCI writes. > > It is a query interface from the device. > > Well, you've explained in another thread that actually it needs small PCI writes. > No. There may be some misunderstanding. > E.g during IOTLB invalidation ... > This is not part of the virtio interface. > > > > > > Problem of course is that it might overflow and cause things like > > > > packet drops. > > > > > > Exactly like PML. So sticking to wire speed should not be a general > > > goal in the context of migration. It can be done if the speed of the > > > migration interface is faster than the virtio device that needs to be migrated. > > May not have to be. > > Speed of page recording should be fast enough. > > It usually improves with subsequent generation. > > If you have something better, let's propose it from the start. > It is unproven that current proposal cannot be done by the device. So I prefer to do this incrementally when we get there. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Even if we manage to do that, it doesn't mean we won't have issues. > > > > > > > > > > > > > > 1) For many reasons it can neither see nor log via GPA, so > > > > > > > this requires a traversal of the vIOMMU mapping tables by > > > > > > > the hypervisor afterwards, it would be expensive and need > > > > > > > synchronization with the guest modification of the IO page > > > > > > > table which > > > looks very hard. > > > > > > > > > > > > vIOMMU is fast enough to be used on data path but not fast > > > > > > enough for dirty tracking? > > > > > > > > > > We set up SPTEs or using nesting offloading where the PTEs could > > > > > be iterated by hardware directly which is fast. > > > > > > > > There's a way to have hardware find dirty PTEs for you quickly? > > > > > > Scanning PTEs on the host is faster and more secure than scanning > > > guests, that's what I want to say: > > > > > > 1) the guest page could be swapped out but not the host one. > > > 2) no guest triggerable behavior > > > > > > > Device page tracking table to be consulted to flush on mapping change. > > > > > > I don't know how it's done. Do tell. > > > > > > > > > > > > > This is not the case here where software needs to iterate the IO > > > > > page tables in the guest which could be slow. > > > > > > > > > > > Hard to believe. If true and you want to speed up vIOMMU then > > > > > > you implement an efficient datastructure for that. > > > > > > > > > > Besides the issue of performance, it's also racy, assuming we > > > > > are logging > > > IOVA. > > > > > > > > > > 0) device log IOVA > > > > > 1) hypervisor fetches IOVA from log buffer > > > > > 2) guest map IOVA to a new GPA > > > > > 3) hypervisor traverse guest table to get IOVA to new GPA > > > > > > > > > > Then we lost the old GPA. > > > > > > > > Interesting and a good point. > > > > > > Note that PML logs at GPA as it works at L1 of EPT. > > > > > > > And by the way e.g. vhost has the same issue. You need to flush > > > > dirty tracking info when changing the mappings somehow. > > > > > > It's not, > > > > > > 1) memory translation is done by vhost > > > 2) vhost knows GPA and it doesn't log via IOVA. > > > > > > See this for example, and DPDK has similar fixes. > > > > > > commit cc5e710759470bc7f3c61d11fd54586f15fdbdf4 > > > Author: Jason Wang <jasowang@redhat.com> > > > Date: Wed Jan 16 16:54:42 2019 +0800 > > > > > > vhost: log dirty page correctly > > > > > > Vhost dirty page logging API is designed to sync through GPA. But we > > > try to log GIOVA when device IOTLB is enabled. This is wrong and may > > > lead to missing data after migration. > > > > > > To solve this issue, when logging with device IOTLB enabled, we will: > > > > > > 1) reuse the device IOTLB translation result of GIOVA->HVA mapping to > > > get HVA, for writable descriptor, get HVA through iovec. For used > > > ring update, translate its GIOVA to HVA > > > 2) traverse the GPA->HVA mapping to get the possible GPA and log > > > through GPA. Pay attention this reverse mapping is not guaranteed > > > to be unique, so we should log each possible GPA in this case. > > > > > > This fix the failure of scp to guest during migration. In -next, we > > > will probably support passing GIOVA->GPA instead of GIOVA->HVA. > > > > > > Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API") > > > Reported-by: Jintack Lim <jintack@cs.columbia.edu> > > > Cc: Jintack Lim <jintack@cs.columbia.edu> > > > Signed-off-by: Jason Wang <jasowang@redhat.com> > > > Acked-by: Michael S. Tsirkin <mst@redhat.com> > > > Signed-off-by: David S. Miller <davem@davemloft.net> > > > > > > All of the above is not what virtio did right now. > > > > > > > Parav what's the plan for this? Should be addressed in the spec too. > > > > > > > > > > AFAIK, there's no easy/efficient way to do that. I hope I was wrong. > > > > > > > The query interface in this proposal works on the granular boundary to read > and clear. > > This will ensure that mapping is consistent. > > > > > > > > > > > > > > > > > > > > > > > 2) There are a lot of special or reserved IOVA ranges (for > > > > > > > example the interrupt areas in x86) that need special care > > > > > > > which is architectural and where it is beyond the scope or > > > > > > > knowledge of the virtio device but the platform IOMMU. > > > > > > > Things would be more complicated when SVA is enabled. > > > > > > > > > > > > SVA being what here? > > > > > > > > > > For example, IOMMU may treat interrupt ranges differently > > > > > depending on whether SVA is enabled or not. It's very hard and > > > > > unnecessary to teach devices about this. > > > > > > > > Oh, shared virtual memory. So what you are saying here? virtio > > > > does not care, it just uses some addresses and if you want it to > > > > it can record writes somewhere. > > > > > > One example, PCI allows devices to send translated requests, how can > > > a hypervisor know it's a PA or IOVA in this case? We probably need a > > > new bit. But it's not the only thing we need to deal with. > > > > > > By definition, interrupt ranges and other reserved ranges should not > > > belong to dirty pages. And the logging should be done before the DMA > > > where there's no way for the device to know whether or not an IOVA > > > is valid or not. It would be more safe to just not report them from > > > the source instead of leaving it to the hypervisor to deal with but > > > this seems impossible at the device level. Otherwise the hypervisor > > > driver needs to communicate with the (v)IOMMU to be reached with the > > > interrupt(MSI) area, RMRR area etc in order to do the correct things > > > or it might have security implications. And those areas don't make > > > sense at L1 when vSVA is enabled. What's more, when vIOMMU could be > > > fully offloaded, there's no easy way to fetch that information. > > > > > There cannot be logging before the DMA. > > Well, I don't see how this is related to the issue above. Logging after the DMA > doesn't mean the device can know what sits behind an IOVA, no? > Device does not know what sits behind IOVA. IOVA to GPA mapping is maintained by the iommu. Hence, device records and hypervor query them. > > Only requirement is before the mapping changes, the dirty page tracking to be > synced. > > > > In most common cases where the perf is critical, such mapping wont change > so often dynamically anyway. > > I've explained the issue in another reply. > > > > > > Again, it's hard to bypass or even duplicate the functionality of > > > the platform or we need to step into every single detail of a > > > specific transport, architecture or IOMMU to figure out whether or > > > not logging at virtio is correct which is awkward and unrealistic. > > > This proposal suffers from an exact similar issue when inventing > > > things like freeze/stop where I've pointed out other branches of issues as > well. > > > > > It is incorrect attribution that platform is duplicated here. > > It feeds the data to the platform as needed without replicating. > > > > I do agree that there is overlap of IOMMU tracking the dirty and storing it in > the per PTE vs device supplying its dirty track via its own interface. > > Both are consolidated at hypervisor level. > > > > > > > > > > > > > > > > > > > And there could be other architecte specific knowledge (e.g > > > > > > > PAGE_SIZE) that might be needed. There's no easy way to deal > > > > > > > with those cases. > > > > > > > > > > > > Good point about page size actually - using 4k unconditionally > > > > > > is a waste of resources. > > > > > > > > > > Actually, they are more than just PAGE_SIZE, for example, PASID and > others. > > > > > > > > what does pasid have to do with it? anyway, just give driver > > > > control over page size. > > > > > > For example, two virtqueues have two PASIDs assigned. How can a > > > hypervisor know which specific IOVA belongs to which IOVA? For > > > platform IOMMU, they are handy as it talks to the transport. But I > > > don't think we need to duplicate every transport specific address space > feature in core virtio layer: > > > > > PASID to vq assignment won't be duplicated. > > It is configured fully by the guest without consulting hypervisor at the device > level. > > Guest IOMMU would consult hypervisor to setup any PASID mapping as part > of any mapping method. > > > > > 1) translated/untranslated request > > > 2) request w/ and w/o PASID > > > > > > > > > > > > > > > > > > > > > > > > > > We wouldn't need to care about all of them if it is done at > > > > > > > platform IOMMU level. > > > > > > > > > > > > If someone logs at IOMMU level then nothing needs to be done > > > > > > in the spec at all. This is about capability at the device level. > > > > > > > > > > True, but my question is where or not it can be done at the > > > > > device level > > > easily. > > > > > > > > there's no "easily" about live migration ever. > > > > > > I think I've stated sufficient issues to demonstrate how hard virtio wants to > do it. > > > And I've given the link that it is possible to do that in IOMMU > > > without those issues. So in this context doing it in virtio is much harder. > > > > > > > For example on-device iommus are a thing. > > > > > > I'm not sure that's the way to go considering the platform IOMMU > > > evolves very quickly. > > > > > > > > > > > > > > > > > > > > > > > > > > > what Lingshan > > > > > > > > proposed is analogous to bit per page - problem > > > > > > > > unfortunately is you can't easily set a bit by DMA. > > > > > > > > > > > > > > > > > > > > > > I'm not saying bit/bytemap is the best, but it has been used > > > > > > > by real hardware. And we have many other options. > > > > > > > > > > > > > > > So I think this dirty tracking is a good option to have. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > i.e. in first year of 2024? > > > > > > > > > > > > > > > > > > > > > > Why does it matter in 2024? > > > > > > > > > > Because users needs to use it now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If not, we are better off to offer this, and > > > > > > > > > > > > when/if platform support is, sure, > > > > > > > > > > > this feature can be disabled/not used/not enabled. > > > > > > > > > > > > > > > > > > > > > > > > > 4) if the platform support is missing, we can > > > > > > > > > > > > > use software or leverage transport for > > > > > > > > > > > > > assistance like PRI > > > > > > > > > > > > All of these are in theory. > > > > > > > > > > > > Our experiment shows PRI performance is 21x slower > > > > > > > > > > > > than page fault rate > > > > > > > > > > > done by the cpu. > > > > > > > > > > > > It simply does not even pass a simple 10Gbps test. > > > > > > > > > > > > > > > > > > > > > > If you stick to the wire speed during migration, it can > converge. > > > > > > > > > > Do you have perf data for this? > > > > > > > > > > > > > > > > > > No, but it's not hard to imagine the worst case. Wrote a > > > > > > > > > small program that dirty every page by a NIC. > > > > > > > > > > > > > > > > > > > In the internal tests we donât see this happening. > > > > > > > > > > > > > > > > > > downtime = dirty_rates * PAGE_SIZE / migration_speed > > > > > > > > > > > > > > > > > > So if we get very high dirty rates (e.g by a high speed > > > > > > > > > NIC), we can't satisfy the requirement of the downtime. > > > > > > > > > Or if you see the converge, you might get help from the > > > > > > > > > auto converge support by the hypervisors like KVM where > > > > > > > > > it tries to throttle the VCPU then you can't reach the wire speed. > > > > > > > > > > > > > > > > Will only work for some device types. > > > > > > > > > > > > > > > > > > > > > > Yes, that's the point. Parav said he doesn't see the issue, > > > > > > > it's probably because he is testing a virtio-net and so the > > > > > > > vCPU is automatically throttled. It doesn't mean it can work > > > > > > > for other virito devices. > > > > > > > > > > > > Only for TX, and I'm pretty sure they had the foresight to > > > > > > test RX not just TX but let's confirm. Parav did you test both directions? > > > > > > > > > > RX speed somehow depends on the speed of refill, so throttling > > > > > helps more or less. > > > > > > > > It doesn't depend on speed of refill you just underrun and drop > > > > packets. then your nice 10usec latency becomes more like 10sec. > > > > > > I miss your point here. If the driver can't achieve wire speed > > > without dirty page tracking, it can neither when dirty page tracking is > enabled. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There is no requirement for mandating PRI either. > > > > > > > > > > > > So it is unusable. > > > > > > > > > > > > > > > > > > > > > > It's not about mandating, it's about doing things in > > > > > > > > > > > the correct layer. If PRI is slow, PCI can evolve for sure. > > > > > > > > > > You should try. > > > > > > > > > > > > > > > > > > Not my duty, I just want to make sure things are done in > > > > > > > > > the correct layer, and once it needs to be done in the > > > > > > > > > virtio, there's nothing obviously wrong. > > > > > > > > > > > > > > > > Yea but just vague questions don't help to make sure eiter way. > > > > > > > > > > > > > > I don't think it's vague, I have explained, if something in > > > > > > > the virito slows down the PRI, we can try to fix them. > > > > > > > > > > > > I don't believe you are going to make PRI fast. No one managed so far. > > > > > > > > > > So it's the fault of PRI not virito, but it doesn't mean we need > > > > > to do it in virtio. > > > > > > > > I keep saying with this approach we would just say "e1000 > > > > emulation is slow and encumbered this is the fault of e1000" and > > > > never get virtio at all. Assigning blame only gets you so far. > > > > > > I think we are discussing different things. My point is virtio needs > > > to leverage the functionality provided by transport or platform > > > (especially considering they evolve faster than virtio). It seems to > > > me it's hard even to duplicate some basic function of platform IOMMU in > virtio. > > > > > Not duplicated. Feeding into the platform. > > > > > > > > > > > > > > > > > > > Missing functions in > > > > > > > platform or transport is not a good excuse to try to > > > > > > > workaround it in the virtio. It's a layer violation and we > > > > > > > never had any feature like this in the past. > > > > > > > > > > > > Yes missing functionality in the platform is exactly why > > > > > > virtio was born in the first place. > > > > > > > > > > Well the platform can't do device specific logic. But that's not > > > > > the case of dirty page tracking which is device logic agnostic. > > > > > > > > Not true platforms have things like NICs on board and have for > > > > many years. It's about performance really. > > > > > > I've stated sufficient issues above. And one more obvious issue for > > > device initiated page logging is that it needs a lot of extra or > > > unnecessary PCI transactions which will throttle the performance of > > > the whole system (and lead to other issues like QOS). So I can't believe it has > good performance overall. > > > Logging via IOMMU or using shadow virtqueue doesn't need any extra > > > PCI transactions at least. > > > > > In the current proposal, it does not required PCI transactions, as there is only a > hypervisor-initiated query interface. > > Such query requires at least several transactions, no? > It depends on how much memory is unmapped. For the pinned memory cases with hw iommu's mapping unmapping is not frequent event. > Or to make things more clear, could you list the steps how a hypervisor is > expected to do the querying? > It is listed in other email thread where you described the race condition. > > It is a trade off of using svq + pasid vs using something from the device. > > > > Again, both has different use case and value. One uses cpu and one uses > device. > > That's for sure. > > > Depending how much power one wants to spend where.. > > But as the author of the proposal, you need to elaborate more on how you > expect for a hypervisor instead of letting the reviewer guess. Hypervisor would look for ways on how to do functionality and find it in the spec. But I will add a short description on the write record read command.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]