virtio-comment message

Subject: RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
From: Parav Pandit <parav@nvidia.com>
To: Jason Wang <jasowang@redhat.com>
Date: Wed, 18 Oct 2023 08:16:01 +0000
> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 17, 2023 7:12 AM
> 
> On Fri, Oct 13, 2023 at 2:36âPM Parav Pandit <parav@nvidia.com> wrote:
> >
> >
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Friday, October 13, 2023 6:46 AM
> >
> > [..]
> > > > > > > It's still not clear to me how this is done.
> > > > > > >
> > > > > > > 1) guest starts FLR
> > > > > > > 2) adminq freeze the VF
> > > > > > > 3) FLR is done
> > > > > > >
> > > > > > > If the freezing doesn't wait for the FLR, does it mean we
> > > > > > > need to migrate to a state like FLR is pending? If yes, do
> > > > > > > we need to migrate the other sub states like this? If not, why?
> > > > > > >
> > > > > > In most practical cases #2 followed by #1 should not happen as
> > > > > > on the source
> > > > > side the expected is mode change to stop from active.
> > > > >
> > > > > How does the hypervisor know if a guest is doing what without trapping?
> > > > >
> > > > Hypervisor does not know. The device knows being the recipient of #1 and
> #2.
> > >
> > > We are discussing the possibility in software/driver side isn't it?
> > >
> > > 1) is initiated from the guest
> > > 2) is initiated from the hypervisor
> > >
> > > Both are softwares, and you're saying 2) should not happen after 1)
> > > since the device knows what is being done by guests? How can devices
> > > control software behaviour?
> > >
> > Device do not control software behavior.
> > i.e. either hypervisor can initiate device mode change to stop (not freeze) or
> guest can initiate FLR.
> > Device knows which is initiated first as single recipient of both.
> > Therefore, device responds accordingly.
> > For example, in the sequence you described, A device will delay mode
> > change command response, until the FLR is completed.
> 
> Finally but ok.
> 
> >
> >
> > > This only possible thing is to make sure 3) is done before 2) That
> > > is what I'm asking but you are saying freeze doesn't need to wait for FLR...
> > >
> > I think I responded in previous email further down on synchronization point
> being fw.
> > I meant to say software do not need to wait for initiation of the freeze mode
> command.
> 
> For software, did you mean the hypervisor?
> 
Yes.

> > Just the command will complete at right time.
> >
> > This is anyway very corner case.
> > On source hypervisor as written in the theory of operation, the sequence is
> active->stop->freeze.
> > When mode change is done to stop, the vcpus are already suspended.
> 
> The problem here is not the vcpu but the when FLR is being done since it may
> change the device context.
> 
Once the freeze mode transition is completed, the hypervisor sw reads the final device context to migrate.

> >
> > I agree FLR may have been initiated and driver is waiting now for 100msec.
> 
> For driver, did you mean the driver in the guest?
> 
Yes for 100msec wait time, it is the guest driver.

> >
> > So yes, device single entity synchronized it.
> >
> > > >
> > > > > > But ok, since we active to freeze mode change is allowed, lets
> > > > > > discuss
> > > above.
> > > > > >
> > > > > > A device is the single synchronization point for any device
> > > > > > reset, FLR or admin
> > > > > command operation.
> > > > >
> > > > > So you agree we need synchronization? And I'm not sure I get the
> > > > > meaning of synchronization point, do you mean the
> > > > > synchronization between freeze/stop and virtio facilities?
> > > > >
> > > > Synchronization means, handling two events in parallel such as FLR and
> other.
> > >
> > > Great. So we have a perfect race:
> > >
> > > 1) guest initiates FLR
> > > 2) device start FLR
> > > 3) hypervisor stop and freeze the device
> > > 4) device is freeze
> > > 5) hypervisor read device context A
> > > 6) migrate device contextA
> > > 8) migration is done
> > > 9) FLR is done
> > > 10) hypervisor read device context B
> > >
> > > So we end up with inconsistent device context, no? Dest want B or
> > > A+B, but you give A.
> > >
> > Since #1 and #2 is done before #3, the device knows to finish the FLR, hence
> #9 is completed before #4.
> 
> Ok, that's my understanding and that's why I'm asking, but you said freeze/stop
> doesn't need to wait for FLR before.
> 
Hypervisor side does not need to wait to issue the freeze/stop command.
It just completes later from the device if FLR in this corner case was ongoing.
I covered this in v2 now.

> >
> > Alternatively, in above sequence when destination sees #10, it can
> immediately finish the FLR as dest device is not under FLR, treating it as no-op.
> >
> > Both ways to handle are fine. (and rare in practice, but yes, its possible).
> >
> > I will write both the options in the device requirements.
> >
> > > >
> > > > > > So, the migration driver do not need to wait for FLR to complete.
> > > > >
> > > > > I'm confused, you said below that device context could be changed by
> FLR.
> > > > >
> > > > Yes.
> > > > > If FLR needs to clear device context, we can have a race where
> > > > > device context is cleared when we are trying to read it?
> > > > >
> > > > I didnât say clear the context.
> > > > FLR updates the device context.
> > >
> > > In what sense?
> > >
> > Indicating a new device context indicating a new device context and discard
> the old one.
> 
> For example, what will queue_address have after an FLR?
> 
All zeros.
It probably does not matter because device context to capture this FLR notion.
device_status shows that device is under reset after the FLR.

> > I am glad you asked this. I wanted to get the basic part captured before
> adding this optimization.
> 
> Ok.
> 
> > Probably it is good to add it now in the v2 as we crossed this stage now.
> >
> > > > Device is serving the device context read write commands, serving
> > > > FLR, answering mode change command, So device knows the best how
> > > > to avoid
> > > any race.
> > >
> > > You want to leave those details for the vendor to figure out? If
> > > devices know everything, why do we need device normative?
> > >
> > Device knows its implementation.
> > Implementation guidelines to be in the normative.
> > I will add it to the normative.
> >
> > > I see issues at least for FLR, I'm pretty sure they are others. If a
> > > design requires us to audit all the possible conflicts between
> > > virtio facilities and transport. It's a strong hint of layer
> > > violation and when it happens it for sure may hit a lot of problems that are
> very hard to find or debug thus we should drop such a design.
> > > I suggest using the RFC tag since the next version (if there is one)
> > > as I see it is immature in many ways.
> > >
> > Technical committee audits the required touch points like rest of the industry
> committees that I participated.
> > I disagree to your above point.
> > If you do not want to review, that is fine.
> 
> I don't want to hold my breath if I see something that is obviously wrong. Using
> RFC may help people to know that it is a draft that has something to be
> improved before it can be merged.
> 
> > We are reviewing with other members and also contributed by them.
> >
> > > What's more, solving races is much easier if the device
> > > functionality is self contained. For example, for a self contained
> > > device with the transport as the single interface, we can leverage
> > > from transport
> > > (PCI) for dealing with races, arbitration, ordering, QOS etc which
> > > is probably required in the internal channel between the owner and
> > > the member. But all of these were missed in your series and even if
> > > you can I'm not sure it's worthwhile to reinvent all of them.
> > >
> > At the end there is one physical device serving owner and member devices.
> 
> This doesn't happen yet. For example, a VF with adminq that can be isolated
> with PASID makes some sense.
PASID is for the process of user space.
Kernel space do not consume a PASID.
Depending on the PASID can work for mediation approach only.
Last I heard is the cpu or kernel took the support out of it as some special instruction did not work for that cpu.

> 
> > So a claim like things are on the VF hence you magically get 200% QoS
> guarantee is myth.
> 
> That's not my point, I'm saying VF could benefit from the e.g QOS support in
> PCIE. I'm not saying it's perfect.
> 
It is not the QOS support in PCIE.
It is the restrictions to respond in N usec and burden on the device to put something in always available memory for rarely used element is wasteful.

> >
> > Quoting "all of these" is also incorrect.
> >
> > Things added gradually, first functionally with reasonable performance,
> followed by notion and extension for QoS.
> > By definition of PCI transport for SR-IOV there is internal channel.
> >
> > It is reasonably well proposal in current form.
> > There are few race condition that you highlight are extremely rare in nature.
> 
> It's not rare since there's no way to know what the guest is doing.
I will be practical. It is rare, because a production environment guest is interested in running traffic and not constantly engaging in device reset.
And if it does, taking longer time to migrate is also fine, because guest is driving the device for production apps.
> It's actually the critical part for live migration to be correct. You are proposing
> migration so it must cover all those cases to make sure there is no case to make
> your proposal a dead end.
Yes, it should be correct.
> 
> > Suggestions are welcome to improve.
> 
> I have given some and I will give more.
> 
Sure, that is very helpful as usual.

> > There were couple of them by Michael too, I am addressing them in the v2.
> >
> > > For example, for the architecture like owner/member, if the virtio
> > > or transport facility could be controlled via device internal
> > > channels besides the transport, such a channel may complicate the
> synchronization a lot.
> > Two vendors who actually make the hw sriov devices are authoring these and
> others are also reviewing.
> > So I am more confident that it is solid enough.
> > Also, a similar design has been seen with other device for more than a year
> as GPL integrated with QEMU for a year now and with upstream kernel.
> >
> > > The device needs to
> > > be able to handle or synchronize requests from both PCI and owner in
> parallel.
> > > They are just too many possible races and most of my questions so
> > > far come from this viewpoint. I wouldn't go further for other stuff
> > > since I believe I've spotted sufficient issues and that's why I must
> > > stop at this patch before looking at the rest.
> > It is your call to stop or progress.
> > I find your reviews useful to improve this proposal, so I will fix them.
> 
> My point is to make the theory correct before looking at the others as I had a
> lot of questions (as demonstrated in this thread). I think it's not hard to
> understand as the rest of the series are based on the theory.
> 
> >
> > >
> > > Admin commands are fine if it does real administrative jobs such as
> > > provisioning since such work is beyond the core virtio functionality.
> > >
> > > Again, the goal of virtio spec is to have a device with sufficient
> > > guidelines that is easy to implement but not leave the vendors to
> > > waste their engineering resources in figuring or fuzzing the corner cases.
> > I have not seen an industry standard spec or a software that does not have
> corner cases.
> 
> Corner cases are probably not accurate. I meant, for you, it's probably a corner
> case, but for me it's kind of obvious.
> 
> > The spec proposal is from > 1 device vendors.
> 
> That's good but it doesn't mean it doesn't have any (major) issues.
> E.g vendors may choose to just implement part of the PCIE capabilities so they
> don't do audits for the rest.
> 
> >
> > I will focus on more practical aspects to progress and improve this spec.
> > >
> > > >
> > > > > > When admin cmd freeze the VF it can expect FLR_completed VF.
> > > > >
> > > > > We need to explain why and how about the resume? For example, is
> > > > > resuming required to wait for the completion of FLR, if not, why?
> > >
> > > This question is ignored.
> > >
> > I probably missed. Sorry about it.
> > No, the driver does not need to wait for FLR to finish to issue resume
> > command,
> 
> Good but I want to know if stop/freeze->active requires to wait for the
> completion of FLR. I guess the answer is yes.
> 
Yes.
> > as this typically done on the destination member device which should not be
> under FLR.
> > I will write up the requirements further.
> >
> > > > > In another thread you are saying that the PCI composition is
> > > > > done by hypervisor, so passthrough is really confusing at least for me.
> > > > >
> > > > I explained there what vPCI composition is done there.
> > > > PCI config space and msix side of composition is done.
> > > > The whole virtio interface is not composed.
> > >
> > > You need to describe this somewhere, no? That's what I'm saying.
> > >
> > Mostly not. What is not done is not written.
> >
> > > And passthrough is misleading here.
> > >
> > Passthrough is mentioned in theory of operation.
> > It is not present in requirements section.
> > So, it is fine.
> 
> I suggest documenting or defining the "passthrough" methodology somewhere.
> Michael tries to define it in another thread, if it's ok, let's use that. We can't
> require people to read VFIO code in order to know what happens in the virtio
> spec.
> 
Yes. I captured in v2 in assumptions section.

> >
> > > >
> > > > > > Ok. I assume "reset flow" is clear to you now that it points to section
> 2.4.
> > > > > > This section is not normative section, so using an extra word
> > > > > > like "flow" does
> > > > > not confuse anyone.
> > > > > > I will link to the section anyway.
> > > > >
> > > > > Probably, but you mention FLR flow as well.
> > > > As I said, not repeating the PCIe spec here. The reader knows what
> > > > FLR of the
> > > PCIe transport.
> > >
> > > Ok, I'm not a native speaker, but I really don't know the difference
> > > between "FLR" and "FLR flow".
> > >
> > Lets keep it simple. I will write it as FLR, as pci transport has it as FLR.
> 
> Ok.
> 
Done in v2.

> >
> > > >
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > and may also undergo PCI function level
> > > > > > > > > > +reset(FLR) flow.
> > > > > > > > >
> > > > > > > > > Why is only FLR special here? I've asked FRS but you
> > > > > > > > > ignore the
> > > question.
> > > > > > > > >
> > > > > > > > FLR is special to bring clarity that guest owns the VF
> > > > > > > > doing FLR, hence
> > > > > > > hypervisor cannot mediate any registers of the VF.
> > > > > > >
> > > > > > > It's not about mediation at all, it's about how the device
> > > > > > > can implement what you want here correctly.
> > > > > > >
> > > > > > > See my above question.
> > > > > > >
> > > > > > Ok. it is clear that live migration commands cannot stay on
> > > > > > the member device
> > > > > because the member device can undergo device reset and FLR flows
> > > > > owned by the guest.
> > > > >
> > > > > I disagree, hypervisors can emulate FLR and never send FLR to real
> devices.
> > > > >
> > > > That would be some other trap alternative that needs to dissect
> > > > the device
> > > and build infrastructure for such dissection is not desired in the listed use
> case.
> > >
> > > Do you need to trap FLR or not? You're saying the hypervisor is in
> > > charge of vPCI, how is this differ to what you proposed? If not, how
> > > can vPCI be composed?
> > >
> > Live migration driver do not need to trap FLR.
> 
> Maybe I misunderstood your vPCI composition, but it's really helpful to
> document how it is expected to be done.
> 
We can probably have it in the cover letter as hypervisors may evolve and do more passthrough work than done today.

> >
> > > I believe you need to document how vpci is supposed to be done,
> > > since I believe your proposal can only work with such specific types
> > > of PCI composition. This is one of the important things that is missed in this
> series.
> > >
> > I donât see a need to describe vpci composition as there may be more than
> one way to do it.
> 
> More than one way for sure, but this contradicts what you say: you said you
> don't trap FLR ...
> 
Donât trap the FLR in the live migration driver.

> People like me may wonder for example why FLR is mentioned, as FLR can be
> trapped and emulated.
It can be, that emulation requires the knowledge of device specific things.
This may be fine in such stack which is composing the virtio device using non virtio devices.
> 
> Another example, when a device can be saved and restored, the hypervisor may
> schedule the device among multiple VMs, in that case, trapping FLR is a must.
> 
I think you meant sharing when you say scheduling.
Such can work when the data path is also trapped.
May be one can do using PASID and queue assignment.
But than it is not passthrough.
So two different use cases, in that case the whole PCI config space is fully replicated and hence, FLR never reaches to the real device.


> > What I think it is worth to describe is the whole pci device is not stored in
> device context.
> > I will try to add a short description around it.
> >
> > > >
> > > > So your disagreement is fine for non-passthrough devices.
> > > >
> > > > > > (and hypervisor is not involved in these two flows, hence the
> > > > > > admin command
> > > > > interface is designed such that it can fullfil above requirements).
> > > > > >
> > > > > > Theory of operation brings out this clarity. Please notice
> > > > > > that it is in
> > > > > introductory section with an example.
> > > > > > Not normative line.
> > > > > >
> > > > > > > >
> > > > > > > > > > Such flows must comply to the PCI standard and also
> > > > > > > > > > +virtio specification;
> > > > > > > > >
> > > > > > > > > This seems unnecessary and obvious as it applies to all
> > > > > > > > > other PCI and virtio functionality.
> > > > > > > > >
> > > > > > > > Great. But your comment is contradicts.
> > > > > > > >
> > > > > > > > > What's more, for the things that need to be
> > > > > > > > > synchronized, I don't see any descriptions in this patch. And if it
> doesn't need, why?
> > > > > > > > With which operation should it be synchronized and why?
> > > > > > > > Can you please be specific?
> > > > > > >
> > > > > > > See my above question regarding FLR. And it may have others
> > > > > > > which I haven't had time to audit.
> > > > > > >
> > > > > > Ok. when you get chance to audit, lets discuss that time.
> > > > >
> > > > > Well, I'm not the author of this series, it should be your job
> > > > > otherwise it would be too late.
> > > > >
> > > > As author, what we think, I will cover. If you have specific
> > > > points to add value,
> > > please share, I will look into it.
> > >
> > > I've pointed out sufficient issues. I have a lot of others but I
> > > don't want to have a giant thread once again.
> > >
> > I see following things to improve in the requirements which I will do in v2.
> >
> > 1. Document race around FLR and admin commands for really rare corner
> case.
> > 2. Some text around not migrating the pci device registers 3.
> > Interaction with PM commands
> >
> > > >
> > > > > For example, how is the power management interaction with the
> > > freeze/stop?
> > > > >
> > > > Power management is owned by the guest, like any other virtio interface.
> > > > So freeze/stop do not interfere with it.
> > >
> > > I don't think this is a good answer. I'm asking how the PM interacts
> > > with freeze/stop, you answer it works well.
> >
> > >
> > > I'm not obliged to design hardware for you but figuring out the bad
> > > design for virtio. I'm not convinced with a proposal that misses a
> > > lot of obvious critical cases and for sure it's not my job to solve them.
> > >
> > I am not asking you to solve.
> 
> My point is that, it's better for you to have some investigation on the PM
> instead of me.
> 
Yes, I updated the device requirement in v2.

> >
> > > I've demonstrated the possible races with FLR. So did the PM. For
> > > example, if VF is in D3cold state, can we still read its device context?
> > I think yes, but I will double check.
> >  If yes, is it a violation of the PCIE spec? If not, why?
> 
> So you are emulating the state instead of a real suspension?
> 
No. not emulating the state. It is present in the device at PCI level under guest and member device control.
When controlling function (owner PF) administrate the member device to freeze, it put the whole device in freeze so D3->D0 cannot happen.

> > No, because device context is owned by the owner device and not the VF. SR-
> PCIM interface has defined it be outside of scope of PCIe spec.
> >
> > > How about other states? Can the device be freezed in the middle of
> > > PM state transitions? If yes, how can it work without migrating PCI
> > > states?
> > I will double check, but unlikely, it should be similar to FLR case to keep the
> device to avoid treating it differently.
> 
> The reason why I see it is different from FLR is that
> 
> 1) D3cold requires the VF to be off the power
> 2) State transition might takes more than what FLR did, PCI seems only cover
> the minimum delay but not maximum which may have implications for
> downtime
> 
D3cold is not controlled by the guest driver.
PM register can change D0 to D3hot.

> >
> > > Well, I meant we need a more precise definition of each state
> > > otherwise it could be ambiguous (as I pointed above).
> > Ok. so, few things about read and other messages, I will add.
> >
> > >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > In "stop" mode, the device wont process descriptors.
> > > > > > >
> > > > > > > If the device won't process descriptors, why still allow it
> > > > > > > to receive
> > > > > notifications?
> > > > > > Because notification may still arrive and if the device may
> > > > > > update any counters as part of
> > > > >
> > > > > Which counters did you mean here?
> > > > >
> > > > The counter that Xuan is adding and any other state that device
> > > > may have to
> > > update as result of driver notification.
> > > > For example caching the posted avail index in the notification.
> > >
> > > A link to those proposals?
> > [1]
> > https://lists.oasis-open.org/archives/virtio-comment/202310/msg00048.h
> > tml
> >
> 
> I don't see how this is related to "posted avail index" etc.
Driver notification counters are updated.

> 
> > > If the device must depend on those cached features to work it's
> > > really fragile. If not, we don't need to care about them.
> > It is not dependent.
> > It is the infrastructure to enable it.
> > Same for other shared memory region accesses.
> >
> > >
> > > >
> > > > > > it which needs to be migrated or store the received notification.
> > > > > >
> > > > > > > Or does it really matter if the device can receive or not here?
> > > > > > >
> > > > > > From device point of view, the device is given the chance to
> > > > > > update its device
> > > > > context as part of notifications or access to it.
> > > > >
> > > > > This is in conflict with what you said above " Device cannot
> > > > > process the queue ..."
> > > > >
> > > > No, it does not.
> > > > Device context is updated within the device without accessing the
> > > > queue
> > > memory of the guest.
> > >
> > > This is not documented or explained anywhere?
> > >
> > Why should it be explained?
> > device is not accessing the guest memory -> this is mentioned in stop mode.
> 
> Isn't it hard to see the difference between the following two?
> 
I am not following your question.
> 1) In stop mode, device is not accessing guest memory
> 2) device context is updated without accessing the queue memory of the guest
> 
Device context is read/written by the owner PF, so it does not touch the guest memory.

> 1) is to define the stop mode, 2) is to define the behaviour of device context
> 
> Or are you saying device context can only be fetched after the device is
> stopped?
> 
No device context can be fetched in all 3 modes.
It is weird for the hypervisor to fetch the device context while mode transition in progress.

> > Hence, there is no need to write above.
> >
> > > >
> > > > > Maybe you can give a concrete example.
> > > > >
> > > > The above one.
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > + the member device context
> > > > > > > > >
> > > > > > > > > I don't think we define "device context" anywhere.
> > > > > > > > >
> > > > > > > > It is defined further in the description.
> > > > > > >
> > > > > > > Like this?
> > > > > > >
> > > > > > > """
> > > > > > >  +The member device has a device context which the owner
> > > > > > > driver can
> > > > > > > +either read or write. The member device context consist of
> > > > > > > +any
> > > > > > > device  +specific data which is needed by the device to
> > > > > > > resume its operation  +when the device mode """
> > > > > > >
> > > > > > Yes.
> > > > > > Further patch-3 adds the device context and also add the link
> > > > > > to it in the
> > > > > theory of operation section so reader can read more detail about it.
> > > > > >
> > > > > > > "Any" is probably too hard for vendors to implement. And in
> > > > > > > patch 3 I only see virtio device context. Does this mean we
> > > > > > > don't need transport
> > > > > > > (PCI) context at all? If yes, how can it work?
> > > > > > >
> > > > > > Right. PCI member device is present at source and destination
> > > > > > with its layout,
> > > > > only the virtio device context is transferred.
> > > > > > Which part cannot work?
> > > > >
> > > > > It is explained in another thread where you are saying the PCI
> > > > > requires mediation. I think any author should not ignore such
> > > > > important assumptions in both the change log and the patch.
> > > > >
> > > > > And again, the more I review the more I see how narrow this
> > > > > series can be
> > > used:
> > > > >
> > > > I explained this before and also covered in the cover letter.
> > > >
> > > > > 1) Only works for SR-IOV member device like VF
> > > > It can be extended to SIOV member device in future.
> > > > Today these are the only type of member device virtio has.
> > >
> > > That is exactly what I want to say, it can only work for the
> > > owner/member model. It can't work when the virtio device is not
> > > structured like that. And you missed that most of the existing
> > > virtio devices are not implemented in this model. It means they
> > > can't be migrated with a pure virtio specific extension. For you,
> > > SR-IOV is all but this is not true for virtio. PCI is not the only transport and
> SR-IOV is not the only architecture in PCI.
> > >
> > Each transport will have its own way to handle it.
> > When there is MMIO owner-member relationship arise, one will be able to do
> so as well.
> > In fact other transports will likely miss out as they have not established such
> pace.
> >
> > > And I'm pretty sure the owner/member is not the only requirement,
> > > there are a lot of other assumptions which are missed in this series.
> > >
> > One proposal does not do everything.
> > It is just impractical.
> 
> For other assumptions, I meant:
> 
> 1) how vpci is composed, if it can be composed as vhost, why do we need to
> mention "passthrough"
I am lost, above you said you want to capture how vpci is composed.
Here you say why to mention passthrough.

> 2) the cap/bar layout, for example if a cap shares BARs with others, it can't be
> "passthrough", no?
Why not, the cap exposes all the things to the guest.

> 
> >
> > > >
> > > > > 2) Mediate PCI but not virtio which is tricky
> > > > > 3) Can only work for a specific BAR/capability register layout
> > > > >
> > > > > Only 1) is described in the change log.
> > > > >
> > > > > The other important assumptions like 2) and 3) are not
> > > > > documented
> > > anywhere.
> > > > > And this patch never explains why 2) and 3) is needed or why it
> > > > > can be used for subsystems other than VFIO/Linux.
> > > > >
> > > > Since I am not mentioning vfio now, I will refrain from mentioning
> > > > others as well. :)
> > >
> > > It's not about VFIO at all. It's about to let people know under
> > > which case this proposal could work. Otherwise if a vendor develops
> > > a BAR/cap which is not at page boundary. How could you make it work with
> your proposal here?
> > >
> > Vendor is a cloud operator which is building the device, so it will always work
> it has the matching capabilities on source and destination.
> 
> I meant, for example, if common_cfg shares a BAR with others but doesn't own
> a page exclusively, you need to trap, no?
> 
For passthrough device why such restriction?

> >
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > >and device configuration space may change. \\
> > > > > > > > > > +\hline
> > > > > > > > >
> > > > > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > > > > >
> > > > > > > > All pci devices which belong to a single guest VM are not
> > > > > > > > stopped
> > > > > atomically.
> > > > > > > > Hence, one device which is in freeze mode, may still
> > > > > > > > receive driver notifications from other pci device,
> > > > > > >
> > > > > > > Device may choose to ignore those notifications, no?
> > > > > > >
> > > > > > > > or it may experience a read from the shared memory and get
> > > > > > > > garbage
> > > > > data.
> > > > > > >
> > > > > > > Could you give me an example for this?
> > > > > > >
> > > > > > Section 2.10 Shared Memory Regions.
> > > > >
> > > > > How can it experience a read in this case?
> > > > >
> > > > MMIO read/write can be initiated by the peer device while the
> > > > device is in
> > > stopped state.
> > >
> > > Ok, but what I want to say is how it can get the garbage data here?
> > >
> > If the device mode is changed to freeze while it is being read by the peer
> device, it can get garbage data or last data.
> > Which may not be the one that is expected.
> > So first all the initiator devices are stopped, ensure that they do not make any
> requests.
> >
> > And there are requests, which gets proper answer.
> 
> Ok.
> 
> >
> > > >
> > > > > Btw, shared regions are tricky for hardware.
> > > > >
> > > > > >
> > > > > > > > And things can break.
> > > > > > > > Hence the stop mode, ensures that all the devices get
> > > > > > > > enough chance to stop
> > > > > > > themselves, and later when freezed, to not change anything
> internally.
> > > > > > > >
> > > > > > > > > > +0x2   & Freeze &
> > > > > > > > > > + In this mode, the member device does not accept any
> > > > > > > > > > +driver notifications,
> > > > > > > > >
> > > > > > > > > This is too vague. Is the device allowed to be freezed
> > > > > > > > > in the middle of any virtio or PCI operations?
> > > > > > > > >
> > > > > > > > > For example, in the middle of feature negotiation etc.
> > > > > > > > > It may cause implementation specific sub-states which
> > > > > > > > > can't be
> > > migrated easily.
> > > > > > > > >
> > > > > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > > > > It is passthrough device, hence hypervisor layer do not
> > > > > > > > get to see sub-
> > > > > state.
> > > > > > > >
> > > > > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > > > > The device context already covers this sub-state.
> > > > > > >
> > > > > > > 1) driver writes driver_features
> > > > > > > 2) driver sets FEAUTRES_OK
> > > > > > >
> > > > > > > 3) device receive driver_features
> > > > > > > 4) device validating driver_features
> > > > > > > 5) device clears FEATURES_OK
> > > > > > >
> > > > > > > 6) driver read stats and realize FEATURES_OK is being
> > > > > > > cleared
> > > > > > >
> > > > > > > Is it valid to be frozen of the above?
> > > > > > No. device mode is frozen when hypervisor is sure that no more
> > > > > > access by the
> > > > > guest will be done.
> > > > >
> > > > > How, you don't trap so 1) and 2) are posted, how can hypervisor
> > > > > know if there's inflight transactions to any registers?
> > > > >
> > > > Because hypervisor has stopped the vcpus which are issuing them.
> > >
> > > MMIO are posted. vCPU is stopped but the transactions are inflight.
> > > How could the hypervisor/device know if there's any inflight PCIE
> > > transactions here? So I can imagine what happens in fact is the TLP
> > > for freezing is ordered with the TLP for posted MMIO. This is
> > > probably guaranteed for typical PCIE setup but how about the relaxed
> ordering?
> >
> > Vcpus do not generated relaxed ordering MMIOs.
> > In pci spec: " If this bit is Set, the Function is permitted to set
> > the Relaxed Ordering bit in the Attributes field of transactions it initiates".
> >
> > Function initiates RO requests, not the vcpu.
> > Hence, it is fine.
> >
> 
> Ok.
> 
> > > >
> > > > > > What can happen between #2 and #3, is device mode may change to
> stop.
> > > > >
> > > > > Why can't be freezed in this case? It's really hard to deduce
> > > > > why it can't just from your above descriptions.
> > > > >
> > > > On the source hypervisor, the mode changes are active->stop->freeze.
> > > > Hence when freeze is done, the hypervisor knows that all inflight
> > > > has been
> > > stopped by now.
> > >
> > > Ok, but how about freezing between 3) and 4). If we allow it, do we
> > > need to migrate to this state? If yes, how can it work with your
> > > device context? If not, shouldn't we document this?
> > >
> > May be, some of these are implementation details. I am not sure it belongs to
> spec.
> 
> The point is to make sure that your deivce context covers this case.
> If it can't be covered, it's a design defect.
> 
> > Like RSS update while packets are received.. such implementation details are
> not part of the spec.
> 
> This is definitely different, the driver can choose to synchronize or the end user
> can tolerate the possible out of order packets in this case.
> 
Right but it is not defined in the spec.

> This is not the case here, if freezing between 3) and 4) is allowed, your current
> device context can't cover this case and guests can't tolerate such kinds of
> errors after migration for sure.
> 
Ok. I will add the text around this in v3.

> >
> > > >
> > > > > Even if it had, is it even possible to list all the places where
> > > > > freezing is prohibited? We don't want to end up with a spec that
> > > > > is hard to implement or leave the vendor to figure out those tricky parts.
> > > > >
> > > > The general idea is not prohibiting the freeze/stop mode.
> > > > If the device needs more time, let device take time to do it.
> > >
> > > Ok, it means:
> > >
> > > 1) there're conditions from stop to freeze, then what are they?
> > No, there isnât condition.
> > May be I didnât follow the question.
> 
> E.g under which condition could the device change the status from active to
> stop etc. That's something I keep asking with a concrete example (e.g FLR).
> 
Device mode is changed by the driver from active to stop. This is the admin mode.
FLR do not change the mode to stop/freeze because it is guest driver controlled operational state of the device.

> > > 2) how much time at most? E.g FLR takes at most 100ms.
> > From the driver side, it is 100msec for device side it can be less too.
> > As soon as FLR is done or enough to record it, is done, stop can continue.
> >
> > > 3) If it needs more time, can this time satisfy the downtime requirement?
> > >
> > Guest VM for all practical purposes is not busy in doing FLR, it is a corner
> case, yet we have to cover it.
> 
> Corner case in what sense? A loop in a simple shell script can trigger this easily.
> 
Sure.
Which is not the practical application of the guest VM.
Hence, it is corner case.

> > And yes, it satisfy the downtime requirements, because VM is already not
> interested in the packets, it is busy doing the FLR.
> 
> Well, it has subtle differences. VM may have more than one interface, just one
> of the interfaces is doing FLR.
> 
Sure, it can do. But in that case the time of that VF stop is not critical.
VM is busy in non-critical work.

> >
> > > >
> > > >
> > > > > > And in stop mode, device context would capture #5 or #4,
> > > > > > depending where is
> > > > > device at that point.
> > > > > >
> > > > > > > >
> > > > > > > > > And what's more, the above state machine seems to be
> > > > > > > > > virtio specific, but you don't explain the interaction
> > > > > > > > > with the device status state
> > > > > > > machine.
> > > > > > > > First, above is not a state machine.
> > > > > > >
> > > > > > > So how do readers know if a state can go to another state and when?
> > > > > > >
> > > > > > Not sure what you mean by reader. Can you please explain.
> > > > >
> > > > > The people who read virtio spec.
> > > > >
> > > > So question is "how reader knows if a state can go to another
> > > > state and
> > > when"?
> > > > It is described and listed in the table, when a mode can change.
> > >
> > > It's not only "if" but also "when". Your table partially answers the
> > > "if '' but not "when". I think you should know now the state
> > > transition is conditional. So let's try our best to ease the life of the vendor.
> > What do you mean when?
> > I do not understand that "mode change is conditional"? it is not based on the
> condition.
> > [..]
> 
> See above.
> 
> >
> > > > > Let's define the synchronization point first. And it
> > > > > demonstrates at least devices need to synchronize between the
> > > > > free/stop and virtio device status machine which is not as easy as what
> is done in this patch.
> > > > >
> > > > Synchronization point = device.
> > >
> > > This is obvious as we can't rule stuff outside virtio, and we are
> > > talking about devices not drivers here. But the spec needs
> > > sufficient guidance/normative for the vendor to implement. It's more
> > > than just saying "device is synchronization point".
> > >
> > The requirements are already covering what device needs to do.
> > Some interaction points are missing, as I acked above, I will add them.
> >
> > [..]
> > > > > Until virtio reset, this is how virtio works now. I've pointed
> > > > > out that it may cause extra troubles when trying to resume, but
> > > > > you don't tell me what's wrong to keep that?
> > > > >
> > > > If kept, hypervisor may not be able to decide when to change the
> > > > mode from
> > > active->stop.
> > >
> > > Why? It is simply done when mgmt requires a migration?
> > >
> > Mgmt is bit higher level entity. Underneath the software layers may wait until
> the time is right to migrate.
> 
> I don't understand, anyhow the migration request could not be sent to the
> device directly without the assistance in hypervisor.
> 
> > The fundamental point is, the device context is expected to return the
> incremental value, that is changed content from last time.
> > So once all changed content is read, its empty.
> 
> You can't easily define an incremental value for all types of states or structures:
> 
> 1) device with complicated states like RAM or other
> 2) the device state has complicated data structures
> 
What I parse is, that device context is complicated structure.
So it will be defined incrementally as it becomes more mature.

> >
> > > What's more important, PCI allows multiple common_cfgs. So the
> > > hypervisor can choose to reserve one common_cfg for live migration.
> > > In this case we don't have to read to clear semantics.
> > Common_cfg does not serve large device context, nor it serves DMA.
> 
> Well, I'd think e.g the address of the descriptor table is part of the device
> context, and it can be read some common_cfg.
It can be read but we are talking about not saving 64 VQs tables, and RSS, flow filters, statistics all in some common config registers.

> 
> >
> > >
> > > Or, are you saying the value read from common_cfg is not device context?
> > The value of common config is part of the device context that represents
> current common config.
> >
> > > Isn't this conflict with your vague definition of device context?
> > >
> > You mentioned you stop at this patch,
> 
> Stop means stopping comment.
> 
> > so likely you didnât read device context patch, hence you quote it vague.
> > So I donât know what you mean by vague.
> 
> So in this patch you define device context as:
> 
> "The member device context consist of any device specific data which is needed
> by the device to resume its operation"
> 
> So the address of the descriptor table satisfy this definition? If not, why?
> 
Address of the descriptor table is part of device context.

> > Please let me know what you additional thing you want to see in device
> context after you reach that patch.
> >
> >
> > > > We can opt for a mode where full device context is read in each
> > > > mode
> > > without clearing it.
> > > > But than it can be very specific to a version of qemu, which we
> > > > are avoiding it
> > > here.
> > > >
> > > > > > 2. device context returns incremental value from the previous
> > > > > > read. So, it
> > > > > needs to clear it.
> > > > >
> > > > > I don't understand here. This is not the case for most of the devices.
> > > > >
> > > > Not sure which devices you mean here with "most of the devices".
> > > > Device context functions like a write record pages (aka dirty pages).
> > >
> > > It's definitely different. We want to migrate dirty pages lively
> > > which can consume a lot of bandwidth. So reporting delta makes a lot
> > > of sense here since it would have a lot of rounds of syncing and it
> > > doesn't result in blockers resuming.
> > >
> > Write records are reported as delta from the previous read.
> >
> > > For device context, how many rounds of syncing did you expect, and
> > > if we have N rounds, we need to restore N rounds in order to resume?
> > > Do you want to live migrating device states? If it's only 1 or 2 rounds, why
> bother?
> > >
> > Live migrate the device context. Typically in current software using it, it is 2
> rounds.
> 
> If it's just 2 rounds, why bother for delta? It is only helpful is we want to live
> migrate some device with giant states with sevreal rounds, and in that can we
> should leave it as a device specific state.
> 
The number of rounds matter less. The number of things a device needs to setup is a lot.
And a VM may have many devices so just like pre-copy of dirty pages, it is an extension for the device state (context) to pre-copy.

> > The interface is generic that if needed more rounds are possible.
> >
> > Even device for most practical purpose will implement 2 rounds.
> >
> > > And for the delta, how do you know you can easily define deltas for
> > > every type of device, especially the ones with complicated internal
> > > states? Defining states has already been demonstrated as a
> > > complicated task for some devices like virtio-FS and you want to complicate
> it furtherly?
> > >
> > What is your question? If you say virtio-fs is complicated state, may be it
> should not have existed itself in the virtio spec as first place.
> 
> We have just more than FS that can't work for live migration. Crypto and GPU
> are two other examples, and I'm pretty sure we have more.
> 
Industry have migrated gpu, rdma, nvme, virtio-net, virtio-blk, devices, and susupend/resumed gpu devices too.
So I can imagine it will happen as wider devices adopt the device context.

> Until we figure out how they can, we can't say a device context work for all
> types. No?
> 
No. 
> 1) Trying to define a format that works for all types of devices
> 2) Leavce the states to be defined by individual device types
> 
> Which method is esay?
> 
Best of both. i.e. generic fields for generic virtio items like vq config, common config, device config area.
And device specific context fields like rss, flow filters, counters.

> > But I differ to think that.
> > Virtio-fs guest side state wont be changed as part of it.
> > Virtio-fs is the first device which has considered and listed to migrate the
> device state.
> > So it should be possible.
> 
> I wouldn't repeart the discussion of virtio-FS migration here, you can serach the
> archives for more details.
> 
> But the point is obvious, it's really hard to say a simple device context can work
> for all type of devices. We should allow a device specific states definition. This
> seems to be agreed by Michale and LingShan.
> 
Sure. I covered this in v2 at [2].
Device specific state definitions will be able to grow.

[2] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00190.html

> >
> > > What is proposed in this series is an ad-hoc optimization for a
> > > specific deivce type within a specific subsystem (e.g VFIO) in a
> > > specific operating system which is not the general.
> > >
> > Oh now you mention vfio. Not me. :)
> >
> > I am not going to comment on this. It is not ad-hoc.
> 
> You need to justify how it is not. Based on the current discussion, you have
> demonstreated a lot of asusmptions in order to make your proposal to work.
> 
Listed in v2 now.

> > It uses similar dirty page tracking like technique present in cpu hw and other
> devices.
> >
> > > As demsonsted many times, starting from something simple and stupid
> > > is the most easy way.
> > >
> >
> > > > Whatever is already returned is/should not be repeated in
> > > > subsequent reads,
> > > though device can choose to do so.
> > > >
> > > > > >
> > > > > > > > And which software stack may find this useful?
> > > > > > > > Is there any existing software that can utilize it?
> > > > > > >
> > > > > > > Libvirt.
> > > > > > >
> > > > > > Does libvirt restore on migration failure?
> > > > >
> > > > > Yes.
> > > > >
> > > > Ok. the device will be able to resume when it is marked active.
> > > > The device context returned  is the incremental delta as explained above.
> > >
> > > I disagree, see my above reply.
> > I replied above.
> >
> > >
> > > >
> > > > > >
> > > > > > > > Why that device context present with the software
> > > > > > > > vanished, in your
> > > > > > > assumption, if it is?
> > > > > > > >
> > > > > > > > > > Typically, on
> > > > > > > > > > +the source hypervisor, the owner driver reads the
> > > > > > > > > > +device context once when the device is in
> > > > > > > > > > +\field{Active} or \field{Stop} mode and later once
> > > > > > > > > > +the member device is in
> > > > > \field{Freeze} mode.
> > > > > > > > >
> > > > > > > > > Why need the read while device context could be changed?
> > > > > > > > > Or is the dirty page part of the device context?
> > > > > > > > >
> > > > > > > > It is not part of the dirty page.
> > > > > > > > It needs to read in the active/stop mode, so that it can
> > > > > > > > be shared with
> > > > > > > destination hypervisor, which will pre-setup the complex
> > > > > > > context of the device, while it is still running on the source side.
> > > > > > >
> > > > > > > Is such a method used by any hypervisor?
> > > > > > Yes. qemu which uses vfio interface uses it.
> > > > >
> > > > > Ok, such software technology could be used for all types of
> > > > > devices, I don't see any advantages to mention it here unless it's unique
> to virtio.
> > > > >
> > > > It is theory of operation that brings the clarity and rationale.
> > >
> > > I think it's not. Since it's not something that is unique to virtio.
> > >
> > > > So I will keep it.
> > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > > > +
> > > > > > > > > > +Typically, the device context is read and written one
> > > > > > > > > > +time on the source and the destination hypervisor
> > > > > > > > > > +respectively once the device is in \field{Freeze} mode.
> > > > > > > > > > +On the destination hypervisor, after writing the
> > > > > > > > > > +device context, when the device mode set to
> > > > > > > > > > +\field{Active}, the device uses the most recently set
> > > > > > > > > > +device context and resumes the device
> > > > > > > > > operation.
> > > > > > > > >
> > > > > > > > > There's no context sequence, so this is obvious. It's
> > > > > > > > > the semantic of all other existing interfaces.
> > > > > > > > >
> > > > > > > > Can you please what which existing interfaces do you mean here?
> > > > > > >
> > > > > > > For any common cfg member. E.g queue_addr.
> > > > > > >
> > > > > > > The driver wrote 100 different values to queue_addr and the
> > > > > > > device used the value written last time.
> > > > > > >
> > > > > > o.k. I donât see any problem in stating what is done, which is
> > > > > > less vague. ð
> > > > > >
> > > > > > > >
> > > > > > > > > > +
> > > > > > > > > > +In an alternative flow, on the source hypervisor the
> > > > > > > > > > +owner driver may choose to read the device context
> > > > > > > > > > +first time while the device is in \field{Active} mode
> > > > > > > > > > +and second time once the device is in \field{Freeze}
> > > > > > > > > mode.
> > > > > > > > >
> > > > > > > > > Who is going to synchronize the device context with
> > > > > > > > > possible configuration from the driver?
> > > > > > > > >
> > > > > > > > Not sure I understand the question.
> > > > > > > > If I understand you right, do you mean that, When
> > > > > > > > configuration change is done by the guest driver, how does
> > > > > > > > device
> > > context change?
> > > > > > > >
> > > > > > >
> > > > > > > Yes.
> > > > > > >
> > > > > > > > If so, device context reading will reflect the new configuration.
> > > > > > >
> > > > > > > How do you do that? For example:
> > > > > > >
> > > > > > > static inline void vp_iowrite64_twopart(u64 val,
> > > > > > >                                         __le32 __iomem *lo,
> > > > > > >                                         __le32 __iomem *hi) {
> > > > > > >         vp_iowrite32((u32)val, lo);
> > > > > > >         vp_iowrite32(val >> 32, hi); }
> > > > > > >
> > > > > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > > > > >
> > > > > > Yes. the device context
> VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > > > > section captures the partial value.
> > > > >
> > > > > There's no way for the device to know whether or not it's a
> > > > > partial value or
> > > not.
> > > > > No?
> > > > >
> > > > Device does not need to know, because when the guest vm and the
> > > > device is
> > > resumed on the destination, it the guest vm will continue with
> > > writing the 2nd part.
> > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > Similarly, on the
> > > > > > > > > > +destination hypervisor writes the device context
> > > > > > > > > > +first time while the device is still running in
> > > > > > > > > > +\field{Active} mode on the source hypervisor and
> > > > > > > > > > +writes the device context second time while the
> > > > > > > > > > +device is in
> > > > > > > > > \field{Freeze} mode.
> > > > > > > > > > +This flow may result in very short setup time as the
> > > > > > > > > > +device context likely have minimal changes from the
> > > > > > > > > > +previously written device
> > > > > > > context.
> > > > > > > > >
> > > > > > > > > Is the hypervisor who is in charge of doing the
> > > > > > > > > comparison and writing only the delta?
> > > > > > > > >
> > > > > > > > The spec commands allow to do so. So possibility exists
> > > > > > > > from spec
> > > wise.
> > > > > > >
> > > > > > > There are various optimizations for migration for sure, I
> > > > > > > don't think mentioning any specific one is good.
> > > > > > >
> > > > > > The text is informative text similar to,
> > > > > >
> > > > > > " However, some devices benefit from the ability to find out
> > > > > > the amount of available data in the queue without accessing
> > > > > > the virtqueue in
> > > > > memory"
> > > > > >
> > > > > > " To help with these optimizations, when
> > > > > > VIRTIO_F_NOTIFICATION_DATA has
> > > > > been negotiated".
> > > > > >
> > > > > > Is this the only optimization in virtio? No, but we still
> > > > > > mention the rationale of
> > > > > why it exists.
> > > > >
> > > > > The above is a good example as it explain
> > > > > VIRTIO_F_NOTIFICATION_DATA is the only way without accessing the
> > > > > virtqueue. But this is not the case of
> > > migration.
> > > > > You said it's just a possibility but not a must which is not the
> > > > > case for VIRTIO_F_NOTIFICATION_DATA.
> > > > >
> > > > It is one of the optimization apart. The comparison is of
> > > > one_of_example or
> > > not.
> > >
> > > I don't get this.
> > Theory of operation is describing a flow how things are done and how the
> constructs are helpful to achieve it.
> 
> Immature optimzation doesn't belong to theory for sure. I see your delta
> reporting immature in many ways. That's the point.
> 
> Thanks
> 
> > And it is not the end of the list.
> > That does not mean one should not write those.
Follow-Ups:
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>
References:
- [PATCH v1 0/8] Introduce device migration support commands
  - From: Parav Pandit <parav@nvidia.com>
- [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>