virtio-comment message

Subject: Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Parav Pandit <parav@nvidia.com>
Date: Wed, 11 Oct 2023 16:14:00 -0400
On Wed, Oct 11, 2023 at 10:47:23AM +0000, Parav Pandit wrote:
> 
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Wednesday, October 11, 2023 8:44 AM
> > 
> > On Tue, Oct 10, 2023 at 3:19âPM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Tuesday, October 10, 2023 11:21 AM
> > > >
> > > > On Mon, Oct 9, 2023 at 6:06âPM Parav Pandit <parav@nvidia.com> wrote:
> > > > >
> > > > > > From: Jason Wang <jasowang@redhat.com>
> > > > > > Sent: Monday, October 9, 2023 2:19 PM
> > > > > >
> > > > > > Adding LingShan.
> > > > > >
> > > > > Thanks for adding him.
> > > > >
> > > > > > Parav, if you want any specific people to comment, please do cc them.
> > > > > >
> > > > > Sure, will cc them in v2 as now I see there is interest in the review.
> > > > >
> > > > > > On Sun, Oct 8, 2023 at 7:26âPM Parav Pandit <parav@nvidia.com> wrote:
> > > > > > >
> > > > > > > One or more passthrough PCI VF devices are ubiquitous for
> > > > > > > virtual machines usage using generic kernel framework such as vfio [1].
> > > > > >
> > > > > > Mentioning a specific subsystem in a specific OS may mislead the
> > > > > > user to think it can only work in that setup. Let's not do that,
> > > > > > virtio is not only used for Linux and VFIO.
> > > > > >
> > > > > Not really. it is an example in the cover letter.
> > > > > It is not the only use case.
> > > > > A use case gives a crisp clarity of what UAPI it needs to fulfil.
> > > > > So I will keep it. It is anyway written as one use case.
> > > > >
> > > > > > >
> > > > > > > A passthrough PCI VF device is fully owned by the virtual
> > > > > > > machine device driver.
> > > > > >
> > > > > > Is this true? Even VFIO needs to mediate PCI stuff. Or how do
> > > > > > you define "passthrough" here?
> > > > > >
> > > > > Other than PCI config registers and due to some legacy, msix.
> > > > > The "device interface" side is not mediated.
> > > > > The definition of passthrough here is: To not mediate a device
> > > > > type specific
> > > > and virtio specific interfaces for modern and future devices.
> > > >
> > > > Ok, but what's the difference between "device type specific" and
> > > > "virtio specific interfaces". Maybe an example for this?
> > > >
> > > Virtio device specific means: cvq of crypto device, cvq of net device, flow filter
> > vqs of net device etc.
> > > Virtio specific interface: virtio driver notifications, virtio virtqueue and
> > configuration mediation etc.
> > >
> > > > >
> > > > > > > This passthrough device controls its own device reset flow,
> > > > > > > basic functionality as PCI VF function level reset
> > > > > >
> > > > > > How about other PCI stuff? Or Why is FLR special?
> > > > > FLR is special for the readers to get the clarity that FLR is also
> > > > > done by the
> > > > guest driver hence, the device migration commands do not
> > > > interact/depend with FLR flow.
> > > >
> > > > It's still not clear to me how this is done.
> > > >
> > > > 1) guest starts FLR
> > > > 2) adminq freeze the VF
> > > > 3) FLR is done
> > > >
> > > > If the freezing doesn't wait for the FLR, does it mean we need to
> > > > migrate to a state like FLR is pending? If yes, do we need to
> > > > migrate the other sub states like this? If not, why?
> > > >
> > > In most practical cases #2 followed by #1 should not happen as on the source
> > side the expected is mode change to stop from active.
> > 
> > How does the hypervisor know if a guest is doing what without trapping?
> >
> Hypervisor does not know. The device knows being the recipient of #1 and #2.
>  
> > > But ok, since we active to freeze mode change is allowed, lets discuss above.
> > >
> > > A device is the single synchronization point for any device reset, FLR or admin
> > command operation.
> > 
> > So you agree we need synchronization? And I'm not sure I get the meaning of
> > synchronization point, do you mean the synchronization between freeze/stop
> > and virtio facilities?
> >
> Synchronization means, handling two events in parallel such as FLR and other.
>  
> > > So, the migration driver do not need to wait for FLR to complete.
> > 
> > I'm confused, you said below that device context could be changed by FLR.
> > 
> Yes.
> > If FLR needs to clear device context, we can have a race where device context is
> > cleared when we are trying to read it?
> > 
> I didnât say clear the context.
> FLR updates the device context.
> Device is serving the device context read write commands, serving FLR, answering mode change command,
> So device knows the best how to avoid any race.

Heh well but if drivers depend on specific behaviour then we really
need to document that in the spec.

> > > When admin cmd freeze the VF it can expect FLR_completed VF.
> > 
> > We need to explain why and how about the resume? For example, is resuming
> > required to wait for the completion of FLR, if not, why?
> > 
> > > Secondly since the FLR is local to the source, intermediate sub state does not
> > migrate.
> > >
> > > But I agree, it is worth to have the text capturing this.
> > >
> > > > >
> > > > > >
> > > > > > > and rest of the virtio device functionality such as control
> > > > > > > vq,
> > > > > >
> > > > > > What do you mean by "rest of"?
> > > > > >
> > > > > As given in the example cvq.
> > > > >
> > > > > > Which part is not controlled and why?
> > > > > Not controlled because as states, it is passthrough device.
> > > > >
> > > > > > > config space access, data path descriptors handling.
> > > > > > >
> > > > > > > Additionally, VM live migration using a precopy method is also
> > > > > > > widely
> > > > used.
> > > > > >
> > > > > > Why is this mentioned here?
> > > > > >
> > > > > Huh. You should be positive for bringing clarity to the readers on
> > > > understanding the use case.
> > > > > And you seem opposite, but ok.
> > > > >
> > > > > As stated, it for the reader to understand the use case and see
> > > > > how proposed
> > > > commands addresses the use case.
> > > >
> > > > The problem is that the hardware features should be designed for a
> > > > general purpose instead of a specific technology if it can. The only
> > > > missing part for post copy is the page fault.
> > > >
> > > Ok. The use case and requirement of member device passthrough is clear to
> > most reviewers now.
> > 
> > In another thread you are saying that the PCI composition is done by hypervisor,
> > so passthrough is really confusing at least for me.
> >
> I explained there what vPCI composition is done there.
> PCI config space and msix side of composition is done.
> The whole virtio interface is not composed.
>  
> > > Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
> > > This section is not normative section, so using an extra word like "flow" does
> > not confuse anyone.
> > > I will link to the section anyway.
> > 
> > Probably, but you mention FLR flow as well.
> As I said, not repeating the PCIe spec here. The reader knows what FLR of the PCIe transport.

What I worry about however, is what happens for example if FLR
is triggered while an admin command is in progress.
This applies to things like legacy admin commands by the way.


> > 
> > >
> > > > >
> > > > > > > and may also undergo PCI function level
> > > > > > > +reset(FLR) flow.
> > > > > >
> > > > > > Why is only FLR special here? I've asked FRS but you ignore the question.
> > > > > >
> > > > > FLR is special to bring clarity that guest owns the VF doing FLR,
> > > > > hence
> > > > hypervisor cannot mediate any registers of the VF.
> > > >
> > > > It's not about mediation at all, it's about how the device can
> > > > implement what you want here correctly.
> > > >
> > > > See my above question.
> > > >
> > > Ok. it is clear that live migration commands cannot stay on the member device
> > because the member device can undergo device reset and FLR flows owned by
> > the guest.
> > 
> > I disagree, hypervisors can emulate FLR and never send FLR to real devices.
> > 
> That would be some other trap alternative that needs to dissect the device and build infrastructure for such dissection is not desired in the listed use case.
> Here we are addressing the requirement of passthrough the device.
> 
> So your disagreement is fine for non-passthrough devices.
> 
> > > (and hypervisor is not involved in these two flows, hence the admin command
> > interface is designed such that it can fullfil above requirements).
> > >
> > > Theory of operation brings out this clarity. Please notice that it is in
> > introductory section with an example.
> > > Not normative line.
> > >
> > > > >
> > > > > > > Such flows must comply to the PCI standard and also
> > > > > > > +virtio specification;
> > > > > >
> > > > > > This seems unnecessary and obvious as it applies to all other
> > > > > > PCI and virtio functionality.
> > > > > >
> > > > > Great. But your comment is contradicts.
> > > > >
> > > > > > What's more, for the things that need to be synchronized, I
> > > > > > don't see any descriptions in this patch. And if it doesn't need, why?
> > > > > With which operation should it be synchronized and why?
> > > > > Can you please be specific?
> > > >
> > > > See my above question regarding FLR. And it may have others which I
> > > > haven't had time to audit.
> > > >
> > > Ok. when you get chance to audit, lets discuss that time.
> > 
> > Well, I'm not the author of this series, it should be your job otherwise it would
> > be too late.
> > 
> As author, what we think, I will cover. If you have specific points to add value, please share, I will look into it.
> 
> > For example, how is the power management interaction with the freeze/stop?
> >
> Power management is owned by the guest, like any other virtio interface.
> So freeze/stop do not interfere with it.

I am not sure what exactly all this means though.
Should be clarified, in some way.

> > >
> > > > >
> > > > > It is not written in this series, because we believe it must not
> > > > > be synchronized
> > > > as it is fully controlled by the guest.
> > > > >
> > > > > >
> > > > > > > at the same time such flows must not obstruct
> > > > > > > +the device migration flow. In such a scenario, a group owner
> > > > > > > +device can provide the administration command interface to
> > > > > > > +facilitate the device migration related operations.
> > > > > > > +
> > > > > > > +When a virtual machine migrates from one hypervisor to
> > > > > > > +another hypervisor, these hypervisors are named as source and
> > > > > > > +destination
> > > > > > hypervisor respectively.
> > > > > > > +In such a scenario, a source hypervisor administers the
> > > > > > > +member device to suspend the device and preserves the device
> > context.
> > > > > > > +Subsequently, a destination hypervisor administers the member
> > > > > > > +device to setup a device context and resumes the member device.
> > > > > > > +The source hypervisor reads the member device context and the
> > > > > > > +destination hypervisor writes the member device context. The
> > > > > > > +method to transfer the member device context from the source
> > > > > > > +to the destination hypervisor is
> > > > > > outside the scope of this specification.
> > > > > > > +
> > > > > > > +The member device can be in any of the three migration modes.
> > > > > > > +The owner driver sets the member device in one of the
> > > > > > > +following modes during
> > > > > > device migration flow.
> > > > > > > +
> > > > > > > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name
> > > > > > > +& Description \\ \hline \hline
> > > > > > > +0x0   & Active &
> > > > > > > +  It is the default mode after instantiation of the member
> > > > > > > +device. \\
> > > > > >
> > > > > > I don't think we ever define "instantiation" anywhere.
> > > > > >
> > > > > Well a transport has implicit definition of the instantiation already.
> > > > > May be a text can be added, but donât see a value in duplicating
> > > > > PCI spec
> > > > here.
> > > >
> > > > Ok, maybe something like "transport specific instantiation"
> > > >
> > > Ok. thatâs a good text. I will change to it.
> > >
> > > > >
> > > > > > > +\hline
> > > > > > > +0x1   & Stop &
> > > > > > > + In this mode, the member device does not send any
> > > > > > > +notifications, and it does not access any driver memory.
> > > > > >
> > > > > > What's the meaning of "driver memory"?
> > > > > >
> > > > > May be guest memory? Or do you suggest a better naming for the
> > > > > memory
> > > > allocated by the guest driver?
> > > >
> > > > Virtqueue?
> > > >
> > > Virtqueue and any memory referred by the virtqueue.
> > >
> > > This is good text, I will change to it.
> > >
> > > > >
> > > > > > And stop seems to be a source of inflight buffers.
> > > > > >
> > > > > I didnât follow it.
> > > > > If you mean without stop there are no inflight buffer, then I donât agree.
> > > > > We donât want to violate the spec by having descriptors with zero
> > > > > size
> > > > returned.
> > > > > Stop is not the source of inflight descriptors.
> > > >
> > > > I think not since you forbid access to the used ring here. So even
> > > > if the buffer were processed by the device it can't be added back to
> > > > the used ring thus became inflight ones.
> > > >
> > > > >
> > > > > There are inflight descriptors with the device that are not yet
> > > > > returned to the
> > > > driver, and device wont return them as zero size wrong completions.
> > > > >
> > > > > > > + The member device may receive driver notifications in this
> > > > > > > + mode,
> > > > > >
> > > > > > What's the meaning of "receive"? For example if the device can
> > > > > > still process buffers, "stop" is not accurate.
> > > > > >
> > > > > Receive means, driver can send the notification as PCIe TLP that
> > > > > device may
> > > > receive as incoming PCIe TLP.
> > > >
> > > > Ok, so this is the transport level. But the device can keep processing the
> > queue?
> > > >
> > > Device cannot process the queue because it does not initiate any read/write
> > towards the virtqueue.
> > 
> > Read/Write only results in a driver noticeable behaviour, it doesn't mean the
> > device can't process the buffers.  For example, devices can keep processing
> > available buffers and make them as inflight ones.
> > 
> The idea is to stop the device and prepare for the migration, so the command to do so.
> Otherwise just the keep the device in active mode and avoid the complications.
> 
> > >
> > > > >
> > > > > In "stop" mode, the device wont process descriptors.
> > > >
> > > > If the device won't process descriptors, why still allow it to receive
> > notifications?
> > > Because notification may still arrive and if the device may update any
> > > counters as part of
> > 
> > Which counters did you mean here?
> >
> The counter that Xuan is adding and any other state that device may have to update as result of driver notification.
> For example caching the posted avail index in the notification.
>  
> > > it which needs to be migrated or store the received notification.
> > >
> > > > Or does it really matter if the device can receive or not here?
> > > >
> > > From device point of view, the device is given the chance to update its device
> > context as part of notifications or access to it.
> > 
> > This is in conflict with what you said above " Device cannot process the queue
> > ..."
> > 
> No, it does not.
> Device context is updated within the device without accessing the queue memory of the guest.
> 
> > Maybe you can give a concrete example.
> > 
> The above one.
> 
> > >
> > > > >
> > > > > > > + the member device context
> > > > > >
> > > > > > I don't think we define "device context" anywhere.
> > > > > >
> > > > > It is defined further in the description.
> > > >
> > > > Like this?
> > > >
> > > > """
> > > >  +The member device has a device context which the owner driver can
> > > > +either read or write. The member device context consist of any
> > > > device  +specific data which is needed by the device to resume its
> > > > operation  +when the device mode """
> > > >
> > > Yes.
> > > Further patch-3 adds the device context and also add the link to it in the
> > theory of operation section so reader can read more detail about it.
> > >
> > > > "Any" is probably too hard for vendors to implement. And in patch 3
> > > > I only see virtio device context. Does this mean we don't need
> > > > transport
> > > > (PCI) context at all? If yes, how can it work?
> > > >
> > > Right. PCI member device is present at source and destination with its layout,
> > only the virtio device context is transferred.
> > > Which part cannot work?
> > 
> > It is explained in another thread where you are saying the PCI requires
> > mediation. I think any author should not ignore such important assumptions in
> > both the change log and the patch.
> > 
> > And again, the more I review the more I see how narrow this series can be used:
> >
> I explained this before and also covered in the cover letter.
>  
> > 1) Only works for SR-IOV member device like VF
> It can be extended to SIOV member device in future.
> Today these are the only type of member device virtio has.
> 
> > 2) Mediate PCI but not virtio which is tricky
> > 3) Can only work for a specific BAR/capability register layout
> > 
> > Only 1) is described in the change log.
> > 
> > The other important assumptions like 2) and 3) are not documented anywhere.
> > And this patch never explains why 2) and 3) is needed or why it can be used for
> > subsystems other than VFIO/Linux.
> >
> Since I am not mentioning vfio now, I will refrain from mentioning others as well. :)
>  
> > >
> > > > >
> > > > > > >and device configuration space may change. \\
> > > > > > > +\hline
> > > > > >
> > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > >
> > > > > All pci devices which belong to a single guest VM are not stopped
> > atomically.
> > > > > Hence, one device which is in freeze mode, may still receive
> > > > > driver notifications from other pci device,
> > > >
> > > > Device may choose to ignore those notifications, no?
> > > >
> > > > > or it may experience a read from the shared memory and get garbage
> > data.
> > > >
> > > > Could you give me an example for this?
> > > >
> > > Section 2.10 Shared Memory Regions.
> > 
> > How can it experience a read in this case?
> >
> MMIO read/write can be initiated by the peer device while the device is in stopped state.

worth mentioning

> > Btw, shared regions are tricky for hardware.
> > 
> > >
> > > > > And things can break.
> > > > > Hence the stop mode, ensures that all the devices get enough
> > > > > chance to stop
> > > > themselves, and later when freezed, to not change anything internally.
> > > > >
> > > > > > > +0x2   & Freeze &
> > > > > > > + In this mode, the member device does not accept any driver
> > > > > > > +notifications,
> > > > > >
> > > > > > This is too vague. Is the device allowed to be freezed in the
> > > > > > middle of any virtio or PCI operations?
> > > > > >
> > > > > > For example, in the middle of feature negotiation etc. It may
> > > > > > cause implementation specific sub-states which can't be migrated easily.
> > > > > >
> > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > It is passthrough device, hence hypervisor layer do not get to see sub-
> > state.
> > > > >
> > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > The device context already covers this sub-state.
> > > >
> > > > 1) driver writes driver_features
> > > > 2) driver sets FEAUTRES_OK
> > > >
> > > > 3) device receive driver_features
> > > > 4) device validating driver_features
> > > > 5) device clears FEATURES_OK
> > > >
> > > > 6) driver read stats and realize FEATURES_OK is being cleared
> > > >
> > > > Is it valid to be frozen of the above?
> > > No. device mode is frozen when hypervisor is sure that no more access by the
> > guest will be done.
> > 
> > How, you don't trap so 1) and 2) are posted, how can hypervisor know if there's
> > inflight transactions to any registers?
> > 
> Because hypervisor has stopped the vcpus which are issuing them.
> 
> > > What can happen between #2 and #3, is device mode may change to stop.
> > 
> > Why can't be freezed in this case? It's really hard to deduce why it can't just
> > from your above descriptions.
> >
> On the source hypervisor, the mode changes are active->stop->freeze.
> Hence when freeze is done, the hypervisor knows that all inflight has been stopped by now.
>  
> > Even if it had, is it even possible to list all the places where freezing is
> > prohibited? We don't want to end up with a spec that is hard to implement or
> > leave the vendor to figure out those tricky parts.
> >
> The general idea is not prohibiting the freeze/stop mode.
> If the device needs more time, let device take time to do it.
> 
>  
> > > And in stop mode, device context would capture #5 or #4, depending where is
> > device at that point.
> > >
> > > > >
> > > > > > And what's more, the above state machine seems to be virtio
> > > > > > specific, but you don't explain the interaction with the device
> > > > > > status state
> > > > machine.
> > > > > First, above is not a state machine.
> > > >
> > > > So how do readers know if a state can go to another state and when?
> > > >
> > > Not sure what you mean by reader. Can you please explain.
> > 
> > The people who read virtio spec.
> > 
> So question is "how reader knows if a state can go to another state and when"?
> It is described and listed in the table, when a mode can change.
> 
> > > > So only the driver notification is allowed by not config write?
> > > > What's the consideration for allowing driver notification?
> > > >
> > > Because for most practical purposes, peer device wants to queue blk, net
> > other requests and not do device configuration.
> > 
> > You forbid the device to process the queue but only allow the notification. How
> > can the device queue those requests? The device can just do the available
> > buffer check after resume, then it's all fine.
> >
> Device can always decide to not queue the request and do the available buffer check later.
> The peer device may read also from MMIO space.
> 
> So the intermediate step covers this aspect where device_type specific plumbing is not done.
> Its generic. A device may choose to omit such doorbells as well as long as it knows it can resume.

all this is kind of vague ... should be in the spec.

> > >
> > > Do you know any device configuration space which is RW?
> > > For net and blk I recall it as RO?
> > 
> > For example, WCE. What's more important, the spec allows config space to be
> > RW, so even if there's no examples before, it doesn't mean we won't have a RW
> > in the future.
> > 
> Ok.
> 
> > >
> > > > Let me ask differently, similar to FLR, what happens if the driver
> > > > wants a virtio reset but the hypervisor wants to stop or freeze?
> > > >
> > > The device would respond to stop/freeze request when it has internally
> > started the reset, as device is the single synchronization point which knows how
> > to handle both in parallel.
> > 
> > Let's define the synchronization point first. And it demonstrates at least devices
> > need to synchronize between the free/stop and virtio device status machine
> > which is not as easy as what is done in this patch.
> >
> Synchronization point = device.

Then we need to spec device behaviour.

> > >
> > > > > We would enrich the device context for this, but no need to
> > > > > connects the
> > > > admin mode controlled by the owner device with operational state
> > > > (device_status) owned by the member device.
> > > > >
> > > > > > > + it ignores any device configuration space writes,
> > > > > >
> > > > > > How about read and the device configuration changes?
> > > > > >
> > > > > As listed, device do not have any changes.
> > > > > So device configuration change cannot occur.
> > > >
> > > > It's not necessarily caused by config write, it could be things like
> > > > link status or geometry changes that are initiated from the device.
> > > >
> > > I understand it. Link status was one example, you listed other examples too.
> > > The point is, when in freeze mode, the member device is frozen, hence,
> > device won't initiate those changes.
> > >
> > > > >
> > > > > The device requirements cover this content more explicitly:
> > > > >
> > > > > For the SR-IOV group type, regardless of the member device mode,
> > > > > all the PCI transport level registers MUST be always accessible
> > > > > and the member device MUST function the same way for all the PCI
> > > > > transport level
> > > > registers regardless of the member device mode.
> > > > >
> > > > > > > + the device do not have any changes in the device context.
> > > > > > > + The member device is not accessed in the system through the
> > > > > > > + virtio
> > > > interface.
> > > > > > > + \\
> > > > > >
> > > > > > But accessible via PCI interface?
> > > > > >
> > > > > Yes, as usual.
> > > > >
> > > > > > For example, what happens if we want to freeze during FLR? Does
> > > > > > the hypervisor need to wait for the FLR to be completed?
> > > > > >
> > > > > Hypervisor do not need wait for the FLR to be completed.
> > > >
> > > > So does FLR change device context?
> > > Yes.
> > 
> > So this implies the freeze needs to wait for FLR otherwise device context may
> > change.
> > 
> Device context can change anytime and reflect what is latest.
> I will update the patches to reflect that device is the single synchronization point serving flr, mode changes.
> 
> > >
> > > >
> > > > >
> > > > > > > +\hline
> > > > > > > +\hline
> > > > > > > +0x03-0xFF   & -    & reserved for future use \\
> > > > > > > +\hline
> > > > > > > +\end{tabularx}
> > > > > > > +
> > > > > > > +When the owner driver wants to stop the operation of the
> > > > > > > +device, the owner driver sets the device mode to
> > > > > > > +\field{Stop}. Once the device is in the \field{Stop} mode,
> > > > > > > +the device does not initiate any notifications or does not
> > > > > > > +access any driver memory. Since the member driver may be
> > > > > > > +still active which may send further driver notifications to the device,
> > the device context may be updated.
> > > > > > > +When the member driver has stopped accessing the device, the
> > > > > > > +owner driver sets the device to \field{Freeze} mode
> > > > > > > +indicating to the device that no more driver access occurs.
> > > > > > > +In the \field{Freeze} mode, no more changes occur in the device
> > context.
> > > > > > > +At this point, the device ensures that
> > > > > > there will not be any update to the device context.
> > > > > >
> > > > > > What is missed here are:
> > > > > >
> > > > > > 1) it is a virtio specific states or not
> > > > > It is not.
> > > > >
> > > > > > 2) if it is a virtio specific state, if or how to synchronize
> > > > > > with transport specific interfaces and why
> > > > > > 3) can active go directly to freeze and why
> > > > > >
> > > > > Yes. donât see a reason to not allow it.
> > > > > Active to freeze mode can change is useful on the destination
> > > > > side, where
> > > > destination hypervisor knows for sure that there is no other entity
> > > > accessing the device.
> > > > > And it needs to setup the device context, it received from the source side.
> > > > > So setting freeze mode can be done directly.
> > > > >
> > > > > > > +
> > > > > > > +The member device has a device context which the owner driver
> > > > > > > +can either read or write. The member device context consist
> > > > > > > +of any device specific data which is needed by the device to
> > > > > > > +resume its operation when the device mode
> > > > > >
> > > > > > This is too vague. There're states that are not suitable for
> > > > > > cmd/queue for
> > > > sure.
> > > > > > I'd split it into
> > > > > >
> > > > > > 1) common states: virtqueue, dirty pages
> > > > > > 2) device specific states: defined be each device
> > > > > >
> > > > > This is theory of operation section. So it capturing such details.
> > > > > Actual device context definition is outside of theory, and precise
> > > > > states of
> > > > virtqueue, device specific, etc are in it.
> > > >
> > > > See my comment above regarding to the device context.
> > > >
> > > I replied above, device context link is added in the patch-3 in the theory of
> > operation.
> > > So reader gets the complete view.
> > >
> > > > >
> > > > > > > +is changed from \field{Stop} to \field{Active} or from
> > > > > > > +\field{Freeze} to \field{Active}.
> > > > > > > +
> > > > > > > +Once the device context is read, it is cleared from the device.
> > > > > >
> > > > > > This is horrible, it means we can't easily
> > > > > >
> > > > > > 1) re-try the migration
> > > > > > 2) recover from migration failure
> > > > > >
> > > > > Can you please explain the flow?
> > > >
> > > > When migration fails, management can choose to resume the device(VM)
> > > > on the source.
> > > >
> > > ok. This should be possible as the management which has the device
> > > context, it can restore it on the source and move the device mode to active.
> > >
> > > > If the state were cleared, it means there's not simple way to resume
> > > > the device but restoring the whole context.
> > > >
> > > Yes, as you say, by restoring the whole context will suffice this corner/rare
> > case scenario.
> > >
> > > > What's the consideration for such clearing?
> > > >
> > > There are two considerations.
> > > 1.  If one does not clear, till how long should it be kept on the device?
> > 
> > Until virtio reset, this is how virtio works now. I've pointed out that it may cause
> > extra troubles when trying to resume, but you don't tell me what's wrong to
> > keep that?
> > 
> If kept, hypervisor may not be able to decide when to change the mode from active->stop.
> We can opt for a mode where full device context is read in each mode without clearing it.
> But than it can be very specific to a version of qemu, which we are avoiding it here.
> 
> > > 2. device context returns incremental value from the previous read. So, it
> > needs to clear it.
> > 
> > I don't understand here. This is not the case for most of the devices.
> >
> Not sure which devices you mean here with "most of the devices".
> Device context functions like a write record pages (aka dirty pages).
> Whatever is already returned is/should not be repeated in subsequent reads, though device can choose to do so.
>  
> > >
> > > > > And which software stack may find this useful?
> > > > > Is there any existing software that can utilize it?
> > > >
> > > > Libvirt.
> > > >
> > > Does libvirt restore on migration failure?
> > 
> > Yes.
> > 
> Ok. the device will be able to resume when it is marked active.
> The device context returned  is the incremental delta as explained above.
> 
> > >
> > > > > Why that device context present with the software vanished, in
> > > > > your
> > > > assumption, if it is?
> > > > >
> > > > > > > Typically, on
> > > > > > > +the source hypervisor, the owner driver reads the device
> > > > > > > +context once when the device is in \field{Active} or
> > > > > > > +\field{Stop} mode and later once the member device is in
> > \field{Freeze} mode.
> > > > > >
> > > > > > Why need the read while device context could be changed? Or is
> > > > > > the dirty page part of the device context?
> > > > > >
> > > > > It is not part of the dirty page.
> > > > > It needs to read in the active/stop mode, so that it can be shared
> > > > > with
> > > > destination hypervisor, which will pre-setup the complex context of
> > > > the device, while it is still running on the source side.
> > > >
> > > > Is such a method used by any hypervisor?
> > > Yes. qemu which uses vfio interface uses it.
> > 
> > Ok, such software technology could be used for all types of devices, I don't see
> > any advantages to mention it here unless it's unique to virtio.
> > 
> It is theory of operation that brings the clarity and rationale.
> So I will keep it.
> 
> > >
> > > >
> > > > >
> > > > > > > +
> > > > > > > +Typically, the device context is read and written one time on
> > > > > > > +the source and the destination hypervisor respectively once
> > > > > > > +the device is in \field{Freeze} mode. On the destination
> > > > > > > +hypervisor, after writing the device context, when the device
> > > > > > > +mode set to \field{Active}, the device uses the most recently
> > > > > > > +set device context and resumes the device
> > > > > > operation.
> > > > > >
> > > > > > There's no context sequence, so this is obvious. It's the
> > > > > > semantic of all other existing interfaces.
> > > > > >
> > > > > Can you please what which existing interfaces do you mean here?
> > > >
> > > > For any common cfg member. E.g queue_addr.
> > > >
> > > > The driver wrote 100 different values to queue_addr and the device
> > > > used the value written last time.
> > > >
> > > o.k. I donât see any problem in stating what is done, which is less
> > > vague. ð
> > >
> > > > >
> > > > > > > +
> > > > > > > +In an alternative flow, on the source hypervisor the owner
> > > > > > > +driver may choose to read the device context first time while
> > > > > > > +the device is in \field{Active} mode and second time once the
> > > > > > > +device is in \field{Freeze}
> > > > > > mode.
> > > > > >
> > > > > > Who is going to synchronize the device context with possible
> > > > > > configuration from the driver?
> > > > > >
> > > > > Not sure I understand the question.
> > > > > If I understand you right, do you mean that, When configuration
> > > > > change is done by the guest driver, how does device context change?
> > > > >
> > > >
> > > > Yes.
> > > >
> > > > > If so, device context reading will reflect the new configuration.
> > > >
> > > > How do you do that? For example:
> > > >
> > > > static inline void vp_iowrite64_twopart(u64 val,
> > > >                                         __le32 __iomem *lo,
> > > >                                         __le32 __iomem *hi) {
> > > >         vp_iowrite32((u32)val, lo);
> > > >         vp_iowrite32(val >> 32, hi); }
> > > >
> > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > >
> > > Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > section captures the partial value.
> > 
> > There's no way for the device to know whether or not it's a partial value or not.
> > No?
> > 
> Device does not need to know, because when the guest vm and the device is resumed on the destination, it the guest vm will continue with writing the 2nd part.
> 
> > >
> > > > >
> > > > > > > Similarly, on the
> > > > > > > +destination hypervisor writes the device context first time
> > > > > > > +while the device is still running in \field{Active} mode on
> > > > > > > +the source hypervisor and writes the device context second
> > > > > > > +time while the device is in
> > > > > > \field{Freeze} mode.
> > > > > > > +This flow may result in very short setup time as the device
> > > > > > > +context likely have minimal changes from the previously
> > > > > > > +written device
> > > > context.
> > > > > >
> > > > > > Is the hypervisor who is in charge of doing the comparison and
> > > > > > writing only the delta?
> > > > > >
> > > > > The spec commands allow to do so. So possibility exists from spec wise.
> > > >
> > > > There are various optimizations for migration for sure, I don't
> > > > think mentioning any specific one is good.
> > > >
> > > The text is informative text similar to,
> > >
> > > " However, some devices benefit from the ability to find out the
> > > amount of available data in the queue without accessing the virtqueue in
> > memory"
> > >
> > > " To help with these optimizations, when VIRTIO_F_NOTIFICATION_DATA has
> > been negotiated".
> > >
> > > Is this the only optimization in virtio? No, but we still mention the rationale of
> > why it exists.
> > 
> > The above is a good example as it explain VIRTIO_F_NOTIFICATION_DATA is the
> > only way without accessing the virtqueue. But this is not the case of migration.
> > You said it's just a possibility but not a must which is not the case for
> > VIRTIO_F_NOTIFICATION_DATA.
> > 
> It is one of the optimization apart. The comparison is of one_of_example or not.
>
Follow-Ups:
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
References:
- [PATCH v1 0/8] Introduce device migration support commands
  - From: Parav Pandit <parav@nvidia.com>
- [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>