virtio-comment message

Subject: RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
From: Parav Pandit <parav@nvidia.com>
To: Jason Wang <jasowang@redhat.com>
Date: Fri, 13 Oct 2023 06:36:29 +0000
> From: Jason Wang <jasowang@redhat.com>
> Sent: Friday, October 13, 2023 6:46 AM

[..]
> > > > > It's still not clear to me how this is done.
> > > > >
> > > > > 1) guest starts FLR
> > > > > 2) adminq freeze the VF
> > > > > 3) FLR is done
> > > > >
> > > > > If the freezing doesn't wait for the FLR, does it mean we need
> > > > > to migrate to a state like FLR is pending? If yes, do we need to
> > > > > migrate the other sub states like this? If not, why?
> > > > >
> > > > In most practical cases #2 followed by #1 should not happen as on
> > > > the source
> > > side the expected is mode change to stop from active.
> > >
> > > How does the hypervisor know if a guest is doing what without trapping?
> > >
> > Hypervisor does not know. The device knows being the recipient of #1 and #2.
> 
> We are discussing the possibility in software/driver side isn't it?
> 
> 1) is initiated from the guest
> 2) is initiated from the hypervisor
> 
> Both are softwares, and you're saying 2) should not happen after 1) since the
> device knows what is being done by guests? How can devices control software
> behaviour?
> 
Device do not control software behavior.
i.e. either hypervisor can initiate device mode change to stop (not freeze) or guest can initiate FLR.
Device knows which is initiated first as single recipient of both.
Therefore, device responds accordingly.
For example, in the sequence you described,
A device will delay mode change command response, until the FLR is completed.


> This only possible thing is to make sure 3) is done before 2) That is what I'm
> asking but you are saying freeze doesn't need to wait for FLR...
> 
I think I responded in previous email further down on synchronization point being fw.
I meant to say software do not need to wait for initiation of the freeze mode command.
Just the command will complete at right time.

This is anyway very corner case.
On source hypervisor as written in the theory of operation, the sequence is active->stop->freeze.
When mode change is done to stop, the vcpus are already suspended.

I agree FLR may have been initiated and driver is waiting now for 100msec.

So yes, device single entity synchronized it.

> >
> > > > But ok, since we active to freeze mode change is allowed, lets discuss
> above.
> > > >
> > > > A device is the single synchronization point for any device reset,
> > > > FLR or admin
> > > command operation.
> > >
> > > So you agree we need synchronization? And I'm not sure I get the
> > > meaning of synchronization point, do you mean the synchronization
> > > between freeze/stop and virtio facilities?
> > >
> > Synchronization means, handling two events in parallel such as FLR and other.
> 
> Great. So we have a perfect race:
> 
> 1) guest initiates FLR
> 2) device start FLR
> 3) hypervisor stop and freeze the device
> 4) device is freeze
> 5) hypervisor read device context A
> 6) migrate device contextA
> 8) migration is done
> 9) FLR is done
> 10) hypervisor read device context B
> 
> So we end up with inconsistent device context, no? Dest want B or A+B, but you
> give A.
> 
Since #1 and #2 is done before #3, the device knows to finish the FLR, hence #9 is completed before #4.

Alternatively, in above sequence when destination sees #10, it can immediately finish the FLR as dest device is not under FLR, treating it as no-op.

Both ways to handle are fine. (and rare in practice, but yes, its possible).

I will write both the options in the device requirements.

> >
> > > > So, the migration driver do not need to wait for FLR to complete.
> > >
> > > I'm confused, you said below that device context could be changed by FLR.
> > >
> > Yes.
> > > If FLR needs to clear device context, we can have a race where
> > > device context is cleared when we are trying to read it?
> > >
> > I didnât say clear the context.
> > FLR updates the device context.
> 
> In what sense?
> 
Indicating a new device context indicating a new device context and discard the old one.
I am glad you asked this. I wanted to get the basic part captured before adding this optimization.
Probably it is good to add it now in the v2 as we crossed this stage now.

> > Device is serving the device context read write commands, serving FLR,
> > answering mode change command, So device knows the best how to avoid
> any race.
> 
> You want to leave those details for the vendor to figure out? If devices know
> everything, why do we need device normative?
> 
Device knows its implementation.
Implementation guidelines to be in the normative.
I will add it to the normative.

> I see issues at least for FLR, I'm pretty sure they are others. If a design requires
> us to audit all the possible conflicts between virtio facilities and transport. It's a
> strong hint of layer violation and when it happens it for sure may hit a lot of
> problems that are very hard to find or debug thus we should drop such a design.
> I suggest using the RFC tag since the next version (if there is one) as I see it is
> immature in many ways.
> 
Technical committee audits the required touch points like rest of the industry committees that I participated.
I disagree to your above point.
If you do not want to review, that is fine.
We are reviewing with other members and also contributed by them.

> What's more, solving races is much easier if the device functionality is self
> contained. For example, for a self contained device with the transport as the
> single interface, we can leverage from transport
> (PCI) for dealing with races, arbitration, ordering, QOS etc which is probably
> required in the internal channel between the owner and the member. But all of
> these were missed in your series and even if you can I'm not sure it's
> worthwhile to reinvent all of them.
> 
At the end there is one physical device serving owner and member devices.
So a claim like things are on the VF hence you magically get 200% QoS guarantee is myth.

Quoting "all of these" is also incorrect.

Things added gradually, first functionally with reasonable performance, followed by notion and extension for QoS.
By definition of PCI transport for SR-IOV there is internal channel.

It is reasonably well proposal in current form.
There are few race condition that you highlight are extremely rare in nature.
Suggestions are welcome to improve.
There were couple of them by Michael too, I am addressing them in the v2.

> For example, for the architecture like owner/member, if the virtio or transport
> facility could be controlled via device internal channels besides the transport,
> such a channel may complicate the synchronization a lot. 
Two vendors who actually make the hw sriov devices are authoring these and others are also reviewing.
So I am more confident that it is solid enough.
Also, a similar design has been seen with other device for more than a year as GPL integrated with QEMU for a year now and with upstream kernel.

> The device needs to
> be able to handle or synchronize requests from both PCI and owner in parallel.
> They are just too many possible races and most of my questions so far come
> from this viewpoint. I wouldn't go further for other stuff since I believe I've
> spotted sufficient issues and that's why I must stop at this patch before looking
> at the rest.
It is your call to stop or progress.
I find your reviews useful to improve this proposal, so I will fix them.

> 
> Admin commands are fine if it does real administrative jobs such as provisioning
> since such work is beyond the core virtio functionality.
> 
> Again, the goal of virtio spec is to have a device with sufficient guidelines that is
> easy to implement but not leave the vendors to waste their engineering
> resources in figuring or fuzzing the corner cases.
I have not seen an industry standard spec or a software that does not have corner cases.
The spec proposal is from > 1 device vendors.

I will focus on more practical aspects to progress and improve this spec.
> 
> >
> > > > When admin cmd freeze the VF it can expect FLR_completed VF.
> > >
> > > We need to explain why and how about the resume? For example, is
> > > resuming required to wait for the completion of FLR, if not, why?
> 
> This question is ignored.
> 
I probably missed. Sorry about it.
No, the driver does not need to wait for FLR to finish to issue resume command, as this typically done on the destination member device which should not be under FLR.
I will write up the requirements further.

> > > In another thread you are saying that the PCI composition is done by
> > > hypervisor, so passthrough is really confusing at least for me.
> > >
> > I explained there what vPCI composition is done there.
> > PCI config space and msix side of composition is done.
> > The whole virtio interface is not composed.
> 
> You need to describe this somewhere, no? That's what I'm saying.
> 
Mostly not. What is not done is not written.

> And passthrough is misleading here.
> 
Passthrough is mentioned in theory of operation.
It is not present in requirements section.
So, it is fine.

> >
> > > > Ok. I assume "reset flow" is clear to you now that it points to section 2.4.
> > > > This section is not normative section, so using an extra word like
> > > > "flow" does
> > > not confuse anyone.
> > > > I will link to the section anyway.
> > >
> > > Probably, but you mention FLR flow as well.
> > As I said, not repeating the PCIe spec here. The reader knows what FLR of the
> PCIe transport.
> 
> Ok, I'm not a native speaker, but I really don't know the difference between
> "FLR" and "FLR flow".
> 
Lets keep it simple. I will write it as FLR, as pci transport has it as FLR.

> >
> > >
> > > >
> > > > > >
> > > > > > > > and may also undergo PCI function level
> > > > > > > > +reset(FLR) flow.
> > > > > > >
> > > > > > > Why is only FLR special here? I've asked FRS but you ignore the
> question.
> > > > > > >
> > > > > > FLR is special to bring clarity that guest owns the VF doing
> > > > > > FLR, hence
> > > > > hypervisor cannot mediate any registers of the VF.
> > > > >
> > > > > It's not about mediation at all, it's about how the device can
> > > > > implement what you want here correctly.
> > > > >
> > > > > See my above question.
> > > > >
> > > > Ok. it is clear that live migration commands cannot stay on the
> > > > member device
> > > because the member device can undergo device reset and FLR flows
> > > owned by the guest.
> > >
> > > I disagree, hypervisors can emulate FLR and never send FLR to real devices.
> > >
> > That would be some other trap alternative that needs to dissect the device
> and build infrastructure for such dissection is not desired in the listed use case.
> 
> Do you need to trap FLR or not? You're saying the hypervisor is in charge of
> vPCI, how is this differ to what you proposed? If not, how can vPCI be
> composed?
> 
Live migration driver do not need to trap FLR.

> I believe you need to document how vpci is supposed to be done, since I believe
> your proposal can only work with such specific types of PCI composition. This is
> one of the important things that is missed in this series.
> 
I donât see a need to describe vpci composition as there may be more than one way to do it.
What I think it is worth to describe is the whole pci device is not stored in device context.
I will try to add a short description around it.

> >
> > So your disagreement is fine for non-passthrough devices.
> >
> > > > (and hypervisor is not involved in these two flows, hence the
> > > > admin command
> > > interface is designed such that it can fullfil above requirements).
> > > >
> > > > Theory of operation brings out this clarity. Please notice that it
> > > > is in
> > > introductory section with an example.
> > > > Not normative line.
> > > >
> > > > > >
> > > > > > > > Such flows must comply to the PCI standard and also
> > > > > > > > +virtio specification;
> > > > > > >
> > > > > > > This seems unnecessary and obvious as it applies to all
> > > > > > > other PCI and virtio functionality.
> > > > > > >
> > > > > > Great. But your comment is contradicts.
> > > > > >
> > > > > > > What's more, for the things that need to be synchronized, I
> > > > > > > don't see any descriptions in this patch. And if it doesn't need, why?
> > > > > > With which operation should it be synchronized and why?
> > > > > > Can you please be specific?
> > > > >
> > > > > See my above question regarding FLR. And it may have others
> > > > > which I haven't had time to audit.
> > > > >
> > > > Ok. when you get chance to audit, lets discuss that time.
> > >
> > > Well, I'm not the author of this series, it should be your job
> > > otherwise it would be too late.
> > >
> > As author, what we think, I will cover. If you have specific points to add value,
> please share, I will look into it.
> 
> I've pointed out sufficient issues. I have a lot of others but I don't want to have a
> giant thread once again.
> 
I see following things to improve in the requirements which I will do in v2.

1. Document race around FLR and admin commands for really rare corner case.
2. Some text around not migrating the pci device registers
3. Interaction with PM commands

> >
> > > For example, how is the power management interaction with the
> freeze/stop?
> > >
> > Power management is owned by the guest, like any other virtio interface.
> > So freeze/stop do not interfere with it.
> 
> I don't think this is a good answer. I'm asking how the PM interacts with
> freeze/stop, you answer it works well.

> 
> I'm not obliged to design hardware for you but figuring out the bad design for
> virtio. I'm not convinced with a proposal that misses a lot of obvious critical
> cases and for sure it's not my job to solve them.
> 
I am not asking you to solve.

> I've demonstrated the possible races with FLR. So did the PM. For example, if VF
> is in D3cold state, can we still read its device context?
I think yes, but I will double check.
 If yes, is it a violation of the PCIE spec? If not, why? 
No, because device context is owned by the owner device and not the VF. SR-PCIM interface has defined it be outside of scope of PCIe spec.

> How about other states? Can the device be freezed
> in the middle of PM state transitions? If yes, how can it work without migrating
> PCI states?
I will double check, but unlikely, it should be similar to FLR case to keep the device to avoid treating it differently.

> Well, I meant we need a more precise definition of each state otherwise it
> could be ambiguous (as I pointed above).
Ok. so, few things about read and other messages, I will add.

> 
> >
> > > >
> > > > > >
> > > > > > In "stop" mode, the device wont process descriptors.
> > > > >
> > > > > If the device won't process descriptors, why still allow it to
> > > > > receive
> > > notifications?
> > > > Because notification may still arrive and if the device may update
> > > > any counters as part of
> > >
> > > Which counters did you mean here?
> > >
> > The counter that Xuan is adding and any other state that device may have to
> update as result of driver notification.
> > For example caching the posted avail index in the notification.
> 
> A link to those proposals? 
[1] https://lists.oasis-open.org/archives/virtio-comment/202310/msg00048.html

> If the device must depend on those cached features to
> work it's really fragile. If not, we don't need to care about them.
It is not dependent.
It is the infrastructure to enable it.
Same for other shared memory region accesses.

> 
> >
> > > > it which needs to be migrated or store the received notification.
> > > >
> > > > > Or does it really matter if the device can receive or not here?
> > > > >
> > > > From device point of view, the device is given the chance to
> > > > update its device
> > > context as part of notifications or access to it.
> > >
> > > This is in conflict with what you said above " Device cannot process
> > > the queue ..."
> > >
> > No, it does not.
> > Device context is updated within the device without accessing the queue
> memory of the guest.
> 
> This is not documented or explained anywhere?
> 
Why should it be explained?
device is not accessing the guest memory -> this is mentioned in stop mode.
Hence, there is no need to write above.

> >
> > > Maybe you can give a concrete example.
> > >
> > The above one.
> >
> > > >
> > > > > >
> > > > > > > > + the member device context
> > > > > > >
> > > > > > > I don't think we define "device context" anywhere.
> > > > > > >
> > > > > > It is defined further in the description.
> > > > >
> > > > > Like this?
> > > > >
> > > > > """
> > > > >  +The member device has a device context which the owner driver
> > > > > can
> > > > > +either read or write. The member device context consist of any
> > > > > device  +specific data which is needed by the device to resume
> > > > > its operation  +when the device mode """
> > > > >
> > > > Yes.
> > > > Further patch-3 adds the device context and also add the link to
> > > > it in the
> > > theory of operation section so reader can read more detail about it.
> > > >
> > > > > "Any" is probably too hard for vendors to implement. And in
> > > > > patch 3 I only see virtio device context. Does this mean we
> > > > > don't need transport
> > > > > (PCI) context at all? If yes, how can it work?
> > > > >
> > > > Right. PCI member device is present at source and destination with
> > > > its layout,
> > > only the virtio device context is transferred.
> > > > Which part cannot work?
> > >
> > > It is explained in another thread where you are saying the PCI
> > > requires mediation. I think any author should not ignore such
> > > important assumptions in both the change log and the patch.
> > >
> > > And again, the more I review the more I see how narrow this series can be
> used:
> > >
> > I explained this before and also covered in the cover letter.
> >
> > > 1) Only works for SR-IOV member device like VF
> > It can be extended to SIOV member device in future.
> > Today these are the only type of member device virtio has.
> 
> That is exactly what I want to say, it can only work for the owner/member
> model. It can't work when the virtio device is not structured like that. And you
> missed that most of the existing virtio devices are not implemented in this
> model. It means they can't be migrated with a pure virtio specific extension. For
> you, SR-IOV is all but this is not true for virtio. PCI is not the only transport and
> SR-IOV is not the only architecture in PCI.
> 
Each transport will have its own way to handle it.
When there is MMIO owner-member relationship arise, one will be able to do so as well.
In fact other transports will likely miss out as they have not established such pace.

> And I'm pretty sure the owner/member is not the only requirement, there are a
> lot of other assumptions which are missed in this series.
> 
One proposal does not do everything.
It is just impractical.

> >
> > > 2) Mediate PCI but not virtio which is tricky
> > > 3) Can only work for a specific BAR/capability register layout
> > >
> > > Only 1) is described in the change log.
> > >
> > > The other important assumptions like 2) and 3) are not documented
> anywhere.
> > > And this patch never explains why 2) and 3) is needed or why it can
> > > be used for subsystems other than VFIO/Linux.
> > >
> > Since I am not mentioning vfio now, I will refrain from mentioning
> > others as well. :)
> 
> It's not about VFIO at all. It's about to let people know under which case this
> proposal could work. Otherwise if a vendor develops a BAR/cap which is not at
> page boundary. How could you make it work with your proposal here?
> 
Vendor is a cloud operator which is building the device, so it will always work it has the matching capabilities on source and destination.

> >
> > > >
> > > > > >
> > > > > > > >and device configuration space may change. \\
> > > > > > > > +\hline
> > > > > > >
> > > > > > > I still don't get why we need a "stop" state in the middle.
> > > > > > >
> > > > > > All pci devices which belong to a single guest VM are not
> > > > > > stopped
> > > atomically.
> > > > > > Hence, one device which is in freeze mode, may still receive
> > > > > > driver notifications from other pci device,
> > > > >
> > > > > Device may choose to ignore those notifications, no?
> > > > >
> > > > > > or it may experience a read from the shared memory and get
> > > > > > garbage
> > > data.
> > > > >
> > > > > Could you give me an example for this?
> > > > >
> > > > Section 2.10 Shared Memory Regions.
> > >
> > > How can it experience a read in this case?
> > >
> > MMIO read/write can be initiated by the peer device while the device is in
> stopped state.
> 
> Ok, but what I want to say is how it can get the garbage data here?
> 
If the device mode is changed to freeze while it is being read by the peer device, it can get garbage data or last data.
Which may not be the one that is expected.
So first all the initiator devices are stopped, ensure that they do not make any requests.

And there are requests, which gets proper answer.

> >
> > > Btw, shared regions are tricky for hardware.
> > >
> > > >
> > > > > > And things can break.
> > > > > > Hence the stop mode, ensures that all the devices get enough
> > > > > > chance to stop
> > > > > themselves, and later when freezed, to not change anything internally.
> > > > > >
> > > > > > > > +0x2   & Freeze &
> > > > > > > > + In this mode, the member device does not accept any
> > > > > > > > +driver notifications,
> > > > > > >
> > > > > > > This is too vague. Is the device allowed to be freezed in
> > > > > > > the middle of any virtio or PCI operations?
> > > > > > >
> > > > > > > For example, in the middle of feature negotiation etc. It
> > > > > > > may cause implementation specific sub-states which can't be
> migrated easily.
> > > > > > >
> > > > > > Yes. it is allowed in middle of feature negotiation, for sure.
> > > > > > It is passthrough device, hence hypervisor layer do not get to
> > > > > > see sub-
> > > state.
> > > > > >
> > > > > > Not sure why you comment, why it cannot be migrated easily.
> > > > > > The device context already covers this sub-state.
> > > > >
> > > > > 1) driver writes driver_features
> > > > > 2) driver sets FEAUTRES_OK
> > > > >
> > > > > 3) device receive driver_features
> > > > > 4) device validating driver_features
> > > > > 5) device clears FEATURES_OK
> > > > >
> > > > > 6) driver read stats and realize FEATURES_OK is being cleared
> > > > >
> > > > > Is it valid to be frozen of the above?
> > > > No. device mode is frozen when hypervisor is sure that no more
> > > > access by the
> > > guest will be done.
> > >
> > > How, you don't trap so 1) and 2) are posted, how can hypervisor know
> > > if there's inflight transactions to any registers?
> > >
> > Because hypervisor has stopped the vcpus which are issuing them.
> 
> MMIO are posted. vCPU is stopped but the transactions are inflight.
> How could the hypervisor/device know if there's any inflight PCIE transactions
> here? So I can imagine what happens in fact is the TLP for freezing is ordered
> with the TLP for posted MMIO. This is probably guaranteed for typical PCIE
> setup but how about the relaxed ordering?

Vcpus do not generated relaxed ordering MMIOs.
In pci spec: " If this bit is Set, the Function is permitted to set the Relaxed Ordering bit in
the Attributes field of transactions it initiates".

Function initiates RO requests, not the vcpu.
Hence, it is fine.

> >
> > > > What can happen between #2 and #3, is device mode may change to stop.
> > >
> > > Why can't be freezed in this case? It's really hard to deduce why it
> > > can't just from your above descriptions.
> > >
> > On the source hypervisor, the mode changes are active->stop->freeze.
> > Hence when freeze is done, the hypervisor knows that all inflight has been
> stopped by now.
> 
> Ok, but how about freezing between 3) and 4). If we allow it, do we need to
> migrate to this state? If yes, how can it work with your device context? If not,
> shouldn't we document this?
> 
May be, some of these are implementation details. I am not sure it belongs to spec.
Like RSS update while packets are received.. such implementation details are not part of the spec.

> >
> > > Even if it had, is it even possible to list all the places where
> > > freezing is prohibited? We don't want to end up with a spec that is
> > > hard to implement or leave the vendor to figure out those tricky parts.
> > >
> > The general idea is not prohibiting the freeze/stop mode.
> > If the device needs more time, let device take time to do it.
> 
> Ok, it means:
> 
> 1) there're conditions from stop to freeze, then what are they?
No, there isnât condition.
May be I didnât follow the question.
> 2) how much time at most? E.g FLR takes at most 100ms.
From the driver side, it is 100msec for device side it can be less too.
As soon as FLR is done or enough to record it, is done, stop can continue.

> 3) If it needs more time, can this time satisfy the downtime requirement?
> 
Guest VM for all practical purposes is not busy in doing FLR, it is a corner case, yet we have to cover it.
And yes, it satisfy the downtime requirements, because VM is already not interested in the packets, it is busy doing the FLR.

> >
> >
> > > > And in stop mode, device context would capture #5 or #4, depending
> > > > where is
> > > device at that point.
> > > >
> > > > > >
> > > > > > > And what's more, the above state machine seems to be virtio
> > > > > > > specific, but you don't explain the interaction with the
> > > > > > > device status state
> > > > > machine.
> > > > > > First, above is not a state machine.
> > > > >
> > > > > So how do readers know if a state can go to another state and when?
> > > > >
> > > > Not sure what you mean by reader. Can you please explain.
> > >
> > > The people who read virtio spec.
> > >
> > So question is "how reader knows if a state can go to another state and
> when"?
> > It is described and listed in the table, when a mode can change.
> 
> It's not only "if" but also "when". Your table partially answers the "if '' but not
> "when". I think you should know now the state transition is conditional. So let's
> try our best to ease the life of the vendor.
What do you mean when?
I do not understand that "mode change is conditional"? it is not based on the condition.
[..]

> > > Let's define the synchronization point first. And it demonstrates at
> > > least devices need to synchronize between the free/stop and virtio
> > > device status machine which is not as easy as what is done in this patch.
> > >
> > Synchronization point = device.
> 
> This is obvious as we can't rule stuff outside virtio, and we are talking about
> devices not drivers here. But the spec needs sufficient guidance/normative for
> the vendor to implement. It's more than just saying "device is synchronization
> point".
> 
The requirements are already covering what device needs to do.
Some interaction points are missing, as I acked above, I will add them.

[..]
> > > Until virtio reset, this is how virtio works now. I've pointed out
> > > that it may cause extra troubles when trying to resume, but you
> > > don't tell me what's wrong to keep that?
> > >
> > If kept, hypervisor may not be able to decide when to change the mode from
> active->stop.
> 
> Why? It is simply done when mgmt requires a migration?
> 
Mgmt is bit higher level entity. Underneath the software layers may wait until the time is right to migrate.
The fundamental point is, the device context is expected to return the incremental value, that is changed content from last time.
So once all changed content is read, its empty.

> What's more important, PCI allows multiple common_cfgs. So the hypervisor
> can choose to reserve one common_cfg for live migration. In this case we don't
> have to read to clear semantics.
Common_cfg does not serve large device context, nor it serves DMA.

> 
> Or, are you saying the value read from common_cfg is not device context? 
The value of common config is part of the device context that represents current common config.

> Isn't this conflict with your vague definition of device context?
>
You mentioned you stop at this patch, so likely you didnât read device context patch, hence you quote it vague.
So I donât know what you mean by vague.
Please let me know what you additional thing you want to see in device context after you reach that patch.

 
> > We can opt for a mode where full device context is read in each mode
> without clearing it.
> > But than it can be very specific to a version of qemu, which we are avoiding it
> here.
> >
> > > > 2. device context returns incremental value from the previous
> > > > read. So, it
> > > needs to clear it.
> > >
> > > I don't understand here. This is not the case for most of the devices.
> > >
> > Not sure which devices you mean here with "most of the devices".
> > Device context functions like a write record pages (aka dirty pages).
> 
> It's definitely different. We want to migrate dirty pages lively which can
> consume a lot of bandwidth. So reporting delta makes a lot of sense here since
> it would have a lot of rounds of syncing and it doesn't result in blockers
> resuming.
> 
Write records are reported as delta from the previous read.

> For device context, how many rounds of syncing did you expect, and if we have
> N rounds, we need to restore N rounds in order to resume? Do you want to live
> migrating device states? If it's only 1 or 2 rounds, why bother?
> 
Live migrate the device context. Typically in current software using it, it is 2 rounds.
The interface is generic that if needed more rounds are possible.

Even device for most practical purpose will implement 2 rounds.

> And for the delta, how do you know you can easily define deltas for every type
> of device, especially the ones with complicated internal states? Defining states
> has already been demonstrated as a complicated task for some devices like
> virtio-FS and you want to complicate it furtherly?
> 
What is your question? If you say virtio-fs is complicated state, may be it should not have existed itself in the virtio spec as first place.
But I differ to think that.
Virtio-fs guest side state wont be changed as part of it.
Virtio-fs is the first device which has considered and listed to migrate the device state.
So it should be possible.

> What is proposed in this series is an ad-hoc optimization for a specific deivce
> type within a specific subsystem (e.g VFIO) in a specific operating system which
> is not the general.
> 
Oh now you mention vfio. Not me. :)

I am not going to comment on this. It is not ad-hoc.
It uses similar dirty page tracking like technique present in cpu hw and other devices.

> As demsonsted many times, starting from something simple and stupid is the
> most easy way.
> 

> > Whatever is already returned is/should not be repeated in subsequent reads,
> though device can choose to do so.
> >
> > > >
> > > > > > And which software stack may find this useful?
> > > > > > Is there any existing software that can utilize it?
> > > > >
> > > > > Libvirt.
> > > > >
> > > > Does libvirt restore on migration failure?
> > >
> > > Yes.
> > >
> > Ok. the device will be able to resume when it is marked active.
> > The device context returned  is the incremental delta as explained above.
> 
> I disagree, see my above reply.
I replied above.

> 
> >
> > > >
> > > > > > Why that device context present with the software vanished, in
> > > > > > your
> > > > > assumption, if it is?
> > > > > >
> > > > > > > > Typically, on
> > > > > > > > +the source hypervisor, the owner driver reads the device
> > > > > > > > +context once when the device is in \field{Active} or
> > > > > > > > +\field{Stop} mode and later once the member device is in
> > > \field{Freeze} mode.
> > > > > > >
> > > > > > > Why need the read while device context could be changed? Or
> > > > > > > is the dirty page part of the device context?
> > > > > > >
> > > > > > It is not part of the dirty page.
> > > > > > It needs to read in the active/stop mode, so that it can be
> > > > > > shared with
> > > > > destination hypervisor, which will pre-setup the complex context
> > > > > of the device, while it is still running on the source side.
> > > > >
> > > > > Is such a method used by any hypervisor?
> > > > Yes. qemu which uses vfio interface uses it.
> > >
> > > Ok, such software technology could be used for all types of devices,
> > > I don't see any advantages to mention it here unless it's unique to virtio.
> > >
> > It is theory of operation that brings the clarity and rationale.
> 
> I think it's not. Since it's not something that is unique to virtio.
> 
> > So I will keep it.
> >
> > > >
> > > > >
> > > > > >
> > > > > > > > +
> > > > > > > > +Typically, the device context is read and written one
> > > > > > > > +time on the source and the destination hypervisor
> > > > > > > > +respectively once the device is in \field{Freeze} mode.
> > > > > > > > +On the destination hypervisor, after writing the device
> > > > > > > > +context, when the device mode set to \field{Active}, the
> > > > > > > > +device uses the most recently set device context and
> > > > > > > > +resumes the device
> > > > > > > operation.
> > > > > > >
> > > > > > > There's no context sequence, so this is obvious. It's the
> > > > > > > semantic of all other existing interfaces.
> > > > > > >
> > > > > > Can you please what which existing interfaces do you mean here?
> > > > >
> > > > > For any common cfg member. E.g queue_addr.
> > > > >
> > > > > The driver wrote 100 different values to queue_addr and the
> > > > > device used the value written last time.
> > > > >
> > > > o.k. I donât see any problem in stating what is done, which is
> > > > less vague. ð
> > > >
> > > > > >
> > > > > > > > +
> > > > > > > > +In an alternative flow, on the source hypervisor the
> > > > > > > > +owner driver may choose to read the device context first
> > > > > > > > +time while the device is in \field{Active} mode and
> > > > > > > > +second time once the device is in \field{Freeze}
> > > > > > > mode.
> > > > > > >
> > > > > > > Who is going to synchronize the device context with possible
> > > > > > > configuration from the driver?
> > > > > > >
> > > > > > Not sure I understand the question.
> > > > > > If I understand you right, do you mean that, When
> > > > > > configuration change is done by the guest driver, how does device
> context change?
> > > > > >
> > > > >
> > > > > Yes.
> > > > >
> > > > > > If so, device context reading will reflect the new configuration.
> > > > >
> > > > > How do you do that? For example:
> > > > >
> > > > > static inline void vp_iowrite64_twopart(u64 val,
> > > > >                                         __le32 __iomem *lo,
> > > > >                                         __le32 __iomem *hi) {
> > > > >         vp_iowrite32((u32)val, lo);
> > > > >         vp_iowrite32(val >> 32, hi); }
> > > > >
> > > > > Is it ok to be freezed in the middle of two vp_iowrite()?
> > > > >
> > > > Yes. the device context VIRTIO_DEV_CTX_PCI_COMMON_RUNTIME_CFG
> > > section captures the partial value.
> > >
> > > There's no way for the device to know whether or not it's a partial value or
> not.
> > > No?
> > >
> > Device does not need to know, because when the guest vm and the device is
> resumed on the destination, it the guest vm will continue with writing the 2nd
> part.
> >
> > > >
> > > > > >
> > > > > > > > Similarly, on the
> > > > > > > > +destination hypervisor writes the device context first
> > > > > > > > +time while the device is still running in \field{Active}
> > > > > > > > +mode on the source hypervisor and writes the device
> > > > > > > > +context second time while the device is in
> > > > > > > \field{Freeze} mode.
> > > > > > > > +This flow may result in very short setup time as the
> > > > > > > > +device context likely have minimal changes from the
> > > > > > > > +previously written device
> > > > > context.
> > > > > > >
> > > > > > > Is the hypervisor who is in charge of doing the comparison
> > > > > > > and writing only the delta?
> > > > > > >
> > > > > > The spec commands allow to do so. So possibility exists from spec
> wise.
> > > > >
> > > > > There are various optimizations for migration for sure, I don't
> > > > > think mentioning any specific one is good.
> > > > >
> > > > The text is informative text similar to,
> > > >
> > > > " However, some devices benefit from the ability to find out the
> > > > amount of available data in the queue without accessing the
> > > > virtqueue in
> > > memory"
> > > >
> > > > " To help with these optimizations, when
> > > > VIRTIO_F_NOTIFICATION_DATA has
> > > been negotiated".
> > > >
> > > > Is this the only optimization in virtio? No, but we still mention
> > > > the rationale of
> > > why it exists.
> > >
> > > The above is a good example as it explain VIRTIO_F_NOTIFICATION_DATA
> > > is the only way without accessing the virtqueue. But this is not the case of
> migration.
> > > You said it's just a possibility but not a must which is not the
> > > case for VIRTIO_F_NOTIFICATION_DATA.
> > >
> > It is one of the optimization apart. The comparison is of one_of_example or
> not.
> 
> I don't get this.
Theory of operation is describing a flow how things are done and how the constructs are helpful to achieve it.
And it is not the end of the list.
That does not mean one should not write those.
Follow-Ups:
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
References:
- [PATCH v1 0/8] Introduce device migration support commands
  - From: Parav Pandit <parav@nvidia.com>
- [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>
- RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration
  - From: Jason Wang <jasowang@redhat.com>