OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: RE: [virtio-comment] [PATCH v1 1/8] admin: Add theory of operation for device migration

> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, October 9, 2023 2:19 PM
> Adding LingShan.
Thanks for adding him.

> Parav, if you want any specific people to comment, please do cc them.
Sure, will cc them in v2 as now I see there is interest in the review.

> On Sun, Oct 8, 2023 at 7:26âPM Parav Pandit <parav@nvidia.com> wrote:
> >
> > One or more passthrough PCI VF devices are ubiquitous for virtual
> > machines usage using generic kernel framework such as vfio [1].
> Mentioning a specific subsystem in a specific OS may mislead the user to think
> it can only work in that setup. Let's not do that, virtio is not only used for Linux
> and VFIO.
Not really. it is an example in the cover letter.
It is not the only use case.
A use case gives a crisp clarity of what UAPI it needs to fulfil.
So I will keep it. It is anyway written as one use case.

> >
> > A passthrough PCI VF device is fully owned by the virtual machine
> > device driver.
> Is this true? Even VFIO needs to mediate PCI stuff. Or how do you define
> "passthrough" here?
Other than PCI config registers and due to some legacy, msix.
The "device interface" side is not mediated.
The definition of passthrough here is: To not mediate a device type specific and virtio specific interfaces for modern and future devices.

> > This passthrough device controls its own device reset flow, basic
> > functionality as PCI VF function level reset
> How about other PCI stuff? Or Why is FLR special?
FLR is special for the readers to get the clarity that FLR is also done by the guest driver hence, the device migration commands do not interact/depend with FLR flow.

> > and rest of the virtio device functionality such as control vq,
> What do you mean by "rest of"?
As given in the example cvq.

> Which part is not controlled and why?
Not controlled because as states, it is passthrough device.

> > config space access, data path descriptors handling.
> >
> > Additionally, VM live migration using a precopy method is also widely used.
> Why is this mentioned here?
Huh. You should be positive for bringing clarity to the readers on understanding the use case.
And you seem opposite, but ok.

As stated, it for the reader to understand the use case and see how proposed commands addresses the use case.

> >
> > To support a VM live migration for such passthrough virtio devices,
> > the owner PCI PF device administers the device migration flow.
> Well, if this is specific only to PCI SR-IOV, I'd move it to the PCI transport part.
> But I guess not.
We took the decision to not do so, for other group commands as well.
After Michael's suggestion we moved it to group commands.
So I will not debate this further.

> >
> > This patch introduces the basic theory of operation which describes
> > the flow and supporting administration commands.
> >
> > [1]
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/
> > include/uapi/linux/vfio.h?h=v6.1.47
> >
> > Fixes: https://github.com/oasis-tcs/virtio-spec/issues/176
> > Signed-off-by: Parav Pandit <parav@nvidia.com>
> > ---
> >  admin-cmds-device-migration.tex | 94
> +++++++++++++++++++++++++++++++++
> >  admin.tex                       |  1 +
> >  2 files changed, 95 insertions(+)
> >  create mode 100644 admin-cmds-device-migration.tex
> >
> > diff --git a/admin-cmds-device-migration.tex
> > b/admin-cmds-device-migration.tex new file mode 100644 index
> > 0000000..f839af4
> > --- /dev/null
> > +++ b/admin-cmds-device-migration.tex
> > @@ -0,0 +1,94 @@
> > +\subsubsection{Device Migration}\label{sec:Basic Facilities of a
> > +Virtio Device / Device groups / Group administration commands /
> > +Device Migration}
> > +
> > +In some systems, there is a need to migrate a running virtual machine
> > +from one to another system. A running virtual machine has one or more
> > +passthrough virtio member devices attached to it. A passthrough
> > +device is entirely operated by the guest virtual machine. For
> > +example, with the SR-IOV group type, group member (VF) may undergo
> > +virtio device initialization and reset flow
> What do you mean by "reset flow"? It looks not like a terminology defined in the
> PCI spec. And Google gives me nothing about this.
"reset flow" = virtio specification section 2.4 Device Reset flow.

> > and may also undergo PCI function level
> > +reset(FLR) flow.
> Why is only FLR special here? I've asked FRS but you ignore the question.
FLR is special to bring clarity that guest owns the VF doing FLR, hence hypervisor cannot mediate any registers of the VF.

> > Such flows must comply to the PCI standard and also
> > +virtio specification;
> This seems unnecessary and obvious as it applies to all other PCI and virtio
> functionality.
Great. But your comment is contradicts.

> What's more, for the things that need to be synchronized, I don't see any
> descriptions in this patch. And if it doesn't need, why?
With which operation should it be synchronized and why?
Can you please be specific?

It is not written in this series, because we believe it must not be synchronized as it is fully controlled by the guest.

> > at the same time such flows must not obstruct
> > +the device migration flow. In such a scenario, a group owner device
> > +can provide the administration command interface to facilitate the
> > +device migration related operations.
> > +
> > +When a virtual machine migrates from one hypervisor to another
> > +hypervisor, these hypervisors are named as source and destination
> hypervisor respectively.
> > +In such a scenario, a source hypervisor administers the member device
> > +to suspend the device and preserves the device context.
> > +Subsequently, a destination hypervisor administers the member device
> > +to setup a device context and resumes the member device. The source
> > +hypervisor reads the member device context and the destination
> > +hypervisor writes the member device context. The method to transfer
> > +the member device context from the source to the destination hypervisor is
> outside the scope of this specification.
> > +
> > +The member device can be in any of the three migration modes. The
> > +owner driver sets the member device in one of the following modes during
> device migration flow.
> > +
> > +\begin{tabularx}{\textwidth}{ |l||l|X| } \hline Value & Name &
> > +Description \\ \hline \hline
> > +0x0   & Active &
> > +  It is the default mode after instantiation of the member device. \\
> I don't think we ever define "instantiation" anywhere.
Well a transport has implicit definition of the instantiation already.
May be a text can be added, but donât see a value in duplicating PCI spec here.

> > +\hline
> > +0x1   & Stop &
> > + In this mode, the member device does not send any notifications,
> > +and it does not access any driver memory.
> What's the meaning of "driver memory"?
May be guest memory? Or do you suggest a better naming for the memory allocated by the guest driver?

> And stop seems to be a source of inflight buffers.
I didnât follow it.
If you mean without stop there are no inflight buffer, then I donât agree.
We donât want to violate the spec by having descriptors with zero size returned.
Stop is not the source of inflight descriptors.

There are inflight descriptors with the device that are not yet returned to the driver, and device wont return them as zero size wrong completions.

> > + The member device may receive driver notifications in this mode,
> What's the meaning of "receive"? For example if the device can still process
> buffers, "stop" is not accurate.
Receive means, driver can send the notification as PCIe TLP that device may receive as incoming PCIe TLP.

In "stop" mode, the device wont process descriptors.

> > + the member device context
> I don't think we define "device context" anywhere.
It is defined further in the description.

> >and device configuration space may change. \\
> > +\hline
> I still don't get why we need a "stop" state in the middle.
All pci devices which belong to a single guest VM are not stopped atomically.
Hence, one device which is in freeze mode, may still receive driver notifications from other pci device, or it may experience a read from the shared memory and get garbage data.
And things can break.
Hence the stop mode, ensures that all the devices get enough chance to stop themselves, and later when freezed, to not change anything internally.

> > +0x2   & Freeze &
> > + In this mode, the member device does not accept any driver
> > +notifications,
> This is too vague. Is the device allowed to be freezed in the middle of any virtio
> or PCI operations?
> For example, in the middle of feature negotiation etc. It may cause
> implementation specific sub-states which can't be migrated easily.
Yes. it is allowed in middle of feature negotiation, for sure.
It is passthrough device, hence hypervisor layer do not get to see sub-state.

Not sure why you comment, why it cannot be migrated easily.
The device context already covers this sub-state.

> And what's more, the above state machine seems to be virtio specific, but you
> don't explain the interaction with the device status state machine. 
First, above is not a state machine.
Second, it is not virtio specific. It is present in leading OS that has fundamental requirement to support P2P devices.
Third, it is not, interacing with the _actua_ device status.

In "SUSPEND" patch-5, you already asked this question. I assume you asked again so that this series is complete.

> For example,
> what happens if the driver wants to reset but the device is in stop mode? You
> told me it is addressed in your series but looks not. Once you try to describe
> that, you're actually try to connect states between the two state machines.
As listed in the definition of the stop mode, the device do not act on the incoming writes, it only keep tracks of its internal device context change as part of this.
We would enrich the device context for this, but no need to connects the admin mode controlled by the owner device with operational state (device_status) owned by the member device.

> > + it ignores any device configuration space writes,
> How about read and the device configuration changes?
As listed, device do not have any changes.
So device configuration change cannot occur.

The device requirements cover this content more explicitly:

For the SR-IOV group type, regardless of the member device mode, all the PCI transport level registers
MUST be always accessible and the member device MUST function the same way for all the PCI transport
level registers regardless of the member device mode.

> > + the device do not have any changes in the device context. The member
> > + device is not accessed in the system through the virtio interface.
> > + \\
> But accessible via PCI interface?
Yes, as usual.

> For example, what happens if we want to freeze during FLR? Does the
> hypervisor need to wait for the FLR to be completed?
Hypervisor do not need wait for the FLR to be completed.

> > +\hline
> > +\hline
> > +0x03-0xFF   & -    & reserved for future use \\
> > +\hline
> > +\end{tabularx}
> > +
> > +When the owner driver wants to stop the operation of the device, the
> > +owner driver sets the device mode to \field{Stop}. Once the device is
> > +in the \field{Stop} mode, the device does not initiate any
> > +notifications or does not access any driver memory. Since the member
> > +driver may be still active which may send further driver
> > +notifications to the device, the device context may be updated. When
> > +the member driver has stopped accessing the device, the owner driver
> > +sets the device to \field{Freeze} mode indicating to the device that
> > +no more driver access occurs. In the \field{Freeze} mode, no more
> > +changes occur in the device context. At this point, the device ensures that
> there will not be any update to the device context.
> What is missed here are:
> 1) it is a virtio specific states or not
It is not.

> 2) if it is a virtio specific state, if or how to synchronize with transport specific
> interfaces and why
> 3) can active go directly to freeze and why
Yes. donât see a reason to not allow it.
Active to freeze mode can change is useful on the destination side, where destination hypervisor knows for sure that there is no other entity accessing the device.
And it needs to setup the device context, it received from the source side.
So setting freeze mode can be done directly.

> > +
> > +The member device has a device context which the owner driver can
> > +either read or write. The member device context consist of any device
> > +specific data which is needed by the device to resume its operation
> > +when the device mode
> This is too vague. There're states that are not suitable for cmd/queue for sure.
> I'd split it into
> 1) common states: virtqueue, dirty pages
> 2) device specific states: defined be each device
This is theory of operation section. So it capturing such details.
Actual device context definition is outside of theory, and precise states of virtqueue, device specific, etc are in it.

> > +is changed from \field{Stop} to \field{Active} or from \field{Freeze}
> > +to \field{Active}.
> > +
> > +Once the device context is read, it is cleared from the device.
> This is horrible, it means we can't easily
> 1) re-try the migration
> 2) recover from migration failure
Can you please explain the flow?
And which software stack may find this useful?
Is there any existing software that can utilize it?
Why that device context present with the software vanished, in your assumption, if it is?

> > Typically, on
> > +the source hypervisor, the owner driver reads the device context once
> > +when the device is in \field{Active} or \field{Stop} mode and later
> > +once the member device is in \field{Freeze} mode.
> Why need the read while device context could be changed? Or is the dirty page
> part of the device context?
It is not part of the dirty page.
It needs to read in the active/stop mode, so that it can be shared with destination hypervisor, which will pre-setup the complex context of the device, while it is still running on the source side.

> > +
> > +Typically, the device context is read and written one time on the
> > +source and the destination hypervisor respectively once the device is
> > +in \field{Freeze} mode. On the destination hypervisor, after writing
> > +the device context, when the device mode set to \field{Active}, the
> > +device uses the most recently set device context and resumes the device
> operation.
> There's no context sequence, so this is obvious. It's the semantic of all other
> existing interfaces.
Can you please what which existing interfaces do you mean here?

> > +
> > +In an alternative flow, on the source hypervisor the owner driver may
> > +choose to read the device context first time while the device is in
> > +\field{Active} mode and second time once the device is in \field{Freeze}
> mode.
> Who is going to synchronize the device context with possible configuration from
> the driver?
Not sure I understand the question.
If I understand you right, do you mean that,
When configuration change is done by the guest driver, how does device context change?

If so, device context reading will reflect the new configuration.

> > Similarly, on the
> > +destination hypervisor writes the device context first time while the
> > +device is still running in \field{Active} mode on the source
> > +hypervisor and writes the device context second time while the device is in
> \field{Freeze} mode.
> > +This flow may result in very short setup time as the device context
> > +likely have minimal changes from the previously written device context.
> Is the hypervisor who is in charge of doing the comparison and writing only the
> delta?
The spec commands allow to do so. So possibility exists from spec wise.
In current proposal, there isnât a need for hypervisor to do so at all.

The destination side device gets to see the new device context and apply the delta.

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]