virtio-comment message

Subject: RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
From: Parav Pandit <parav@nvidia.com>
To: "Chen, Jiqian" <Jiqian.Chen@amd.com>, "Michael S. Tsirkin" <mst@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>
Date: Fri, 12 Jan 2024 08:02:37 +0000
Hi Jiqian,

> From: Chen, Jiqian <Jiqian.Chen@amd.com>
> Sent: Friday, January 12, 2024 1:11 PM

> 
> Hi all,
> Sorry to reply late.
> I don't know if you still remember this problem, let me briefly descript it.
> I am working to implement virtgpu S3 function on Xen.
> Currently on Xen, if we start a guest through qemu with enabling virtgpu, and
> then suspend and resume guest. We can find that the guest kernel comes
> back, but the display doesn't. It just shown a black screen.
> That is because during suspending, guest called into qemu and qemu
> destroyed all display resources and reset renderer. This made the display gone
> after guest resumed.
> So, I add a new command for virtio-gpu that when guest is suspending, it will
> notify qemu and set parameter(preserver_resource) to 1 and then qemu will
> preserve resources, and when resuming, guest will notify qemu to set
> parameter to 0, and then qemu will keep the normal actions. That can help
> guest's display come back.
> When I upstream above implementation, Parav and MST suggest me to use
> the PM capability to fix this problem instead of adding a new command or
> state bit.
> Now, I have tried the PM capability of virtio-gpu, it can't be used to solve this
> problem.
> The reason is:
> during guest suspending, it will write D3 state through PM cap, then I can save
> the resources of virtio-gpu on Qemu side(set preserver_resource to 1), but
> during the process of resuming, the state of PM cap will be cleared by qemu
> resetting(qemu_system_wakeup-> qemu_devices_reset->
> virtio_vga_base_reset-> virtio_gpu_gl_reset), it causes that when guest reads
> state from PM cap, it will find the virtio-gpu has already been D0 state, 
This behavior needs to be fixed. As the spec listed out, " This 2-bit field is used both to determine the current power state of a Function"

So device needs to return D3 in the PowerState field.  This is must.

In addition to it, an additional busy_poll register is helpful that indicates the device is ready to use.
This is because 10msec is the timeout set by the PCI spec.
This can be hard for the devices to implement if the large GPU state is being read from files or slow media or for other hw devices in large quantities.

This limit comes from the PCI spec normative of below.

After transitioning a VF from D3Hot to D0, at least one of the following is true:
â At least 10 ms has passed since the request to enter D0 was issued.

So Readiness Time Reporting capability is not so useful.

Hence, after PM state to D3->D0 transition is successful, virtio level PCI register is useful to ensure that device is resumed to drive rest of the registers.

> so
> guest will not write D0 through PM cap, so I can't know when to restore the
> resources(set preserver_resource to 0).
> Do you have any other suggestions?
> Or can I just fallback to the version that add a new
> command(VIRTIO_GPU_CMD_PRESERVE_RESOURCE) in virtio-gpu? I think
> that way is more reasonable and feasible for virtio-gpu to protect display
> resources during S3. As for other devices, if necessary, they can also refer to
> the implementation of the virtio-gpu to add new commands to prevent
> resource loss during S3.
> 
> On 2023/10/27 11:03, Chen, Jiqian wrote:
> > Hi Michael S. Tsirkin and Parav Pandit, Thank you for your detailed
> > explanation. I will try to use PM cap to fix this issue.
> >
> > On 2023/10/26 18:30, Michael S. Tsirkin wrote:
> >> On Thu, Oct 26, 2023 at 10:24:26AM +0000, Chen, Jiqian wrote:
> >>>
> >>> On 2023/10/25 11:51, Parav Pandit wrote:
> >>>>
> >>>>
> >>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
> >>>>> Sent: Tuesday, October 24, 2023 5:44 PM
> >>>>>
> >>>>> On 2023/10/24 18:51, Parav Pandit wrote:
> >>>>>>
> >>>>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
> >>>>>>> Sent: Tuesday, October 24, 2023 4:06 PM
> >>>>>>>
> >>>>>>> On 2023/10/23 21:35, Parav Pandit wrote:
> >>>>>>>>
> >>>>>>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
> >>>>>>>>> Sent: Monday, October 23, 2023 4:09 PM
> >>>>>>>>
> >>>>>>>>>> I think this can be done without introducing the new register.
> >>>>>>>>>> Can you please check if the PM register itself can serve the
> >>>>>>>>>> purpose instead
> >>>>>>>>> of new virtio level register?
> >>>>>>>>> Do you mean the system PM register?
> >>>>>>>> No, the device's PM register at transport level.
> >>>>>>> I tried to find this register(pci level or virtio pci level or
> >>>>>>> virtio driver level), but I didn't find it in Linux kernel or Qemu codes.
> >>>>>>> May I know which register you are referring to specifically? Or
> >>>>>>> which PM state bit you mentioned below?
> >>>>>>>
> >>>>>> PCI spec's PCI Power Management Capability Structure in section
> 7.5.2.
> >>>>> Yes, what you point to is PM capability for PCIe device.
> >>>>> But the problem is still that in Qemu code, it will check the
> >>>>> condition(pci_bus_is_express or pci_is_express) of all virtio-pci
> >>>>> devices in function virtio_pci_realize(), if the virtio devices
> >>>>> aren't a PCIe device, it will not add PM capability for them.
> >>>> PCI PM capability is must for PCIe devices. So may be QEMU code has put
> it only under is_pcie check.
> >>>> But it can be done outside of that check as well because this capability
> exists on PCI too for long time and it is backward compatible.
> >>> Do you suggest me to implement PM capability for virtio devices in
> >>> Qemu firstly, and then to try if the PM capability can work for this
> >>> scenario?
> >>
> >> virtio devices in qemu already have a PM capability.
> >>
> >>
> >>> If so, we will complicate a simple problem. Because there are no other
> needs to add PM capability for virtio devices for now, if we add it just for
> preserving resources, it seems unnecessary and unreasonable. And we are not
> sure if there are other scenarios that are not during the process of PM state
> changing also need to preserve resources, if have, then the PM register can't
> cover, but preserve_resources register can.
> >>
> >> One of the selling points of virtio is precisely reusing existing
> >> platform capabilities as opposed to coming up with our own.
> >> See abstract.tex
> >>
> >>
> >>> Can I add some notes like "If PM capability is implemented for virtio
> devices, it may cover this scenario, and if there are no other scenarios that PM
> can't cover, then we can remove this register " in commit message or spec
> description and let us continue to add preserve_resources register?
> >>
> >> We can't remove registers.
> >>
> >>>>
> >>>>> And another problem is how about the MMIO transport devices? Since
> >>>>> preserve resources is need for all transport type devices.
> >>>>>
> >>>> MMIO lacks such rich PM definitions. If in future MMIO wants to
> support, it will be extended to match to other transports like PCI.
> >>>>
> >>>>>>>>
> >>>>>>>>> I think it is unreasonable to let virtio- device listen the PM
> >>>>>>>>> state of Guest system.
> >>>>>>>> Guest driver performs any work on the guest systems PM callback
> >>>>>>>> events in
> >>>>>>> the virtio driver.
> >>>>>>> I didn't find any PM state callback in the virtio driver.
> >>>>>>>
> >>>>>> There are virtio_suspend and virtio_resume in case of Linux.
> >>>>> I think what you said virtio_suspend/resume is freeze/restore
> >>>>> callback from "struct virtio_driver" or suspend/resume callback
> >>>>> from "static const struct dev_pm_ops virtio_pci_pm_ops".
> >>>>> And yes, I agree, if virtio devices have PM capability, maybe we
> >>>>> can set PM state in those callback functions.
> >>>>>
> >>>>>>
> >>>>>>>>
> >>>>>>>>> It's more suitable that each device gets notifications from
> >>>>>>>>> driver, and then do preserving resources operation.
> >>>>>>>> I agree that each device gets the notification from driver.
> >>>>>>>> The question is, should it be virtio driver, or existing pci
> >>>>>>>> driver which
> >>>>>>> transitions the state from d0->d3 and d3->d0 is enough.
> >>>>>>> It seems there isn't existing pci driver to transitions d0 or d3
> >>>>>>> state. Could you please tell me which one it is specifically? I
> >>>>>>> am very willing to
> >>>>> give a try.
> >>>>>>>
> >>>>>> Virtio-pci modern driver of Linux should be able to.
> >>>>> Yes, I know, it is the VIRTIO_PCI_FLAG_INIT_PM_BIT. But still the
> >>>>> two problems I said above.
> >>>>>
> >>>> Both can be resolved without switching to pcie.
> >>>>
> >>>>>>
> >>>>>>>> Can you please check that?
> >>>>>>>>
> >>>>>>>>>>> --- a/transport-pci.tex
> >>>>>>>>>>> +++ b/transport-pci.tex
> >>>>>>>>>>> @@ -325,6 +325,7 @@ \subsubsection{Common configuration
> >>>>> structure
> >>>>>>>>>>> layout}\label{sec:Virtio Transport
> >>>>>>>>>>>          /* About the administration virtqueue. */
> >>>>>>>>>>>          le16 admin_queue_index;         /* read-only for driver */
> >>>>>>>>>>>          le16 admin_queue_num;         /* read-only for driver */
> >>>>>>>>>>> +        le16 preserve_resources;        /* read-write */
> >>>>>>>>>> Preserving these resources in the device implementation takes
> >>>>>>>>>> finite amount
> >>>>>>>>> of time.
> >>>>>>>>>> Possibly more than 40nsec (time of PCIe write TLP).
> >>>>>>>>>> Hence this register must be a polling register to indicate
> >>>>>>>>>> that
> >>>>>>>>> preservation_done.
> >>>>>>>>>> This will tell the guest when the preservation is done, and
> >>>>>>>>>> when restoration is
> >>>>>>>>> done, so that it can resume upper layers.
> >>>>>>>>>>
> >>>>>>>>>> Please refer to queue_reset definition to learn more about
> >>>>>>>>>> such register
> >>>>>>>>> definition.
> >>>>>>>>> Thanks, I will refer to "queue_reset". So, I need three
> >>>>>>>>> values, driver write 1 to let device do preserving resources,
> >>>>>>>>> driver write
> >>>>>>>>> 2 to let device do restoring resources, device write 0 to tell
> >>>>>>>>> driver that preserving or restoring done, am I right?
> >>>>>>>>>
> >>>>>>>> Right.
> >>>>>>>>
> >>>>>>>> And if the existing pcie pm state bits can do, we can leverage that.
> >>>>>>>> If it cannot be used, lets add that reasoning in the commit log
> >>>>>>>> to describe this
> >>>>>>> register.
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Lets please make sure that PCIe PM level registers are
> >>>>>>>>>> sufficient/not-sufficient
> >>>>>>>>> to decide the addition of this register.
> >>>>>>>>> But if the device is not a PCIe device, it doesn't have PM
> >>>>>>>>> capability, then this will not work. Actually in my local
> >>>>>>>>> environment, pci_is_express() return false in Qemu, they are
> >>>>>>>>> not PCIe
> >>>>>>> device.
> >>>>>>>> It is reasonable to ask to plug in as PCIe device in 2023 to
> >>>>>>>> get new functionality that too you mentioned a gpu device. ð
> >>>>>>>> Which does not have very long history of any backward
> compatibility.
> >>>>>>> Do you suggest me to add PM capability for virtio-gpu or change
> >>>>>>> virtio-gpu to a PCIe device?
> >>>>>>>
> >>>>>> PCI Power Management Capability Structure does not seem to be
> >>>>>> limited to
> >>>>> PCIe.
> >>>>> I am not sure, but in current Qemu code, I can see the check
> "pci_is_express"
> >>>>> for all virtio-pci devices. If we want to add PM capability for
> >>>>> virtio-pci devices, we need to change them to PCIe device I think.
> >>>>>
> >>>> That is one option.
> >>>> Second option to extend PCI PM cap for non pci device because it is
> supported.
> >>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Best regards,
> >>>>> Jiqian Chen.
> >>>
> >>> --
> >>> Best regards,
> >>> Jiqian Chen.
> >>
> >
> 
> --
> Best regards,
> Jiqian Chen.
Follow-Ups:
- Re: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: "Chen, Jiqian" <Jiqian.Chen@amd.com>
References:
- Re: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: "Chen, Jiqian" <Jiqian.Chen@amd.com>