virtio-comment message

Subject: Re: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
From: "Chen, Jiqian" <Jiqian.Chen@amd.com>
To: Parav Pandit <parav@nvidia.com>, "Michael S. Tsirkin" <mst@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>
Date: Fri, 12 Jan 2024 08:25:18 +0000
On 2024/1/12 16:02, Parav Pandit wrote:
> Hi Jiqian,
> 
>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
>> Sent: Friday, January 12, 2024 1:11 PM
> 
>>
>> Hi all,
>> Sorry to reply late.
>> I don't know if you still remember this problem, let me briefly descript it.
>> I am working to implement virtgpu S3 function on Xen.
>> Currently on Xen, if we start a guest through qemu with enabling virtgpu, and
>> then suspend and resume guest. We can find that the guest kernel comes
>> back, but the display doesn't. It just shown a black screen.
>> That is because during suspending, guest called into qemu and qemu
>> destroyed all display resources and reset renderer. This made the display gone
>> after guest resumed.
>> So, I add a new command for virtio-gpu that when guest is suspending, it will
>> notify qemu and set parameter(preserver_resource) to 1 and then qemu will
>> preserve resources, and when resuming, guest will notify qemu to set
>> parameter to 0, and then qemu will keep the normal actions. That can help
>> guest's display come back.
>> When I upstream above implementation, Parav and MST suggest me to use
>> the PM capability to fix this problem instead of adding a new command or
>> state bit.
>> Now, I have tried the PM capability of virtio-gpu, it can't be used to solve this
>> problem.
>> The reason is:
>> during guest suspending, it will write D3 state through PM cap, then I can save
>> the resources of virtio-gpu on Qemu side(set preserver_resource to 1), but
>> during the process of resuming, the state of PM cap will be cleared by qemu
>> resetting(qemu_system_wakeup-> qemu_devices_reset->
>> virtio_vga_base_reset-> virtio_gpu_gl_reset), it causes that when guest reads
>> state from PM cap, it will find the virtio-gpu has already been D0 state, 
> This behavior needs to be fixed. As the spec listed out, " This 2-bit field is used both to determine the current power state of a Function"
Do you mean it is wrong to reset PM cap when vritio_gpu reset in current qemu code? Why?
Shouldn't all device states, including PM registers, be reset during the process of virtio-gpu reset?

> 
> So device needs to return D3 in the PowerState field.  This is must.
But current behavior is a kind of soft reset, I think(!PCI_PM_CTRL_NO_SOFT_RESET). The pm cap is reset by qemu is reasonable.

> 
> In addition to it, an additional busy_poll register is helpful that indicates the device is ready to use.
> This is because 10msec is the timeout set by the PCI spec.
> This can be hard for the devices to implement if the large GPU state is being read from files or slow media or for other hw devices in large quantities.
> 
> This limit comes from the PCI spec normative of below.
> 
> After transitioning a VF from D3Hot to D0, at least one of the following is true:
> â At least 10 ms has passed since the request to enter D0 was issued.
> 
> So Readiness Time Reporting capability is not so useful.
> 
> Hence, after PM state to D3->D0 transition is successful, virtio level PCI register is useful to ensure that device is resumed to drive rest of the registers.
> 
>> so
>> guest will not write D0 through PM cap, so I can't know when to restore the
>> resources(set preserver_resource to 0).
>> Do you have any other suggestions?
>> Or can I just fallback to the version that add a new
>> command(VIRTIO_GPU_CMD_PRESERVE_RESOURCE) in virtio-gpu? I think
>> that way is more reasonable and feasible for virtio-gpu to protect display
>> resources during S3. As for other devices, if necessary, they can also refer to
>> the implementation of the virtio-gpu to add new commands to prevent
>> resource loss during S3.
>>
>> On 2023/10/27 11:03, Chen, Jiqian wrote:
>>> Hi Michael S. Tsirkin and Parav Pandit, Thank you for your detailed
>>> explanation. I will try to use PM cap to fix this issue.
>>>
>>> On 2023/10/26 18:30, Michael S. Tsirkin wrote:
>>>> On Thu, Oct 26, 2023 at 10:24:26AM +0000, Chen, Jiqian wrote:
>>>>>
>>>>> On 2023/10/25 11:51, Parav Pandit wrote:
>>>>>>
>>>>>>
>>>>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
>>>>>>> Sent: Tuesday, October 24, 2023 5:44 PM
>>>>>>>
>>>>>>> On 2023/10/24 18:51, Parav Pandit wrote:
>>>>>>>>
>>>>>>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
>>>>>>>>> Sent: Tuesday, October 24, 2023 4:06 PM
>>>>>>>>>
>>>>>>>>> On 2023/10/23 21:35, Parav Pandit wrote:
>>>>>>>>>>
>>>>>>>>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
>>>>>>>>>>> Sent: Monday, October 23, 2023 4:09 PM
>>>>>>>>>>
>>>>>>>>>>>> I think this can be done without introducing the new register.
>>>>>>>>>>>> Can you please check if the PM register itself can serve the
>>>>>>>>>>>> purpose instead
>>>>>>>>>>> of new virtio level register?
>>>>>>>>>>> Do you mean the system PM register?
>>>>>>>>>> No, the device's PM register at transport level.
>>>>>>>>> I tried to find this register(pci level or virtio pci level or
>>>>>>>>> virtio driver level), but I didn't find it in Linux kernel or Qemu codes.
>>>>>>>>> May I know which register you are referring to specifically? Or
>>>>>>>>> which PM state bit you mentioned below?
>>>>>>>>>
>>>>>>>> PCI spec's PCI Power Management Capability Structure in section
>> 7.5.2.
>>>>>>> Yes, what you point to is PM capability for PCIe device.
>>>>>>> But the problem is still that in Qemu code, it will check the
>>>>>>> condition(pci_bus_is_express or pci_is_express) of all virtio-pci
>>>>>>> devices in function virtio_pci_realize(), if the virtio devices
>>>>>>> aren't a PCIe device, it will not add PM capability for them.
>>>>>> PCI PM capability is must for PCIe devices. So may be QEMU code has put
>> it only under is_pcie check.
>>>>>> But it can be done outside of that check as well because this capability
>> exists on PCI too for long time and it is backward compatible.
>>>>> Do you suggest me to implement PM capability for virtio devices in
>>>>> Qemu firstly, and then to try if the PM capability can work for this
>>>>> scenario?
>>>>
>>>> virtio devices in qemu already have a PM capability.
>>>>
>>>>
>>>>> If so, we will complicate a simple problem. Because there are no other
>> needs to add PM capability for virtio devices for now, if we add it just for
>> preserving resources, it seems unnecessary and unreasonable. And we are not
>> sure if there are other scenarios that are not during the process of PM state
>> changing also need to preserve resources, if have, then the PM register can't
>> cover, but preserve_resources register can.
>>>>
>>>> One of the selling points of virtio is precisely reusing existing
>>>> platform capabilities as opposed to coming up with our own.
>>>> See abstract.tex
>>>>
>>>>
>>>>> Can I add some notes like "If PM capability is implemented for virtio
>> devices, it may cover this scenario, and if there are no other scenarios that PM
>> can't cover, then we can remove this register " in commit message or spec
>> description and let us continue to add preserve_resources register?
>>>>
>>>> We can't remove registers.
>>>>
>>>>>>
>>>>>>> And another problem is how about the MMIO transport devices? Since
>>>>>>> preserve resources is need for all transport type devices.
>>>>>>>
>>>>>> MMIO lacks such rich PM definitions. If in future MMIO wants to
>> support, it will be extended to match to other transports like PCI.
>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I think it is unreasonable to let virtio- device listen the PM
>>>>>>>>>>> state of Guest system.
>>>>>>>>>> Guest driver performs any work on the guest systems PM callback
>>>>>>>>>> events in
>>>>>>>>> the virtio driver.
>>>>>>>>> I didn't find any PM state callback in the virtio driver.
>>>>>>>>>
>>>>>>>> There are virtio_suspend and virtio_resume in case of Linux.
>>>>>>> I think what you said virtio_suspend/resume is freeze/restore
>>>>>>> callback from "struct virtio_driver" or suspend/resume callback
>>>>>>> from "static const struct dev_pm_ops virtio_pci_pm_ops".
>>>>>>> And yes, I agree, if virtio devices have PM capability, maybe we
>>>>>>> can set PM state in those callback functions.
>>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> It's more suitable that each device gets notifications from
>>>>>>>>>>> driver, and then do preserving resources operation.
>>>>>>>>>> I agree that each device gets the notification from driver.
>>>>>>>>>> The question is, should it be virtio driver, or existing pci
>>>>>>>>>> driver which
>>>>>>>>> transitions the state from d0->d3 and d3->d0 is enough.
>>>>>>>>> It seems there isn't existing pci driver to transitions d0 or d3
>>>>>>>>> state. Could you please tell me which one it is specifically? I
>>>>>>>>> am very willing to
>>>>>>> give a try.
>>>>>>>>>
>>>>>>>> Virtio-pci modern driver of Linux should be able to.
>>>>>>> Yes, I know, it is the VIRTIO_PCI_FLAG_INIT_PM_BIT. But still the
>>>>>>> two problems I said above.
>>>>>>>
>>>>>> Both can be resolved without switching to pcie.
>>>>>>
>>>>>>>>
>>>>>>>>>> Can you please check that?
>>>>>>>>>>
>>>>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>>>>> @@ -325,6 +325,7 @@ \subsubsection{Common configuration
>>>>>>> structure
>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>>>>          /* About the administration virtqueue. */
>>>>>>>>>>>>>          le16 admin_queue_index;         /* read-only for driver */
>>>>>>>>>>>>>          le16 admin_queue_num;         /* read-only for driver */
>>>>>>>>>>>>> +        le16 preserve_resources;        /* read-write */
>>>>>>>>>>>> Preserving these resources in the device implementation takes
>>>>>>>>>>>> finite amount
>>>>>>>>>>> of time.
>>>>>>>>>>>> Possibly more than 40nsec (time of PCIe write TLP).
>>>>>>>>>>>> Hence this register must be a polling register to indicate
>>>>>>>>>>>> that
>>>>>>>>>>> preservation_done.
>>>>>>>>>>>> This will tell the guest when the preservation is done, and
>>>>>>>>>>>> when restoration is
>>>>>>>>>>> done, so that it can resume upper layers.
>>>>>>>>>>>>
>>>>>>>>>>>> Please refer to queue_reset definition to learn more about
>>>>>>>>>>>> such register
>>>>>>>>>>> definition.
>>>>>>>>>>> Thanks, I will refer to "queue_reset". So, I need three
>>>>>>>>>>> values, driver write 1 to let device do preserving resources,
>>>>>>>>>>> driver write
>>>>>>>>>>> 2 to let device do restoring resources, device write 0 to tell
>>>>>>>>>>> driver that preserving or restoring done, am I right?
>>>>>>>>>>>
>>>>>>>>>> Right.
>>>>>>>>>>
>>>>>>>>>> And if the existing pcie pm state bits can do, we can leverage that.
>>>>>>>>>> If it cannot be used, lets add that reasoning in the commit log
>>>>>>>>>> to describe this
>>>>>>>>> register.
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Lets please make sure that PCIe PM level registers are
>>>>>>>>>>>> sufficient/not-sufficient
>>>>>>>>>>> to decide the addition of this register.
>>>>>>>>>>> But if the device is not a PCIe device, it doesn't have PM
>>>>>>>>>>> capability, then this will not work. Actually in my local
>>>>>>>>>>> environment, pci_is_express() return false in Qemu, they are
>>>>>>>>>>> not PCIe
>>>>>>>>> device.
>>>>>>>>>> It is reasonable to ask to plug in as PCIe device in 2023 to
>>>>>>>>>> get new functionality that too you mentioned a gpu device. ð
>>>>>>>>>> Which does not have very long history of any backward
>> compatibility.
>>>>>>>>> Do you suggest me to add PM capability for virtio-gpu or change
>>>>>>>>> virtio-gpu to a PCIe device?
>>>>>>>>>
>>>>>>>> PCI Power Management Capability Structure does not seem to be
>>>>>>>> limited to
>>>>>>> PCIe.
>>>>>>> I am not sure, but in current Qemu code, I can see the check
>> "pci_is_express"
>>>>>>> for all virtio-pci devices. If we want to add PM capability for
>>>>>>> virtio-pci devices, we need to change them to PCIe device I think.
>>>>>>>
>>>>>> That is one option.
>>>>>> Second option to extend PCI PM cap for non pci device because it is
>> supported.
>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Jiqian Chen.
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Jiqian Chen.
>>>>
>>>
>>
>> --
>> Best regards,
>> Jiqian Chen.

-- 
Best regards,
Jiqian Chen.
Follow-Ups:
- RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: Parav Pandit <parav@nvidia.com>
References:
- Re: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: "Chen, Jiqian" <Jiqian.Chen@amd.com>
- RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: Parav Pandit <parav@nvidia.com>