virtio-comment message

Subject: Re: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
From: "Chen, Jiqian" <Jiqian.Chen@amd.com>
To: Parav Pandit <parav@nvidia.com>, "Michael S. Tsirkin" <mst@redhat.com>, Gerd Hoffmann <kraxel@redhat.com>
Date: Fri, 12 Jan 2024 09:24:41 +0000
On 2024/1/12 16:47, Parav Pandit wrote:
> 
> 
>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
>> Sent: Friday, January 12, 2024 1:55 PM
>>
>>
>> On 2024/1/12 16:02, Parav Pandit wrote:
>>> Hi Jiqian,
>>>
>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
>>>> Sent: Friday, January 12, 2024 1:11 PM
>>>
>>>>
>>>> Hi all,
>>>> Sorry to reply late.
>>>> I don't know if you still remember this problem, let me briefly descript it.
>>>> I am working to implement virtgpu S3 function on Xen.
>>>> Currently on Xen, if we start a guest through qemu with enabling
>>>> virtgpu, and then suspend and resume guest. We can find that the
>>>> guest kernel comes back, but the display doesn't. It just shown a black
>> screen.
>>>> That is because during suspending, guest called into qemu and qemu
>>>> destroyed all display resources and reset renderer. This made the
>>>> display gone after guest resumed.
>>>> So, I add a new command for virtio-gpu that when guest is suspending,
>>>> it will notify qemu and set parameter(preserver_resource) to 1 and
>>>> then qemu will preserve resources, and when resuming, guest will
>>>> notify qemu to set parameter to 0, and then qemu will keep the normal
>>>> actions. That can help guest's display come back.
>>>> When I upstream above implementation, Parav and MST suggest me to
>> use
>>>> the PM capability to fix this problem instead of adding a new command
>>>> or state bit.
>>>> Now, I have tried the PM capability of virtio-gpu, it can't be used
>>>> to solve this problem.
>>>> The reason is:
>>>> during guest suspending, it will write D3 state through PM cap, then
>>>> I can save the resources of virtio-gpu on Qemu side(set
>>>> preserver_resource to 1), but during the process of resuming, the
>>>> state of PM cap will be cleared by qemu
>>>> resetting(qemu_system_wakeup-> qemu_devices_reset->
>>>> virtio_vga_base_reset-> virtio_gpu_gl_reset), it causes that when
>>>> guest reads state from PM cap, it will find the virtio-gpu has
>>>> already been D0 state,
>>> This behavior needs to be fixed. As the spec listed out, " This 2-bit field is
>> used both to determine the current power state of a Function"
>> Do you mean it is wrong to reset PM cap when vritio_gpu reset in current
>> qemu code? Why?
> Because PM is implementing the support from D3->d0 transition and if it device in D3, the device needs to respond that it is in D3 to match the PCI spec.
> 
>> Shouldn't all device states, including PM registers, be reset during the
>> process of virtio-gpu reset?
> If No_Soft_Reset== 0, no device context to be preserved.
> If No_Soft_Reset == 1, device to preserve minimal registers and other things listed in pci spec.
> 
>>
>>>
>>> So device needs to return D3 in the PowerState field.  This is must.
>> But current behavior is a kind of soft reset, I
>> think(!PCI_PM_CTRL_NO_SOFT_RESET). The pm cap is reset by qemu is
>> reasonable.
> What you described is, you need to do No_Soft_Reset=1, so please set the cap accordingly to achieve the restore.
> 
> Also in your case the if the QEMU knows that it will have to resume the device soon.
> Hence, it can well prepare the gpu context before resuming the vcpus.
> In such case, the extra register wouldnât be necessary.
> Having PM control register is anyway needed to drive the device properly.
Even as you said, the reset behavior of PM in the current QEMU code is incorrect, and even if it is modified, it also cannot fix this problem with S3.
Because there is twice device reset during resuming.
First is when we trigger a resume command to qemu, qemu will reset device(qemu_system_wakeup-> qemu_devices_reset-> virtio_vga_base_reset-> virtio_gpu_gl_reset).
After the first time, then second time happens in guest kernel, virtio_device_restore-> virtio_reset_device. But the PM state changes (D3 to D0) happens between those two, so the display resources will be still destroyed by the second time of device reset.
The most feasible and useful way to fix this problem is that just need to add a new virtio-gpu command instead of an extra register or using PM cap.
Thanks.

> 
> Having the extra busy_poll register is flexible. So please evaluate if you need it or not.
> 
>>
>>>
>>> In addition to it, an additional busy_poll register is helpful that indicates the
>> device is ready to use.
>>> This is because 10msec is the timeout set by the PCI spec.
>>> This can be hard for the devices to implement if the large GPU state is being
>> read from files or slow media or for other hw devices in large quantities.
>>>
>>> This limit comes from the PCI spec normative of below.
>>>
>>> After transitioning a VF from D3Hot to D0, at least one of the following is
>> true:
>>> â At least 10 ms has passed since the request to enter D0 was issued.
>>>
>>> So Readiness Time Reporting capability is not so useful.
>>>
>>> Hence, after PM state to D3->D0 transition is successful, virtio level PCI
>> register is useful to ensure that device is resumed to drive rest of the
>> registers.
>>>
>>>> so
>>>> guest will not write D0 through PM cap, so I can't know when to
>>>> restore the resources(set preserver_resource to 0).
>>>> Do you have any other suggestions?
>>>> Or can I just fallback to the version that add a new
>>>> command(VIRTIO_GPU_CMD_PRESERVE_RESOURCE) in virtio-gpu? I think
>> that
>>>> way is more reasonable and feasible for virtio-gpu to protect display
>>>> resources during S3. As for other devices, if necessary, they can
>>>> also refer to the implementation of the virtio-gpu to add new
>>>> commands to prevent resource loss during S3.
>>>>
>>>> On 2023/10/27 11:03, Chen, Jiqian wrote:
>>>>> Hi Michael S. Tsirkin and Parav Pandit, Thank you for your detailed
>>>>> explanation. I will try to use PM cap to fix this issue.
>>>>>
>>>>> On 2023/10/26 18:30, Michael S. Tsirkin wrote:
>>>>>> On Thu, Oct 26, 2023 at 10:24:26AM +0000, Chen, Jiqian wrote:
>>>>>>>
>>>>>>> On 2023/10/25 11:51, Parav Pandit wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
>>>>>>>>> Sent: Tuesday, October 24, 2023 5:44 PM
>>>>>>>>>
>>>>>>>>> On 2023/10/24 18:51, Parav Pandit wrote:
>>>>>>>>>>
>>>>>>>>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
>>>>>>>>>>> Sent: Tuesday, October 24, 2023 4:06 PM
>>>>>>>>>>>
>>>>>>>>>>> On 2023/10/23 21:35, Parav Pandit wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> From: Chen, Jiqian <Jiqian.Chen@amd.com>
>>>>>>>>>>>>> Sent: Monday, October 23, 2023 4:09 PM
>>>>>>>>>>>>
>>>>>>>>>>>>>> I think this can be done without introducing the new register.
>>>>>>>>>>>>>> Can you please check if the PM register itself can serve
>>>>>>>>>>>>>> the purpose instead
>>>>>>>>>>>>> of new virtio level register?
>>>>>>>>>>>>> Do you mean the system PM register?
>>>>>>>>>>>> No, the device's PM register at transport level.
>>>>>>>>>>> I tried to find this register(pci level or virtio pci level or
>>>>>>>>>>> virtio driver level), but I didn't find it in Linux kernel or Qemu
>> codes.
>>>>>>>>>>> May I know which register you are referring to specifically?
>>>>>>>>>>> Or which PM state bit you mentioned below?
>>>>>>>>>>>
>>>>>>>>>> PCI spec's PCI Power Management Capability Structure in section
>>>> 7.5.2.
>>>>>>>>> Yes, what you point to is PM capability for PCIe device.
>>>>>>>>> But the problem is still that in Qemu code, it will check the
>>>>>>>>> condition(pci_bus_is_express or pci_is_express) of all
>>>>>>>>> virtio-pci devices in function virtio_pci_realize(), if the
>>>>>>>>> virtio devices aren't a PCIe device, it will not add PM capability for
>> them.
>>>>>>>> PCI PM capability is must for PCIe devices. So may be QEMU code
>>>>>>>> has put
>>>> it only under is_pcie check.
>>>>>>>> But it can be done outside of that check as well because this
>>>>>>>> capability
>>>> exists on PCI too for long time and it is backward compatible.
>>>>>>> Do you suggest me to implement PM capability for virtio devices in
>>>>>>> Qemu firstly, and then to try if the PM capability can work for
>>>>>>> this scenario?
>>>>>>
>>>>>> virtio devices in qemu already have a PM capability.
>>>>>>
>>>>>>
>>>>>>> If so, we will complicate a simple problem. Because there are no
>>>>>>> other
>>>> needs to add PM capability for virtio devices for now, if we add it
>>>> just for preserving resources, it seems unnecessary and unreasonable.
>>>> And we are not sure if there are other scenarios that are not during
>>>> the process of PM state changing also need to preserve resources, if
>>>> have, then the PM register can't cover, but preserve_resources register
>> can.
>>>>>>
>>>>>> One of the selling points of virtio is precisely reusing existing
>>>>>> platform capabilities as opposed to coming up with our own.
>>>>>> See abstract.tex
>>>>>>
>>>>>>
>>>>>>> Can I add some notes like "If PM capability is implemented for
>>>>>>> virtio
>>>> devices, it may cover this scenario, and if there are no other
>>>> scenarios that PM can't cover, then we can remove this register " in
>>>> commit message or spec description and let us continue to add
>> preserve_resources register?
>>>>>>
>>>>>> We can't remove registers.
>>>>>>
>>>>>>>>
>>>>>>>>> And another problem is how about the MMIO transport devices?
>>>>>>>>> Since preserve resources is need for all transport type devices.
>>>>>>>>>
>>>>>>>> MMIO lacks such rich PM definitions. If in future MMIO wants to
>>>> support, it will be extended to match to other transports like PCI.
>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I think it is unreasonable to let virtio- device listen the
>>>>>>>>>>>>> PM state of Guest system.
>>>>>>>>>>>> Guest driver performs any work on the guest systems PM
>>>>>>>>>>>> callback events in
>>>>>>>>>>> the virtio driver.
>>>>>>>>>>> I didn't find any PM state callback in the virtio driver.
>>>>>>>>>>>
>>>>>>>>>> There are virtio_suspend and virtio_resume in case of Linux.
>>>>>>>>> I think what you said virtio_suspend/resume is freeze/restore
>>>>>>>>> callback from "struct virtio_driver" or suspend/resume callback
>>>>>>>>> from "static const struct dev_pm_ops virtio_pci_pm_ops".
>>>>>>>>> And yes, I agree, if virtio devices have PM capability, maybe we
>>>>>>>>> can set PM state in those callback functions.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> It's more suitable that each device gets notifications from
>>>>>>>>>>>>> driver, and then do preserving resources operation.
>>>>>>>>>>>> I agree that each device gets the notification from driver.
>>>>>>>>>>>> The question is, should it be virtio driver, or existing pci
>>>>>>>>>>>> driver which
>>>>>>>>>>> transitions the state from d0->d3 and d3->d0 is enough.
>>>>>>>>>>> It seems there isn't existing pci driver to transitions d0 or
>>>>>>>>>>> d3 state. Could you please tell me which one it is
>>>>>>>>>>> specifically? I am very willing to
>>>>>>>>> give a try.
>>>>>>>>>>>
>>>>>>>>>> Virtio-pci modern driver of Linux should be able to.
>>>>>>>>> Yes, I know, it is the VIRTIO_PCI_FLAG_INIT_PM_BIT. But still
>>>>>>>>> the two problems I said above.
>>>>>>>>>
>>>>>>>> Both can be resolved without switching to pcie.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> Can you please check that?
>>>>>>>>>>>>
>>>>>>>>>>>>>>> --- a/transport-pci.tex
>>>>>>>>>>>>>>> +++ b/transport-pci.tex
>>>>>>>>>>>>>>> @@ -325,6 +325,7 @@ \subsubsection{Common
>> configuration
>>>>>>>>> structure
>>>>>>>>>>>>>>> layout}\label{sec:Virtio Transport
>>>>>>>>>>>>>>>          /* About the administration virtqueue. */
>>>>>>>>>>>>>>>          le16 admin_queue_index;         /* read-only for driver
>> */
>>>>>>>>>>>>>>>          le16 admin_queue_num;         /* read-only for driver */
>>>>>>>>>>>>>>> +        le16 preserve_resources;        /* read-write */
>>>>>>>>>>>>>> Preserving these resources in the device implementation
>>>>>>>>>>>>>> takes finite amount
>>>>>>>>>>>>> of time.
>>>>>>>>>>>>>> Possibly more than 40nsec (time of PCIe write TLP).
>>>>>>>>>>>>>> Hence this register must be a polling register to indicate
>>>>>>>>>>>>>> that
>>>>>>>>>>>>> preservation_done.
>>>>>>>>>>>>>> This will tell the guest when the preservation is done, and
>>>>>>>>>>>>>> when restoration is
>>>>>>>>>>>>> done, so that it can resume upper layers.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Please refer to queue_reset definition to learn more about
>>>>>>>>>>>>>> such register
>>>>>>>>>>>>> definition.
>>>>>>>>>>>>> Thanks, I will refer to "queue_reset". So, I need three
>>>>>>>>>>>>> values, driver write 1 to let device do preserving
>>>>>>>>>>>>> resources, driver write
>>>>>>>>>>>>> 2 to let device do restoring resources, device write 0 to
>>>>>>>>>>>>> tell driver that preserving or restoring done, am I right?
>>>>>>>>>>>>>
>>>>>>>>>>>> Right.
>>>>>>>>>>>>
>>>>>>>>>>>> And if the existing pcie pm state bits can do, we can leverage
>> that.
>>>>>>>>>>>> If it cannot be used, lets add that reasoning in the commit
>>>>>>>>>>>> log to describe this
>>>>>>>>>>> register.
>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lets please make sure that PCIe PM level registers are
>>>>>>>>>>>>>> sufficient/not-sufficient
>>>>>>>>>>>>> to decide the addition of this register.
>>>>>>>>>>>>> But if the device is not a PCIe device, it doesn't have PM
>>>>>>>>>>>>> capability, then this will not work. Actually in my local
>>>>>>>>>>>>> environment, pci_is_express() return false in Qemu, they are
>>>>>>>>>>>>> not PCIe
>>>>>>>>>>> device.
>>>>>>>>>>>> It is reasonable to ask to plug in as PCIe device in 2023 to
>>>>>>>>>>>> get new functionality that too you mentioned a gpu device. ð
>>>>>>>>>>>> Which does not have very long history of any backward
>>>> compatibility.
>>>>>>>>>>> Do you suggest me to add PM capability for virtio-gpu or
>>>>>>>>>>> change virtio-gpu to a PCIe device?
>>>>>>>>>>>
>>>>>>>>>> PCI Power Management Capability Structure does not seem to be
>>>>>>>>>> limited to
>>>>>>>>> PCIe.
>>>>>>>>> I am not sure, but in current Qemu code, I can see the check
>>>> "pci_is_express"
>>>>>>>>> for all virtio-pci devices. If we want to add PM capability for
>>>>>>>>> virtio-pci devices, we need to change them to PCIe device I think.
>>>>>>>>>
>>>>>>>> That is one option.
>>>>>>>> Second option to extend PCI PM cap for non pci device because it
>>>>>>>> is
>>>> supported.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Best regards,
>>>>>>>>> Jiqian Chen.
>>>>>>>
>>>>>>> --
>>>>>>> Best regards,
>>>>>>> Jiqian Chen.
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Jiqian Chen.
>>
>> --
>> Best regards,
>> Jiqian Chen.

-- 
Best regards,
Jiqian Chen.
Follow-Ups:
- RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: Parav Pandit <parav@nvidia.com>
References:
- Re: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: "Chen, Jiqian" <Jiqian.Chen@amd.com>
- RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: "Chen, Jiqian" <Jiqian.Chen@amd.com>
- RE: [PATCH v6 1/1] content: Add new feature VIRTIO_F_PRESERVE_RESOURCES
  - From: Parav Pandit <parav@nvidia.com>