Re: [PATCH v1] virtio-gpu: Document new fence-passing feature

On Fri, Feb 2, 2024 at 11:40âAM Rob Clark <robdclark@gmail.com> wrote:

On Fri, Feb 2, 2024 at 10:16âAM Gurchetan Singh
<gurchetansingh@chromium.org> wrote:
>
>
>
> On Thu, Feb 1, 2024 at 11:19âAM Rob Clark <robdclark@gmail.com> wrote:
>>
>> On Thu, Feb 1, 2024 at 10:08âAM Gurchetan Singh
>> <gurchetansingh@chromium.org> wrote:
>> >
>> >
>> >
>> > On Thu, Feb 1, 2024 at 9:14âAM Rob Clark <robdclark@gmail.com> wrote:
>> >>
>> >> On Thu, Feb 1, 2024 at 9:00âAM Gurchetan Singh
>> >> <gurchetansingh@chromium.org> wrote:
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Mon, Jan 29, 2024 at 9:21âPM Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
>> >> >>
>> >> >> On 1/30/24 04:34, Gurchetan Singh wrote:
>> >> >> > On Sun, Jan 28, 2024 at 11:12âAM Dmitry Osipenko <
>> >> >> > dmitry.osipenko@collabora.com> wrote:
>> >> >> >
>> >> >> >> Document new virtio-gpu protocol fence-passing feature that extends 3d
>> >> >> >> submit command with support of in-fences which are passed from guest to
>> >> >> >> host for waiting. This feature allows to offload waiting for in-fences
>> >> >> >> from guest to host, where it can be done more efficiently.
>> >> >> >>
>> >> >> >> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
>> >> >> >> ---
>> >> >> >>Â device-types/gpu/description.tex | 22 +++++++++++++++++++++-
>> >> >> >>Â 1 file changed, 21 insertions(+), 1 deletion(-)
>> >> >> >>
>> >> >> >> diff --git a/device-types/gpu/description.tex
>> >> >> >> b/device-types/gpu/description.tex
>> >> >> >> index 443524851a05..e5e23b5c9072 100644
>> >> >> >> --- a/device-types/gpu/description.tex
>> >> >> >> +++ b/device-types/gpu/description.tex
>> >> >> >> @@ -37,6 +37,8 @@ \subsection{Feature bits}\label{sec:Device Types / GPU
>> >> >> >> Device / Feature bits}
>> >> >> >>Â Â resources is supported.
>> >> >> >>Â \item[VIRTIO_GPU_F_CONTEXT_INIT (4)] multiple context types and
>> >> >> >>Â Â synchronization timelines supported.Â Requires VIRTIO_GPU_F_VIRGL.
>> >> >> >> +\item[VIRTIO_GPU_F_FENCE_PASSING (5)] passing fence IDs from guest to host
>> >> >> >> +Â for waiting supported.Â Requires VIRTIO_GPU_F_VIRGL.
>> >> >> >>Â \end{description}
>> >> >> >>
>> >> >> >>Â \subsection{Device configuration layout}\label{sec:Device Types / GPU
>> >> >> >> Device / Device configuration layout}
>> >> >> >> @@ -746,7 +748,25 @@ \subsubsection{Device Operation: controlq
>> >> >> >> (3d)}\label{sec:Device Types / GPU Dev
>> >> >> >>
>> >> >> >>Â \item[VIRTIO_GPU_CMD_SUBMIT_3D]
>> >> >> >>Â Â Submit an opaque command stream.Â The type of the command stream is
>> >> >> >> -Â determined when creating a context.
>> >> >> >> +Â determined when creating a context.Â Request data is
>> >> >> >> +Â \field{struct virtio_gpu_cmd_submit}.Â The \field{size} field describes
>> >> >> >> +Â the size of \field{cmd_data} array in bytes.Â The \field{num_in_fences}
>> >> >> >> +Â field is active only if VIRTIO_GPU_F_FENCE_PASSING is enabled, otherwise
>> >> >> >> +Â \field{num_in_fences} is treated as zero.Â The array of
>> >> >> >> \field{in_fences}
>> >> >> >> +Â IDs is placed in between the \field{num_in_fences} field and the
>> >> >> >> +Â \field{cmd_data} array.Â The \field{in_fences} array contains virtio-gpu
>> >> >> >> +Â \field{fence_id}'s corresponding to the out-fence IDs of a previous
>> >> >> >> +Â 3d submits.
>> >> >> >> +
>> >> >> >> +\begin{lstlisting}
>> >> >> >> +struct virtio_gpu_cmd_submit {
>> >> >> >> +Â Â Â Â struct virtio_gpu_ctrl_hdr hdr;
>> >> >> >> +Â Â Â Â le32 size;
>> >> >> >> +Â Â Â Â le32 num_in_fences;
>> >> >> >> +Â Â Â Â le64 in_fences[num_in_fences];
>> >> >> >> +Â Â Â Â u32 cmd_data[size/4];
>> >> >> >> +};
>> >> >> >> +\end{lstlisting}
>> >> >> >>
>> >> >> >
>> >> >> > Don't you need VIRTIO_GPU_FLAG_FENCE_SHAREABLE somewhere too?
>> >> >>
>> >> >> We haven't made use of this flag in the experimental/prototype version
>> >> >> that was made for the native contexts. The flag is unused by both Qemu
>> >> >> and CrosVM.
>> >> >>
>> >> >> We added Linux kernel UAPI flag VIRTGPU_EXECBUF_SHARED_FENCE and it's
>> >> >> deemed to be enough. If the UAPI flag isn't set for a job's out-fence,
>> >> >> then guest uses guest-side synchronization for this fence when it's
>> >> >> passed as in-fence for the further jobs.
>> >> >>
>> >> >> The FENCE_SHAREABLE flag isn't needed by nctx/virgl/venus protocols. All
>> >> >> fences are shareable by default. If VMM fails to export fence, it is
>> >> >> treated like fence was already signaled. In case of Venus, it's up to
>> >> >> Venus to decide internally which fences can be made shareable, i.e. both
>> >> >> guest and virglrenderer Venus implementations are responsible for making
>> >> >> decision whether fence should be made exportable. Venus won't make
>> >> >> out-fence exportable for internal intermediate jobs and will encode this
>> >> >> export-info within the Venus protocol.
>> >> >>
>> >> >> Whether FENCE_SHAREABLE virtio-gpu flag will be needed by gfxstream is
>> >> >> questionable. Gfxstream should be similar to Venus, i.e. you should be
>> >> >> able to add flag to the gfxstream. We'll need to be certain that the new
>> >> >> common flag is definitely needed by making use of it before specifying
>> >> >> it in the stable virtio-gpu protocol.
>> >> >
>> >> >
>> >> > I don't think you've gotten fence passing working with display integration.Â Although passing external fences between two GPU instances does happen, the much more common and important case is passing a fence from the virtual GPU to virtual display.
>> >> >
>> >> > Let me describe the use case for you:
>> >> >
>> >> > GPU creates a fence
>> >> > rutabaga exports fence on creation via stream_renderer_export_fence / or virglrenderer_export_fence --> creates [fence_id --> descriptor mapping] when FLAG_SHAREABLE is set
>> >> > SUBMIT_CMD + CROSS_DOMAIN_SEND --> takes the fence_ids and sends a fence to the compositor
>> >> >
>> >> > You also refer to https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1138#note_1996999.
>> >> >
>> >> > Once you have a working prototype that shares an out_fence with the display, then we can talk about whether FLAG_SHAREABLE is needed or not.Â So I suggest you focus on that before upstreaming uapis + virtio protocols.
>> >> >
>> >>
>> >> Why would sharing with display (via a different virtgpu ctx type) be
>> >> any different than sharing with a different GPU context?Â I don't
>> >> really see a problem here, either the fence fd still exists on the
>> >> host side, or you know that it is already signaled and you don't have
>> >> to wait on anything.
>> >
>> >
>> > I'm not a huge fan of the fact that the current implementation uses EGL to create the pre-signalled fence.Â Vulkan ... it could work, but I think you'll have to generate some sort of dummy command at startup?Â Shouldn't that be implemented too?
>> >
>> > Something should use the new "virgl_renderer_export_signalled_fence" API to verify everything works.
>>
>> ok, so I gather you are talking about the awkwardness with the wayland
>> protocol that we can't just pass -1 for a fence?Â Rather than say
>> something using kms.
>
>
> Is there a plan to do a virtio-gpu KMS implementation?Â Something like RESOURCE_FLUSH (in_fence)?Â There are still many virtio-gpu KMS users out there.
>
>>
>>
>> I think this is just an implementation detail.Â At any rate, in the
>> later version of the virglrenderer implementation, which got merged,
>> there is no dependency on EGL for this.Â Only that there was _some_
>> prior work which created a fence.Â Virglrenderer hangs on to a
>> reference to the last signalled fence until it is replaced with a
>> newer one.Â If a guest fence is created, by definition the host will
>> have _some_ signaled fence to hand out.
>
>
> The implementation does use a static global table (virgl_fence_table) to coordinate between the virgl sync thread and the main thread?Â This is common in C, but discouraged in Rust (and I do have a use case in mind).
>

This is fallout of how the virglrenderer API was designed (ie. not
having a toplevel vgpu instance object).Â But this seems pretty off
topic from the protocol definition.Â Maybe we can revisit if there
ever is a libvirglrenderer2 where we can break API compatibility.

Well, at the very least we should pass through the proper information.Â FLAG_SHAREABLE is completely costless (DESTROY_FENCE would in theory add VM-exits).Â I think we can pass some sort of shared data structure

> Spec-wise, having FLAG_SHAREABLE + another DESTROY_FENCE api would probably be the cleanest since it gives explicit lifetimes.Â If there were no performance concerns we would totally do it.
>
> The concern is that DESTROY_FENCE causes excessive vm-exits as fences are attached to the surface every frame.Â So that's how we end up with the last signalled fence idea, and somewhat undefined lifetime/ownership semantics.
>

The semantics are in fact well defined, a fence is just an integer in
the context of a timeline, there is _no_ lifetime/ownership.

Interesting, your concerns are not just performance related, but more philosophical.Â Namely:

- What does a "fence_id" refer to?

- What is a fence on the host side?

I think the spec should address these questions.

You can't pass an integer between two GPU HW contexts, or share an integer with the display.Â You can do those things with kernel file descriptors like sync_file (backed by dma_fence, which is refcounted).

In the past, each fence_id was associated with a glSync object.Â It was deleted after the sync thread was done waiting for it (see free_fence_locked in vrend_renderer.c).Â This meant the host lifetime was tied to fence signalling.Â The guest dma_fence would out-live the host side glSync object (backed by dma_fence under the hood).Â The fence_id wasn't used by the host after signaling.

Now, we're in the situation where the fence_id on the host side can be used after it's been signaled, and will be valid so long as the guest dma_fence is around.Â "fence_id" must refer to a shareable host kernel object when VIRTIO_GPU_F_FENCE_PASSING is set.Â Given the new requirements, that's where the idea behind explicit lifetimes come from.Â

The last signaled fence is a way to refer to *some* kernel object, after the original one has already been signaled and destroyed.Â Using last signaled fence rather the original one may create oddities -- for example, if another process (say Wayland compositor) calls SYNC_IOC_FILE_INFO from fences received from virtio-gpu, there might be data anomalies (timestamps, name) compared to if we just passed the original kernel object.Â

We can probably survive these oddities, but we can probably avoid them too, so that's why it would be nice for the guest to provide the information it has.Â

> Though, the guest kernel already tracks fence lifetimes through dma_fence.Â What if we add:
>
> - DESTROY_FENCES(fence_ids, num_fences)
> - a virtio-config ("fence_destroy_threshold"), which controls num_fences in DESTROY_FENCES
>
> When fence_destroy_threshold == 0, this would be the current proposed solution (fence lifetimes implementation defined).Â However, a user can make "fence_destroy_threshold" 100 or something like that, to cause a VM-exit everytime 100 fences have been destroyed.
>
> This attempts a tunable compromise between API purity and performance concerns.Â WDYT?

Tbh, I think the current approach is cleaner from the PoV of what a
fence is.. it is just a seqno, it should not have a lifetime.

BR,
-R

>
>>
>>
>> But this is all implementation detail
>>
>> BR,
>> -R
>>
>> >
>> > Essentially, let's do our due diligence and verify the most important use case (gpu --> display) actually works.
>> >
>> >>
>> >>
>> >> BR,
>> >> -R
>> >>
>> >> >> --
>> >> >> Best regards,
>> >> >> Dmitry
>> >> >>

virtio-comment message