Re: [PATCH v1] virtio-gpu: Document new fence-passing feature

On Tue, Feb 6, 2024 at 8:55âPM Rob Clark <robdclark@gmail.com> wrote:

On Tue, Feb 6, 2024 at 4:09âPM Gurchetan Singh
<gurchetansingh@chromium.org> wrote:
>
>
>
> On Fri, Feb 2, 2024 at 11:40âAM Rob Clark <robdclark@gmail.com> wrote:
>>
>> On Fri, Feb 2, 2024 at 10:16âAM Gurchetan Singh
>> <gurchetansingh@chromium.org> wrote:
>> >
>> >
>> >
>> > On Thu, Feb 1, 2024 at 11:19âAM Rob Clark <robdclark@gmail.com> wrote:
>> >>
>> >> On Thu, Feb 1, 2024 at 10:08âAM Gurchetan Singh
>> >> <gurchetansingh@chromium.org> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Feb 1, 2024 at 9:14âAM Rob Clark <robdclark@gmail.com> wrote:
>> >> >>
>> >> >> On Thu, Feb 1, 2024 at 9:00âAM Gurchetan Singh
>> >> >> <gurchetansingh@chromium.org> wrote:
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Mon, Jan 29, 2024 at 9:21âPM Dmitry Osipenko <dmitry.osipenko@collabora.com> wrote:
>> >> >> >>
>> >> >> >> On 1/30/24 04:34, Gurchetan Singh wrote:
>> >> >> >> > On Sun, Jan 28, 2024 at 11:12âAM Dmitry Osipenko <
>> >> >> >> > dmitry.osipenko@collabora.com> wrote:
>> >> >> >> >
>> >> >> >> >> Document new virtio-gpu protocol fence-passing feature that extends 3d
>> >> >> >> >> submit command with support of in-fences which are passed from guest to
>> >> >> >> >> host for waiting. This feature allows to offload waiting for in-fences
>> >> >> >> >> from guest to host, where it can be done more efficiently.
>> >> >> >> >>
>> >> >> >> >> Signed-off-by: Dmitry Osipenko <dmitry.osipenko@collabora.com>
>> >> >> >> >> ---
>> >> >> >> >>Â device-types/gpu/description.tex | 22 +++++++++++++++++++++-
>> >> >> >> >>Â 1 file changed, 21 insertions(+), 1 deletion(-)
>> >> >> >> >>
>> >> >> >> >> diff --git a/device-types/gpu/description.tex
>> >> >> >> >> b/device-types/gpu/description.tex
>> >> >> >> >> index 443524851a05..e5e23b5c9072 100644
>> >> >> >> >> --- a/device-types/gpu/description.tex
>> >> >> >> >> +++ b/device-types/gpu/description.tex
>> >> >> >> >> @@ -37,6 +37,8 @@ \subsection{Feature bits}\label{sec:Device Types / GPU
>> >> >> >> >> Device / Feature bits}
>> >> >> >> >>Â Â resources is supported.
>> >> >> >> >>Â \item[VIRTIO_GPU_F_CONTEXT_INIT (4)] multiple context types and
>> >> >> >> >>Â Â synchronization timelines supported.Â Requires VIRTIO_GPU_F_VIRGL.
>> >> >> >> >> +\item[VIRTIO_GPU_F_FENCE_PASSING (5)] passing fence IDs from guest to host
>> >> >> >> >> +Â for waiting supported.Â Requires VIRTIO_GPU_F_VIRGL.
>> >> >> >> >>Â \end{description}
>> >> >> >> >>
>> >> >> >> >>Â \subsection{Device configuration layout}\label{sec:Device Types / GPU
>> >> >> >> >> Device / Device configuration layout}
>> >> >> >> >> @@ -746,7 +748,25 @@ \subsubsection{Device Operation: controlq
>> >> >> >> >> (3d)}\label{sec:Device Types / GPU Dev
>> >> >> >> >>
>> >> >> >> >>Â \item[VIRTIO_GPU_CMD_SUBMIT_3D]
>> >> >> >> >>Â Â Submit an opaque command stream.Â The type of the command stream is
>> >> >> >> >> -Â determined when creating a context.
>> >> >> >> >> +Â determined when creating a context.Â Request data is
>> >> >> >> >> +Â \field{struct virtio_gpu_cmd_submit}.Â The \field{size} field describes
>> >> >> >> >> +Â the size of \field{cmd_data} array in bytes.Â The \field{num_in_fences}
>> >> >> >> >> +Â field is active only if VIRTIO_GPU_F_FENCE_PASSING is enabled, otherwise
>> >> >> >> >> +Â \field{num_in_fences} is treated as zero.Â The array of
>> >> >> >> >> \field{in_fences}
>> >> >> >> >> +Â IDs is placed in between the \field{num_in_fences} field and the
>> >> >> >> >> +Â \field{cmd_data} array.Â The \field{in_fences} array contains virtio-gpu
>> >> >> >> >> +Â \field{fence_id}'s corresponding to the out-fence IDs of a previous
>> >> >> >> >> +Â 3d submits.
>> >> >> >> >> +
>> >> >> >> >> +\begin{lstlisting}
>> >> >> >> >> +struct virtio_gpu_cmd_submit {
>> >> >> >> >> +Â Â Â Â struct virtio_gpu_ctrl_hdr hdr;
>> >> >> >> >> +Â Â Â Â le32 size;
>> >> >> >> >> +Â Â Â Â le32 num_in_fences;
>> >> >> >> >> +Â Â Â Â le64 in_fences[num_in_fences];
>> >> >> >> >> +Â Â Â Â u32 cmd_data[size/4];
>> >> >> >> >> +};
>> >> >> >> >> +\end{lstlisting}
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> > Don't you need VIRTIO_GPU_FLAG_FENCE_SHAREABLE somewhere too?
>> >> >> >>
>> >> >> >> We haven't made use of this flag in the experimental/prototype version
>> >> >> >> that was made for the native contexts. The flag is unused by both Qemu
>> >> >> >> and CrosVM.
>> >> >> >>
>> >> >> >> We added Linux kernel UAPI flag VIRTGPU_EXECBUF_SHARED_FENCE and it's
>> >> >> >> deemed to be enough. If the UAPI flag isn't set for a job's out-fence,
>> >> >> >> then guest uses guest-side synchronization for this fence when it's
>> >> >> >> passed as in-fence for the further jobs.
>> >> >> >>
>> >> >> >> The FENCE_SHAREABLE flag isn't needed by nctx/virgl/venus protocols. All
>> >> >> >> fences are shareable by default. If VMM fails to export fence, it is
>> >> >> >> treated like fence was already signaled. In case of Venus, it's up to
>> >> >> >> Venus to decide internally which fences can be made shareable, i.e. both
>> >> >> >> guest and virglrenderer Venus implementations are responsible for making
>> >> >> >> decision whether fence should be made exportable. Venus won't make
>> >> >> >> out-fence exportable for internal intermediate jobs and will encode this
>> >> >> >> export-info within the Venus protocol.
>> >> >> >>
>> >> >> >> Whether FENCE_SHAREABLE virtio-gpu flag will be needed by gfxstream is
>> >> >> >> questionable. Gfxstream should be similar to Venus, i.e. you should be
>> >> >> >> able to add flag to the gfxstream. We'll need to be certain that the new
>> >> >> >> common flag is definitely needed by making use of it before specifying
>> >> >> >> it in the stable virtio-gpu protocol.
>> >> >> >
>> >> >> >
>> >> >> > I don't think you've gotten fence passing working with display integration.Â Although passing external fences between two GPU instances does happen, the much more common and important case is passing a fence from the virtual GPU to virtual display.
>> >> >> >
>> >> >> > Let me describe the use case for you:
>> >> >> >
>> >> >> > GPU creates a fence
>> >> >> > rutabaga exports fence on creation via stream_renderer_export_fence / or virglrenderer_export_fence --> creates [fence_id --> descriptor mapping] when FLAG_SHAREABLE is set
>> >> >> > SUBMIT_CMD + CROSS_DOMAIN_SEND --> takes the fence_ids and sends a fence to the compositor
>> >> >> >
>> >> >> > You also refer to https://gitlab.freedesktop.org/virgl/virglrenderer/-/merge_requests/1138#note_1996999.
>> >> >> >
>> >> >> > Once you have a working prototype that shares an out_fence with the display, then we can talk about whether FLAG_SHAREABLE is needed or not.Â So I suggest you focus on that before upstreaming uapis + virtio protocols.
>> >> >> >
>> >> >>
>> >> >> Why would sharing with display (via a different virtgpu ctx type) be
>> >> >> any different than sharing with a different GPU context?Â I don't
>> >> >> really see a problem here, either the fence fd still exists on the
>> >> >> host side, or you know that it is already signaled and you don't have
>> >> >> to wait on anything.
>> >> >
>> >> >
>> >> > I'm not a huge fan of the fact that the current implementation uses EGL to create the pre-signalled fence.Â Vulkan ... it could work, but I think you'll have to generate some sort of dummy command at startup?Â Shouldn't that be implemented too?
>> >> >
>> >> > Something should use the new "virgl_renderer_export_signalled_fence" API to verify everything works.
>> >>
>> >> ok, so I gather you are talking about the awkwardness with the wayland
>> >> protocol that we can't just pass -1 for a fence?Â Rather than say
>> >> something using kms.
>> >
>> >
>> > Is there a plan to do a virtio-gpu KMS implementation?Â Something like RESOURCE_FLUSH (in_fence)?Â There are still many virtio-gpu KMS users out there.
>> >
>> >>
>> >>
>> >> I think this is just an implementation detail.Â At any rate, in the
>> >> later version of the virglrenderer implementation, which got merged,
>> >> there is no dependency on EGL for this.Â Only that there was _some_
>> >> prior work which created a fence.Â Virglrenderer hangs on to a
>> >> reference to the last signalled fence until it is replaced with a
>> >> newer one.Â If a guest fence is created, by definition the host will
>> >> have _some_ signaled fence to hand out.
>> >
>> >
>> > The implementation does use a static global table (virgl_fence_table) to coordinate between the virgl sync thread and the main thread?Â This is common in C, but discouraged in Rust (and I do have a use case in mind).
>> >
>>
>> This is fallout of how the virglrenderer API was designed (ie. not
>> having a toplevel vgpu instance object).Â But this seems pretty off
>> topic from the protocol definition.Â Maybe we can revisit if there
>> ever is a libvirglrenderer2 where we can break API compatibility.
>
>
> Well, at the very least we should pass through the proper information.Â FLAG_SHAREABLE is completely costless (DESTROY_FENCE would in theory add VM-exits).Â I think we can pass some sort of shared data structure
>

costless or not, it is _useless_... I don't really see a good way to
not just set FLAG_SHAREABLE on everything, and it kinda works against
the guest kernel view of things... ie if you have a guest dma_fence,
of course it is shareable.Â That is why it exists in the first place.
But it is designed to be cheap.

The definition of FLAG_SHAREABLE would be something like:

"This fence is shareable across host 3D contexts and subsystems (including other virtio-devices)"

- For fences that will be shared with the host display, gfxstream guest will set flag VIRTGPU_EXECBUF_SHARED_FENCE

- For CPU fences or when a display not capable of fencing sharing,Â VIRTGPU_EXECBUF_SHARED_FENCE will not be set

- That would turn into FLAG_SHAREABLE on host: rutabaga_gfx will call stream_renderer_export_fence to take ownership of the fence

- When KMS issues RESOURCE_FLUSH(fence_id), rutabaga_export_fence will transfer ownership to the displayÂ

- After it's sent to Wayland, the fence may be closed

This flow has no mention of the last signaled fence, since gfxstream/rutabaga_gfx can avoid this by clearer ownership semantics.Â ÂÂ

There does exist a theoretical case where the same fence is shared between two different contexts, but I believe that may actually not happen in practice yet.Â The last signalled fence trick would fail on that theoretical case too.

So with FLAG_SHAREABLE, I think we can:

- Avoid the last signalled fence trick (though virglrenderer may still chose to use it) and associated oddities

- Make the fence transient, as you desire

Since VIRTGPU_EXECBUF_SHARED_FENCE already exists, adding FLAG_SHAREABLE on the virtio side would be very trivial.

>>
>>
>> > Spec-wise, having FLAG_SHAREABLE + another DESTROY_FENCE api would probably be the cleanest since it gives explicit lifetimes.Â If there were no performance concerns we would totally do it.
>> >
>> > The concern is that DESTROY_FENCE causes excessive vm-exits as fences are attached to the surface every frame.Â So that's how we end up with the last signalled fence idea, and somewhat undefined lifetime/ownership semantics.
>> >
>>
>> The semantics are in fact well defined, a fence is just an integer in
>> the context of a timeline, there is _no_ lifetime/ownership.
>
>
> Interesting, your concerns are not just performance related, but more philosophical.Â Namely:
>
> - What does a "fence_id" refer to?
> - What is a fence on the host side?
>
> I think the spec should address these questions.
>
> You can't pass an integer between two GPU HW contexts, or share an integer with the display.Â You can do those things with kernel file descriptors like sync_file (backed by dma_fence, which is refcounted).
>

I may have over-simplified things slightly in earlier replies.Â It is
not a single integer but a pair of integers, ie. the fence seqno and
the timeline id.. but that doesn't really change anything from a
lifetime standpoint.

> In the past, each fence_id was associated with a glSync object.Â It was deleted after the sync thread was done waiting for it (see free_fence_locked in vrend_renderer.c).Â This meant the host lifetime was tied to fence signalling.Â The guest dma_fence would out-live the host side glSync object (backed by dma_fence under the hood).Â The fence_id wasn't used by the host after signaling.
>
> Now, we're in the situation where the fence_id on the host side can be used after it's been signaled, and will be valid so long as the guest dma_fence is around.Â "fence_id" must refer to a shareable host kernel object when VIRTIO_GPU_F_FENCE_PASSING is set.Â Given the new requirements, that's where the idea behind explicit lifetimes come from.
>
> The last signaled fence is a way to refer to *some* kernel object, after the original one has already been signaled and destroyed.Â Using last signaled fence rather the original one may create oddities -- for example, if another process (say Wayland compositor) calls SYNC_IOC_FILE_INFO from fences received from virtio-gpu, there might be data anomalies (timestamps, name) compared to if we just passed the original kernel object.
>

I mean, we are bending over backwards for one specific case that we
should maybe solve more directly (by amending the wayland protocol, or
changing how guest/host wayland proxy works)

A fence is meant to be lightweight and transient.Â It is a thing
frequently created, and short lived.Â We don't want to have guest
interaction in order to destroy a host dma_fence, in fact ideally we
want to minimize host depending on guest releasing transient
resources, in order to avoid deadlock.. ie. only the host kernel
cannot be preempted by page faults because all of the host kernel
memory is pinned.Â If the host kernel relies on the guest kernel to
release transient resources, that is a priority inversion which can
lead to deadlock in shrinker/reclaim.

The host kernel shouldn't depend on any userspace for memory management/priority decisions.Â The virtio-gpu process in QEMU/crosvm is aÂuser-space process, and can be compromised irrespective of DESTROY_FENCES.Â I don't think DESTROY_FENCES would add to any attack surface.

Though, I hear you on your desire for a transient/lightweight fence.Â I just think you can accomplish this by being more explicit on how the fence moves at the spec level.ÂÂ

For example, say one day we want to share the GPU fence with virtio-video, and wait on the host side before passing to hostÂV4L2.Â We'll need something like virtio-dmabuf [1] and the acknowledgement "fence_id" is not just a {seqno, timeline}, but an unique identifier for a host-side sync file.Â Then we can say -- at the spec level -- virtio-video takes ownership of the sync file and has responsibility for closing it.ÂÂ

This is both transient and more correct than just giving out the last signaled fence.ÂÂ

However, this would take new v4l2 ioctls that take in an in-fence for the virtio-video case, and hence I propose DESTROY_FENCES as a tunable fallback if we can't make that happen everywhere.Â But in the ideal world, yes, the fence would be transient.ÂÂ

So in summary with my review comments:

Â Â - Can we get a FLAG_SHAREABLE in v2?

Â Â - Can we get a RESOURCE_FLUSH(in_fence) for KMS integration in v2?

Â Â - Regarding transiency:

Â Â Â Â Â Â- Good idea, but can we add it to the spec level rather than being implementation defined?

Â Â Â Â Â Â- For it to be transient everywhere, we'll need even virtio-video to take an in-fence at-least?

Â Â Â Â Â Â- I recommend more research on the transient subject, it *could* work

[1] https://lists.nongnu.org/archive/html/qemu-devel/2023-05/msg00595.htmlÂÂ

The last_signaled_fence may be a workaround that leads to weird edge
cases.. but over-complicating transient resource lifetime in the
protocol is not the answer.

BR,
-R

> We can probably survive these oddities, but we can probably avoid them too, so that's why it would be nice for the guest to provide the information it has.
>
>>
>> > Though, the guest kernel already tracks fence lifetimes through dma_fence.Â What if we add:
>> >
>> > - DESTROY_FENCES(fence_ids, num_fences)
>> > - a virtio-config ("fence_destroy_threshold"), which controls num_fences in DESTROY_FENCES
>> >
>> > When fence_destroy_threshold == 0, this would be the current proposed solution (fence lifetimes implementation defined).Â However, a user can make "fence_destroy_threshold" 100 or something like that, to cause a VM-exit everytime 100 fences have been destroyed.
>> >
>> > This attempts a tunable compromise between API purity and performance concerns.Â WDYT?
>>
>> Tbh, I think the current approach is cleaner from the PoV of what a
>> fence is.. it is just a seqno, it should not have a lifetime.
>>
>> BR,
>> -R
>>
>> >
>> >>
>> >>
>> >> But this is all implementation detail
>> >>
>> >> BR,
>> >> -R
>> >>
>> >> >
>> >> > Essentially, let's do our due diligence and verify the most important use case (gpu --> display) actually works.
>> >> >
>> >> >>
>> >> >>
>> >> >> BR,
>> >> >> -R
>> >> >>
>> >> >> >> --
>> >> >> >> Best regards,
>> >> >> >> Dmitry
>> >> >> >>

virtio-comment message