virtio-dev message

Subject: Re: [virtio-dev] Re: [RFC PATCH v6] virtio-video: Add virtio video device specification

From: Alexander Gordeev <alexander.gordeev@opensynergy.com>
To: Alexandre Courbot <acourbot@chromium.org>
Date: Thu, 19 Jan 2023 00:06:04 +0100

Hi Alexandre,

On 12.01.23 07:39, Alexandre Courbot wrote:

On Thu, Jan 12, 2023 at 3:42 AM Alexander Gordeev
<alexander.gordeev@opensynergy.com> wrote:

Hi Alexandre,

On 27.12.22 08:31, Alexandre Courbot wrote:

Hi Alexander,


Cornelia provided links to the previous versions (thanks!). Through
these revisions we tried different approaches, and the more we
progress the closer we are getting to the V4L2 stateful
decoder/encoder interface.

This is actually the point where I would particularly be interested in
having your feedback, since you probably have noticed the similarity.
What would you think about just using virtio as a transport for V4L2
ioctls (virtio-fs does something similar with FUSE), and having the
host emulate a V4L2 decoder or encoder device in place of this (long)
specification? I am personally starting to think this is could be a
better and faster way to get us to a point where both spec and guest
drivers are merged. Moreover this would also open the way to support
other kinds of V4L2 devices like simple cameras - we would just need
to allocate new device IDs for these and would be good to go.

This probably means a bit more work on the device side, since this
spec is tailored for the specific video codec use-case and V4L2 is
more generic, but also less spec to maintain and more confidence that
things will work as we want in the real world. On the other hand, the
device would also become simpler by the fact that responses to
commands could not come out-of-order as they currently do. So at the
end of the day I'm not even sure this would result in a more complex
device.

Sorry for the delay. I tried to gather data about how the spec has
evolved in the old emails.

If has been a bit all over the place as we tried different approaches,
sorry about that. >_<


No worries, this is totally understandable.

Well, on the one hand mimicking v4l2 looks like an easy solution from
virtio-video spec writing perspective. (But the implementers will have
to read the V4L2 API instead AFAIU, which is probably longer...)

It should not necessarily be much longer as the parts we are
interested in have their own dedicated pages:

https://docs.kernel.org/userspace-api/media/v4l/dev-decoder.html  <https://docs.kernel.org/userspace-api/media/v4l/dev-decoder.html>https://docs.kernel.org/userspace-api/media/v4l/dev-encoder.html  <https://docs.kernel.org/userspace-api/media/v4l/dev-encoder.html>

Besides, the decoding and encoding processes are described with more
precision, not that we couldn't do that here but it would make the
spec grow longer than I am comfortable with...


I read the references carefully, thanks. I am somewhat familiar with the
stateful decoder API, but the stateless one still needs exploring.

There is one serious issue with these references IMO: they represent
guest user-space <-> v4l2 subsystem API, not v4l2 subsystem <->
virtio-video driver API. Just to make sure we're on the same page:

guest user-space <-> v4l2 kernel subsystem <-> virtio-video driver <-
virtio-video protocol -> virtio-video device.

I believe this is how it is supposed to work, right? So I thought, that
your intention is to simplify virtio-video driver and virtio-video
protocol by reusing the v4l2 subsystem <-> v4l2 driver API. But having
these references I can assume, that you want to use user-space <-> v4l2
subsystem API, right? Well, I think this cannot happen and therefore
these references cannot be used directly unless:

1. You suggest that virtio-video driver should not use v4l2 subsystem,
but should mimic its user-space API in every detail. Probably not a good
idea.

2. There is already a way to bypass the subsystem completely. I'm not
aware of that.

3. user-space <-> v4l2 subsystem API is already the same or very close
to v4l2 subsystem <-> v4l2 driver API. I believe this is not the case
even with stateful decoder/encoder. Even more with stateless decoders
because I can see, that v4l2 subsystem actually stores some state in
this case as well. Which is quite reasonable I think.

So I think what we need to reference here is v4l2 subsystem <-> v4l2
driver API. Do you have this reference? Well, I know there is some
documentation, but still I doubt that. AFAIR kernel internal APIs are
never fixed. Right?

Besides that I can see, that this version of the spec indeed comes
closer to v4l2 user-space API, but it doesn't borrow all the legacy. I
think, this is fine. It is definitely more lightweight. I like that.

On the other hand v4l2 has a lot of history. It started as a camera API
and gained the codec support later, right? So it definitely has just to
much stuff irrelevant for codecs. Here we have an option to design from
scratch taking the best ideas from v4l2.

That's also what we were thinking initially, but as we try to
implement our new and optimized designs, we end up hitting a wall and
redoing things like V4L2 did. There are small exceptions, like how
STOP/RESET is implemented here which is slightly simpler than the V4L2
equivalent, but I don't think these justify reinventing the remaining
95% quasi-identically.


I like this simplicity. If the user-space API can be reused, and this
makes things easier somehow, then maybe you're right. So looking forward
for you response to my comment above.

V4L2 supports much more than video codecs, but if you want to
implement a decoder you don't need to support anything more than what
the decoder spec says you should. And that subset happens to map very
well to the decoder use-case - it's not like the V4L2 folks tried to
shoehorn codecs into something that is inadequate for them.


Actually I think that having only a single timestamp might be the result
of the evolution from an API for cameras, where you don't have this
tricky ordering stuff.

Also I have concerns about the virtio-video spec development. This seems
like a big change. It seems to me that after so many discussions and
versions of the spec, the process should be coming to something by now.
But this is still a moving target...

I agree and apologize for the slow progress of this project, but let's
not fall for the sunk cost fallacy if it turns out the
V4L2-over-virtio solution fits the bill better and for less effort.


Ok. I just think that at this point this has to be very well justified.

There were arguments against adding camera support for security and
complexity reasons during discussions about virtio-video spec v1. Were
these concerns addressed somehow? Maybe I missed a followup discussion?

The conclusion was that cameras should be their own specification as
the virtio-video spec is too specialized for the codec use-case. There
is actually an ongoing project for this:

https://gitlab.collabora.com/collabora/virtio-camera

... which states in its README: "For now it is almost directly based
on V4L2 Linux driver UAPI."

That makes me think, if virtio-video is going to ressemble V4L2
closely, and virtio-camera ends up heading in the same direction, why
don't we just embrace the underlying reality that we are reinventing
V4L2?


As I wrote earlier I only welcome taking the best ideas from V4L2. And
as I wrote above AFAIU we can't reinvent and we're not reinventing V4L2
UAPI.

+
+Finally, for \field{struct virtio_video_resource_sg_list}:
+
+\begin{description}
+\item[\field{num_entries}]
+is the number of \field{struct virtio_video_resource_sg_entry} instances
+that follow.
+\end{description}
+
+\field{struct virtio_video_resource_object} is defined as follows:
+
+\begin{lstlisting}
+struct virtio_video_resource_object {
+        u8 uuid[16];
+};
+\end{lstlisting}
+
+\begin{description}
+\item[uuid]
+is a version 4 UUID specified by \hyperref[intro:rfc4122]{[RFC4122]}.
+\end{description}
+
+The device responds with
+\field{struct virtio_video_resource_attach_backing_resp}:
+
+\begin{lstlisting}
+struct virtio_video_resource_attach_backing_resp {
+        le32 result; /* VIRTIO_VIDEO_RESULT_* */
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{result}]
+is
+
+\begin{description}
+\item[VIRTIO\_VIDEO\_RESULT\_OK]
+if the operation succeeded,
+\item[VIRTIO\_VIDEO\_RESULT\_ERR\_INVALID\_STREAM\_ID]
+if the mentioned stream does not exist,
+\item[VIRTIO\_VIDEO\_RESULT\_ERR\_INVALID\_ARGUMENT]
+if \field{queue_type}, \field{resource_id}, or \field{resources} have an
+invalid value,
+\item[VIRTIO\_VIDEO\_RESULT\_ERR\_INVALID\_OPERATION]
+if the operation is performed at a time when it is non-valid.
+\end{description}
+\end{description}
+
+VIRTIO\_VIDEO\_CMD\_RESOURCE\_ATTACH\_BACKING can only be called during
+the following times:
+
+\begin{itemize}
+\item
+  AFTER a VIRTIO\_VIDEO\_CMD\_STREAM\_CREATE and BEFORE invoking
+  VIRTIO\_VIDEO\_CMD\_RESOURCE\_QUEUE for the first time on the
+  resource,
+\item
+  AFTER successfully changing the \field{virtio_video_params_resources}
+  parameter corresponding to the queue and BEFORE
+  VIRTIO\_VIDEO\_CMD\_RESOURCE\_QUEUE is called again on the resource.
+\end{itemize}
+
+This is to ensure that the device can rely on the fact that a given
+resource will always point to the same memory for as long as it may be
+used by the video device. For instance, a decoder may use returned
+decoded frames as reference for future frames and won't overwrite the
+backing resource of a frame that is being referenced. It is only before
+a stream is started and after a Dynamic Resolution Change event has
+occurred that we can be sure that all resources won't be used in that
+way.

The mentioned scenario about the referenced frames looks
somewhatreasonable, but I wonder how exactly would that work in practice.

Basically the guest need to make sure the backing memory remains
available and unwritten until the conditions mentioned above are met.
Or is there anything unclear in this description?

Ok, I read the discussions about whether to allow the device to have
read access after response to QUEUE or not. Since this comes from v4l2,
then this should not be a problem, I think. I didn't know that v4l2
expects the user-space to never write to CAPTURE buffers after they
dequeued. I wonder if it is enforced in drivers.


Actually I have more on this. I'd prefer if there is a requirement for
the driver or the device to handle a case when a still referenced frame
is queued again. So that it is less likely to be forgotten.

+        le32 stream_id;
+        le32 queue_type; /* VIRTIO_VIDEO_QUEUE_TYPE_* */
+        le32 resource_id;
+        le32 flags; /* Bitmask of VIRTIO_VIDEO_ENQUEUE_FLAG_* */
+        u8 padding[4];
+        le64 timestamp;
+        le32 data_sizes[VIRTIO_VIDEO_MAX_PLANES];
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{stream_id}]
+is the ID of a valid stream.
+\item[\field{queue_type}]
+is the direction of the queue.
+\item[\field{resource_id}]
+is the ID of the resource to be queued.
+\item[\field{flags}]
+is a bitmask of VIRTIO\_VIDEO\_ENQUEUE\_FLAG\_* values.
+
+\begin{description}
+\item[\field{VIRTIO_VIDEO_ENQUEUE_FLAG_FORCE_KEY_FRAME}]
+The submitted frame is to be encoded as a key frame. Only valid for the
+encoder's INPUT queue.
+\end{description}
+\item[\field{timestamp}]
+is an abstract sequence counter that can be used on the INPUT queue for
+synchronization. Resources produced on the output queue will carry the
+\field{timestamp} of the input resource they have been produced from.

I think this is quite misleading. Implementers may assume, that it is ok
to assume a 1-to-1 mapping between input and output buffers and no
reordering, right? But this is not the case usually:

1. In the end of the spec H.264 and HEVC are defined to always have a
single NAL unit per resource. Well, there are many types of NAL units,
that do not represent any video data. Like SEI NAL units or delimiters.

2. We may assume that the SEI and delimiter units are filtered before
queuing, but there still is also other codec-specific data that can't be
filtered, like SPS and PPS NAL units. There has to be some special handling.

3. All of this means more codec-specific code in the driver or client
applications.

4. This spec says that the device may skip to a next key frame after a
seek. So the driver has to account for this too.

5. For example, in H.264 a single key frame may by coded by several NAL
units. In fact all VCL NAL units are called slices because of this. What
happens when the decoder sees several NAL units with different
timestamps coding the same output frame? Which timestamp will it choose?
I'm not sure it is defined anywhere. Probably it will just take the
first timestamp. The driver/client applications have to be ready for
this too.

6. I saw almost the same scenario with CSD units too. Imagine SPS with
timestamp 0, then PPS with 1, and then an IDR with 2. These three might
be combined in a single input buffer together by the vendor-provided
decoding software. Then the timestamp of the resulting frame is
naturally 0. But the driver/client application already doesn't expect to
get any response with timestamps 0 and 1, because they are known to be
belonging to CSD. And it expects an output buffer with ts 2. So there
will be a problem. (This is a real world example actually.)

7. Then there is H.264 High profile, for example. It has different
decoding and presentation order because frames may depend on future
frames. I think all the modern codecs have a mode like this. The input
frames are usually provided in the decoding order. Should the output
frames timestamps just be copied from input frames, they have been
produced from as this paragraph above says? This resembles decoder order
then. Well, this can work, if the container has correct DTS and PTS, and
the client software creates a mapping between these timestamps and the
virtio video timestamp. But this is not always the case. For example,
simple H.264 bitstream doesn't have any timestamps. And still it can be
easily played by ffmpeg/gstreamer/VLC/etc. There is no way to make this
work with a decoder following this spec, I think.

My suggestion is to not think about the timestamp as an abstract
counter, but give some freedom to the device by providing the available
information from the container, be it DTS, PTR or only FPS (through
PARAMS). Also the input and output queues should indeed be completely
separated. There should be no assumption of a 1-to-1 mapping of buffers.

The beginning of the "Device Operation" section tries to make it clear
that the input and output queues are operating independently and that
no mapping or ordering should be expected by the driver, but maybe
this is worth repeating here.

Regarding the use of timestamp, a sensible use would indeed be for the
driver to set it to some meaningful information retrieved from the
container (which the driver would itself obtain from user-space),
probably the PTS if that is available. In the case of H.264 non-VCL
NAL units would not produce any output, so their timestamp would
effectively be ignored. For frames that are made of several slices,
the first timestamp should be the one propagated to the output frame.
(and this here is why I prefer VP8/VP9 ^_^;)

Did they manage to avoid the same thing with VP9 SVC? :)

The phrase "Resources produced on the output queue will carry the
\field{timestamp} of the input resource they have been produced from."
still sounds misleading to me. It doesn't cover for all these cases of
no 1 to 1 mapping. Also what if there are timestamps for some of the
frames, but not for all?

This shouldn't matter - a timestamp of 0 is still a timestamp and will
be carried over to the corresponding frames.


I really like the table of all possible cases from the reference, that
you provided above:
https://docs.kernel.org/userspace-api/media/v4l/dev-decoder.html#decoding

I'd prefer to have it here as well. Maybe it could be shorter.

In fact most users probably won't care about this field. In the worst
case, even if no timestamp is available, operation can still be done
reliably since decoded frames are made available in presentation
order. This fact was not obvious in the spec, so I have added a
sentence in the "Device Operation" section to clarify.

I hope this answers your concerns, but please let me know if I didn't
address something in particular.

Indeed the order of output frames was not obvious from the spec. I think
there might be use-cases, when you want the decoded frames as early as
possible. Like when you have to transmit the frames over some (slow)
medium. If the decoder outputs in presentation order, the frames might
come out in batches. This is not good for latency then. WDYT?

Who would be in charge of reordering then? If that burden falls to the
guest user-space, then it probably wants to use a stateless API.
That's not something covered by this spec (and covering it would
require adding many more per-codec structures for SPS/PPS, VP8
headers, etc.), but can be supported with V4L2 FWIW. Supporting this
API however would add dozens more pages just to document the
codec-specific structures necessary to decode a frame. See for
instance what would be needed for H.264:
https://www.kernel.org/doc/html/v5.5/media/uapi/v4l/ext-ctrls-codec.html#c.v4l2_ctrl_h264_sps.

Or a client that *really* wants decoding order for latency reasons
could hack the stream a bit to change the presentation order and
perform QUEUE/DRAIN sequences for each frame. That would not be more
complex than supporting a stateless API anyway.


I agree to some extent. It is just that I see this as a limitation of
stateful V4L2 UAPI. If we consider virtio-video V4L2 driver as the only
virtio-video driver implementation out there, then yes, we can safely
bake this limitation in. If V4L2 APIs change, the spec can be updated
too. If other implementations are expected and they need the decoding
order, then this might be a problem. Reordering could be done in the
virtio video V4L2 driver. Anyway for me this is a low priority issue for
now.

+\item[\field{YU12}]
+one Y plane followed by one Cb plane, followed by one Cr plane, in a
+single buffer. 4:2:0 subsampling.
+\item[\field{YM12}]
+same as \field{YU12} but using three separate buffers for the Y, U and V
+planes.
+\end{description}

This looks like V4L2 formats. Maybe add a V4L2 reference? At least the
V4L2 documentation has a nice description of exact plane layouts.
Otherwise it would be nice to have these layouts in the spec IMO.

I've linked to the relevant V4L2 pages, indeed they describe the
formats and layouts much better.

Thanks for all the feedback. We can continue on this basis, or I can
try to build a small prototype of that V4L2-over-virtio idea if you
agree this looks like a good idea. The guest driver would mostly be
forwarding the V4L2 ioctls as-is to the host, it would be interesting
to see how small we can make it with this design.

Let's discuss the idea.

Let me try to summarize the case for using V4L2 over Virtio (I'll call
it virtio-v4l2 to differentiate it from the current spec).

There is the argument that virtio-video turns out to be a recreation
of the stateful V4L2 decoder API, which itself works similarly to
other high-level decoder APIs. So it's not like we could or should
come with something very different. In parallel, virtio-camera is also
currently using V4L2 as its model. While this is subject to change, I
am starting to see a pattern here. :)

Transporting V4L2 over virtio would considerably shorten the length of
this spec, as we would just need to care about the transport aspect
and minor amendments to the meaning of some V4L2 structure members,
and leave the rest to V4L2 which is properly documented and for which
there is a large collection of working examples.

This would work very well for codec devices, but as a side-effect
would also enable other kinds of devices that may be useful to
virtualize, like image processors, DVB cards, and cameras. This
doesn't mean virtio-v4l2 should be the *only* way to support cameras
over virtio. It is a nice bonus of encapsulating V4L2, it may be
sufficient for simple (most?) use-cases, but also doesn't forbid more
specialized virtual devices for complex camera pipelines to be added
later. virtio-v4l2 would just be the generic virtual video device that
happens to be sufficient for our accelerated video needs - and if your
host camera is a USB UVC one, well feel free to use that too.

In other words, I see an opportunity to enable a whole class of
devices instead of a single type for the same effort and think we
should seriously consider this.

I have started to put down what a virtio-v4l2 transport might look
like, and am also planning on putting together a small
proof-of-concept. If I can get folks here to warm up to the idea, I
believe we should be able to share a spec and prototype in a month or
so.


Thanks for the detailed explanation. Please check my comments above. I'd
like to resolve the mentioned issue first.


--
Alexander Gordeev
Senior Software Engineer

OpenSynergy GmbH
Rotherstr. 20, 10245 Berlin

Phone: +49 30 60 98 54 0 - 88
Fax: +49 (30) 60 98 54 0 - 99
EMail: alexander.gordeev@opensynergy.com

www.opensynergy.com

Handelsregister/Commercial Registry: Amtsgericht Charlottenburg, HRB 108616B
GeschÃftsfÃhrer/Managing Director: RÃgis Adjamah


Please mind our privacy notice<https://www.opensynergy.com/datenschutzerklaerung/privacy-notice-for-business-partners-pursuant-to-article-13-of-the-general-data-protection-regulation-gdpr/> pursuant to Art. 13 GDPR. // Unsere Hinweise zum Datenschutz gem. Art. 13 DSGVO finden Sie hier.<https://www.opensynergy.com/de/datenschutzerklaerung/datenschutzhinweise-fuer-geschaeftspartner-gem-art-13-dsgvo/>

References:
- Re: [RFC PATCH v6] virtio-video: Add virtio video device specification
  - From: Alexander Gordeev <alexander.gordeev@opensynergy.com>
- Re: [RFC PATCH v6] virtio-video: Add virtio video device specification
  - From: Alexandre Courbot <acourbot@chromium.org>