virtio-dev message

Subject: Re: [RFC PATCH v6] virtio-video: Add virtio video device specification

From: Alexander Gordeev <alexander.gordeev@opensynergy.com>
To: Alexandre Courbot <acourbot@chromium.org>
Date: Wed, 11 Jan 2023 19:42:24 +0100

Hi Alexandre,

On 27.12.22 08:31, Alexandre Courbot wrote:

Hi Alexander,


On Tue, Dec 20, 2022 at 1:59 AM Alexander Gordeev
<alexander.gordeev@opensynergy.com> wrote:

Hello Alexandre,

Thanks for the update. Please check my comments below.
I'm new to the virtio video spec development, so I may lack some
historic perspective. I would gladly appreciate pointing me to some
older emails explaining decisions, that I might not understand. I hope
to read through all of them later. Overall I have a lot of experience in
the video domain and in virtio video device development in Opsy, so I
hope, that my comments are relevant and useful.

Cornelia provided links to the previous versions (thanks!). Through
these revisions we tried different approaches, and the more we
progress the closer we are getting to the V4L2 stateful
decoder/encoder interface.

This is actually the point where I would particularly be interested in
having your feedback, since you probably have noticed the similarity.
What would you think about just using virtio as a transport for V4L2
ioctls (virtio-fs does something similar with FUSE), and having the
host emulate a V4L2 decoder or encoder device in place of this (long)
specification? I am personally starting to think this is could be a
better and faster way to get us to a point where both spec and guest
drivers are merged. Moreover this would also open the way to support
other kinds of V4L2 devices like simple cameras - we would just need
to allocate new device IDs for these and would be good to go.

This probably means a bit more work on the device side, since this
spec is tailored for the specific video codec use-case and V4L2 is
more generic, but also less spec to maintain and more confidence that
things will work as we want in the real world. On the other hand, the
device would also become simpler by the fact that responses to
commands could not come out-of-order as they currently do. So at the
end of the day I'm not even sure this would result in a more complex
device.


Sorry for the delay. I tried to gather data about how the spec has
evolved in the old emails.

Well, on the one hand mimicking v4l2 looks like an easy solution from
virtio-video spec writing perspective. (But the implementers will have
to read the V4L2 API instead AFAIU, which is probably longer...)

On the other hand v4l2 has a lot of history. It started as a camera API
and gained the codec support later, right? So it definitely has just to
much stuff irrelevant for codecs. Here we have an option to design from
scratch taking the best ideas from v4l2.

Also I have concerns about the virtio-video spec development. This seems
like a big change. It seems to me that after so many discussions and
versions of the spec, the process should be coming to something by now.
But this is still a moving target...

There were arguments against adding camera support for security and
complexity reasons during discussions about virtio-video spec v1. Were
these concerns addressed somehow? Maybe I missed a followup discussion?

+\begin{lstlisting}
+/* Device */
+#define VIRTIO_VIDEO_CMD_DEVICE_QUERY_CAPS       0x100
+
+/* Stream */
+#define VIRTIO_VIDEO_CMD_STREAM_CREATE           0x200
+#define VIRTIO_VIDEO_CMD_STREAM_DESTROY          0x201

Is this gap in numbers intentional? It would be great to remove it to
simplify boundary checks.

This is to allow commands of the same family to stay close to one
another. I'm not opposed to removing the gap, it just means that
commands may end up being a bit all over the place if we extend the
protocol.


Actually there is a gap between 0x201 and 0x203. Sorry for not being
clear here.

+
+\devicenormative{\subparagraph}{VIRTIO_VIDEO_CMD_STREAM_DRAIN}{Device Types / Video Device / Device Operation / Device Operation: Stream commands / VIRTIO_VIDEO_CMD_STREAM_DRAIN}
+
+Before the device sends the response, it MUST process and respond to all
+the VIRTIO\_VIDEO\_CMD\_RESOURCE\_QUEUE commands on the INPUT queue that
+were sent before the drain command, and make all the corresponding
+output resources available to the driver by responding to their
+VIRTIO\_VIDEO\_CMD\_RESOURCE\_QUEUE command.

Unfortunately I don't see much details about the OUTPUT queue. What if
the driver queues new output buffers, as it must do, fast enough? Looks
like a valid implementation of the DRAIN command might never send a
response in this case, because the only thing it does is replying to
VIRTIO_VIDEO_CMD_RESOURCE_QUEUE commands on the OUTPUT queue. I guess,
it is better to specify what happens. I think the device should respond
to a certain amount of OUTPUT QUEUE commands until there is an end of
stream condition. Then it should respond to DRAIN command. What happens
with the remaining queued output buffers is a question to me: should
they be cancelled or not?

If I understand correctly this should not be a problem. Replies to
commands can come out-of-order, so the reply to DRAIN can come as soon
as the command is completed, regardless of how many output buffers we
have queued at that moment. The queued output buffers can also remain
queued in prediction for the next sequence, if any - if it has the
same resolution as the previous one, then the queued output buffers
can be used. If it doesn't then a resolution change event will be
produced and the driver will process it.


Ok, thanks, this makes sense to me.

+
+While the device is processing the command, it MUST return
+VIRTIO\_VIDEO\_RESULT\_ERR\_INVALID\_OPERATION to the
+VIRTIO\_VIDEO\_CMD\_STREAM\_DRAIN command.

Should the device stop accepting input too?

There should be no problem with the device accepting (and even
processing) input for the next sequence, as long as it doesn't make
its result available before the response to the DRAIN command.


Hmm, maybe it is worth to add this requirement in the spec. WDYT?

+
+If the command is interrupted due to a VIRTIO\_VIDEO\_CMD\_STREAM\_STOP
+or VIRTIO\_VIDEO\_CMD\_STREAM\_DESTROY operation, the device MUST
+respond with VIRTIO\_VIDEO\_RESULT\_ERR\_CANCELED.
+
+\paragraph{VIRTIO_VIDEO_CMD_STREAM_STOP}\label{sec:Device Types / Video Device / Device Operation / Device Operation: Stream commands / VIRTIO_VIDEO_CMD_STREAM_STOP}
+

I don't like this command to be called "stop". When I see a "stop"
command, I expect to see a "start" command as well. My personal
preference would be "flush" or "reset".

Fair enough, let me rename this to RESET (which was the name used in a
previous revision for a somehow-similar command).


Great.

+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{result}]
+is
+
+\begin{description}
+\item[VIRTIO\_VIDEO\_RESULT\_OK]
+if the operation succeeded,
+\item[VIRTIO\_VIDEO\_RESULT\_ERR\_INVALID\_STREAM\_ID]
+if the requested stream does not exist,
+\item[VIRTIO\_VIDEO\_RESULT\_ERR\_INVALID\_ARGUMENT]
+if the \field{param_type} argument is invalid for the device,
+\end{description}
+\item[\field{param}]
+is the value of the requested parameter, if \field{result} is
+VIRTIO\_VIDEO\_RESULT\_OK.
+\end{description}
+
+\drivernormative{\subparagraph}{VIRTIO_VIDEO_CMD_STREAM_GET_PARAM}{Device Types / Video Device / Device Operation / Device Operation: Stream commands / VIRTIO_VIDEO_CMD_STREAM_GET_PARAM}
+
+\field{cmd_type} MUST be set to VIRTIO\_VIDEO\_CMD\_STREAM\_GET\_PARAM
+by the driver.
+
+\field{stream_id} MUST be set to a valid stream ID previously returned
+by VIRTIO\_VIDEO\_CMD\_STREAM\_CREATE.
+
+\field{param_type} MUST be set to a parameter type that is valid for the
+device.

The device requirements are missing for GET_PARAMS.

There aren't any beyond returning the requested parameter or an error code.

Ok.

+};
+\end{lstlisting}
+
+Within \field{struct virtio_video_resource_sg_entry}:
+
+\begin{description}
+\item[\field{addr}]
+is a guest physical address to the start of the SG entry.
+\item[\field{length}]
+is the length of the SG entry.
+\end{description}

I think having explicit page alignment requirements here would be great.

This may be host-dependent, maybe we should have a capability field so
it can provide this information?


I mean there is already a VIRTIO_VIDEO_F_RESOURCE_GUEST_PAGES feature
bit. This suggests, that these addresses always point to pages, right?
If not, there is some inconsistency here IMO.

In our setup I think it is just always the case, that they are page
aligned. Probably non page aligned addresses would require copying on
CPU on all our platforms. So I think, yes, there should be a way to
indicate (if not require) this.

+
+Finally, for \field{struct virtio_video_resource_sg_list}:
+
+\begin{description}
+\item[\field{num_entries}]
+is the number of \field{struct virtio_video_resource_sg_entry} instances
+that follow.
+\end{description}
+
+\field{struct virtio_video_resource_object} is defined as follows:
+
+\begin{lstlisting}
+struct virtio_video_resource_object {
+        u8 uuid[16];
+};
+\end{lstlisting}
+
+\begin{description}
+\item[uuid]
+is a version 4 UUID specified by \hyperref[intro:rfc4122]{[RFC4122]}.
+\end{description}
+
+The device responds with
+\field{struct virtio_video_resource_attach_backing_resp}:
+
+\begin{lstlisting}
+struct virtio_video_resource_attach_backing_resp {
+        le32 result; /* VIRTIO_VIDEO_RESULT_* */
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{result}]
+is
+
+\begin{description}
+\item[VIRTIO\_VIDEO\_RESULT\_OK]
+if the operation succeeded,
+\item[VIRTIO\_VIDEO\_RESULT\_ERR\_INVALID\_STREAM\_ID]
+if the mentioned stream does not exist,
+\item[VIRTIO\_VIDEO\_RESULT\_ERR\_INVALID\_ARGUMENT]
+if \field{queue_type}, \field{resource_id}, or \field{resources} have an
+invalid value,
+\item[VIRTIO\_VIDEO\_RESULT\_ERR\_INVALID\_OPERATION]
+if the operation is performed at a time when it is non-valid.
+\end{description}
+\end{description}
+
+VIRTIO\_VIDEO\_CMD\_RESOURCE\_ATTACH\_BACKING can only be called during
+the following times:
+
+\begin{itemize}
+\item
+  AFTER a VIRTIO\_VIDEO\_CMD\_STREAM\_CREATE and BEFORE invoking
+  VIRTIO\_VIDEO\_CMD\_RESOURCE\_QUEUE for the first time on the
+  resource,
+\item
+  AFTER successfully changing the \field{virtio_video_params_resources}
+  parameter corresponding to the queue and BEFORE
+  VIRTIO\_VIDEO\_CMD\_RESOURCE\_QUEUE is called again on the resource.
+\end{itemize}
+
+This is to ensure that the device can rely on the fact that a given
+resource will always point to the same memory for as long as it may be
+used by the video device. For instance, a decoder may use returned
+decoded frames as reference for future frames and won't overwrite the
+backing resource of a frame that is being referenced. It is only before
+a stream is started and after a Dynamic Resolution Change event has
+occurred that we can be sure that all resources won't be used in that
+way.

The mentioned scenario about the referenced frames looks
somewhatreasonable, but I wonder how exactly would that work in practice.

Basically the guest need to make sure the backing memory remains
available and unwritten until the conditions mentioned above are met.
Or is there anything unclear in this description?


Ok, I read the discussions about whether to allow the device to have
read access after response to QUEUE or not. Since this comes from v4l2,
then this should not be a problem, I think. I didn't know that v4l2
expects the user-space to never write to CAPTURE buffers after they
dequeued. I wonder if it is enforced in drivers.

+        le32 stream_id;
+        le32 queue_type; /* VIRTIO_VIDEO_QUEUE_TYPE_* */
+        le32 resource_id;
+        le32 flags; /* Bitmask of VIRTIO_VIDEO_ENQUEUE_FLAG_* */
+        u8 padding[4];
+        le64 timestamp;
+        le32 data_sizes[VIRTIO_VIDEO_MAX_PLANES];
+};
+\end{lstlisting}
+
+\begin{description}
+\item[\field{stream_id}]
+is the ID of a valid stream.
+\item[\field{queue_type}]
+is the direction of the queue.
+\item[\field{resource_id}]
+is the ID of the resource to be queued.
+\item[\field{flags}]
+is a bitmask of VIRTIO\_VIDEO\_ENQUEUE\_FLAG\_* values.
+
+\begin{description}
+\item[\field{VIRTIO_VIDEO_ENQUEUE_FLAG_FORCE_KEY_FRAME}]
+The submitted frame is to be encoded as a key frame. Only valid for the
+encoder's INPUT queue.
+\end{description}
+\item[\field{timestamp}]
+is an abstract sequence counter that can be used on the INPUT queue for
+synchronization. Resources produced on the output queue will carry the
+\field{timestamp} of the input resource they have been produced from.

I think this is quite misleading. Implementers may assume, that it is ok
to assume a 1-to-1 mapping between input and output buffers and no
reordering, right? But this is not the case usually:

1. In the end of the spec H.264 and HEVC are defined to always have a
single NAL unit per resource. Well, there are many types of NAL units,
that do not represent any video data. Like SEI NAL units or delimiters.

2. We may assume that the SEI and delimiter units are filtered before
queuing, but there still is also other codec-specific data that can't be
filtered, like SPS and PPS NAL units. There has to be some special handling.

3. All of this means more codec-specific code in the driver or client
applications.

4. This spec says that the device may skip to a next key frame after a
seek. So the driver has to account for this too.

5. For example, in H.264 a single key frame may by coded by several NAL
units. In fact all VCL NAL units are called slices because of this. What
happens when the decoder sees several NAL units with different
timestamps coding the same output frame? Which timestamp will it choose?
I'm not sure it is defined anywhere. Probably it will just take the
first timestamp. The driver/client applications have to be ready for
this too.

6. I saw almost the same scenario with CSD units too. Imagine SPS with
timestamp 0, then PPS with 1, and then an IDR with 2. These three might
be combined in a single input buffer together by the vendor-provided
decoding software. Then the timestamp of the resulting frame is
naturally 0. But the driver/client application already doesn't expect to
get any response with timestamps 0 and 1, because they are known to be
belonging to CSD. And it expects an output buffer with ts 2. So there
will be a problem. (This is a real world example actually.)

7. Then there is H.264 High profile, for example. It has different
decoding and presentation order because frames may depend on future
frames. I think all the modern codecs have a mode like this. The input
frames are usually provided in the decoding order. Should the output
frames timestamps just be copied from input frames, they have been
produced from as this paragraph above says? This resembles decoder order
then. Well, this can work, if the container has correct DTS and PTS, and
the client software creates a mapping between these timestamps and the
virtio video timestamp. But this is not always the case. For example,
simple H.264 bitstream doesn't have any timestamps. And still it can be
easily played by ffmpeg/gstreamer/VLC/etc. There is no way to make this
work with a decoder following this spec, I think.

My suggestion is to not think about the timestamp as an abstract
counter, but give some freedom to the device by providing the available
information from the container, be it DTS, PTR or only FPS (through
PARAMS). Also the input and output queues should indeed be completely
separated. There should be no assumption of a 1-to-1 mapping of buffers.

The beginning of the "Device Operation" section tries to make it clear
that the input and output queues are operating independently and that
no mapping or ordering should be expected by the driver, but maybe
this is worth repeating here.

Regarding the use of timestamp, a sensible use would indeed be for the
driver to set it to some meaningful information retrieved from the
container (which the driver would itself obtain from user-space),
probably the PTS if that is available. In the case of H.264 non-VCL
NAL units would not produce any output, so their timestamp would
effectively be ignored. For frames that are made of several slices,
the first timestamp should be the one propagated to the output frame.
(and this here is why I prefer VP8/VP9 ^_^;)


Did they manage to avoid the same thing with VP9 SVC? :)

The phrase "Resources produced on the output queue will carry the
\field{timestamp} of the input resource they have been produced from."
still sounds misleading to me. It doesn't cover for all these cases of
no 1 to 1 mapping. Also what if there are timestamps for some of the
frames, but not for all?

In fact most users probably won't care about this field. In the worst
case, even if no timestamp is available, operation can still be done
reliably since decoded frames are made available in presentation
order. This fact was not obvious in the spec, so I have added a
sentence in the "Device Operation" section to clarify.

I hope this answers your concerns, but please let me know if I didn't
address something in particular.


Indeed the order of output frames was not obvious from the spec. I think
there might be use-cases, when you want the decoded frames as early as
possible. Like when you have to transmit the frames over some (slow)
medium. If the decoder outputs in presentation order, the frames might
come out in batches. This is not good for latency then. WDYT?

+\item[\field{planes}]
+is the format description of each individual plane making this format.
+The number of planes is dependent on the \field{fourcc} and detailed in
+\ref{sec:Device Types / Video Device / Supported formats / Image formats}.
+
+\begin{description}
+\item[\field{buffer_size}]
+is the minimum size of the buffers that will back resources to be
+queued.
+\item[\field{stride}]
+is the distance in bytes between two lines of data.
+\item[\field{offset}]
+is the starting offset for the data in the buffer.

It is not quite clear to me how to use the offset during SET_PARAMS. I
think it is much more reasonable to have per plane offsets in struct
virtio_video_resource_queue and struct virtio_video_resource_queue_resp.

This is supposed to describe where in a given buffer the host can find
the beginning of a given plane (mostly useful for multi-planar/single
buffer formats). This typically does not change between frames, so
having it as a parameter seems appropriate to me?


The plane sizes don't change either, right? I think it is just usual way
to put the plane offsets and sizes together. I saw this pattern in
gstreamer. I think, in DRM and V4L2 as well. For me it is quite reasonable.

+encode at the requested format and resolution.

It is not defined when changing these parameters is allowed. Also there
is an issue: changing width, height, format, buffer_size should probably
detach all the currently attached buffers. But changing crop shouldn't
affect the output buffers in any way, right? So maybe it is better to
split them?

If the currently attached buffers are large enough to support the new
format, there should not be any need to detach them (if they are not,
the SET_PARAM command should fail). So even if we only change the
crop, the device can perform the full validation on the format and
keep going with the current buffers if possible.

Indeed the timing for setting this parameter should be better defined.
In particular the input format for a decoder (or output format for an
encoder) will probably remain static through the session.

Ok.

+\item[\field{YU12}]
+one Y plane followed by one Cb plane, followed by one Cr plane, in a
+single buffer. 4:2:0 subsampling.
+\item[\field{YM12}]
+same as \field{YU12} but using three separate buffers for the Y, U and V
+planes.
+\end{description}

This looks like V4L2 formats. Maybe add a V4L2 reference? At least the
V4L2 documentation has a nice description of exact plane layouts.
Otherwise it would be nice to have these layouts in the spec IMO.

I've linked to the relevant V4L2 pages, indeed they describe the
formats and layouts much better.

Thanks for all the feedback. We can continue on this basis, or I can
try to build a small prototype of that V4L2-over-virtio idea if you
agree this looks like a good idea. The guest driver would mostly be
forwarding the V4L2 ioctls as-is to the host, it would be interesting
to see how small we can make it with this design.


Let's discuss the idea.


--
Alexander Gordeev
Senior Software Engineer

OpenSynergy GmbH
Rotherstr. 20, 10245 Berlin

Phone: +49 30 60 98 54 0 - 88
Fax: +49 (30) 60 98 54 0 - 99
EMail: alexander.gordeev@opensynergy.com

www.opensynergy.com

Handelsregister/Commercial Registry: Amtsgericht Charlottenburg, HRB 108616B
GeschÃftsfÃhrer/Managing Director: RÃgis Adjamah


Please mind our privacy notice<https://www.opensynergy.com/datenschutzerklaerung/privacy-notice-for-business-partners-pursuant-to-article-13-of-the-general-data-protection-regulation-gdpr/> pursuant to Art. 13 GDPR. // Unsere Hinweise zum Datenschutz gem. Art. 13 DSGVO finden Sie hier.<https://www.opensynergy.com/de/datenschutzerklaerung/datenschutzhinweise-fuer-geschaeftspartner-gem-art-13-dsgvo/>

Follow-Ups:
- Re: [RFC PATCH v6] virtio-video: Add virtio video device specification
  - From: Alexandre Courbot <acourbot@chromium.org>
- Re: [RFC PATCH v6] virtio-video: Add virtio video device specification
  - From: Alex BennÃe <alex.bennee@linaro.org>