OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] [PATCH v4] vsock: add vsock device


Hi Stefan,

On Wed, 2016-03-16 at 10:53 +0000, Stefan Hajnoczi wrote:
> The virtio vsock device is a zero-configuration socket communications
> device.  It is designed as a guest<->host management channel suitable
> for communicating with guest agents.

I'm interested in using this for a project I'm working on so I've been
prototyping a userspace backend (for a non-Linux host end) which pumps
the data over a Unix domain socket. I've got some feedback and
questions on the spec (some of which might be down to my relative
naivety regarding virtio).

For the frontend I've been using your Linux tree from githib vsock
branch (at commit 563d2a770dfa). I'm mostly using those patches
backported to 4.1.19, but I've also run your exact tree and your vsock-
nfs branch with the same behaviours I mention below.

The Linux .ko and various things under sysfs end up called "transport-
virtio" (or s/-/_/g) rather than e.g. "vsock-transport-virtio", which
would be less generic sounding.

[...]

> diff --git a/trunk/content.tex b/trunk/content.tex
> index d989d98..4e9f165 100644
> --- a/trunk/content.tex
> +++ b/trunk/content.tex
> @@ -5641,6 +5641,223 @@ descriptor for the \field{sense_len}, \field{residual},
>  \field{status_qualifier}, \field{status}, \field{response} and
>  \field{sense} fields.
> 
> +\section{Socket Device}\label{sec:Device Types / Socket Device}
> +
> +The virtio socket device is a zero-configuration socket communications device.
> +It facilitates data transfer between the guest and device without using the
> +Ethernet or IP protocols.
> +
> +\subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID}
> +  13

The SVN version of the virtio spec seems to include many more device id
allocations, in particular in the table at the start of Section 5
("Device Types") it assigns 13 as "memory balloon" (not the traditional
one) and 5.5 ("Traditional Memory Balloon Device") says:
    This is the traditional balloon device. The device number 13 is
    reserved for a new memory balloon interface, with different
    semantics, which is expected in a future version of the standard.

[...]
> +The ctrl virtqueue is reserved for future use and is currently unused.

Is the normal way to reflect this in a driver backend implementation by
exposing a small (perhaps even zero sized) queue?

> +
> +\subsection{Feature bits}\label{sec:Device Types / Socket Device /
> Feature bits}
> +
> +\begin{description}
> +There are currently no feature bits defined for this device.
> +\end{description}
> +
> +\subsection{Device configuration layout}\label{sec:Device Types /
> Socket Device / Device configuration layout}
> +
> +\begin{lstlisting}
> +struct virtio_vsock_config {
> + le32 guest_cid;
> +};
> +\end{lstlisting}
> +
> +The \field{guest_cid} field contains the guest's context ID, which uniquely
> +identifies the device for its lifetime.

Are there any CID values which are reserved or invalid, perhaps 0 or
~0U? Likewise for port numbers. I think there is no broadcast nor
routing so perhaps no need for any reserved space?

> +
> +\subsection{Device Initialization}\label{sec:Device Types / Socket
> Device / Device Initialization}
> +
> +\begin{enumerate}
> +\item The guest's cid is read from \field{guest_cid}.
> +
> +\item Buffers are added to the event virtqueue to receive events from
> the device.
> +
> +\item Buffers are added to the rx virtqueue to start receiving packets.
> +\end{enumerate}
> +
> +\subsection{Device Operation}\label{sec:Device Types / Socket Device
> / Device Operation}
> +
> +Packets transmitted or received contain a header before the payload:
> +
> +\begin{lstlisting}
> +struct virtio_vsock_hdr {
> + le32 src_cid;
> + le32 src_port;
> + le32 dst_cid;
> + le32 dst_port;
> + le32 len;
> + le16 type;
> + le16 op;
> + le32 flags;
> + le32 buf_alloc;
> + le32 fwd_cnt;
> +};
> +\end{lstlisting}

I'm interested in which fields are valid for which ops. For example I
think len can only be validly non-zero for VIRTIO_VSOCK_OP_RW, is that
true? Is it an error to use a non-zero len with any other op?

I think the semantics of flags is op specific rather than global? And
it seems to only be defined for SHUTDOWN so far. I assume it MUST be
zero for all other ops?

Does len include the header itself? Looking at the Linux implementation
I think it does not, likewise for fwd_cnt doesn't seem to include the
headers.

Either way all of this would be worth writing down IMHO (unless it is
somehow implied by virtio conventions?)

> +Most packets simply transfer data but control packets are also used for
> +connection and buffer space management.  \field{op} is one of the following
> +operation constants:
> +
> +\begin{lstlisting}
> +enum {
> + VIRTIO_VSOCK_OP_INVALID = 0,
> +
> + /* Connect operations */
> + VIRTIO_VSOCK_OP_REQUEST = 1,
> + VIRTIO_VSOCK_OP_RESPONSE = 2,
> + VIRTIO_VSOCK_OP_RST = 3,
> + VIRTIO_VSOCK_OP_SHUTDOWN = 4,
> +
> + /* To send payload */
> + VIRTIO_VSOCK_OP_RW = 5,
> +
> + /* Tell the peer our credit info */
> + VIRTIO_VSOCK_OP_CREDIT_UPDATE = 6,
> + /* Request the peer to send the credit info to us */
> + VIRTIO_VSOCK_OP_CREDIT_REQUEST = 7,
> +};
> +\end{lstlisting}
> +
> +\subsubsection{Addressing}\label{sec:Device Types / Socket Device /
> Device Operation / Addressing}
> +
> +Flows are identified by a (source, destination) address tuple.  An address
> +consists of a (cid, port number) tuple. The header fields used for this are
> +\field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}.
> +
> +Currently only stream sockets are supported. \field{type} is 1 for stream
> +socket types.
> +
> +Stream sockets provide in-order, guaranteed, connection-oriented delivery
> +without message boundaries.
> +
> +\subsubsection{Buffer Space Management}\label{sec:Device Types /
> Socket Device / Device Operation / Buffer Space Management}
> +\field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of
> +stream sockets. The guest and the device publish how much buffer space is
> +available per socket. This facilitates flow control so packets are never
> +dropped.
> +
> +\field{buf_alloc} is the total receive buffer space, in bytes, for this socket.
> +This includes both free and in-use buffers. \field{fwd_cnt} is the free-running
> +bytes received counter. The sender calculates the amount of free receive buffer
> +space as follows:
> +
> +\begin{lstlisting}
> +/* tx_cnt is the sender's free-running bytes transmitted counter */
> +u32 peer_free = peer_buf_alloc - (tx_cnt - peer_fwd_cnt);
> +\end{lstlisting}
> +
> +If there is insufficient buffer space, the sender waits until virtqueue buffers
> +are returned and checks \field{buf_alloc} and \field{fwd_cnt} again.

I think I've stumbled across a potential pit fall with this approach,
which is that is relies on the application level data flow being bi-
directional.

My first test was to run on the guest side "seq 1 10000 | nc-vsock 1 1"
with a nc -U pulling the data off the AF_UNIX socket on the host side
i.e. a unidirectional flow so the driver sent buf_alloc bytes and then
stopped because there was no return traffic on to which the device
could piggy back any fwd_cnt updates.

I hacked around this with an unsolicited CREDIT_UPDATE op for each
buffer chain processed, but I think this is something which needs
looking into at at the protocol level. I couldn't see anything in the
Linux implementation which looked like it would address this, but
perhaps I've missed something.

I did consider whether I was supposed to update 'buf_alloc' and
'fwd_cnt' in the tx request before releasing it, but the request buffer
is read-only for the device end so I think not.

It's also unclear to me how to map 'buf_alloc' and 'fwd_cnt' onto a
situation where the backend is not pumping data directly into Linux's
the internal socket interface (since the spec maps pretty directly onto
that interface it seems).

For example should 'buf_alloc' be the buffers within the virtio backend
itself (which would be zero in my current simple synchronous prototype,
causing no traffic to flow and my prototype to therefore lie and say
4096 instead) or should it include the buffers present in the next
link, which may well be an estimate depending on the visibility one has
into the next link (i.e. I think for my AF_UNIX backend I could get
something with getsockopt, but that might not be an option for some
other backend paths) or should it be an estimate of the buffers from
the backend all the way to the eventual end point? I think from the
Linux implementation that it includes the next hop (i.e. the Linux
socket buffer).

Related to which buffers is at which point should 'fwd_cnt' be
incremented, AIUI it mustn't be before the virtio buffer chain has been
released (whether or not there is buffering in the backend). Right now
I increment it after I do the write onto the AF_UNIX socket, but if
'buf_alloc' is supposed to include the buffers in that socket then
really I ought to be doing it as the data drains out the other end --
which I think is not something I have reasonable visibility onto.

I think these issues around flow controls are the biggest stumbling
block right now wrt defining a generic platform agnostic virtio device.

>  Sending
> +the VIRTIO_VSOCK_OP_CREDIT_REQUEST packet queries how much buffer space is
> +available. The reply to this query is a VIRTIO_VSOCK_OP_CREDIT_UPDATE packet.

It's not clear when a peer is expected to do this, is it implied that
if the sends has waited for buffers to be returned but tx_cnt hasn't
increased that it should start polling using this mechanism? If so then
how frequently?

The Linux implementation responds to CREDIT_REQUEST but doesn't appear
to have code to generate them.

> +
> +\drivernormative{\paragraph}{Device Operation: Buffer Space
> Management}{Device Types / Socket Device / Device Operation / Buffer
> Space Management}
> +VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> +sufficient free buffer space for the payload.

This implies that the implicit buffering space within the virtio ring
itself is being ignored, is that correct?

That rules out a simple zero buffer approach which simply pumps data
from the ring into an AF_UNIX socket whenever the socket is available
for writing, returning the virtio chains as they are fully consumed.

> +
> +All packets associated with a stream flow MUST contain valid information in
> +\field{buf_alloc} and \field{fwd_cnt} fields.
> +
> +\devicenormative{\paragraph}{Device Operation: Buffer Space
> Management}{Device Types / Socket Device / Device Operation / Buffer
> Space Management}
> +VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> +sufficient free buffer space for the payload.
> +
> +All packets associated with a stream flow MUST contain valid information in
> +\field{buf_alloc} and \field{fwd_cnt} fields.
> +
> +\subsubsection{Receive and Transmit}\label{sec:Device Types / Socket
> Device / Device Operation / Receive and Transmit}
> +The driver queues outgoing packets on the tx virtqueue and incoming packet
> +receive buffers on the rx virtqueue. Packets are of the following form:
> +
> +\begin{lstlisting}
> +struct virtio_vsock_packet {
> +    struct virtio_vsock_hdr hdr;
> +    u8 data[];
> +};
> +\end{lstlisting}
> +
> +Virtqueue buffers for outgoing packets are read-only. Virtqueue buffers for
> +incoming packets are write-only.
> +
> +\drivernormative{\paragraph}{Device Operation: Receive and
> Transmit}{Device Types / Socket Device / Device Operation / Receive
> and Transmit}
> +
> +The \field{guest_cid} configuration field MUST be used as the source CID when
> +sending outgoing packets.
> +
> +A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
> +unknown \field{type} value.
> +
> +\devicenormative{\paragraph}{Device Operation: Receive and
> Transmit}{Device Types / Socket Device / Device Operation / Receive
> and Transmit}
> +A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an
> +unknown \field{type} value.
> +
> +\subsubsection{Stream Sockets}\label{sec:Device Types / Socket Device
> / Device Operation / Stream Sockets}
> +
> +Connections are established by sending a VIRTIO_VSOCK_OP_REQUEST packet. If a
> +listening socket exists on the destination a VIRTIO_VSOCK_OP_RESPONSE reply is
> +sent and the connection is established.  A VIRTIO_VSOCK_OP_RST reply is sent if
> +a listening socket does not exist on the destination or the destination has
> +insufficient resources to establish the connection.

What are the requirements if a socket with the same src and dst
cid+port is already established? Should the new connection attempt be
RST and/or the exsiting connection torn down? I think sending a RST (a
good option IMHO) would have the effect of tearing down the existing
connection anyway.

The above says various replies MUST be sent, but what if the other end
has not provided any buffers to allow a reply to be sent, or they have
all been consumed with RX traffic?

> +
> +When a connected socket receives VIRTIO_VSOCK_OP_SHUTDOWN the header
> +\field{flags} field bit 0 indicates that the peer will not receive any more
> +data and bit 1 indicates that the peer will not send any more data. If these
> +bits are set and there are no more virtqueue buffers pending the socket is
> +disconnected.

It would be useful (and less prone to ambiguity, although it is pretty
obvious here) to define these bits as enums/constants (like the ops)
rather than simply describing them in prose.

The text describes what to do if both bits are set but not what to do
if one or both of them are not set or if there are virtqueue buffers
pending.

Under what circumstances would a peer send a SHUTDOWN without both bits
set? i.e what is the intended use of a "shutdown, but I'm happy to
recieve more data or even send you some more myself" operation?

Thanks,
Ian.



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]