[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio-dev] [PATCH v4] vsock: add vsock device
Hi Stefan, On Wed, 2016-03-16 at 10:53 +0000, Stefan Hajnoczi wrote: > The virtio vsock device is a zero-configuration socket communications > device. It is designed as a guest<->host management channel suitable > for communicating with guest agents. I'm interested in using this for a project I'm working on so I've been prototyping a userspace backend (for a non-Linux host end) which pumps the data over a Unix domain socket. I've got some feedback and questions on the spec (some of which might be down to my relative naivety regarding virtio). For the frontend I've been using your Linux tree from githib vsock branch (at commit 563d2a770dfa). I'm mostly using those patches backported to 4.1.19, but I've also run your exact tree and your vsock- nfs branch with the same behaviours I mention below. The Linux .ko and various things under sysfs end up called "transport- virtio" (or s/-/_/g) rather than e.g. "vsock-transport-virtio", which would be less generic sounding. [...] > diff --git a/trunk/content.tex b/trunk/content.tex > index d989d98..4e9f165 100644 > --- a/trunk/content.tex > +++ b/trunk/content.tex > @@ -5641,6 +5641,223 @@ descriptor for the \field{sense_len}, \field{residual}, > \field{status_qualifier}, \field{status}, \field{response} and > \field{sense} fields. > > +\section{Socket Device}\label{sec:Device Types / Socket Device} > + > +The virtio socket device is a zero-configuration socket communications device. > +It facilitates data transfer between the guest and device without using the > +Ethernet or IP protocols. > + > +\subsection{Device ID}\label{sec:Device Types / Socket Device / Device ID} > + 13 The SVN version of the virtio spec seems to include many more device id allocations, in particular in the table at the start of Section 5 ("Device Types") it assigns 13 as "memory balloon" (not the traditional one) and 5.5 ("Traditional Memory Balloon Device") says: This is the traditional balloon device. The device number 13 is reserved for a new memory balloon interface, with different semantics, which is expected in a future version of the standard. [...] > +The ctrl virtqueue is reserved for future use and is currently unused. Is the normal way to reflect this in a driver backend implementation by exposing a small (perhaps even zero sized) queue? > + > +\subsection{Feature bits}\label{sec:Device Types / Socket Device / > Feature bits} > + > +\begin{description} > +There are currently no feature bits defined for this device. > +\end{description} > + > +\subsection{Device configuration layout}\label{sec:Device Types / > Socket Device / Device configuration layout} > + > +\begin{lstlisting} > +struct virtio_vsock_config { > + le32 guest_cid; > +}; > +\end{lstlisting} > + > +The \field{guest_cid} field contains the guest's context ID, which uniquely > +identifies the device for its lifetime. Are there any CID values which are reserved or invalid, perhaps 0 or ~0U? Likewise for port numbers. I think there is no broadcast nor routing so perhaps no need for any reserved space? > + > +\subsection{Device Initialization}\label{sec:Device Types / Socket > Device / Device Initialization} > + > +\begin{enumerate} > +\item The guest's cid is read from \field{guest_cid}. > + > +\item Buffers are added to the event virtqueue to receive events from > the device. > + > +\item Buffers are added to the rx virtqueue to start receiving packets. > +\end{enumerate} > + > +\subsection{Device Operation}\label{sec:Device Types / Socket Device > / Device Operation} > + > +Packets transmitted or received contain a header before the payload: > + > +\begin{lstlisting} > +struct virtio_vsock_hdr { > + le32 src_cid; > + le32 src_port; > + le32 dst_cid; > + le32 dst_port; > + le32 len; > + le16 type; > + le16 op; > + le32 flags; > + le32 buf_alloc; > + le32 fwd_cnt; > +}; > +\end{lstlisting} I'm interested in which fields are valid for which ops. For example I think len can only be validly non-zero for VIRTIO_VSOCK_OP_RW, is that true? Is it an error to use a non-zero len with any other op? I think the semantics of flags is op specific rather than global? And it seems to only be defined for SHUTDOWN so far. I assume it MUST be zero for all other ops? Does len include the header itself? Looking at the Linux implementation I think it does not, likewise for fwd_cnt doesn't seem to include the headers. Either way all of this would be worth writing down IMHO (unless it is somehow implied by virtio conventions?) > +Most packets simply transfer data but control packets are also used for > +connection and buffer space management. \field{op} is one of the following > +operation constants: > + > +\begin{lstlisting} > +enum { > + VIRTIO_VSOCK_OP_INVALID = 0, > + > + /* Connect operations */ > + VIRTIO_VSOCK_OP_REQUEST = 1, > + VIRTIO_VSOCK_OP_RESPONSE = 2, > + VIRTIO_VSOCK_OP_RST = 3, > + VIRTIO_VSOCK_OP_SHUTDOWN = 4, > + > + /* To send payload */ > + VIRTIO_VSOCK_OP_RW = 5, > + > + /* Tell the peer our credit info */ > + VIRTIO_VSOCK_OP_CREDIT_UPDATE = 6, > + /* Request the peer to send the credit info to us */ > + VIRTIO_VSOCK_OP_CREDIT_REQUEST = 7, > +}; > +\end{lstlisting} > + > +\subsubsection{Addressing}\label{sec:Device Types / Socket Device / > Device Operation / Addressing} > + > +Flows are identified by a (source, destination) address tuple. An address > +consists of a (cid, port number) tuple. The header fields used for this are > +\field{src_cid}, \field{src_port}, \field{dst_cid}, and \field{dst_port}. > + > +Currently only stream sockets are supported. \field{type} is 1 for stream > +socket types. > + > +Stream sockets provide in-order, guaranteed, connection-oriented delivery > +without message boundaries. > + > +\subsubsection{Buffer Space Management}\label{sec:Device Types / > Socket Device / Device Operation / Buffer Space Management} > +\field{buf_alloc} and \field{fwd_cnt} are used for buffer space management of > +stream sockets. The guest and the device publish how much buffer space is > +available per socket. This facilitates flow control so packets are never > +dropped. > + > +\field{buf_alloc} is the total receive buffer space, in bytes, for this socket. > +This includes both free and in-use buffers. \field{fwd_cnt} is the free-running > +bytes received counter. The sender calculates the amount of free receive buffer > +space as follows: > + > +\begin{lstlisting} > +/* tx_cnt is the sender's free-running bytes transmitted counter */ > +u32 peer_free = peer_buf_alloc - (tx_cnt - peer_fwd_cnt); > +\end{lstlisting} > + > +If there is insufficient buffer space, the sender waits until virtqueue buffers > +are returned and checks \field{buf_alloc} and \field{fwd_cnt} again. I think I've stumbled across a potential pit fall with this approach, which is that is relies on the application level data flow being bi- directional. My first test was to run on the guest side "seq 1 10000 | nc-vsock 1 1" with a nc -U pulling the data off the AF_UNIX socket on the host side i.e. a unidirectional flow so the driver sent buf_alloc bytes and then stopped because there was no return traffic on to which the device could piggy back any fwd_cnt updates. I hacked around this with an unsolicited CREDIT_UPDATE op for each buffer chain processed, but I think this is something which needs looking into at at the protocol level. I couldn't see anything in the Linux implementation which looked like it would address this, but perhaps I've missed something. I did consider whether I was supposed to update 'buf_alloc' and 'fwd_cnt' in the tx request before releasing it, but the request buffer is read-only for the device end so I think not. It's also unclear to me how to map 'buf_alloc' and 'fwd_cnt' onto a situation where the backend is not pumping data directly into Linux's the internal socket interface (since the spec maps pretty directly onto that interface it seems). For example should 'buf_alloc' be the buffers within the virtio backend itself (which would be zero in my current simple synchronous prototype, causing no traffic to flow and my prototype to therefore lie and say 4096 instead) or should it include the buffers present in the next link, which may well be an estimate depending on the visibility one has into the next link (i.e. I think for my AF_UNIX backend I could get something with getsockopt, but that might not be an option for some other backend paths) or should it be an estimate of the buffers from the backend all the way to the eventual end point? I think from the Linux implementation that it includes the next hop (i.e. the Linux socket buffer). Related to which buffers is at which point should 'fwd_cnt' be incremented, AIUI it mustn't be before the virtio buffer chain has been released (whether or not there is buffering in the backend). Right now I increment it after I do the write onto the AF_UNIX socket, but if 'buf_alloc' is supposed to include the buffers in that socket then really I ought to be doing it as the data drains out the other end -- which I think is not something I have reasonable visibility onto. I think these issues around flow controls are the biggest stumbling block right now wrt defining a generic platform agnostic virtio device. > Sending > +the VIRTIO_VSOCK_OP_CREDIT_REQUEST packet queries how much buffer space is > +available. The reply to this query is a VIRTIO_VSOCK_OP_CREDIT_UPDATE packet. It's not clear when a peer is expected to do this, is it implied that if the sends has waited for buffers to be returned but tx_cnt hasn't increased that it should start polling using this mechanism? If so then how frequently? The Linux implementation responds to CREDIT_REQUEST but doesn't appear to have code to generate them. > + > +\drivernormative{\paragraph}{Device Operation: Buffer Space > Management}{Device Types / Socket Device / Device Operation / Buffer > Space Management} > +VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has > +sufficient free buffer space for the payload. This implies that the implicit buffering space within the virtio ring itself is being ignored, is that correct? That rules out a simple zero buffer approach which simply pumps data from the ring into an AF_UNIX socket whenever the socket is available for writing, returning the virtio chains as they are fully consumed. > + > +All packets associated with a stream flow MUST contain valid information in > +\field{buf_alloc} and \field{fwd_cnt} fields. > + > +\devicenormative{\paragraph}{Device Operation: Buffer Space > Management}{Device Types / Socket Device / Device Operation / Buffer > Space Management} > +VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has > +sufficient free buffer space for the payload. > + > +All packets associated with a stream flow MUST contain valid information in > +\field{buf_alloc} and \field{fwd_cnt} fields. > + > +\subsubsection{Receive and Transmit}\label{sec:Device Types / Socket > Device / Device Operation / Receive and Transmit} > +The driver queues outgoing packets on the tx virtqueue and incoming packet > +receive buffers on the rx virtqueue. Packets are of the following form: > + > +\begin{lstlisting} > +struct virtio_vsock_packet { > + struct virtio_vsock_hdr hdr; > + u8 data[]; > +}; > +\end{lstlisting} > + > +Virtqueue buffers for outgoing packets are read-only. Virtqueue buffers for > +incoming packets are write-only. > + > +\drivernormative{\paragraph}{Device Operation: Receive and > Transmit}{Device Types / Socket Device / Device Operation / Receive > and Transmit} > + > +The \field{guest_cid} configuration field MUST be used as the source CID when > +sending outgoing packets. > + > +A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an > +unknown \field{type} value. > + > +\devicenormative{\paragraph}{Device Operation: Receive and > Transmit}{Device Types / Socket Device / Device Operation / Receive > and Transmit} > +A VIRTIO_VSOCK_OP_RST reply MUST be sent if a packet is received with an > +unknown \field{type} value. > + > +\subsubsection{Stream Sockets}\label{sec:Device Types / Socket Device > / Device Operation / Stream Sockets} > + > +Connections are established by sending a VIRTIO_VSOCK_OP_REQUEST packet. If a > +listening socket exists on the destination a VIRTIO_VSOCK_OP_RESPONSE reply is > +sent and the connection is established. A VIRTIO_VSOCK_OP_RST reply is sent if > +a listening socket does not exist on the destination or the destination has > +insufficient resources to establish the connection. What are the requirements if a socket with the same src and dst cid+port is already established? Should the new connection attempt be RST and/or the exsiting connection torn down? I think sending a RST (a good option IMHO) would have the effect of tearing down the existing connection anyway. The above says various replies MUST be sent, but what if the other end has not provided any buffers to allow a reply to be sent, or they have all been consumed with RX traffic? > + > +When a connected socket receives VIRTIO_VSOCK_OP_SHUTDOWN the header > +\field{flags} field bit 0 indicates that the peer will not receive any more > +data and bit 1 indicates that the peer will not send any more data. If these > +bits are set and there are no more virtqueue buffers pending the socket is > +disconnected. It would be useful (and less prone to ambiguity, although it is pretty obvious here) to define these bits as enums/constants (like the ops) rather than simply describing them in prose. The text describes what to do if both bits are set but not what to do if one or both of them are not set or if there are virtqueue buffers pending. Under what circumstances would a peer send a SHUTDOWN without both bits set? i.e what is the intended use of a "shutdown, but I'm happy to recieve more data or even send you some more myself" operation? Thanks, Ian.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]