OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]

Subject: Re: [virtio-dev] [PATCH v4] vsock: add vsock device

• From: Stefan Hajnoczi <stefanha@redhat.com>
• To: Ian Campbell <ian.campbell@docker.com>
• Date: Wed, 23 Mar 2016 16:47:27 +0000

On Mon, Mar 21, 2016 at 12:53:23PM +0000, Ian Campbell wrote:
> > +The \field{guest_cid} field contains the guest's context ID, which uniquely
> > +identifies the device for its lifetime.
>
> Are there any CID values which are reserved or invalid, perhaps 0 or
> ~0U? Likewise for port numbers. I think there is no broadcast nor
> routing so perhaps no need for any reserved space?

The following reserved CIDs are defined in <linux/vm_sockets.h>:

/* The vSocket equivalent of INADDR_ANY.  This works for the svm_cid field of
* sockaddr_vm and indicates the context ID of the current endpoint.
*/

/* Use this as the destination CID in an address when referring to the
* hypervisor.  VMCI relies on it being 0, but this would be useful for other
* transports too.
*/

/* This CID is specific to VMCI and can be considered reserved (even VMCI
* doesn't use it anymore, it's a legacy value from an older release).
*/

/* Use this as the destination CID in an address when referring to the host
* (any process other than the hypervisor).  VMCI relies on it being 2, but
* this would be useful for other transports too.
*/

The reserved CIDs are enforced by vhost_vsock_set_cid():

/* Refuse reserved CIDs */
return -EINVAL;

virtio-vsock uses VMADDR_CID_HOST for the host.  The common AF_VSOCK
code uses VMADDR_CID_ANY.  The other two reserved CIDs are not used with
the virtio transport but are still reserved.

I will document reserved CIDs in the next spec revision.

> > +
> > +\subsection{Device Initialization}\label{sec:Device Types / Socket
> > Device / Device Initialization}
> > +
> > +\begin{enumerate}
> > +\item The guest's cid is read from \field{guest_cid}.
> > +
> > +\item Buffers are added to the event virtqueue to receive events from
> > the device.
> > +
> > +\item Buffers are added to the rx virtqueue to start receiving packets.
> > +\end{enumerate}
> > +
> > +\subsection{Device Operation}\label{sec:Device Types / Socket Device
> > / Device Operation}
> > +
> > +
> > +\begin{lstlisting}
> > +struct virtio_vsock_hdr {
> > + le32 src_cid;
> > + le32 src_port;
> > + le32 dst_cid;
> > + le32 dst_port;
> > + le32 len;
> > + le16 type;
> > + le16 op;
> > + le32 flags;
> > + le32 buf_alloc;
> > + le32 fwd_cnt;
> > +};
> > +\end{lstlisting}
>
> I'm interested in which fields are valid for which ops. For example I
> think len can only be validly non-zero for VIRTIO_VSOCK_OP_RW, is that
> true? Is it an error to use a non-zero len with any other op?

len is only used for VIRTIO_VSOCK_OP_RW packets.  It is ignored for
other packet types.  I will document that other packet types should set
it to zero.

> I think the semantics of flags is op specific rather than global? And
> it seems to only be defined for SHUTDOWN so far. I assume it MUST be
> zero for all other ops?

Yes, flags is currently only used for SHUTDOWN.  I will document that it
MUST be zero for other ops and that unused bits must be zero.

> Does len include the header itself? Looking at the Linux implementation
> I think it does not, likewise for fwd_cnt doesn't seem to include the
>
> Either way all of this would be worth writing down IMHO (unless it is
> somehow implied by virtio conventions?)

Thanks for pointing this out.  Only data bytes are counted.  Hence
non-data packets can still be sent on a connection that has run out of
buffer space as long as the virtqueue itself has space.

Will document in the next revision.

> > +\drivernormative{\paragraph}{Device Operation: Buffer Space
> > Management}{Device Types / Socket Device / Device Operation / Buffer
> > Space Management}
> > +VIRTIO_VSOCK_OP_RW data packets MUST only be transmitted when the peer has
> > +sufficient free buffer space for the payload.
>
> This implies that the implicit buffering space within the virtio ring
> itself is being ignored, is that correct?
>
> That rules out a simple zero buffer approach which simply pumps data
> from the ring into an AF_UNIX socket whenever the socket is available
> for writing, returning the virtio chains as they are fully consumed.

You are right, the zero-copy implementation you are describing is
missing the buffer space management layer and only relying on the rx
virtqueue.  That doesn't work if you need to support multiple
simultaneous connections.

virtio-vsock is by design not zero-copy.  The reason for this is that
multiple connections are multiplexed over a single pair of rx/tx
virtqueues.  A malicious or buggy connection could hog the vring and
prevent other connections from making progress.

Guaranteed delivery is possible because the rx virtqueue is emptied as
soon as possible by copying data into the socket receive buffer (socket
option SO_RCVBUF).  Since each connection has a receive buffer and the
other side is aware of available space we can achieve guaranteed
delivery.

> > +Connections are established by sending a VIRTIO_VSOCK_OP_REQUEST packet. If a
> > +listening socket exists on the destination a VIRTIO_VSOCK_OP_RESPONSE reply is
> > +sent and the connection is established.  A VIRTIO_VSOCK_OP_RST reply is sent if
> > +a listening socket does not exist on the destination or the destination has
> > +insufficient resources to establish the connection.
>
> What are the requirements if a socket with the same src and dst
> cid+port is already established? Should the new connection attempt be
> RST and/or the exsiting connection torn down? I think sending a RST (a
> good option IMHO) would have the effect of tearing down the existing
> connection anyway.

I agree.  RST makes sense because something is wrong if we ever reach
this state.  Best to go back to a clean state by dropping the
connection.  Both the old connection and new connection are killed by a
single RST.  This will allow future connection attempts to succeed
again.

Will add to the next revision.

> The above says various replies MUST be sent, but what if the other end
> has not provided any buffers to allow a reply to be sent, or they have
> all been consumed with RX traffic?

The Linux implementation will wait until vring buffers become available.
I think this is the correct approach to take advantage of virtqueue's
guaranteed delivery - otherwise we'd have to add TCPesque retries.

Currently task state UNINTERRUPTIBLE is used in Linux without a timeout
so this could tie up the thread indefinitely.  But it's possible to
improve the code so the timeout is honored and the thread is
interruptible.

> > +
> > +\field{flags} field bit 0 indicates that the peer will not receive any more
> > +data and bit 1 indicates that the peer will not send any more data. If these
> > +bits are set and there are no more virtqueue buffers pending the socket is
> > +disconnected.
>
> It would be useful (and less prone to ambiguity, although it is pretty
> obvious here) to define these bits as enums/constants (like the ops)
> rather than simply describing them in prose.
>
> The text describes what to do if both bits are set but not what to do
> if one or both of them are not set or if there are virtqueue buffers
> pending.

I agree on both points and will clarify in the next revision.

> Under what circumstances would a peer send a SHUTDOWN without both bits
> set? i.e what is the intended use of a "shutdown, but I'm happy to
> recieve more data or even send you some more myself" operation?

Yes, the half-duplex semantics are inherited from TCP and are supported
by the Sockets API.  For example, you can shutdown write but continue
receiving.

Stefan


Attachment: signature.asc
Description: PGP signature

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]