virtio-dev message

Subject: Re: [PATCH v7] vsock: add vsock device
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Ian Campbell <ian.campbell@docker.com>
Date: Thu, 11 Aug 2016 22:32:13 +0300
On Thu, Aug 11, 2016 at 11:50:16AM +0100, Ian Campbell wrote:
> On 10 August 2016 at 16:24, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Mon, Aug 08, 2016 at 12:58:24PM +0100, Ian Campbell wrote:
> >> On 4 August 2016 at 16:24, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> >> Thanks, would it be possible/easy for you to update
> >> https://stefanha.github.io/virtio/#x1-2800007 and
> >> https://github.com/stefanha/virtio? (I think they both still have v6)
> >
> > I have updated it:
> > https://stefanha.github.io/virtio/#x1-2830007
> 
> Thanks.
> 
> >> Is there a more-or-less semi-official git mirror of the master svn
> >> repo, or does everyone just roll their own with `git-svn`?
> >
> > Everyone rolls their own with git-svn.
> 
> OK, I shall do the same ;-)
> 
> >> I think the following quote is the entirety of the change/newly added text:
> >>
> >> > +\subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> >> > +
> >> > +The tx virtqueue carries packets initiated by applications and replies to
> >> > +received packets.  The rx virtqueue carries packets initiated by the device and
> >> > +replies to previously transmitted packets.
> >> > +
> >> > +If both rx and tx virtqueues are filled by the driver and device at the same
> >> > +time then it appears that a deadlock is reached.  The driver has no free tx
> >> > +descriptors to send replies.  The device has no free rx descriptors to send
> >> > +replies either.  Therefore neither device nor driver can process virtqueues
> >> > +since that may involve sending new replies.
> >> > +
> >> > +This is solved using additional resources outside the virtqueue to hold
> >> > +packets.  With additional resources, it becomes possible to process incoming
> >> > +packets even when outgoing packets cannot be sent.
> >> > +
> >> > +Eventually even the additional resources will be exhausted and further
> >> > +processing is not possible until the other side processes the virtqueue that
> >> > +it has neglected.  This stop to processing prevents one side from causing
> >> > +unbounded resource consumption in the other side.
> >>
> >> Rather than "This stop to..." did you mean "This is to stop..." or
> >> something else? I can't parse it as is I'm afraid.
> >
> > Wow, I thought I'd crafted some pretty fine prose.  Sorry about that.
> > I'll change it to:
> >
> > This stops one side from causing unbounded resource consumption.
> 
> This is now parsable (thanks) but it's not clear to me how it stops
> this from happening.
> 
> I think what I am missing is the bit which says explicitly "and under
> these circumstances you should stop processing the ring entirely" (as
> opposed to dumping traffic on the floor)

This is actually a general spec issue: we don't specify
what should devices do when they encounter an invalid request,
and that is not good: in particular, some devices just
cause the guest to exit.

I think we should work on a new feature for recovering from errors
e.g. by device reset.


> >> > +\drivernormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> >> > +
> >> > +The rx virtqueue MUST be processed even when the tx virtqueue is full so long as there are additional resources available to hold packets outside the tx virtqueue.
> >> > +
> >> > +\devicenormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> >> > +
> >> > +The tx virtqueue MUST be processed even when the rx virtqueue is full so long as there are additional resources available to hold packets outside the rx virtqueue.
> 
> would s/packets/replies/ in both of these clarify your intent some more?
> 
> In the case of the drivernormative one is it worth being explicit that
> processing the rx virt queue includes replenishing the supply of empty
> descriptors for the other end (not just consuming the filled ones)?
> 
> >> These describe handling the case where additional resources are
> >> available (by saying "keep going if you have resources to do so",
> >> which IMHO doesn't really need stating) but they don't describe what
> >> the acceptable behaviour is when those resources are no longer
> >> available, so really this is just kicking the deadlock can a bit
> >> further down the road without addressing the actual problem (i.e. the
> >> "Eventually even the additional resources..." paragraph from above).
> >
> > No, there is an important point in there which I didn't state clearly:
> >
> > Additional resources are *required*, otherwise virtqueue deadlock is
> > possible under normal operation.
> >
> > Regarding kicking the can down the road:
> >
> > It will always be possible to ignore incoming packets to the point where
> > the peer will have to stop responding, but this isn't a problem.  Under
> > correct operation this state is never reached.
> 
> I'm afraid I'm still missing the piece of the puzzles which causes
> this to be guaranteed to be true.
> 
> I think maybe there is something to do with the fact that a reply can
> never itself require a reply?
> 
> Is there also a requirement that the driver/device handle any
> outstanding ARs before putting any actual data onto the outgoing ring?
> I think there has to be, otherwise you are adding more future replies
> which will eventually clog the ARs.
> 
> Such a requirement could be a bit subtle on the implementation side if
> you have separate Rx and Tx processing threads (as I do), since one
> thread may be injecting data just as the other is producing ARs. (i.e.
> on the device side the RX thread fills the RX ring with data while the
> TX thread is filling the AR queue with replies).
> 
> >> IIRC from previous discussions your intention was to say that a
> >> driver/backend should stop processessing TX requests if the RX ring
> >> becomes full (or I suppose if the "additional resources" are
> >> exhausted), I guess with the implications that it should never stop
> >> processing RX requests (which I think is always possible since from
> >> the driver end there are no "additional resources" needed for RX
> >> processing). But the above essentially explicitly states the OK case
> >> (i.e. the non-deadlock case, resources available) and leaves what to
> >> do on resource exhaustion implicit (while that is the most important
> >> bit of information).
> >>
> >> It's unspecified how much "additional resource" an implementation is
> >> expected/required to provide. Should it be enough for one ring slot?
> >> Or N*slots-per-ring? Is 0 an acceptable number?
> >
> > The way I've implemented the resource limit is to only allow queuing N
> > reply packets when the outgoing virtqueue is full, where N is the size
> > of the incoming virtqueue.
> >
> > If this limit is reached then incoming virtqueue processing is paused
> > until there is space to transmit pending reply packets again.
> 
> Is >=N a requirement, or just a convenient number? I have a feeling
> this has a relationship to the "separate RX and Tx processing threads"
> issue I mention above, specifically the ARs need to be larger than the
> number of data slots which will be consumed before checking for ARs
> (i.e. on the device side how often the RX thread stop processing real
> data in order to check for pending ARs), which is another way of
> saying that you need to stop and check for pending ARs more often than
> every N data slots, what do you think?
> 
> I keep coming back to this "how many ARs are enough" question because
> it seems like if N-1 ARs are not sufficient to avoid deadlock while N
> ARs is then the "inductive step" from N-1 to N ought to be essentially
> the "proof" which guarantees that deadlocks are not possible.
> 
> Sorry to be so pedantic about this, the guarantee here seems very
> subtle to me or perhaps there are simply generic virtio
> assumptions/idioms which I'm not aware of which allow these things to
> be explained more easily.
> 
> >> BTW our backend implementation is now available in
> >> https://github.com/docker/hyperkit/blob/af-vsock/src/pci_virtio_sock.c.
> >> It corresponds to ~v5 of the spec IIIRC (so it still has workarounds
> >> relating to the OP_SHUTDOWN issue which you resolved in v6 and it
> >> hasn't yet switched to 64 bit addressing). WRT the above issue the
> >> "additional resources" are a pending-reply ring
> >> (https://github.com/docker/hyperkit/blob/af-vsock/src/pci_virtio_sock.c#L281)
> >> which I sized to be 2x the ring size (since with 1x I still saw
> >> deadlocks) however people are currently reporting deadlocks (actually
> >> an assert failure in this implementation) with that, so I am
> >> considering making it 4x, but that is just pushing the problem a
> >> further away again without solving.
> >
> > One detail about the limit I've implemented is that it only applies to
> > *reply* packets.  It does not count ordinary outgoing packets.  Perhaps
> > you are counting all outgoing packets.  I'm trusting rlimits, socket
> > buffer sizes, etc to limit ordinary outgoing packets so they don't
> > matter here.
> 
> I'm also only counting reply packets since ordinary outgoing packets
> don't need to go over the pending reply ring (my manifestation of
> ARs), they come directly out of an fd in the relevant thread.
> 
> I also communicate the need for a CREDIT_UPDATE reply as a single
> per-socket bit, so those also do not consume ARs (and often can piggy
> back on other traffic too).
> 
> Ian.
Follow-Ups:
- Re: [PATCH v7] vsock: add vsock device
  - From: Ian Campbell <ian.campbell@docker.com>
References:
- [PATCH v7] vsock: add vsock device
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [PATCH v7] vsock: add vsock device
  - From: Ian Campbell <ian.campbell@docker.com>
- Re: [PATCH v7] vsock: add vsock device
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [PATCH v7] vsock: add vsock device
  - From: Ian Campbell <ian.campbell@docker.com>