virtio-dev message

Subject: Re: [PATCH v7] vsock: add vsock device
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Ian Campbell <ian.campbell@docker.com>
Date: Wed, 17 Aug 2016 16:50:07 +0100
On Thu, Aug 11, 2016 at 11:50:16AM +0100, Ian Campbell wrote:
> On 10 August 2016 at 16:24, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Mon, Aug 08, 2016 at 12:58:24PM +0100, Ian Campbell wrote:
> >> On 4 August 2016 at 16:24, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> 
> >> Thanks, would it be possible/easy for you to update
> >> https://stefanha.github.io/virtio/#x1-2800007 and
> >> https://github.com/stefanha/virtio? (I think they both still have v6)
> >
> > I have updated it:
> > https://stefanha.github.io/virtio/#x1-2830007
> 
> Thanks.
> 
> >> Is there a more-or-less semi-official git mirror of the master svn
> >> repo, or does everyone just roll their own with `git-svn`?
> >
> > Everyone rolls their own with git-svn.
> 
> OK, I shall do the same ;-)
> 
> >> I think the following quote is the entirety of the change/newly added text:
> >>
> >> > +\subsubsection{Virtqueue Flow Control}\label{sec:Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> >> > +
> >> > +The tx virtqueue carries packets initiated by applications and replies to
> >> > +received packets.  The rx virtqueue carries packets initiated by the device and
> >> > +replies to previously transmitted packets.
> >> > +
> >> > +If both rx and tx virtqueues are filled by the driver and device at the same
> >> > +time then it appears that a deadlock is reached.  The driver has no free tx
> >> > +descriptors to send replies.  The device has no free rx descriptors to send
> >> > +replies either.  Therefore neither device nor driver can process virtqueues
> >> > +since that may involve sending new replies.
> >> > +
> >> > +This is solved using additional resources outside the virtqueue to hold
> >> > +packets.  With additional resources, it becomes possible to process incoming
> >> > +packets even when outgoing packets cannot be sent.
> >> > +
> >> > +Eventually even the additional resources will be exhausted and further
> >> > +processing is not possible until the other side processes the virtqueue that
> >> > +it has neglected.  This stop to processing prevents one side from causing
> >> > +unbounded resource consumption in the other side.
> >>
> >> Rather than "This stop to..." did you mean "This is to stop..." or
> >> something else? I can't parse it as is I'm afraid.
> >
> > Wow, I thought I'd crafted some pretty fine prose.  Sorry about that.
> > I'll change it to:
> >
> > This stops one side from causing unbounded resource consumption.
> 
> This is now parsable (thanks) but it's not clear to me how it stops
> this from happening.
> 
> I think what I am missing is the bit which says explicitly "and under
> these circumstances you should stop processing the ring entirely" (as
> opposed to dumping traffic on the floor)
> 
> >> > +\drivernormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> >> > +
> >> > +The rx virtqueue MUST be processed even when the tx virtqueue is full so long as there are additional resources available to hold packets outside the tx virtqueue.
> >> > +
> >> > +\devicenormative{\paragraph}{Device Operation: Virtqueue Flow Control}{Device Types / Socket Device / Device Operation / Virtqueue Flow Control}
> >> > +
> >> > +The tx virtqueue MUST be processed even when the rx virtqueue is full so long as there are additional resources available to hold packets outside the rx virtqueue.
> 
> would s/packets/replies/ in both of these clarify your intent some more?
> 
> In the case of the drivernormative one is it worth being explicit that
> processing the rx virt queue includes replenishing the supply of empty
> descriptors for the other end (not just consuming the filled ones)?
> 
> >> These describe handling the case where additional resources are
> >> available (by saying "keep going if you have resources to do so",
> >> which IMHO doesn't really need stating) but they don't describe what
> >> the acceptable behaviour is when those resources are no longer
> >> available, so really this is just kicking the deadlock can a bit
> >> further down the road without addressing the actual problem (i.e. the
> >> "Eventually even the additional resources..." paragraph from above).
> >
> > No, there is an important point in there which I didn't state clearly:
> >
> > Additional resources are *required*, otherwise virtqueue deadlock is
> > possible under normal operation.
> >
> > Regarding kicking the can down the road:
> >
> > It will always be possible to ignore incoming packets to the point where
> > the peer will have to stop responding, but this isn't a problem.  Under
> > correct operation this state is never reached.
> 
> I'm afraid I'm still missing the piece of the puzzles which causes
> this to be guaranteed to be true.
> 
> I think maybe there is something to do with the fact that a reply can
> never itself require a reply?
> 
> Is there also a requirement that the driver/device handle any
> outstanding ARs before putting any actual data onto the outgoing ring?
> I think there has to be, otherwise you are adding more future replies
> which will eventually clog the ARs.
> 
> Such a requirement could be a bit subtle on the implementation side if
> you have separate Rx and Tx processing threads (as I do), since one
> thread may be injecting data just as the other is producing ARs. (i.e.
> on the device side the RX thread fills the RX ring with data while the
> TX thread is filling the AR queue with replies).
> 
> >> IIRC from previous discussions your intention was to say that a
> >> driver/backend should stop processessing TX requests if the RX ring
> >> becomes full (or I suppose if the "additional resources" are
> >> exhausted), I guess with the implications that it should never stop
> >> processing RX requests (which I think is always possible since from
> >> the driver end there are no "additional resources" needed for RX
> >> processing). But the above essentially explicitly states the OK case
> >> (i.e. the non-deadlock case, resources available) and leaves what to
> >> do on resource exhaustion implicit (while that is the most important
> >> bit of information).
> >>
> >> It's unspecified how much "additional resource" an implementation is
> >> expected/required to provide. Should it be enough for one ring slot?
> >> Or N*slots-per-ring? Is 0 an acceptable number?
> >
> > The way I've implemented the resource limit is to only allow queuing N
> > reply packets when the outgoing virtqueue is full, where N is the size
> > of the incoming virtqueue.
> >
> > If this limit is reached then incoming virtqueue processing is paused
> > until there is space to transmit pending reply packets again.
> 
> Is >=N a requirement, or just a convenient number? I have a feeling
> this has a relationship to the "separate RX and Tx processing threads"
> issue I mention above, specifically the ARs need to be larger than the
> number of data slots which will be consumed before checking for ARs
> (i.e. on the device side how often the RX thread stop processing real
> data in order to check for pending ARs), which is another way of
> saying that you need to stop and check for pending ARs more often than
> every N data slots, what do you think?
> 
> I keep coming back to this "how many ARs are enough" question because
> it seems like if N-1 ARs are not sufficient to avoid deadlock while N
> ARs is then the "inductive step" from N-1 to N ought to be essentially
> the "proof" which guarantees that deadlocks are not possible.
> 
> Sorry to be so pedantic about this, the guarantee here seems very
> subtle to me or perhaps there are simply generic virtio
> assumptions/idioms which I'm not aware of which allow these things to
> be explained more easily.
> 
> >> BTW our backend implementation is now available in
> >> https://github.com/docker/hyperkit/blob/af-vsock/src/pci_virtio_sock.c.
> >> It corresponds to ~v5 of the spec IIIRC (so it still has workarounds
> >> relating to the OP_SHUTDOWN issue which you resolved in v6 and it
> >> hasn't yet switched to 64 bit addressing). WRT the above issue the
> >> "additional resources" are a pending-reply ring
> >> (https://github.com/docker/hyperkit/blob/af-vsock/src/pci_virtio_sock.c#L281)
> >> which I sized to be 2x the ring size (since with 1x I still saw
> >> deadlocks) however people are currently reporting deadlocks (actually
> >> an assert failure in this implementation) with that, so I am
> >> considering making it 4x, but that is just pushing the problem a
> >> further away again without solving.
> >
> > One detail about the limit I've implemented is that it only applies to
> > *reply* packets.  It does not count ordinary outgoing packets.  Perhaps
> > you are counting all outgoing packets.  I'm trusting rlimits, socket
> > buffer sizes, etc to limit ordinary outgoing packets so they don't
> > matter here.
> 
> I'm also only counting reply packets since ordinary outgoing packets
> don't need to go over the pending reply ring (my manifestation of
> ARs), they come directly out of an fd in the relevant thread.
> 
> I also communicate the need for a CREDIT_UPDATE reply as a single
> per-socket bit, so those also do not consume ARs (and often can piggy
> back on other traffic too).

I'm working on a write-up that covers the scheme and shows why it is
correct.  My goal is to prove either that it can deadlock or that it
cannot deadlock.  Hope to finish it by Friday.

By the way, I'll be in Toronto for LinuxCon/KVM Forum.  It would be nice
to chat if you are there too.

Stefan
Attachment: signature.asc
Description: PGP signature
Follow-Ups:
- Re: [PATCH v7] vsock: add vsock device
  - From: Ian Campbell <ian.campbell@docker.com>
References:
- [PATCH v7] vsock: add vsock device
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [PATCH v7] vsock: add vsock device
  - From: Ian Campbell <ian.campbell@docker.com>
- Re: [PATCH v7] vsock: add vsock device
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [PATCH v7] vsock: add vsock device
  - From: Ian Campbell <ian.campbell@docker.com>