virtio-dev message

Subject: Re: [PATCH v6] vsock: add vsock device
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Ian Campbell <ian.campbell@docker.com>
Date: Tue, 26 Apr 2016 14:47:01 +0100
On Fri, Apr 22, 2016 at 01:56:24PM +0100, Ian Campbell wrote:
> On 21 April 2016 at 15:36, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > The virtio vsock device is a zero-configuration socket communications
> > device.  It is designed as a guest<->host management channel suitable
> > for communicating with guest agents.
> >
> > vsock is designed with the sockets API in mind and the driver is
> > typically implemented as an address family (at the same level as
> > AF_INET).  Applications written for the sockets API can be ported with
> > minimal changes (similar amount of effort as adding IPv6 support to an
> > IPv4 application).
> >
> > Unlike the existing console device, which is also used for guest<->host
> > communication, multiple clients can connect to a server at the same time
> > over vsock.  This limitation requires console-based users to arbitrate
> > access through a single client.  In vsock they can connect directly and
> > do not have to synchronize with each other.
> >
> > Unlike network devices, no configuration is necessary because the device
> > comes with its address in the configuration space.
> >
> > The vsock device was prototyped by Gerd Hoffmann and Asias He.  I picked
> > the code and design up from them.
> 
> First: Do you happen to have this in a git tree, i.e. where I can
> easily run make and get a more readable document? (Bonus points if you
> are publishing the formatted docs somewhere already ;-))

Your wish is my command!  Pre-built specification:
https://stefanha.github.io/virtio/#x1-2800007

Git repo:
https://github.com/stefanha/virtio

> Secondly and the main reason for mailing: I've tripped over an
> interesting issue recently. I'm using your v5 Linux frontend patches +
> my own backend.
> 
> One property of my backend is that when a TX request requires a
> response (e.g. a RW=>CREDIT_UPDATE, REQUEST=>RESPONSE, *=>RST) it
> waits synchronously for an RX slot in order to send the reply.
> 
> So what happens is we have a sender in the guest, see first backtrace
> below, which is holding sk_lock-AF_VSOCK and is waiting in
> virtio_transport_send_one_pkt(), looping calling virtqueue_add_sgs(),
> i.e. waiting for a TX slot to become available.
> 
> We also have the receive work queue in the guest, see second backtrace
> below, it is holding several locks (one for virtio_vsock_workqueue,
> one for sock->rx_work and sock->rx_lock) but it needs sk_lock-AF_VSOCK
> to do its thing and so it is spinning waiting for it.
> 
> Now we have the backend, it has a tx thread which wants to send a
> reply, however there are no RX descriptors available right now and so
> it is spinning waiting for one. However while it does this it is not
> processing any more tx requests, so now the sender in the guest cannot
> make progress, so it holds sk_lock-AF_VSOCK forever preventing the RX
> workqueue from making any progress, and therefore preventing any new
> RX descriptors from being enqueued. So we have a threeway
> frontend:rx-frontend:tx-backend:tx deadlock. Since there is an
> external entity lockdep is unable to spot this (and in any case I
> doubt it considers the spin+schedule loop in
> virtio_transport_send_one_pkt() to be a "lock"), I had a look at
> Claudio's various fixes but none of them looked to cover this case
> since they were mainly about lockdep issues.
> 
> I solved this by removing the synchronous reply from my backends tx
> thread and instead queuing replies onto a backend-internal ring to be
> processed by the rx thread, meaning that the tx thread can continue
> and process more entries, freeing them up for the guest's sender
> thread to use and breaking the deadlock.
> 
> However the issue with this (and the reason I avoided it in favour of
> the synchronous scheme to start with) was that the number of replies
> which can be pending in this manner is effectively unbounded, since
> the guest can continue to produce as many TX requests as it likes and
> may not process the replies quickly enough. I started with a ring
> which was the same size as the virtio descriptor ring, but I managed
> to observe that filling up (which would then expose the same deadlock
> but with an extra chain). I ended up doubling the size and was no
> longer able to provoke that, but that seems unsatisfactory and I
> assume I am just lucky that the workloads I've got aren't hitting it.
> 
> I'm unsure what the right answer here is, lots of the obvious answers
> (i.e. throwing stuff on the floor, eliding some CREDIT_UPDATE
> responses) would work against the guaranteed delivery goal of
> virtio-vsock.
> 
> Clearly the deadlock is bad, but equally I don't think we can mandate
> that the backend must be prepared to queue an arbitrary number of
> replies, and if we set some upper bound on that then we re-introduce
> the deadlock (or introduce an occasional "backend shoots the guest for
> using too many slots" failure mode).
> 
> About the only think I can think of would be to mandate that the TX
> and RX processing in the frontend must not block (or spin) while
> holding a common resource (i.e. disallow either from holding
> sk_lock-AF_VSOCK while waiting for things). I'm not sure how practical
> that is in terms of actual implementation. It looked like the TX side
> had taken it in common code, but I may have misread.
> 
> A separate "replies" ring from the "(data) rx" ring might work. Or
> perhaps a response-descriptor could be added to the tx chains. It
> would only really need the op field and buff_alloc+fwd_cnt, not the
> full set. I'm not sure if either of those are idiomatic virtio -- but
> my gut is telling me no.
> 
> I'm hoping that I've just done something stupid or am missing some
> blindingly obvious solution.

I need to look seriously into the locking so we can fix it.  The fact
that rx code calls directly into tx code to send control packets is
causing trouble.

Stefan

> Sender's trace:
> 
> [  121.218465] INFO: task transfused:1067 blocked for more than 30 seconds.
> [  121.218908]       Not tainted 4.4.6+ #29
> [  121.219122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  121.219621] transfused      D ffff880072eefb88     0  1067      1 0x00000000
> [  121.220114]  ffff880072eefb88 00ff880076206928 ffff88007d2585c0
> ffff880072f2c6c0
> [  121.220640]  ffff880072ef0000 ffff880076206928 ffff880076206800
> ffff880076247c00
> [  121.221176]  0000000000000002 ffff880072eefba0 ffffffff815c0f9b
> ffff88006bd67f40
> [  121.221685] Call Trace:
> [  121.221799]  [<ffffffff815c0f9b>] schedule+0x73/0x81
> [  121.222078]  [<ffffffff815b56ad>] virtio_transport_send_one_pkt+0xf4/0x138
> [  121.222514]  [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25
> [  121.222865]  [<ffffffff815b58d1>] virtio_transport_send_pkt+0x1b7/0x1c6
> [  121.223254]  [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25
> [  121.223616]  [<ffffffff815b66fd>] virtio_transport_send_pkt+0x23/0x25
> [  121.224065]  [<ffffffff815b67a0>] virtio_transport_stream_enqueue+0x37/0x3b
> [  121.224521]  [<ffffffff815b4465>] vsock_stream_sendmsg+0x297/0x2f2
> [  121.224907]  [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25
> [  121.225268]  [<ffffffff813f6d24>] sock_sendmsg+0x2e/0x3f
> [  121.225589]  [<ffffffff813f6daa>] sock_write_iter+0x75/0x8d
> [  121.225931]  [<ffffffff8115e44f>] new_sync_write+0x5b/0x7d
> [  121.226266]  [<ffffffff8115e499>] __vfs_write+0x28/0x2a
> [  121.226579]  [<ffffffff8115e896>] vfs_write+0x95/0xc0
> [  121.226880]  [<ffffffff8115f001>] SyS_write+0x4b/0x79
> [  121.227180]  [<ffffffff815c4c72>] entry_SYSCALL_64_fastpath+0x12/0x72
> [  121.227586] 1 lock held by transfused/1067:
> [  121.227819]  #0:  (sk_lock-AF_VSOCK){+.+.+.}, at:
> [<ffffffff815b3307>] lock_sock+0xb/0xd
> 
> RX queue's trace:
> 
> [  121.204369] INFO: task kworker/0:1:22 blocked for more than 30 seconds.
> [  121.204926]       Not tainted 4.4.6+ #29
> [  121.205198] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  121.205877] kworker/0:1     D ffff88007d25fc50     0    22      2 0x00000000
> [  121.206508] Workqueue: virtio_vsock virtio_transport_recv_pkt_work
> [  121.207021]  ffff88007d25fc50 00000000fffffe01 ffffffff818b1540
> ffff88007d2585c0
> [  121.207698]  ffff88007d260000 ffff880072fd0c08 ffff880072fd0bc8
> 0000000000855abf
> [  121.208382]  ffff88001a0f1598 ffff88007d25fc68 ffffffff815c0f9b
> ffff880072fd0b40
> [  121.209083] Call Trace:
> [  121.209213]  [<ffffffff815c0f9b>] schedule+0x73/0x81
> [  121.209602]  [<ffffffff813f90a2>] __lock_sock+0x6e/0x96
> [  121.210003]  [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25
> [  121.210478]  [<ffffffff813fa7b2>] lock_sock_nested+0x34/0x82
> [  121.210936]  [<ffffffff815b6f3d>] virtio_transport_recv_pkt+0x171/0x471
> [  121.211485]  [<ffffffff815c1dcb>] ? mutex_lock_nested+0x2f8/0x361
> [  121.211983]  [<ffffffff81320f41>] ? virtqueue_get_buf+0xd7/0x110
> [  121.212476]  [<ffffffff815b5d1c>] virtio_transport_recv_pkt_work+0xca/0x144
> [  121.212994]  [<ffffffff81085f28>] process_one_work+0x207/0x397
> [  121.213364]  [<ffffffff8108654e>] worker_thread+0x1e7/0x292
> [  121.213706]  [<ffffffff81086367>] ? rescuer_thread+0x285/0x285
> [  121.214058]  [<ffffffff81086367>] ? rescuer_thread+0x285/0x285
> [  121.214509]  [<ffffffff8108ab0b>] kthread+0xb5/0xbd
> [  121.214796]  [<ffffffff810a7882>] ? lock_acquire+0xd9/0x134
> [  121.215154]  [<ffffffff8108aa56>] ? kthread_parkme+0x1f/0x1f
> [  121.215502]  [<ffffffff815c4fdf>] ret_from_fork+0x3f/0x70
> [  121.215828]  [<ffffffff8108aa56>] ? kthread_parkme+0x1f/0x1f
> [  121.216160] 3 locks held by kworker/0:1/22:
> [  121.216408]  #0:  ("virtio_vsock"){.+.+.+}, at:
> [<ffffffff81085e64>] process_one_work+0x143/0x397
> [  121.217058]  #1:  ((&vsock->rx_work)){+.+.+.}, at:
> [<ffffffff81085e64>] process_one_work+0x143/0x397
> [  121.217727]  #2:  (&vsock->rx_lock){+.+.+.}, at:
> [<ffffffff815b5cae>] virtio_transport_recv_pkt_work+0x5c/0x144
Attachment: signature.asc
Description: PGP signature
Follow-Ups:
- Re: [PATCH v6] vsock: add vsock device
  - From: Ian Campbell <ian.campbell@docker.com>
References:
- [PATCH v6] vsock: add vsock device
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [PATCH v6] vsock: add vsock device
  - From: Ian Campbell <ian.campbell@docker.com>