virtio-dev message

Subject: Re: [PATCH v6] vsock: add vsock device
From: Ian Campbell <ian.campbell@docker.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Date: Fri, 22 Apr 2016 13:56:24 +0100
On 21 April 2016 at 15:36, Stefan Hajnoczi <stefanha@redhat.com> wrote:
> The virtio vsock device is a zero-configuration socket communications
> device.  It is designed as a guest<->host management channel suitable
> for communicating with guest agents.
>
> vsock is designed with the sockets API in mind and the driver is
> typically implemented as an address family (at the same level as
> AF_INET).  Applications written for the sockets API can be ported with
> minimal changes (similar amount of effort as adding IPv6 support to an
> IPv4 application).
>
> Unlike the existing console device, which is also used for guest<->host
> communication, multiple clients can connect to a server at the same time
> over vsock.  This limitation requires console-based users to arbitrate
> access through a single client.  In vsock they can connect directly and
> do not have to synchronize with each other.
>
> Unlike network devices, no configuration is necessary because the device
> comes with its address in the configuration space.
>
> The vsock device was prototyped by Gerd Hoffmann and Asias He.  I picked
> the code and design up from them.

First: Do you happen to have this in a git tree, i.e. where I can
easily run make and get a more readable document? (Bonus points if you
are publishing the formatted docs somewhere already ;-))

Secondly and the main reason for mailing: I've tripped over an
interesting issue recently. I'm using your v5 Linux frontend patches +
my own backend.

One property of my backend is that when a TX request requires a
response (e.g. a RW=>CREDIT_UPDATE, REQUEST=>RESPONSE, *=>RST) it
waits synchronously for an RX slot in order to send the reply.

So what happens is we have a sender in the guest, see first backtrace
below, which is holding sk_lock-AF_VSOCK and is waiting in
virtio_transport_send_one_pkt(), looping calling virtqueue_add_sgs(),
i.e. waiting for a TX slot to become available.

We also have the receive work queue in the guest, see second backtrace
below, it is holding several locks (one for virtio_vsock_workqueue,
one for sock->rx_work and sock->rx_lock) but it needs sk_lock-AF_VSOCK
to do its thing and so it is spinning waiting for it.

Now we have the backend, it has a tx thread which wants to send a
reply, however there are no RX descriptors available right now and so
it is spinning waiting for one. However while it does this it is not
processing any more tx requests, so now the sender in the guest cannot
make progress, so it holds sk_lock-AF_VSOCK forever preventing the RX
workqueue from making any progress, and therefore preventing any new
RX descriptors from being enqueued. So we have a threeway
frontend:rx-frontend:tx-backend:tx deadlock. Since there is an
external entity lockdep is unable to spot this (and in any case I
doubt it considers the spin+schedule loop in
virtio_transport_send_one_pkt() to be a "lock"), I had a look at
Claudio's various fixes but none of them looked to cover this case
since they were mainly about lockdep issues.

I solved this by removing the synchronous reply from my backends tx
thread and instead queuing replies onto a backend-internal ring to be
processed by the rx thread, meaning that the tx thread can continue
and process more entries, freeing them up for the guest's sender
thread to use and breaking the deadlock.

However the issue with this (and the reason I avoided it in favour of
the synchronous scheme to start with) was that the number of replies
which can be pending in this manner is effectively unbounded, since
the guest can continue to produce as many TX requests as it likes and
may not process the replies quickly enough. I started with a ring
which was the same size as the virtio descriptor ring, but I managed
to observe that filling up (which would then expose the same deadlock
but with an extra chain). I ended up doubling the size and was no
longer able to provoke that, but that seems unsatisfactory and I
assume I am just lucky that the workloads I've got aren't hitting it.

I'm unsure what the right answer here is, lots of the obvious answers
(i.e. throwing stuff on the floor, eliding some CREDIT_UPDATE
responses) would work against the guaranteed delivery goal of
virtio-vsock.

Clearly the deadlock is bad, but equally I don't think we can mandate
that the backend must be prepared to queue an arbitrary number of
replies, and if we set some upper bound on that then we re-introduce
the deadlock (or introduce an occasional "backend shoots the guest for
using too many slots" failure mode).

About the only think I can think of would be to mandate that the TX
and RX processing in the frontend must not block (or spin) while
holding a common resource (i.e. disallow either from holding
sk_lock-AF_VSOCK while waiting for things). I'm not sure how practical
that is in terms of actual implementation. It looked like the TX side
had taken it in common code, but I may have misread.

A separate "replies" ring from the "(data) rx" ring might work. Or
perhaps a response-descriptor could be added to the tx chains. It
would only really need the op field and buff_alloc+fwd_cnt, not the
full set. I'm not sure if either of those are idiomatic virtio -- but
my gut is telling me no.

I'm hoping that I've just done something stupid or am missing some
blindingly obvious solution.

Ian.

Sender's trace:

[  121.218465] INFO: task transfused:1067 blocked for more than 30 seconds.
[  121.218908]       Not tainted 4.4.6+ #29
[  121.219122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  121.219621] transfused      D ffff880072eefb88     0  1067      1 0x00000000
[  121.220114]  ffff880072eefb88 00ff880076206928 ffff88007d2585c0
ffff880072f2c6c0
[  121.220640]  ffff880072ef0000 ffff880076206928 ffff880076206800
ffff880076247c00
[  121.221176]  0000000000000002 ffff880072eefba0 ffffffff815c0f9b
ffff88006bd67f40
[  121.221685] Call Trace:
[  121.221799]  [<ffffffff815c0f9b>] schedule+0x73/0x81
[  121.222078]  [<ffffffff815b56ad>] virtio_transport_send_one_pkt+0xf4/0x138
[  121.222514]  [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25
[  121.222865]  [<ffffffff815b58d1>] virtio_transport_send_pkt+0x1b7/0x1c6
[  121.223254]  [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25
[  121.223616]  [<ffffffff815b66fd>] virtio_transport_send_pkt+0x23/0x25
[  121.224065]  [<ffffffff815b67a0>] virtio_transport_stream_enqueue+0x37/0x3b
[  121.224521]  [<ffffffff815b4465>] vsock_stream_sendmsg+0x297/0x2f2
[  121.224907]  [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25
[  121.225268]  [<ffffffff813f6d24>] sock_sendmsg+0x2e/0x3f
[  121.225589]  [<ffffffff813f6daa>] sock_write_iter+0x75/0x8d
[  121.225931]  [<ffffffff8115e44f>] new_sync_write+0x5b/0x7d
[  121.226266]  [<ffffffff8115e499>] __vfs_write+0x28/0x2a
[  121.226579]  [<ffffffff8115e896>] vfs_write+0x95/0xc0
[  121.226880]  [<ffffffff8115f001>] SyS_write+0x4b/0x79
[  121.227180]  [<ffffffff815c4c72>] entry_SYSCALL_64_fastpath+0x12/0x72
[  121.227586] 1 lock held by transfused/1067:
[  121.227819]  #0:  (sk_lock-AF_VSOCK){+.+.+.}, at:
[<ffffffff815b3307>] lock_sock+0xb/0xd

RX queue's trace:

[  121.204369] INFO: task kworker/0:1:22 blocked for more than 30 seconds.
[  121.204926]       Not tainted 4.4.6+ #29
[  121.205198] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  121.205877] kworker/0:1     D ffff88007d25fc50     0    22      2 0x00000000
[  121.206508] Workqueue: virtio_vsock virtio_transport_recv_pkt_work
[  121.207021]  ffff88007d25fc50 00000000fffffe01 ffffffff818b1540
ffff88007d2585c0
[  121.207698]  ffff88007d260000 ffff880072fd0c08 ffff880072fd0bc8
0000000000855abf
[  121.208382]  ffff88001a0f1598 ffff88007d25fc68 ffffffff815c0f9b
ffff880072fd0b40
[  121.209083] Call Trace:
[  121.209213]  [<ffffffff815c0f9b>] schedule+0x73/0x81
[  121.209602]  [<ffffffff813f90a2>] __lock_sock+0x6e/0x96
[  121.210003]  [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25
[  121.210478]  [<ffffffff813fa7b2>] lock_sock_nested+0x34/0x82
[  121.210936]  [<ffffffff815b6f3d>] virtio_transport_recv_pkt+0x171/0x471
[  121.211485]  [<ffffffff815c1dcb>] ? mutex_lock_nested+0x2f8/0x361
[  121.211983]  [<ffffffff81320f41>] ? virtqueue_get_buf+0xd7/0x110
[  121.212476]  [<ffffffff815b5d1c>] virtio_transport_recv_pkt_work+0xca/0x144
[  121.212994]  [<ffffffff81085f28>] process_one_work+0x207/0x397
[  121.213364]  [<ffffffff8108654e>] worker_thread+0x1e7/0x292
[  121.213706]  [<ffffffff81086367>] ? rescuer_thread+0x285/0x285
[  121.214058]  [<ffffffff81086367>] ? rescuer_thread+0x285/0x285
[  121.214509]  [<ffffffff8108ab0b>] kthread+0xb5/0xbd
[  121.214796]  [<ffffffff810a7882>] ? lock_acquire+0xd9/0x134
[  121.215154]  [<ffffffff8108aa56>] ? kthread_parkme+0x1f/0x1f
[  121.215502]  [<ffffffff815c4fdf>] ret_from_fork+0x3f/0x70
[  121.215828]  [<ffffffff8108aa56>] ? kthread_parkme+0x1f/0x1f
[  121.216160] 3 locks held by kworker/0:1/22:
[  121.216408]  #0:  ("virtio_vsock"){.+.+.+}, at:
[<ffffffff81085e64>] process_one_work+0x143/0x397
[  121.217058]  #1:  ((&vsock->rx_work)){+.+.+.}, at:
[<ffffffff81085e64>] process_one_work+0x143/0x397
[  121.217727]  #2:  (&vsock->rx_lock){+.+.+.}, at:
[<ffffffff815b5cae>] virtio_transport_recv_pkt_work+0x5c/0x144
Follow-Ups:
- Re: [virtio-dev] Re: [PATCH v6] vsock: add vsock device
  - From: Stefan Hajnoczi <stefanha@redhat.com>
- Re: [PATCH v6] vsock: add vsock device
  - From: Stefan Hajnoczi <stefanha@redhat.com>
References:
- [PATCH v6] vsock: add vsock device
  - From: Stefan Hajnoczi <stefanha@redhat.com>