[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [PATCH v6] vsock: add vsock device
On Fri, Apr 22, 2016 at 01:56:24PM +0100, Ian Campbell wrote: > On 21 April 2016 at 15:36, Stefan Hajnoczi <stefanha@redhat.com> wrote: > > The virtio vsock device is a zero-configuration socket communications > > device. It is designed as a guest<->host management channel suitable > > for communicating with guest agents. > > > > vsock is designed with the sockets API in mind and the driver is > > typically implemented as an address family (at the same level as > > AF_INET). Applications written for the sockets API can be ported with > > minimal changes (similar amount of effort as adding IPv6 support to an > > IPv4 application). > > > > Unlike the existing console device, which is also used for guest<->host > > communication, multiple clients can connect to a server at the same time > > over vsock. This limitation requires console-based users to arbitrate > > access through a single client. In vsock they can connect directly and > > do not have to synchronize with each other. > > > > Unlike network devices, no configuration is necessary because the device > > comes with its address in the configuration space. > > > > The vsock device was prototyped by Gerd Hoffmann and Asias He. I picked > > the code and design up from them. > > First: Do you happen to have this in a git tree, i.e. where I can > easily run make and get a more readable document? (Bonus points if you > are publishing the formatted docs somewhere already ;-)) Your wish is my command! Pre-built specification: https://stefanha.github.io/virtio/#x1-2800007 Git repo: https://github.com/stefanha/virtio > Secondly and the main reason for mailing: I've tripped over an > interesting issue recently. I'm using your v5 Linux frontend patches + > my own backend. > > One property of my backend is that when a TX request requires a > response (e.g. a RW=>CREDIT_UPDATE, REQUEST=>RESPONSE, *=>RST) it > waits synchronously for an RX slot in order to send the reply. > > So what happens is we have a sender in the guest, see first backtrace > below, which is holding sk_lock-AF_VSOCK and is waiting in > virtio_transport_send_one_pkt(), looping calling virtqueue_add_sgs(), > i.e. waiting for a TX slot to become available. > > We also have the receive work queue in the guest, see second backtrace > below, it is holding several locks (one for virtio_vsock_workqueue, > one for sock->rx_work and sock->rx_lock) but it needs sk_lock-AF_VSOCK > to do its thing and so it is spinning waiting for it. > > Now we have the backend, it has a tx thread which wants to send a > reply, however there are no RX descriptors available right now and so > it is spinning waiting for one. However while it does this it is not > processing any more tx requests, so now the sender in the guest cannot > make progress, so it holds sk_lock-AF_VSOCK forever preventing the RX > workqueue from making any progress, and therefore preventing any new > RX descriptors from being enqueued. So we have a threeway > frontend:rx-frontend:tx-backend:tx deadlock. Since there is an > external entity lockdep is unable to spot this (and in any case I > doubt it considers the spin+schedule loop in > virtio_transport_send_one_pkt() to be a "lock"), I had a look at > Claudio's various fixes but none of them looked to cover this case > since they were mainly about lockdep issues. > > I solved this by removing the synchronous reply from my backends tx > thread and instead queuing replies onto a backend-internal ring to be > processed by the rx thread, meaning that the tx thread can continue > and process more entries, freeing them up for the guest's sender > thread to use and breaking the deadlock. > > However the issue with this (and the reason I avoided it in favour of > the synchronous scheme to start with) was that the number of replies > which can be pending in this manner is effectively unbounded, since > the guest can continue to produce as many TX requests as it likes and > may not process the replies quickly enough. I started with a ring > which was the same size as the virtio descriptor ring, but I managed > to observe that filling up (which would then expose the same deadlock > but with an extra chain). I ended up doubling the size and was no > longer able to provoke that, but that seems unsatisfactory and I > assume I am just lucky that the workloads I've got aren't hitting it. > > I'm unsure what the right answer here is, lots of the obvious answers > (i.e. throwing stuff on the floor, eliding some CREDIT_UPDATE > responses) would work against the guaranteed delivery goal of > virtio-vsock. > > Clearly the deadlock is bad, but equally I don't think we can mandate > that the backend must be prepared to queue an arbitrary number of > replies, and if we set some upper bound on that then we re-introduce > the deadlock (or introduce an occasional "backend shoots the guest for > using too many slots" failure mode). > > About the only think I can think of would be to mandate that the TX > and RX processing in the frontend must not block (or spin) while > holding a common resource (i.e. disallow either from holding > sk_lock-AF_VSOCK while waiting for things). I'm not sure how practical > that is in terms of actual implementation. It looked like the TX side > had taken it in common code, but I may have misread. > > A separate "replies" ring from the "(data) rx" ring might work. Or > perhaps a response-descriptor could be added to the tx chains. It > would only really need the op field and buff_alloc+fwd_cnt, not the > full set. I'm not sure if either of those are idiomatic virtio -- but > my gut is telling me no. > > I'm hoping that I've just done something stupid or am missing some > blindingly obvious solution. I need to look seriously into the locking so we can fix it. The fact that rx code calls directly into tx code to send control packets is causing trouble. Stefan > Sender's trace: > > [ 121.218465] INFO: task transfused:1067 blocked for more than 30 seconds. > [ 121.218908] Not tainted 4.4.6+ #29 > [ 121.219122] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 121.219621] transfused D ffff880072eefb88 0 1067 1 0x00000000 > [ 121.220114] ffff880072eefb88 00ff880076206928 ffff88007d2585c0 > ffff880072f2c6c0 > [ 121.220640] ffff880072ef0000 ffff880076206928 ffff880076206800 > ffff880076247c00 > [ 121.221176] 0000000000000002 ffff880072eefba0 ffffffff815c0f9b > ffff88006bd67f40 > [ 121.221685] Call Trace: > [ 121.221799] [<ffffffff815c0f9b>] schedule+0x73/0x81 > [ 121.222078] [<ffffffff815b56ad>] virtio_transport_send_one_pkt+0xf4/0x138 > [ 121.222514] [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25 > [ 121.222865] [<ffffffff815b58d1>] virtio_transport_send_pkt+0x1b7/0x1c6 > [ 121.223254] [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25 > [ 121.223616] [<ffffffff815b66fd>] virtio_transport_send_pkt+0x23/0x25 > [ 121.224065] [<ffffffff815b67a0>] virtio_transport_stream_enqueue+0x37/0x3b > [ 121.224521] [<ffffffff815b4465>] vsock_stream_sendmsg+0x297/0x2f2 > [ 121.224907] [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25 > [ 121.225268] [<ffffffff813f6d24>] sock_sendmsg+0x2e/0x3f > [ 121.225589] [<ffffffff813f6daa>] sock_write_iter+0x75/0x8d > [ 121.225931] [<ffffffff8115e44f>] new_sync_write+0x5b/0x7d > [ 121.226266] [<ffffffff8115e499>] __vfs_write+0x28/0x2a > [ 121.226579] [<ffffffff8115e896>] vfs_write+0x95/0xc0 > [ 121.226880] [<ffffffff8115f001>] SyS_write+0x4b/0x79 > [ 121.227180] [<ffffffff815c4c72>] entry_SYSCALL_64_fastpath+0x12/0x72 > [ 121.227586] 1 lock held by transfused/1067: > [ 121.227819] #0: (sk_lock-AF_VSOCK){+.+.+.}, at: > [<ffffffff815b3307>] lock_sock+0xb/0xd > > RX queue's trace: > > [ 121.204369] INFO: task kworker/0:1:22 blocked for more than 30 seconds. > [ 121.204926] Not tainted 4.4.6+ #29 > [ 121.205198] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [ 121.205877] kworker/0:1 D ffff88007d25fc50 0 22 2 0x00000000 > [ 121.206508] Workqueue: virtio_vsock virtio_transport_recv_pkt_work > [ 121.207021] ffff88007d25fc50 00000000fffffe01 ffffffff818b1540 > ffff88007d2585c0 > [ 121.207698] ffff88007d260000 ffff880072fd0c08 ffff880072fd0bc8 > 0000000000855abf > [ 121.208382] ffff88001a0f1598 ffff88007d25fc68 ffffffff815c0f9b > ffff880072fd0b40 > [ 121.209083] Call Trace: > [ 121.209213] [<ffffffff815c0f9b>] schedule+0x73/0x81 > [ 121.209602] [<ffffffff813f90a2>] __lock_sock+0x6e/0x96 > [ 121.210003] [<ffffffff810a1829>] ? wake_up_atomic_t+0x25/0x25 > [ 121.210478] [<ffffffff813fa7b2>] lock_sock_nested+0x34/0x82 > [ 121.210936] [<ffffffff815b6f3d>] virtio_transport_recv_pkt+0x171/0x471 > [ 121.211485] [<ffffffff815c1dcb>] ? mutex_lock_nested+0x2f8/0x361 > [ 121.211983] [<ffffffff81320f41>] ? virtqueue_get_buf+0xd7/0x110 > [ 121.212476] [<ffffffff815b5d1c>] virtio_transport_recv_pkt_work+0xca/0x144 > [ 121.212994] [<ffffffff81085f28>] process_one_work+0x207/0x397 > [ 121.213364] [<ffffffff8108654e>] worker_thread+0x1e7/0x292 > [ 121.213706] [<ffffffff81086367>] ? rescuer_thread+0x285/0x285 > [ 121.214058] [<ffffffff81086367>] ? rescuer_thread+0x285/0x285 > [ 121.214509] [<ffffffff8108ab0b>] kthread+0xb5/0xbd > [ 121.214796] [<ffffffff810a7882>] ? lock_acquire+0xd9/0x134 > [ 121.215154] [<ffffffff8108aa56>] ? kthread_parkme+0x1f/0x1f > [ 121.215502] [<ffffffff815c4fdf>] ret_from_fork+0x3f/0x70 > [ 121.215828] [<ffffffff8108aa56>] ? kthread_parkme+0x1f/0x1f > [ 121.216160] 3 locks held by kworker/0:1/22: > [ 121.216408] #0: ("virtio_vsock"){.+.+.+}, at: > [<ffffffff81085e64>] process_one_work+0x143/0x397 > [ 121.217058] #1: ((&vsock->rx_work)){+.+.+.}, at: > [<ffffffff81085e64>] process_one_work+0x143/0x397 > [ 121.217727] #2: (&vsock->rx_lock){+.+.+.}, at: > [<ffffffff815b5cae>] virtio_transport_recv_pkt_work+0x5c/0x144
Attachment:
signature.asc
Description: PGP signature
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]