[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio-comment] Seeking guidance for custom virtIO device
On 2020-04-14 13:50, Stefan Hajnoczi wrote:
On Fri, Apr 10, 2020 at 12:09:22PM +0200, Stefano Garzarella wrote:Hi, On Fri, Apr 10, 2020 at 09:36:58AM +0000, Eftime, Petre wrote:Hi all, I am looking for guidance on how to proceed with regards to either reserving a virtio device ID for a specific device for a particular usecase or for formalizing a device type that could be potentially used by others. We have developed a virtio device that acts as a transport for API calls between a guest userspace library and a backend server in the host system. Our requirements are: * multiple clients in the guest (multiple servers is not required) * provide an in-order, reliable datagram transport mechanism * datagram size should be either negotiable or large (16k-64k?) * performance is not a big concern for our usecaseIt looks really close to vsock.The reason why we used a special device and not something else is the following: * vsock spec does not contain a datagram specification (eg. SOCK_DGRAM, SOCK_SEQPACKET) and the effort of updating the Linux driver and other implementations for this particular purposeÂÂseemed relatively high. The path to approach this problem wasn't clear. Vsock today only works in SOCK_STREAM mode and this is not ideal: the receiver must implement additional state and buffer incoming data,ÂÂadding complexity and host resource usage.AF_VSOCK itself supports SOCK_DGRAM, but virtio-vsock doesn't provide this feature. (vmci provides SOCK_DGRAM support) The changes should not be too intrusive in the virtio-vsock specs and implementation, we already have the "type" field in the packet header to address this new feature. We also have the credit-mechanism to provide in-order and reliable packets delivery. Maybe the hardest part could be change something in the core to handle multiple transports that provide SOCK_DGRAM, for nested VMs. We already did for stream sockets, but we didn't handle the datagram socket for now. I am not sure how convenient it is to have two very similar devices... If you decide to give virtio-vsock a chance to get SOCK_DGRAM, I can try to give you a more complete list of changes to make. :-)I although think this sounds exactly like adding SOCK_DGRAM support to virtio-vsock. The reason why the SOCK_DGRAM code was dropped from early virtio-vsock patches is that the prototocol design didn't ensure reliable delivery semantics. At that time there were no real users for SOCK_DGRAM so it was left as a feature to be added later. The challenge with reusing the SOCK_STREAM credit mechanism for SOCK_DGRAM is that datagrams are connectionless. The credit mechanism consists per-connection state. Maybe it can be extended to cover SOCK_DGRAM too. I would urge you to add SOCK_DGRAM to virtio-vsock instead of trying to create another device that does basically what is within the scope of virtio-vsock. It took quite a bit of time and effort to get AF_VSOCK support into various software components, and doing that again for another device is more effort than one would think. If you don't want to modify the Linux guest driver, then let's just discuss the device spec and protocol. Someone else could make the Linux driver changes. Stefan
I think it would be great if we could get the virtio-vsock driver to support SOCK_DGRAM/SOCK_SEQPACKET as it would make a lot of sense.
But one of the reasons that I don't really like virtio-vsock at the moment for my use-case in particular is that it doesn't seem well fitted to support non-cooperating live-migrateable VMs all that well.Â One problem is that to avoid guest-visible disconnections to any service while doing a live migration there might be performance impact if using vsock for any other reasons.
I'll try to exemplify what I mean with this setup: ÂÂÂ * workload 1 sends data constantly via an AF_VSOCK SOCK_STREAMÂÂÂ * workload 2 sends commands / gets replies once in a while via an AF_VSOCK SOCK_SEQPACKET.
Assume the VM needs to be migrated:ÂÂÂ ÂÂÂ 1) If workload 2 currently not processing anything, even if there are some commands for it queued up, everything is fine, VMM can pause the guest and serialize.
ÂÂÂÂÂÂÂ 2) If there's an outstanding command the VMM needs to wait for it to finish and wait for the receive queue of the request to have enough capacity for the reply, but since this capacity is guest driven, this second part can take a while / forever. This is definitely not ideal.
To fix this issue, the VMM can keep any non-finishable command in queue until it can actually properly finish them, that is, it won't remove the command from the queue until it can push the reply back as well, if it needs to migrate, it can restart servicing the commands from queue. But this has a big impact on workload 1, which can get blocked by workload 2 from meaningful progress.
I could potentially solve some of the issues above by playing around with the credits system, but it seems like it could be challenging to do well: if the device advertises 32k of buffer space, the guest could place 2 16k commands before the device can actually tell it that it's actually 0k space now. This also doesn't scale all that well, credits are per stream, so once the device gets 1 command from 1 stream, it needs to tell all the other streams to back off, and when it can handle another command, it would need to advertise back to all of them that everything is fine, consuming available capacity.
I short, I think workload 2 needs to be in control of its own queues for this to work reasonably well, I don't know if sharing ownership of queues can work. The device we defined doesn't have this problem: first of all, it's on a separate queue, so workload 1 never competes in any way with workload 2, and workload 2 always has where to place replies, since it has an attached reply buffer by design.
Perhaps a good compromise would be to have a multi-queue virtio-vsock or allow AF_VSOCK to be backed my more than 1 device and allow some minimalist routing on a per-destination basis?
Best, Petre Eftime Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.