virtio-comment message

Subject: Re: [virtio-comment] Re: [PATCH v1] docs/vhost-user: extend the vhost-user protocol to support the vhost-pci based inter-vm communication

From: Wei Wang <wei.w.wang@intel.com>
To: Marc-André Lureau <marcandre.lureau@gmail.com>
Date: Fri, 11 Nov 2016 16:24:44 +0800

Thanks for your comments, Marc-André.

On 11/10/2016 08:26 PM, Marc-André Lureau wrote:

Hi Wei

On Wed, Nov 9, 2016 at 12:32 PM Wei Wang <wei.w.wang@intel.com<mailto:wei.w.wang@intel.com>> wrote:


    On 11/08/2016 08:17 PM, Marc-André Lureau wrote:
    > >
    >
    >     >      Message Specification
    >     >      ---------------------
    >     >
    >     >      Note that all numbers are in the machine native byte
    order. A
    >     >     vhost-user message
    >     >     -consists of 3 header fields and a payload:
    >     >     +consists of 4 header fields and a payload:
    >     >
    >     >     -------------------------------------
    >     >     -| request | flags | size | payload |
    >     >     -------------------------------------
    >     >  +----------------------------------------------
    >     >     +| request | flags | conn_id | size | payload |
    >     >  +----------------------------------------------
    >     >
    >     >       * Request: 32-bit type of the request
    >     >       * Flags: 32-bit bit field:
    >     >         - Lower 2 bits are the version (currently 0x01)
    >     >     -   - Bit 2 is the reply flag - needs to be sent on
    each reply
    >     >     from the slave
    >     >     +   - Bit 2 is the reply flag - needs to be sent on
    each reply
    >     >         - Bit 3 is the need_reply flag - see
    >     >     VHOST_USER_PROTOCOL_F_REPLY_ACK for
    >     >           details.
    >     >     + * Conn_id: 64-bit connection id to indentify a
    client socket
    >     >     connection. It is
    >     >     +            introduced in version 0x02 to support the
    >     >     "1-server-N-client" model
    >     >     +            and an asynchronous client read
    implementation. The
    >     >     connection id,
    >     >     +            0xFFFFFFFFFFFFFFFF, is used by an
    anonymous client
    >     >     (e.g. a client who
    >     >     +            has not got its connection id from the server
    >     in the
    >     >     initial talk)
    >     >
    >     >
    >     > I don't understand why you need a connection id, on each
    message.
    >     > What's the purpose? Since the communication is unicast, a
    single
    >     > message should be enough.
    >
    >     Sure, please let me explain more:
    >     The QEMU socket is going to be upgraded to support 1 server
    socket
    >     being
    >     connected by multiple client sockets (I've made patches to
    achieve
    >     this). In other words, here, multiple masters will connect
    to one
    >     slave,
    >     and the slave creates a vhost-pci device for each master after
    >     receiving

Before handling hotplugging and multiple vhost-pci, I think it wouldbe simpler to support 1-1.

Yes, we can start from 1-1 (server-client model), but I think we stillneed to use the hotplugging method for the following reasons:

1) the vhost-pci-net device (slave) is created based on theconfiguration info from the master VM - for example, the vhost-pci-netBAR size is set according to the memory size of the master VM.Typically, the master VM gets booted later than the slave VM, since theqemu client socket (master) needs to connect to the server socket(slave). So, I think we should let the socket server manages the device(creation/hotplug/destruction/). Instead of creating multiplevhost-pci-net devices, each server socket manages only one vhost-pcidevice for now. We can boot a VM like this:

-chardev socket,id=vp-server1,server,path=/opt/vp-server1
-chardev socket,id=vp-server2,server,path=/opt/vp-server2
-vhost-pci-server socket,chrdev=vp-server1,chrdev=vp-server2

2) The socket manages the device instance, rather than the device ownsthe socket. This is flexible for the connection setup - we can also addqmp commands to support the setup of a socket connection manually. Forexample, we have VM1 booted with the above commands, and VM2 gets bootedwithout connecting to VM1. We still have a chance to let the admin touse qmp commands to set up the connection, and a vhost-pci-net device isthen created and hotplugged to the VM.

3) Considering that we will have the "1-server-N-client" version QEMUsocket for the next step, I think this will be easy to upgrade.


What do you think?

    >     the necessary message info. The slave needs to know which
    master it is
    >     talking to when receiving a message, as it maintains multiple
    >     connections at the same time.
    >
    >
    > You should be able to identify each connection in the slave (as a
    > socket server), without a need for connection id: connected sockets
    > are independent from each others.


    Yes, that's doable. But why couldn't we do it from the protocol
    layer? I
    think it will be easier.

Since it can be done by implementation, I would rather not change theprotocol with unnecessary data.



    Please check below my thoughts about the implementation if we do it in
    the slave:

    The interface for receiving a msg is - tcp_chr_read(QIOChannel *chan,
    GIOCondition cond, void *opaque)

    QIOChannel is the one that we can use to identify the master
    connection
    who sends this msg (the socket server now has an array of QIOChannel,
    ioc[MAX_CLIENTS]). Everytime a msg is received, the tcp_chr_read()
    needs
    to compare *chan and the ioc[] array, to find out the id (indexed into
    the ioc[]), and passes the id to qemu_chr_be_write(), and all the way
    down to the final slave handler where the msg is parsed and handled.
    This needs modifications to the existing APIs, for example, the
    mentioned qemu_chr_be_write() will need one more parameter, "id". This
    will not be compatible with the existing implementation, because all
    other implementations which invoke qemu_chr_be_write() will need to be
    patched to use the new qemu_chr_be_write(,"id",).

We could have the "id" set with the CharBackend. But again, it's animplementation detail, and I would rather treat it as a seperateenhacement.

For this step, we don't need the connection id, so I think we candiscuss more about it when we get to the next step :)




    >     >       * Size - 32-bit size of the payload
    >     >
    >     >
    >     >     @@ -97,6 +106,13 @@ Depending on the request type, payload
    >     can be:
    >     >         log offset: offset from start of supplied file
    descriptor
    >     >             where logging starts (i.e. where guest address 0
    >     would be
    >     >     logged)
    >     >
    >     >     +* Device info
    >     >     +   --------------------
    >     >     +   | virito id | uuid |
    >     >     +   --------------------
    >     >     +   Virtio id: 16-bit virtio id of the device
    >     >     +   UUID: 128-bit UUID to identify the QEMU instance that
    >     creates
    >     >     the device
    >     >     +
    >     >
    >     >
    >     > I wonder if UUID should be a different message.
    >     >
    >     We can make uuid another message if it has other usages.
    >     Do you see any other usages of uuid?
    >
    >
    > Allows to associate data/configuration with a particular VM, in a
    > multi-master/single-slave scenario. But tbh, I don't see how this is
    > necessary, I can imagine solving this differently (having different
    > connection address per vm for ex).


    Using connection addresses, how could you know if the two connections
    are from the same VM?

Each VM could have a private connection address/server, that themanagement layer sets up. Note that the same "Slave" process couldhave multiple servers. Iow, you could have -chardevsocket,id=c1,server,.. -chardev socket,id=c2,server,... -devicevhost-pci,chardev=c1 device vhost-pci,chardev=c2 (this is somewhatsimilar to usbredir): with this setup, qemu could implicitely starts asingleton vhost-user "Slave" server with two listening sockets c1&c1,or it could start a server for each chardev.

I think UUID is not useful for the first version of vhost-pci-net, sowe can probably drop it for now.


    > I would like to understand your use case.


    Here is an example of the use case:
    VM1 has two master connections (connection X and Y) and VM2 has 1
    master
    connection (Z).
    X,Y,Z - each has a connection id. But X and Y send the same uuid,
    uuid1,
    to the slave, and Z sends uuid2 to the slave. In this way, the slave
    know X and Y are the two connections from the same VM, and Z is a
    connection from a different VM.

    For connection Y, the vhost-pci device will be created in a way which
    does not need the driver to map the memory, since it has already been
    mapped by device X from the same VM.

For now, I would ignore the fact that the same VM could be the Masterfor both c1 & c2 (in the ex above). Is passing uuid necessary andenough to avoid duplicated mmap? is this really an issue after all?it's only virtual memory.

I will think more about it. We may discuss it in the next step as well.Thanks.

    >
    >     >      [ Also see the section on REPLY_ACK protocol extension. ]
    >     >
    >     >     +Currently, the communication also supports the Slave
    (server)
    >     >     sending messages
    >     >     +to the Master (client). Here is a list of them:
    >     >     + * VHOST_USER_SET_FEATURES
    >     >
    >     >     + * VHOST_USER_SET_PEER_CONNECTION (the serve may actively
    >     request
    >     >     to disconnect
    >     >     +   with the client)
    >     >
    >     >
    >     > Oh, you are making the communication bidirectional? This is a
    >     > fundamental change in the protocol. This may be difficult to
    >     implement
    >     > in qemu, since the communication in synchronous, a request
    >     expects an
    >     > immediate reply, if it gets back a request (from the
    slave) in the
    >     > middle, it will fail.
    >     >
    >
    >     Not really.
    >     Adding the above two doesn't affect the existing synchronous
    read()
    >     messages (basically, those VHOST_USER_GET_xx messages). Like
    >     VHOST_USER_SET_FEATURES, the _SET_ messages don't need a reply.
    >     Here, we
    >     just make the slave capable of actively sending messages to the
    >     master.
    >
    >
    > Yes, that's the trouble. At any time the Master may send a
    request and
    > expects an immediate reply. There is a race of getting a request
    from
    > the Slave in the middle with your proposed change. I'd rather avoid
    > making the request bidirectionnal if possible. (I proposed a second
    > channel for Slave->Master request in the past:
    > https://lists.gnu.org/archive/html/qemu-devel/2016-04/msg00095.html)


      If the message that the slave got has a different "request" field
    value, it simply drops it and re-read again. The implementation is not
    complex also, please see the change example to
    vhost_user_get_u64() below:

        if (vhost_user_write(dev, &msg_w, NULL, 0) < 0) {
            return -1;
         }
    retry:
         if (vhost_user_read(dev, &msg_r) < 0) {
             return -1;
        }
         if (msg_r.request != msg_w.request)
             goto retry;


    On the other side, the slave's request to the master is dropped due to
    the race. This race can be solved in the protocol layer - let the
    _SET_
    request ask for an ACK, if no ACK is received, re-sent it. Also, this
    kind of race should be very rare in real usage.

Ok, that's an option, not very elegant imho, but it could work. Iwould be afraid of deadlocks if both side apply the same retry logic..

It should be a traditional problem. If necessary, I think we can inserta delay using binary exponential backoff before "goto retry".


    >
    >     >
    >     > +This request should be sent only when
    >     VHOST_USER_PROTOCOL_F_VHOST_PCI
    >     > has...
    >     >
    >     >     +* VHOST_USER_SET_DEV_INFO
    >     >     +
    >     >     +      Id: 21
    >     >     +      Equivalent ioctl: N/A
    >     >     +      Master payload: dev info
    >     >     +
    >     >     +      The client sends the producer device info to
    the server.
    >     >
    >     >
    >     > "Master sends producer device info to the Slave" works, no?
    >
    >     Yes, it works. The current dev info only contains a "virtio
    id" field
    >     (assume we'll take uuid out as a separate message), which
    tells the
    >     slave if it is a net, scsi, console or else. do you see any
    issue?
    >     >
    >     > Could we guarantee this message is sent before SET_VRING*?
    >
    >     Why do we need to guarantee this?
    >
    >
    > It would simplify the protocol to have expectations on when messages
    > come. In particular, an early message with devinfo would allow to
    > check/pre-configure the Slave for a particular device. Also
    > VHOST_USER_SET_DEV_INFO should probably be unique (don't allow a
    > device to be reconfigured)


    Yes, it is sent in an early age of the vhost-user protocol
    interaction.
    It's implemented to be sent right after sending the
    VHOST_USER_SET_PROTOCOL_FEATURES msg. On the slave side, when it
    receives SET_DEV_INFO, it pre-configures the device in a table
    entry (as
    mentioned before, a device will be created from the table entry at a
    later stage of the protocol interaction).

    I think it should be the implementation logic, like
    VHOST_USER_SET_PROTOCOL_FEATURES. why do we need to add a guarantee in
    the protocol to specify the order?

It would help to simplify the possible protocol states, helpingimplementation, testing and documentation. I think the protocol is tooliberal already, it would help to specify the various states of Masterand Slave and what messages are permitted in each state. This is outof scope here, but a worthy goal for a v2 of the protocol.

OK. btw, have you got a plan what states to add to slave and master?Some messages (e.g. SET_VRING_CALL) are sent both in the initial andlater stage of the protocol interaction.

    >
    >     >
    >     >     + This request should be sent only when
    >     >     VHOST_USER_PROTOCOL_F_VHOST_PCI has
    >     >     +      been negotiated.
    >     >     +
    >     >
    >     >
    >     > I think this message could be useful for other purposes than
    >     > vhost-pci, thus I would give it its own flag.
    >
    >     Could you please give an example of other usage? Thanks.
    >
    >
    > You could have a Slave that implements various devices, and pick the
    > corresponding one dynamically (we already have implementations for
    > net/input/gpu/scsi...)


    If I understand the example correctly, the various devices still
    belongs
    to the vhost-pci series - in the future we would have vhost-pci-net,
    vhost-pci-scsi, vhost-pci-gpu etc. If that's the case, we may
    still use
    the VHOST_PCI flag.
My point is that those messages are not specific to vhost-pci, theycan be useful for vhost-user Slave today.
That brings me to a related question: do you need to have differentvhost-pci devices, or can we make it generic enough that it couldmirror various device kinds dynamically. As a first step, I think itwould be simpler to focus on a specific device, like vhost-pci-net,like you are doing in your qemu series.

Ok, let's start from the simpler case. Please check another email forthe 2-step plan. Thanks.


Best,
Wei

References:
- Re: [PATCH v1] docs/vhost-user: extend the vhost-user protocol to support the vhost-pci based inter-vm communication
  - From: Marc-André Lureau <marcandre.lureau@gmail.com>
- Re: [virtio-comment] Re: [PATCH v1] docs/vhost-user: extend the vhost-user protocol to support the vhost-pci based inter-vm communication
  - From: Wei Wang <wei.w.wang@intel.com>
- Re: [virtio-comment] Re: [PATCH v1] docs/vhost-user: extend the vhost-user protocol to support the vhost-pci based inter-vm communication
  - From: Marc-André Lureau <marcandre.lureau@gmail.com>
- Re: [virtio-comment] Re: [PATCH v1] docs/vhost-user: extend the vhost-user protocol to support the vhost-pci based inter-vm communication
  - From: Wei Wang <wei.w.wang@intel.com>
- Re: [virtio-comment] Re: [PATCH v1] docs/vhost-user: extend the vhost-user protocol to support the vhost-pci based inter-vm communication
  - From: Marc-André Lureau <marcandre.lureau@gmail.com>