virtio-comment message

Subject: RE: [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
From: "Wang, Wei W" <wei.w.wang@intel.com>
To: Xiao Guangrong <guangrong.xiao@linux.intel.com>, "kvm@vger.kernel.org" <kvm@vger.kernel.org>, "qemu-devel@nongnu.org" <qemu-devel@nongnu.org>, "virtio-comment@lists.oasis-open.org" <virtio-comment@lists.oasis-open.org>, "mst@redhat.com" <mst@redhat.com>, "stefanha@redhat.com" <stefanha@redhat.com>, "pbonzini@redhat.com" <pbonzini@redhat.com>
Date: Thu, 2 Jun 2016 08:43:37 +0000
On Thu 6/2/2016 11:52 AM,  Xiao Guangrong wrote:
> On 06/02/2016 11:15 AM, Wang, Wei W wrote:
> > On Wed 6/1/2016 4:15 PM, Xiao Guangrong wrote:
> >> On 05/29/2016 04:11 PM, Wei Wang wrote:
> >>> Signed-off-by: Wei Wang <wei.w.wang@intel.com>
> >>> ---
> >>>    Details | 324
> >>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>    1 file changed, 324 insertions(+)
> >>>    create mode 100644 Details
> >>>
> >>> diff --git a/Details b/Details
> >>> new file mode 100644
> >>> index 0000000..4ea2252
> >>> --- /dev/null
> >>> +++ b/Details
> >>> @@ -0,0 +1,324 @@
> >>> +1 Device ID
> >>> +TBD
> >>> +
> >>> +2 Virtqueues
> >>> +0 controlq
> >>> +
> >>> +3 Feature Bits
> >>> +3.1 Local Feature Bits
> >>> +Currently no local feature bits are defined, so the standard virtio
> >>> +feature bits negation will always be successful and complete.
> >>> +
> >>> +3.2 Remote Feature Bits
> >>> +The remote feature bits are obtained from the frontend virtio
> >>> +device and negotiated with the vhost-pci driver via the controlq.
> >>> +The negotiation steps are described in 4.5 Device Initialization.
> >>> +
> >>> +4 Device Configuration Layout
> >>> +struct vhost_pci_config {
> >>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK 0
> >>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO_ACK 1
> >>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS_ACK 2
> >>> +	u32 ack_type;
> >>> +	u32 ack_device_type;
> >>> +	u64 ack_device_id;
> >>> +	union {
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_DONE 0
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_ADD_FAIL 1
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_DONE 2
> >>> +		#define VHOST_PCI_CONTROLQ_ACK_DEL_FAIL 3
> >>> +		u64 ack_memory_info;
> >>> +		u64 ack_device_info;
> >>> +		u64 ack_feature_bits;
> >>> +	};
> >>> +};
> >>
> >> Do you need to write all these 4 field to ack the operation? It seems
> >> it is not efficient and it is not flexible if the driver need to
> >> offer more data to the device in the further. Can we dedicate a vq
> >> for this purpose？
> >
> > Yes, all the 4 fields are required to be written. The vhost-pci server usually
> connects to multiple clients, and the "ack_device_type" and "ack_device_id"
> fields are used to identify them.
> >
> > Agree, another controlq for the guest->host direction looks better, and the
> above fileds can be converted to be the controlq message header.
> >
> 
> Thanks.
> 
> >>
> >> BTW, current approach can not handle the case if there are multiple
> >> same kind of requests in the control queue, e.g, if there are two
> >> memory-add request in the control queue.
> >
> > A vhost-pci device corresponds to a driver VM. The two memory-add requests
> on the controlq are all for the same driver VM.  Memory-add requests for
> different driver VMs  couldn’t be present on the same controlq. I haven’t seen
> the issue yet. Can you please explain more? Thanks.
> 
> The issue is caused by "The two memory-add requests on the controlq are all for
> the same driver VM", the driver need to ACK these request respectively, however,
> these two requests have the same ack_type, device_type, device_id,
> ack_memory_info, then QEMU is not able to figure out which request has been
> acked.

Normally pieces of memory info should be combined into one message (the structure includes multiple memory regions) and sent by the client. In a rare case like this: the driver VM hot-adds 1GB memory, followed by hot-adding another 1GB memory. The first piece of memory info is passed via the socket and controlq to the vhost-pci driver, then the second. Normally they won't get an opportunity to be put on the controlq at the same time. 
Even the implementation batches the controlq messages, there will be a sequence difference between the two messages on the controlq, right?

From the QEMU's (vhost-pci server) perspective, it just sends back an ACK to the client whenever it receives an ACK from the vhost-pci driver.
From the client's perspective, it will receive two ACK messages in this example.
Since the two have a sequence difference, the client should be able to distinguish the two (first sent, first acked), right?

Do you see a case where handling the first message is delayed and the second message is handled and ACK-ed first? 

> >
> >
> >>> +
> >>> +The configuration fields are currently used for the vhost-pci
> >>> +driver to acknowledge to the vhost-pci device after it receives controlq
> messages.
> >>> +
> >>> +4.5 Device Initialization
> >>> +When a device VM boots, it creates a vhost-pci server socket.
> >>> +
> >>> +When a virtio device on the driver VM is created with specifying
> >>> +the use of a vhost-pci device as a backend, a client socket is
> >>> +created and connected to the corresponding vhost-pci server for message
> exchanges.
> >>> +
> >>> +The messages passed to the vhost-pci server is proceeded by the
> >>> +following
> >>> +header:
> >>> +struct vhost_pci_socket_hdr {
> >>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO 0
> >>> +	#define VHOST_PCI_SOCKET_MEMORY_INFO_ACK 1
> >>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO 2
> >>> +	#define VHOST_PCI_SOCKET_DEVICE_INFO_ACK 3
> >>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS 4
> >>> +	#define VHOST_PCI_SOCKET_FEATURE_BITS_ACK 5
> >>> +	u16 msg_type;
> >>> +	u16 msg_version;
> >>> +	u32 msg_len;
> >>> +	u64 qemu_pid;
> >>> +};
> >>> +
> >>> +The payload of the above message types can be constructed using the
> >>> +structures
> >>> +below:
> >>> +/* VHOST_PCI_SOCKET_MEMORY_INFO message */ struct
> >>> +vhost_pci_socket_memory_info {
> >>> +	#define VHOST_PCI_ADD_MEMORY 0
> >>> +	#define VHOST_PCI_DEL_MEMORY 1
> >>> +	u16 ops;
> >>> +	u32 nregions;
> >>> +	struct vhost_pci_memory_region {
> >>> +		int fd;
> >>> +		u64 guest_phys_addr;
> >>> +		u64 memory_size;
> >>> +		u64 mmap_offset;
> >>> +	} regions[VHOST_PCI_MAX_NREGIONS]; };
> >>> +
> >>> +/* VHOST_PCI_SOCKET_DEVICE_INFO message */ struct
> >>> +vhost_pci_device_info {
> >>> +	#define VHOST_PCI_ADD_FRONTEND_DEVICE 0
> >>> +	#define VHOST_PCI_DEL_FRONTEND_DEVICE 1
> >>> +	u16    ops;
> >>> +	u32    nvirtq;
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_NET 1
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_BLK 2
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_CONSOLE 3
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_ENTROPY 4
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_BALLOON 5
> >>> +	#define VHOST_PCI_FRONTEND_DEVICE_SCSI 8
> >>> +	u32    device_type;
> >>> +	u64    device_id;
> >>> +	struct virtq exotic_virtq[VHOST_PCI_MAX_NVIRTQ];
> >>> +};
> >>> +The device_id field identifies the device. For example, it can be
> >>> +used to store a MAC address if the device_type is
> >> VHOST_PCI_FRONTEND_DEVICE_NET.
> >>> +
> >>> +/* VHOST_PCI_SOCKET_FEATURE_BITS message*/ struct
> >>> +vhost_pci_feature_bits {
> >>> +	u64 feature_bits;
> >>> +};
> >>
> >> We not only have 'socket feature bits' but also the feature bits for
> >> per virtio device plugged in on the side of vhost-pci device.
> >
> > Yes. It is mentioned in "3 Feature Bits". The socket feature bits here are
> actually the remote feature bits (got from a socket message).
> 
> Hmm, there are two questions:
> 1) The device related info (e.g, device-type, device-id, etc) has not been included
>     there, so the vhost-pci driver can not write proper values to the fields of
>     vhost_pci_config.

Right. For the controlq feature bits messages, the data structure lacks the "device_type" and "device_id" fields. I will add them.

I think "device_type" and "device_id" don’t need to be included in the socket feature bits message structure:
each frontend virtio device has its own connection to the vhost-pci server, that is, the connection itself has actually already been associated with a "device_type+device_id" (the server should maintain a table for such a relationship after a device info socket message is received). 

> 2) different virtio device may have different feature bits,
> VHOST_PCI_SOCKET_FEATURE_BITS
>     and VHOST_PCI_CONTROLQ_FEATURE_BITS should be able to negotiate the
> feature bits for
>     different virtio device.

Then, I think this shouldn’t be a problem after "device_type" and "device_id" are added to the controlq feature bits message structure.

> >
> >>
> >> E.g: if there are two virtio devices (e.g, a NIC and BLK) both of
> >> them need to directly communicate with another VM. The feature bits
> >> of these two devices need to be negotiated with that VM respectively.
> >> And you can not put these feature bits in vhost_pci_device_info struct as its
> vq is not created at that time.
> >
> > Right. If you check the initialization steps below, there is a statement "When
> the device status is updated with DRIVER_OK".
> >
> >>> +
> >>> +/* VHOST_PCI_SOCKET_xx_ACK messages */ struct vhost_pci_socket_ack {
> >>> +	#define VHOST_PCI_SOCKET_ACK_ADD_DONE 0
> >>> +	#define VHOST_PCI_SOCKET_ACK_ADD_FAIL 1
> >>> +	#define VHOST_PCI_SOCKET_ACK_DEL_DONE 2
> >>> +	#define VHOST_PCI_SOCKET_ACK_DEL_FAIL 3
> >>> +	u64 ack;
> >>> +};
> >>> +
> >>> +The driver update message passed via the controlq is preceded by
> >>> +the following
> >>> +header:
> >>> +struct vhost_pci_controlq_hdr {
> >>> +	#define VHOST_PCI_CONTROLQ_MEMORY_INFO 0
> >>> +	#define VHOST_PCI_CONTROLQ_DEVICE_INFO 1
> >>> +	#define VHOST_PCI_CONTROLQ_FEATURE_BITS 2
> >>> +	#define VHOST_PCI_CONTROLQ_UPDATE_DONE 3
> >>> +	u16 msg_type;
> >>> +	u16 msg_version;
> >>> +	u32 msg_len;
> >>> +};
> >>> +
> >>> +The payload of a VHOST_PCI_CONTROLQ_MEMORY_INFO message can be
> >>> +constructed using the following structure:
> >>> +/* VHOST_PCI_CONTROLQ_MEMORY_INFO message */ struct
> >>> +vhost_pci_controlq_memory_info {
> >>> +	#define VHOST_PCI_ADD_MEMORY 0
> >>> +	#define VHOST_PCI_DEL_MEMORY 1
> >>> +	u16  ops;
> >>> +	u32 nregion;
> >>> +	struct exotic_memory_region {
> >>> +		u64   region_base_xgpa;
> >>> +		u64   size;
> >>> +		u64   offset_in_bar_area;
> >>> +	} region[VHOST_PCI_MAX_NREGIONS];
> >>> +};
> >>> +
> >>> +The payload of VHOST_PCI_CONTROLQ_DEVICE_INFO and
> >>> +VHOST_PCI_CONTROLQ_FEATURE_BITS messages can be constructed
> using
> >> the
> >>> +vhost_pci_device_info structure and the vhost_pci_feature_bits
> >>> +structure respectively.
> >>> +
> >>> +The payload of a VHOST_PCI_CONTROLQ_UPDATE_DONE message can be
> >>> +constructed using the structure below:
> >>> +struct vhost_pci_controlq_update_done {
> >>> +	u32    device_type;
> >>> +	u64    device_id;
> >>> +};
> >>> +
> >>> +Fig. 1 shows the initialization steps.
> >>> +
> >>> +When the vhost-pci server receives a
> >>> +VHOST_PCI_SOCKET_MEMORY_INFO(ADD) message, it checks if a vhost-
> pci
> >>> +device has been created for the requesting VM whose QEMU process id
> >>> +is qemu_pid. If yes, it will simply update the subsequent received
> >>> +messages to the vhost-pci driver via the controlq. Otherwise, the
> >>> +server creates a new vhost-pci device, and continues the following
> >> initialization steps.
> >>
> >>
> >> qemu-pid is not stable as the existing VM will be killed silently and
> >> the new vhost-pci driver reusing the same qemu-pid will ask to join
> >> before the vhost- device gets to know the previous one has gone.
> >
> > Would it be a normal and legal operation to silently kill a QEMU? I guess only
> the system admin can do that, right?
> 
> Yup, it is a valid operation as a problematic VM will be killed and as you said the
> admin can kill it silently, anyway, the design should have a way to handle this
> case properly.
> 
> >
> > If that's true, I think we can add a new field, "u64 tsc_of_birth" to the
> vhost_pci_socket_hdr structure. It records the tsc when the QEMU is created.
> 
> You only need to identify a VM, can use UUID instead?

Yes. The two methods are actually similar.

Best,
Wei

> 
> > If that's true, another problem would be the remove of the vhost-pci device
> for a silently killed driver VM.
> > The vhost-pci server may need to periodically send a checking message to
> check if the driver VM is silently killed. If that really happens, it should remove
> the related vhost-pci device.
> 
> Yes.
> 
> >
> >>> +
> >>> +The vhost-pci server adds up all the memory region size, and uses a
> >>> +64-bit device bar for the mapping of all the memory regions
> >>> +obtained from the socket message. To better support memory
> >>> +hot-plugging of the driver VM, the bar is configured with a double
> >>> +size of the driver VM's memory. The server maps the received memory
> >>> +info via the QEMU MemoryRegion mechanism, and then the new created
> >>> +vhost-pci device is
> >> hot-plugged to the VM.
> >>> +
> >>> +When the device status is updated with DRIVER_OK, a
> >>> +VHOST_PCI_CONTROLQ_MEMORY_INFO(ADD) message, which is stemed
> >> from the
> >>> +memory info socket message, is put on the controlq and a controlq
> >>> +interrupt is injected to the VM.
> >>> +
> >>> +When the vhost-pci server receives a
> >>> +VHOST_PCI_CONTROLQ_MEMORY_INFO_ACK(ADD_DONE)
> >> acknowledgement from the
> >>> +driver, it sends a VHOST_PCI_SOCKET_MEMORY_INFO_ACK(ADD_DONE)
> >> message
> >>> +to the client that is identified by the ack_device_type and ack_device_id
> fields.
> >>> +
> >>> +When the vhost-pci server receives a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS(feature bits) message, a
> >>> +VHOST_PCI_CONTROLQ_FEATURE_BITS(feature bits) message is put on
> the
> >>> +controlq and a controlq interrupt is injected to the VM.
> >>> +
> >>> +If the vhost-pci server notices that the driver fully accepted the
> >>> +offered feature bits, it sends a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message to the
> client.
> >> If
> >>> +the vhost-pci server notices that the vhost-pci driver only
> >>> +accepted a subset of the offered feature bits, it sends a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS(accepted feature bits) message back
> >>> +to the client. The client side virtio device re-negotiates the new
> >>> +feature bits with its driver, and sends back a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE)
> >>> +message to the server.
> >>> +
> >>> +Either when the vhost-pci driver fully accepted the offered feature
> >>> +bits or a
> >>> +VHOST_PCI_SOCKET_FEATURE_BITS_ACK(ADD_DONE) message is received
> >> from
> >>> +the client, the vhost-pci server puts a
> >>> +VHOST_PCI_CONTROLQ_UPDATE_DONE message on the controlq, and a
> >> controlq interrupt is injected to the VM.
> >>
> >> Why VHOST_PCI_CONTROLQ_UPDATE_DONE is needed?
> >
> > OK, this one looks redundant. We can set up the related support for that
> frontend device when the device info is received via the controlq.
> >
> 
> Great.
References:
- RE: [PATCH 4/6 Resend] Vhost-pci RFC: Detailed Description in the Virtio Specification Format
  - From: "Wang, Wei W" <wei.w.wang@intel.com>