[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio-dev] [PATCH] message framing: make ANY_LAYOUT implicit
"Michael S. Tsirkin" <mst@redhat.com> writes: > This resolves issue VIRTIO-10 > > This also creates a new section for legacy feature > bits which will be handy for VIRTIO-13 if that is > accepted. > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> This is very similar to the patch I came up with: I'm too slow! Pasted below, in particular the 2.4.2.5.1 header needs to move up a few lines, too. Cheers, Rusty. > --- > virtio-v1.0-wd01-part1-specification.txt | 38 ++++++++++++++++++++++---------- > 1 file changed, 26 insertions(+), 12 deletions(-) > > diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt > index 2679b5b..bb75e35 100644 > --- a/virtio-v1.0-wd01-part1-specification.txt > +++ b/virtio-v1.0-wd01-part1-specification.txt > @@ -312,8 +312,8 @@ It is assumed that the host is already aware of the guest endian. > > 2.1.4.2. Message Framing > ----------------------- > -The original intent of the specification was that message framing (the > -particular layout of descriptors) be independent of the contents of > +Generally, the intent of the specification is for message framing (the > +particular layout of descriptors) to be independent of the contents of > the buffers. For example, a network transmit buffer consists of a 12 > byte header followed by the network packet. This could be most simply > placed in the descriptor table as a 12 byte output descriptor followed > @@ -322,16 +322,21 @@ single 1526 byte output descriptor in the case where the header and > packet are adjacent, or even three or more descriptors (possibly with > loss of efficiency in that case). > > -Regrettably, initial driver implementations used simple layouts, and > -devices came to rely on it, despite this specification wording[10]. It > -is thus recommended that drivers be conservative in their assumptions, > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some > +In addition, some > implementations may have large-but-reasonable restrictions on total > descriptor size (such as based on IOV_MAX in the host OS). This has > not been a problem in practice: little sympathy will be given to > drivers which create unreasonably-sized descriptors such as by > dividing a network packet into 1500 single-byte descriptors! > > +2.1.4.2.1. Legacy Interfaces: A Note on Message Framing > +----------------------- > +Regrettably, initial driver implementations used simple layouts, and > +devices came to rely on it, despite this specification wording[10]. It > +is thus recommended that when using legacy interfaces, > +drivers should be conservative in their assumptions, > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > + > 2.1.4.3. The Virtqueue Descriptor Table > -------------------------------------- > > @@ -2980,9 +2985,6 @@ Currently there are five device-independent feature bits defined: > using a timer if the device interrupts it when all the packets > are transmitted. > > - VIRTIO_F_ANY_LAYOUT (27) This feature indicates that the device accepts arbitrary > - descriptor layouts, as described in Section "2.1.4.2. Message Framing". > - > VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature indicates > that the driver can use descriptors with the VRING_DESC_F_INDIRECT > flag set, as described in "2.1.4.3.1. Indirect Descriptors". > @@ -3002,9 +3004,21 @@ Currently there are five device-independent feature bits defined: > compliant with this specification, and acknowledged by all device > drivers. > > -In addition, bit 30 is used by qemu's implementation to check for experimental > -early versions of virtio which did not perform correct feature negotiation, > -and should not be used. > +2.5.1 Legacy Interface: A Note on Reserved Feature Bits > +------------------------- > + > +When used through the legacy interface, transitional > +devices should advertise and recognize the following > +feature bits: > + > +VIRTIO_F_ANY_LAYOUT (27) This feature indicates that the device accepts arbitrary > +descriptor layouts, as described in Section > +"2.1.4.2.1. Legacy Interfaces: A Note on Message Framing". > + > +VIRTIO_F_BAD_FEATURE (30) This bit is used by devices > +to check for experimental early versions of virtio > +drivers which did not perform correct feature negotiation, > +and should not be supported or activated by drivers. > > 2.6. virtio_ring.h > ================= > -- > MST commit 75f47c3fcc52192b7b6a7b6a77692cdb971dcb7e Author: Rusty Russell <rusty@au1.ibm.com> Date: Mon Sep 9 17:40:32 2013 +0930 Any layout is allowed (deprecate VIRTIO_F_ANY_LAYOUT) As per issue virtio-10. Signed-off-by: Rusty Russell <rusty@au1.ibm.com> diff --git a/virtio-v1.0-wd01-part1-specification.txt b/virtio-v1.0-wd01-part1-specification.txt index 1e424d0..ceac80a 100644 --- a/virtio-v1.0-wd01-part1-specification.txt +++ b/virtio-v1.0-wd01-part1-specification.txt @@ -268,25 +268,33 @@ is not a significant issue. 2.1.4.2. Message Framing ----------------------- -The original intent of the specification was that message framing (the -particular layout of descriptors) be independent of the contents of -the buffers. For example, a network transmit buffer consists of a 12 -byte header followed by the network packet. This could be most simply -placed in the descriptor table as a 12 byte output descriptor followed -by a 1514 byte output descriptor, but it could also consist of a -single 1526 byte output descriptor in the case where the header and -packet are adjacent, or even three or more descriptors (possibly with -loss of efficiency in that case). +The message framing (the particular layout of descriptors) is +independent of the contents of the buffers. For example, a network +transmit buffer consists of a 12 byte header followed by the network +packet. This could be most simply placed in the descriptor table as a +12 byte output descriptor followed by a 1514 byte output descriptor, +but it could also consist of a single 1526 byte output descriptor in +the case where the header and packet are adjacent, or even three or +more descriptors (possibly with loss of efficiency in that case). + +Note that, some implementations may have large-but-reasonable +restrictions on total descriptor size (such as based on IOV_MAX in the +host OS). This has not been a problem in practice: little sympathy +will be given to drivers which create unreasonably-sized descriptors +such as by dividing a network packet into 1500 single-byte +descriptors! + +2.1.4.2.1. Legacy Interface: Message Framing +----------------------- Regrettably, initial driver implementations used simple layouts, and -devices came to rely on it, despite this specification wording[10]. It -is thus recommended that drivers be conservative in their assumptions, -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some -implementations may have large-but-reasonable restrictions on total -descriptor size (such as based on IOV_MAX in the host OS). This has -not been a problem in practice: little sympathy will be given to -drivers which create unreasonably-sized descriptors such as by -dividing a network packet into 1500 single-byte descriptors! +devices came to rely on it, despite this specification wording. In +addition, the specification for virtio_blk SCSI commands required +intuiting field lengths from frame boundaries (see "2.4.2.5.1 Legacy +Interface: Device Operation") + +It is thus recommended that drivers be conservative in their +assumptions, unless the VIRTIO_F_ANY_LAYOUT feature is accepted. 2.1.4.3. The Virtqueue Descriptor Table -------------------------------------- @@ -1193,7 +1201,7 @@ features. VIRTIO_NET_F_MAC (5) Device has given MAC address. VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with - any GSO type.[12] + any GSO type.[11] VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4. @@ -1267,16 +1275,16 @@ features. packets by negotating the VIRTIO_NET_F_CSUM feature. This “ checksum offload” is a common feature on modern network cards. -7. If that feature is negotiated[13], a driver can use TCP or UDP +7. If that feature is negotiated[12], a driver can use TCP or UDP segmentation offload by negotiating the VIRTIO_NET_F_HOST_TSO4 (IPv4 TCP), VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and VIRTIO_NET_F_HOST_UFO (UDP fragmentation) features. It should not send TCP packets requiring segmentation offload which have the Explicit Congestion Notification bit set, unless the VIRTIO_NET_F_HOST_ECN feature is - negotiated.[14] + negotiated.[13] 8. The converse features are also available: a driver can save - the virtual device some work by negotiating these features.[15] + the virtual device some work by negotiating these features.[14] The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially checksummed packets can be received, and if it can do that then the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, @@ -1328,7 +1336,7 @@ the different features the driver negotiated. and • csum_offset indicates how many bytes after the csum_start the - new (16 bit ones' complement) checksum should be placed.[16] + new (16 bit ones' complement) checksum should be placed.[15] 2. If the driver negotiated VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires @@ -1341,21 +1349,21 @@ the different features the driver negotiated. • hdr_len is a hint to the device as to how much of the header needs to be kept to copy into each packet, usually set to the - length of the headers, including the transport header.[17] + length of the headers, including the transport header.[16] • gso_size is the maximum size of each packet beyond that header (ie. MSS). • If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as - well, indicating that the TCP packet has the ECN bit set.[18] + well, indicating that the TCP packet has the ECN bit set.[17] 3. If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, the num_buffers field is set to zero. 4. The header and packet are added as one output buffer to the transmitq, and the device is notified of the new entry (see "2.4.1.4. - Notifying The Device").[19] + Notifying The Device").[18] 2.4.1.5.1.1. Packet Transmission Interrupt ----------------------------------------- @@ -1380,7 +1388,7 @@ VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to accept packets of up to 65550 bytes long (the maximum size of a TCP or UDP packet, plus the 14 byte ethernet header), otherwise 1514. bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every -buffer in the receive queue needs to be at least this length [20] +buffer in the receive queue needs to be at least this length [19] If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at least the size of the struct virtio_net_hdr. @@ -1480,7 +1488,7 @@ off. The command-specific-data is one byte containing 0 (off) or #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 The device can filter incoming packets by any number of destination -MAC addresses.[21] This table is set using the class +MAC addresses.[20] This table is set using the class VIRTIO_NET_CTRL_MAC and the command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-specific-data is two variable length tables of 6-byte MAC addresses. The first table contains unicast addresses, and the second @@ -1652,7 +1660,7 @@ the device (not necessarily in order). Each request is of form: The type of the request is either a read (VIRTIO_BLK_T_IN), a write (VIRTIO_BLK_T_OUT), or a flush (VIRTIO_BLK_T_FLUSH or -VIRTIO_BLK_T_FLUSH_OUT[22]). If the device has VIRTIO_BLK_F_BARRIER +VIRTIO_BLK_T_FLUSH_OUT[21]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit (VIRTIO_BLK_T_BARRIER) indicates that this request acts as a barrier and that all preceeding requests must be complete before this one, and all following requests must not be @@ -1683,12 +1691,12 @@ error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: #define VIRTIO_BLK_S_IOERR 1 #define VIRTIO_BLK_S_UNSUPP 2 +2.4.2.5.1 Legacy Interface: Device Operation +------------------------ Historically, devices assumed that the fields type, ioprio and sector reside in a single, separate read-only buffer and the status field is a separate read-only buffer of size 1 byte, by itself. -2.4.2.5.1 Legacy Interface: Device Operation ------------------------- If the device has VIRTIO_BLK_F_SCSI feature, it can also support scsi packet command requests, each of these requests is of form: @@ -1826,7 +1834,7 @@ data and outgoing characters are placed in the transmit queue. ------------------------ 1. For output, a buffer containing the characters is placed in - the port's transmitq.[23] + the port's transmitq.[22] 2. When a buffer is used in the receiveq (signalled by an interrupt), the contents is the input to the port associated @@ -1970,7 +1978,7 @@ configuration change interrupt. 2. To supply memory to the balloon (aka. inflate): (a) The driver constructs an array of addresses of unused memory - pages. These addresses are divided by 4096[24] and the descriptor + pages. These addresses are divided by 4096[23] and the descriptor describing the resulting 32-bit array is added to the inflateq. 3. To remove memory from the balloon (aka. deflate): @@ -1985,11 +1993,11 @@ configuration change interrupt. (c) Otherwise, the guest may begin to re-use pages previously given to the balloon before the device has acknowledged their - withdrawl. [25] + withdrawl. [24] 4. In either case, once the device has completed the inflation or deflation, the “actual” field of the configuration should be - updated to reflect the new number of pages in the balloon.[26] + updated to reflect the new number of pages in the balloon.[25] 2.4.5.6.1. Memory Statistics --------------------------- @@ -2577,10 +2585,7 @@ contents of the event field. The following events are defined: 2.5. Reserved Feature Bits ========================= -Currently there are four device-independent feature bits defined: - - VIRTIO_F_ANY_LAYOUT (27) This feature indicates that the device accepts arbitrary - descriptor layouts, as described in Section "2.1.4.2. Message Framing". +Currently there are three device-independent feature bits defined: VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature indicates that the driver can use descriptors with the VRING_DESC_F_INDIRECT @@ -2621,6 +2626,10 @@ VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature using a timer if the device interrupts it when all the packets are transmitted. +VIRTIO_F_ANY_LAYOUT (27) This feature indicates that the device + accepts arbitrary descriptor layouts, as described in Section + "2.1.4.2.1. Legacy Interface: Message Framing". + 2.6. virtio_ring.h ================= @@ -2851,7 +2860,7 @@ altogether. Any change to configuration space, or new virtqueues, or behavioural changes, should be indicated by negotiation of a new -feature bit. This establishes clarity[11] and avoids future expansion problems. +feature bit. This establishes clarity[10] and avoids future expansion problems. Clusters of functionality which are always implemented together can use a single bit, but if one feature makes sense without the @@ -2895,34 +2904,28 @@ of this expected condition is necessary. [9] https://lists.linux-foundation.org/mailman/listinfo/virtualization -[10] It was previously asserted that framing should be independent of message -contents, yet invariably drivers layed out messages in reliable ways and -devices assumed it. -In addition, the specifications for virtio_blk and virtio_scsi require -intuiting field lengths from frame boundaries. - -[11] Even if it does mean documenting design or implementation +[10] Even if it does mean documenting design or implementation mistakes! -[12] It was supposed to indicate segmentation offload support, but +[11] It was supposed to indicate segmentation offload support, but upon further investigation it became clear that multiple bits were required. -[13] ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are +[12] ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are dependent on VIRTIO_NET_F_CSUM; a dvice which offers the offload features must offer the checksum feature, and a driver which accepts the offload features must accept the checksum feature. Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features depending on VIRTIO_NET_F_GUEST_CSUM. -[14] This is a common restriction in real, older network cards. +[13] This is a common restriction in real, older network cards. -[15] For example, a network packet transported between two guests on +[14] For example, a network packet transported between two guests on the same system may not require checksumming at all, nor segmentation, if both guests are amenable. -[16] For example, consider a partially checksummed TCP (IPv4) packet. +[15] For example, consider a partially checksummed TCP (IPv4) packet. It will have a 14 byte ethernet header and 20 byte IP header followed by the TCP header (with the TCP checksum field 16 bytes into that header). csum_start will be 14+20 = 34 (the TCP @@ -2932,26 +2935,26 @@ of the TCP pseudo header, so that replacing it by the ones' complement checksum of the TCP header and body will give the correct result. -[17] Due to various bugs in implementations, this field is not useful +[16] Due to various bugs in implementations, this field is not useful as a guarantee of the transport header size. -[18] This case is not handled by some older hardware, so is called out +[17] This case is not handled by some older hardware, so is called out specifically in the protocol. -[19] Note that the header will be two bytes longer for the +[18] Note that the header will be two bytes longer for the VIRTIO_NET_F_MRG_RXBUF case. -[20] Obviously each one can be split across multiple descriptor +[19] Obviously each one can be split across multiple descriptor elements. -[21] Since there are no guarentees, it can use a hash filter or +[20] Since there are no guarentees, it can use a hash filter or silently switch to allmulti or promiscuous mode if it is given too many addresses. -[22] The FLUSH and FLUSH_OUT types are equivalent, the device does not +[21] The FLUSH and FLUSH_OUT types are equivalent, the device does not distinguish between them -[23] Because this is high importance and low bandwidth, the current +[22] Because this is high importance and low bandwidth, the current Linux implementation polls for the buffer to be used, rather than waiting for an interrupt, simplifying the implementation significantly. However, for generic serial ports with the @@ -2959,9 +2962,9 @@ O_NONBLOCK flag set, the polling limitation is relaxed and the consumed buffers are freed upon the next write or poll call or when a port is closed or hot-unplugged. -[24] This is historical, and independent of the guest page size +[23] This is historical, and independent of the guest page size -[25] In this case, deflation advice is merely a courtesy +[24] In this case, deflation advice is merely a courtesy -[26] As updates to configuration space are not atomic, this field +[25] As updates to configuration space are not atomic, this field isn't particularly reliable, but can be used to diagnose buggy guests.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]