OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-3) Merge appropriate improvements from pre-OASIS draft: Rusty


OASIS Issues Tracker <workgroup_mailer@lists.oasis-open.org> writes:
> Merge appropriate improvements from pre-OASIS draft: Rusty
> ----------------------------------------------------------
>
>                  Key: VIRTIO-3
>                  URL: http://tools.oasis-open.org/issues/browse/VIRTIO-3
>              Project: OASIS Virtual I/O Device (VIRTIO) TC
>           Issue Type: Improvement
>             Reporter: Rusty Russell
>
>
> There have been various improvements to the draft since the version 0.9.5 used by the TC.

And here's the patch which does the merge of my changes since 0.9.5.

diff --git a/virtio-spec.txt b/virtio-spec.txt
index 6a3860e..d762461 100644
--- a/virtio-spec.txt
+++ b/virtio-spec.txt
@@ -178,12 +178,13 @@ required for the virtqueue according to the following formula:
 	#define ALIGN(x) (((x) + PAGE_SIZE) & ~PAGE_SIZE)
 	static inline unsigned vring_size(unsigned int qsz)
 	{
-	     return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 + qsz))
-	          + ALIGN(sizeof(struct vring_used_elem)*qsz);
+	     return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(3 + qsz))
+	          + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz);
 	}
 
-This currently wastes some space with padding, but also allows 
-future extensions.  The virtqueue layout structure looks like this:
+This currently wastes some space with padding, but also allows future
+extensions such as the VIRTIO_RING_F_EVENT_IDX extension.  The
+virtqueue layout structure looks like this:
 
 	struct vring {
 		// The actual descriptors (16 bytes each)
@@ -216,17 +217,25 @@ is not a significant issue.
 
 2.1.4.2 Message Framing
 -----------------------
-
-The descriptors used for a buffer should not effect the semantics 
-of the message, except for the total length of the buffer. For 
-example, a network buffer consists of a 10 byte header followed 
-by the network packet. Whether this is presented in the ring 
-descriptor chain as (say) a 10 byte buffer and a 1514 byte 
-buffer, or a single 1524 byte buffer, or even three buffers, 
-should have no effect.
-
-In particular, no implementation should use the descriptor 
-boundaries to determine the size of any header in a request.[10]
+The original intent of the specification was that message framing (the
+particular layout of descriptors) be independent of the contents of
+the buffers. For example, a network transmit buffer consists of a 12
+byte header followed by the network packet. This could be most simply
+placed in the descriptor table as a 12 byte output descriptor followed
+by a 1514 byte output descriptor, but it could also consist of a
+single 1526 byte output descriptor in the case where the header and
+packet are adjacent, or even three or more descriptors (possibly with
+loss of efficiency in that case).
+
+Regrettably, initial driver implementations used simple layouts, and
+devices came to rely on it, despite this specification wording[10]. It
+is thus recommended that drivers be conservative in their assumptions,
+unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some
+implementations may have large-but-reasonable restrictions on total
+descriptor size (such as based on IOV_MAX in the host OS). This has
+not been a problem in practice: little sympathy will be given to
+drivers which create unreasonably-sized descriptors such as by
+dividing a network packet into 1500 single-byte descriptors!
 
 2.1.4.3 The Virtqueue Descriptor Table
 --------------------------------------
@@ -294,10 +303,10 @@ is ignored.
 2.1.4.4 The Virtqueue Available Ring
 ------------------------------------
 
-The available ring refers to what descriptors we are offering the
-device: it refers to the head of a descriptor chain. The “flags” field
+The available ring refers to what descriptor chains we are offering the
+device: each entry refers to the head of a descriptor chain. The “flags” field
 is currently 0 or 1: 1 indicating that we do not need an interrupt
-when the device consumes a descriptor from the available
+when the device consumes a descriptor chain from the available
 ring. Alternatively, the guest can ask the device to delay interrupts
 until an entry with an index specified by the “used_event” field is
 written in the used ring (equivalently, until the idx field in the
@@ -314,7 +323,7 @@ entry (modulo the queue size). This starts at 0, and increases.
 		u16 flags;
 		u16 idx;
 		u16 ring[ /* Queue Size */ ];
-		u16 used_event;
+		u16 used_event;	/* Only if VIRTIO_RING_F_EVENT_IDX */
 	}; 
 
 2.1.4.5 The Virtqueue Used Ring
@@ -353,7 +362,7 @@ the buffer to ensure no data leakage occurs.
 		u16 flags;
 		u16 idx;
 		struct vring_used_elem ring[ /* Queue Size */];
-		u16 avail_event;
+		u16 avail_event; /* Only if VIRTIO_RING_F_EVENT_IDX */
 	};
 
 2.1.4.6 Helpers for Operating Virtqueues
@@ -492,7 +501,7 @@ above. A naive implementation would do the following:
 
 	avail->ring[avail->idx % qsz] = head;
 
-However, in general we can add many descriptors before we update 
+However, in general we can add many descriptor chains before we update 
 the “idx” field (at which point they become visible to the 
 device), so we keep a counter of how many we've added:
 
@@ -502,7 +511,7 @@ device), so we keep a counter of how many we've added:
 ----------------------------------
 
 Once the index field of the virtqueue is updated, the device will 
-be able to access the descriptor entries we've created and the 
+be able to access the descriptor chains we've created and the 
 memory they refer to. This is why a memory barrier is generally 
 used before the index update, to ensure it sees the most up-to-date 
 copy.
@@ -915,7 +924,7 @@ configuration space. The following list presents their layout:
 
 • 0x038 | W | QueueNum 
  Virtual queue size.
- Queue size is a number of elements in the queue, therefore size 
+ Queue size is the number of elements in the queue, therefore size 
   of the descriptor table and both available and used rings.
  Writing to this register notifies the Host what size of the 
   queue the Guest will use. This applies to the queue selected by 
@@ -980,7 +989,7 @@ Reading from this register returns a bit mask of interrupts
   and is accessed with byte alignment. Its meaning and size 
   depends on the device and the driver. 
 
-Virtual queue size is a number of elements in the queue, 
+Virtual queue size is the number of elements in the queue, 
 therefore size of the descriptor table and both available and 
 used rings.
 
@@ -1087,6 +1096,10 @@ Discovering what devices are available and their type is bus-dependent.
 +------------+--------------------+
 | 10         |   mac80211 wlan    |
 +------------+--------------------+
+| 11         |   rproc serial     |
++------------+--------------------+
+| 12         |   virtio CAIF      |
++------------+--------------------+
 
 2.5.1 Network Device
 ====================
@@ -1234,7 +1247,7 @@ case, the packet itself is preceeded by a header:
 		u16 csum_start;
 		u16 csum_offset;
 	/* Only if VIRTIO_NET_F_MRG_RXBUF: */
-		u16 num_buffers
+		u16 num_buffers;
 	};
 
 The controlq is used to control device features such as 
@@ -1867,7 +1880,7 @@ configuration change interrupt.
     it has previously given to the balloon, as described above. 
     This descriptor is added to the deflateq.
 
-  (b) If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the 
+  (b) If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is negotiated, the 
     guest may not use these requested pages until that descriptor 
     in the deflateq has been used by the device.
 
@@ -2457,7 +2470,7 @@ contents of the event field. The following events are defined:
 2.6 Reserved Feature Bits
 =========================
 
-Currently there are five device-independent feature bits defined:
+Currently there are four device-independent feature bits defined:
 
   VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature 
   indicates that the driver wants an interrupt if the device runs 
@@ -2470,6 +2483,9 @@ Currently there are five device-independent feature bits defined:
   using a timer if the device interrupts it when all the packets 
   are transmitted.
 
+  VIRTIO_F_ANY_LAYOUT (27) This feature indicates that the device accepts arbitrary
+  descriptor layouts, as described in Section FIXME.
+
   VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature indicates
   that the driver can use descriptors with the VRING_DESC_F_INDIRECT
   flag set, as described in 2.3.3 Indirect Descriptors.
@@ -2486,6 +2502,10 @@ Currently there are five device-independent feature bits defined:
   ignore the avail_event field; the flags field is used
 
 
+In addition, bit 30 is used by qemu's implementation to check for experimental
+early versions of virtio which did not perform correct feature negotiation,
+and should not be used.
+
 2.7 virtio_ring.h
 =================
 
@@ -2523,11 +2543,14 @@ Currently there are five device-independent feature bits defined:
  * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  * SUCH DAMAGE.
  */
+#include <stdint.h>
 
 /* This marks a buffer as continuing via the next field. */
 #define VRING_DESC_F_NEXT       1
 /* This marks a buffer as write-only (otherwise read-only). */
 #define VRING_DESC_F_WRITE      2
+/* This means the buffer contains a list of buffer descriptors. */
+#define VRING_DESC_F_INDIRECT   4
 
 /* The Host uses this in used->flags to advise the Guest: don't kick me
  * when you add a buffer.  It's unreliable, so it's simply an
@@ -2538,6 +2561,15 @@ Currently there are five device-independent feature bits defined:
  * simply an optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT      1
 
+/* Support for indirect descriptors */
+#define VIRTIO_RING_F_INDIRECT_DESC    28
+
+/* Support for avail_idx and used_idx fields */
+#define VIRTIO_RING_F_EVENT_IDX        29
+
+/* Arbitrary descriptor layouts. */
+#define VIRTIO_F_ANY_LAYOUT            27
+
 /* Virtio ring descriptors: 16 bytes.
  * These can chain together via "next". */
 struct vring_desc {
@@ -2555,7 +2587,7 @@ struct vring_avail {
         uint16_t flags;
         uint16_t idx;
         uint16_t ring[];
-        uint16_t used_event;
+        /* Only if VIRTIO_RING_F_EVENT_IDX: uint16_t used_event; */
 };
 
 /* u32 is used here for ids for padding reasons. */
@@ -2570,7 +2602,7 @@ struct vring_used {
         uint16_t flags;
         uint16_t idx;
         struct vring_used_elem ring[];
-        uint16_t avail_event;
+        /* Only if VIRTIO_RING_F_EVENT_IDX: uint16_t avail_event; */
 };
 
 struct vring {
@@ -2592,14 +2624,16 @@ struct vring {
  *      __u16 avail_flags;
  *      __u16 avail_idx;
  *      __u16 available[num];
+ *      __u16 used_event_idx; // Only if VIRTIO_RING_F_EVENT_IDX
  *
  *      // Padding to the next align boundary.
  *      char pad[];
  *
  *      // A ring of used descriptor heads with free-running index.
  *      __u16 used_flags;
- *      __u16 EVENT_IDX;
+ *      __u16 used_idx;
  *      struct vring_used_elem used[num];
+ *      __u16 avail_event_idx; // Only if VIRTIO_RING_F_EVENT_IDX
  * };
  * Note: for virtio PCI, align is 4096.
  */
@@ -2609,14 +2643,14 @@ static inline void vring_init(struct vring *vr, unsigned int num, void *p,
         vr->num = num;
         vr->desc = p;
         vr->avail = p + num*sizeof(struct vring_desc);
-        vr->used = (void *)(((unsigned long)&vr->avail->ring[num]
+        vr->used = (void *)(((unsigned long)&vr->avail->ring[num] + sizeof(uint16_t)
                               + align-1)
                             & ~(align - 1));
 }
 
 static inline unsigned vring_size(unsigned int num, unsigned long align)
 {
-        return ((sizeof(struct vring_desc)*num + sizeof(uint16_t)*(2+num)
+        return ((sizeof(struct vring_desc)*num + sizeof(uint16_t)*(3+num)
                  + align - 1) & ~(align - 1))
                 + sizeof(uint16_t)*3 + sizeof(struct vring_used_elem)*num;
 }
@@ -2625,6 +2659,19 @@ static inline int vring_need_event(uint16_t event_idx, uint16_t new_idx, uint16_
 {
          return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx - old_idx); 
 }
+
+/* Get location of event indices (only with VIRTIO_RING_F_EVENT_IDX) */
+static inline uint16_t *vring_used_event(struct vring *vr)
+{
+        /* For backwards compat, used event index is at *end* of avail ring. */
+        return &vr->avail->ring[vr->num];
+}
+
+static inline uint16_t *vring_avail_event(struct vring *vr)
+{
+        /* For backwards compat, avail event index is at *end* of used ring. */
+        return (uint16_t *)&vr->used->ring[vr->num];
+}
 #endif /* VIRTIO_RING_H */
 
 
@@ -2733,9 +2780,11 @@ of this expected condition is necessary.
 
 [9] https://lists.linux-foundation.org/mailman/listinfo/virtualization
 
-[10] The current qemu device implementations mistakenly insist that 
-the first descriptor cover the header in these cases exactly, so 
-a cautious driver should arrange it so.
+[10] It was previously asserted that framing should be independent of message
+contents, yet invariably drivers layed out messages in reliable ways and
+devices assumed it.
+In addition, the specifications for virtio_blk and virtio_scsi require
+intuiting field lengths from frame boundaries.
 
 [11] Even if it does mean documenting design or implementation 
 mistakes!



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]