[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio] Groups - Action Item "Create text version of virtio 0.9.5 document" added
On Wed, Jul 31, 2013 at 01:33:39PM +0930, Rusty Russell wrote: > Rusty Russell <rusty@au1.ibm.com> writes: > > ----------------- > > Action Item Subject: Create text version of virtio 0.9.5 document > > OK, I've attached this below. This is exactly the spec as 0.9.5 > reformatted into text so we can work it. I copied virtio-comment which is an open-list. Hope that's OK with everyone. People aren't auto-subscribed there but if everyone is willing to subscribe to virtio-comment, we could reserve the virtio@ members-only list to administrative issues only? > > I'm still waiting for virtio to be added to the OASIS issue tracking > system, where I will be prompting you all to open an issue for every > change we want to consider. > > The obvious issues I want to open are: > > o Major rework to make PCI an appendix, and core bus-independent. > o Update spec with changes/fixes since 0.9.5 > - This is easy where contributors are already members, > trickier for others. > > In addition here's a brain dump: > > o Endian for config space > - LE everywhere? > o Endian for ring > - LE as well? > o Allow arbitrary descriptor layouts / message framing. > o Method to stop activity on a queue? > o Size descriptor table independent of ringsize? > o Remove VIRTIO_F_NOTIFY_ON_EMPTY? > o Remove limit on # indirect descriptors > - Some other limit? > o Simplify indirect desc > - No return to top level on end of desc array > o Allow chained indirect desc > - indirect bit use to chain? > > Net: > o Remove VIRTIO_NET_F_GSO? > > Block: > o Remove VIRTIO_BLK_F_SCSI? > o Revisit flush/barrier semantics? > > PCI: > o New capability layout > o Allowing non-zero revision numbers? > o Remove 'align' and use explicit addresses for used/avail. > > Balloon: > o Fix endianness > o Remove VIRTIO_BALLOON_F_MUST_TELL_HOST? > o Remove outgoing page queue? > > Cheers, > Rusty. > > ==== > This document describes the specifications of the “virtio” family > of PCI devices. These are devices > are found in virtual environments, > yet by design they are not all that different from physical PCI > devices, and this document treats them as such. This allows the > guest to use standard PCI drivers and discovery mechanisms. > > The purpose of virtio and this specification is that virtual > environments and guests should have a straightforward, efficient, > standard and extensible mechanism for virtual devices, rather > than boutique per-environment or per-OS mechanisms. > > Straightforward: Virtio PCI devices use normal PCI mechanisms > of interrupts and DMA which should be familiar to any device > driver author. There is no exotic page-flipping or COW > mechanism: it's just a PCI device.[1] > > Efficient: Virtio PCI devices consist of rings of descriptors > for input and output, which are neatly separated to avoid cache > effects from both guest and device writing to the same cache > lines. > > Standard: Virtio PCI makes no assumptions about the environment > in which it operates, beyond supporting PCI. In fact the virtio > devices specified in the appendices do not require PCI at all: > they have been implemented on non-PCI buses.[2] > > Extensible: Virtio PCI devices contain feature bits which are > acknowledged by the guest operating system during device setup. > This allows forwards and backwards compatibility: the device > offers all the features it knows about, and the driver > acknowledges those it understands and wishes to use. > > 1.1 Virtqueues > > The mechanism for bulk data transport on virtio PCI devices is > pretentiously called a virtqueue. Each device can have zero or > more virtqueues: for example, the network device has one for > transmit and one for receive. > > Each virtqueue occupies two or more physically-contiguous pages > (defined, for the purposes of this specification, as 4096 bytes), > and consists of three parts: > > > +-------------------+-----------------------------------+-----------+ > | Descriptor Table | Available Ring (padding) | Used Ring | > +-------------------+-----------------------------------+-----------+ > > > When the driver wants to send a buffer to the device, it fills in > a slot in the descriptor table (or chains several together), and > writes the descriptor index into the available ring. It then > notifies the device. When the device has finished a buffer, it > writes the descriptor into the used ring, and sends an interrupt. > > Specification > > 2.1 PCI Discovery > > Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through > 0x103F inclusive is a virtio device[3]. The device must also have a > Revision ID of 0 to match this specification. > > The Subsystem Device ID indicates which virtio device is > supported by the device. The Subsystem Vendor ID should reflect > the PCI Vendor ID of the environment (it's currently only used > for informational purposes by the guest). > > > +----------------------+--------------------+---------------+ > | Subsystem Device ID | Virtio Device | Specification | > +----------------------+--------------------+---------------+ > +----------------------+--------------------+---------------+ > | 1 | network card | Appendix C | > +----------------------+--------------------+---------------+ > | 2 | block device | Appendix D | > +----------------------+--------------------+---------------+ > | 3 | console | Appendix E | > +----------------------+--------------------+---------------+ > | 4 | entropy source | Appendix F | > +----------------------+--------------------+---------------+ > | 5 | memory ballooning | Appendix G | > +----------------------+--------------------+---------------+ > | 6 | ioMemory | - | > +----------------------+--------------------+---------------+ > | 7 | rpmsg | - | > +----------------------+--------------------+---------------+ > | 8 | SCSI host | Appendix I | > +----------------------+--------------------+---------------+ > | 9 | 9P transport | - | > +----------------------+--------------------+---------------+ > | 10 | mac80211 wlan | - | > +----------------------+--------------------+---------------+ > > > 2.2 Device Configuration > > To configure the device, we use the first I/O region of the PCI > device. This contains a virtio header followed by a > device-specific region. > > There may be different widths of accesses to the I/O region; the > “natural” access method for each field in the virtio header must be > used (i.e. 32-bit accesses for 32-bit fields, etc), but the > device-specific region can be accessed using any width accesses, and > should obtain the same results. > > Note that this is possible because while the virtio header is PCI > (i.e. little) endian, the device-specific region is encoded in > the native endian of the guest (where such distinction is > applicable). > > 2.2.1 Device Initialization Sequence > > We start with an overview of device initialization, then expand > on the details of the device and how each step is preformed. > > 1. Reset the device. This is not required on initial start up. > > 2. The ACKNOWLEDGE status bit is set: we have noticed the device. > > 3. The DRIVER status bit is set: we know how to drive the device. > > 4. Device-specific setup, including reading the Device Feature > Bits, discovery of virtqueues for the device, optional MSI-X > setup, and reading and possibly writing the virtio > configuration space. > > 5. The subset of Device Feature Bits understood by the driver is > written to the device. > > 6. The DRIVER_OK status bit is set. > > 7. The device can now be used (ie. buffers added to the > virtqueues)[4] > > If any of these steps go irrecoverably wrong, the guest should > set the FAILED status bit to indicate that it has given up on the > device (it can reset the device later to restart if desired). > > We now cover the fields required for general setup in detail. > > 2.2.2 Virtio Header > > The virtio header looks as follows: > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > | Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > | Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > | Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | > | || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > If MSI-X is enabled for the device, two additional fields > immediately follow this header:[5] > > > +------------++----------------+--------+ > | Bits || 16 | 16 | > +----------------+--------+ > +------------++----------------+--------+ > | Read/Write || R+W | R+W | > +------------++----------------+--------+ > | Purpose || Configuration | Queue | > | (MSI-X) || Vector | Vector | > +------------++----------------+--------+ > > > Immediately following these general headers, there may be > device-specific headers: > > > +------------++--------------------+ > | Bits || Device Specific | > +--------------------+ > +------------++--------------------+ > | Read/Write || Device Specific | > +------------++--------------------+ > | Purpose || Device Specific... | > | || | > +------------++--------------------+ > > > 2.2.2.1 Device Status > > The Device Status field is updated by the guest to indicate its > progress. This provides a simple low-level diagnostic: it's most > useful to imagine them hooked up to traffic lights on the console > indicating the status of each device. > > The device can be reset by writing a 0 to this field, otherwise > at least one bit should be set: > > ACKNOWLEDGE (1) Indicates that the guest OS has found the > device and recognized it as a valid virtio device. > > DRIVER (2) Indicates that the guest OS knows how to drive the > device. Under Linux, drivers can be loadable modules so there > may be a significant (or infinite) delay before setting this > bit. > > DRIVER_OK (4) Indicates that the driver is set up and ready to > drive the device. > > FAILED (128) Indicates that something went wrong in the guest, > and it has given up on the device. This could be an internal > error, or the driver didn't like the device for some reason, or > even a fatal error during device operation. The device must be > reset before attempting to re-initialize. > > 2.2.2.2 Feature Bits > > The first configuration field indicates the features that the > device supports. The bits are allocated as follows: > > 0 to 23 Feature bits for the specific device type > > 24 to 32 Feature bits reserved for extensions to the queue and > feature negotiation mechanisms > > For example, feature bit 0 for a network device (i.e. Subsystem > Device ID 1) indicates that the device supports checksumming of > packets. > > The feature bits are negotiated: the device lists all the > features it understands in the Device Features field, and the > guest writes the subset that it understands into the Guest > Features field. The only way to renegotiate is to reset the > device. > > In particular, new fields in the device configuration header are > indicated by offering a feature bit, so the guest can check > before accessing that part of the configuration space. > > This allows for forwards and backwards compatibility: if the > device is enhanced with a new feature bit, older guests will not > write that feature bit back to the Guest Features field and it > can go into backwards compatibility mode. Similarly, if a guest > is enhanced with a feature that the device doesn't support, it > will not see that feature bit in the Device Features field and > can go into backwards compatibility mode (or, for poor > implementations, set the FAILED Device Status bit). > > 2.2.2.3 Configuration/Queue Vectors > > When MSI-X capability is present and enabled in the device > (through standard PCI configuration space) 4 bytes at byte offset > 20 are used to map configuration change and queue interrupts to > MSI-X vectors. In this case, the ISR Status field is unused, and > device specific configuration starts at byte offset 24 in virtio > header structure. When MSI-X capability is not enabled, device > specific configuration starts at byte offset 20 in virtio header. > > Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of > Configuration/Queue Vector registers, maps interrupts triggered > by the configuration change/selected queue events respectively to > the corresponding MSI-X vector. To disable interrupts for a > specific event type, unmap it by writing a special NO_VECTOR > value: > > /* Vector value used to disable MSI for queue */ > > #define VIRTIO_MSI_NO_VECTOR 0xffff > > Reading these registers returns vector mapped to a given event, > or NO_VECTOR if unmapped. All queue and configuration change > events are unmapped by default. > > Note that mapping an event to vector might require allocating > internal device resources, and might fail. Devices report such > failures by returning the NO_VECTOR value when the relevant > Vector field is read. After mapping an event to vector, the > driver must verify success by reading the Vector field value: on > success, the previously written value is returned, and on > failure, NO_VECTOR is returned. If a mapping failure is detected, > the driver can retry mapping with fewervectors, or disable MSI-X. > > 2.3 Virtqueue Configuration > > As a device can have zero or more virtqueues for bulk data > transport (for example, the network driver has two), the driver > needs to configure them as part of the device-specific > configuration. > > This is done as follows, for each virtqueue a device has: > > 1. Write the virtqueue index (first queue is 0) to the Queue > Select field. > > 2. Read the virtqueue size from the Queue Size field, which is > always a power of 2. This controls how big the virtqueue is > (see below). If this field is 0, the virtqueue does not exist. > > 3. Allocate and zero virtqueue in contiguous physical memory, on > a 4096 byte alignment. Write the physical address, divided by > 4096 to the Queue Address field.[6] > > 4. Optionally, if MSI-X capability is present and enabled on the > device, select a vector to use to request interrupts triggered > by virtqueue events. Write the MSI-X Table entry number > corresponding to this vector in Queue Vector field. Read the > Queue Vector field: on success, previously written value is > returned; on failure, NO_VECTOR value is returned. > > The Queue Size field controls the total number of bytes required > for the virtqueue according to the following formula: > > #define ALIGN(x) (((x) + 4095) & ~4095) > > static inline unsigned vring_size(unsigned int qsz) > { > return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 + qsz)) > + ALIGN(sizeof(struct vring_used_elem)*qsz); > } > > This currently wastes some space with padding, but also allows > future extensions. The virtqueue layout structure looks like this > (qsz is the Queue Size field, which is a variable, so this code > won't compile): > > struct vring { > /* The actual descriptors (16 bytes each) */ > struct vring_desc desc[qsz]; > > /* A ring of available descriptor heads with free-running index. */ > struct vring_avail avail; > > // Padding to the next 4096 boundary. > char pad[]; > > // A ring of used descriptor heads with free-running index. > struct vring_used used; > }; > > 2.3.1 A Note on Virtqueue Endianness > > Note that the endian of these fields and everything else in the > virtqueue is the native endian of the guest, not little-endian as > PCI normally is. This makes for simpler guest code, and it is > assumed that the host already has to be deeply aware of the guest > endian so such an “endian-aware” device is not a significant > issue. > > 2.3.2 Descriptor Table > > The descriptor table refers to the buffers the guest is using for > the device. The addresses are physical addresses, and the buffers > can be chained via the next field. Each descriptor describes a > buffer which is read-only or write-only, but a chain of > descriptors can contain both read-only and write-only buffers. > > No descriptor chain may be more than 2^32 bytes long in total. > > struct vring_desc { > /* Address (guest-physical). */ > u64 addr; > /* Length. */ > u32 len; > > /* This marks a buffer as continuing via the next field. */ > #define VRING_DESC_F_NEXT 1 > /* This marks a buffer as write-only (otherwise read-only). */ > #define VRING_DESC_F_WRITE 2 > /* This means the buffer contains a list of buffer descriptors. */ > #define VRING_DESC_F_INDIRECT 4 > /* The flags as indicated above. */ > u16 flags; > /* Next field if flags & NEXT */ > u16 next; > }; > > The number of descriptors in the table is specified by the Queue > Size field for this virtqueue. > > 2.3.3 Indirect Descriptors > > Some devices benefit by concurrently dispatching a large number > of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be > used to allow this (see Appendix B: Reserved Feature Bits). To increase > ring capacity it is possible to store a table of indirect > descriptors anywhere in memory, and insert a descriptor in main > virtqueue (with flags&INDIRECT on) that refers to memory buffer > containing this indirect descriptor table; fields addr and len > refer to the indirect table address and length in bytes, > respectively. The indirect table layout structure looks like this > (len is the length of the descriptor that refers to this table, > which is a variable, so this code won't compile): > > struct indirect_descriptor_table { > /* The actual descriptors (16 bytes each) */ > struct vring_desc desc[len / 16]; > }; > > The first indirect descriptor is located at start of the indirect > descriptor table (index 0), additional indirect descriptors are > chained by next field. An indirect descriptor without next field > (with flags&NEXT off) signals the end of the indirect descriptor > table, and transfers control back to the main virtqueue. An > indirect descriptor can not refer to another indirect descriptor > table (flags&INDIRECT must be off). A single indirect descriptor > table can include both read-only and write-only descriptors; > write-only flag (flags&WRITE) in the descriptor that refers to it > is ignored. > > 2.3.4 Available Ring > > The available ring refers to what descriptors we are offering the > device: it refers to the head of a descriptor chain. The “flags” field > is currently 0 or 1: 1 indicating that we do not need an interrupt > when the device consumes a descriptor from the available > ring. Alternatively, the guest can ask the device to delay interrupts > until an entry with an index specified by the “ used_event” field is > written in the used ring (equivalently, until the idx field in the > used ring will reach the value used_event + 1). The method employed by > the device is controlled by the VIRTIO_RING_F_EVENT_IDX feature bit > (see Appendix B: Reserved Feature Bits). This interrupt suppression is > merely an optimization; it may not suppress interrupts entirely. > > The “idx” field indicates where we would put the next descriptor > entry (modulo the ring size). This starts at 0, and increases. > > struct vring_avail { > #define VRING_AVAIL_F_NO_INTERRUPT 1 > u16 flags; > u16 idx; > u16 ring[qsz]; /* qsz is the Queue Size field read from device */ > u16 used_event; > }; > > 2.3.5 Used Ring > > The used ring is where the device returns buffers once it is done > with them. The flags field can be used by the device to hint that > no notification is necessary when the guest adds to the available > ring. Alternatively, the “avail_event” field can be used by the > device to hint that no notification is necessary until an entry > with an index specified by the “avail_event” is written in the > available ring (equivalently, until the idx field in the > available ring will reach the value avail_event + 1). The method > employed by the device is controlled by the guest through the > VIRTIO_RING_F_EVENT_IDX feature bit (see Appendix B: Reserved > Feature Bits).[7] > > Each entry in the ring is a pair: the head entry of the > descriptor chain describing the buffer (this matches an entry > placed in the available ring by the guest earlier), and the total > of bytes written into the buffer. The latter is extremely useful > for guests using untrusted buffers: if you do not know exactly > how much has been written by the device, you usually have to zero > the buffer to ensure no data leakage occurs. > > /* u32 is used here for ids for padding reasons. */ > struct vring_used_elem { > /* Index of start of used descriptor chain. */ > u32 id; > /* Total length of the descriptor chain which was used (written to) */ > u32 len; > }; > > struct vring_used { > #define VRING_USED_F_NO_NOTIFY 1 > u16 flags; > u16 idx; > struct vring_used_elem ring[qsz]; > u16 avail_event; > }; > > 2.3.6 Helpers for Managing Virtqueues > > The Linux Kernel Source code contains the definitions above and > helper routines in a more usable form, in > include/linux/virtio_ring.h. This was explicitly licensed by IBM > and Red Hat under the (3-clause) BSD license so that it can be > freely used by all other projects, and is reproduced (with slight > variation to remove Linux assumptions) in Appendix A. > > 2.4 Device Operation > > There are two parts to device operation: supplying new buffers to > the device, and processing used buffers from the device. As an > example, the virtio network device has two virtqueues: the > transmit virtqueue and the receive virtqueue. The driver adds > outgoing (read-only) packets to the transmit virtqueue, and then > frees them after they are used. Similarly, incoming (write-only) > buffers are added to the receive virtqueue, and processed after > they are used. > > 2.4.1 Supplying Buffers to The Device > > Actual transfer of buffers from the guest OS to the device > operates as follows: > > 1. Place the buffer(s) into free descriptor(s). > > (a) If there are no free descriptors, the guest may choose to > notify the device even if notifications are suppressed (to > reduce latency).[8] > > 2. Place the id of the buffer in the next ring entry of the > available ring. > > 3. The steps (1) and (2) may be performed repeatedly if batching > is possible. > > 4. A memory barrier should be executed to ensure the device sees > the updated descriptor table and available ring before the next > step. > > 5. The available “idx” field should be increased by the number of > entries added to the available ring. > > 6. A memory barrier should be executed to ensure that we update > the idx field before checking for notification suppression. > > 7. If notifications are not suppressed, the device should be > notified of the new buffers. > > Note that the above code does not take precautions against the > available ring buffer wrapping around: this is not possible since > the ring buffer is the same size as the descriptor table, so step > (1) will prevent such a condition. > > In addition, the maximum queue size is 32768 (it must be a power > of 2 which fits in 16 bits), so the 16-bit “idx” value can always > distinguish between a full and empty buffer. > > Here is a description of each stage in more detail. > > 2.4.1.1 Placing Buffers Into The Descriptor Table > > A buffer consists of zero or more read-only physically-contiguous > elements followed by zero or more physically-contiguous > write-only elements (it must have at least one element). This > algorithm maps it into the descriptor table: > > 1. for each buffer element, b: > > (a) Get the next free descriptor table entry, d > > (b) Set d.addr to the physical address of the start of b > > (c) Set d.len to the length of b. > > (d) If b is write-only, set d.flags to VRING_DESC_F_WRITE, > otherwise 0. > > (e) If there is a buffer element after this: > > i. Set d.next to the index of the next free descriptor > element. > > ii. Set the VRING_DESC_F_NEXT bit in d.flags. > > In practice, the d.next fields are usually used to chain free > descriptors, and a separate count kept to check there are enough > free descriptors before beginning the mappings. > > 2.4.1.2 Updating The Available Ring > > The head of the buffer we mapped is the first d in the algorithm > above. A naive implementation would do the following: > > avail->ring[avail->idx % qsz] = head; > > However, in general we can add many descriptors before we update > the “idx” field (at which point they become visible to the > device), so we keep a counter of how many we've added: > > avail->ring[(avail->idx + added++) % qsz] = head; > > 2.4.1.3 Updating The Index Field > > Once the idx field of the virtqueue is updated, the device will > be able to access the descriptor entries we've created and the > memory they refer to. This is why a memory barrier is generally > used before the idx update, to ensure it sees the most up-to-date > copy. > > The idx field always increments, and we let it wrap naturally at > 65536: > > avail->idx += added; > > 2.4.1.4 Notifying The Device > > Device notification occurs by writing the 16-bit virtqueue index > of this virtqueue to the Queue Notify field of the virtio header > in the first I/O region of the PCI device. This can be expensive, > however, so the device can suppress such notifications if it > doesn't need them. We have to be careful to expose the new idx > value before checking the suppression flag: it's OK to notify > gratuitously, but not to omit a required notification. So again, > we use a memory barrier here before reading the flags or the > avail_event field. > > If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if > the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to > the PCI configuration space. > > If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the > avail_event field in the available ring structure. If the > available index crossed_the avail_event field value since the > last notification, we go ahead and write to the PCI configuration > space. The avail_event field wraps naturally at 65536 as well: > > (u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx) > > 2.4.2 Receiving Used Buffers From The Device > > Once the device has used a buffer (read from or written to it, or > parts of both, depending on the nature of the virtqueue and the > device), it sends an interrupt, following an algorithm very > similar to the algorithm used for the driver to send the device a > buffer: > > 1. Write the head descriptor number to the next field in the used > ring. > > 2. Update the used ring idx. > > 3. Determine whether an interrupt is necessary: > > (a) If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: > check if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in > avail->flags > > (b) If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check > whether the used index crossed the used_event field value > since the last update. The used_event field wraps naturally > at 65536 as well: > (u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx) > > 4. If an interrupt is necessary: > > (a) If MSI-X capability is disabled: > > i. Set the lower bit of the ISR Status field for the device. > > ii. Send the appropriate PCI interrupt for the device. > > (b) If MSI-X capability is enabled: > > i. Request the appropriate MSI-X interrupt message for the > device, Queue Vector field sets the MSI-X Table entry > number. > > ii. If Queue Vector field value is NO_VECTOR, no interrupt > message is requested for this event. > > The guest interrupt handler should: > > 1. If MSI-X capability is disabled: read the ISR Status field, > which will reset it to zero. If the lower bit is zero, the > interrupt was not for this device. Otherwise, the guest driver > should look through the used rings of each virtqueue for the > device, to see if any progress has been made by the device > which requires servicing. > > 2. If MSI-X capability is enabled: look through the used rings of > each virtqueue mapped to the specific MSI-X vector for the > device, to see if any progress has been made by the device > which requires servicing. > > For each ring, guest should then disable interrupts by writing > VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required. > It can then process used ring entries finally enabling interrupts > by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the > EVENT_IDX field in the available structure, Guest should then > execute a memory barrier, and then recheck the ring empty > condition. This is necessary to handle the case where, after the > last check and before enabling interrupts, an interrupt has been > suppressed by the device: > > vring_disable_interrupts(vq); > > for (;;) { > if (vq->last_seen_used != vring->used.idx) { > vring_enable_interrupts(vq); > mb(); > > if (vq->last_seen_used != vring->used.idx) > break; > } > > struct vring_used_elem *e = vring.used->ring[vq->last_seen_used%vsz]; > process_buffer(e); > vq->last_seen_used++; > } > > 2.4.3 Dealing With Configuration Changes > > Some virtio PCI devices can change the device configuration > state, as reflected in the virtio header in the PCI configuration > space. In this case: > > 1. If MSI-X capability is disabled: an interrupt is delivered and > the second highest bit is set in the ISR Status field to > indicate that the driver should re-examine the configuration > space. Note that a single interrupt can indicate both that one > or more virtqueue has been used and that the configuration > space has changed: even if the config bit is set, virtqueues > must be scanned. > > 2. If MSI-X capability is enabled: an interrupt message is > requested. The Configuration Vector field sets the MSI-X Table > entry number to use. If Configuration Vector field value is > NO_VECTOR, no interrupt message is requested for this event. > > > Creating New Device Types > > Various considerations are necessary when creating a new device > type: > > How Many Virtqueues? > > It is possible that a very simple device will operate entirely > through its configuration space, but most will need at least one > virtqueue in which it will place requests. A device with both > input and output (eg. console and network devices described here) > need two queues: one which the driver fills with buffers to > receive input, and one which the driver places buffers to > transmit output. > > What Configuration Space Layout? > > Configuration space is generally used for rarely-changing or > initialization-time parameters. But it is a limited resource, so > it might be better to use a virtqueue to update configuration > information (the network device does this for filtering, > otherwise the table in the config space could potentially be very > large). > > Note that this space is generally the guest's native endian, > rather than PCI's little-endian. > > What Device Number? > > Currently device numbers are assigned quite freely: a simple > request mail to the author of this document or the Linux > virtualization mailing list[9] will be sufficient to secure a unique one. > > Meanwhile for experimental drivers, use 65535 and work backwards. > > How many MSI-X vectors? > > Using the optional MSI-X capability devices can speed up > interrupt processing by removing the need to read ISR Status > register by guest driver (which might be an expensive operation), > reducing interrupt sharing between devices and queues within the > device, and handling interrupts from multiple CPUs. However, some > systems impose a limit (which might be as low as 256) on the > total number of MSI-X vectors that can be allocated to all > devices. Devices and/or device drivers should take this into > account, limiting the number of vectors used unless the device is > expected to cause a high volume of interrupts. Devices can > control the number of vectors used by limiting the MSI-X Table > Size or not presenting MSI-X capability in PCI configuration > space. Drivers can control this by mapping events to as small > number of vectors as possible, or disabling MSI-X capability > altogether. > > Message Framing > > The descriptors used for a buffer should not effect the semantics > of the message, except for the total length of the buffer. For > example, a network buffer consists of a 10 byte header followed > by the network packet. Whether this is presented in the ring > descriptor chain as (say) a 10 byte buffer and a 1514 byte > buffer, or a single 1524 byte buffer, or even three buffers, > should have no effect. > > In particular, no implementation should use the descriptor > boundaries to determine the size of any header in a request.[10] > > Device Improvements > > Any change to configuration space, or new virtqueues, or > behavioural changes, should be indicated by negotiation of a new > feature bit. This establishes clarity[11] and avoids future expansion problems. > > Clusters of functionality which are always implemented together > can use a single bit, but if one feature makes sense without the > others they should not be gratuitously grouped together to > conserve feature bits. We can always extend the spec when the > first person needs more than 24 feature bits for their device. > > > > > Appendix A: virtio_ring.h > > #ifndef VIRTIO_RING_H > #define VIRTIO_RING_H > /* An interface for efficient virtio implementation. > * > * This header is BSD licensed so anyone can use the definitions > * to implement compatible drivers/servers. > * > * Copyright 2007, 2009, IBM Corporation > * Copyright 2011, Red Hat, Inc > * All rights reserved. > * > * Redistribution and use in source and binary forms, with or without > * modification, are permitted provided that the following conditions > * are met: > * 1. Redistributions of source code must retain the above copyright > * notice, this list of conditions and the following disclaimer. > * 2. Redistributions in binary form must reproduce the above copyright > * notice, this list of conditions and the following disclaimer in the > * documentation and/or other materials provided with the distribution. > * 3. Neither the name of IBM nor the names of its contributors > * may be used to endorse or promote products derived from this software > * without specific prior written permission. > * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ``AS IS'' AND > * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE > * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE > * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE > * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL > * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS > * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) > * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT > * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY > * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF > * SUCH DAMAGE. > */ > > /* This marks a buffer as continuing via the next field. */ > #define VRING_DESC_F_NEXT 1 > /* This marks a buffer as write-only (otherwise read-only). */ > #define VRING_DESC_F_WRITE 2 > > /* The Host uses this in used->flags to advise the Guest: don't kick me > * when you add a buffer. It's unreliable, so it's simply an > * optimization. Guest will still kick if it's out of buffers. */ > #define VRING_USED_F_NO_NOTIFY 1 > /* The Guest uses this in avail->flags to advise the Host: don't > * interrupt me when you consume a buffer. It's unreliable, so it's > * simply an optimization. */ > #define VRING_AVAIL_F_NO_INTERRUPT 1 > > /* Virtio ring descriptors: 16 bytes. > * These can chain together via "next". */ > struct vring_desc { > /* Address (guest-physical). */ > uint64_t addr; > /* Length. */ > uint32_t len; > /* The flags as indicated above. */ > uint16_t flags; > /* We chain unused descriptors via this, too */ > uint16_t next; > }; > > struct vring_avail { > uint16_t flags; > uint16_t idx; > uint16_t ring[]; > uint16_t used_event; > }; > > /* u32 is used here for ids for padding reasons. */ > struct vring_used_elem { > /* Index of start of used descriptor chain. */ > uint32_t id; > /* Total length of the descriptor chain which was written to. */ > uint32_t len; > }; > > struct vring_used { > uint16_t flags; > uint16_t idx; > struct vring_used_elem ring[]; > uint16_t avail_event; > }; > > struct vring { > unsigned int num; > > struct vring_desc *desc; > struct vring_avail *avail; > struct vring_used *used; > }; > > /* The standard layout for the ring is a continuous chunk of memory which > * looks like this. We assume num is a power of 2. > * > * struct vring { > * // The actual descriptors (16 bytes each) > * struct vring_desc desc[num]; > * > * // A ring of available descriptor heads with free-running index. > * __u16 avail_flags; > * __u16 avail_idx; > * __u16 available[num]; > * > * // Padding to the next align boundary. > * char pad[]; > * > * // A ring of used descriptor heads with free-running index. > * __u16 used_flags; > * __u16 EVENT_IDX; > * struct vring_used_elem used[num]; > * }; > * Note: for virtio PCI, align is 4096. > */ > static inline void vring_init(struct vring *vr, unsigned int num, void *p, > unsigned long align) > { > vr->num = num; > vr->desc = p; > vr->avail = p + num*sizeof(struct vring_desc); > vr->used = (void *)(((unsigned long)&vr->avail->ring[num] > + align-1) > & ~(align - 1)); > } > > static inline unsigned vring_size(unsigned int num, unsigned long align) > { > return ((sizeof(struct vring_desc)*num + sizeof(uint16_t)*(2+num) > + align - 1) & ~(align - 1)) > + sizeof(uint16_t)*3 + sizeof(struct vring_used_elem)*num; > } > > static inline int vring_need_event(uint16_t event_idx, uint16_t new_idx, uint16_t old_idx) > { > return (uint16_t)(new_idx - event_idx - 1) < (uint16_t)(new_idx - old_idx); > } > #endif /* VIRTIO_RING_H */ > > > Appendix B: Reserved Feature Bits > > Currently there are five device-independent feature bits defined: > > VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature > indicates that the driver wants an interrupt if the device runs > out of available descriptors on a virtqueue, even though > interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT > flag or the used_event field. An example of this is the > networking driver: it doesn't need to know every time a packet > is transmitted, but it does need to free the transmitted > packets a finite time after they are transmitted. It can avoid > using a timer if the device interrupts it when all the packets > are transmitted. > > VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature indicates > that the driver can use descriptors with the VRING_DESC_F_INDIRECT > flag set, as described in 2.3.3 Indirect Descriptors. > > VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event > and the avail_event fields. If set, it indicates that the > device should ignore the flags field in the available ring > structure. Instead, the used_event field in this structure is > used by guest to suppress device interrupts. Further, the > driver should ignore the flags field in the used ring > structure. Instead, the avail_event field in this structure is > used by the device to suppress notifications. If unset, the > driver should ignore the used_event field; the device should > ignore the avail_event field; the flags field is used > > Appendix C: Network Device > > The virtio network device is a virtual ethernet card, and is the > most complex of the devices supported so far by virtio. It has > enhanced rapidly and demonstrates clearly how support for new > features should be added to an existing device. Empty buffers are > placed in one virtqueue for receiving packets, and outgoing > packets are enqueued into another for transmission in that order. > A third command queue is used to control advanced filtering > features. > > Configuration > > Subsystem Device ID 1 > > Virtqueues 0:receiveq. 1:transmitq. 2:controlq[12] > > Feature bits > > VIRTIO_NET_F_CSUM (0) Device handles packets with partial checksum > > VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets with partial checksum > > VIRTIO_NET_F_MAC (5) Device has given MAC address. > > VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with > any GSO type.[13] > > VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4. > > VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6. > > VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO with ECN. > > VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO. > > VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4. > > VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6. > > VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO with ECN. > > VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO. > > VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers. > > VIRTIO_NET_F_STATUS (16) Configuration status field is > available. > > VIRTIO_NET_F_CTRL_VQ (17) Control channel is available. > > VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode support. > > VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering. > > VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous > packets. > > Device configuration layout Two configuration fields are > currently defined. The mac address field always exists (though > is only valid if VIRTIO_NET_F_MAC is set), and the status field > only exists if VIRTIO_NET_F_STATUS is set. Two read-only bits > are currently defined for the status field: > VIRTIO_NET_S_LINK_UP and VIRTIO_NET_S_ANNOUNCE. > > #define VIRTIO_NET_S_LINK_UP 1 > #define VIRTIO_NET_S_ANNOUNCE 2 > > struct virtio_net_config { > u8 mac[6]; > u16 status; > }; > > Device Initialization > > 1. The initialization routine should identify the receive and > transmission virtqueues. > > 2. If the VIRTIO_NET_F_MAC feature bit is set, the configuration > space “mac” entry indicates the “physical” address of the the > network card, otherwise a private MAC address should be > assigned. All guests are expected to negotiate this feature if > it is set. > > 3. If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, > identify the control virtqueue. > > 4. If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link > status can be read from the bottom bit of the “status” config > field. Otherwise, the link should be assumed active. > > 5. The receive virtqueue should be filled with receive buffers. > This is described in detail below in “Setting Up Receive > Buffers”. > > 6. A driver can indicate that it will generate checksumless > packets by negotating the VIRTIO_NET_F_CSUM feature. This “ > checksum offload” is a common feature on modern network cards. > > 7. If that feature is negotiated[14], a driver can use TCP or UDP > segmentation offload by negotiating the VIRTIO_NET_F_HOST_TSO4 (IPv4 > TCP), VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and VIRTIO_NET_F_HOST_UFO > (UDP fragmentation) features. It should not send TCP packets > requiring segmentation offload which have the Explicit Congestion > Notification bit set, unless the VIRTIO_NET_F_HOST_ECN feature is > negotiated.[15] > > 8. The converse features are also available: a driver can save > the virtual device some work by negotiating these features.[16] > The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially > checksummed packets can be received, and if it can do that then > the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, > VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input > equivalents of the features described above. See “Receiving > Packets” below. > > Device Operation > > Packets are transmitted by placing them in the transmitq, and > buffers for incoming packets are placed in the receiveq. In each > case, the packet itself is preceeded by a header: > > struct virtio_net_hdr { > #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 > u8 flags; > #define VIRTIO_NET_HDR_GSO_NONE 0 > #define VIRTIO_NET_HDR_GSO_TCPV4 1 > #define VIRTIO_NET_HDR_GSO_UDP 3 > #define VIRTIO_NET_HDR_GSO_TCPV6 4 > #define VIRTIO_NET_HDR_GSO_ECN 0x80 > u8 gso_type; > u16 hdr_len; > u16 gso_size; > u16 csum_start; > u16 csum_offset; > /* Only if VIRTIO_NET_F_MRG_RXBUF: */ > u16 num_buffers > }; > > The controlq is used to control device features such as > filtering. > > Packet Transmission > > Transmitting a single packet is simple, but varies depending on > the different features the driver negotiated. > > 1. If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has > not been fully checksummed, then the virtio_net_hdr's fields > are set as follows. Otherwise, the packet must be fully > checksummed, and flags is zero. > > • flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set, > > • csum_start is set to the offset within the packet to begin checksumming, > and > > • csum_offset indicates how many bytes after the csum_start the > new (16 bit ones' complement) checksum should be placed.[17] > > 2. If the driver negotiated > VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires > TCP segmentation or UDP fragmentation, then the “gso_type” > field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP. > (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this > case, packets larger than 1514 bytes can be transmitted: the > metadata indicates how to replicate the packet header to cut it > into smaller packets. The other gso fields are set: > > • hdr_len is a hint to the device as to how much of the header > needs to be kept to copy into each packet, usually set to the > length of the headers, including the transport header.[18] > > • gso_size is the maximum size of each packet beyond that > header (ie. MSS). > > • If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, > the VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as > well, indicating that the TCP packet has the ECN bit set.[19] > > 3. If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, > the num_buffers field is set to zero. > > 4. The header and packet are added as one output buffer to the > transmitq, and the device is notified of the new entry (see 2.4.1.4 > Notifying The Device).[20] > > Packet Transmission Interrupt > > Often a driver will suppress transmission interrupts using the > VRING_AVAIL_F_NO_INTERRUPT flag (see 2.4.2 Receiving Used Buffers From > The Device) and check for used packets in the transmit path of following > packets. However, it will still receive interrupts if the > VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that > the transmission queue is completely emptied. > > The normal behavior in this interrupt handler is to retrieve and > new descriptors from the used ring and free the corresponding > headers and packets. > > Setting Up Receive Buffers > > It is generally a good idea to keep the receive virtqueue as > fully populated as possible: if it runs out, network performance > will suffer. > > If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or > VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to > accept packets of up to 65550 bytes long (the maximum size of a > TCP or UDP packet, plus the 14 byte ethernet header), otherwise > 1514 bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every > buffer in the receive queue needs to be at least this length [20a] > > If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at > least the size of the struct virtio_net_hdr. > > Packet Receive Interrupt > > When a packet is copied into a buffer in the receiveq, the > optimal path is to disable further interrupts for the receiveq > (see [sub:Receiving-Used-Buffers]) and process packets until no > more are found, then re-enable them. > > Processing packet involves: > > 1. If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, > then the “num_buffers” field indicates how many descriptors > this packet is spread over (including this one). This allows > receipt of large packets without having to allocate large > buffers. In this case, there will be at least “num_buffers” in > the used ring, and they should be chained together to form a > single packet. The other buffers will not begin with a struct > virtio_net_hdr. > > 2. If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or > the “num_buffers” field is one, then the entire packet will be > contained within this buffer, immediately following the struct > virtio_net_hdr. > > 3. If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the > VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be > set: if so, the checksum on the packet is incomplete and the “ > csum_start” and “csum_offset” fields indicate how to calculate > it (see Packet Transmission point 1). > > 4. If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were > negotiated, then the “gso_type” may be something other than > VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the > desired MSS (see Packet Transmission point 2). > > Control Virtqueue > > The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is > negotiated) to send commands to manipulate various features of > the device which would not easily map into the configuration > space. > > All commands are of the following form: > > struct virtio_net_ctrl { > u8 class; > u8 command; > u8 command-specific-data[]; > u8 ack; > }; > > /* ack values */ > #define VIRTIO_NET_OK 0 > #define VIRTIO_NET_ERR 1 > > The class, command and command-specific-data are set by the > driver, and the device sets the ack byte. There is little it can > do except issue a diagnostic if the ack byte is not > VIRTIO_NET_OK. > > Packet Receive Filtering > > If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can > send control commands for promiscuous mode, multicast receiving, > and filtering of MAC addresses. > > Note that in general, these commands are best-effort: unwanted > packets may still arrive. > > Setting Promiscuous Mode > > #define VIRTIO_NET_CTRL_RX 0 > #define VIRTIO_NET_CTRL_RX_PROMISC 0 > #define VIRTIO_NET_CTRL_RX_ALLMULTI 1 > > The class VIRTIO_NET_CTRL_RX has two commands: > VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and > VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and > off. The command-specific-data is one byte containing 0 (off) or > 1 (on). > > Setting MAC Address Filtering > > struct virtio_net_ctrl_mac { > u32 entries; > u8 macs[entries][ETH_ALEN]; > }; > > #define VIRTIO_NET_CTRL_MAC 1 > #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 > > The device can filter incoming packets by any number of destination > MAC addresses.[21] This table is set using the class > VIRTIO_NET_CTRL_MAC and the command VIRTIO_NET_CTRL_MAC_TABLE_SET. The > command-specific-data is two variable length tables of 6-byte MAC > addresses. The first table contains unicast addresses, and the second > contains multicast addresses. > > VLAN Filtering > > If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it > can control a VLAN filter table in the device. > > #define VIRTIO_NET_CTRL_VLAN 2 > #define VIRTIO_NET_CTRL_VLAN_ADD 0 > #define VIRTIO_NET_CTRL_VLAN_DEL 1 > > Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL > command take a 16-bit VLAN id as the command-specific-data. > > Gratuitous Packet Sending > > If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends > on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous > packets; this is usually done after the guest has been physically > migrated, and needs to announce its presence on the new network > links. (As hypervisor does not have the knowledge of guest > network configuration (eg. tagged vlan) it is simplest to prod > the guest in this way). > > #define VIRTIO_NET_CTRL_ANNOUNCE 3 > #define VIRTIO_NET_CTRL_ANNOUNCE_ACK 0 > > The Guest needs to check VIRTIO_NET_S_ANNOUNCE bit in status > field when it notices the changes of device configuration. The > command VIRTIO_NET_CTRL_ANNOUNCE_ACK is used to indicate that > driver has recevied the notification and device would clear the > VIRTIO_NET_S_ANNOUNCE bit in the status filed after it received > this command. > > Processing this notification involves: > > 1. Sending the gratuitous packets or marking there are pending > gratuitous packets to be sent and letting deferred routine to > send them. > > 2. Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control > vq. > > 3. . > > Appendix D: Block Device > > The virtio block device is a simple virtual block device (ie. > disk). Read and write requests (and other exotic requests) are > placed in the queue, and serviced (probably out of order) by the > device except where noted. > > Configuration > > Subsystem Device ID 2 > > Virtqueues 0:requestq. > > Feature bits > > VIRTIO_BLK_F_BARRIER (0) Host supports request barriers. > > VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is > in “size_max”. > > VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a > request is in “seg_max”. > > VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “ > geometry”. > > VIRTIO_BLK_F_RO (5) Device is read-only. > > VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. > > VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. > > VIRTIO_BLK_F_FLUSH (9) Cache flush command support. > > Device configuration layout The capacity of the device > (expressed in 512-byte sectors) is always present. The > availability of the others all depend on various feature bits > as indicated above. > > struct virtio_blk_config { > u64 capacity; > u32 size_max; > u32 seg_max; > struct virtio_blk_geometry { > u16 cylinders; > u8 heads; > u8 sectors; > } geometry; > u32 blk_size; > }; > > Device Initialization > > 1. The device size should be read from the “capacity” > configuration field. No requests should be submitted which goes > beyond this limit. > > 2. If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the > blk_size field can be read to determine the optimal sector size > for the driver to use. This does not effect the units used in > the protocol (always 512 bytes), but awareness of the correct > value can effect performance. > > 3. If the VIRTIO_BLK_F_RO feature is set by the device, any write > requests will fail. > > Device Operation > > The driver queues requests to the virtqueue, and they are used by > the device (not necessarily in order). Each request is of form: > > struct virtio_blk_req { > u32 type; > u32 ioprio; > u64 sector; > char data[][512]; > u8 status; > }; > > If the device has VIRTIO_BLK_F_SCSI feature, it can also support > scsi packet command requests, each of these requests is of form: > > struct virtio_scsi_pc_req { > u32 type; > u32 ioprio; > u64 sector; > char cmd[]; > char data[][512]; > #define SCSI_SENSE_BUFFERSIZE 96 > u8 sense[SCSI_SENSE_BUFFERSIZE]; > u32 errors; > u32 data_len; > u32 sense_len; > u32 residual; > u8 status; > }; > > The type of the request is either a read (VIRTIO_BLK_T_IN), a write > (VIRTIO_BLK_T_OUT), a scsi packet command (VIRTIO_BLK_T_SCSI_CMD or > VIRTIO_BLK_T_SCSI_CMD_OUT[22]) or a flush (VIRTIO_BLK_T_FLUSH or > VIRTIO_BLK_T_FLUSH_OUT[23]). If the device has VIRTIO_BLK_F_BARRIER > feature the high bit (VIRTIO_BLK_T_BARRIER) indicates that this > request acts as a barrier and that all preceeding requests must be > complete before this one, and all following requests must not be > started until this is complete. Note that a barrier does not flush > caches in the underlying backend device in host, and thus does not > serve as data consistency guarantee. Driver must use FLUSH request to > flush the host cache. > > #define VIRTIO_BLK_T_IN 0 > #define VIRTIO_BLK_T_OUT 1 > #define VIRTIO_BLK_T_SCSI_CMD 2 > #define VIRTIO_BLK_T_SCSI_CMD_OUT 3 > #define VIRTIO_BLK_T_FLUSH 4 > #define VIRTIO_BLK_T_FLUSH_OUT 5 > #define VIRTIO_BLK_T_BARRIER 0x80000000 > > The ioprio field is a hint about the relative priorities of > requests to the device: higher numbers indicate more important > requests. > > The sector number indicates the offset (multiplied by 512) where > the read or write is to occur. This field is unused and set to 0 > for scsi packet commands and for flush commands. > > The cmd field is only present for scsi packet command requests, > and indicates the command to perform. This field must reside in a > single, separate read-only buffer; command length can be derived > from the length of this buffer. > > Note that these first three (four for scsi packet commands) > fields are always read-only: the data field is either read-only > or write-only, depending on the request. The size of the read or > write can be derived from the total size of the request buffers. > > The sense field is only present for scsi packet command requests, > and indicates the buffer for scsi sense data. > > The data_len field is only present for scsi packet command > requests, this field is deprecated, and should be ignored by the > driver. Historically, devices copied data length there. > > The sense_len field is only present for scsi packet command > requests and indicates the number of bytes actually written to > the sense buffer. > > The residual field is only present for scsi packet command > requests and indicates the residual size, calculated as data > length - number of bytes actually transferred. > > The final status byte is written by the device: either > VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest > error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host: > > #define VIRTIO_BLK_S_OK 0 > #define VIRTIO_BLK_S_IOERR 1 > #define VIRTIO_BLK_S_UNSUPP 2 > > Historically, devices assumed that the fields type, ioprio and > sector reside in a single, separate read-only buffer; the fields > errors, data_len, sense_len and residual reside in a single, > separate write-only buffer; the sense field in a separate > write-only buffer of size 96 bytes, by itself; the fields errors, > data_len, sense_len and residual in a single write-only buffer; > and the status field is a separate read-only buffer of size 1 > byte, by itself. > > Appendix E: Console Device > > The virtio console device is a simple device for data input and > output. A device may have one or more ports. Each port has a pair > of input and output virtqueues. Moreover, a device has a pair of > control IO virtqueues. The control virtqueues are used to > communicate information between the device and the driver about > ports being opened and closed on either side of the connection, > indication from the host about whether a particular port is a > console port, adding new ports, port hot-plug/unplug, etc., and > indication from the guest about whether a port or a device was > successfully added, port open/close, etc.. For data IO, one or > more empty buffers are placed in the receive queue for incoming > data and outgoing characters are placed in the transmit queue. > > Configuration > > Subsystem Device ID 3 > > Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control > receiveq[24], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1), > ... > > Feature bits > > VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields > are valid. > > VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple > ports; configuration fields nr_ports and max_nr_ports are > valid and control virtqueues will be used. > > Device configuration layout The size of the console is supplied > in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature > is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature > is set, the maximum number of ports supported by the device can > be fetched. > > struct virtio_console_config { > u16 cols; > u16 rows; > u32 max_nr_ports; > }; > > Device Initialization > > 1. If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver > can read the console dimensions from the configuration fields. > > 2. If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the > driver can spawn multiple ports, not all of which may be > attached to a console. Some could be generic ports. In this > case, the control virtqueues are enabled and according to the > max_nr_ports configuration-space value, the appropriate number > of virtqueues are created. A control message indicating the > driver is ready is sent to the host. The host can then send > control messages for adding new ports to the device. After > creating and initializing each port, a > VIRTIO_CONSOLE_PORT_READY control message is sent to the host > for that port so the host can let us know of any additional > configuration options set for that port. > > 3. The receiveq for each port is populated with one or more > receive buffers. > > Device Operation > > 1. For output, a buffer containing the characters is placed in > the port's transmitq.[25] > > 2. When a buffer is used in the receiveq (signalled by an > interrupt), the contents is the input to the port associated > with the virtqueue for which the notification was received. > > 3. If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a > configuration change interrupt may occur. The updated size can > be read from the configuration fields. > > 4. If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT > feature, active ports are announced by the host using the > VIRTIO_CONSOLE_PORT_ADD control message. The same message is > used for port hot-plug as well. > > 5. If the host specified a port `name', a sysfs attribute is > created with the name filled in, so that udev rules can be > written that can create a symlink from the port's name to the > char device for port discovery by applications in the guest. > > 6. Changes to ports' state are effected by control messages. > Appropriate action is taken on the port indicated in the > control message. The layout of the structure of the control > buffer and the events associated are: > > struct virtio_console_control { > uint32_t id; /* Port number */ > uint16_t event; /* The kind of control event */ > uint16_t value; /* Extra information for the event */ > }; > > /* Some events for the internal messages (control packets) */ > #define VIRTIO_CONSOLE_DEVICE_READY 0 > #define VIRTIO_CONSOLE_PORT_ADD 1 > #define VIRTIO_CONSOLE_PORT_REMOVE 2 > #define VIRTIO_CONSOLE_PORT_READY 3 > #define VIRTIO_CONSOLE_CONSOLE_PORT 4 > #define VIRTIO_CONSOLE_RESIZE 5 > #define VIRTIO_CONSOLE_PORT_OPEN 6 > #define VIRTIO_CONSOLE_PORT_NAME 7 > > Appendix F: Entropy Device > > The virtio entropy device supplies high-quality randomness for > guest use. > > Configuration > > Subsystem Device ID 4 > > Virtqueues 0:requestq. > > Feature bits None currently defined > > Device configuration layout None currently defined. > > Device Initialization > > 1. The virtqueue is initialized > > Device Operation > > When the driver requires random bytes, it places the descriptor > of one or more buffers in the queue. It will be completely filled > by random data by the device. > > Appendix G: Memory Balloon Device > > The virtio memory balloon device is a primitive device for > managing guest memory: the device asks for a certain amount of > memory, and the guest supplies it (or withdraws it, if the device > has more than it asks for). This allows the guest to adapt to > changes in allowance of underlying physical memory. If the > feature is negotiated, the device can also be used to communicate > guest memory statistics to the host. > > Configuration > > Subsystem Device ID 5 > > Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[26] > > Feature bits > > VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before > pages from the balloon are used. > > VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest > memory statistics is present. > > Device configuration layout Both fields of this configuration > are always available. Note that they are little endian, despite > convention that device fields are guest endian: > > struct virtio_balloon_config { > u32 num_pages; > u32 actual; > }; > > Device Initialization > > 1. The inflate and deflate virtqueues are identified. > > 2. If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated: > > (a) Identify the stats virtqueue. > > (b) Add one empty buffer to the stats virtqueue and notify the > host. > > Device operation begins immediately. > > Device Operation > > Memory Ballooning The device is driven by the receipt of a > configuration change interrupt. > > 1. The “num_pages” configuration field is examined. If this is > greater than the “actual” number of pages, memory must be given > to the balloon. If it is less than the “actual” number of > pages, memory may be taken back from the balloon for general > use. > > 2. To supply memory to the balloon (aka. inflate): > > (a) The driver constructs an array of addresses of unused memory > pages. These addresses are divided by 4096[27] and the descriptor > describing the resulting 32-bit array is added to the inflateq. > > 3. To remove memory from the balloon (aka. deflate): > > (a) The driver constructs an array of addresses of memory pages > it has previously given to the balloon, as described above. > This descriptor is added to the deflateq. > > (b) If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the > guest may not use these requested pages until that descriptor > in the deflateq has been used by the device. > > (c) Otherwise, the guest may begin to re-use pages previously > given to the balloon before the device has acknowledged their > withdrawl. [28] > > 4. In either case, once the device has completed the inflation or > deflation, the “actual” field of the configuration should be > updated to reflect the new number of pages in the balloon.[29] > > Memory Statistics > > The stats virtqueue is atypical because communication is driven > by the device (not the driver). The channel becomes active at > driver initialization time when the driver adds an empty buffer > and notifies the device. A request for memory statistics proceeds > as follows: > > 1. The device pushes the buffer onto the used ring and sends an > interrupt. > > 2. The driver pops the used buffer and discards it. > > 3. The driver collects memory statistics and writes them into a > new buffer. > > 4. The driver adds the buffer to the virtqueue and notifies the > device. > > 5. The device pops the buffer (retaining it to initiate a > subsequent request) and consumes the statistics. > > Memory Statistics Format Each statistic consists of a 16 bit > tag and a 64 bit value. Both quantities are represented in the > native endian of the guest. All statistics are optional and the > driver may choose which ones to supply. To guarantee backwards > compatibility, unsupported statistics should be omitted. > > struct virtio_balloon_stat { > #define VIRTIO_BALLOON_S_SWAP_IN 0 > #define VIRTIO_BALLOON_S_SWAP_OUT 1 > #define VIRTIO_BALLOON_S_MAJFLT 2 > #define VIRTIO_BALLOON_S_MINFLT 3 > #define VIRTIO_BALLOON_S_MEMFREE 4 > #define VIRTIO_BALLOON_S_MEMTOT 5 > u16 tag; > u64 val; > } __attribute__((packed)); > > Tags > > VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been > swapped in (in bytes). > > VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been > swapped out to disk (in bytes). > > VIRTIO_BALLOON_S_MAJFLT The number of major page faults that > have occurred. > > VIRTIO_BALLOON_S_MINFLT The number of minor page faults that > have occurred. > > VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used > for any purpose (in bytes). > > VIRTIO_BALLOON_S_MEMTOT The total amount of memory available > (in bytes). > > Appendix I: SCSI Host Device > > The virtio SCSI host device groups together one or more virtual > logical units (such as disks), and allows communicating to them > using the SCSI protocol. An instance of the device represents a > SCSI host to which many targets and LUNs are attached. > > The virtio SCSI device services two kinds of requests: > > • command requests for a logical unit; > > • task management functions related to a logical unit, target or > command. > > The device is also able to send out notifications about added and > removed logical units. Together, these capabilities provide a > SCSI transport protocol that uses virtqueues as the transfer > medium. In the transport protocol, the virtio driver acts as the > initiator, while the virtio SCSI host provides one or more > targets that receive and process the requests. > > Configuration > > Subsystem Device ID 8 > > Virtqueues 0:controlq; 1:eventq; 2..n:request queues. > > Feature bits > > VIRTIO_SCSI_F_INOUT (0) A single request can include both > read-only and write-only data buffers. > > VIRTIO_SCSI_F_HOTPLUG (1) The host should enable > hot-plug/hot-unplug of new LUNs and targets on the SCSI bus. > > Device configuration layout All fields of this configuration > are always available. sense_size and cdb_size are writable by > the guest. > > struct virtio_scsi_config { > u32 num_queues; > u32 seg_max; > u32 max_sectors; > u32 cmd_per_lun; > u32 event_info_size; > u32 sense_size; > u32 cdb_size; > u16 max_channel; > u16 max_target; > u32 max_lun; > }; > > num_queues is the total number of request virtqueues exposed by > the device. The driver is free to use only one request queue, > or it can use more to achieve better performance. > > seg_max is the maximum number of segments that can be in a > command. A bidirectional command can include seg_max input > segments and seg_max output segments. > > max_sectors is a hint to the guest about the maximum transfer > size it should use. > > cmd_per_lun is a hint to the guest about the maximum number of > linked commands it should send to one LUN. The actual value > to be used is the minimum of cmd_per_lun and the virtqueue > size. > > event_info_size is the maximum size that the device will fill > for buffers that the driver places in the eventq. The driver > should always put buffers at least of this size. It is > written by the device depending on the set of negotated > features. > > sense_size is the maximum size of the sense data that the > device will write. The default value is written by the device > and will always be 96, but the driver can modify it. It is > restored to the default when the device is reset. > > cdb_size is the maximum size of the CDB that the driver will > write. The default value is written by the device and will > always be 32, but the driver can likewise modify it. It is > restored to the default when the device is reset. > > max_channel, max_target and max_lun can be used by the driver > as hints to constrain scanning the logical units on the > host.h > > Device Initialization > > The initialization routine should first of all discover the > device's virtqueues. > > If the driver uses the eventq, it should then place at least a > buffer in the eventq. > > The driver can immediately issue requests (for example, INQUIRY > or REPORT LUNS) or task management functions (for example, I_T > RESET). > > Device Operation: request queues > > The driver queues requests to an arbitrary request queue, and > they are used by the device on that same queue. It is the > responsibility of the driver to ensure strict request ordering > for commands placed on different queues, because they will be > consumed with no order constraints. > > Requests have the following format: > > struct virtio_scsi_req_cmd { > // Read-only > u8 lun[8]; > u64 id; > u8 task_attr; > u8 prio; > u8 crn; > char cdb[cdb_size]; > char dataout[]; > // Write-only part > u32 sense_len; > u32 residual; > u16 status_qualifier; > u8 status; > u8 response; > u8 sense[sense_size]; > char datain[]; > }; > > > /* command-specific response values */ > #define VIRTIO_SCSI_S_OK 0 > #define VIRTIO_SCSI_S_OVERRUN 1 > #define VIRTIO_SCSI_S_ABORTED 2 > #define VIRTIO_SCSI_S_BAD_TARGET 3 > #define VIRTIO_SCSI_S_RESET 4 > #define VIRTIO_SCSI_S_BUSY 5 > #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6 > #define VIRTIO_SCSI_S_TARGET_FAILURE 7 > #define VIRTIO_SCSI_S_NEXUS_FAILURE 8 > #define VIRTIO_SCSI_S_FAILURE 9 > > /* task_attr */ > #define VIRTIO_SCSI_S_SIMPLE 0 > #define VIRTIO_SCSI_S_ORDERED 1 > #define VIRTIO_SCSI_S_HEAD 2 > #define VIRTIO_SCSI_S_ACA 3 > > The lun field addresses a target and logical unit in the > virtio-scsi device's SCSI domain. The only supported format for > the LUN field is: first byte set to 1, second byte set to target, > third and fourth byte representing a single level LUN structure, > followed by four zero bytes. With this representation, a > virtio-scsi device can serve up to 256 targets and 16384 LUNs per > target. > > The id field is the command identifier (“tag”). > > task_attr, prio and crn should be left to zero. task_attr defines > the task attribute as in the table above, but all task attributes > may be mapped to SIMPLE by the device; crn may also be provided > by clients, but is generally expected to be 0. The maximum CRN > value defined by the protocol is 255, since CRN is stored in an > 8-bit integer. > > All of these fields are defined in SAM. They are always > read-only, as are the cdb and dataout field. The cdb_size is > taken from the configuration space. > > sense and subsequent fields are always write-only. The sense_len > field indicates the number of bytes actually written to the sense > buffer. The residual field indicates the residual size, > calculated as “data_length - number_of_transferred_bytes”, for > read or write operations. For bidirectional commands, the > number_of_transferred_bytes includes both read and written bytes. > A residual field that is less than the size of datain means that > the dataout field was processed entirely. A residual field that > exceeds the size of datain means that the dataout field was > processed partially and the datain field was not processed at > all. > > The status byte is written by the device to be the status code as > defined in SAM. > > The response byte is written by the device to be one of the > following: > > VIRTIO_SCSI_S_OK when the request was completed and the status > byte is filled with a SCSI status code (not necessarily > "GOOD"). > > VIRTIO_SCSI_S_OVERRUN if the content of the CDB requires > transferring more data than is available in the data buffers. > > VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an > ABORT TASK or ABORT TASK SET task management function. > > VIRTIO_SCSI_S_BAD_TARGET if the request was never processed > because the target indicated by the lun field does not exist. > > VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus > or device reset (including a task management function). > > VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a > problem in the connection between the host and the target > (severed link). > > VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a > failure and the guest should not retry on other paths. > > VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure > but retrying on other paths might yield a different result. > > VIRTIO_SCSI_S_BUSY if the request failed but retrying on the > same path should work. > > VIRTIO_SCSI_S_FAILURE for other host or guest error. In > particular, if neither dataout nor datain is empty, and the > VIRTIO_SCSI_F_INOUT feature has not been negotiated, the > request will be immediately returned with a response equal to > VIRTIO_SCSI_S_FAILURE. > > Device Operation: controlq > > The controlq is used for other SCSI transport operations. > Requests have the following format: > > struct virtio_scsi_ctrl { > u32 type; > ... > u8 response; > }; > > /* response values valid for all commands */ > #define VIRTIO_SCSI_S_OK 0 > #define VIRTIO_SCSI_S_BAD_TARGET 3 > #define VIRTIO_SCSI_S_BUSY 5 > #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6 > #define VIRTIO_SCSI_S_TARGET_FAILURE 7 > #define VIRTIO_SCSI_S_NEXUS_FAILURE 8 > #define VIRTIO_SCSI_S_FAILURE 9 > #define VIRTIO_SCSI_S_INCORRECT_LUN 12 > > The type identifies the remaining fields. > > The following commands are defined: > > Task management function > #define VIRTIO_SCSI_T_TMF 0 > > #define VIRTIO_SCSI_T_TMF_ABORT_TASK 0 > #define VIRTIO_SCSI_T_TMF_ABORT_TASK_SET 1 > #define VIRTIO_SCSI_T_TMF_CLEAR_ACA 2 > #define VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET 3 > #define VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET 4 > #define VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET 5 > #define VIRTIO_SCSI_T_TMF_QUERY_TASK 6 > #define VIRTIO_SCSI_T_TMF_QUERY_TASK_SET 7 > > struct virtio_scsi_ctrl_tmf > { > // Read-only part > u32 type; > u32 subtype; > u8 lun[8]; > u64 id; > // Write-only part > u8 response; > } > > /* command-specific response values */ > #define VIRTIO_SCSI_S_FUNCTION_COMPLETE 0 > #define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10 > #define VIRTIO_SCSI_S_FUNCTION_REJECTED 11 > > The type is VIRTIO_SCSI_T_TMF; the subtype field defines. All > fields except response are filled by the driver. The subtype > field must always be specified and identifies the requested > task management function. > > Other fields may be irrelevant for the requested TMF; if so, > they are ignored but they should still be present. The lun > field is in the same format specified for request queues; the > single level LUN is ignored when the task management function > addresses a whole I_T nexus. When relevant, the value of the id > field is matched against the id values passed on the requestq. > > The outcome of the task management function is written by the > device in the response field. The command-specific response > values map 1-to-1 with those defined in SAM. > > Asynchronous notification query > > #define VIRTIO_SCSI_T_AN_QUERY 1 > > struct virtio_scsi_ctrl_an { > // Read-only part > u32 type; > u8 lun[8]; > u32 event_requested; > // Write-only part > u32 event_actual; > u8 response; > } > > #define VIRTIO_SCSI_EVT_ASYNC_OPERATIONAL_CHANGE 2 > #define VIRTIO_SCSI_EVT_ASYNC_POWER_MGMT 4 > #define VIRTIO_SCSI_EVT_ASYNC_EXTERNAL_REQUEST 8 > #define VIRTIO_SCSI_EVT_ASYNC_MEDIA_CHANGE 16 > #define VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST 32 > #define VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY 64 > > By sending this command, the driver asks the device which > events the given LUN can report, as described in paragraphs 6.6 > and A.6 of the SCSI MMC specification. The driver writes the > events it is interested in into the event_requested; the device > responds by writing the events that it supports into > event_actual. > > The type is VIRTIO_SCSI_T_AN_QUERY. The lun and event_requested > fields are written by the driver. The event_actual and response > fields are written by the device. > > No command-specific values are defined for the response byte. > > Asynchronous notification subscription > #define VIRTIO_SCSI_T_AN_SUBSCRIBE 2 > > struct virtio_scsi_ctrl_an { > // Read-only part > u32 type; > u8 lun[8]; > u32 event_requested; > // Write-only part > u32 event_actual; > u8 response; > } > > By sending this command, the driver asks the specified LUN to > report events for its physical interface, again as described in > the SCSI MMC specification. The driver writes the events it is > interested in into the event_requested; the device responds by > writing the events that it supports into event_actual. > > Event types are the same as for the asynchronous notification > query message. > > The type is VIRTIO_SCSI_T_AN_SUBSCRIBE. The lun and > event_requested fields are written by the driver. The > event_actual and response fields are written by the device. > > No command-specific values are defined for the response byte. > > Device Operation: eventq > > The eventq is used by the device to report information on logical > units that are attached to it. The driver should always leave a > few buffers ready in the eventq. In general, the device will not > queue events to cope with an empty eventq, and will end up > dropping events if it finds no buffer ready. However, when > reporting events for many LUNs (e.g. when a whole target > disappears), the device can throttle events to avoid dropping > them. For this reason, placing 10-15 buffers on the event queue > should be enough. > > Buffers are placed in the eventq and filled by the device when > interesting events occur. The buffers should be strictly > write-only (device-filled) and the size of the buffers should be > at least the value given in the device's configuration > information. > > Buffers returned by the device on the eventq will be referred to > as "events" in the rest of this section. Events have the > following format: > > #define VIRTIO_SCSI_T_EVENTS_MISSED 0x80000000 > > struct virtio_scsi_event { > // Write-only part > u32 event; > ... > } > > If bit 31 is set in the event field, the device failed to report > an event due to missing buffers. In this case, the driver should > poll the logical units for unit attention conditions, and/or do > whatever form of bus scan is appropriate for the guest operating > system. > > Other data that the device writes to the buffer depends on the > contents of the event field. The following events are defined: > > No event > #define VIRTIO_SCSI_T_NO_EVENT 0 > > This event is fired in the following cases: > > • When the device detects in the eventq a buffer that is > shorter than what is indicated in the configuration field, it > might use it immediately and put this dummy value in the > event field. A well-written driver will never observe this > situation. > > • When events are dropped, the device may signal this event as > soon as the drivers makes a buffer available, in order to > request action from the driver. In this case, of course, this > event will be reported with the VIRTIO_SCSI_T_EVENTS_MISSED > flag. > > Transport reset > #define VIRTIO_SCSI_T_TRANSPORT_RESET 1 > > struct virtio_scsi_event_reset { > // Write-only part > u32 event; > u8 lun[8]; > u32 reason; > } > > #define VIRTIO_SCSI_EVT_RESET_HARD 0 > #define VIRTIO_SCSI_EVT_RESET_RESCAN 1 > #define VIRTIO_SCSI_EVT_RESET_REMOVED 2 > > By sending this event, the device signals that a logical unit > on a target has been reset, including the case of a new device > appearing or disappearing on the bus.The device fills in all > fields. The event field is set to > VIRTIO_SCSI_T_TRANSPORT_RESET. The lun field addresses a > logical unit in the SCSI host. > > The reason value is one of the three #define values appearing > above: > > • VIRTIO_SCSI_EVT_RESET_REMOVED (“LUN/target removed”) is used > if the target or logical unit is no longer able to receive > commands. > > • VIRTIO_SCSI_EVT_RESET_HARD (“LUN hard reset”) is used if the > logical unit has been reset, but is still present. > > • VIRTIO_SCSI_EVT_RESET_RESCAN (“rescan LUN/target”) is used if > a target or logical unit has just appeared on the device. > > The “removed” and “rescan” events, when sent for LUN 0, may > apply to the entire target. After receiving them the driver > should ask the initiator to rescan the target, in order to > detect the case when an entire target has appeared or > disappeared. These two events will never be reported unless the > VIRTIO_SCSI_F_HOTPLUG feature was negotiated between the host > and the guest. > > Events will also be reported via sense codes (this obviously > does not apply to newly appeared buses or targets, since the > application has never discovered them): > > • “LUN/target removed” maps to sense key ILLEGAL REQUEST, asc > 0x25, ascq 0x00 (LOGICAL UNIT NOT SUPPORTED) > > • “LUN hard reset” maps to sense key UNIT ATTENTION, asc 0x29 > (POWER ON, RESET OR BUS DEVICE RESET OCCURRED) > > • “rescan LUN/target” maps to sense key UNIT ATTENTION, asc > 0x3f, ascq 0x0e (REPORTED LUNS DATA HAS CHANGED) > > The preferred way to detect transport reset is always to use > events, because sense codes are only seen by the driver when it > sends a SCSI command to the logical unit or target. However, in > case events are dropped, the initiator will still be able to > synchronize with the actual state of the controller if the > driver asks the initiator to rescan of the SCSI bus. During the > rescan, the initiator will be able to observe the above sense > codes, and it will process them as if it the driver had > received the equivalent event. > > Asynchronous notification > #define VIRTIO_SCSI_T_ASYNC_NOTIFY 2 > > struct virtio_scsi_event_an { > // Write-only part > u32 event; > u8 lun[8]; > u32 reason; > } > > By sending this event, the device signals that an asynchronous > event was fired from a physical interface. > > All fields are written by the device. The event field is set to > VIRTIO_SCSI_T_ASYNC_NOTIFY. The lun field addresses a logical > unit in the SCSI host. The reason field is a subset of the > events that the driver has subscribed to via the "Asynchronous > notification subscription" command. > > When dropped events are reported, the driver should poll for > asynchronous events manually using SCSI commands. > > Appendix X: virtio-mmio > > Virtual environments without PCI support (a common situation in > embedded devices models) might use simple memory mapped device (“ > virtio-mmio”) instead of the PCI device. > > The memory mapped virtio device behaviour is based on the PCI > device specification. Therefore most of operations like device > initialization, queues configuration and buffer transfers are > nearly identical. Existing differences are described in the > following sections. > > Device Initialization > > Instead of using the PCI IO space for virtio header, the “ > virtio-mmio” device provides a set of memory mapped control > registers, all 32 bits wide, followed by device-specific > configuration space. The following list presents their layout: > > • Offset from the device base address | Direction | Name > Description > > • 0x000 | R | MagicValue > “virt” string. > > • 0x004 | R | Version > Device version number. Currently must be 1. > > • 0x008 | R | DeviceID > Virtio Subsystem Device ID (ie. 1 for network card). > > • 0x00c | R | VendorID > Virtio Subsystem Vendor ID. > > • 0x010 | R | HostFeatures > Flags representing features the device supports. > Reading from this register returns 32 consecutive flag bits, > first bit depending on the last value written to > HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32 > > to (HostFeaturesSel*32)+31, eg. feature bits 0 to 31 if > HostFeaturesSel is set to 0 and features bits 32 to 63 if > HostFeaturesSel is set to 1. Also see [sub:Feature-Bits] > > • 0x014 | W | HostFeaturesSel > Device (Host) features word selection. > Writing to this register selects a set of 32 device feature bits > accessible by reading from HostFeatures register. Device driver > must write a value to the HostFeaturesSel register before > reading from the HostFeatures register. > > • 0x020 | W | GuestFeatures > Flags representing device features understood and activated by > the driver. > Writing to this register sets 32 consecutive flag bits, first > bit depending on the last value written to GuestFeaturesSel > register. Access to this register sets bits GuestFeaturesSel*32 > to (GuestFeaturesSel*32)+31, eg. feature bits 0 to 31 if > GuestFeaturesSel is set to 0 and features bits 32 to 63 if > GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits] > > • 0x024 | W | GuestFeaturesSel > Activated (Guest) features word selection. > Writing to this register selects a set of 32 activated feature > bits accessible by writing to the GuestFeatures register. > Device driver must write a value to the GuestFeaturesSel > register before writing to the GuestFeatures register. > > • 0x028 | W | GuestPageSize > Guest page size. > Device driver must write the guest page size in bytes to the > register during initialization, before any queues are used. > This value must be a power of 2 and is used by the Host to > calculate Guest address of the first queue page (see QueuePFN). > > • 0x030 | W | QueueSel > Virtual queue index (first queue is 0). > Writing to this register selects the virtual queue that the > following operations on QueueNum, QueueAlign and QueuePFN apply > to. > > • 0x034 | R | QueueNumMax > Maximum virtual queue size. > Reading from the register returns the maximum size of the queue > the Host is ready to process or zero (0x0) if the queue is not > available. This applies to the queue selected by writing to > QueueSel and is allowed only when QueuePFN is set to zero > (0x0), so when the queue is not actively used. > > • 0x038 | W | QueueNum > Virtual queue size. > Queue size is a number of elements in the queue, therefore size > of the descriptor table and both available and used rings. > Writing to this register notifies the Host what size of the > queue the Guest will use. This applies to the queue selected by > writing to QueueSel. > > • 0x03c | W | QueueAlign > Used Ring alignment in the virtual queue. > Writing to this register notifies the Host about alignment > boundary of the Used Ring in bytes. This value must be a power > of 2 and applies to the queue selected by writing to QueueSel. > > • 0x040 | RW | QueuePFN > Guest physical page number of the virtual queue. > Writing to this register notifies the host about location of the > virtual queue in the Guest's physical address space. This value > is the index number of a page starting with the queue > Descriptor Table. Value zero (0x0) means physical address zero > (0x00000000) and is illegal. When the Guest stops using the > queue it must write zero (0x0) to this register. > Reading from this register returns the currently used page > number of the queue, therefore a value other than zero (0x0) > means that the queue is in use. > Both read and write accesses apply to the queue selected by > writing to QueueSel. > > • 0x050 | W | QueueNotify > Queue notifier. > Writing a queue index to this register notifies the Host that > there are new buffers to process in the queue. > > • 0x60 | R | InterruptStatus > Interrupt status. > Reading from this register returns a bit mask of interrupts > asserted by the device. An interrupt is asserted if the > corresponding bit is set, ie. equals one (1). > > – Bit 0 | Used Ring Update > This interrupt is asserted when the Host has updated the Used > Ring in at least one of the active virtual queues. > > – Bit 1 | Configuration change > This interrupt is asserted when configuration of the device has > changed. > > • 0x064 | W | InterruptACK > Interrupt acknowledge. > Writing to this register notifies the Host that the Guest > finished handling interrupts. Set bits in the value clear the > corresponding bits of the InterruptStatus register. > > • 0x070 | RW | Status > Device status. > Reading from this register returns the current device status > flags. > Writing non-zero values to this register sets the status flags, > indicating the Guest progress. Writing zero (0x0) to this > register triggers a device reset. > Also see [sub:Device-Initialization-Sequence] > > • 0x100+ | RW | Config > Device-specific configuration space starts at an offset 0x100 > and is accessed with byte alignment. Its meaning and size > depends on the device and the driver. > > Virtual queue size is a number of elements in the queue, > therefore size of the descriptor table and both available and > used rings. > > The endianness of the registers follows the native endianness of > the Guest. Writing to registers described as “R” and reading from > registers described as “W” is not permitted and can cause > undefined behavior. > > The device initialization is performed as described in 2.2.1 Device > Initialization Sequence with one exception: the Guest must notify the > Host about its page size, writing the size in bytes to GuestPageSize > register before the initialization is finished. > > The memory mapped virtio devices generate single interrupt only, > therefore no special configuration is required. > > Virtqueue Configuration > > The virtual queue configuration is performed in a similar way to > the one described in 2.3 Virtqueue Configuration with a few > additional operations: > > 1. Select the queue writing its index (first queue is 0) to the > QueueSel register. > > 2. Check if the queue is not already in use: read QueuePFN > register, returned value should be zero (0x0). > > 3. Read maximum queue size (number of elements) from the > QueueNumMax register. If the returned value is zero (0x0) the > queue is not available. > > 4. Allocate and zero the queue pages in contiguous virtual > memory, aligning the Used Ring to an optimal boundary (usually > page size). Size of the allocated queue may be smaller than or > equal to the maximum size returned by the Host. > > 5. Notify the Host about the queue size by writing the size to > QueueNum register. > > 6. Notify the Host about the used alignment by writing its value > in bytes to QueueAlign register. > > 7. Write the physical number of the first page of the queue to > the QueuePFN register. > > The queue and the device are ready to begin normal operations > now. > > Device Operation > > The memory mapped virtio device behaves in the same way as > described in 2.4 Device Operation, with the following > exceptions: > > 1. The device is notified about new buffers available in a queue > by writing the queue index to register QueueNum instead of the > virtio header in PCI I/O space (2.4.1.4 Notifying The Device). > > 2. The memory mapped virtio device is using single, dedicated > interrupt signal, which is raised when at least one of the > interrupts described in the InterruptStatus register > description is asserted. After receiving an interrupt, the > driver must read the InterruptStatus register to check what > caused the interrupt (see the register description). After the > interrupt is handled, the driver must acknowledge it by writing > a bit mask corresponding to the serviced interrupt to the > InterruptACK register. > > > FOOTNOTES: > [1] This lack of page-sharing implies that the implementation of the > device (e.g. the hypervisor or host) needs full access to the > guest memory. Communication with untrusted parties (i.e. > inter-guest communication) requires copying. > > [2] The Linux implementation further separates the PCI virtio code > from the specific virtio drivers: these drivers are shared with > the non-PCI implementations (currently lguest and S/390). > > [3] The actual value within this range is ignored > > [4] Historically, drivers have used the device before steps 5 and 6. > This is only allowed if the driver does not use any features > which would alter this early use of the device. > > [5] ie. once you enable MSI-X on the device, the other fields move. > If you turn it off again, they move back! > > [6] The 4096 is based on the x86 page size, but it's also large > enough to ensure that the separate parts of the virtqueue are on > separate cache lines. > > [7] These fields are kept here because this is the only part of the > virtqueue written by the device > > [8] The Linux drivers do this only for read-only buffers: for > write-only buffers, it is assumed that the driver is merely > trying to keep the receive buffer ring full, and no notification > of this expected condition is necessary. > > [9] https://lists.linux-foundation.org/mailman/listinfo/virtualization > > [10] The current qemu device implementations mistakenly insist that > the first descriptor cover the header in these cases exactly, so > a cautious driver should arrange it so. > > [11] Even if it does mean documenting design or implementation > mistakes! > > [12] Only if VIRTIO_NET_F_CTRL_VQ set > > [13] It was supposed to indicate segmentation offload support, but > upon further investigation it became clear that multiple bits > were required. > > [14] ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are > dependent on VIRTIO_NET_F_CSUM; a dvice which offers the offload > features must offer the checksum feature, and a driver which > accepts the offload features must accept the checksum feature. > Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features > depending on VIRTIO_NET_F_GUEST_CSUM. > > [15] This is a common restriction in real, older network cards. > > [16] For example, a network packet transported between two guests on > the same system may not require checksumming at all, nor segmentation, > if both guests are amenable. > > [17] For example, consider a partially checksummed TCP (IPv4) packet. > It will have a 14 byte ethernet header and 20 byte IP header > followed by the TCP header (with the TCP checksum field 16 bytes > into that header). csum_start will be 14+20 = 34 (the TCP > checksum includes the header), and csum_offset will be 16. The > value in the TCP checksum field should be initialized to the sum > of the TCP pseudo header, so that replacing it by the ones' > complement checksum of the TCP header and body will give the > correct result. > > [18] Due to various bugs in implementations, this field is not useful > as a guarantee of the transport header size. > > [19] This case is not handled by some older hardware, so is called out > specifically in the protocol. > > [20] Note that the header will be two bytes longer for the > VIRTIO_NET_F_MRG_RXBUF case. > > [20a] Obviously each one can be split across multiple descriptor > elements. > > [21] Since there are no guarentees, it can use a hash filter or > silently switch to allmulti or promiscuous mode if it is given too > many addresses. > > [22] The SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device > does not distinguish between them. > > [23] The FLUSH and FLUSH_OUT types are equivalent, the device does not > distinguish between them > > [24] Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set. > > [25] Because this is high importance and low bandwidth, the current > Linux implementation polls for the buffer to be used, rather than > waiting for an interrupt, simplifying the implementation > significantly. However, for generic serial ports with the > O_NONBLOCK flag set, the polling limitation is relaxed and the > consumed buffers are freed upon the next write or poll call or > when a port is closed or hot-unplugged. > > [26] Only if VIRTIO_BALLON_F_STATS_VQ set. > > [27] This is historical, and independent of the guest page size > > [28] In this case, deflation advice is merely a courtesy > > [29] As updates to configuration space are not atomic, this field > isn't particularly reliable, but can be used to diagnose buggy guests. > > > --------------------------------------------------------------------- > To unsubscribe from this mail list, you must leave the OASIS TC that > generates this mail. Follow this link to all your TCs in OASIS at: > https://www.oasis-open.org/apps/org/workgroup/portal/my_workgroups.php
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]