[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
On Tue, 27 Aug 2013 18:36:29 +0300 "Michael S. Tsirkin" <mst@redhat.com> wrote: > On Tue, Aug 27, 2013 at 05:09:53PM +0200, Cornelia Huck wrote: > > Some remarks from my side... > > > > On Tue, 27 Aug 2013 10:38:59 +0300 > > "Michael S. Tsirkin" <mst@redhat.com> wrote: > > > > > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote: > > > > "Michael S. Tsirkin" <mst@redhat.com> writes: > > > > > This is the new configuration layout. > > > > > > > > > > Notes: > > > > > - Everything is LE > > > > > - There's a feature bit that means spec 1.0 compliant. > > > > > - Both devices and drivers can either require the 1.0 interface > > > > > or try to include compatibility support. The spec isn't forcing > > > > > this decision. > > > > > > > > Hmm, this kind includes other changes already proposed, like the LE > > > > change and the framing change. I think this conceptually splits nicely: > > > > > > > > 1) Feature bit 32 proposal. > > > > 2) Endian change. > > > > 3) Framing change. > > > > 4) PCI layout change. > > > > > > Right - they are mostly in different parts of the document. > > > I put it all together so it's easy to see how we intend to > > > handle the transition. > > > So is everyone OK with keeping this in a single patch? > > > > The new feature bit is supposed to cover all of this, right? Then this > > should be one patch. > > > > > > > > > > - I kept documentation of the legacy interface around, and added notes > > > > > on transition inline. They are in separate sections each clearly marked > > > > > "Legacy Interface" so we'll be able to separate them out > > > > > from the final document as necessary - for now I think it's easier > > > > > to keep it all together. > > > > > > > > Good thinking: most of us know the current spec so it's definitely > > > > clearer. And makes sure we're thinking about the transition. > > > > > > > > > Only virtio PCI has been converted. > > > > > Let's discuss this on the meeting tonight, once we figure out PCI > > > > > we can do something similar for MMIO and CCW. > > > > > > > > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows: > > > > > 24 to 31: Feature bits reserved for extensions to the queue and > > > > > feature negotiation mechanisms > > > > > > > > > > + 32: Feature bit must be set for any device compliant with this > > > > > + revision of the specification, and acknowledged by all device drivers. > > > > Would it make sense to have a bit 33 "rings big endian" whose validity > > depends on bit 32 set? This would make it possible for ccw to keep its > > current endianness. > > I didn't go over ccw or MMIO yet - only PCI. > I think ccw registers will just > be explicitly BE, with no need for a feature bit. > Does this sound right? Sure, that would be even better. > > > > > > + > > > > > + 33 to 63: Feature bits reserved for future extensions > > > > > + > > > > > For example, feature bit 0 for a network device (i.e. Subsystem > > > > > Device ID 1) indicates that the device supports checksumming of > > > > > packets. > > > > > > > > Why stop at 63? If we go to a more decentralized feature-assignment > > > > model, we'll run through those very fast. > > > > > > Then we'll just document more, but driver needs to know where to stop > > > looking for features. > > > > > > > > > > > > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are > > > > > indicated by offering a feature bit, so the guest can check > > > > > before accessing that part of the configuration space. > > > > > > > > > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts > > > > > +-------------------------------------- > > > > > + > > > > > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but > > > > > +different interface between the hypervisor and the guest. > > > > > +Since these are widely deployed in the field, this specification > > > > > +accomodates optional features to simplify transition > > > > > +from these earlier draft interfaces. Specifically: > > > > > + > > > > > +Legacy Interface > > > > > + is an interface specified by an earlier draft of this specification > > > > > + (up to 0.9.X) > > > > > +Legacy Device > > > > > + is a device implemented before this specification was released, > > > > > + and implementing a legacy interface on the host side > > > > > +Legacy Driver > > > > > + is a driver implemented before this specification was released, > > > > > + and implementing a legacy interface on the guest side > > > > > + > > > > > +to simplify transition from these earlier draft interfaces, > > > > > +it is possible to implement > > > > > + > > > > > +Transitional Device > > > > > + a device supporting both drivers conforming to this > > > > > + specification, and legacy drivers > > > > > + > > > > > +Transitional Driver > > > > > + a driver supporting both devices conforming to this > > > > > + specification, and legacy devices > > > > What happens to legacy devices in the future? Current implementers > > will obviously expose legacy devices, which means future drivers need > > to be transitional or they won't work with what is currently out there. > > You are right. It's a bug in what I wrote: non transitional drivers > should work with transitional devices. > This way a transitional device can change to non-transitional > after drivers are updated. > > > Will legacy stay around (for the forseeable furture)? > > That's up to implementers I think as long as they > implement the new standard we should not prevent them from > bundling in the old virtio, coffee making capabilities etc. > > > > Will legacy > > devices still be considered standard compliant (as in "compliant to the > > legacy standard")? > > I don't think they are compliant. We'll split the legacy sections > from spec out to a separate transition guide before we release > the spec. What I'm worried about is probably the transitional nature of this. There is a framework we have now, so there will be users - and not on all platforms they expect needing to upgrade, especially if traditional I/O has always been backwards compatible for decades... > > > > > > + > > > > > +Device and driver that require support for revision 1.0 or newer of > > > > > +the specification to function, are called non-transitional device and driver, > > > > > +respectively. > > > > > + > > > > > +Transitional Drivers can detect Legacy Devices by detecting that > > > > > +Feature bit 32 is not offered. > > > > > +Transitional devices can detect Legacy drivers by detecting that > > > > > +Feature bit 32 has not been acknowledged by driver. > > > > Will we use new feature bits for new, incompatible revisions? Or will > > we try to stay backwards compatible? > > So an incompatible change needs to increment revision ID > to prevent drivers from loading. > MMIO and PCI both have revision IDs. > CCW will need to add something like a revision ID, > we discussed this already. Command rejects? I think it is a good idea to try to stay as compatible as possible; this should really be a last measure. > > > > > > + > > > > > +To make them easier to locate, specification sections documenting these > > > > > +transitional features all explicitly marked with > > > > > +'Legacy Interface' in the section title. > > > > > + > > > > > + > > > > > 2.1.3 Configuration Space > > > > > ------------------------- > > > > > > > > > > Configuration space is generally used for rarely-changing or > > > > > initialization-time parameters. > > > > > > > > > > -Note that this space is generally the guest's native endian, > > > > > +Note that configuration space generally uses the little-endian format > > > > > +for multi-byte fields. > > > > > + > > > > > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness > > > > > +-------------------------------------- > > > > > + > > > > > +Note that for legacy interfaces, configuration space is generally the guest's native endian, > > > > > rather than PCI's little-endian. > > > > > > > > > > 2.1.4 Virtqueues > > > > > @@ -164,6 +221,45 @@ transmit and one for receive. Each queue has a 16-bit queue size > > > > > parameter, which sets the number of entries and implies the total size > > > > > of the queue. > > > > > > > > > > +Each virtqueue consists of three parts: > > > > > + > > > > > + Descriptor Table > > > > > + Available Ring > > > > > + Used Ring > > > > > + > > > > > +where each part is physically-contiguous in guest memory, > > > > > +and has different alignment requirements. > > > > > + > > > > > +The Queue Size field controls the total number of bytes > > > > > +required for each part of the virtqueue. > > > > > + > > > > > +The memory aligment and size requirements, in bytes, of each part of the > > > > > +virtqueue are summarized in the following table (qsz is the Queue Size field): > > > > > + > > > > > ++------------+---------------------------------+ > > > > > +| Virtqueue Part | Alignment | Size | > > > > > ++------------+---------------------------------+ > > > > > ++------------+---------------------------------+ > > > > > +| Descriptor Table | 16 | 16 * qsz | > > > > > ++------------+---------------------------------+ > > > > > +| Available Ring | 2 | 6 + 2 * qsz | > > > > > ++------------+---------------------------------+ > > > > > +| Used Ring | 4 | 6 + 4 * qsz | > > > > > ++------------+---------------------------------+ > > > > > + > > > > > +When the driver wants to send a buffer to the device, it fills in > > > > > +a slot in the descriptor table (or chains several together), and > > > > > +writes the descriptor index into the available ring. It then > > > > > +notifies the device. When the device has finished a buffer, it > > > > > +writes the descriptor into the used ring, and sends an interrupt. > > > > > + > > > > > + > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout > > > > > +-------------------------------------- > > > > > + > > > > > +For Legacy Interfaces, several additional > > > > > +restrictions are placed on the virtqueue layout: > > > > > + > > > > > Each virtqueue occupies two or more physically-contiguous pages > > > > > (usually defined as 4096 bytes, but depending on the transport) > > > > > and consists of three parts: > > > > > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula: > > > > > + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz); > > > > > } > > > > > > > > > > -This currently wastes some space with padding, but also allows future > > > > > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension. The > > > > > -virtqueue layout structure looks like this: > > > > > +This wastes some space with padding. > > > > > +The legacy virtqueue layout structure therefore looks like this: > > > > > > > > > > struct vring { > > > > > // The actual descriptors (16 bytes each) > > > > > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this: > > > > > struct vring_used used; > > > > > }; > > > > > > > > > > -When the driver wants to send a buffer to the device, it fills in > > > > > -a slot in the descriptor table (or chains several together), and > > > > > -writes the descriptor index into the available ring. It then > > > > > -notifies the device. When the device has finished a buffer, it > > > > > -writes the descriptor into the used ring, and sends an interrupt. > > > > > - > > > > > -2.1.4.1 A Note on Virtqueue Endianness > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness > > > > > -------------------------------------- > > > > > > > > > > Note that the endian of fields and in the virtqueue is the native > > > > > -endian of the guest, not little-endian as PCI normally is. This makes > > > > > -for simpler guest code, and it is assumed that the host already has to > > > > > -be deeply aware of the guest endian so such an “endian-aware” device > > > > > -is not a significant issue. > > > > > +endian of the guest, not little-endian as PCI normally is. > > > > > +It is assumed that the host is already aware of the guest endian. > > > > > > > > > > 2.1.4.2 Message Framing > > > > > ----------------------- > > > > > -The original intent of the specification was that message framing (the > > > > > -particular layout of descriptors) be independent of the contents of > > > > > +Generally, the intent of the specification is for message framing (the > > > > > +particular layout of descriptors) to be independent of the contents of > > > > > the buffers. For example, a network transmit buffer consists of a 12 > > > > > byte header followed by the network packet. This could be most simply > > > > > placed in the descriptor table as a 12 byte output descriptor followed > > > > > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and > > > > > packet are adjacent, or even three or more descriptors (possibly with > > > > > loss of efficiency in that case). > > > > > > > > > > -Regrettably, initial driver implementations used simple layouts, and > > > > > -devices came to rely on it, despite this specification wording[10]. It > > > > > -is thus recommended that drivers be conservative in their assumptions, > > > > > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some > > > > > +In addition, some > > > > > implementations may have large-but-reasonable restrictions on total > > > > > descriptor size (such as based on IOV_MAX in the host OS). This has > > > > > not been a problem in practice: little sympathy will be given to > > > > > drivers which create unreasonably-sized descriptors such as by > > > > > dividing a network packet into 1500 single-byte descriptors! > > > > > > > > > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing > > > > > +----------------------- > > > > > +Regrettably, initial driver implementations used simple layouts, and > > > > > +devices came to rely on it, despite this specification wording[10]. It > > > > > +is thus recommended that when using legacy interfaces, > > > > > +drivers should be conservative in their assumptions, > > > > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted. > > > > So ANY_LAYOUT and feature bit 32 are mutually exclusive? > > Hmm. I wonder what gives this impression. > What I tried to say is bit 32 should imply ANY_LAYOUT. Better to spell it out, then. > > > > > > > + > > > > > 2.1.4.3 The Virtqueue Descriptor Table > > > > > -------------------------------------- > > > > > > > > > > @@ -386,23 +478,27 @@ how to communicate with the specific device. > > > > > 2.2.1 Device Initialization > > > > > --------------------------- > > > > > > > > > > -1. Reset the device. This is not required on initial start up. > > > > > +1. Device discovery. This is only required for some transports. > > > > > + > > > > > +2. Reset the device. This is not required on initial start up. > > > > > > > > > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > +3. Device layout detection. This is only required for some transports. > > > > > > > > > > -3. The DRIVER status bit is set: we know how to drive the device. > > > > > +4. The ACKNOWLEDGE status bit is set: we have noticed the device. > > > > > > > > > > -4. Device-specific setup, including reading the device feature > > > > > +5. The DRIVER status bit is set: we know how to drive the device. > > > > > + > > > > > +6. Device-specific setup, including reading the device feature > > > > > bits, discovery of virtqueues for the device, optional per-bus > > > > > setup, and reading and possibly writing the device's virtio > > > > > configuration space. > > > > > > > > > > -5. The subset of device feature bits understood by the driver is > > > > > +7. The subset of device feature bits understood by the driver is > > > > > written to the device. > > > > > > > > > > -6. The DRIVER_OK status bit is set. > > > > > +8. The DRIVER_OK status bit is set. > > > > > > > > > > -7. The device can now be used (ie. buffers added to the > > > > > +9. The device can now be used (ie. buffers added to the > > > > > virtqueues)[4] > > > > > > > > > > If any of these steps go irrecoverably wrong, the guest should > > > > > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices. > > > > > > > > > > Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through > > > > > 0x103F inclusive is a virtio device[3]. The device must also have a > > > > > -Revision ID of 0 to match this specification. > > > > > +Revision ID of 0 or Revision ID of 1 to match this specification. > > > > > > > > > > The Subsystem Device ID indicates which virtio device is > > > > > supported by the device. The Subsystem Vendor ID should reflect > > > > > the PCI Vendor ID of the environment (it's currently only used > > > > > for informational purposes by the guest). > > > > > > > > > > +Drivers must not match devices where Revision ID does not match 0 or 1. > > > > > + > > > > > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery > > > > > +---------------------------- > > > > > +Transitional devices must have a Revision ID of 0. > > > > > + > > > > > +Non-transitional devices must have a Revision ID of 1. > > > > > + > > > > > +Transitional drivers must match a Revision ID of 0 or 1. > > > > > + > > > > > +Non-transitional drivers must only match a Revision ID of 1. > > > > > + > > > > > > > > I think we should stop abusing Revision IDs, and start using them > > > > to reflect device version changes as intended. > > > > > > > > We could reserve revision id 0 for legacy devices, however, which should > > > > work nicely. > > > > > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply? > > > > > > More concerns: > > > > > > We are using revision ID now exactly as was intended to disable old > > > drivers - it served us well for 0.X-1.X and would be as useful if we > > > ever have 1.X->2.0 transition. > > > > > > Another worry with using revision numbering for features is that > > > it does not play well with downstreams. > > > E.g. RHEL might want to cherry-pick a feature without implementing > > > other features that happened to land in the same revision. > > > > > > Also Revision ID is only 8 bit - it's designed for hardware where > > > making a new revision is expensive. In software we'll run out of that > > > eventually. > > > > So Revision ID is a PCI-specific thing, right? Not all transports will > > necessarily have something equivalent, so they would need to depend on > > the feature bit. > > They can't do this reliably - for example you might want to move feature > bits around. That sounds like setting yourself up for problems. If you want to deprecate bits, it would be better to define them as "reserved" and use a new bit for your new feature. The s390 architecture is full of "reserved" bits like that. > For 0.9.X drivers and non-transitional devices, > I'd like to find some hack to make probe fail. > > Any idea? Not really, sorry. > > But let's plan ahead and add a way to do this > in the future if we make an incompatible change again. I'd rather have an architecture that allows us to be backwards compatible for a long time and introduce a new device id/cu type for a new kind of device if we want to do things differently and ditch old baggage. > > > > > > > > > > > > > > > > 2.4.1.2 PCI Device Layout > > > > > ------------------------- > > > > > > > > > > -To configure the device, we use the first I/O region of the PCI > > > > > -device. This contains a virtio header followed by a > > > > > -device-specific region. > > > > > +To configure the device, > > > > > +use I/O and/or memory regions and/or PCI configuration space of the PCI device. > > > > > +These contain the virtio header registers, the notification register, the > > > > > +ISR status register and device specific registers, as specified by Virtio > > > > > ++ Structure PCI Capabilities > > > > > + > > > > > +There may be different widths of accesses to the I/O region; the > > > > > +“natural” access method for each field must be > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc). > > > > > + > > > > > +PCI Device Configuration Layout includes the common configuration, > > > > > +ISR, notification and device specific configuration > > > > > +structures. > > > > > + > > > > > +Unless explicitly specified otherwise, all multi-byte fields are little-endian. > > > > > + > > > > > + > > > > > +2.4.1.2.1 Common configuration structure layout > > > > > +------------------------- > > > > > +Common configuration structure layout is documented below: > > > > > + > > > > > +struct virtio_pci_common_cfg { > > > > > + /* About the whole device. */ > > > > > + __le32 device_feature_select; /* read-write */ > > > > > + __le32 device_feature; /* read-only */ > > > > > + __le32 guest_feature_select; /* read-write */ > > > > > + __le32 guest_feature; /* read-write */ > > > > > + __le16 msix_config; /* read-write */ > > > > > + __le16 num_queues; /* read-only */ > > > > > + __u8 device_status; /* read-write */ > > > > > + __u8 unused1; > > > > > + > > > > > + /* About a specific virtqueue. */ > > > > > + __le16 queue_select; /* read-write */ > > > > > + __le16 queue_size; /* read-write, power of 2, or 0. */ > > > > > + __le16 queue_msix_vector; /* read-write */ > > > > > + __le16 queue_enable; /* read-write */ > > > > > + __le16 queue_notify_off; /* read-only */ > > > > > + __le64 queue_desc; /* read-write */ > > > > > + __le64 queue_avail; /* read-write */ > > > > > + __le64 queue_used; /* read-write */ > > > > > +}; > > > > > + > > > > > +device_feature_select > > > > > + > > > > > + Selects which Feature Bits does device_feature field refer to. > > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > > + All other values cause reads from device_feature to return 0. > > > > > + > > > > > +device_feature > > > > > + > > > > > + Used by Device to report Feature Bits to Driver. > > > > > + Device Feature Bits selected by device_feature_select. > > > > > + > > > > > +guest_feature_select > > > > > + > > > > > + Selects which Feature Bits does guest_feature field refer to. > > > > > + Value 0x0 selects Feature Bits 0 to 31 > > > > > + Value 0x1 selects Feature Bits 32 to 63 > > > > > + All other values cause writes to guest_feature to be ignored, > > > > > + and reads to return 0. > > > > > + > > > > > +guest_feature > > > > > + > > > > > + Used by Driver to acknowledge Feature Bits to Device. > > > > > + Guest Feature Bits selected by guest_feature_select. > > > > > + > > > > > +msix_config > > > > > + > > > > > + Configuration Vector for MSI-X. > > > > > + > > > > > +num_queues > > > > > + > > > > > + Specifies the maximum number of virtqueues supported by device. > > > > > + > > > > > +device_status > > > > > + > > > > > + Device Status field. > > > > > + > > > > > +queue_select > > > > > + > > > > > + Queue Select. Selects which virtqueue do other fields refer to. > > > > > + > > > > > +queue_size > > > > > + > > > > > + Queue Size. On reset, specifies the maximum queue size supported by > > > > > + the hypervisor. This can be modified by driver to reduce memory requirements. > > > > > + Set to 0 if this virtqueue is unused. > > > > > + > > > > > +queue_msix_vector > > > > > + > > > > > + Queue Vector for MSI-X. > > > > > + > > > > > +queue_enable > > > > > + > > > > > + Used to selectively prevent host from executing requests from this virtqueue. > > > > > + 1 - enabled; 0 - disabled > > > > > + > > > > > +queue_notify_off > > > > > + > > > > > + Used to calculate the offset from start of Notification structure at > > > > > + which this virtqueue is located. > > > > > + Note: this is *not* an offset in bytes. See notify_off_multiplier below. > > > > > + > > > > > +queue_desc > > > > > + > > > > > + Physical address of Descriptor Table. > > > > > + > > > > > +queue_avail > > > > > + > > > > > + Physical address of Available Ring. > > > > > + > > > > > +queue_used > > > > > + > > > > > + Physical address of Used Ring. > > > > > + > > > > > + > > > > > +2.4.1.2.2 ISR status structure layout > > > > > +------------------------- > > > > > +ISR status structure includes a single 8-bite ISR status field > > > > > > > > 8-bit > > > > > > Right :) > > > > > > > > + > > > > > +2.4.1.2.3 Notification structure layout > > > > > +------------------------- > > > > > +Notification structure is always a multiple of 2 bytes in size. > > > > > +It includes 2-byte Queue Notify fields for each virtqueue of > > > > > +the device. Note that multiple virtqueues can use the same > > > > > +Queue Notify field, if necessary. > > > > > > > > Hmm, maybe move this down, so you can have a section which starts with > > > > "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below? That would put it all > > > > together. > > > > > > so Move PCI Device Layout to within > > > PCI-specific Initialization And Device Operation? > > > > > > > > + > > > > > +2.4.1.2.4 Device specific structure > > > > > +------------------------- > > > > > + > > > > > +Device specific structure is optional. > > > > > + > > > > > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout > > > > > +------------------------- > > > > > + > > > > > +Transitional devices should present part of configuration > > > > > +registers in a legacy configuration structure in BAR0 in the first I/O > > > > > +region of the PCI device, as documented below. > > > > > > > > > > There may be different widths of accesses to the I/O region; the > > > > > “natural” access method for each field in the virtio header must be > > > > > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc), but > > > > > +When accessed through the legacy interface the > > > > > device-specific region can be accessed using any width accesses, and > > > > > should obtain the same results. > > > > > > > > > > Note that this is possible because while the virtio header is PCI > > > > > -(i.e. little) endian, the device-specific region is encoded in > > > > > -the native endian of the guest (where such distinction is > > > > > +(i.e. little) endian, when using the legacy interface the device-specific > > > > > +region is encoded in the native endian of the guest (where such distinction is > > > > > applicable). > > > > > > > > > > -2.4.1.2.1 PCI Device Virtio Header > > > > > ----------------------------------- > > > > > > > > > > -The virtio header looks as follows: > > > > > +When used through the legacy interface, the virtio header looks as follows: > > > > > > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > | Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | > > > > > @@ -661,7 +905,6 @@ The virtio header looks as follows: > > > > > | || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | > > > > > +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ > > > > > > > > > > - > > > > > If MSI-X is enabled for the device, two additional fields > > > > > immediately follow this header:[5] > > > > > > > > > > @@ -689,25 +932,154 @@ device-specific headers: > > > > > | || | > > > > > +------------++--------------------+ > > > > > > > > > > +Note that only Feature Bits 0 to 31 are accessible through the > > > > > +Legacy Interface. When used through the Legacy Interface, > > > > > +Transitional Devices must assume that Feature Bits 32 to 63 > > > > > +are not acknowledged by Driver. > > > > > + > > > > > + > > > > > 2.4.1.3 PCI-specific Initialization And Device Operation > > > > > -------------------------------------------------------- > > > > > > > > > > -The page size for a virtqueue on a PCI virtio device is defined as > > > > > -4096 bytes. > > > > > - > > > > > 2.4.1.3.1 Device Initialization > > > > > ------------------------------- > > > > > > > > > > -2.4.1.3.1.1 Queue Vector Configuration > > > > > +This documents PCI-specific steps executed during Device Initialization. > > > > > +As the first step, driver must detect device configuration layout > > > > > +to locate configuration fields in memory,I/O or configuration space of the > > > > > +device. > > > > > + > > > > > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection > > > > > +------------------------------- > > > > > + > > > > > +As a prerequisite to device initialization, driver executes a > > > > > +PCI capability list scan, detecting virtio configuration layout using Virtio > > > > > +Structure PCI capabilities. > > > > > + > > > > > +Virtio Device Configuration Layout includes virtio configuration header, Notification > > > > > +and ISR Status and device configuration structures. > > > > > +Each structure can be mapped by a Base Address register (BAR) belonging to > > > > > +the function, located beginning at 10h in Configuration Space, > > > > > +or accessed though PCI configuration space. > > > > > + > > > > > +Actual location of each structure is specified using vendor-specific PCI capability located > > > > > +on capability list in PCI configuration space of the device. > > > > > +This virtio structure capability uses little-endian format; all bits are > > > > > +read-only: > > > > > + > > > > > +struct virtio_pci_cap { > > > > > + __u8 cap_vndr; /* Generic PCI field: PCI_CAP_ID_VNDR */ > > > > > + __u8 cap_next; /* Generic PCI field: next ptr. */ > > > > > + __u8 cap_len; /* Generic PCI field: capability length */ > > > > > + __u8 cfg_type; /* Identifies the structure. */ > > > > > + __u8 bar; /* Where to find it. */ > > > > > + __u8 padding[3];/* Pad to full dword. */ > > > > > + __le32 offset; /* Offset within bar. */ > > > > > + __le32 length; /* Length of the structure, in bytes. */ > > > > > +}; > > > > > + > > > > > +This structure can optionally followed by extra data, depending on > > > > > +other fields, as documented below. > > > > > + > > > > > +The fields are interpreted as follows: > > > > > + > > > > > +cap_vndr > > > > > + 0x09; Identifies a vendor-specific capability. > > > > > + > > > > > +cap_next > > > > > + Link to next capability in the capability list in the configuration space. > > > > > + > > > > > +cap_len > > > > > + Length of the capability structure, including the whole of > > > > > + struct virtio_pci_cap, and extra data if any. > > > > > + This length might include padding, or fields unused by the driver. > > > > > + > > > > > +cfg_type > > > > > + identifies the structure, according to the following table. > > > > > + > > > > > + /* Common configuration */ > > > > > + #define VIRTIO_PCI_CAP_COMMON_CFG 1 > > > > > + /* Notifications */ > > > > > + #define VIRTIO_PCI_CAP_NOTIFY_CFG 2 > > > > > + /* ISR Status */ > > > > > + #define VIRTIO_PCI_CAP_ISR_CFG 3 > > > > > + /* Device specific configuration */ > > > > > + #define VIRTIO_PCI_CAP_DEVICE_CFG 4 > > > > > + > > > > > + More than one capability can identify the same structure - this makes it > > > > > + possible for the device to expose multiple interfaces to drivers. The order of > > > > > + the capabilities in the capability list specifies the order of preference > > > > > + suggested by the device; drivers should use the first interface that they can > > > > > + support. For example, on some hypervisors, notifications using IO accesses are > > > > > + faster than memory accesses. In this case, hypervisor can expose two > > > > > + capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG: > > > > > + the first one addressing an I/O BAR, the second one addressing a memory BAR. > > > > > + Driver will use the I/O BAR if I/O resources are available, and fall back on > > > > > + memory BAR when I/O resources are unavailable. > > > > > + > > > > > +bar > > > > > + > > > > > + values 0x0 to 0x5 specify a Base Address register (BAR) belonging to > > > > > + the function located beginning at 10h in Configuration Space > > > > > + and used to map the structure into Memory or I/O Space. > > > > > + The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space > > > > > + or I/O Space. > > > > > + > > > > > + The value 0xF specifies that the structure is in PCI configuration space > > > > > + inline with this capability structure, following (not necessarily immediately) > > > > > + the length field. > > > > > > > > Why not immediately? > > > > Or how would the driver know where it is? > > > > > > It's at the offset. > > > > > > E.g. for notification we stick multiplier after length. > > > Further, we might extend virtio_pci_cap in the future, > > > and we don't want to move stuff around like we > > > had to with MSI-X. > > > > > > > > + > > > > > +offset > > > > > + indicates where the structure begins relative to the base address associated > > > > > + with the BAR. If bar specifies configuration space, offset is relative > > > > > + to start of virtio_pci_cap structure. > > > > > + > > > > > +length > > > > > + indicates the length of the structure. > > > > > + This size might include padding, or fields unused by the driver. > > > > > + Drivers are also recommended to only map part of configuration structure > > > > > + large enough for device operation. > > > > > + For example, a future device might present a large structure size of several > > > > > + MBytes. > > > > > + As current devices never utilize structures larger than 4KBytes in size, > > > > > + driver can limit the mapped structure size to e.g. > > > > > + 4KBytes to allow forward compatibility with such devices without loss of > > > > > + functionality and without wasting resources. > > > > > + > > > > > + > > > > > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed > > > > > +by additional fields: > > > > > + > > > > > +struct virtio_pci_notify_cap { > > > > > + struct virtio_pci_cap cap; > > > > > + __le32 notify_off_multiplier; /* Multiplier for queue_notify_off. */ > > > > > +}; > > > > > + > > > > > +notify_off_multiplier > > > > > + > > > > > + Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0. > > > > > + Value 0x1 is reserved. > > > > > + For a given virtqueue, the address to use for notifications is calculated as follows: > > > > > + > > > > > + queue_notify_off * notify_off_multiplier + offset > > > > > + > > > > > + If notify_off_multiplier is 0, all virtqueues use the same address in > > > > > + the Notifications structure! > > > > > + > > > > > + > > > > > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection > > > > > +------------------------------- > > > > > + > > > > > +Legacy drivers skipped Device Layout Detection step, assuming legacy > > > > > +configuration space in BAR0 in I/O space unconditionally. > > > > > + > > > > > +2.4.1.3.1.3 Queue Vector Configuration > > > > > -------------------------------------- > > > > > > > > > > When MSI-X capability is present and enabled in the device > > > > > -(through standard PCI configuration space) 4 bytes at byte offset > > > > > -20 are used to map configuration change and queue interrupts to > > > > > -MSI-X vectors. In this case, the ISR Status field is unused, and > > > > > -device specific configuration starts at byte offset 24 in virtio > > > > > -header structure. When MSI-X capability is not enabled, device > > > > > -specific configuration starts at byte offset 20 in virtio header. > > > > > +(through standard PCI configuration space) Configuration/Queue > > > > > +MSI-X Vector registers are used to map configuration change and queue > > > > > +interrupts to MSI-X vectors. In this case, the ISR Status is unused. > > > > > > > > > > Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of > > > > > Configuration/Queue Vector registers, maps interrupts triggered > > > > > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on > > > > > failure, NO_VECTOR is returned. If a mapping failure is detected, > > > > > the driver can retry mapping with fewervectors, or disable MSI-X. > > > > > > > > > > -2.4.1.3.1.2 Virtqueue Configuration > > > > > +2.4.1.3.1.4 Virtqueue Configuration > > > > > ----------------------------------- > > > > > > > > > > As a device can have zero or more virtqueues for bulk data > > > > > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has: > > > > > always a power of 2. This controls how big the virtqueue is > > > > > (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. > > > > > > > > > > -3. Allocate and zero virtqueue in contiguous physical memory, on > > > > > - a 4096 byte alignment. Write the physical address, divided by > > > > > - 4096 to the Queue Address field.[6] > > > > > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size > > > > > + field. > > > > > + > > > > > +3. Allocate and zero Descriptor Table, Available and Used rings for the > > > > > + virtqueue in contiguous physical memory. > > > > > > > > > > 4. Optionally, if MSI-X capability is present and enabled on the > > > > > device, select a vector to use to request interrupts triggered > > > > > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has: > > > > > Queue Vector field: on success, previously written value is > > > > > returned; on failure, NO_VECTOR value is returned. > > > > > > > > > > + > > > > > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration > > > > > +----------------------------------- > > > > > +When using the legacy interface, the page size for a virtqueue on a PCI virtio > > > > > +device is defined as 4096 bytes. Driver writes the physical address, divided > > > > > +by 4096 to the Queue Address field [6]. > > > > > + > > > > > 2.4.1.3.2 Notifying The Device > > > > > ------------------------------ > > > > > > > > > > Device notification occurs by writing the 16-bit virtqueue index > > > > > -of this virtqueue to the Queue Notify field of the virtio header > > > > > -in the first I/O region of the PCI device. > > > > > +of this virtqueue to the Queue Notify field. > > > > > > > > > > 2.4.1.3.3 Receiving Used Buffers From The Device > > > > > +------------------------------ > > > > > > > > > > If an interrupt is necessary: > > > > > > > > > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390). > > > > > This is only allowed if the driver does not use any features > > > > > which would alter this early use of the device. > > > > > > > > > > -[5] ie. once you enable MSI-X on the device, the other fields move. > > > > > +[5] When MSI-X capability is enabled, device specific configuration starts at > > > > > +byte offset 24 in virtio header structure. When MSI-X capability is not > > > > > +enabled, device specific configuration starts at byte offset 20 in virtio > > > > > +header. ie. once you enable MSI-X on the device, the other fields move. > > > > > If you turn it off again, they move back! > > > > > > > > Thanks, > > > > Rusty. > > > > Cornelia > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]