virtio message

Subject: Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
From: Cornelia Huck <cornelia.huck@de.ibm.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Tue, 27 Aug 2013 19:01:23 +0200
On Tue, 27 Aug 2013 18:36:29 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Aug 27, 2013 at 05:09:53PM +0200, Cornelia Huck wrote:
> > Some remarks from my side...
> > 
> > On Tue, 27 Aug 2013 10:38:59 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote:
> > > > "Michael S. Tsirkin" <mst@redhat.com> writes:
> > > > > This is the new configuration layout.
> > > > >
> > > > > Notes:
> > > > > - Everything is LE
> > > > > - There's a feature bit that means spec 1.0 compliant.
> > > > > - Both devices and drivers can either require the 1.0 interface
> > > > >   or try to include compatibility support. The spec isn't forcing
> > > > >   this decision.
> > > > 
> > > > Hmm, this kind includes other changes already proposed, like the LE
> > > > change and the framing change.  I think this conceptually splits nicely:
> > > > 
> > > > 1) Feature bit 32 proposal.
> > > > 2) Endian change.
> > > > 3) Framing change.
> > > > 4) PCI layout change.
> > > 
> > > Right - they are mostly in different parts of the document.
> > > I put it all together so it's easy to see how we intend to
> > > handle the transition.
> > > So is everyone OK with keeping this in a single patch?
> > 
> > The new feature bit is supposed to cover all of this, right? Then this
> > should be one patch.
> > 
> > > 
> > > > > - I kept documentation of the legacy interface around, and added notes
> > > > >   on transition inline. They are in separate sections each clearly marked
> > > > >   "Legacy Interface" so we'll be able to separate them out
> > > > >   from the final document as necessary - for now I think it's easier
> > > > >   to keep it all together.
> > > > 
> > > > Good thinking: most of us know the current spec so it's definitely
> > > > clearer.  And makes sure we're thinking about the transition.
> > > > 
> > > > > Only virtio PCI has been converted.
> > > > > Let's discuss this on the meeting tonight, once we figure out PCI
> > > > > we can do something similar for MMIO and CCW.
> > > > 
> > > > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows:
> > > > >    24 to 31: Feature bits reserved for extensions to the queue and 
> > > > >    feature negotiation mechanisms
> > > > >  
> > > > > +  32: Feature bit must be set for any device compliant with this
> > > > > +  revision of the specification, and acknowledged by all device drivers.
> > 
> > Would it make sense to have a bit 33 "rings big endian" whose validity
> > depends on bit 32 set? This would make it possible for ccw to keep its
> > current endianness.
> 
> I didn't go over ccw or MMIO yet - only PCI.
> I think ccw registers will just
> be explicitly BE, with no need for a feature bit.
> Does this sound right?

Sure, that would be even better.

> 
> > > > > +
> > > > > +  33 to 63: Feature bits reserved for future extensions
> > > > > +
> > > > >  For example, feature bit 0 for a network device (i.e. Subsystem 
> > > > >  Device ID 1) indicates that the device supports checksumming of 
> > > > >  packets.
> > > > 
> > > > Why stop at 63?  If we go to a more decentralized feature-assignment
> > > > model, we'll run through those very fast.
> > > 
> > > Then we'll just document more, but driver needs to know where to stop
> > > looking for features.
> > > 
> > > > 
> > > > > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are
> > > > >  indicated by offering a feature bit, so the guest can check 
> > > > >  before accessing that part of the configuration space.
> > > > >  
> > > > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts
> > > > > +--------------------------------------
> > > > > +
> > > > > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but
> > > > > +different interface between the hypervisor and the guest.
> > > > > +Since these are widely deployed in the field, this specification
> > > > > +accomodates optional features to simplify transition
> > > > > +from these earlier draft interfaces. Specifically:
> > > > > +
> > > > > +Legacy Interface
> > > > > +	is an interface specified by an earlier draft of this specification
> > > > > +        (up to 0.9.X)
> > > > > +Legacy Device
> > > > > +	is a device implemented before this specification was released,
> > > > > +        and implementing a legacy interface on the host side
> > > > > +Legacy Driver
> > > > > +	is a driver implemented before this specification was released,
> > > > > +        and implementing a legacy interface on the guest side
> > > > > +
> > > > > +to simplify transition from these earlier draft interfaces,
> > > > > +it is possible to implement
> > > > > +
> > > > > +Transitional Device
> > > > > +	a device supporting both drivers conforming to this
> > > > > +        specification, and legacy drivers
> > > > > +
> > > > > +Transitional Driver
> > > > > +	a driver supporting both devices conforming to this
> > > > > +	specification, and legacy devices
> > 
> > What happens to legacy devices in the future? Current implementers
> > will obviously expose legacy devices, which means future drivers need
> > to be transitional or they won't work with what is currently out there.
> 
> You are right. It's a bug in what I wrote: non transitional drivers
> should work with transitional devices.
> This way a transitional device can change to non-transitional
> after drivers are updated.
> 
> > Will legacy stay around (for the forseeable furture)?
> 
> That's up to implementers I think as long as they
> implement the new standard we should not prevent them from
> bundling in the old virtio, coffee making capabilities etc.
> 
> 
> > Will legacy
> > devices still be considered standard compliant (as in "compliant to the
> > legacy standard")?
> 
> I don't think they are compliant. We'll split the legacy sections
> from spec out to a separate transition guide before we release
> the spec.

What I'm worried about is probably the transitional nature of this.
There is a framework we have now, so there will be users - and not on
all platforms they expect needing to upgrade, especially if traditional
I/O has always been backwards compatible for decades...

> 
> > > > > +
> > > > > +Device and driver that require support for revision 1.0 or newer of
> > > > > +the specification to function, are called non-transitional device and driver,
> > > > > +respectively.
> > > > > +
> > > > > +Transitional Drivers can detect Legacy Devices by detecting that
> > > > > +Feature bit 32 is not offered.
> > > > > +Transitional devices can detect Legacy drivers by detecting that
> > > > > +Feature bit 32 has not been acknowledged by driver.
> > 
> > Will we use new feature bits for new, incompatible revisions? Or will
> > we try to stay backwards compatible?
> 
> So an incompatible change needs to increment revision ID
> to prevent drivers from loading.
> MMIO and PCI both have revision IDs.
> CCW will need to add something like a revision ID,
> we discussed this already.

Command rejects?

I think it is a good idea to try to stay as compatible as possible;
this should really be a last measure.

> 
> > > > > +
> > > > > +To make them easier to locate, specification sections documenting these
> > > > > +transitional features all explicitly marked with
> > > > > +'Legacy Interface' in the section title.
> > > > > +
> > > > > +
> > > > >  2.1.3 Configuration Space
> > > > >  -------------------------
> > > > >  
> > > > >  Configuration space is generally used for rarely-changing or
> > > > >  initialization-time parameters.
> > > > >  
> > > > > -Note that this space is generally the guest's native endian, 
> > > > > +Note that configuration space generally uses the little-endian format
> > > > > +for multi-byte fields.
> > > > > +
> > > > > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness
> > > > > +--------------------------------------
> > > > > +
> > > > > +Note that for legacy interfaces, configuration space is generally the guest's native endian, 
> > > > >  rather than PCI's little-endian.
> > > > >  
> > > > >  2.1.4 Virtqueues
> > > > > @@ -164,6 +221,45 @@ transmit and one for receive.  Each queue has a 16-bit queue size
> > > > >  parameter, which sets the number of entries and implies the total size
> > > > >  of the queue.
> > > > >  
> > > > > +Each virtqueue consists of three parts:
> > > > > +
> > > > > +	Descriptor Table
> > > > > +	Available Ring
> > > > > +	Used Ring
> > > > > +
> > > > > +where each part is physically-contiguous in guest memory,
> > > > > +and has different alignment requirements.
> > > > > +
> > > > > +The Queue Size field controls the total number of bytes
> > > > > +required for each part of the virtqueue.
> > > > > +
> > > > > +The memory aligment and size requirements, in bytes, of each part of the
> > > > > +virtqueue are summarized in the following table (qsz is the Queue Size field):
> > > > > +
> > > > > ++------------+---------------------------------+
> > > > > +| Virtqueue Part    | Alignment | Size         |
> > > > > ++------------+---------------------------------+
> > > > > ++------------+---------------------------------+
> > > > > +| Descriptor Table  | 16        | 16 * qsz     |
> > > > > ++------------+---------------------------------+
> > > > > +| Available Ring    | 2         | 6 + 2 * qsz  |
> > > > > ++------------+---------------------------------+
> > > > > +| Used Ring         | 4         | 6 + 4 * qsz  |
> > > > > ++------------+---------------------------------+
> > > > > +
> > > > > +When the driver wants to send a buffer to the device, it fills in 
> > > > > +a slot in the descriptor table (or chains several together), and 
> > > > > +writes the descriptor index into the available ring.  It then 
> > > > > +notifies the device. When the device has finished a buffer, it 
> > > > > +writes the descriptor into the used ring, and sends an interrupt.
> > > > > +
> > > > > +
> > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout
> > > > > +--------------------------------------
> > > > > +
> > > > > +For Legacy Interfaces, several additional
> > > > > +restrictions are placed on the virtqueue layout:
> > > > > +
> > > > >  Each virtqueue occupies two or more physically-contiguous pages 
> > > > >  (usually defined as 4096 bytes, but depending on the transport)
> > > > >  and consists of three parts:
> > > > > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula:
> > > > >  	          + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz);
> > > > >  	}
> > > > >  
> > > > > -This currently wastes some space with padding, but also allows future
> > > > > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension.  The
> > > > > -virtqueue layout structure looks like this:
> > > > > +This wastes some space with padding.
> > > > > +The legacy virtqueue layout structure therefore looks like this:
> > > > >  
> > > > >  	struct vring {
> > > > >  		// The actual descriptors (16 bytes each)
> > > > > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this:
> > > > >  		struct vring_used used;
> > > > >  	};
> > > > >  
> > > > > -When the driver wants to send a buffer to the device, it fills in 
> > > > > -a slot in the descriptor table (or chains several together), and 
> > > > > -writes the descriptor index into the available ring.  It then 
> > > > > -notifies the device. When the device has finished a buffer, it 
> > > > > -writes the descriptor into the used ring, and sends an interrupt.
> > > > > -
> > > > > -2.1.4.1 A Note on Virtqueue Endianness
> > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness
> > > > >  --------------------------------------
> > > > >  
> > > > >  Note that the endian of fields and in the virtqueue is the native
> > > > > -endian of the guest, not little-endian as PCI normally is. This makes
> > > > > -for simpler guest code, and it is assumed that the host already has to
> > > > > -be deeply aware of the guest endian so such an “endian-aware” device
> > > > > -is not a significant issue.
> > > > > +endian of the guest, not little-endian as PCI normally is.
> > > > > +It is assumed that the host is already aware of the guest endian.
> > > > >  
> > > > >  2.1.4.2 Message Framing
> > > > >  -----------------------
> > > > > -The original intent of the specification was that message framing (the
> > > > > -particular layout of descriptors) be independent of the contents of
> > > > > +Generally, the intent of the specification is for message framing (the
> > > > > +particular layout of descriptors) to be independent of the contents of
> > > > >  the buffers. For example, a network transmit buffer consists of a 12
> > > > >  byte header followed by the network packet. This could be most simply
> > > > >  placed in the descriptor table as a 12 byte output descriptor followed
> > > > > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and
> > > > >  packet are adjacent, or even three or more descriptors (possibly with
> > > > >  loss of efficiency in that case).
> > > > >  
> > > > > -Regrettably, initial driver implementations used simple layouts, and
> > > > > -devices came to rely on it, despite this specification wording[10]. It
> > > > > -is thus recommended that drivers be conservative in their assumptions,
> > > > > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some
> > > > > +In addition, some
> > > > >  implementations may have large-but-reasonable restrictions on total
> > > > >  descriptor size (such as based on IOV_MAX in the host OS). This has
> > > > >  not been a problem in practice: little sympathy will be given to
> > > > >  drivers which create unreasonably-sized descriptors such as by
> > > > >  dividing a network packet into 1500 single-byte descriptors!
> > > > >  
> > > > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing
> > > > > +-----------------------
> > > > > +Regrettably, initial driver implementations used simple layouts, and
> > > > > +devices came to rely on it, despite this specification wording[10]. It
> > > > > +is thus recommended that when using legacy interfaces,
> > > > > +drivers should be conservative in their assumptions,
> > > > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted.
> > 
> > So ANY_LAYOUT and feature bit 32 are mutually exclusive?
> 
> Hmm. I wonder what gives this impression.
> What I tried to say is bit 32 should imply ANY_LAYOUT.

Better to spell it out, then.

> 
> 
> > > > > +
> > > > >  2.1.4.3 The Virtqueue Descriptor Table
> > > > >  --------------------------------------
> > > > >  
> > > > > @@ -386,23 +478,27 @@ how to communicate with the specific device.
> > > > >  2.2.1 Device Initialization
> > > > >  ---------------------------
> > > > >  
> > > > > -1. Reset the device. This is not required on initial start up.
> > > > > +1. Device discovery. This is only required for some transports.
> > > > > +
> > > > > +2. Reset the device. This is not required on initial start up.
> > > > >  
> > > > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device.
> > > > > +3. Device layout detection. This is only required for some transports.
> > > > >  
> > > > > -3. The DRIVER status bit is set: we know how to drive the device.
> > > > > +4. The ACKNOWLEDGE status bit is set: we have noticed the device.
> > > > >  
> > > > > -4. Device-specific setup, including reading the device feature 
> > > > > +5. The DRIVER status bit is set: we know how to drive the device.
> > > > > +
> > > > > +6. Device-specific setup, including reading the device feature 
> > > > >    bits, discovery of virtqueues for the device, optional per-bus
> > > > >    setup, and reading and possibly writing the device's virtio 
> > > > >    configuration space.
> > > > >  
> > > > > -5. The subset of device feature bits understood by the driver is 
> > > > > +7. The subset of device feature bits understood by the driver is 
> > > > >     written to the device.
> > > > >  
> > > > > -6. The DRIVER_OK status bit is set.
> > > > > +8. The DRIVER_OK status bit is set.
> > > > >  
> > > > > -7. The device can now be used (ie. buffers added to the 
> > > > > +9. The device can now be used (ie. buffers added to the 
> > > > >     virtqueues)[4]
> > > > >  
> > > > >  If any of these steps go irrecoverably wrong, the guest should 
> > > > > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices.
> > > > >  
> > > > >  Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through
> > > > >  0x103F inclusive is a virtio device[3]. The device must also have a
> > > > > -Revision ID of 0 to match this specification.
> > > > > +Revision ID of 0 or Revision ID of 1 to match this specification.
> > > > >  
> > > > >  The Subsystem Device ID indicates which virtio device is 
> > > > >  supported by the device. The Subsystem Vendor ID should reflect 
> > > > >  the PCI Vendor ID of the environment (it's currently only used 
> > > > >  for informational purposes by the guest).
> > > > >  
> > > > > +Drivers must not match devices where Revision ID does not match 0 or 1.
> > > > > +
> > > > > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery
> > > > > +----------------------------
> > > > > +Transitional devices must have a Revision ID of 0.
> > > > > +
> > > > > +Non-transitional devices must have a Revision ID of 1.
> > > > > +
> > > > > +Transitional drivers must match a Revision ID of 0 or 1.
> > > > > +
> > > > > +Non-transitional drivers must only match a Revision ID of 1.
> > > > > +
> > > > 
> > > > I think we should stop abusing Revision IDs, and start using them
> > > > to reflect device version changes as intended.
> > > >
> > > > We could reserve revision id 0 for legacy devices, however, which should
> > > > work nicely.
> > > 
> > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply?
> > > 
> > > More concerns:
> > > 
> > > We are using revision ID now exactly as was intended to disable old
> > > drivers - it served us well for 0.X-1.X and would be as useful if we
> > > ever have 1.X->2.0 transition.
> > > 
> > > Another worry with using revision numbering for features is that
> > > it does not play well with downstreams.
> > > E.g. RHEL might want to cherry-pick a feature without implementing
> > > other features that happened to land in the same revision.
> > > 
> > > Also Revision ID is only 8 bit - it's designed for hardware where
> > > making a new revision is expensive. In software we'll run out of that
> > > eventually.
> > 
> > So Revision ID is a PCI-specific thing, right? Not all transports will
> > necessarily have something equivalent, so they would need to depend on
> > the feature bit.
> 
> They can't do this reliably - for example you might want to move feature
> bits around.

That sounds like setting yourself up for problems. If you want to
deprecate bits, it would be better to define them as "reserved" and use
a new bit for your new feature. The s390 architecture is full of
"reserved" bits like that.

> For 0.9.X drivers and non-transitional devices,
> I'd like to find some hack to make probe fail.
> 
> Any idea?

Not really, sorry.

> 
> But let's plan ahead and add a way to do this
> in the future if we make an incompatible change again.

I'd rather have an architecture that allows us to be backwards
compatible for a long time and introduce a new device id/cu type for
a new kind of device if we want to do things differently and ditch old
baggage.

> 
> > > 
> > > 
> > > > 
> > > > >  2.4.1.2 PCI Device Layout
> > > > >  -------------------------
> > > > >  
> > > > > -To configure the device, we use the first I/O region of the PCI 
> > > > > -device. This contains a virtio header followed by a 
> > > > > -device-specific region.
> > > > > +To configure the device,
> > > > > +use I/O and/or memory regions and/or PCI configuration space of the PCI device.
> > > > > +These contain the virtio header registers, the notification register, the
> > > > > +ISR status register and device specific registers, as specified by Virtio
> > > > > ++ Structure PCI Capabilities
> > > > > +
> > > > > +There may be different widths of accesses to the I/O region; the
> > > > > +“natural” access method for each field must be
> > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc).
> > > > > +
> > > > > +PCI Device Configuration Layout includes the common configuration,
> > > > > +ISR, notification and device specific configuration
> > > > > +structures.
> > > > > +
> > > > > +Unless explicitly specified otherwise, all multi-byte fields are little-endian.
> > > > > +
> > > > > +
> > > > > +2.4.1.2.1 Common configuration structure layout
> > > > > +-------------------------
> > > > > +Common configuration structure layout is documented below:
> > > > > +
> > > > > +struct virtio_pci_common_cfg {
> > > > > +	/* About the whole device. */
> > > > > +	__le32 device_feature_select;	/* read-write */
> > > > > +	__le32 device_feature;		/* read-only */
> > > > > +	__le32 guest_feature_select;	/* read-write */
> > > > > +	__le32 guest_feature;		/* read-write */
> > > > > +	__le16 msix_config;		/* read-write */
> > > > > +	__le16 num_queues;		/* read-only */
> > > > > +	__u8 device_status;		/* read-write */
> > > > > +	__u8 unused1;
> > > > > +
> > > > > +	/* About a specific virtqueue. */
> > > > > +	__le16 queue_select;		/* read-write */
> > > > > +	__le16 queue_size;		/* read-write, power of 2, or 0. */
> > > > > +	__le16 queue_msix_vector;	/* read-write */
> > > > > +	__le16 queue_enable;		/* read-write */
> > > > > +	__le16 queue_notify_off;	/* read-only */
> > > > > +	__le64 queue_desc;		/* read-write */
> > > > > +	__le64 queue_avail;		/* read-write */
> > > > > +	__le64 queue_used;		/* read-write */
> > > > > +};
> > > > > +
> > > > > +device_feature_select
> > > > > +
> > > > > +	Selects which Feature Bits does device_feature field refer to.
> > > > > +	Value 0x0 selects Feature Bits 0 to 31
> > > > > +	Value 0x1 selects Feature Bits 32 to 63
> > > > > +	All other values cause reads from device_feature to return 0.
> > > > > +
> > > > > +device_feature
> > > > > +
> > > > > +	Used by Device to report Feature Bits to Driver.
> > > > > +	Device Feature Bits selected by device_feature_select.
> > > > > +
> > > > > +guest_feature_select
> > > > > +
> > > > > +	Selects which Feature Bits does guest_feature field refer to.
> > > > > +	Value 0x0 selects Feature Bits 0 to 31
> > > > > +	Value 0x1 selects Feature Bits 32 to 63
> > > > > +	All other values cause writes to guest_feature to be ignored,
> > > > > +	and reads to return 0.
> > > > > +
> > > > > +guest_feature
> > > > > +
> > > > > +	Used by Driver to acknowledge Feature Bits to Device.
> > > > > +	Guest Feature Bits selected by guest_feature_select.
> > > > > +
> > > > > +msix_config
> > > > > +
> > > > > +	Configuration Vector for MSI-X.
> > > > > +
> > > > > +num_queues
> > > > > +
> > > > > +	Specifies the maximum number of virtqueues supported by device.
> > > > > +
> > > > > +device_status
> > > > > +
> > > > > +	Device Status field.
> > > > > +
> > > > > +queue_select
> > > > > +
> > > > > +	Queue Select. Selects which virtqueue do other fields refer to.
> > > > > +
> > > > > +queue_size
> > > > > +
> > > > > +	Queue Size.  On reset, specifies the maximum queue size supported by
> > > > > +	the hypervisor. This can be modified by driver to reduce memory requirements.
> > > > > +	Set to 0 if this virtqueue is unused.
> > > > > +
> > > > > +queue_msix_vector
> > > > > +
> > > > > +	Queue Vector for MSI-X.
> > > > > +
> > > > > +queue_enable
> > > > > +
> > > > > +	Used to selectively prevent host from executing requests from this virtqueue.
> > > > > +	1 - enabled; 0 - disabled
> > > > > +
> > > > > +queue_notify_off
> > > > > +
> > > > > +	Used to calculate the offset from start of Notification structure at
> > > > > +	which this virtqueue is located.
> > > > > +	Note: this is *not* an offset in bytes. See notify_off_multiplier below.
> > > > > +	
> > > > > +queue_desc
> > > > > +
> > > > > +	Physical address of Descriptor Table.
> > > > > +
> > > > > +queue_avail
> > > > > +
> > > > > +	Physical address of Available Ring.
> > > > > +
> > > > > +queue_used
> > > > > +
> > > > > +	Physical address of Used Ring.
> > > > > +
> > > > > +
> > > > > +2.4.1.2.2 ISR status structure layout
> > > > > +-------------------------
> > > > > +ISR status structure includes a single 8-bite ISR status field
> > > > 
> > > > 8-bit
> > > 
> > > Right :)
> > > 
> > > > > +
> > > > > +2.4.1.2.3 Notification structure layout
> > > > > +-------------------------
> > > > > +Notification structure is always a multiple of 2 bytes in size.
> > > > > +It includes 2-byte Queue Notify fields for each virtqueue of
> > > > > +the device. Note that multiple virtqueues can use the same
> > > > > +Queue Notify field, if necessary.
> > > > 
> > > > Hmm, maybe move this down, so you can have a section which starts with
> > > > "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below?  That would put it all
> > > > together.
> > > 
> > > so Move PCI Device Layout to within
> > > PCI-specific Initialization And Device Operation?
> > > 
> > > > > +
> > > > > +2.4.1.2.4 Device specific structure
> > > > > +-------------------------
> > > > > +
> > > > > +Device specific structure is optional.
> > > > > +
> > > > > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout
> > > > > +-------------------------
> > > > > +
> > > > > +Transitional devices should present part of configuration
> > > > > +registers in a legacy configuration structure in BAR0 in the first I/O
> > > > > +region of the PCI device, as documented below.
> > > > >  
> > > > >  There may be different widths of accesses to the I/O region; the
> > > > >  “natural” access method for each field in the virtio header must be
> > > > > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the
> > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc), but
> > > > > +When accessed through the legacy interface the
> > > > >  device-specific region can be accessed using any width accesses, and
> > > > >  should obtain the same results.
> > > > >  
> > > > >  Note that this is possible because while the virtio header is PCI 
> > > > > -(i.e. little) endian, the device-specific region is encoded in 
> > > > > -the native endian of the guest (where such distinction is 
> > > > > +(i.e. little) endian, when using the legacy interface the device-specific
> > > > > +region is encoded in the native endian of the guest (where such distinction is
> > > > >  applicable).
> > > > >  
> > > > > -2.4.1.2.1 PCI Device Virtio Header
> > > > > -----------------------------------
> > > > >  
> > > > > -The virtio header looks as follows:
> > > > > +When used through the legacy interface, the virtio header looks as follows:
> > > > >  
> > > > >  +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
> > > > >  | Bits       || 32                  | 32                  | 32       | 16     | 16      | 16      | 8       | 8      |
> > > > > @@ -661,7 +905,6 @@ The virtio header looks as follows:
> > > > >  |            || Features bits 0:31  | Features bits 0:31  | Address  | Size   | Select  | Notify  | Status  | Status |
> > > > >  +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
> > > > >  
> > > > > -
> > > > >  If MSI-X is enabled for the device, two additional fields 
> > > > >  immediately follow this header:[5]
> > > > >  
> > > > > @@ -689,25 +932,154 @@ device-specific headers:
> > > > >  |            ||                    |
> > > > >  +------------++--------------------+
> > > > >  
> > > > > +Note that only Feature Bits 0 to 31 are accessible through the
> > > > > +Legacy Interface. When used through the Legacy Interface,
> > > > > +Transitional Devices must assume that Feature Bits 32 to 63
> > > > > +are not acknowledged by Driver.
> > > > > +
> > > > > +
> > > > >  2.4.1.3 PCI-specific Initialization And Device Operation
> > > > >  --------------------------------------------------------
> > > > >  
> > > > > -The page size for a virtqueue on a PCI virtio device is defined as
> > > > > -4096 bytes.
> > > > > -
> > > > >  2.4.1.3.1 Device Initialization
> > > > >  -------------------------------
> > > > >  
> > > > > -2.4.1.3.1.1 Queue Vector Configuration
> > > > > +This documents PCI-specific steps executed during Device Initialization.
> > > > > +As the first step, driver must detect device configuration layout
> > > > > +to locate configuration fields in memory,I/O or configuration space of the
> > > > > +device.
> > > > > +
> > > > > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection
> > > > > +-------------------------------
> > > > > +
> > > > > +As a prerequisite to device initialization, driver executes a
> > > > > +PCI capability list scan, detecting virtio configuration layout using Virtio
> > > > > +Structure PCI capabilities.
> > > > > +
> > > > > +Virtio Device Configuration Layout includes virtio configuration header, Notification
> > > > > +and ISR Status and device configuration structures.
> > > > > +Each structure can be mapped by a Base Address register (BAR) belonging to
> > > > > +the function, located beginning at 10h in Configuration Space,
> > > > > +or accessed though PCI configuration space.
> > > > > +
> > > > > +Actual location of each structure is specified using vendor-specific PCI capability located
> > > > > +on capability list in PCI configuration space of the device.
> > > > > +This virtio structure capability uses little-endian format; all bits are
> > > > > +read-only:
> > > > > +
> > > > > +struct virtio_pci_cap {
> > > > > +	__u8 cap_vndr;	/* Generic PCI field: PCI_CAP_ID_VNDR */
> > > > > +	__u8 cap_next;	/* Generic PCI field: next ptr. */
> > > > > +	__u8 cap_len;	/* Generic PCI field: capability length */
> > > > > +	__u8 cfg_type;	/* Identifies the structure. */
> > > > > +	__u8 bar;	/* Where to find it. */
> > > > > +	__u8 padding[3];/* Pad to full dword. */
> > > > > +	__le32 offset;	/* Offset within bar. */
> > > > > +	__le32 length;	/* Length of the structure, in bytes. */
> > > > > +};
> > > > > +
> > > > > +This structure can optionally followed by extra data, depending on
> > > > > +other fields, as documented below.
> > > > > +
> > > > > +The fields are interpreted as follows:
> > > > > +
> > > > > +cap_vndr
> > > > > +	0x09; Identifies a vendor-specific capability.
> > > > > +
> > > > > +cap_next
> > > > > +	Link to next capability in the capability list in the configuration space.
> > > > > +
> > > > > +cap_len
> > > > > +	Length of the capability structure, including the whole of
> > > > > +	struct virtio_pci_cap, and extra data if any.
> > > > > +	This length might include padding, or fields unused by the driver.
> > > > > +
> > > > > +cfg_type
> > > > > +	identifies the structure, according to the following table.
> > > > > +
> > > > > +	/* Common configuration */
> > > > > +	#define VIRTIO_PCI_CAP_COMMON_CFG	1
> > > > > +	/* Notifications */
> > > > > +	#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
> > > > > +	/* ISR Status */
> > > > > +	#define VIRTIO_PCI_CAP_ISR_CFG		3
> > > > > +	/* Device specific configuration */
> > > > > +	#define VIRTIO_PCI_CAP_DEVICE_CFG	4
> > > > > +
> > > > > +	More than one capability can identify the same structure - this makes it
> > > > > +	possible for the device to expose multiple interfaces to drivers.  The order of
> > > > > +	the capabilities in the capability list specifies the order of preference
> > > > > +	suggested by the device; drivers should use the first interface that they can
> > > > > +	support.  For example, on some hypervisors, notifications using IO accesses are
> > > > > +	faster than memory accesses. In this case, hypervisor can expose two
> > > > > +	capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG:
> > > > > +	the first one addressing an I/O BAR, the second one addressing a memory BAR.
> > > > > +	Driver will use the I/O BAR if I/O resources are available, and fall back on
> > > > > +	memory BAR when I/O resources are unavailable.
> > > > > +
> > > > > +bar
> > > > > +
> > > > > +	values 0x0 to 0x5 specify a Base Address register (BAR) belonging to
> > > > > +	the function located beginning at 10h in Configuration Space
> > > > > +	and used to map the structure into Memory or I/O Space.
> > > > > +	The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space
> > > > > +	or I/O Space.
> > > > > +
> > > > > +	The value 0xF specifies that the structure is in PCI configuration space
> > > > > +	inline with this capability structure, following (not necessarily immediately)
> > > > > +	the length field.
> > > > 
> > > > Why not immediately?
> > > >  Or how would the driver know where it is?
> > > 
> > > It's at the offset.
> > > 
> > > E.g. for notification we stick multiplier after length.
> > > Further, we might extend virtio_pci_cap in the future,
> > > and we don't want to move stuff around like we
> > > had to with MSI-X.
> > > 
> > > > > +
> > > > > +offset
> > > > > +	indicates where the structure begins relative to the base address associated
> > > > > +	with the BAR. If bar specifies configuration space, offset is relative
> > > > > +	to start of virtio_pci_cap structure.
> > > > > +
> > > > > +length
> > > > > +	indicates the length of the structure.
> > > > > +	This size might include padding, or fields unused by the driver.
> > > > > +	Drivers are also recommended to only map part of configuration structure
> > > > > +	large enough for device operation.
> > > > > +	For example, a future device might present a large structure size of several
> > > > > +	MBytes.
> > > > > +	As current devices never utilize structures larger than 4KBytes in size,
> > > > > +	driver can limit the mapped structure size to e.g.
> > > > > +	4KBytes to allow forward compatibility with such devices without loss of
> > > > > +	functionality and without wasting resources.
> > > > > +
> > > > > +
> > > > > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed
> > > > > +by additional fields:
> > > > > +
> > > > > +struct virtio_pci_notify_cap {
> > > > > +	struct virtio_pci_cap cap;
> > > > > +	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
> > > > > +};
> > > > > +
> > > > > +notify_off_multiplier
> > > > > +
> > > > > +	Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0.
> > > > > +	Value 0x1 is reserved.
> > > > > +	For a given virtqueue, the address to use for notifications is calculated as follows:
> > > > > +
> > > > > +	queue_notify_off * notify_off_multiplier + offset
> > > > > +
> > > > > +	If notify_off_multiplier is 0, all virtqueues use the same address in
> > > > > +	the Notifications structure!
> > > > > +
> > > > > +
> > > > > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection
> > > > > +-------------------------------
> > > > > +
> > > > > +Legacy drivers skipped  Device Layout Detection step, assuming legacy
> > > > > +configuration space in BAR0 in I/O space unconditionally.
> > > > > +
> > > > > +2.4.1.3.1.3 Queue Vector Configuration
> > > > >  --------------------------------------
> > > > >  
> > > > >  When MSI-X capability is present and enabled in the device 
> > > > > -(through standard PCI configuration space) 4 bytes at byte offset 
> > > > > -20 are used to map configuration change and queue interrupts to 
> > > > > -MSI-X vectors. In this case, the ISR Status field is unused, and 
> > > > > -device specific configuration starts at byte offset 24 in virtio 
> > > > > -header structure. When MSI-X capability is not enabled, device 
> > > > > -specific configuration starts at byte offset 20 in virtio header.
> > > > > +(through standard PCI configuration space) Configuration/Queue
> > > > > +MSI-X Vector registers are used to map configuration change and queue
> > > > > +interrupts to MSI-X vectors. In this case, the ISR Status is unused.
> > > > >  
> > > > >  Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of 
> > > > >  Configuration/Queue Vector registers, maps interrupts triggered 
> > > > > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on
> > > > >  failure, NO_VECTOR is returned. If a mapping failure is detected, 
> > > > >  the driver can retry mapping with fewervectors, or disable MSI-X.
> > > > >  
> > > > > -2.4.1.3.1.2 Virtqueue Configuration
> > > > > +2.4.1.3.1.4 Virtqueue Configuration
> > > > >  -----------------------------------
> > > > >  
> > > > >  As a device can have zero or more virtqueues for bulk data 
> > > > > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has:
> > > > >    always a power of 2. This controls how big the virtqueue is 
> > > > >    (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. 
> > > > >  
> > > > > -3. Allocate and zero virtqueue in contiguous physical memory, on 
> > > > > -  a 4096 byte alignment. Write the physical address, divided by 
> > > > > -  4096 to the Queue Address field.[6]
> > > > > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size
> > > > > +   field.
> > > > > +
> > > > > +3. Allocate and zero Descriptor Table, Available and Used rings for the
> > > > > +   virtqueue in contiguous physical memory.
> > > > >  
> > > > >  4. Optionally, if MSI-X capability is present and enabled on the 
> > > > >    device, select a vector to use to request interrupts triggered 
> > > > > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has:
> > > > >    Queue Vector field: on success, previously written value is 
> > > > >    returned; on failure, NO_VECTOR value is returned.
> > > > >  
> > > > > +
> > > > > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration
> > > > > +-----------------------------------
> > > > > +When using the legacy interface, the page size for a virtqueue on a PCI virtio
> > > > > +device is defined as 4096 bytes.  Driver writes the physical address, divided
> > > > > +by 4096 to the Queue Address field [6].
> > > > > +
> > > > >  2.4.1.3.2 Notifying The Device
> > > > >  ------------------------------
> > > > >  
> > > > >  Device notification occurs by writing the 16-bit virtqueue index 
> > > > > -of this virtqueue to the Queue Notify field of the virtio header 
> > > > > -in the first I/O region of the PCI device.
> > > > > +of this virtqueue to the Queue Notify field.
> > > > >  
> > > > >  2.4.1.3.3 Receiving Used Buffers From The Device
> > > > > +------------------------------
> > > > >  
> > > > >  If an interrupt is necessary:
> > > > >  
> > > > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390).
> > > > >  This is only allowed if the driver does not use any features 
> > > > >  which would alter this early use of the device.
> > > > >  
> > > > > -[5] ie. once you enable MSI-X on the device, the other fields move. 
> > > > > +[5] When MSI-X capability is enabled, device specific configuration starts at
> > > > > +byte offset 24 in virtio header structure. When MSI-X capability is not
> > > > > +enabled, device specific configuration starts at byte offset 20 in virtio
> > > > > +header.  ie. once you enable MSI-X on the device, the other fields move. 
> > > > >  If you turn it off again, they move back!
> > > > 
> > > > Thanks,
> > > > Rusty.
> > 
> > Cornelia
> 
>
Follow-Ups:
- Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
  - From: "Michael S. Tsirkin" <mst@redhat.com>
References:
- Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
  - From: Rusty Russell <rusty@au1.ibm.com>
- Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
  - From: Cornelia Huck <cornelia.huck@de.ibm.com>
- Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
  - From: "Michael S. Tsirkin" <mst@redhat.com>