virtio message

Subject: Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
From: Rusty Russell <rusty@au1.ibm.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Mon, 02 Sep 2013 10:45:11 +0930
"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote:
>> "Michael S. Tsirkin" <mst@redhat.com> writes:
>> > This is the new configuration layout.
>> >
>> > Notes:
>> > - Everything is LE
>> > - There's a feature bit that means spec 1.0 compliant.
>> > - Both devices and drivers can either require the 1.0 interface
>> >   or try to include compatibility support. The spec isn't forcing
>> >   this decision.
>> 
>> Hmm, this kind includes other changes already proposed, like the LE
>> change and the framing change.  I think this conceptually splits nicely:
>> 
>> 1) Feature bit 32 proposal.
>> 2) Endian change.
>> 3) Framing change.
>> 4) PCI layout change.
>
> Right - they are mostly in different parts of the document.
> I put it all together so it's easy to see how we intend to
> handle the transition.
> So is everyone OK with keeping this in a single patch?
>
>> > - I kept documentation of the legacy interface around, and added notes
>> >   on transition inline. They are in separate sections each clearly marked
>> >   "Legacy Interface" so we'll be able to separate them out
>> >   from the final document as necessary - for now I think it's easier
>> >   to keep it all together.
>> 
>> Good thinking: most of us know the current spec so it's definitely
>> clearer.  And makes sure we're thinking about the transition.
>> 
>> > Only virtio PCI has been converted.
>> > Let's discuss this on the meeting tonight, once we figure out PCI
>> > we can do something similar for MMIO and CCW.
>> 
>> > @@ -137,6 +139,11 @@ Feature bits are allocated as follows:
>> >    24 to 31: Feature bits reserved for extensions to the queue and 
>> >    feature negotiation mechanisms
>> >  
>> > +  32: Feature bit must be set for any device compliant with this
>> > +  revision of the specification, and acknowledged by all device drivers.
>> > +
>> > +  33 to 63: Feature bits reserved for future extensions
>> > +
>> >  For example, feature bit 0 for a network device (i.e. Subsystem 
>> >  Device ID 1) indicates that the device supports checksumming of 
>> >  packets.
>> 
>> Why stop at 63?  If we go to a more decentralized feature-assignment
>> model, we'll run through those very fast.
>
> Then we'll just document more, but driver needs to know where to stop
> looking for features.

Shouldn't it stop looking at the highest feature it understands?  Yes,
this makes a problem for diagnostics, but they can use a simple
heuristic (stop at first word which is all zeroes).  Or we need a
register with an explicit limit.

If the spec says there won't be features above 63, we'd need a revision
to change that.

>> I think we should stop abusing Revision IDs, and start using them
>> to reflect device version changes as intended.
>>
>> We could reserve revision id 0 for legacy devices, however, which should
>> work nicely.
>
> Hmm I am not sure I agree - what does it buy us that feature bits don't already supply?

AFAICT it's a PCI question more than a virtio question.  The use of
revision IDs is left open in the PCI spec, but Microsoft's requirements
here seem reasonable:

http://msdn.microsoft.com/en-us/library/windows/hardware/gg463287.aspx

> More concerns:
>
> We are using revision ID now exactly as was intended to disable old
> drivers - it served us well for 0.X-1.X and would be as useful if we
> ever have 1.X->2.0 transition.

I think we'll use feature bit 64 for such a transition, now we don't
have a feature bit limit.

> Another worry with using revision numbering for features is that
> it does not play well with downstreams.
> E.g. RHEL might want to cherry-pick a feature without implementing
> other features that happened to land in the same revision.
>
> Also Revision ID is only 8 bit - it's designed for hardware where
> making a new revision is expensive. In software we'll run out of that
> eventually.

This is why I think we leave it open in the standard.  If people want to
use it as another way for them to implement driver workarounds, they
can (probably a combination of device ID, rev id and features offered).

>> >  2.4.1.2 PCI Device Layout
>> >  -------------------------
>> >  
>> > -To configure the device, we use the first I/O region of the PCI 
>> > -device. This contains a virtio header followed by a 
>> > -device-specific region.
>> > +To configure the device,
>> > +use I/O and/or memory regions and/or PCI configuration space of the PCI device.
>> > +These contain the virtio header registers, the notification register, the
>> > +ISR status register and device specific registers, as specified by Virtio
>> > ++ Structure PCI Capabilities
>> > +
>> > +There may be different widths of accesses to the I/O region; the
>> > +“natural” access method for each field must be
>> > +used (i.e. 32-bit accesses for 32-bit fields, etc).
>> > +
>> > +PCI Device Configuration Layout includes the common configuration,
>> > +ISR, notification and device specific configuration
>> > +structures.
>> > +
>> > +Unless explicitly specified otherwise, all multi-byte fields are little-endian.
>> > +
>> > +
>> > +2.4.1.2.1 Common configuration structure layout
>> > +-------------------------
>> > +Common configuration structure layout is documented below:
>> > +
>> > +struct virtio_pci_common_cfg {
>> > +	/* About the whole device. */
>> > +	__le32 device_feature_select;	/* read-write */
>> > +	__le32 device_feature;		/* read-only */
>> > +	__le32 guest_feature_select;	/* read-write */
>> > +	__le32 guest_feature;		/* read-write */
>> > +	__le16 msix_config;		/* read-write */
>> > +	__le16 num_queues;		/* read-only */
>> > +	__u8 device_status;		/* read-write */
>> > +	__u8 unused1;
>> > +
>> > +	/* About a specific virtqueue. */
>> > +	__le16 queue_select;		/* read-write */
>> > +	__le16 queue_size;		/* read-write, power of 2, or 0. */
>> > +	__le16 queue_msix_vector;	/* read-write */
>> > +	__le16 queue_enable;		/* read-write */
>> > +	__le16 queue_notify_off;	/* read-only */
>> > +	__le64 queue_desc;		/* read-write */
>> > +	__le64 queue_avail;		/* read-write */
>> > +	__le64 queue_used;		/* read-write */
>> > +};
>> > +
>> > +device_feature_select
>> > +
>> > +	Selects which Feature Bits does device_feature field refer to.
>> > +	Value 0x0 selects Feature Bits 0 to 31
>> > +	Value 0x1 selects Feature Bits 32 to 63
>> > +	All other values cause reads from device_feature to return 0.
>> > +
>> > +device_feature
>> > +
>> > +	Used by Device to report Feature Bits to Driver.
>> > +	Device Feature Bits selected by device_feature_select.
>> > +
>> > +guest_feature_select
>> > +
>> > +	Selects which Feature Bits does guest_feature field refer to.
>> > +	Value 0x0 selects Feature Bits 0 to 31
>> > +	Value 0x1 selects Feature Bits 32 to 63
>> > +	All other values cause writes to guest_feature to be ignored,
>> > +	and reads to return 0.
>> > +
>> > +guest_feature
>> > +
>> > +	Used by Driver to acknowledge Feature Bits to Device.
>> > +	Guest Feature Bits selected by guest_feature_select.
>> > +
>> > +msix_config
>> > +
>> > +	Configuration Vector for MSI-X.
>> > +
>> > +num_queues
>> > +
>> > +	Specifies the maximum number of virtqueues supported by device.
>> > +
>> > +device_status
>> > +
>> > +	Device Status field.
>> > +
>> > +queue_select
>> > +
>> > +	Queue Select. Selects which virtqueue do other fields refer to.
>> > +
>> > +queue_size
>> > +
>> > +	Queue Size.  On reset, specifies the maximum queue size supported by
>> > +	the hypervisor. This can be modified by driver to reduce memory requirements.
>> > +	Set to 0 if this virtqueue is unused.
>> > +
>> > +queue_msix_vector
>> > +
>> > +	Queue Vector for MSI-X.
>> > +
>> > +queue_enable
>> > +
>> > +	Used to selectively prevent host from executing requests from this virtqueue.
>> > +	1 - enabled; 0 - disabled
>> > +
>> > +queue_notify_off
>> > +
>> > +	Used to calculate the offset from start of Notification structure at
>> > +	which this virtqueue is located.
>> > +	Note: this is *not* an offset in bytes. See notify_off_multiplier below.
>> > +	
>> > +queue_desc
>> > +
>> > +	Physical address of Descriptor Table.
>> > +
>> > +queue_avail
>> > +
>> > +	Physical address of Available Ring.
>> > +
>> > +queue_used
>> > +
>> > +	Physical address of Used Ring.
>> > +
>> > +
>> > +2.4.1.2.2 ISR status structure layout
>> > +-------------------------
>> > +ISR status structure includes a single 8-bite ISR status field
>> 
>> 8-bit
>
> Right :)
>
>> > +
>> > +2.4.1.2.3 Notification structure layout
>> > +-------------------------
>> > +Notification structure is always a multiple of 2 bytes in size.
>> > +It includes 2-byte Queue Notify fields for each virtqueue of
>> > +the device. Note that multiple virtqueues can use the same
>> > +Queue Notify field, if necessary.
>> 
>> Hmm, maybe move this down, so you can have a section which starts with
>> "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below?  That would put it all
>> together.
>
> so Move PCI Device Layout to within
> PCI-specific Initialization And Device Operation?

OK, maybe not.  This is simple, we can always reshuffle after we've made
all the changes we want.

>> > +
>> > +2.4.1.2.4 Device specific structure
>> > +-------------------------
>> > +
>> > +Device specific structure is optional.
>> > +
>> > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout
>> > +-------------------------
>> > +
>> > +Transitional devices should present part of configuration
>> > +registers in a legacy configuration structure in BAR0 in the first I/O
>> > +region of the PCI device, as documented below.
>> >  
>> >  There may be different widths of accesses to the I/O region; the
>> >  “natural” access method for each field in the virtio header must be
>> > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the
>> > +used (i.e. 32-bit accesses for 32-bit fields, etc), but
>> > +When accessed through the legacy interface the
>> >  device-specific region can be accessed using any width accesses, and
>> >  should obtain the same results.
>> >  
>> >  Note that this is possible because while the virtio header is PCI 
>> > -(i.e. little) endian, the device-specific region is encoded in 
>> > -the native endian of the guest (where such distinction is 
>> > +(i.e. little) endian, when using the legacy interface the device-specific
>> > +region is encoded in the native endian of the guest (where such distinction is
>> >  applicable).
>> >  
>> > -2.4.1.2.1 PCI Device Virtio Header
>> > -----------------------------------
>> >  
>> > -The virtio header looks as follows:
>> > +When used through the legacy interface, the virtio header looks as follows:
>> >  
>> >  +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
>> >  | Bits       || 32                  | 32                  | 32       | 16     | 16      | 16      | 8       | 8      |
>> > @@ -661,7 +905,6 @@ The virtio header looks as follows:
>> >  |            || Features bits 0:31  | Features bits 0:31  | Address  | Size   | Select  | Notify  | Status  | Status |
>> >  +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
>> >  
>> > -
>> >  If MSI-X is enabled for the device, two additional fields 
>> >  immediately follow this header:[5]
>> >  
>> > @@ -689,25 +932,154 @@ device-specific headers:
>> >  |            ||                    |
>> >  +------------++--------------------+
>> >  
>> > +Note that only Feature Bits 0 to 31 are accessible through the
>> > +Legacy Interface. When used through the Legacy Interface,
>> > +Transitional Devices must assume that Feature Bits 32 to 63
>> > +are not acknowledged by Driver.
>> > +
>> > +
>> >  2.4.1.3 PCI-specific Initialization And Device Operation
>> >  --------------------------------------------------------
>> >  
>> > -The page size for a virtqueue on a PCI virtio device is defined as
>> > -4096 bytes.
>> > -
>> >  2.4.1.3.1 Device Initialization
>> >  -------------------------------
>> >  
>> > -2.4.1.3.1.1 Queue Vector Configuration
>> > +This documents PCI-specific steps executed during Device Initialization.
>> > +As the first step, driver must detect device configuration layout
>> > +to locate configuration fields in memory,I/O or configuration space of the
>> > +device.
>> > +
>> > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection
>> > +-------------------------------
>> > +
>> > +As a prerequisite to device initialization, driver executes a
>> > +PCI capability list scan, detecting virtio configuration layout using Virtio
>> > +Structure PCI capabilities.
>> > +
>> > +Virtio Device Configuration Layout includes virtio configuration header, Notification
>> > +and ISR Status and device configuration structures.
>> > +Each structure can be mapped by a Base Address register (BAR) belonging to
>> > +the function, located beginning at 10h in Configuration Space,
>> > +or accessed though PCI configuration space.
>> > +
>> > +Actual location of each structure is specified using vendor-specific PCI capability located
>> > +on capability list in PCI configuration space of the device.
>> > +This virtio structure capability uses little-endian format; all bits are
>> > +read-only:
>> > +
>> > +struct virtio_pci_cap {
>> > +	__u8 cap_vndr;	/* Generic PCI field: PCI_CAP_ID_VNDR */
>> > +	__u8 cap_next;	/* Generic PCI field: next ptr. */
>> > +	__u8 cap_len;	/* Generic PCI field: capability length */
>> > +	__u8 cfg_type;	/* Identifies the structure. */
>> > +	__u8 bar;	/* Where to find it. */
>> > +	__u8 padding[3];/* Pad to full dword. */
>> > +	__le32 offset;	/* Offset within bar. */
>> > +	__le32 length;	/* Length of the structure, in bytes. */
>> > +};
>> > +
>> > +This structure can optionally followed by extra data, depending on
>> > +other fields, as documented below.
>> > +
>> > +The fields are interpreted as follows:
>> > +
>> > +cap_vndr
>> > +	0x09; Identifies a vendor-specific capability.
>> > +
>> > +cap_next
>> > +	Link to next capability in the capability list in the configuration space.
>> > +
>> > +cap_len
>> > +	Length of the capability structure, including the whole of
>> > +	struct virtio_pci_cap, and extra data if any.
>> > +	This length might include padding, or fields unused by the driver.
>> > +
>> > +cfg_type
>> > +	identifies the structure, according to the following table.
>> > +
>> > +	/* Common configuration */
>> > +	#define VIRTIO_PCI_CAP_COMMON_CFG	1
>> > +	/* Notifications */
>> > +	#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
>> > +	/* ISR Status */
>> > +	#define VIRTIO_PCI_CAP_ISR_CFG		3
>> > +	/* Device specific configuration */
>> > +	#define VIRTIO_PCI_CAP_DEVICE_CFG	4
>> > +
>> > +	More than one capability can identify the same structure - this makes it
>> > +	possible for the device to expose multiple interfaces to drivers.  The order of
>> > +	the capabilities in the capability list specifies the order of preference
>> > +	suggested by the device; drivers should use the first interface that they can
>> > +	support.  For example, on some hypervisors, notifications using IO accesses are
>> > +	faster than memory accesses. In this case, hypervisor can expose two
>> > +	capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG:
>> > +	the first one addressing an I/O BAR, the second one addressing a memory BAR.
>> > +	Driver will use the I/O BAR if I/O resources are available, and fall back on
>> > +	memory BAR when I/O resources are unavailable.
>> > +
>> > +bar
>> > +
>> > +	values 0x0 to 0x5 specify a Base Address register (BAR) belonging to
>> > +	the function located beginning at 10h in Configuration Space
>> > +	and used to map the structure into Memory or I/O Space.
>> > +	The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space
>> > +	or I/O Space.
>> > +
>> > +	The value 0xF specifies that the structure is in PCI configuration space
>> > +	inline with this capability structure, following (not necessarily immediately)
>> > +	the length field.
>> 
>> Why not immediately?
>>  Or how would the driver know where it is?
>
> It's at the offset.
>
> E.g. for notification we stick multiplier after length.
> Further, we might extend virtio_pci_cap in the future,
> and we don't want to move stuff around like we
> had to with MSI-X.

So, offset is described as "/* Offset within bar. */", but if bar is
0xF, it's offset within the config space?

This adds complexity: do we really want this?

Cheers,
Rusty.
PS. Sorry for delay, took some time off.
Follow-Ups:
- Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
  - From: "Michael S. Tsirkin" <mst@redhat.com>