virtio message

Subject: Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Rusty Russell <rusty@au1.ibm.com>
Date: Mon, 2 Sep 2013 07:09:08 +0300
On Mon, Sep 02, 2013 at 10:45:11AM +0930, Rusty Russell wrote:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
> > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote:
> >> "Michael S. Tsirkin" <mst@redhat.com> writes:
> >> > This is the new configuration layout.
> >> >
> >> > Notes:
> >> > - Everything is LE
> >> > - There's a feature bit that means spec 1.0 compliant.
> >> > - Both devices and drivers can either require the 1.0 interface
> >> >   or try to include compatibility support. The spec isn't forcing
> >> >   this decision.
> >> 
> >> Hmm, this kind includes other changes already proposed, like the LE
> >> change and the framing change.  I think this conceptually splits nicely:
> >> 
> >> 1) Feature bit 32 proposal.
> >> 2) Endian change.
> >> 3) Framing change.
> >> 4) PCI layout change.
> >
> > Right - they are mostly in different parts of the document.
> > I put it all together so it's easy to see how we intend to
> > handle the transition.
> > So is everyone OK with keeping this in a single patch?
> >
> >> > - I kept documentation of the legacy interface around, and added notes
> >> >   on transition inline. They are in separate sections each clearly marked
> >> >   "Legacy Interface" so we'll be able to separate them out
> >> >   from the final document as necessary - for now I think it's easier
> >> >   to keep it all together.
> >> 
> >> Good thinking: most of us know the current spec so it's definitely
> >> clearer.  And makes sure we're thinking about the transition.
> >> 
> >> > Only virtio PCI has been converted.
> >> > Let's discuss this on the meeting tonight, once we figure out PCI
> >> > we can do something similar for MMIO and CCW.
> >> 
> >> > @@ -137,6 +139,11 @@ Feature bits are allocated as follows:
> >> >    24 to 31: Feature bits reserved for extensions to the queue and 
> >> >    feature negotiation mechanisms
> >> >  
> >> > +  32: Feature bit must be set for any device compliant with this
> >> > +  revision of the specification, and acknowledged by all device drivers.
> >> > +
> >> > +  33 to 63: Feature bits reserved for future extensions
> >> > +
> >> >  For example, feature bit 0 for a network device (i.e. Subsystem 
> >> >  Device ID 1) indicates that the device supports checksumming of 
> >> >  packets.
> >> 
> >> Why stop at 63?  If we go to a more decentralized feature-assignment
> >> model, we'll run through those very fast.
> >
> > Then we'll just document more, but driver needs to know where to stop
> > looking for features.
> 
> Shouldn't it stop looking at the highest feature it understands?  Yes,
> this makes a problem for diagnostics, but they can use a simple
> heuristic (stop at first word which is all zeroes).  Or we need a
> register with an explicit limit.
> 
> If the spec says there won't be features above 63, we'd need a revision
> to change that.
> 
> >> I think we should stop abusing Revision IDs, and start using them
> >> to reflect device version changes as intended.
> >>
> >> We could reserve revision id 0 for legacy devices, however, which should
> >> work nicely.
> >
> > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply?
> 
> AFAICT it's a PCI question more than a virtio question.  The use of
> revision IDs is left open in the PCI spec, but Microsoft's requirements
> here seem reasonable:
> 
> http://msdn.microsoft.com/en-us/library/windows/hardware/gg463287.aspx

They are only reasonable if you consider a requirement that users
load new drivers for new revision of devices reasonable.


> > More concerns:
> >
> > We are using revision ID now exactly as was intended to disable old
> > drivers - it served us well for 0.X-1.X and would be as useful if we
> > ever have 1.X->2.0 transition.
> 
> I think we'll use feature bit 64 for such a transition, now we don't
> have a feature bit limit.

Windows drivers only match on device,vendor id, subsystem ids and revision.

Once you match on that you can fail after checking feature bits
but it's too late to have another driver loaded.


> > Another worry with using revision numbering for features is that
> > it does not play well with downstreams.
> > E.g. RHEL might want to cherry-pick a feature without implementing
> > other features that happened to land in the same revision.
> >
> > Also Revision ID is only 8 bit - it's designed for hardware where
> > making a new revision is expensive. In software we'll run out of that
> > eventually.
> 
> This is why I think we leave it open in the standard.  If people want to
> use it as another way for them to implement driver workarounds, they
> can (probably a combination of device ID, rev id and features offered).
>
> >> >  2.4.1.2 PCI Device Layout
> >> >  -------------------------
> >> >  
> >> > -To configure the device, we use the first I/O region of the PCI 
> >> > -device. This contains a virtio header followed by a 
> >> > -device-specific region.
> >> > +To configure the device,
> >> > +use I/O and/or memory regions and/or PCI configuration space of the PCI device.
> >> > +These contain the virtio header registers, the notification register, the
> >> > +ISR status register and device specific registers, as specified by Virtio
> >> > ++ Structure PCI Capabilities
> >> > +
> >> > +There may be different widths of accesses to the I/O region; the
> >> > +“natural” access method for each field must be
> >> > +used (i.e. 32-bit accesses for 32-bit fields, etc).
> >> > +
> >> > +PCI Device Configuration Layout includes the common configuration,
> >> > +ISR, notification and device specific configuration
> >> > +structures.
> >> > +
> >> > +Unless explicitly specified otherwise, all multi-byte fields are little-endian.
> >> > +
> >> > +
> >> > +2.4.1.2.1 Common configuration structure layout
> >> > +-------------------------
> >> > +Common configuration structure layout is documented below:
> >> > +
> >> > +struct virtio_pci_common_cfg {
> >> > +	/* About the whole device. */
> >> > +	__le32 device_feature_select;	/* read-write */
> >> > +	__le32 device_feature;		/* read-only */
> >> > +	__le32 guest_feature_select;	/* read-write */
> >> > +	__le32 guest_feature;		/* read-write */
> >> > +	__le16 msix_config;		/* read-write */
> >> > +	__le16 num_queues;		/* read-only */
> >> > +	__u8 device_status;		/* read-write */
> >> > +	__u8 unused1;
> >> > +
> >> > +	/* About a specific virtqueue. */
> >> > +	__le16 queue_select;		/* read-write */
> >> > +	__le16 queue_size;		/* read-write, power of 2, or 0. */
> >> > +	__le16 queue_msix_vector;	/* read-write */
> >> > +	__le16 queue_enable;		/* read-write */
> >> > +	__le16 queue_notify_off;	/* read-only */
> >> > +	__le64 queue_desc;		/* read-write */
> >> > +	__le64 queue_avail;		/* read-write */
> >> > +	__le64 queue_used;		/* read-write */
> >> > +};
> >> > +
> >> > +device_feature_select
> >> > +
> >> > +	Selects which Feature Bits does device_feature field refer to.
> >> > +	Value 0x0 selects Feature Bits 0 to 31
> >> > +	Value 0x1 selects Feature Bits 32 to 63
> >> > +	All other values cause reads from device_feature to return 0.
> >> > +
> >> > +device_feature
> >> > +
> >> > +	Used by Device to report Feature Bits to Driver.
> >> > +	Device Feature Bits selected by device_feature_select.
> >> > +
> >> > +guest_feature_select
> >> > +
> >> > +	Selects which Feature Bits does guest_feature field refer to.
> >> > +	Value 0x0 selects Feature Bits 0 to 31
> >> > +	Value 0x1 selects Feature Bits 32 to 63
> >> > +	All other values cause writes to guest_feature to be ignored,
> >> > +	and reads to return 0.
> >> > +
> >> > +guest_feature
> >> > +
> >> > +	Used by Driver to acknowledge Feature Bits to Device.
> >> > +	Guest Feature Bits selected by guest_feature_select.
> >> > +
> >> > +msix_config
> >> > +
> >> > +	Configuration Vector for MSI-X.
> >> > +
> >> > +num_queues
> >> > +
> >> > +	Specifies the maximum number of virtqueues supported by device.
> >> > +
> >> > +device_status
> >> > +
> >> > +	Device Status field.
> >> > +
> >> > +queue_select
> >> > +
> >> > +	Queue Select. Selects which virtqueue do other fields refer to.
> >> > +
> >> > +queue_size
> >> > +
> >> > +	Queue Size.  On reset, specifies the maximum queue size supported by
> >> > +	the hypervisor. This can be modified by driver to reduce memory requirements.
> >> > +	Set to 0 if this virtqueue is unused.
> >> > +
> >> > +queue_msix_vector
> >> > +
> >> > +	Queue Vector for MSI-X.
> >> > +
> >> > +queue_enable
> >> > +
> >> > +	Used to selectively prevent host from executing requests from this virtqueue.
> >> > +	1 - enabled; 0 - disabled
> >> > +
> >> > +queue_notify_off
> >> > +
> >> > +	Used to calculate the offset from start of Notification structure at
> >> > +	which this virtqueue is located.
> >> > +	Note: this is *not* an offset in bytes. See notify_off_multiplier below.
> >> > +	
> >> > +queue_desc
> >> > +
> >> > +	Physical address of Descriptor Table.
> >> > +
> >> > +queue_avail
> >> > +
> >> > +	Physical address of Available Ring.
> >> > +
> >> > +queue_used
> >> > +
> >> > +	Physical address of Used Ring.
> >> > +
> >> > +
> >> > +2.4.1.2.2 ISR status structure layout
> >> > +-------------------------
> >> > +ISR status structure includes a single 8-bite ISR status field
> >> 
> >> 8-bit
> >
> > Right :)
> >
> >> > +
> >> > +2.4.1.2.3 Notification structure layout
> >> > +-------------------------
> >> > +Notification structure is always a multiple of 2 bytes in size.
> >> > +It includes 2-byte Queue Notify fields for each virtqueue of
> >> > +the device. Note that multiple virtqueues can use the same
> >> > +Queue Notify field, if necessary.
> >> 
> >> Hmm, maybe move this down, so you can have a section which starts with
> >> "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below?  That would put it all
> >> together.
> >
> > so Move PCI Device Layout to within
> > PCI-specific Initialization And Device Operation?
> 
> OK, maybe not.  This is simple, we can always reshuffle after we've made
> all the changes we want.
> 
> >> > +
> >> > +2.4.1.2.4 Device specific structure
> >> > +-------------------------
> >> > +
> >> > +Device specific structure is optional.
> >> > +
> >> > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout
> >> > +-------------------------
> >> > +
> >> > +Transitional devices should present part of configuration
> >> > +registers in a legacy configuration structure in BAR0 in the first I/O
> >> > +region of the PCI device, as documented below.
> >> >  
> >> >  There may be different widths of accesses to the I/O region; the
> >> >  “natural” access method for each field in the virtio header must be
> >> > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the
> >> > +used (i.e. 32-bit accesses for 32-bit fields, etc), but
> >> > +When accessed through the legacy interface the
> >> >  device-specific region can be accessed using any width accesses, and
> >> >  should obtain the same results.
> >> >  
> >> >  Note that this is possible because while the virtio header is PCI 
> >> > -(i.e. little) endian, the device-specific region is encoded in 
> >> > -the native endian of the guest (where such distinction is 
> >> > +(i.e. little) endian, when using the legacy interface the device-specific
> >> > +region is encoded in the native endian of the guest (where such distinction is
> >> >  applicable).
> >> >  
> >> > -2.4.1.2.1 PCI Device Virtio Header
> >> > -----------------------------------
> >> >  
> >> > -The virtio header looks as follows:
> >> > +When used through the legacy interface, the virtio header looks as follows:
> >> >  
> >> >  +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
> >> >  | Bits       || 32                  | 32                  | 32       | 16     | 16      | 16      | 8       | 8      |
> >> > @@ -661,7 +905,6 @@ The virtio header looks as follows:
> >> >  |            || Features bits 0:31  | Features bits 0:31  | Address  | Size   | Select  | Notify  | Status  | Status |
> >> >  +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
> >> >  
> >> > -
> >> >  If MSI-X is enabled for the device, two additional fields 
> >> >  immediately follow this header:[5]
> >> >  
> >> > @@ -689,25 +932,154 @@ device-specific headers:
> >> >  |            ||                    |
> >> >  +------------++--------------------+
> >> >  
> >> > +Note that only Feature Bits 0 to 31 are accessible through the
> >> > +Legacy Interface. When used through the Legacy Interface,
> >> > +Transitional Devices must assume that Feature Bits 32 to 63
> >> > +are not acknowledged by Driver.
> >> > +
> >> > +
> >> >  2.4.1.3 PCI-specific Initialization And Device Operation
> >> >  --------------------------------------------------------
> >> >  
> >> > -The page size for a virtqueue on a PCI virtio device is defined as
> >> > -4096 bytes.
> >> > -
> >> >  2.4.1.3.1 Device Initialization
> >> >  -------------------------------
> >> >  
> >> > -2.4.1.3.1.1 Queue Vector Configuration
> >> > +This documents PCI-specific steps executed during Device Initialization.
> >> > +As the first step, driver must detect device configuration layout
> >> > +to locate configuration fields in memory,I/O or configuration space of the
> >> > +device.
> >> > +
> >> > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection
> >> > +-------------------------------
> >> > +
> >> > +As a prerequisite to device initialization, driver executes a
> >> > +PCI capability list scan, detecting virtio configuration layout using Virtio
> >> > +Structure PCI capabilities.
> >> > +
> >> > +Virtio Device Configuration Layout includes virtio configuration header, Notification
> >> > +and ISR Status and device configuration structures.
> >> > +Each structure can be mapped by a Base Address register (BAR) belonging to
> >> > +the function, located beginning at 10h in Configuration Space,
> >> > +or accessed though PCI configuration space.
> >> > +
> >> > +Actual location of each structure is specified using vendor-specific PCI capability located
> >> > +on capability list in PCI configuration space of the device.
> >> > +This virtio structure capability uses little-endian format; all bits are
> >> > +read-only:
> >> > +
> >> > +struct virtio_pci_cap {
> >> > +	__u8 cap_vndr;	/* Generic PCI field: PCI_CAP_ID_VNDR */
> >> > +	__u8 cap_next;	/* Generic PCI field: next ptr. */
> >> > +	__u8 cap_len;	/* Generic PCI field: capability length */
> >> > +	__u8 cfg_type;	/* Identifies the structure. */
> >> > +	__u8 bar;	/* Where to find it. */
> >> > +	__u8 padding[3];/* Pad to full dword. */
> >> > +	__le32 offset;	/* Offset within bar. */
> >> > +	__le32 length;	/* Length of the structure, in bytes. */
> >> > +};
> >> > +
> >> > +This structure can optionally followed by extra data, depending on
> >> > +other fields, as documented below.
> >> > +
> >> > +The fields are interpreted as follows:
> >> > +
> >> > +cap_vndr
> >> > +	0x09; Identifies a vendor-specific capability.
> >> > +
> >> > +cap_next
> >> > +	Link to next capability in the capability list in the configuration space.
> >> > +
> >> > +cap_len
> >> > +	Length of the capability structure, including the whole of
> >> > +	struct virtio_pci_cap, and extra data if any.
> >> > +	This length might include padding, or fields unused by the driver.
> >> > +
> >> > +cfg_type
> >> > +	identifies the structure, according to the following table.
> >> > +
> >> > +	/* Common configuration */
> >> > +	#define VIRTIO_PCI_CAP_COMMON_CFG	1
> >> > +	/* Notifications */
> >> > +	#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
> >> > +	/* ISR Status */
> >> > +	#define VIRTIO_PCI_CAP_ISR_CFG		3
> >> > +	/* Device specific configuration */
> >> > +	#define VIRTIO_PCI_CAP_DEVICE_CFG	4
> >> > +
> >> > +	More than one capability can identify the same structure - this makes it
> >> > +	possible for the device to expose multiple interfaces to drivers.  The order of
> >> > +	the capabilities in the capability list specifies the order of preference
> >> > +	suggested by the device; drivers should use the first interface that they can
> >> > +	support.  For example, on some hypervisors, notifications using IO accesses are
> >> > +	faster than memory accesses. In this case, hypervisor can expose two
> >> > +	capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG:
> >> > +	the first one addressing an I/O BAR, the second one addressing a memory BAR.
> >> > +	Driver will use the I/O BAR if I/O resources are available, and fall back on
> >> > +	memory BAR when I/O resources are unavailable.
> >> > +
> >> > +bar
> >> > +
> >> > +	values 0x0 to 0x5 specify a Base Address register (BAR) belonging to
> >> > +	the function located beginning at 10h in Configuration Space
> >> > +	and used to map the structure into Memory or I/O Space.
> >> > +	The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space
> >> > +	or I/O Space.
> >> > +
> >> > +	The value 0xF specifies that the structure is in PCI configuration space
> >> > +	inline with this capability structure, following (not necessarily immediately)
> >> > +	the length field.
> >> 
> >> Why not immediately?
> >>  Or how would the driver know where it is?
> >
> > It's at the offset.
> >
> > E.g. for notification we stick multiplier after length.
> > Further, we might extend virtio_pci_cap in the future,
> > and we don't want to move stuff around like we
> > had to with MSI-X.
> 
> So, offset is described as "/* Offset within bar. */", but if bar is
> 0xF, it's offset within the config space?

Yes. So I'll fix the wording?

> This adds complexity: do we really want this?

I think making stuff part of a vendor specific cap is important,
and we might add stuff to capabilities, so what are
some other options?

> Cheers,
> Rusty.
> PS. Sorry for delay, took some time off.
Follow-Ups:
- Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
  - From: Rusty Russell <rusty@au1.ibm.com>
References:
- Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout
  - From: Rusty Russell <rusty@au1.ibm.com>