OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio] Re: [virtio-comment] [PATCH RFC] virtio-pci: new configuration layout


On Tue, 27 Aug 2013 20:18:21 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Tue, Aug 27, 2013 at 07:01:23PM +0200, Cornelia Huck wrote:
> > On Tue, 27 Aug 2013 18:36:29 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > On Tue, Aug 27, 2013 at 05:09:53PM +0200, Cornelia Huck wrote:
> > > > Some remarks from my side...
> > > > 
> > > > On Tue, 27 Aug 2013 10:38:59 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > On Tue, Aug 27, 2013 at 04:24:47PM +0930, Rusty Russell wrote:
> > > > > > "Michael S. Tsirkin" <mst@redhat.com> writes:
> > > > > > > This is the new configuration layout.
> > > > > > >
> > > > > > > Notes:
> > > > > > > - Everything is LE
> > > > > > > - There's a feature bit that means spec 1.0 compliant.
> > > > > > > - Both devices and drivers can either require the 1.0 interface
> > > > > > >   or try to include compatibility support. The spec isn't forcing
> > > > > > >   this decision.
> > > > > > 
> > > > > > Hmm, this kind includes other changes already proposed, like the LE
> > > > > > change and the framing change.  I think this conceptually splits nicely:
> > > > > > 
> > > > > > 1) Feature bit 32 proposal.
> > > > > > 2) Endian change.
> > > > > > 3) Framing change.
> > > > > > 4) PCI layout change.
> > > > > 
> > > > > Right - they are mostly in different parts of the document.
> > > > > I put it all together so it's easy to see how we intend to
> > > > > handle the transition.
> > > > > So is everyone OK with keeping this in a single patch?
> > > > 
> > > > The new feature bit is supposed to cover all of this, right? Then this
> > > > should be one patch.
> > > > 
> > > > > 
> > > > > > > - I kept documentation of the legacy interface around, and added notes
> > > > > > >   on transition inline. They are in separate sections each clearly marked
> > > > > > >   "Legacy Interface" so we'll be able to separate them out
> > > > > > >   from the final document as necessary - for now I think it's easier
> > > > > > >   to keep it all together.
> > > > > > 
> > > > > > Good thinking: most of us know the current spec so it's definitely
> > > > > > clearer.  And makes sure we're thinking about the transition.
> > > > > > 
> > > > > > > Only virtio PCI has been converted.
> > > > > > > Let's discuss this on the meeting tonight, once we figure out PCI
> > > > > > > we can do something similar for MMIO and CCW.
> > > > > > 
> > > > > > > @@ -137,6 +139,11 @@ Feature bits are allocated as follows:
> > > > > > >    24 to 31: Feature bits reserved for extensions to the queue and 
> > > > > > >    feature negotiation mechanisms
> > > > > > >  
> > > > > > > +  32: Feature bit must be set for any device compliant with this
> > > > > > > +  revision of the specification, and acknowledged by all device drivers.
> > > > 
> > > > Would it make sense to have a bit 33 "rings big endian" whose validity
> > > > depends on bit 32 set? This would make it possible for ccw to keep its
> > > > current endianness.
> > > 
> > > I didn't go over ccw or MMIO yet - only PCI.
> > > I think ccw registers will just
> > > be explicitly BE, with no need for a feature bit.
> > > Does this sound right?
> > 
> > Sure, that would be even better.
> > 
> > > 
> > > > > > > +
> > > > > > > +  33 to 63: Feature bits reserved for future extensions
> > > > > > > +
> > > > > > >  For example, feature bit 0 for a network device (i.e. Subsystem 
> > > > > > >  Device ID 1) indicates that the device supports checksumming of 
> > > > > > >  packets.
> > > > > > 
> > > > > > Why stop at 63?  If we go to a more decentralized feature-assignment
> > > > > > model, we'll run through those very fast.
> > > > > 
> > > > > Then we'll just document more, but driver needs to know where to stop
> > > > > looking for features.
> > > > > 
> > > > > > 
> > > > > > > @@ -145,13 +152,63 @@ In particular, new fields in the device configuration space are
> > > > > > >  indicated by offering a feature bit, so the guest can check 
> > > > > > >  before accessing that part of the configuration space.
> > > > > > >  
> > > > > > > +2.1.2.1 Legacy Interface: A Note on transitions from earlier drafts
> > > > > > > +--------------------------------------
> > > > > > > +
> > > > > > > +Earlier drafts of this specification (up to 0.9.X) defined a similar, but
> > > > > > > +different interface between the hypervisor and the guest.
> > > > > > > +Since these are widely deployed in the field, this specification
> > > > > > > +accomodates optional features to simplify transition
> > > > > > > +from these earlier draft interfaces. Specifically:
> > > > > > > +
> > > > > > > +Legacy Interface
> > > > > > > +	is an interface specified by an earlier draft of this specification
> > > > > > > +        (up to 0.9.X)
> > > > > > > +Legacy Device
> > > > > > > +	is a device implemented before this specification was released,
> > > > > > > +        and implementing a legacy interface on the host side
> > > > > > > +Legacy Driver
> > > > > > > +	is a driver implemented before this specification was released,
> > > > > > > +        and implementing a legacy interface on the guest side
> > > > > > > +
> > > > > > > +to simplify transition from these earlier draft interfaces,
> > > > > > > +it is possible to implement
> > > > > > > +
> > > > > > > +Transitional Device
> > > > > > > +	a device supporting both drivers conforming to this
> > > > > > > +        specification, and legacy drivers
> > > > > > > +
> > > > > > > +Transitional Driver
> > > > > > > +	a driver supporting both devices conforming to this
> > > > > > > +	specification, and legacy devices
> > > > 
> > > > What happens to legacy devices in the future? Current implementers
> > > > will obviously expose legacy devices, which means future drivers need
> > > > to be transitional or they won't work with what is currently out there.
> > > 
> > > You are right. It's a bug in what I wrote: non transitional drivers
> > > should work with transitional devices.
> > > This way a transitional device can change to non-transitional
> > > after drivers are updated.
> > > 
> > > > Will legacy stay around (for the forseeable furture)?
> > > 
> > > That's up to implementers I think as long as they
> > > implement the new standard we should not prevent them from
> > > bundling in the old virtio, coffee making capabilities etc.
> > > 
> > > 
> > > > Will legacy
> > > > devices still be considered standard compliant (as in "compliant to the
> > > > legacy standard")?
> > > 
> > > I don't think they are compliant. We'll split the legacy sections
> > > from spec out to a separate transition guide before we release
> > > the spec.
> > 
> > What I'm worried about is probably the transitional nature of this.
> > There is a framework we have now, so there will be users - and not on
> > all platforms they expect needing to upgrade, especially if traditional
> > I/O has always been backwards compatible for decades...
> 
> I'm not sure I understand the suggestion.
> You want us to push devices harder to implement legacy interfaces?
> You want us to push drivers harder to switch to new interfaces?
> 
> The proposal is basically trying hard to supply a mechanism,
> not force a policy.

So we may be in violent agreement there :) If the legacy mechanism
can stay, I'm fine.

> 
> > > 
> > > > > > > +
> > > > > > > +Device and driver that require support for revision 1.0 or newer of
> > > > > > > +the specification to function, are called non-transitional device and driver,
> > > > > > > +respectively.
> > > > > > > +
> > > > > > > +Transitional Drivers can detect Legacy Devices by detecting that
> > > > > > > +Feature bit 32 is not offered.
> > > > > > > +Transitional devices can detect Legacy drivers by detecting that
> > > > > > > +Feature bit 32 has not been acknowledged by driver.
> > > > 
> > > > Will we use new feature bits for new, incompatible revisions? Or will
> > > > we try to stay backwards compatible?
> > > 
> > > So an incompatible change needs to increment revision ID
> > > to prevent drivers from loading.
> > > MMIO and PCI both have revision IDs.
> > > CCW will need to add something like a revision ID,
> > > we discussed this already.
> > 
> > Command rejects?
> 
> Which command would you reject?

Whatever was incompatible or unknown.

> 
> > I think it is a good idea to try to stay as compatible as possible;
> > this should really be a last measure.
> 
> Again, I think that at some point, e.g. 10-15 years in
> the future, devices will want to say "I require new drivers
> and that's it".

Then they should probably present themselves as different devices, no?
(We're not talking about about minor changes here, I guess.)

> 
> I think it's useful to have a mechanism for this, so
> old drivers fail gracefully.

See below for some more thoughts I had on this.

> 
> > > 
> > > > > > > +
> > > > > > > +To make them easier to locate, specification sections documenting these
> > > > > > > +transitional features all explicitly marked with
> > > > > > > +'Legacy Interface' in the section title.
> > > > > > > +
> > > > > > > +
> > > > > > >  2.1.3 Configuration Space
> > > > > > >  -------------------------
> > > > > > >  
> > > > > > >  Configuration space is generally used for rarely-changing or
> > > > > > >  initialization-time parameters.
> > > > > > >  
> > > > > > > -Note that this space is generally the guest's native endian, 
> > > > > > > +Note that configuration space generally uses the little-endian format
> > > > > > > +for multi-byte fields.
> > > > > > > +
> > > > > > > +2.1.4.1 Legacy Interface: A Note on Configuration Space endian-ness
> > > > > > > +--------------------------------------
> > > > > > > +
> > > > > > > +Note that for legacy interfaces, configuration space is generally the guest's native endian, 
> > > > > > >  rather than PCI's little-endian.
> > > > > > >  
> > > > > > >  2.1.4 Virtqueues
> > > > > > > @@ -164,6 +221,45 @@ transmit and one for receive.  Each queue has a 16-bit queue size
> > > > > > >  parameter, which sets the number of entries and implies the total size
> > > > > > >  of the queue.
> > > > > > >  
> > > > > > > +Each virtqueue consists of three parts:
> > > > > > > +
> > > > > > > +	Descriptor Table
> > > > > > > +	Available Ring
> > > > > > > +	Used Ring
> > > > > > > +
> > > > > > > +where each part is physically-contiguous in guest memory,
> > > > > > > +and has different alignment requirements.
> > > > > > > +
> > > > > > > +The Queue Size field controls the total number of bytes
> > > > > > > +required for each part of the virtqueue.
> > > > > > > +
> > > > > > > +The memory aligment and size requirements, in bytes, of each part of the
> > > > > > > +virtqueue are summarized in the following table (qsz is the Queue Size field):
> > > > > > > +
> > > > > > > ++------------+---------------------------------+
> > > > > > > +| Virtqueue Part    | Alignment | Size         |
> > > > > > > ++------------+---------------------------------+
> > > > > > > ++------------+---------------------------------+
> > > > > > > +| Descriptor Table  | 16        | 16 * qsz     |
> > > > > > > ++------------+---------------------------------+
> > > > > > > +| Available Ring    | 2         | 6 + 2 * qsz  |
> > > > > > > ++------------+---------------------------------+
> > > > > > > +| Used Ring         | 4         | 6 + 4 * qsz  |
> > > > > > > ++------------+---------------------------------+
> > > > > > > +
> > > > > > > +When the driver wants to send a buffer to the device, it fills in 
> > > > > > > +a slot in the descriptor table (or chains several together), and 
> > > > > > > +writes the descriptor index into the available ring.  It then 
> > > > > > > +notifies the device. When the device has finished a buffer, it 
> > > > > > > +writes the descriptor into the used ring, and sends an interrupt.
> > > > > > > +
> > > > > > > +
> > > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Layout
> > > > > > > +--------------------------------------
> > > > > > > +
> > > > > > > +For Legacy Interfaces, several additional
> > > > > > > +restrictions are placed on the virtqueue layout:
> > > > > > > +
> > > > > > >  Each virtqueue occupies two or more physically-contiguous pages 
> > > > > > >  (usually defined as 4096 bytes, but depending on the transport)
> > > > > > >  and consists of three parts:
> > > > > > > @@ -182,9 +278,8 @@ required for the virtqueue according to the following formula:
> > > > > > >  	          + ALIGN(sizeof(u16)*3 + sizeof(struct vring_used_elem)*qsz);
> > > > > > >  	}
> > > > > > >  
> > > > > > > -This currently wastes some space with padding, but also allows future
> > > > > > > -extensions such as the VIRTIO_RING_F_EVENT_IDX extension.  The
> > > > > > > -virtqueue layout structure looks like this:
> > > > > > > +This wastes some space with padding.
> > > > > > > +The legacy virtqueue layout structure therefore looks like this:
> > > > > > >  
> > > > > > >  	struct vring {
> > > > > > >  		// The actual descriptors (16 bytes each)
> > > > > > > @@ -200,25 +295,17 @@ virtqueue layout structure looks like this:
> > > > > > >  		struct vring_used used;
> > > > > > >  	};
> > > > > > >  
> > > > > > > -When the driver wants to send a buffer to the device, it fills in 
> > > > > > > -a slot in the descriptor table (or chains several together), and 
> > > > > > > -writes the descriptor index into the available ring.  It then 
> > > > > > > -notifies the device. When the device has finished a buffer, it 
> > > > > > > -writes the descriptor into the used ring, and sends an interrupt.
> > > > > > > -
> > > > > > > -2.1.4.1 A Note on Virtqueue Endianness
> > > > > > > +2.1.4.1 Legacy Interfaces: A Note on Virtqueue Endianness
> > > > > > >  --------------------------------------
> > > > > > >  
> > > > > > >  Note that the endian of fields and in the virtqueue is the native
> > > > > > > -endian of the guest, not little-endian as PCI normally is. This makes
> > > > > > > -for simpler guest code, and it is assumed that the host already has to
> > > > > > > -be deeply aware of the guest endian so such an “endian-aware” device
> > > > > > > -is not a significant issue.
> > > > > > > +endian of the guest, not little-endian as PCI normally is.
> > > > > > > +It is assumed that the host is already aware of the guest endian.
> > > > > > >  
> > > > > > >  2.1.4.2 Message Framing
> > > > > > >  -----------------------
> > > > > > > -The original intent of the specification was that message framing (the
> > > > > > > -particular layout of descriptors) be independent of the contents of
> > > > > > > +Generally, the intent of the specification is for message framing (the
> > > > > > > +particular layout of descriptors) to be independent of the contents of
> > > > > > >  the buffers. For example, a network transmit buffer consists of a 12
> > > > > > >  byte header followed by the network packet. This could be most simply
> > > > > > >  placed in the descriptor table as a 12 byte output descriptor followed
> > > > > > > @@ -227,16 +314,21 @@ single 1526 byte output descriptor in the case where the header and
> > > > > > >  packet are adjacent, or even three or more descriptors (possibly with
> > > > > > >  loss of efficiency in that case).
> > > > > > >  
> > > > > > > -Regrettably, initial driver implementations used simple layouts, and
> > > > > > > -devices came to rely on it, despite this specification wording[10]. It
> > > > > > > -is thus recommended that drivers be conservative in their assumptions,
> > > > > > > -unless the VIRTIO_F_ANY_LAYOUT feature is accepted. In addition, some
> > > > > > > +In addition, some
> > > > > > >  implementations may have large-but-reasonable restrictions on total
> > > > > > >  descriptor size (such as based on IOV_MAX in the host OS). This has
> > > > > > >  not been a problem in practice: little sympathy will be given to
> > > > > > >  drivers which create unreasonably-sized descriptors such as by
> > > > > > >  dividing a network packet into 1500 single-byte descriptors!
> > > > > > >  
> > > > > > > +2.1.4.2.1 Legacy Interfaces: A Note on Message Framing
> > > > > > > +-----------------------
> > > > > > > +Regrettably, initial driver implementations used simple layouts, and
> > > > > > > +devices came to rely on it, despite this specification wording[10]. It
> > > > > > > +is thus recommended that when using legacy interfaces,
> > > > > > > +drivers should be conservative in their assumptions,
> > > > > > > +unless the VIRTIO_F_ANY_LAYOUT feature is accepted.
> > > > 
> > > > So ANY_LAYOUT and feature bit 32 are mutually exclusive?
> > > 
> > > Hmm. I wonder what gives this impression.
> > > What I tried to say is bit 32 should imply ANY_LAYOUT.
> > 
> > Better to spell it out, then.
> 
> Well it says (in unchanged text)
> 	Generally, the intent of the specification is for message framing (the
> 	particular layout of descriptors) to be independent of the contents of
> 	the buffers.
> 
> how would you make it clearer?

"Note that bit 32 implies ANY_LAYOUT"?

> 
> > > 
> > > 
> > > > > > > +
> > > > > > >  2.1.4.3 The Virtqueue Descriptor Table
> > > > > > >  --------------------------------------
> > > > > > >  
> > > > > > > @@ -386,23 +478,27 @@ how to communicate with the specific device.
> > > > > > >  2.2.1 Device Initialization
> > > > > > >  ---------------------------
> > > > > > >  
> > > > > > > -1. Reset the device. This is not required on initial start up.
> > > > > > > +1. Device discovery. This is only required for some transports.
> > > > > > > +
> > > > > > > +2. Reset the device. This is not required on initial start up.
> > > > > > >  
> > > > > > > -2. The ACKNOWLEDGE status bit is set: we have noticed the device.
> > > > > > > +3. Device layout detection. This is only required for some transports.
> > > > > > >  
> > > > > > > -3. The DRIVER status bit is set: we know how to drive the device.
> > > > > > > +4. The ACKNOWLEDGE status bit is set: we have noticed the device.
> > > > > > >  
> > > > > > > -4. Device-specific setup, including reading the device feature 
> > > > > > > +5. The DRIVER status bit is set: we know how to drive the device.
> > > > > > > +
> > > > > > > +6. Device-specific setup, including reading the device feature 
> > > > > > >    bits, discovery of virtqueues for the device, optional per-bus
> > > > > > >    setup, and reading and possibly writing the device's virtio 
> > > > > > >    configuration space.
> > > > > > >  
> > > > > > > -5. The subset of device feature bits understood by the driver is 
> > > > > > > +7. The subset of device feature bits understood by the driver is 
> > > > > > >     written to the device.
> > > > > > >  
> > > > > > > -6. The DRIVER_OK status bit is set.
> > > > > > > +8. The DRIVER_OK status bit is set.
> > > > > > >  
> > > > > > > -7. The device can now be used (ie. buffers added to the 
> > > > > > > +9. The device can now be used (ie. buffers added to the 
> > > > > > >     virtqueues)[4]
> > > > > > >  
> > > > > > >  If any of these steps go irrecoverably wrong, the guest should 
> > > > > > > @@ -622,35 +718,183 @@ Virtio devices are commonly implemented as PCI devices.
> > > > > > >  
> > > > > > >  Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through
> > > > > > >  0x103F inclusive is a virtio device[3]. The device must also have a
> > > > > > > -Revision ID of 0 to match this specification.
> > > > > > > +Revision ID of 0 or Revision ID of 1 to match this specification.
> > > > > > >  
> > > > > > >  The Subsystem Device ID indicates which virtio device is 
> > > > > > >  supported by the device. The Subsystem Vendor ID should reflect 
> > > > > > >  the PCI Vendor ID of the environment (it's currently only used 
> > > > > > >  for informational purposes by the guest).
> > > > > > >  
> > > > > > > +Drivers must not match devices where Revision ID does not match 0 or 1.
> > > > > > > +
> > > > > > > +2.4.1.1.1 Legacy Interface: A Note on PCI Device Discovery
> > > > > > > +----------------------------
> > > > > > > +Transitional devices must have a Revision ID of 0.
> > > > > > > +
> > > > > > > +Non-transitional devices must have a Revision ID of 1.
> > > > > > > +
> > > > > > > +Transitional drivers must match a Revision ID of 0 or 1.
> > > > > > > +
> > > > > > > +Non-transitional drivers must only match a Revision ID of 1.
> > > > > > > +
> > > > > > 
> > > > > > I think we should stop abusing Revision IDs, and start using them
> > > > > > to reflect device version changes as intended.
> > > > > >
> > > > > > We could reserve revision id 0 for legacy devices, however, which should
> > > > > > work nicely.
> > > > > 
> > > > > Hmm I am not sure I agree - what does it buy us that feature bits don't already supply?
> > > > > 
> > > > > More concerns:
> > > > > 
> > > > > We are using revision ID now exactly as was intended to disable old
> > > > > drivers - it served us well for 0.X-1.X and would be as useful if we
> > > > > ever have 1.X->2.0 transition.
> > > > > 
> > > > > Another worry with using revision numbering for features is that
> > > > > it does not play well with downstreams.
> > > > > E.g. RHEL might want to cherry-pick a feature without implementing
> > > > > other features that happened to land in the same revision.
> > > > > 
> > > > > Also Revision ID is only 8 bit - it's designed for hardware where
> > > > > making a new revision is expensive. In software we'll run out of that
> > > > > eventually.
> > > > 
> > > > So Revision ID is a PCI-specific thing, right? Not all transports will
> > > > necessarily have something equivalent, so they would need to depend on
> > > > the feature bit.
> > > 
> > > They can't do this reliably - for example you might want to move feature
> > > bits around.
> > 
> > That sounds like setting yourself up for problems.
> > If you want to
> > deprecate bits, it would be better to define them as "reserved" and use
> > a new bit for your new feature. The s390 architecture is full of
> > "reserved" bits like that.
> 
> That's exactly what PCI does here though, and it does this
> without problems exactly because we have a way to
> make old drivers fail if we want to.

You fail if a reserved bit is to be negotiated?

> 
> So IMO it would be good to add a revision field to ccw so we
> can do this there in the future.

An idea I just had:

- Add a new channel command "set virtio configuration".
  This can set:
  - a revision id; 0 for legacy, 1 for the proposal, possible more later
  - a format field indicating the format of the following data area
  - a data area (unused for now, but can be used for all kind of
    configuration parameters)
- A transitional or modern driver will issue this command when
  starting to probe.
  - A legacy device will reject the command, prompting the transitional
    driver to use the legacy interface and the modern driver to fail.
  - Transitional/modern devices will either accept the configuration or
    reject it if they don't support that particular configuration. The
    driver may then either fail or re-try with a different
    configuration.
- A legacy driver will not issue this command (obviously).
  - A legacy device will work as before.
  - A transitional device will notice that a virtio-ccw command is
    issued without any configuration set. It will therefore operate in
    legacy mode.
  - A modern device will reject any virtio-ccw command without any
    configuration set, causing the legacy driver to fail.
- A set configuration command is rejected after the first virtio-ccw
  command has been issued.
  - Obviously true for legacy devices.
  - Allows transitional/modern devices to fence off misbehaving drivers.
  - No dynamic change of the configuration; you'll always have to tear
    down and re-init for that.

I think this should allow us to accomodate future changes without
having to change the control unit type, unless we'd really do something
radically different.

> 
> > > For 0.9.X drivers and non-transitional devices,
> > > I'd like to find some hack to make probe fail.
> > > 
> > > Any idea?
> > 
> > Not really, sorry.
> > 
> > > 
> > > But let's plan ahead and add a way to do this
> > > in the future if we make an incompatible change again.
> > 
> > I'd rather have an architecture that allows us to be backwards
> > compatible for a long time and introduce a new device id/cu type for
> > a new kind of device if we want to do things differently and ditch old
> > baggage.
> 
> device ids are transport independent so we can't do this.



> What's a cu type? Hard to add?
> If no let's do that, and add a revision to future-proof it.

Ah, terminology fail.

The device id (net, block, ...) is reflected for virtio-ccw devices in
the control unit model (8 bit value). The control unit type (16 bit
value) is what identifies the control unit as a virtio-ccw control
unit, accepting virtio-ccw channel commands. So you get

3832/01 -> virtio-net via virtio-ccw
3832/02 -> virtio-blk via virtio-ccw

(all of which is discovered via a common channel-io mechanism)

For something radically incompatible, this could become 3833/01,
3833/02, ...

What I *meant* with device id above was the pci id.

A new cu type (3833) would be easy to code, but I'd have to get it
reserved with the folks handling known ids. So if my idea from above
worked, that would be way better.

> 
> > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > >  2.4.1.2 PCI Device Layout
> > > > > > >  -------------------------
> > > > > > >  
> > > > > > > -To configure the device, we use the first I/O region of the PCI 
> > > > > > > -device. This contains a virtio header followed by a 
> > > > > > > -device-specific region.
> > > > > > > +To configure the device,
> > > > > > > +use I/O and/or memory regions and/or PCI configuration space of the PCI device.
> > > > > > > +These contain the virtio header registers, the notification register, the
> > > > > > > +ISR status register and device specific registers, as specified by Virtio
> > > > > > > ++ Structure PCI Capabilities
> > > > > > > +
> > > > > > > +There may be different widths of accesses to the I/O region; the
> > > > > > > +“natural” access method for each field must be
> > > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc).
> > > > > > > +
> > > > > > > +PCI Device Configuration Layout includes the common configuration,
> > > > > > > +ISR, notification and device specific configuration
> > > > > > > +structures.
> > > > > > > +
> > > > > > > +Unless explicitly specified otherwise, all multi-byte fields are little-endian.
> > > > > > > +
> > > > > > > +
> > > > > > > +2.4.1.2.1 Common configuration structure layout
> > > > > > > +-------------------------
> > > > > > > +Common configuration structure layout is documented below:
> > > > > > > +
> > > > > > > +struct virtio_pci_common_cfg {
> > > > > > > +	/* About the whole device. */
> > > > > > > +	__le32 device_feature_select;	/* read-write */
> > > > > > > +	__le32 device_feature;		/* read-only */
> > > > > > > +	__le32 guest_feature_select;	/* read-write */
> > > > > > > +	__le32 guest_feature;		/* read-write */
> > > > > > > +	__le16 msix_config;		/* read-write */
> > > > > > > +	__le16 num_queues;		/* read-only */
> > > > > > > +	__u8 device_status;		/* read-write */
> > > > > > > +	__u8 unused1;
> > > > > > > +
> > > > > > > +	/* About a specific virtqueue. */
> > > > > > > +	__le16 queue_select;		/* read-write */
> > > > > > > +	__le16 queue_size;		/* read-write, power of 2, or 0. */
> > > > > > > +	__le16 queue_msix_vector;	/* read-write */
> > > > > > > +	__le16 queue_enable;		/* read-write */
> > > > > > > +	__le16 queue_notify_off;	/* read-only */
> > > > > > > +	__le64 queue_desc;		/* read-write */
> > > > > > > +	__le64 queue_avail;		/* read-write */
> > > > > > > +	__le64 queue_used;		/* read-write */
> > > > > > > +};
> > > > > > > +
> > > > > > > +device_feature_select
> > > > > > > +
> > > > > > > +	Selects which Feature Bits does device_feature field refer to.
> > > > > > > +	Value 0x0 selects Feature Bits 0 to 31
> > > > > > > +	Value 0x1 selects Feature Bits 32 to 63
> > > > > > > +	All other values cause reads from device_feature to return 0.
> > > > > > > +
> > > > > > > +device_feature
> > > > > > > +
> > > > > > > +	Used by Device to report Feature Bits to Driver.
> > > > > > > +	Device Feature Bits selected by device_feature_select.
> > > > > > > +
> > > > > > > +guest_feature_select
> > > > > > > +
> > > > > > > +	Selects which Feature Bits does guest_feature field refer to.
> > > > > > > +	Value 0x0 selects Feature Bits 0 to 31
> > > > > > > +	Value 0x1 selects Feature Bits 32 to 63
> > > > > > > +	All other values cause writes to guest_feature to be ignored,
> > > > > > > +	and reads to return 0.
> > > > > > > +
> > > > > > > +guest_feature
> > > > > > > +
> > > > > > > +	Used by Driver to acknowledge Feature Bits to Device.
> > > > > > > +	Guest Feature Bits selected by guest_feature_select.
> > > > > > > +
> > > > > > > +msix_config
> > > > > > > +
> > > > > > > +	Configuration Vector for MSI-X.
> > > > > > > +
> > > > > > > +num_queues
> > > > > > > +
> > > > > > > +	Specifies the maximum number of virtqueues supported by device.
> > > > > > > +
> > > > > > > +device_status
> > > > > > > +
> > > > > > > +	Device Status field.
> > > > > > > +
> > > > > > > +queue_select
> > > > > > > +
> > > > > > > +	Queue Select. Selects which virtqueue do other fields refer to.
> > > > > > > +
> > > > > > > +queue_size
> > > > > > > +
> > > > > > > +	Queue Size.  On reset, specifies the maximum queue size supported by
> > > > > > > +	the hypervisor. This can be modified by driver to reduce memory requirements.
> > > > > > > +	Set to 0 if this virtqueue is unused.
> > > > > > > +
> > > > > > > +queue_msix_vector
> > > > > > > +
> > > > > > > +	Queue Vector for MSI-X.
> > > > > > > +
> > > > > > > +queue_enable
> > > > > > > +
> > > > > > > +	Used to selectively prevent host from executing requests from this virtqueue.
> > > > > > > +	1 - enabled; 0 - disabled
> > > > > > > +
> > > > > > > +queue_notify_off
> > > > > > > +
> > > > > > > +	Used to calculate the offset from start of Notification structure at
> > > > > > > +	which this virtqueue is located.
> > > > > > > +	Note: this is *not* an offset in bytes. See notify_off_multiplier below.
> > > > > > > +	
> > > > > > > +queue_desc
> > > > > > > +
> > > > > > > +	Physical address of Descriptor Table.
> > > > > > > +
> > > > > > > +queue_avail
> > > > > > > +
> > > > > > > +	Physical address of Available Ring.
> > > > > > > +
> > > > > > > +queue_used
> > > > > > > +
> > > > > > > +	Physical address of Used Ring.
> > > > > > > +
> > > > > > > +
> > > > > > > +2.4.1.2.2 ISR status structure layout
> > > > > > > +-------------------------
> > > > > > > +ISR status structure includes a single 8-bite ISR status field
> > > > > > 
> > > > > > 8-bit
> > > > > 
> > > > > Right :)
> > > > > 
> > > > > > > +
> > > > > > > +2.4.1.2.3 Notification structure layout
> > > > > > > +-------------------------
> > > > > > > +Notification structure is always a multiple of 2 bytes in size.
> > > > > > > +It includes 2-byte Queue Notify fields for each virtqueue of
> > > > > > > +the device. Note that multiple virtqueues can use the same
> > > > > > > +Queue Notify field, if necessary.
> > > > > > 
> > > > > > Hmm, maybe move this down, so you can have a section which starts with
> > > > > > "If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG" below?  That would put it all
> > > > > > together.
> > > > > 
> > > > > so Move PCI Device Layout to within
> > > > > PCI-specific Initialization And Device Operation?
> > > > > 
> > > > > > > +
> > > > > > > +2.4.1.2.4 Device specific structure
> > > > > > > +-------------------------
> > > > > > > +
> > > > > > > +Device specific structure is optional.
> > > > > > > +
> > > > > > > +2.4.1.2.5 Legacy Interfaces: A Note on PCI Device Layout
> > > > > > > +-------------------------
> > > > > > > +
> > > > > > > +Transitional devices should present part of configuration
> > > > > > > +registers in a legacy configuration structure in BAR0 in the first I/O
> > > > > > > +region of the PCI device, as documented below.
> > > > > > >  
> > > > > > >  There may be different widths of accesses to the I/O region; the
> > > > > > >  “natural” access method for each field in the virtio header must be
> > > > > > > -used (i.e. 32-bit accesses for 32-bit fields, etc), but the
> > > > > > > +used (i.e. 32-bit accesses for 32-bit fields, etc), but
> > > > > > > +When accessed through the legacy interface the
> > > > > > >  device-specific region can be accessed using any width accesses, and
> > > > > > >  should obtain the same results.
> > > > > > >  
> > > > > > >  Note that this is possible because while the virtio header is PCI 
> > > > > > > -(i.e. little) endian, the device-specific region is encoded in 
> > > > > > > -the native endian of the guest (where such distinction is 
> > > > > > > +(i.e. little) endian, when using the legacy interface the device-specific
> > > > > > > +region is encoded in the native endian of the guest (where such distinction is
> > > > > > >  applicable).
> > > > > > >  
> > > > > > > -2.4.1.2.1 PCI Device Virtio Header
> > > > > > > -----------------------------------
> > > > > > >  
> > > > > > > -The virtio header looks as follows:
> > > > > > > +When used through the legacy interface, the virtio header looks as follows:
> > > > > > >  
> > > > > > >  +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
> > > > > > >  | Bits       || 32                  | 32                  | 32       | 16     | 16      | 16      | 8       | 8      |
> > > > > > > @@ -661,7 +905,6 @@ The virtio header looks as follows:
> > > > > > >  |            || Features bits 0:31  | Features bits 0:31  | Address  | Size   | Select  | Notify  | Status  | Status |
> > > > > > >  +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
> > > > > > >  
> > > > > > > -
> > > > > > >  If MSI-X is enabled for the device, two additional fields 
> > > > > > >  immediately follow this header:[5]
> > > > > > >  
> > > > > > > @@ -689,25 +932,154 @@ device-specific headers:
> > > > > > >  |            ||                    |
> > > > > > >  +------------++--------------------+
> > > > > > >  
> > > > > > > +Note that only Feature Bits 0 to 31 are accessible through the
> > > > > > > +Legacy Interface. When used through the Legacy Interface,
> > > > > > > +Transitional Devices must assume that Feature Bits 32 to 63
> > > > > > > +are not acknowledged by Driver.
> > > > > > > +
> > > > > > > +
> > > > > > >  2.4.1.3 PCI-specific Initialization And Device Operation
> > > > > > >  --------------------------------------------------------
> > > > > > >  
> > > > > > > -The page size for a virtqueue on a PCI virtio device is defined as
> > > > > > > -4096 bytes.
> > > > > > > -
> > > > > > >  2.4.1.3.1 Device Initialization
> > > > > > >  -------------------------------
> > > > > > >  
> > > > > > > -2.4.1.3.1.1 Queue Vector Configuration
> > > > > > > +This documents PCI-specific steps executed during Device Initialization.
> > > > > > > +As the first step, driver must detect device configuration layout
> > > > > > > +to locate configuration fields in memory,I/O or configuration space of the
> > > > > > > +device.
> > > > > > > +
> > > > > > > +2.4.1.3.1.1 Virtio Device Configuration Layout Detection
> > > > > > > +-------------------------------
> > > > > > > +
> > > > > > > +As a prerequisite to device initialization, driver executes a
> > > > > > > +PCI capability list scan, detecting virtio configuration layout using Virtio
> > > > > > > +Structure PCI capabilities.
> > > > > > > +
> > > > > > > +Virtio Device Configuration Layout includes virtio configuration header, Notification
> > > > > > > +and ISR Status and device configuration structures.
> > > > > > > +Each structure can be mapped by a Base Address register (BAR) belonging to
> > > > > > > +the function, located beginning at 10h in Configuration Space,
> > > > > > > +or accessed though PCI configuration space.
> > > > > > > +
> > > > > > > +Actual location of each structure is specified using vendor-specific PCI capability located
> > > > > > > +on capability list in PCI configuration space of the device.
> > > > > > > +This virtio structure capability uses little-endian format; all bits are
> > > > > > > +read-only:
> > > > > > > +
> > > > > > > +struct virtio_pci_cap {
> > > > > > > +	__u8 cap_vndr;	/* Generic PCI field: PCI_CAP_ID_VNDR */
> > > > > > > +	__u8 cap_next;	/* Generic PCI field: next ptr. */
> > > > > > > +	__u8 cap_len;	/* Generic PCI field: capability length */
> > > > > > > +	__u8 cfg_type;	/* Identifies the structure. */
> > > > > > > +	__u8 bar;	/* Where to find it. */
> > > > > > > +	__u8 padding[3];/* Pad to full dword. */
> > > > > > > +	__le32 offset;	/* Offset within bar. */
> > > > > > > +	__le32 length;	/* Length of the structure, in bytes. */
> > > > > > > +};
> > > > > > > +
> > > > > > > +This structure can optionally followed by extra data, depending on
> > > > > > > +other fields, as documented below.
> > > > > > > +
> > > > > > > +The fields are interpreted as follows:
> > > > > > > +
> > > > > > > +cap_vndr
> > > > > > > +	0x09; Identifies a vendor-specific capability.
> > > > > > > +
> > > > > > > +cap_next
> > > > > > > +	Link to next capability in the capability list in the configuration space.
> > > > > > > +
> > > > > > > +cap_len
> > > > > > > +	Length of the capability structure, including the whole of
> > > > > > > +	struct virtio_pci_cap, and extra data if any.
> > > > > > > +	This length might include padding, or fields unused by the driver.
> > > > > > > +
> > > > > > > +cfg_type
> > > > > > > +	identifies the structure, according to the following table.
> > > > > > > +
> > > > > > > +	/* Common configuration */
> > > > > > > +	#define VIRTIO_PCI_CAP_COMMON_CFG	1
> > > > > > > +	/* Notifications */
> > > > > > > +	#define VIRTIO_PCI_CAP_NOTIFY_CFG	2
> > > > > > > +	/* ISR Status */
> > > > > > > +	#define VIRTIO_PCI_CAP_ISR_CFG		3
> > > > > > > +	/* Device specific configuration */
> > > > > > > +	#define VIRTIO_PCI_CAP_DEVICE_CFG	4
> > > > > > > +
> > > > > > > +	More than one capability can identify the same structure - this makes it
> > > > > > > +	possible for the device to expose multiple interfaces to drivers.  The order of
> > > > > > > +	the capabilities in the capability list specifies the order of preference
> > > > > > > +	suggested by the device; drivers should use the first interface that they can
> > > > > > > +	support.  For example, on some hypervisors, notifications using IO accesses are
> > > > > > > +	faster than memory accesses. In this case, hypervisor can expose two
> > > > > > > +	capabilities with cfg_type set to VIRTIO_PCI_CAP_NOTIFY_CFG:
> > > > > > > +	the first one addressing an I/O BAR, the second one addressing a memory BAR.
> > > > > > > +	Driver will use the I/O BAR if I/O resources are available, and fall back on
> > > > > > > +	memory BAR when I/O resources are unavailable.
> > > > > > > +
> > > > > > > +bar
> > > > > > > +
> > > > > > > +	values 0x0 to 0x5 specify a Base Address register (BAR) belonging to
> > > > > > > +	the function located beginning at 10h in Configuration Space
> > > > > > > +	and used to map the structure into Memory or I/O Space.
> > > > > > > +	The BAR is permitted to be either 32-bit or 64-bit, it can map Memory Space
> > > > > > > +	or I/O Space.
> > > > > > > +
> > > > > > > +	The value 0xF specifies that the structure is in PCI configuration space
> > > > > > > +	inline with this capability structure, following (not necessarily immediately)
> > > > > > > +	the length field.
> > > > > > 
> > > > > > Why not immediately?
> > > > > >  Or how would the driver know where it is?
> > > > > 
> > > > > It's at the offset.
> > > > > 
> > > > > E.g. for notification we stick multiplier after length.
> > > > > Further, we might extend virtio_pci_cap in the future,
> > > > > and we don't want to move stuff around like we
> > > > > had to with MSI-X.
> > > > > 
> > > > > > > +
> > > > > > > +offset
> > > > > > > +	indicates where the structure begins relative to the base address associated
> > > > > > > +	with the BAR. If bar specifies configuration space, offset is relative
> > > > > > > +	to start of virtio_pci_cap structure.
> > > > > > > +
> > > > > > > +length
> > > > > > > +	indicates the length of the structure.
> > > > > > > +	This size might include padding, or fields unused by the driver.
> > > > > > > +	Drivers are also recommended to only map part of configuration structure
> > > > > > > +	large enough for device operation.
> > > > > > > +	For example, a future device might present a large structure size of several
> > > > > > > +	MBytes.
> > > > > > > +	As current devices never utilize structures larger than 4KBytes in size,
> > > > > > > +	driver can limit the mapped structure size to e.g.
> > > > > > > +	4KBytes to allow forward compatibility with such devices without loss of
> > > > > > > +	functionality and without wasting resources.
> > > > > > > +
> > > > > > > +
> > > > > > > +If cfg_type is VIRTIO_PCI_CAP_NOTIFY_CFG this structure is immediately followed
> > > > > > > +by additional fields:
> > > > > > > +
> > > > > > > +struct virtio_pci_notify_cap {
> > > > > > > +	struct virtio_pci_cap cap;
> > > > > > > +	__le32 notify_off_multiplier;	/* Multiplier for queue_notify_off. */
> > > > > > > +};
> > > > > > > +
> > > > > > > +notify_off_multiplier
> > > > > > > +
> > > > > > > +	Virtqueue offset multiplier, in bytes. Must be even and either a power of two, or 0.
> > > > > > > +	Value 0x1 is reserved.
> > > > > > > +	For a given virtqueue, the address to use for notifications is calculated as follows:
> > > > > > > +
> > > > > > > +	queue_notify_off * notify_off_multiplier + offset
> > > > > > > +
> > > > > > > +	If notify_off_multiplier is 0, all virtqueues use the same address in
> > > > > > > +	the Notifications structure!
> > > > > > > +
> > > > > > > +
> > > > > > > +2.4.1.3.1.1 Legacy Interface: A Note on Device Layout Detection
> > > > > > > +-------------------------------
> > > > > > > +
> > > > > > > +Legacy drivers skipped  Device Layout Detection step, assuming legacy
> > > > > > > +configuration space in BAR0 in I/O space unconditionally.
> > > > > > > +
> > > > > > > +2.4.1.3.1.3 Queue Vector Configuration
> > > > > > >  --------------------------------------
> > > > > > >  
> > > > > > >  When MSI-X capability is present and enabled in the device 
> > > > > > > -(through standard PCI configuration space) 4 bytes at byte offset 
> > > > > > > -20 are used to map configuration change and queue interrupts to 
> > > > > > > -MSI-X vectors. In this case, the ISR Status field is unused, and 
> > > > > > > -device specific configuration starts at byte offset 24 in virtio 
> > > > > > > -header structure. When MSI-X capability is not enabled, device 
> > > > > > > -specific configuration starts at byte offset 20 in virtio header.
> > > > > > > +(through standard PCI configuration space) Configuration/Queue
> > > > > > > +MSI-X Vector registers are used to map configuration change and queue
> > > > > > > +interrupts to MSI-X vectors. In this case, the ISR Status is unused.
> > > > > > >  
> > > > > > >  Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of 
> > > > > > >  Configuration/Queue Vector registers, maps interrupts triggered 
> > > > > > > @@ -732,7 +1104,7 @@ success, the previously written value is returned, and on
> > > > > > >  failure, NO_VECTOR is returned. If a mapping failure is detected, 
> > > > > > >  the driver can retry mapping with fewervectors, or disable MSI-X.
> > > > > > >  
> > > > > > > -2.4.1.3.1.2 Virtqueue Configuration
> > > > > > > +2.4.1.3.1.4 Virtqueue Configuration
> > > > > > >  -----------------------------------
> > > > > > >  
> > > > > > >  As a device can have zero or more virtqueues for bulk data 
> > > > > > > @@ -749,9 +1121,11 @@ This is done as follows, for each virtqueue a device has:
> > > > > > >    always a power of 2. This controls how big the virtqueue is 
> > > > > > >    (see 2.1.4 Virtqueues). If this field is 0, the virtqueue does not exist. 
> > > > > > >  
> > > > > > > -3. Allocate and zero virtqueue in contiguous physical memory, on 
> > > > > > > -  a 4096 byte alignment. Write the physical address, divided by 
> > > > > > > -  4096 to the Queue Address field.[6]
> > > > > > > +3. Optionally, select a smaller virtqueue size and write it in the Queue Size
> > > > > > > +   field.
> > > > > > > +
> > > > > > > +3. Allocate and zero Descriptor Table, Available and Used rings for the
> > > > > > > +   virtqueue in contiguous physical memory.
> > > > > > >  
> > > > > > >  4. Optionally, if MSI-X capability is present and enabled on the 
> > > > > > >    device, select a vector to use to request interrupts triggered 
> > > > > > > @@ -760,14 +1134,21 @@ This is done as follows, for each virtqueue a device has:
> > > > > > >    Queue Vector field: on success, previously written value is 
> > > > > > >    returned; on failure, NO_VECTOR value is returned.
> > > > > > >  
> > > > > > > +
> > > > > > > +2.4.1.3.1.4.1 Legacy Interface: A Note on Virtqueue Configuration
> > > > > > > +-----------------------------------
> > > > > > > +When using the legacy interface, the page size for a virtqueue on a PCI virtio
> > > > > > > +device is defined as 4096 bytes.  Driver writes the physical address, divided
> > > > > > > +by 4096 to the Queue Address field [6].
> > > > > > > +
> > > > > > >  2.4.1.3.2 Notifying The Device
> > > > > > >  ------------------------------
> > > > > > >  
> > > > > > >  Device notification occurs by writing the 16-bit virtqueue index 
> > > > > > > -of this virtqueue to the Queue Notify field of the virtio header 
> > > > > > > -in the first I/O region of the PCI device.
> > > > > > > +of this virtqueue to the Queue Notify field.
> > > > > > >  
> > > > > > >  2.4.1.3.3 Receiving Used Buffers From The Device
> > > > > > > +------------------------------
> > > > > > >  
> > > > > > >  If an interrupt is necessary:
> > > > > > >  
> > > > > > > @@ -2798,7 +3179,10 @@ the non-PCI implementations (currently lguest and S/390).
> > > > > > >  This is only allowed if the driver does not use any features 
> > > > > > >  which would alter this early use of the device.
> > > > > > >  
> > > > > > > -[5] ie. once you enable MSI-X on the device, the other fields move. 
> > > > > > > +[5] When MSI-X capability is enabled, device specific configuration starts at
> > > > > > > +byte offset 24 in virtio header structure. When MSI-X capability is not
> > > > > > > +enabled, device specific configuration starts at byte offset 20 in virtio
> > > > > > > +header.  ie. once you enable MSI-X on the device, the other fields move. 
> > > > > > >  If you turn it off again, they move back!
> > > > > > 
> > > > > > Thanks,
> > > > > > Rusty.
> > > > 
> > > > Cornelia
> > > 
> > > 
> 
> 




[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]