virtio-comment message

Subject: RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration

From: Parav Pandit <parav@nvidia.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Wed, 1 Nov 2023 10:01:27 +0000

> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Wednesday, November 1, 2023 2:36 PM
> 
> On Wed, Nov 01, 2023 at 08:49:40AM +0000, Parav Pandit wrote:
> > > So this:
> > >
> > > +struct virtio_dev_ctx_pci_vq_cfg {
> > > +        le16 vq_index;
> > > +        le16 queue_size;
> > > +        le16 queue_msix_vector;
> > > +        le64 queue_descÍ
> > > +        le64 queue_driverÍ
> > > +        le64 queue_deviceÍ
> > > +};
> > >
> > >
> > > duplicates a bunch of fields from this:
> > >
> > Not really. Above is current VQ's configuration not visible in the config space
> directly.
> > Below is already captured as part of VIRTIO_DEV_CTX_PCI_COMMON_CFG.
> 
> I really wanted to help you understand what I mean when I say that the spec
> effort is duplicated.
> 
Yes, I will try my best to grasp your ideas and improve it v4.
> 
> > > struct virtio_pci_common_cfg {
> > >         /* About the whole device. */
> > >         __le32 device_feature_select;   /* read-write */
> > >         __le32 device_feature;          /* read-only */
> > >         __le32 guest_feature_select;    /* read-write */
> > >         __le32 guest_feature;           /* read-write */
> > >         __le16 msix_config;             /* read-write */
> > >         __le16 num_queues;              /* read-only */
> > >         __u8 device_status;             /* read-write */
> > >         __u8 config_generation;         /* read-only */
> > >
> > >         /* About a specific virtqueue. */
> > >         __le16 queue_select;            /* read-write */
> > >         __le16 queue_size;              /* read-write, power of 2. */
> > >         __le16 queue_msix_vector;       /* read-write */
> > >         __le16 queue_enable;            /* read-write */
> > >         __le16 queue_notify_off;        /* read-only */
> > >         __le32 queue_desc_lo;           /* read-write */
> > >         __le32 queue_desc_hi;           /* read-write */
> > >         __le32 queue_avail_lo;          /* read-write */
> > >         __le32 queue_avail_hi;          /* read-write */
> > >         __le32 queue_used_lo;           /* read-write */
> > >         __le32 queue_used_hi;           /* read-write */
> > > };
> > >
> > >
> > > Except it's incomplete and I suspect that's actually a bug.
> > >
> > It is not a bug.
> 
> For example. queue_enable is not there.
> 
The queue enable you mentioned rightly, is present in struct virtio_dev_ctx_vq_split_runtime.
But thinking more, I agree to move it to struct virtio_dev_ctx_pci_vq_cfg.
This way packed can also utilize the same struct.
Will move it to virtio_dev_ctx_pci_vq_cfg in v4.

> > There is some information duplicated. Above struct virtio_pci_common_cfg is
> the snapshot of registers being updated.
> > While struct virtio_dev_ctx_pci_vq_cfg is capturing what is not visible in
> above config snapshot.
> > i.e. queues which are already configured.
> > For at that rare instance there is some duplication of some fields. but this is
> the exception part not be too much worried about.
> 
> Seems more of a rule than an exception - I don't see any fields that are not also
> in config space.
Just to avoid any confusion would like to sync on terminology.

We have 
a. virtio PCI config space of 4K.
b. virtio common config area in BAR
c. virtio device config area in BAR

when you say config space above, you meant #b and #c given about discussion, right?
Same question in the bright line too where you wrote config space.

> 
> > >
> > > Here's an idea: have a record per field. Use transport offsets as tags.
> > >
> > There are just too many of them.
> > They are logically clubbed in their native structures which are already defined.
> 
> Too many for what? If you don't like what I suggested find some way to avoid
> duplicating everything please.
Too many definitions for each individual fields when the holding structure is already defined.
So why to split?

Majority fields are not duplicated, and design goal is to not duplicate; rather reuse the structs.
All the defined structures are migrated as_is without redefinition.
In current v3, virtio common config, and virtio device config are the examples of it.

In subsequent work, vq stats, mac table, vlan table, rss config, flow filter entries, will be migrated as their default structures.

Between management plane and guest driver, there will be some duplication as all structs may not be drafted as migration context format.

> > >
> > > > > > > We need a stronger compatiblity story here I think.
> > > > > > >
> > > > > > > One way to show how it's designed to work would be to split
> > > > > > > the patches. For example, add queue notify data and queue reset
> separately.
> > > > > > I didn't follow the suggestion. Can you explain splitting
> > > > > > patches and its relation
> > > > > to the structure?
> > > > Did I miss your response? Or it is in below msix?
> > >
> > > exactly.
> > >
> > > > > >
> > > > > > >
> > > > > > > Another is to add MSIX table migration option for when MSIX
> > > > > > > table is passed through to guest.
> > > > > > Yes, this will be added in future when there is actual hypervisor for it.
> > > > >
> > > > > You are tying the architecture to an extremely implementation
> > > > > specific
> > > detail.
> > > > >
> > > > Which part?
> > > >
> > > > Not really. can you please which software will use MSI-X table migration?
> > >
> > > Like, all and any? All hypervisors migrate the msi-x table.
> > >
> > > > > Hypervisors *already* have migrate the MSIX table. Just in a
> > > > > hypervisor specific way.
> > > > MSI-X table is in the PCI BAR memory.
> > >
> > > Exactly. And that means hypervisor should not read it from the
> > > device directly - e.g. with an encrypted device it won't be able to.
> > >
> > Yep.
> 
> And it follows device needs to include
> 
> > > > > queue vector is an index into this table. So the index is
> > > > > migrated through the device but the table itself has to be
> > > > > trapped and emulated by
> > > hypervisor?
> > > > Do you have a hypervisor and a platform that has not done MSI-X
> > > > table
> > > emulation for which you are asking to add?
> > > > I don't know any.
> > > > I am asking to add it when there is a _real_ user of it. Why it
> > > > cannot be added
> > > when the _real_ user arrive?
> > >
> > > Real meaning actual hardware and software implementing it?  By this
> > > definition there's no real user for migration in the spec at all -
> > > all of it can be done by device specific means.
> > This is also an option if vendor specific commands are allowed.
> 
> Once you do that you can just throw all this spec effort out the window, and do
> vdpa. I thought the point was to allow control plane in standard hardware, so
> vendors can compete on best hardware and software is shared?
> 
I think device context definition can be still kept vendor specific so a cloud operator can work that way who has the virtio devices.
The whole infrastructure and commands are still well defined.
It is reasonable approach..

But I personally like this approach of well defined and extendible device context as many liked this offline too.

> > > What's your point? By now we have a reasonably good idea what
> > > hypervisors need to make migration portable. Let's either put all of
> > > it in the spec or not bother at all.
> > >
> > Maybe I was not clear.
> > I am saying lets put the msix table when a user will find it useful in
> incremental manner.
> > For example, lets say we put in the spec in 1.4 version Nov 23, will device
> implement it? mostly yes.
> > Will existing software use it in 2024-25? mostly no, because most platform
> has complexity in this area as cpus have hard coded certain values.
> >
> > So we are asking device to implement something that is not going to be used
> by any forcible future.
> >
> > Hence, the request is, when the hypervisor/cpu vendor asks for it, it will be
> possible to add into the device.
> 
> If it is there then I expect hypervisors will use it, yes.
> 
That would be really good if they use it.
Last time few years back I encountered a cpu limitation on encoded information and close link to remapping table content.

Do you know any efforts in this area to support this?
In 2021 when I read about it, it was almost dead-end discussion for x86_64.
Donât have the link handy anymore.

> > > In other words, we need a bright line. I suggest a simple one:
> > > memory is migrated by device, config space by hypervisor.  If not,
> > > suggest another one - but it needs a reasonable rule based on a
> > > hardware not whatever software found expedient to use.
> > >
> > I really liked this bright line.
> > Msix table like rest of the memory area of common config and device config
> will become part of this memory area.
> >
> > I wish we also draw a good bright line between for near term vs long term to
> be practical.
> >
> > Adding MSI-X table in the spec is not hard now or in future, frankly.
> 
> Show me the patch.
Frankly I donât think it is a reasonable ask at this point to write such patch.

An parallel reasoning would be for one to show a hypervisor patch for at least one cpu/platform which can do msix bypass and remapping programming.
Would one supply?

My question was just a through exercise to stay practical to have good bright line for what we ask.
Not really asking you to write one.

I agree to your suggestion that, it is good for platform/cpu vendor to think about it.

One can see MSIX table like below patch.
Same can be done for the PBA structure.

diff --git a/device-context.tex b/device-context.tex
index ab19fc9..72f626e 100644
--- a/device-context.tex
+++ b/device-context.tex
@@ -52,7 +52,9 @@ \section{Device Context}\label{sec:Basic Facilities of a Virtio Device / Device
 \hline
 0x105 & VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC & Provides list of virtqueue descriptors owned by device  \\
 \hline
-0x106 - 0xFFF & - & Generic device agnostic range reserved for future \\
+0x106 & VIRTIO_DEV_CTX_PCI_MSIX_TABLE & Provides MSI-X table \\
+\hline
+0x107 - 0xFFF & - & Generic device agnostic range reserved for future \\
 \hline
 \hline
 0x1000 & VIRTIO_DEV_CTX_DEV_CFG & Provides device specific configuration \\
@@ -197,6 +199,12 @@ \subsubsection{Virtqueue Split Mode Device owned Descriptors Context}
 One or multiple entries of \field{struct virtio_dev_ctx_vq_split_dev_descs} may exist, each such
 entry corresponds to a virtqueue identified by the \field{vq_index}.

+\subsubsection{PCI Device MSIX Table Context}
+
+For the field VIRTIO_DEV_CTX_PCI_MSIX_TABLE, \field{type} is set to 0x106.
+The \field{value} is in format of MSI-X table structure as defined in PCIe spec.
+The \field{length} is the length of whole MSI-X table structure in bytes.
+

References:
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: Parav Pandit <parav@nvidia.com>
- Re: [virtio-comment] Re: [PATCH v1 3/8] device-context: Define the device context fields for device migration
  - From: "Michael S. Tsirkin" <mst@redhat.com>