virtio-dev message

Subject: Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

From: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>
To: Siwei Liu <loseweigh@gmail.com>, Alexander Duyck <alexander.duyck@gmail.com>
Date: Wed, 21 Feb 2018 19:28:30 -0800

On 2/21/2018 6:35 PM, Samudrala, Sridhar wrote:

On 2/21/2018 5:59 PM, Siwei Liu wrote:

On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:

On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <loseweigh@gmail.com> wrote:

I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubakici@wp.pl>wrote:
On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
Ppatch 2 is in response to the community request for a 3 netdev
solution. However, it creates some issues we'll get into in amoment.
It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net drivercreatesan additional 'bypass' netdev that acts as a master device andcontrols
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup'netdevs areassociated with the same 'pci' device. The user accesses thenetworkinterface via 'bypass' netdev. The 'bypass' netdev chooses'active' netdevas default for transmits when it is available with link up andrunning.
Thank you do doing this.
We noticed a couple of issues with this approach during testing.
- As both 'bypass' and 'backup' netdevs are associated with thesame virtio pci device, udev tries to rename both of them with thesame name and the 2nd rename will fail. This would be OK as long as thefirst netdev to be renamed is the 'bypass' netdev, but the order in whichudev gets
   to rename the 2 netdevs is not reliable.
Out of curiosity - why do you link the master netdev to the virtio
struct device?
The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.
Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.

Yes. This sounds interesting. Looks like you want an existing VM imagewith

VF only configuration to get transparent live migration support by adding

virtio_net with BACKUP feature. We may need another feature bit toswitch

between these 2 options.


After thinking some more, this may be more involved than adding a new

feature bit. This requires a netdev created by virtio to take over thename of

a VF netdev associated with a PCI device that may not be plugged in when
the virtio driver is coming up. This definitely requires some new messages
exchanged across the virtio control queue to pass the PCI device info of the
VF netdev.

The other bits get into more complexity then we are ready to handle
for now. I think I might have talked about something similar that I
was referring to as a "virtio-bond" where you would have a PCI/PCIe
tree topology that makes this easier to sort out, and the "virtio-bond
would be used to handle coordination/configuration of a much more
complex interface.

That was one way to solve this problem but I'd like to see simple ways
to sort it out.

FWIW two solutions that immediately come to mind is to export"backup"as phys_port_name of the backup virtio link and/or assign a nameto themaster like you are doing already. I think team uses team%d andbonduses bond%d, soft naming of master devices seems quite natural inthis
case.
I figured I had overlooked something like that.. Thanks for pointing
this out. Okay so I think the phys_port_name approach might resolve
the original issue. If I am reading things correctly what we end up
with is the master showing up as "ens1" for example and the backup
showing up as "ens1nbackup". Am I understanding that right?

The problem with the team/bond%d approach is that it creates a new
netdevice and so it would require guest configuration changes.
IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
link is quite neat.
I agree. For non-"backup" virio_net devices would it be okay forus to
just return -EOPNOTSUPP? I assume it would be and that way the legacy
behavior could be maintained although the function still exists.
- When the 'active' netdev is unplugged OR not present on adestination system after live migration, the user will see 2 virtio_netnetdevs.
That's necessary and expected, all configuration applies to themaster
so master must exist.
With the naming issue resolved this is the only item leftoutstanding.
This becomes a matter of form vs function.
The main complaint about the "3 netdev" solution is a bitconfusing to
have the 2 netdevs present if the VF isn't there. The idea is that
having the extra "master" netdev there if there isn't really abond is
a bit ugly.
Is it this uglier in terms of user experience rather than
functionality? I don't want it dynamically changed between 2-netdev
and 3-netdev depending on the presence of VF. That gets back to my
original question and suggestion earlier: why not just hide the lower
netdevs from udev renaming and such? Which important observability
benefits users may get if exposing the lower netdevs?

Thanks,
-Siwei

The only real advantage to a 2 netdev solution is that it looks like
the netvsc solution, however it doesn't behave like it since there are
some features like XDP that may not function correctly if they are
left enabled in the virtio_net interface.

As far as functionality the advantage of not hiding the lower devices
is that they are free to be managed. The problem with pushing all of
the configuration into the upper device is that you are limited to the
intersection of the features of the lower devices. This can be
limiting for some setups as some VFs support things like more queues,
or better interrupt moderation options than others so trying to make
everything work with one config would be ugly.

It depends on how you build it and the way you expect it to work. IMHO
the lower devices don't need to be directly managed at all, otherwise
it ends up with loss of configuration across migration, and it really
does not bring much value than having a general team or bond device.
Users still have to reconfigure those queue settings and interrupt
moderation options after all. The new upper device could take the
assumption that the VF/PT lower device always has superior feature set
than virtio-net in order to apply advanced configuration. The upper
device should remember all configurations previously done and apply
supporting ones to active device automatically when switching the
datapath.

It should be possible to extend this patchset to support migration ofadditional

settings  by enabling additional ndo_ops and ethtool_ops on the upper dev

and propagating them down the lower devices and replaying the settingsafter

the VF is replugged after migration.

Thanks
Sridhar

References:
- [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  - From: Sridhar Samudrala <sridhar.samudrala@intel.com>
- Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  - From: Alexander Duyck <alexander.duyck@gmail.com>
- Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  - From: Siwei Liu <loseweigh@gmail.com>
- Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  - From: Alexander Duyck <alexander.duyck@gmail.com>
- Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  - From: Siwei Liu <loseweigh@gmail.com>
- Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device
  - From: "Samudrala, Sridhar" <sridhar.samudrala@intel.com>