OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit


On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote:
> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:
> >>
> >>
> >> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
> >> > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
> >> > > On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> >> > > <sridhar.samudrala@intel.com> wrote:
> >> > > > This feature bit can be used by hypervisor to indicate virtio_net device to
> >> > > > act as a backup for another device with the same MAC address.
> >> > > >
> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > > > ---
> >> > > >   drivers/net/virtio_net.c        | 2 +-
> >> > > >   include/uapi/linux/virtio_net.h | 3 +++
> >> > > >   2 files changed, 4 insertions(+), 1 deletion(-)
> >> > > >
> >> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> > > > index 12dfc5fee58e..f149a160a8c5 100644
> >> > > > --- a/drivers/net/virtio_net.c
> >> > > > +++ b/drivers/net/virtio_net.c
> >> > > > @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> >> > > >          VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >> > > >          VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >> > > >          VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> >> > > > -       VIRTIO_NET_F_SPEED_DUPLEX
> >> > > > +       VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
> >> > > >
> >> > > >   static unsigned int features[] = {
> >> > > >          VIRTNET_FEATURES,
> >> > > > diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> >> > > > index 5de6ed37695b..c7c35fd1a5ed 100644
> >> > > > --- a/include/uapi/linux/virtio_net.h
> >> > > > +++ b/include/uapi/linux/virtio_net.h
> >> > > > @@ -57,6 +57,9 @@
> >> > > >                                           * Steering */
> >> > > >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
> >> > > >
> >> > > > +#define VIRTIO_NET_F_BACKUP      62    /* Act as backup for another device
> >> > > > +                                        * with the same MAC.
> >> > > > +                                        */
> >> > > >   #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
> >> > > >
> >> > > >   #ifndef VIRTIO_NET_NO_LEGACY
> >> > > I'm not a huge fan of the name "backup" since that implies that the
> >> > > Virtio interface is only used if the VF is not present, and there are
> >> > > multiple instances such as dealing with east/west or
> >> > > broadcast/multicast traffic where it may be desirable to use the
> >> > > para-virtual interface rather then deal with PCI overhead/bottleneck
> >> > > to send the packet.
> >> > Right now hypervisors mostly expect that yes, only one at a time is
> >> > used.  E.g. if you try to do multicast sending packets on both VF and
> >> > virtio then you will end up with two copies of each packet.
> >>
> >> I think we want to use only 1 interface to  send out any packet. In case of
> >> broadcast/multicasts it would be an optimization to send them via virtio and
> >> this patch series adds that optimization.
> >
> > Right that's what I think we should rather avoid for now.
> >
> > It's *not* an optimization if there's a single VM on this host,
> > or if a specific multicast group does not have any VMs on same
> > host.
> 
> Agreed. In my mind this is something that is controlled by the
> pass-thru interface once it is enslaved.

It would be pretty tricky to control through the PT
interface since a PT interface pretends to be a physical
device, which has no concept of VMs.

> > I'd rather we just sent everything out on the PT if that's
> > there. The reason we have virtio in the picture is just so
> > we can migrate without downtime.
> 
> I wasn't saying we do that in all cases. That would be something that
> would have to be decided by the pass-thru interface. Ideally the
> virtio would provide just enough information to get itself into the
> bond and I see this being the mechanism for it to do so. From there
> the complexity mostly lies in the pass-thru interface to configure the
> correct transmit modes if for example you have multiple pass-thru
> interfaces or a more complex traffic setup due to things like
> SwitchDev.
> 
> In my mind we go the bonding route and there are few use cases for all
> of this. First is the backup case that is being addressed here. That
> becomes your basic "copy netvsc" approach for this which would be
> default. It is how we would handle basic pass-thru back-up paths. If
> the host decides to send multicast/broadcast traffic from the host up
> through it that is a host side decision. I am okay with our default
> transmit behavior from the guest being to send everything through the
> pass-thru interface. All the virtio would be doing is advertising that
> it is a side channel for some sort of bond with another interface. By
> calling it a side channel it implies that it isn't the direct channel
> for the interface, but it is an alternative communications channel for
> things like backup, and other future uses.
> 
> There end up being a few different "phase 2" concepts I have floating
> around which I want to avoid limiting. Solving the east/west problem
> is an example. In my mind that just becomes a bonding Tx mode that is
> controlled via the pass-thru interface. The VF could black-list the
> local addresses so that they instead fall into the virtio interface.
> In addition I seem to recall watching a presentation from Mellanox
> where they were talking about a VF per NUMA node in a system which
> would imply multiple VFs with the same MAC address. I'm not sure how
> the current patch handles that. Obviously that would require a more
> complex load balancing setup. The bonding solution ends up being a way
> to resolve that so that they could just have it take care of picking
> the right Tx queue based on the NUMA affinity and fall back to the
> virtio/netvsc when those fail.

The way I see it, we'd need to pass a bunch of extra information
host to guest, and we'd have to use a PV interface for it.
When we do this, we'll need to have another
feature bit, and we can call it SIDE_CHANNEL or whatever.


> >> In the receive path,  the broadcasts should only go the PF and reach the VM
> >> via vitio so that the VM doesn't see duplicate broadcasts.
> >>
> >>
> >> >
> >> > To me the east/west scenario looks like you want something
> >> > more similar to a bridge on top of the virtio/PT pair.
> >> >
> >> > So I suspect that use-case will need a separate configuration bit,
> >> > and possibly that's when you will want something more powerful
> >> > such as a full bridge.
> >>
> >> east-west optimization of unicasts would be a harder problem to solve as
> >> somehow the hypervisor needs to indicate the VM about the local dest. macs
> >
> > Using a bridge with a dedicated device for east/west would let
> > bridge use standard learning techniques for that perhaps?
> 
> I'm not sure about having the guest have to learn.

It's certainly a way to do this, but certainly not the only one.

> In my mind the VF
> should know who is local and who isn't.

Right. But note that these things change.

> In the case of SwitchDev it
> should be possible for the port representors and the switch to provide
> data on which interfaces are bonded on the host side and which aren't.
> With that data it would be pretty easy to just put together a list of
> addresses that would prefer to go the para-virtual route instead of
> being transmitted through physical hardware.
> 
> In addition a bridge implies much more overhead since normally a
> bridge can receive a packet in on one interface and transmit it on
> another. We don't really need that. This is more of a VEPA type setup
> and doesn't need to be anything all that complex. You could probably
> even handle the Tx queue selection via a simple eBPF program and map
> since the input for whatever is used to select Tx should be pretty
> simple, destination MAC, source NUMA node, etc, and the data-set
> shouldn't be too large.

That sounds interesting. A separate device might make this kind of setup
a bit easier.  Sridhar, did you look into creating a separate device for
the virtual bond device at all?  It does not have to be in a separate
module, that kind of refactoring can come later, but once we commit to
using the same single device as virtio, we can't change that.


> >> > > What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it
> >> > > is a bit of double entendre as we are using the physical MAC address
> >> > > to provide configuration information, and then in addition this
> >> > > interface acts as a secondary channel for passing frames to and from
> >> > > the guest rather than just using the VF.
> >> > >
> >> > > Just a thought.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > - Alex
> >> > I just feel that's a very generic name, not conveying enough information
> >> > about how they hypervisor expects the pair of devices to be used.
> 
> I would argue that BACKUP is too specific. I can see many other uses
> other than just being a backup interface

True but the ones you listed above all seem to require virtio
changes anyway, we'll be able to add a new feature or
rename this one as we make them.

> and I don't want us to box
> ourselves in. I agree that it makes sense for active/backup to be the
> base mode, but I really don't think it conveys all of the
> possibilities since I see this as possibly being able to solve
> multiple issues as this evolves. I'm also open to other suggestions.
> We could go with PASSTHRU_SIDE_CHANNEL for instance, I just didn't
> suggest that since it seemed kind of wordy.

There's only so much info a single bit can confer :)
So we can always rename later, the point is to draw the line
and say "this is the functionality current hosts expect".

-- 
MST


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]