virtio-dev message

Subject: RE: [PATCH v2] Add device reset timeout field

From: Parav Pandit <parav@nvidia.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 8 Oct 2021 10:51:02 +0000


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Friday, October 8, 2021 3:42 PM

> > When a device migrates to destination, it starts from where the device
> > left off on the source side.  So yes, destination side, device must be
> > usable (out of reset), and after that its current state will be
> > overwritten by the migrating device.
> 
> I get what you are trying to say here but it's a hack. 
What is a hack? I didn't follow.

> Nothing prevents a reset
> for driver's internal reasons at any point, and in particular reset is used e.g. for
> driver removal.
Sure nothing prevents a reset on destination. When device state restoration is going on, if the device reset, device state restoration will simply not go through.

> >  If you ask, does migration
> > overwrite the reset timeout register value? I would say no, because
> > how long device would take to reset is decided by the destination side
> > implementation.
> 
> Problem is, driver can cache the value on source. Then it's migrated and used
> on destination when driver wants to reset the device.
> This can lead to a timeout if the destination does not finish within the source
> timeout value.
>
Usually driver has interest in caching fields that it uses in data path. Device reset timeout is not close to it.
But fair enough, driver can cache the value whichever it chooses to.
Like other fields of the device migratable device, a backend needs to have same device or lower device reset timeout.
 
> That's why I ask: why do we bother? What's wrong with just waiting forever or
> until user gets tired of this and cancels with CTRL-C?

Today, device removal of the device gets stuck for the device which didn't finish the reset, because its waiting for ever.

modprobe to my knowledge cannot be Ctrl-C. In another scenario, device probing of hot plug device occurs by hotplug driver in a workqueue context.

It doesn't sound right to pass the burden to the user to invent some kind of ctrl-C cancel operation in hotplug drivers.

> Is there a use-case where that's not good enough?
A guest has got one good and another device that encountered a fault.
Due to this faulty device which is unable to reset, is blocking other operations in the system.

> 
> 
> > And this is probably yet another good reason to define migratable bits
> > of a virtio device in the live migration spec extension.
> 
> "migratable bits" being what? non-guest visible device state? Sure, would be
> great to have.  Don't think it will help in this instance.
> 
> --
> MST

Follow-Ups:
- Re: [PATCH v2] Add device reset timeout field
  - From: "Michael S. Tsirkin" <mst@redhat.com>

References:
- [PATCH v2] Add device reset timeout field
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v2] Add device reset timeout field
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [PATCH v2] Add device reset timeout field
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v2] Add device reset timeout field
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- RE: [PATCH v2] Add device reset timeout field
  - From: Parav Pandit <parav@nvidia.com>
- RE: [PATCH v2] Add device reset timeout field
  - From: Cornelia Huck <cohuck@redhat.com>
- RE: [PATCH v2] Add device reset timeout field
  - From: Parav Pandit <parav@nvidia.com>
- Re: [PATCH v2] Add device reset timeout field
  - From: "Michael S. Tsirkin" <mst@redhat.com>