OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [PATCH v2] Add device reset timeout field


On Fri, Oct 08, 2021 at 01:23:52PM +0000, Parav Pandit wrote:
> 
> 
> > From: Michael S. Tsirkin <mst@redhat.com>
> > Sent: Friday, October 8, 2021 6:27 PM
> 
> > > 2. A sriov VF virtio device for our case takes a lot lesser than this, but may
> > take anywhere between 10 msec to 250msec.
> > > This can happen on a firmware where user enabled 500 SR-IOV VFs.
> > > Pci spec indicates that all VFs to initialize within 100msec. This translates to
> > 0.2msec for one VF.
> > > In some scenario this can be a hard to initialize a VF in 0.2 msec depending
> > on what else a firmware is doing at that time.
> > 
> > That's separate from virtio reset though. virtio reset is much lighter weight
> > than a VF reset, all it needs to do is return config space to original values and
> > stop DMA.
> Again you took the valid example to stop the DMA of already initialized device, while above case is for the first init. :-)
> virtio device is going the first reset during initialization. It should be able to tell how long to wait.
> A device firmware may take more than 0.2msec to finish needed initialization to serve a virtio device.
> Infinite wait of today works here.

Looks like it's as Cornelia said - nothing to do with reset. E.g. it's
likely device can not even serve pci config before the init is complete.

> Question was for wild guess by driver for 100msec vs 10msec vs 0.2 msec.
> Is that enough? 

So some guidance in the spec on how long it should take will address
this I think.

> > 
> > > 3. A system has one or more virtio boot devices.
> > > One of them happens to be faulty after a firmware upgrade.
> > > Pre-boot env is infinitely waiting. Michael suggest to do disable such PCI slot
> > by means of abstract Ctrl+C.
> > > If PCI slot is disabled, that device must be physically taken out for recovery.
> > > In an alternative, if device advertised a finite timeout, that device didn't boot,
> > system gave up after finite timeout and server picked second boot option, and
> > booted.
> > > Now a system admin can repair the faulty device without physically taking it
> > out.
> > > Will infinite timeout help here? Or a device advertising finite timeout and
> > recovering the system more useful?
> > >
> > > 4. device was hotplug in system and before it is fully probed, a hot unplug is
> > triggered.
> > 
> > 
> > I don't get this one. Are you talking about surprise removal here?
> Yes.
> > The way to handle that is surely not a timeout, we should be able to test for
> > device presence.
> Yes, it should be possible to update device presence of device under probe while its surprised removed.
> I will look into it more.
> However, this is not the only place timeout is used.

As in this example, I'd be worried people will rely on timeout instead
of addressing things properly.

> > 
> > > Device cannot respond to reset, because its hot unplugged.
> > > OS waits infinitely for reset to complete.
> > > And system component is stuck just because of one device.
> > > Would a finite timeout help to abort this operation? Yes.
> > 
> > Except if it takes minutes it is not agile enough for many workloads.
> > 
> > >
> > > So is wild guess of 10msec for all devices or an infinite time most efficient
> > way to handle above scenarios?
> > 
> > Donnu, but as I hope you begin to see, as we start digging into actual
> > requirements, neither does a huge reset promise by the device.
> 
> A finite reset timeout helps in making the virtio devices more predicable to use.
> 
> > How about some "keepalive" signal then? E.g. a register where each read
> > needs to respond with a different value, if it's the same then device is stuck ...
> 
> A device should be out of the reset, keep alive feature negotiated to respond to a keep alive requests from host driver.
> Keep alive is useful post the reset+ init stage.
> (keep alive is also used by nvme devices, similar to device ready TIMEOUT with granularity of 500msec, similar to virtio device reset timeout).

Just to clarify, what I call keepalive here is a counter
providing a different value on each read.
This can thinkably work even before feature negotiation.

-- 
MST



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]