OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [virtio-dev] RE: [PATCH v2] Add device reset timeout field



> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, October 25, 2021 11:12 AM
[..] 
> So what I meant is, if PCI gives us an upper limit of the reset (e.g 100ms as you
> said), can virtio device perform reset more than that? If not, can we simply use
> 100ms for virtio-pci or can we gain from having a smaller per device timeout?
> If yes, it looks like a violation for PCI.
No. Not at all a violation of PCI spec.
PCI spec is transport layer.
Virtio io is layer above transport.

PCI device can be reset, but not necessary, it is ready to service upper layer protocols who may need more time to finish the reset.

There are several examples of it in industry today.
nvme, mlx5, efa, intel ice and more who may need more time to reset the upper layer device tan 100msec.

second point that I already explained, is there is nothing wrong at the PCI level that _requires_ further reset.
PCI device is just fine, it is under reset by virtio. Depending on the device implementation it just takes more time.
In few responses in the thread, I already described those different types of devices in response to Cornelia and Michael.
Please go through the thread. Lets  not repeat same discussion again.

> 
> It looks to there're many choices
> 
> 1) driver reset timeout via module parameter or sysfs
> 2) transport driver reset timeout (e.g via virtio-pci)
> 3) device reset timeout
> 
> If we can solve the issue in 1) and 2), there's no need for 3) consider
> 3) requires a changes is various places and may bring extra troubles.
> 
What are those extra troubles?
Device reset timeout is addressing #1, #2 and #3 without any of the extra user config knobs. This delivers less painful systems to user and also vendor agnostic method.

Actually sysfs is more troublesome, where user needs to reboot the system and do scripts to configure these values even before kernel-builtin modules are loaded.
Some pre-boot env, doesnât have such sysfs.

> 
> >
> >> It looks to me it will guarantee the success of the reset?
> > You are describing the timeout expiration handling.
> 
> 
> This looks more important than just having a device hint of the timeout.
> We don't want to block thing like module unload or device removal etc.
> 
We already discussed in this thread with Michael, that a device which is already reseted before, a second reset should be fast to service anyway.
Please read the discussion in this thread. Lets not repeat same discussion again.
> >>> Still not any different than rest of the RW capabilities/registers.
> >> So the timeout doesn't work if we can enforce that.
> > Didnât follow your comment. Enforce what?
> 
> 
> I meant the timeout value doesn't work here since guest can't see that.
Your example was for VM migration from sw_without_device_reset to hw_with_device_reset.
And here, you wanted to hide the device reset timeout.
So guest _should_not_ be able to see it. Right?

> 
> 
> >
> >> Seems sub optimal than doing a transport reset if possible.
> > Transport level reset is not helpful for two main reasons.
> > 1. A device under reset has nothing wrong at the PCI level.
> > Device is just still under boot/resetting to service virtio level commands.
> > So doing PCI level reset is actually adversely affect it further because PCI
> level reset will trigger more layers of unwanted reset.
> 
> 
> Well, looking at the driver codes, there're various places that the
> reset() is not expected to be fail. My understanding is that your
> proposal only solve the issue when the reset can fail but not other.
> That's why we need seek help for transport layer support.
> 
Driver can issue transport level reset when reset timeout expires.
First we need the reset timeout to be defined and let it expire. :)
Once it expires, sure, transport level resets can be applied.

I think you missed this part, that we do not reset to wait infinitely in all places as done today in linux driver.
Wait finitely, and when finite time expires, sure, driver can issue transport level reset.

> 
> >
> > 2. PCI level reset must complete in 100msec.
> 
> 
> Did you mean 100ms is too long? If yes, please explain why.
This is discussed and Cornelia gave the example also in this thread.
i.e. a fw based hw virtio device may not have finished its initialization in its 100 msec.
so virtio layer may need to wait longer.

> 
> 
> >
> > I still don't understand mixing transport level reset with virtio level reset in
> context of LM.
> >
> > If I read your original concern, that a sw device was exposed on HV-1 without
> device reset timeout.
> > And now user wants to migrate this VM to HV-2 which has a hw device with
> device reset timeout.
> > And now this device cannot be passed to the HV-2 because older sw device
> was attached to the VM.
> > I am frankly not sure, if this is really a case to consider.
> 
> 
> Actually, it's a choice of the cloud vendors. But consider there are
> very huge number of instances which is based on software virtio, it's a
> valid use case. But what I want to say is that, if we touch the spec, it
> will requires changes in the management layer as well. E.g if we can not
> migrate from hardware that support reset timeout to software that
> doesn't support reset timeout. It will be better if we can do that only
> in the driver.
Above explanation chooses now a different example.
In this example, you mention hw to sw migration.
Previously your example was from inferior sw to newer hw.
 
> 
> Similar but one more burden for the management if a simple 100ms timeout
> can work for virtio-pci.
Resilient spec and resilient sw, hw stack is not a burden to management stack.
It is just any other hw/sw attribute of the virtio device to take care.

> 
> > In another MSI-X example you provided emulated MSIX alterative was
> proposed. This is again not mapping the full hw device.
> >
> > So it is still sw device on other hand. User might as well just continue in sw
> mode.
> >
> > Let's assume for a moment that, this (rare) case should be considered.
> >
> > Once way to overcome this is below.
> > (a) Do not place new device_reset_timeout field in struct
> virtio_pci_common_cfg.
> > This is because this map be in memory mapped BAR directly accessible to the
> guest and future version of the spec has more fields after device reset timeout.
> > Such fields are somehow implemented by the sw device on HV-1.
> >
> > (b) Instead, define new cfg_type VIRTIO_PCI_CAP_DEV_TIMEOUT_CFG.
> > This points to the struct containing device reset timeout.
> > Exposing this capability can be skipped when mapping the hw virtio device to
> VM on HV-2 to match sw virtio device of HV-1.
> 
> 
> This will ease the hypervisor but anyhow hypervisor can choose to not
> expose the capability directly to guest (which will soon became a burden
> to support cross vendor live migration).
Again it is not a burden.
If a live migration sw needs the ability to migrate between vendors, right approach is to have attributes in the spec.
So that sw has the ability to understand where can device migrate.

As I explained, there are many RW features in the virtio spec that live migration sw needs to take care to migrate across cross vendor.
A new RO field that improves the sw stack cannot be attributed as burden to live migration sw.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]