virtio-dev message

Subject: Re: [virtio-dev] Timing out virtio-pci config space access

From: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Date: Fri, 5 Nov 2021 19:42:09 +0530

* Michael S. Tsirkin <mst@redhat.com> [2021-11-05 09:13:27]:

> On Fri, Nov 05, 2021 at 05:59:43PM +0530, Srivatsa Vaddagiri wrote:
> > * Michael S. Tsirkin <mst@redhat.com> [2021-11-05 03:38:39]:
> > 
> > > On Thu, Nov 04, 2021 at 10:37:40PM +0530, Srivatsa Vaddagiri wrote:
> > > > We are working on a virtio-pci implementation on a Type-1 hypervisor where
> > > > backend drivers are hosted in another VM and are considered untrusted. PCI is
> > > > the virtio transport used in this case.
> > > > 
> > > > One issue that crops up is a read/write of config space can potentially block
> > > > forever, as the backend is untrusted and could be causing a denial-of-service of
> > > > sorts. This causes the vcpu to stall forever. I was wondering if we can timeout
> > > > in such case and have the hypervisor break the stall by letting read return
> > > > "error" (-1) along with setting DEVICE_NEEDS_RESET in status register. Will that
> > > > allow Linux guest driver to gracefully fail its probe? I don't see where Linux
> > > > handles DEVICE_NEEDS_RESET currently and also am not sure if returning -1 will
> > > > lead to graceful failure of the driver alone (we don't want VM to come down or
> > > > panic because of a mis-behaving device). 
> > > 
> > > DEVICE_NEEDS_RESET isn't handled ATM. the point of it in any case
> > > is a recoverable error, with a malicious backend this is
> > > not the case.
> > > 
> > > 
> > > Once thing you can do that will work a bit better is implementing
> > > surprise-removal in this case.
> > 
> > My layman understanding of surprise removal is that it requires the PCI
> > controller to interrupt OS and convey which device is removed, so that the PCI
> > subsystem can mark it "removed"? Is that possible for the generic controller
> > ("pci-host-ecam-generic") that virtio pci devices use?
> 
> I think so, yes.

Will that work during device probe also?

> > > So hypervisor detects a timeout
> > > (presumably it knows what to expect of the device) and then pretends to
> > > guest device is gone, unmapping it completely from guest.
> > 
> > Can you elaborate on what unmapping means? I think the reads should
> > return -1 and writes to be dropped in such case - beyond that what would unmap
> > entail?
> > 
> > Thanks
> > vatsa
> 
> Removing guest access to device so access attempts end up in QEMU.

Would QEMU end up terminating the guest or inject some bus error that guest VM
can gracefully handle? We prefer not to bring down the VM because of this -
rather have the driver probe fail gracefully.

Thanks
vatsa

References:
- Timing out virtio-pci config space access
  - From: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>
- Re: [virtio-dev] Timing out virtio-pci config space access
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio-dev] Timing out virtio-pci config space access
  - From: Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>
- Re: [virtio-dev] Timing out virtio-pci config space access
  - From: "Michael S. Tsirkin" <mst@redhat.com>