OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-dev message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-dev] [PATCH RESEND v4 1/1] Add virtio-iommu device specification


On 25.11.19 13:35, Jean-Philippe Brucker wrote:
Hi Jan,

On Mon, Nov 25, 2019 at 08:30:29AM +0100, Jan Kiszka wrote:
What's the impact of a fault on the device(s) under the IOMMU regime? Can
they recover?

Are you asking about what happens to the endpoints when the virtio-iommu
encounters an internal error? Or what happens to the endpoints if their
DMA transactions fails translation? I think they are both equivalent to
"what happens when the endpoint's memory transaction aborts?". The answer
to that depends on the bus and endpoint, and is out of scope. The
virtio-iommu spec could state that in those cases, we abort the memory
transaction, but it's too vague since we don't know the specifics of the
bus, and it isn't necessarily true (see VT-d and SMMU below).

I'm interested in both cases in how to recover the endpoints - of course also the virtual IOMMU, but that seems clear - from such errors. Can we describe a procedure that will always lead to working setup again, even if there might be simpler ways on certain setups? I mean, the answer shouldn't be "reset the host"...


Or will they get DEVICE_NEEDS_RESET as well?

If the endpoint is virtio, then the behavior upon DMA fault should be
specified by the virtio transport, because it could happen without an
IOMMU (e.g. trying to access a physical address that isn't mapped to RAM
or MMIO), or with a VT-d emulation for example.

But it's not necessarily virtio. It can be a hardware passed-through
endpoint, in which case the abort behavior depends on the physical IOMMU,
which virtio-iommu doesn't know anything about, in addition to the
physical bus and endpoint.

I also wouldn't state that the whole device (or function, though we're not
necessarily PCI) needs reset. It might be possible for some devices to
only stop the faulting queue and leave the others running, to avoid
disturbing the rest of the system.

As I said above: There should be at least one know-to-work path, and a way to signal better options when they are available in a concrete setup.


With PCI device
behind real IOMMUs, it's normal that they need a reset after having caused a
fault. I'm not sure if this is described in the related specs for them, but
it should be clarify for the virtual IOMMU. But this can be done on top,
IMHO.

The device behaviour is generally not specified. However their spec can
say something about the bus:

* For Intel VT-d see 7.2 and 7.2.1 (Non-Recoverable Address Translation
   Faults), where the spec provides various implementation examples.

   "Requests that encounter non-recoverable address translation faults are
   aborted by the remapping hardware, and typically require a reset of the
   device (such as through a function-level-reset) to recover and
   re-initialize the device to put it back into service."

   So could be aborted, but as stated later in 7.2.1, can also be
   redirected to a catch-all memory location.

* For Arm SMMU, the host driver can specify for each context whether
   the SMMU should return an abort (Slave Error on the AMBA bus) or not
   (read-zero, write-ignore).

   The spec also says "The behavior of the client device after termination
   is specific to the device." (3.12.1 Terminate model)

* For AMD IOMMU, "when the IOMMU detects an I/O page fault, it target
   aborts the faulting request." and "the IOMMU sets the legacy PCI
   Signaled Target Abort bit, if applicable" (2.1.3.2 I/O Page Faults).
   I believe the equivalent for the PCIe bus is a Completer Abort response.

They can specify the behaviour with some precision, because they also
specify how the IOMMU is integrated with the system. We don't have this
luxury, because if the virtio-iommu is just a proxy for a physical IOMMU,
we don't know how aborts are configured, and the bus may be a variant of
PCI, AMBA or something else.

Again: I'm on a virtual machine, my device just ran into a fault - how can I find out what I'm supposed to do to get it working again, in its driver and possibly also the virtio-iommu driver?

Jan

--
Siemens AG, Corporate Technology, CT RDA IOT SES-DE
Corporate Competence Center Embedded Linux


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]