OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-comment] Re: [virtio] Re: [PATCH v10 04/10] admin: introduce virtio admin virtqueues




On 08/03/2023 16:13, Stefan Hajnoczi wrote:
On Wed, Mar 08, 2023 at 01:17:33PM +0200, Max Gurtovoy wrote:


On 06/03/2023 18:25, Stefan Hajnoczi wrote:
On Mon, Mar 06, 2023 at 05:28:03PM +0200, Max Gurtovoy wrote:


On 06/03/2023 13:20, Stefan Hajnoczi wrote:
On Mon, Mar 06, 2023 at 04:00:50PM +0800, Jason Wang wrote:

å 2023/3/6 08:03, Stefan Hajnoczi åé:
On Sun, Mar 05, 2023 at 04:38:59AM -0500, Michael S. Tsirkin wrote:
On Fri, Mar 03, 2023 at 03:21:33PM -0500, Stefan Hajnoczi wrote:
What happens if a command takes 1 second to complete, is the device
allowed to process the next command from the virtqueue during this time,
possibly completing it before the first command?

This requires additional clarification in the spec because "they are
processed by the device in the order in which they are queued" does not
explain whether commands block the virtqueue (in order completion) or
not (out of order completion).
Oh I begin to see. Hmm how does e.g. virtio scsi handle this?
virtio-scsi, virtio-blk, and NVMe requests may complete out of order.
Several may be processed by the device at the same time.

They rely on multi-queue for abort operations:

In virtio-scsi the abort requests (VIRTIO_SCSI_T_TMF_ABORT_TASK) are
sent on the control virtqueue. The the request identifier namespace is
shared across all virtqueues so it's possible to abort a request that
was submitted to any command virtqueue.

NVMe also follows the same design where abort commands are sent on the
Admin Submission Queue instead of an I/O Submission Queue. It's possible
to identify NVMe requests by <Submission Queue ID, Command Identifier>.

virtio-blk doesn't support aborting requests.

I think the logic behind this design is that if a queue gets stuck
processing long-running requests, then the device should not be forced
to perform lookahead in the queue to find abort commands. A separate
control/admin queue is used for the abort requests.


Or device need mandate some kind of QOS here, e.g a request must be complete
in some time. Otherwise we don't have sufficient reliability for using it as
management task?

Yes, if all commands can be executed in bounded time then a guarantee is
possible.

Here is an example where that's hard: imagine a virtio-blk device backed
by network storage. When an admin queue command is used to delete a
group member, any of the group member's in-flight I/O requests need to
be aborted. If the network hangs while the group member is being
deleted, then the device can't complete an orderly shutdown of I/O
requests in a reasonable time.

That example shows a basic group admin command that I think Michael is
about to propose. We can't avoid this problem by not making it a group
admin command - it needs to be a group admin command. So I think it's
likely that there will be admin commands that take an unbounded amount
of time to complete. One way to achieve what you mentioned is timeouts.

I think that you're getting into device specific implementation details and
I'm not sure it's necessary.

I don't think we need to abort admin commands. Admin commands can be
flushed/aborted during the device reset phase.
Only IO commands should have the possibility to being aborted as you
mentioned in NVMe and SCSI (and potentially in virtio-blk).

It's a general design issue that should be clarified now rather than
being left unspecified.

I'm not saying that it must be possible to abort admin commands. There
are other options, like requiring the device itself to fail a command
after a timeout.

do you have an example of timeout today for control vq ?

Do you mean the virtio-net control virtqueue? I don't think it has any
commands with an unbounded execution time.

Correct. So why introducing it now ?



Or we could say that admin commands must complete within bounded time,
but I'm not sure that is implementable for some device types like
virtio-blk, virtio-scsi, and virtiofs.

No we can't.
Some commands, for example FW upgrade can take 10 minutes and it's perfectly
fine. Other commands like setting feature bit will take 1 millisec.
Each device implements commands in a different internal logic so we can't
expect to complete after X time.

When I say bounded time, I mean that it finishes in a finite amount of
time. I'm not saying there is a specific time X that all device
implementations must satisfy. Unbounded means it might never finish.

There might be a chance that any command for any virtio device type will never finish. Nothing new here in the adminq.

what one can do is to set a timeout for himself and if this timeout expire - check the device status. If it needs_reset - do a reset. if status is ok, then wait some more time.
After X retries, unmap buffers or reset the adminq.


Device can go to so FATAL state in case a command is stuck and causing
internal errors in it.


For your example, stopping a member is possible even it there are some
errors in the network. You can for example destroy all the connections to
the remote target and complete all the BIOS with some error.

Forgetting about in-flight requests doesn't necessarily make them go
away. It creates a race between forgotten requests and reconnection. In
the worst case a forgotten write request takes effect after
reconnection, causing data corruption.

For making it work without data corruption we need a cooperation of the
target side for sure. But this is fine since the target in that case is part
of the "virtio-blk backend".
One solution is that the target can decide it will flush all the requests to
the storage device before accepting new connections.

This solution shifts the unbounded time from disconnection to
connection. The Group Member Delete command will complete quickly but a
subsequent Group Member Create command for the same underlying storage
device would need to wait until the requests are done.

Therefore I think the admin queue must be designed under the assumption
that some commands take a very long time.

For sure an admin command may take long time. FW upgrade can take 10 minutes for example.
But each device is free to implement internal logic as he choose.

Same for live migration, when we stop/quiesce a device we must make sure it doesn't master any DMA operations. Thus, in some implementations we need to wait for all inflights to end fast. In others, we can invalidate the access to host/guest memory and wait for completions until the freeze state.

Bottom line, this is device implementation specific consideration.


Stefan


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]