Subject: Re: [virtio-dev] Re: [virtio-comment] Re: [virtio-dev] Re: [PATCH 0/5] virtio: introduce SUSPEND bit and vq state

On 10/12/2023 7:12 PM, Michael S. Tsirkin wrote:
On Thu, Oct 12, 2023 at 06:49:51PM +0800, Zhu, Lingshan wrote:

On 10/12/2023 5:59 PM, Michael S. Tsirkin wrote:
On Wed, Oct 11, 2023 at 06:38:32PM +0800, Zhu, Lingshan wrote:
On 10/11/2023 6:20 PM, Michael S. Tsirkin wrote:
On Mon, Oct 09, 2023 at 06:01:42PM +0800, Zhu, Lingshan wrote:
On 9/27/2023 11:40 PM, Michael S. Tsirkin wrote:
On Wed, Sep 27, 2023 at 04:20:01PM +0800, Zhu, Lingshan wrote:
On 9/26/2023 6:48 PM, Michael S. Tsirkin wrote:
On Tue, Sep 26, 2023 at 05:25:42PM +0800, Zhu, Lingshan wrote:
We don't want to repeat the discussions, it looks like endless circle with
no direction.
OK let me try to direct this discussion.
You guys were speaking past each other, no dialog is happening.
And as long as it goes on no progress will be made and you
will keep going in circles.

Parav here made an effort and attempted to summarize
use-cases addressed by your proposal but not his.
He couldn't resist adding "a yes but" in there oh well.
But now I hope you know he knows about your use-cases?

So please do the same. Do you see any advantages to Parav's
proposal as compared to yours? Try to list them and
if possible try not to accompany the list with "yes but"
(put it in a separate mail if you must ;) ).
If you won't be able to see any, let me know and I'll try to help.

Once each of you and Parav have finally heard the other and
the other also knows he's been heard, that's when we can
try to make progress by looking for something that addresses
all use-cases as opposed to endlessly repeating same arguments.
Sure Michael, I will not say "yes but" here.

    From Parav's proposal, he intends to migrate a member device by its owner
device through the admin vq,
thus necessary admin vq commands are introduced in his series.

I see his proposal can:
1) meet some customers requirements without nested and bare-metal
2) align with Nvidia production
3) easier to emulate by onboard SOC
Is that all you can see?

Hint: there's more.
please help provide more.
Just a small subset off the top of my head:
Error handling.
handle failed live migration? how?
For example you can try restarting VM on source.
Or at least report an error to hypervisor.
I am not sure resetting a VM due to failed live migration is
a good idea, should we resume the VM instead?
Yes - when I said restarting I meant resuming not resetting.
OK, we have implemented the interface to resume the device, to clear suspend.

Then try other
convergence algorithm?
Talking about device failures here nothing to do with convergence.
But yes, can e.g. try a different destination.

And I think current live migration solution already implements error
detector, like sees a time out?
it is extremely hard to predict how
long will it take a random piece of hardware from a random
vendor to respond. even if you do timeouts break nested
don't they ;) and finally, they provide no indication
of what went wrong whatsoever.
the hypervisor would not complete the live migration
process before device migration done.

I think the hypervisor or the orchestration layer
know the LM status anyway.

and for other errors, we have mature error handling solutions
in virtio for years, like re-read, NEEDS_RESET.

Are you aware of the fact that Linux still doesn't support
it since it turned out to be an extremely awkward interface
to use?
I think we have implemented this in virtio driver,
like re-read to check FEATURES.
grep for NEEDS_RESET in drivers/virtio and weep.
that is interesting, virito driver lives so many years
without handling NEEDS_RESET, so good device quality and
layers of error handlers.

what prevent implementing NEEDS_RESET? Is it because of how to reinitialize?
It looks like we should do that.

For now, re-read working well at least.

If that is not good enough, then the corollary is:
admin vq is better than config space,
You keep confusing admin vq with admin commands.
OK, so are admin commands better than registers?
They have more functionality for sure.
yes they are powerful than registers.

However, to suspend, resume, config dirty page facility,
registers are low hanging fruits.

then the further corollary could be:
we should refactor virito-pci interfaces to admin vq commands,
like how we handle features

Is that true?
Extendable to other group types such as SIOV.
For SIOV, the admin vq is a transport, but for SR-IOV
the admin vq is a control channel, that is different,
and admin vq can be a side channel.

For example, for SIOV, we config and migrate MSIX through
admin vq. For SRIOV, they are in config space.
And that's a mess. FYI we already got feedback from Linux devs
who are wondering why we can't come up with a consistent
interface that does everything.
I believe config space is a consistent interface for PCI.
For SIOV, we need a new transport layer anyway.

Batching of commands
less pci transactioons
so this can still be a QOS issue.
If batching, others to starve?
And if you block CPU since you are not accepting
a posted write this is better?
I don't get it, block guest CPU?
host cpu in fact. if you flood pci expess with transactions
this is exactly what happens.
Not sure hypervisor will implement this just because adapting to admin vq live migration.

Support for keeping some data off-device
I don't get it, what is off-device?
The live migration facilities need to fetch data from the device anyway
Heh this is what was driving nvidia to use DMA so heavily all this time.
no - if data is not in registers, device can fetch the data from
across pci express link, presumably with a local cache.
For PCI based configuration, like MSI, we need to fetch from config space
For others like dirty page, we can store the bitmap in host memory, and use
PASID for isolation.
Oh really?  What do we get by not using same mechanism for
device state then? This begins to look exactly like admin vq.
implementing a register to config a logging address in host memory and isolated by PASID. Also there are other few registers to control the facility, like enable/disable.

which does not mean it's better unconditionally.
are above points clear?
The thing is, what blocks the config space solution?
Why admin vq is a must for live migration?
What's wrong in config space solution?
Whan you say what's wrong do you mean you still see no
advantages to doing DMA at all? config space is just better
with no drawbacks?
still, if admin vq or admin commands are better than config space,
we should refactor the whole virtio-pci interfaces to admin vq.
mixing admin vq and command up again apparently.
We want to support virtio over admin commands for SIOV, yes.
And once that's supported nothing should prevent using that
for SRIOV too.
admin commands work for SRIOV, but overkill for live migration.

For example, to suspend a device, what is the benefit using a
admin command than just a register?

And if we want a bar to process admin commands, do we need
to implement some fields like data_length, total_length and
etc, much more complex than a register.

And Jason has ever proposed to build admin vq LM on our basic
facilities, but I see this has been rejected.
Please do not conclude that you just need to resubmit.

Shall we refactor everything in virtio-pci to use admin vq?
as long as you guys keep not hearing each other we will keep
seeing these flame wars. if you expect everyone on virtio-comment
to follow a 300 message thread you are imo very much mistaken.
I am sure I have not ignored any questions.
I am saying admin vq is problematic for live migration,
at least it doesn't work for nested, so why admin vq is a must for live
My suggestion for you was to add admin command support to
VF memory, as an alternative to admin vq. It looks like that
will address the nested virt usecase.
If you mean carrying some big bulk of data like dirty page information,
we implemented a facility in host memory which is isolated by PASID.

I should send a new series soon, so we can work on the patch.
I hope that one does not just restart the same flame war.
As it will if people keep talking past each other and
not listening.
V2 will include dirty page tracking, so we can review the design.

Yes I hope no flame wars.

Thanks for your suggestions and efforts anyway.

The general purpose of his proposal and mine are aligned: migrate virtio

Jason has ever proposed to collaborate, please allow me quote his proposal:

Let me repeat once again here for the possible steps to collaboration:

1) define virtqueue state, inflight descriptors in the section of
basic facility but not under the admin commands
2) define the dirty page tracking, device context/states in the
section of basic facility but not under the admin commands
3) define transport specific interfaces or admin commands to access them

I totally agree with his proposal.

Does this work for you Michael?

Zhu Lingshan
I just doubt very much this will work.  What will "define" mean then -
not an interface, just a description in english? I think you
underestimate the difficulty of creating such definitions that
are robust and precise.
I think we can review the patch to correct the words.
Instead I suggest you define a way to submit admin commands that works
for nested and bare-metal (i.e. not admin vq, and not with sriov group
type). And work with Parav to make live migration admin commands work
reasonably will through this interface and with this type.
why admin commands are better than registers?

