OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: Re: [virtio-comment] [PATCH 5/5] virtio-pci: implement VIRTIO_F_QUEUE_STATE


On Sun, Sep 17, 2023 at 1:22âPM Parav Pandit <parav@nvidia.com> wrote:
>
>
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Thursday, September 14, 2023 8:39 AM
> >
> > On Wed, Sep 13, 2023 at 2:39âPM Parav Pandit <parav@nvidia.com> wrote:
> > >
> > >
> > > > From: Jason Wang <jasowang@redhat.com>
> > > > Sent: Wednesday, September 13, 2023 10:15 AM
> > >
> > > [..]
> > > > > > > [1]
> > > > > > > https://lists.oasis-open.org/archives/virtio-comment/202309/ms
> > > > > > > g000
> > > > > > > 61.h
> > > > > > > tml
> > > > > >
> > > > > > The series works for stateless devices. Before we introduce
> > > > > > device states in the spec, we can't migrate stateful devices. So
> > > > > > the device context doesn't make much sense right now.
> > > > > The series works for stateful devices too. The device context covers it.
> > > >
> > > > How? Can it be used for migrating any existing stateful devices?
> > > > Don't we need to define what context means for a specific stateful
> > > > device before you can introduce things like device context? Please
> > > > go through the archives for the relevant discussions (e.g
> > > > virtio-FS), it's not as simple as introducing a device context API.
> > > >
> > > A device will have its own context for example RSS definition, or flow filters
> > tomorrow.
> >
> > If you know there are things that are missing when posting the patches, please
> > use the RFC tag.
> >
> It is not missing. They are optional, which is why it is not needed in this series.
>
> > > The device context will be extended post the first series.
> > >
> > > > And what's more, how can it handle the migration compatibility?
> > > It will be taken care in follow on as we all know that this to be checked.
> >
> > You don't even mention it anywhere in your series.
> >
> Migration compatibility is topic in itself regardless of device migration series.

Why? Without compatibility support, migration can't work in the
production environment.

> It is part of the feature provisioning phase needed regardless.

Definitely not, it is something that must be considered even without
any feature. It's about the robustness of the migration protocol.
Sometimes you need to do that since some states were lost in the
previous version of protocols or formats .

> Like how you and Lingshan wanted to keep the suspend bit series small and logical, device migration series is also logically split for the functionality.
> I donât see a need to mention the long known missing functionality and common to both approaches.

Again, your proposal needs to describe at least the plan for dealing
with migration compatibility since you want a passthrough based
solution. That's the point.

>
> > > I will include the notes of future follow up work items in v1, which will be
> > taken care post this series.
> > >
> > > > > > Dirty page tracking in virtio is not a must for live migration
> > > > > > to work. It can be done via platform facilities or even
> > > > > > software. And to make it more efficient, it needs to utilize
> > > > > > transport facilities instead of a
> > > > general one.
> > > > > >
> > > > > It is also optional in the spec proposal.
> > > > > Most platforms claimed are not able to do efficiently either,
> > > >
> > > > Most platforms are working towards an efficient way. But we are
> > > > talking about different things, hardware based dirty page logging is
> > > > not a must, that is what I'm saying. For example, KVM doesn't use hardware
> > to log dirty pages.
> > > >
> > > I also said same, that hw based dirty page logging is not must. :) One
> > > day hw mmu will be able to track everything efficiently. I have not seen it
> > happening yet.
> >
> > How do you define efficiency? KVM uses page fault and most modern IOMMU
> > support PRI now.
> >
> One cannot define PRI as mandatory feature.

There's no way to mandate PRI, it's a PCI specific facility.

> In our research and experiments we see that PRI is significantly slower to handle page faults.
> Yet different topic...

PRI's performance is definitely another topic, it's just an example
that tracking dirty pages by device is optional and transport (PCI)
can evolve for sure. What's more important, it demonstrates the basic
design of virtio, which is trying to leverage the transport instead of
a mandatory reveinting of everything.

>
> Efficiency is defined by the downtime of the multiple devices in a VM.

Ok, but you tend to ignore my question regarding the downtime.

> And leading OS allowed device advancements by allowing device to report dirty pages in cpu and platform agnostic way...
>

It has many things that I don't see a good answer for. For example,
the QOS raised by Ling Shan.

> One can use post-copy approach as well, current device migration is around established pre-copy approach.

Another drawback of your proposal. With transport specific assistance
like PRI, you can do both pre and post. But the point is we need to
make sure pre-copy downtime can satisfy the requirement instead of
switching to another.

>
> > >
> > > > > hence the vfio subsystem added the support for it.
> > > >
> > > > As an open standard, if it is designed for a specific software
> > > > subsystem on a specific OS, it's a failure.
> > > >
> > > It is not.
> > > One need accept that, in certain areas virtio is following the trails of
> > advancement already done in sw stack.
> > > So that virtio spec advancement fits in to supply such use cases.
> > > And blocking such advancement of virtio spec to promote only_mediation
> > approach is not good either.
> > >
> > > BTW: One can say the mediation approach is also designed for specific
> > software subsystem and hence failure.
> > > I will stay away from quoting it, as I donât see it this way.
> >
> > The proposal is based on well known technology since the birth of virtualization.
> Sure, but that does not change the fact that such series is also targeted for a specific software subsystem..

How, this series reuses the existing capability by introducing just
two more registers on the existing common cfg structure and you think
it targets a specific software subsystem? If this is true, I think you
are actually challenging the design of the whole modern PCI transport.

> And hence failure.

Failure in what sense?

>
> I didnât say that, I said the opposite that yes, since the virtio is in catch up mode, it is defining the interface so that it can fit into these OS platforms.
> Mostly multiple of them, who all support passthrough devices.

We are talking about different things again.

>
> > I never knew a mainstream hypervisor that doesn't do trap and emulate, did
> > you?
> >
> It does trap and emulation for PCI config space, not for virtio interfaces like queues, config space and more for passthrough devices.

Well, we are in the context of live migration, no? We all know
passthrough just works fine with the existing virtio spec...

>
> > >
> > > > >
> > > > > > The FLR, P2P demonstrates the fragility of a simple passthrough
> > > > > > method and how it conflicts with live migration and complicates
> > > > > > the device
> > > > implementation.
> > > > > Huh, it shows the opposite.
> > > > > It shows that both will seamlessly work.
> > > >
> > > > Have you even tried your proposal with a prototype device?
> > > Of course, it is delivered to user for 1.5 years ago before bringing it to the
> > spec with virtio-net and virtio-blk devices.
> >
> > I hope this is your serious answer, but it looks like it is not. Your proposal misses
> > a lot of states as I pointed out in another thread, how can it work in fact?
> >
> Which states?

Let me repeat it for the third time. You don't even cover all the
functionality of common cfg, how can guests see a consistent common
cfg state?

> What is posted in series [1] is minimal and base required items,

You need to prove it is minimal, instead of ignoring my questions. For
example, dirty page tracking is definitely optional.

> optional one is omitted as it can be done incrementally.
> Lingshan had hard time digesting the basics of P2P and dirty page tracking work in this short series.

You never explain why this series needs to deal with P2P and dirty
page tracking.

> So there is no point in pushing large part of the device context and making the series blurry.

I don't see a good definition of "device context" and most of the
device context has been covered by the existing PCI capabilities.

> It will be done incrementally subsequently.
>
> > > > >
> > > > > > And it means you need to audit all PCI features and do
> > > > > > workaround if there're any possible issues (or using a whitelist).
> > > > > No need for any of this.
> > > >
> > > > You need to prove this otherwise it's fragile. It's the duty of the
> > > > author to justify not the reviewer.
> > > >
> > > One cannot post patches and nor review giant series in one go.
> > > Hence the work to be split on a logical boundary.
> > > Features provisioning, pci layout etc is secondary tasks to take care of.
> >
> > Again, if you know something is missing, you need to explain it in the series
> > instead of waiting for some reviewers to point it out and say it's well-known
> > afterwards.
> >
> The patch set cannot be a laundry list of items missing in virtio spec.
> It is short and focused on the device migration.

You need to mention it in the cover letter at least for a big picture
at least, what's wrong with this? It helps to save time for everyone
or people will keep asking similar questions. Is this too hard to be
understood?

>
> > >
> > > > For example FLR is required to be done in 100ms. How could you
> > > > achieve this during the live migration? How does it affect the downtime and
> > FRS?
> > > >
> > > Good technical question to discuss instead of passthrough vs
> > > mediation. :)
> > >
> > > Device administration work is separate from the device operational part.
> > > The device context records what is the current device context, when the FLR
> > occurs, the device stops all the operations.
> > > And on next read of the device context the FLRed context is returned.
> >
> > Firstly, you didn't explain how it affects the live migration, for example, what
> > happens if we try to migrate while FLR is ongoing.
> > Secondly, you ignore the other two questions.
> >
> > Let's save the time of both.
> >
> There is nothing to explain about device reset and live migration, because there is absolutely there is no touch points.

Do you think this is a valid answer to my above question? Let's don't
exhaust the patience from any reviewer.

> device_status is just another registers like rest of them.

I don't see device status itself as anything related to FLR.

> One does not need to poke around registers when doing passthrough.
>
> > >
> > > > >
> > > > > > This is tricky and we are migrating virtio not virtio-pci. If we
> > > > > > don't use simple passthrough we don't need to care about this.
> > > > > >
> > > > > Exactly, we are migrating virtio device for the PCI transport.
> > > >
> > > > No, the migration facility is a general requirement for all transport.
> > > It is for all transport. One can extend when do for MMIO.
> >
> > By using admin commands? It can not perform well for registered.
> >
> Yes, admin commands using AQ on MMIO based owner device will also be just fine.

Can admin commands be implemented efficiently via registered? I would
like to see how it can work.

MMIO doesn't have the concepts of group owner etc at all or do you
know how to build one?

>
> > >
> > > > Starting from a PCI specific (actually your proposal does not even
> > > > cover all even for PCI) solution which may easily end up with issues in other
> > transports.
> > > >
> > > Like?
> >
> > The admin command/virtqueue itself may not work well for other transport.
> > That's the drawback of your proposal while this proposal doesn't do any
> > coupling.
> >
> There is no coupling in the spec of admin command with virtqueue as Michael consistently insisted.
> And in my proposal also there is no such coupling.

I hope so but I don't think so. We need to at least do this explicitly
by moving all the state definitions to the "basic facility" part.

>
> > >
> > > > Even if you want to migrate virtio for PCI,  please at least read
> > > > Qemu migration codes for virtio and PCI, then you will soon realize
> > > > that a lot of things are missing in your proposal.
> > > >
> > > Device context is something that will be extended.
> > > VIRTIO_PCI_CAP_PCI_CFG will also be added as optional item for PCI
> > transport.
> >
> > This is just one mini stuff, how about PCI config space and others?
> >
> No need to migrate the PCI config space, because migration is of the virtio device, and not the underlying transport.

Let me ask you a simple question, if you don't migrate the PCI config
space, how can you guarantee that guests see the same config space
state after migration? What happens if a specific capability exists
only in the src but not the destination? Or do you want to provision
PCI capabilities?

> Therefore, one can migrate from virtio member device to a fully software based device as well and vis versa.

Please answer my question above.

>
> > Again, please read Qemu codes, a lot of things are missing in your proposal
> > now. If everything is fine to do passthrough based live migration, I'm pretty sure
> > you need more than what Qemu has since it can only do a small fraction of the
> > whole PCI.
> >
> I will read.
> Many of the pieces may be implemented by the device over time following the charter.
>
> > >
> > > > > As usual, if you have to keep arguing about not doing
> > > > > passhthrough, we are
> > > > surely past that point.
> > > >
> > > > Who is "we"?
> > > >
> > > We = You and me.
> > > From 2021, you keep objecting that passthrough must not be done.
> >
> > This is a big misunderstanding, you need to justify it or at least address the
> > concerns from any reviewer.
> >
> They are getting addressed, if you have comments, please post those comments in the actual series.
> I wouldnât diverge to discuss in different series here.

Well, Lingshan's series was posted before you and it's you that keep
referring to your proposal here. What's more, I've asked some
questions but most of them don't have a good answer.  So I need to
stop before I can ask more.

>
> > > And blocking the work done by other technical committee members to
> > improve the virtio spec to make that happen is simply wrong.
> >
> > It's unrealistic to think that one will be 100% correct. Justify your proposal or
> > why I was wrong instead of ignoring my questions and complaining. That is why
> > we need a community. If it doesn't work, virtio provides another process for
> > convergence.
> >
> I am not expecting you to be correct at all. I totally agree that you may miss something, I may miss something.
> And this is why I repeatedly, humbly ask to converge and jointly address the passthrough mode without trap+emulation method.
> The way I understood from your comment is, passthrough for hw based device must not be done and multiple of hw vendors disagree to it.

Again, this is a big misunderstanding. Passthrough can work doesn't
mean your proposal can work. I'm asking questions and want to figure
out if/how it can work correctly. But you keep ignoring them or
raising other unrelated issues.

>
> > >
> > > > Is something like what you said here passed the vote and written to
> > > > the spec?
> > > Not only me.
> > > The virtio technical committee has agreed for nested and hardware-based
> > implementation _both_.
> > >
> > > " hardware-based implementations" is part of the virtio specification charter
> > with ballot of [1].
> > >
> > > [1] https://lists.oasis-open.org/archives/virtio/202104/msg00038.html
> >
> > Let's don't do conceptual shifts, I was asking the passthrough but you give me
> > the hardware implementation.
> >
> Passthrough devices implemented by hw which does dirty tracking and following the spec.

Why is passthrough coupled with dirty tracking?

>
> > >
> > > And passthrough hardware-based device is in the charter that we strive to
> > support.
> > >
> > > > We all know the current virtio spec is not built upon passthrough.
> > >
> > > This efforts improve the passthrough hw based implementation that should
> > not be blocked.
> >
> > Your proposal was posted only for several days and you think I would block that
> > just because I asked several questions and some of them are not answered?
> >
> If I misunderstood, then I am sorry.
> Lets progress and improve the passthrough use case without trap+emulation.

Unless any reviewer says no, the comments or concerns are a good
opportunity for you to justify your method. That's what I'm doing
right now and how the community works.

> Trap+emulation=mediation is also a valid solution for nested case.

Again. Not only for the nested case. This method has been used for
cloud vendors now.

> And I frankly see a need for both as both are solving a different problem.

Then, let's don't couple state, suspending, dirty page tracking with
admin commands.

> Trap+emulation cannot achieve passthrough mode, hence my request was not to step on each other.

It's easy to not step on others, but it would end up with duplications for sure.

>
> When both can use the common infra, it is good to do that, when they cannot, due to the technical challenges of underlying transport, they should evolve differently.
>
> > >
> > > > > Virtio does not need to stay in the weird umbrella to always mediate etc.
> > > >
> > > > It's not the mediation, we're not doing vDPA, the device model we
> > > > had in hardware and we present to guests are all virtio devices.
> > > > It's the trap and emulation which is fundamental in the world of
> > > > virtualization for the past decades. It's the model we used to
> > > > virtualize standard devices. If you want to debate this methodology, virtio
> > community is clearly the wrong forum.
> > > >
> > > I am not debating it at all. You keep bringing up the point of mediation.
> > >
> > > The proposal of [1] is clear that wants to do hardware based passthrough
> > devices with least amount of virtio level mediation.
> > >
> > > So somewhere mode of virtualizing has been used, thatâs fine, it can
> > > continue with full virtualization, mediation,
> > >
> > > And also hardware based passthrough device.
> > >
> > > > >
> > > > > Series [1] will be enhanced further to support virtio passthrough
> > > > > device for
> > > > device context and more.
> > > > > Even further we like to extend the support.
> > > > >
> > > > > > Since the functionality proposed in this series focus on the
> > > > > > minimal set of the functionality for migration, it is virtio
> > > > > > specific and self contained so nothing special is required to work in the
> > nest.
> > > > >
> > > > > Maybe it is.
> > > > >
> > > > > Again, I repeat and like to converge the admin commands between
> > > > passthrough and non-passthrough cases.
> > > >
> > > > You need to prove at least that your proposal can work for the
> > > > passthrough before we can try to converge.
> > > >
> > > What do you mean by "prove"? virtio specification development is not proof
> > based method.
> >
> > For example, several of my questions were ignored.
> >
> I didnât ignore, but if I miss, I will answer.
>
> > >
> > > If you want to participate, please review the patches and help community to
> > improve.
> >
> > See above.
> >
> > >
> > > > > If we can converge it is good.
> > > > > If not both modes can expand.
> > > > > It is not either or as use cases are different.
> > > >
> > > > Admin commands are not the cure for all, I've stated drawbacks in
> > > > other threads. Not repeating it again here.
> > > He he, sure, I am not attempting to cure all.
> > > One solution does not fit all cases.
> >
> > Then why do you want to couple migration with admin commands?
> >
> Because of following.
> 1. A device migration needs to bulk data transfer, this is something cannot be done with tiny registers.
> Cannot be done through registers, because
> a. registers are slow for bidirectional communication
> b. do not scale well with scale of VFs

That's pretty fine, but let's not limit it to a virtqueue. Virtqueue
may not work for all the cases:

I must repeat some of Ling Shan's questions since I don't see a good
answer for them now.

1) If you want to use virtqueue to do the migration with a downtime
requirement. Is the driver required to do some sort of software QOS?
For example what happens if one wants to migrate but the admin
virtqueue is out of space? And do we need a timeout for a specific
command and if yes what happens after the timeout?
2) Assuming one round of the migration requires several commands. Are
they allowed to be submitted in a batch? If yes, how is the ordering
guaranteed or we don't need it at all? If not, why do we even need a
queue?

If you're using an existing transport specific mechanism, you don't
need to care about the above. I'm not saying admin virtqueue can't
work but it definitely has more things to be considered.

>
> > > Admin commands are used to solve the specific problem for which the AQ is
> > designed for.
> > >
> > > One can make argument saying take pci fabric to 10 km distance, donât bring
> > new virtio tcp transport...
> > >
> > > Drawing boundaries around virtio spec in certain way only makes it further
> > inferior. So please do not block advancements bring in [1].
> >
> > As a reviewer, I ask questions but some of them are ignored, do you expect the
> > reviewer to figure out by themselves?
> Sure, please review.
>
> Many of them were not questions, but assertion and conclusions that it does not fit nested.. and sub-optional etc.

I think we all agree that your proposal does not fit for nesting, no?
It demonstrates that work needs to be done in the basic facility
first.

What's more the conclusion is for coupling live migration with admin
command. This point has been clarified several times before.

>
> >
> > > We really would like to make it more robust with your rich experience and
> > inputs, if you care to participate.
> >
> > We can collaborate for sure: as I pointed out in another threads, from what I
> > can see from the both proposals of the current version:
> >
> > I see a good opportunity to build your admin commands proposal on top of this
> > proposal. Or it means, we can focus on what needs to be migrated first:
> >
> > 1) queue state
> This is just one small part of the device context
> So once a device context is read/written, it covers q.

That's a layer violation. Virtqueue is the basic facility, states need
to be defined there.

>
> > 2) inflight descriptors
> Same a q state, it is part of the device context.

Admin commands are not the only way to access device context. For
example, do you agree the virtqueue address is part of the device
context? If yes, it is available in the common configuration now.

>
> > 3) dirty pages (optional)
> > 4) device state(context) (optional)
> >
> It is same as #1 and #2.
> Splitting them from #1 and #2 is not needed.
>
> We can extend the device context to be selectively queried for nested case..
>
> > I'd leave 3 or 4 since they are very complicated features. Then we can invent an
> > interface to access those facilities? This is how this series is structured.
> >
> > And what's more, admin commands or transport specific interfaces. And when
> > we invent admin commands, you may realize you are inventing a new transport
> > which is the idea of transport via admin commands.
>
> Not really. it is not a new transport at all.
> I explained you before when you quote is as transport, it must carry the driver notifications as well..
> Otherwise it is just set of commands..

I've explained that you need admin commands to save and load all
existing virtio PCI capabilities. This means a driver can just use
those commands to work. If not, please explain why I was wrong.

Thanks





>
> The new commands are self contained anyway of [1].
>
> [1] 20230909142911.524407-1-parav@nvidia.com/T/#t">https://lore.kernel.org/virtio-comment/20230909142911.524407-1-parav@nvidia.com/T/#t



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]