virtio message

Subject: Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

From: Daniel Kiper <daniel.kiper@oracle.com>
To: Rusty Russell <rusty@au1.ibm.com>
Date: Tue, 29 Oct 2013 11:00:18 +0100

On Mon, Oct 28, 2013 at 01:12:58PM +1030, Rusty Russell wrote:
> Daniel Kiper <daniel.kiper@oracle.com> writes:
> > On Fri, Oct 25, 2013 at 02:25:58PM +1030, Rusty Russell wrote:
> >> Daniel Kiper <daniel.kiper@oracle.com> writes:
> >> > Maybe guest should also be able to set balloon target. However, in this
> >> > situation host should establish limits which could not be exceeded and
> >> > device should enforce them. This way balloon could be controlled from host
> >> > and/or guest if needed. So balloon device will be responsible just for
> >> > passing requests to/from guest/host and limits enforcement. This way even
> >> > memory hotplug could be easily implemented. However, in this situation
> >> > device should not be called balloon. Memory manager or memory allocator?
> >> > Any better ideas?
> >>
> >> If it's purely guest-driven device, you don't need a target at all.  You
> >> just have a driver which hands pages to the device.
> >
> > OK. Could host send requests too? I think that sometimes it could
> > be useful if host could ask guest to limit its memory usage (guest
> > should return to host only free pages). So if we assume that it is
> > "purely guest-driven device" we could not do that.
>
> Dropping the target *is* how the host asks, surely.

AIUC we assume here that host and guest could put requests into input queue.
If yes then I think that it should be clearly stated in spec.

> >> You could operate the device in that way, of course, treating the target
> >> as a ceiling.  Is it useful to have a way of telling the device you're
> >> operating in such a "self-regulating" mode?  Or should you just do it?
> >>
> >> ie. should this be a feature bit?
> >
> > I think that target and ceiling are separate things. Target should
> > describe "current" or "desired" allocation. Ceiling should be a limit
> > enforced by device. Guest could not allocate more memory than ceiling
> > but should be able to set target above ceiling. In that case host
> > should allow guest allocation until ceiling but no more.
>
> Hard and soft limits are an interesting idea, but I'm not sure that it
> withstands scrutiny.  There is already a (platform-specific) ceiling
> mechanism for a guest, defined by its memory layout.
>
> Varying that hard limit is an interesting problem.  To reduce it, you
> need to decide how long to give it to reach that limit, and what happens
> if they don't do it.  With a soft limit, the implication is that pages
> will be swapped if/while it is above that limit.  I think that's
> slightly simpler than having both mechanisms.

Hmmm... I think that there is a misunderstanding here. I looked at this
issue from host point of view. Anyway, my main concern here is that spec
should not block development of memory hotplug on the base of this device.
In the case of Linux Kernel its implementation is quiet easy and I did that
for Xen balloon driver. Hence, I think that similar thing could be done for
this device. However, as I understand balloon driver definition it deals only
with memory allocated when a given VM was created. So that is why I asked,
should we call it balloon device if we are going to support memory hotplug
in driver? If yes then definition should be changed or we should use another
name. Additionally, I think that host should have mechanism to limit guest
memory usage which will prevent against host memory exhaustion. Of course
later issue is out of scope of this spec but I think that we should at least
mention about such problem.

> >> This doesn't make sense.  It's possible that the host has some memory in
> >> hugepages and some in smaller pages.  But if so, it needs to be able to
> >> say "give me 5 small pages and 1 huge page please".
> >>
> >> >> +1. A VIRTIO_BALLOON_REQ_RESIZE command indicates the balloon target
> >> >> +   size (in bytes) in the value field.  If the current balloon size is
> >> >> +   smaller than the target, the guest should add pages to the balloon
> >> >> +   as soon as possible.  If the current balloon is larger than the
> >> >> +   target, the guest may withdraw pages.
> >>
> >> So let's drop this, and have two commands:
> >>
> >> /* Give me more pages! */
> >> VIRTIO_BALLOON_REQ_FILL:
> >>         u32 type;
> >>         u32 page_bits; // eg 12 == 4096.
> >>         u64 num_pages;
> >>
> >> And:
> >>
> >> /* You can take some back. */
> >> VIRTIO_BALLOON_REQ_RELEASE:
> >>         u32 type;
> >>         u32 page_bits; // eg 12 == 4096.
> >>         u64 num_pages;
> >
> > General idea is OK but this way we back to relative requests.
> > IIRC we would like to avoid them.
>
> They do kind of suck, but if we want to deal with more than one page
> size at once we either need to publish several values, or use this
> mechanism.
>
> In practice there'll probably only be a couple of page sizes, but this
> seems the sweet-spot for simplicity.

My first idea was that request should contain list of desired page sizes
than a hard statement like "give me X pages of size Y". Here, at least we
are not able to predict which pages are unused in guest. So I think this
soft statement is more useful. But if we choose your proposal I think
that there should be a way to explicitly refuse such hard request or fail
partially (e.g. host requests 5 superpages and guest is able to free 2
only).

> >> Yet this requirement that pages be re-requested blocked one
> >> implementation attempt in Linux.  They old spec said you had to, and yet
> >> QEMU didn't actually care.  Nor any existing implementation.
> >
> >>From which point of view? Guest or host? My earlier description
> > is from guest point of view. IIRC, old VIRTIO balloon driver
> > works in the same way (from guest point of view).
>
> From the guest.  Damn, I can't find the email thread, but my vague
> memory was that there was some advantage in having Linux simply use the
> pages without asking permission (as it has to now).
>
> I'd ping Paolo Bonzini <pbonzini@redhat.com>, "Sasha Levin"
> <sasha.levin@oracle.com> and Rafael Aquini <aquini@redhat.com>.

Please check linux/mm/balloon_compaction.c:balloon_page_enqueue(),
linux/mm/balloon_compaction.c:balloon_page_dequeue(),
drivers/virtio/virtio_balloon.c:release_pages_by_pfn() and
include/linux/balloon_compaction.h:balloon_page_free().
It does this thing in a bit different way than Xen balloon driver
but idea is the same.

> >> > Additionally, some hypervisors my require additional steps to add/remove
> >> > page to/from the pool (e.g. Xen PV guests must add/remove frames to P2M and M2P
> >> > lists/trees too). So implementation should be able to call hypervisor
> >> > specific stuff in such situations.
> >>
> >> The underlying assumption is that the hypervisor controls the mapping,
> >> so it can remove the page and fault one back in appropriately.  This
> >> isn't true for PV Xen of course.  Yet we can't put "do some hypervisor
> >> specific stuff here" in the spec.
> >>
> >> Let's step back a moment to look at the goals.  It's nice for PV Xen to
> >> have portable drivers, but by definition you can't run a generic guest
> >> in PV Xen.  So it's more about reducing the differences than trying to
> >> get a completely standardized guest.
> >>
> >> So it doesn't bother me a huge amount that a generic balloon driver
> >> won't work in a Xen PV guest.  In practice, the Linux driver might have
> >> hooks to support Xen PV, but it might be better to keep the Xen-specific
> >> balloon driver until Xen PV finally dies.
> >
> > So as I can see from this and others conversation we assume that PV guest
> > should die sooner or later and we are not going to support them. As I can
> > see the same goal is for Linux Kernel and Xen developers. However, just
> > in case I will confirm this with Xen guys.
>
> To be clear, as lguest author I mourn the loss of PV as much as anyone!
> But the page table tricks which made Xen-PV so brilliant make
> abstraction of a balloon device impossible, AFAICT :(

After some discussion with Konrad Wilk yesterday I suppose
that we will be forced to have some mechanisms to cooperate
with different types of hypervisors. However, let's wait
for first version of draft and we will see what is needed.

Daniel

References:
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: Rusty Russell <rusty@au1.ibm.com>
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: Daniel Kiper <daniel.kiper@oracle.com>
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: Rusty Russell <rusty@au1.ibm.com>
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: Daniel Kiper <daniel.kiper@oracle.com>
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: Rusty Russell <rusty@au1.ibm.com>