virtio message

Subject: Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)

From: Rusty Russell <rusty@au1.ibm.com>
To: Daniel Kiper <daniel.kiper@oracle.com>
Date: Fri, 25 Oct 2013 14:25:58 +1030

Daniel Kiper <daniel.kiper@oracle.com> writes:
> Hi,
>
> Sorry for late reply but I am very busy now.

That's OK, thanks for making the time.

> On Tue, Oct 01, 2013 at 05:40:12PM +0930, Rusty Russell wrote:
>> Ok, so here's what I ended up with.
>>
>> Any feedback welcome...
>> Rusty.
>
> [...]
>
>> +100.2.4.5. Memory Balloon Device
>> +===========================
>> +
>> +The virtio memory balloon device is a primitive device for managing
>> +guest memory: the device asks for a certain amount of memory, and the
>> +guest supplies it.  This allows the guest to adapt to changes in
>> +allowance of underlying physical memory.  The device can also be used
>> +to communicate guest memory statistics to the host.
>
> Maybe guest should also be able to set balloon target. However, in this
> situation host should establish limits which could not be exceeded and
> device should enforce them. This way balloon could be controlled from host
> and/or guest if needed. So balloon device will be responsible just for
> passing requests to/from guest/host and limits enforcement. This way even
> memory hotplug could be easily implemented. However, in this situation
> device should not be called balloon. Memory manager or memory allocator?
> Any better ideas?

If it's purely guest-driven device, you don't need a target at all.  You
just have a driver which hands pages to the device.

You could operate the device in that way, of course, treating the target
as a ceiling.  Is it useful to have a way of telling the device you're
operating in such a "self-regulating" mode?  Or should you just do it?

ie. should this be a feature bit?

>> +The read-only configuration field indicates the granularity of memory
>> +which can be added to the balloon.  This is typically reflects the
>> +page size of the host (eg. 12 for 4096-byte pages).
>> +
>> +	struct virtio_balloon_config {
>> +		u32 page_bits;
>> +	}
>
> Why balloon device must be forced to use only one page size? I think that
> configuration area should list which page sizes could be requested by
> device. Device should be able to request any allowed/defined size but driver
> could reject request or fail partially. Additionally, maybe device should
> inform about allowed page sizes using size explicitly instead of number
> of bits. So maybe it is worth storing page sizes as u64. By the way we could
> store page sizes which are not equal full power of 2 (could be useful for
> some strange sizes of superpages in the future if something crazy happens).
> However, if we store page sizes as number of bits we could represent
> larger sizes. Hmmm...
>
> Probably we do not implement above mentioned feature at first but
> it gives us a chance to do that later.

I don't see non-power-of-two pages happening.

But it makes sense to put the page size in each request.  A bit more
painful to implement, since the driver can't know in advance that it
doesn't support a request.

>> +100.2.4.5.5. Device Initialization
>> +-----------------------------
>> +
>> +1. At least one struct virtio_balloon_request buffer should be placed
>> +   in the inputq.
>> +
>> +2. The balloon starts empty (size 0).
>> +
>> +100.2.4.5.6. Device Operation
>> +------------------------
>> +
>> +The device is driven by receipt of a command in the input queue:
>> +
>> +	struct virtio_balloon_req {
>> +#define VIRTIO_BALLOON_REQ_RESIZE	0
>> +#define VIRTIO_BALLOON_REQ_STATS	1
>> +		u32 type;
>> +		u32 reserved;
>> +		u64 value;
>> +	}
>
> struct virtio_balloon_pages {
> #define VIRTIO_BALLOON_REQ_RESIZE    0
> #define VIRTIO_BALLOON_REQ_STATS     1
>   u32 type;
>   u32 reserved;
>   u64 guest_memory_size;
>   u64 page_sizes[];
> }; ???

This doesn't make sense.  It's possible that the host has some memory in
hugepages and some in smaller pages.  But if so, it needs to be able to
say "give me 5 small pages and 1 huge page please".

>> +1. A VIRTIO_BALLOON_REQ_RESIZE command indicates the balloon target
>> +   size (in bytes) in the value field.  If the current balloon size is
>> +   smaller than the target, the guest should add pages to the balloon
>> +   as soon as possible.  If the current balloon is larger than the
>> +   target, the guest may withdraw pages.

So let's drop this, and have two commands:

/* Give me more pages! */ 
VIRTIO_BALLOON_REQ_FILL:
        u32 type;
        u32 page_bits; // eg 12 == 4096.
        u64 num_pages;

And:

/* You can take some back. */ 
VIRTIO_BALLOON_REQ_RELEASE:
        u32 type;
        u32 page_bits; // eg 12 == 4096.
        u64 num_pages;

>> +2. To add pages to the balloon, the physical addresses of the pages
>
> frames ???

Define frame?  It's an array of page physical addresses.  Was that unclear?

>> +   are sent using the output queue.  The number of pages is implied in
>> +   the message length, and each page value must be a multiple of the
>> +   page size indicated in struct virtio_balloon_config.
>> +
>> +	struct virtio_balloon_pages {
>> +#define VIRTIO_BALLOON_RESP_PAGES	0
>> +		u32 type; // VIRTIO_BALLOON_RESP_PAGES
>> +		u64 page[];
>> +	};
>
> struct virtio_balloon_pages {
>   u32 type; // VIRTIO_BALLOON_RESP_PAGES
>   u64 page_size;
>   u64 frames[];
> }; ???
>
>> +3. To withdraw a page from the balloon, it can simply be accessed.
>
> IIRC, ballooned pages are at first reserved and later frames are returned
> to a host. So if you would like to use pages you must do above steps
> in revers. Hence, "it can simply be accessed" is a bit misleading. May it
> should be phrased in following way: it should have reassigned a frame
> number and later it should be returned to a pool of free pages.

Yet this requirement that pages be re-requested blocked one
implementation attempt in Linux.  They old spec said you had to, and yet
QEMU didn't actually care.  Nor any existing implementation.

> Additionally, some hypervisors my require additional steps to add/remove
> page to/from the pool (e.g. Xen PV guests must add/remove frames to P2M and M2P
> lists/trees too). So implementation should be able to call hypervisor
> specific stuff in such situations.

The underlying assumption is that the hypervisor controls the mapping,
so it can remove the page and fault one back in appropriately.  This
isn't true for PV Xen of course.  Yet we can't put "do some hypervisor
specific stuff here" in the spec.

Let's step back a moment to look at the goals.  It's nice for PV Xen to
have portable drivers, but by definition you can't run a generic guest
in PV Xen.  So it's more about reducing the differences than trying to
get a completely standardized guest.

So it doesn't bother me a huge amount that a generic balloon driver
won't work in a Xen PV guest.  In practice, the Linux driver might have
hooks to support Xen PV, but it might be better to keep the Xen-specific
balloon driver until Xen PV finally dies.

Cheers,
Rusty.

Follow-Ups:
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: Rusty Russell <rusty@au1.ibm.com>
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: "Michael S. Tsirkin" <mst@redhat.com>
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: Daniel Kiper <daniel.kiper@oracle.com>

References:
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: Rusty Russell <rusty@au1.ibm.com>
- Re: [virtio] [OASIS Issue Tracker] Created: (VIRTIO-28) Implement new balloon device (ID 13)
  - From: Daniel Kiper <daniel.kiper@oracle.com>