OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [virtio-comment] Re: [PATCH v1 7/8] admin: Add write recording commands


> From: Michael S. Tsirkin <mst@redhat.com>
> Sent: Monday, October 9, 2023 9:46 PM
> 
> On Mon, Oct 09, 2023 at 11:48:46AM +0000, Parav Pandit wrote:
> >
> > > From: virtio-comment@lists.oasis-open.org
> > > <virtio-comment@lists.oasis- open.org> On Behalf Of Michael S.
> > > Tsirkin
> > > Sent: Monday, October 9, 2023 4:28 PM
> >
> > > > May be we can use a different verb than record. May be tracking?
> > > > Record means, device accumulates these written pages addresses,
> > > > and when
> > > driver queries, it returns these addresses and clears it internally
> > > within the device.
> > >
> > > For example, what about two writes into same address. Do you get one
> > > record or two? What if the length is different?
> > One record.
> > Since the writes are tracked/recorded at page granularity, even if length is
> different, it still one entry.
> 
> what if writes cover different pages but overlap?
From device point of view, for write recording if it spans two pages (due to page_size), they are two write records.

> Don't bother answering here, I am implying this needs to be documented.
> 
> We could give devices a bit of a freeway here too, e.g. explain that device can
> combine two writes into one record but does not have to - might lead to better
> data structures internally.
> 
The device can do so if the write record entry covers written length too.
Since known hypervisor work at constant page size per guest, the suppling of the length was not effective enough.

> 
> > >
> > >
> > > > > > Such an
> > > > > > +address is named as IO virtual address (IOVA).
> > > > >
> > > > > I don't know what does this have to do with IOVA. For that
> > > > > matter everything would have to be "IOVA". Spec calls these
> > > > > physical address and let's stick to that.
> > > > >
> > > > Make sense.
> > > > At device level it does not have knowledge of IOVA.
> > > > I will rename it.
> > > >
> > > > >
> > > > > > The owner driver enables write
> > > > > > +recording for one or more IOVA ranges per device during
> > > > > > +device migration flow. The owner driver periodically queries
> > > > > > +these written IOVA records from the device.
> > > > >
> > > > > periodical reads without any indication are the only option then?
> > > > >
> > > > At least for now, this is starting point. Software stack such as
> > > > QEMU does it
> > > periodically.
> > >
> > > So for CPU kvm switched to PML and this seems to work better - it
> > > guarantees there's convergence.
> > >
> > My guess that PML is better because write protection related faults are
> removed now.
> > The approach here is similar, but there is no PML kind of queue.
> > It is relatively manageable to slow down cpu on VMEXIT or other ways,
> compare to external network.
> >
> > And secondly, it is driven by the hypervisor cpu availability for capacity
> planning etc.
> > So its periodic in nature that share similar scheme like PML.
> 
> With PML the entries are recorded into two places: dirty bit in PTE+log.
> If log fills up there's an exit. I don't exactly get where do you expect the device
> to record this apparently unbounded log.
>
It is abstract enough from spec point of view. Like where the RSS config is stored, where the mac table, vlan tables are stored.
It is similar to that way..

> >
> > > > When new use case arise, may be it can be extended.
> > > >
> > > > > > As the driver reads the written IOVA records,
> > > > > > +the device clears those records from the device. Once the
> > > > > > +device reports zero or small number of written IOVA records,
> > > > > > +the device is set to \field{Stop} or \field{Freeze} mode.
> > > > > > +Once the device is set to \field{Stop}
> > > > > >  or \field{Freeze} mode, and once all the IOVA records are
> > > > > > read, the driver stops  the write recording in the device.
> > > > >
> > > > >
> > > > > it is not great that you are rewriting text you just wrote in
> > > > > patch
> > > > > 1 here. pls find a way not to make reviewers read everything twice.
> > > > >
> > > > There is small duplication of one line explaining mode change,
> > > > rest is
> > > contextual to the write recording.
> > > > Merging with text of patch_1 was slightly complicated to read, so
> > > > one
> > > sentence is duplicated.
> > > >
> > > > I will again check if patch_1 text extension is easier to read.
> > >
> > > this is latex, space is ignored. if you are only changing one word
> > > just don't move the rest and diff will look sane
> > >
> > Yeah, right. I will fix this.
> >
> > > > > > @@ -118,6 +119,10 @@ \subsubsection{Device
> > > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > > > \item Device Context Read Command \item Device Context Write
> > > > > > Command \item Device Context Discard Command
> > > > > > +\item Device Write Record Capabilities Query Command \item
> > > > > > +Device Write Records Start Command \item Device Write Records
> > > > > > +Stop Command \item Device Write Records Read Command
> > > > > >  \end{enumerate}
> > > > > >
> > > > > >  These commands are currently only defined for the SR-IOV group
> type.
> > > > > > @@ -307,6 +312,129 @@ \subsubsection{Device
> > > > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > > > discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE
> command
> > > > > > writes a new device
> > > > > context.
> > > > > >
> > > > > > +\paragraph{Device Write Record Capabilities Query Command}
> > > > > > +\label{par:Basic Facilities of a Virtio Device / Device
> > > > > > +groups / Group administration commands / Device Migration /
> > > > > > +Device Write Record Capabilities Query Command}
> > > > > > +
> > > > > > +This command reads the device write record capabilities.
> > > > > > +For the command
> > > VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
> > > > > > +\field{opcode} is set to 0xd.
> > > > > > +The \field{group_member_id} refers to the member device to be
> > > accessed.
> > > > > > +
> > > > > > +\begin{lstlisting}
> > > > > > +struct virtio_admin_cmd_dev_write_record_cap_result {
> > > > > > +        le32 supported_iova_page_size_bitmap;
> > > > > > +        le32 supported_iova_ranges; }; \end{lstlisting}
> > > > > > +
> > > > > > +When the command completes successfully,
> > > > > > +\field{command_specific_result} is in the format
> > > > > > +\field{struct virtio_admin_cmd_dev_write_record_cap_result}
> > > > > > +returned by the device. The
> > > > > > +\field{supported_iova_page_size_bitmap}
> > > > > > +indicates the granularity at which the device can record IOVA ranges.
> > > > > > +the minimum granularity can be 4KB. Bit 0 corresponds to 4KB,
> > > > > > +bit
> > > > > > +1 corresponds to 8KB, bit 31 corresponds to 4TB. The device
> > > > > > +supports at least
> > > > > one page granularity.
> > > > > > +The device support one or more IOVA page granularity; for
> > > > > > +each IOVA page granularity, the device sets corresponding bit
> > > > > > +in the \field{supported_iova_page_size_bitmap}. The
> > > > > > +\field{supported_iova_ranges} indicates how many unique (non
> > > > > > +overlapping) IOVA ranges can be recorded by the device.
> > > > >
> > > > > what role does this granularity play? i see no mention of it down the
> road.
> > > > >
> > > > The page_size in struct virtio_admin_cmd_write_record_start_data
> > > > must
> > > match to the granularity supplied above.
> > > > I missed it. Will add in v2.
> > > > This is very useful comment.
> > >
> > > Not that it's very clear what does page_size do.
> > >
> > Page_size is the granularity on which to record the writes.
> > For example, when page_size = 2MB, any writes are aligned to 2MB page
> boundary.
> > If 8KB data is written, only single write record entry reported.
> >
> > If the page_size = 4K, two write record entries reported.
> > ( I assumed 4K aligned address to keep the example simple).
> 
> 
> But you also said internally device maintains a bitmap.
Hmm, no. may be my response was confusing.
I replied that bitmap of supported page size, not the bitmap of the written pages.

> So it will have to work hard to find a set bit in the map then?
> Do we want to maybe give device an option to just return a bitmap and have
> driver worry about it?
> 
This would be an entirely different interface.
For 64GB VM, at 2MB page size, 32K bits to process.
And at 4KB page size, 16M bits to process.
And this is regardless of amount of dirty page tracking.

Bigger VMs finds it even more difficult.

> To me this looks like an optimization for when when devices keep writing to the
> same page all the time?  Do you have data to show that's commonly the case?
> Instrumenting a driver would be one way to find out.
>
Often for the devices which do not do zero copy, or uses page caches of block device, the writes are occurring on repeated pages.

[..]
> > > Do we want to return it in this format then?
> > I think yes, because converting to the bit is easy.
> > Reporting bit requires bitmap being function of VM memory and not based
> on amount of written pages.
> 
> So, you are optimizing for when small # of bits are set.
Device can be able to keep track for large amount of memory writes too.

> Do you have data to show it's common?
> 
> I would maybe structure it like this:
> 
> - bitmap used at all times
Bitmap are fine for small VMs as bits are small.
Log is better when VM has large memory where log is maintained only for the written pages, instead of full VM size.

> - log is maintained as long as it's not full
> - when log fills up just bitmap is used.
> 
> and at this point, we can maybe start with just a bitmap and add the log
> optimization separate, and optional?


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]