OASIS Mailing List ArchivesView the OASIS mailing list archive below
or browse/search using MarkMail.

 


Help: OASIS Mailing Lists Help | MarkMail Help

virtio-comment message

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]


Subject: RE: [virtio-comment] Re: [PATCH v1 7/8] admin: Add write recording commands


> From: virtio-comment@lists.oasis-open.org <virtio-comment@lists.oasis-
> open.org> On Behalf Of Michael S. Tsirkin
> Sent: Monday, October 9, 2023 4:28 PM

> > May be we can use a different verb than record. May be tracking?
> > Record means, device accumulates these written pages addresses, and when
> driver queries, it returns these addresses and clears it internally within the
> device.
> 
> For example, what about two writes into same address. Do you get one record
> or two? What if the length is different?
One record.
Since the writes are tracked/recorded at page granularity, even if length is different, it still one entry.

> 
> 
> > > > Such an
> > > > +address is named as IO virtual address (IOVA).
> > >
> > > I don't know what does this have to do with IOVA. For that matter
> > > everything would have to be "IOVA". Spec calls these physical
> > > address and let's stick to that.
> > >
> > Make sense.
> > At device level it does not have knowledge of IOVA.
> > I will rename it.
> >
> > >
> > > > The owner driver enables write
> > > > +recording for one or more IOVA ranges per device during device
> > > > +migration flow. The owner driver periodically queries these
> > > > +written IOVA records from the device.
> > >
> > > periodical reads without any indication are the only option then?
> > >
> > At least for now, this is starting point. Software stack such as QEMU does it
> periodically.
> 
> So for CPU kvm switched to PML and this seems to work better - it guarantees
> there's convergence.
> 
My guess that PML is better because write protection related faults are removed now.
The approach here is similar, but there is no PML kind of queue.
It is relatively manageable to slow down cpu on VMEXIT or other ways, compare to external network.

And secondly, it is driven by the hypervisor cpu availability for capacity planning etc.
So its periodic in nature that share similar scheme like PML.


> > When new use case arise, may be it can be extended.
> >
> > > > As the driver reads the written IOVA records,
> > > > +the device clears those records from the device. Once the device
> > > > +reports zero or small number of written IOVA records, the device
> > > > +is set to \field{Stop} or \field{Freeze} mode. Once the device is
> > > > +set to \field{Stop}
> > > >  or \field{Freeze} mode, and once all the IOVA records are read,
> > > > the driver stops  the write recording in the device.
> > >
> > >
> > > it is not great that you are rewriting text you just wrote in patch
> > > 1 here. pls find a way not to make reviewers read everything twice.
> > >
> > There is small duplication of one line explaining mode change, rest is
> contextual to the write recording.
> > Merging with text of patch_1 was slightly complicated to read, so one
> sentence is duplicated.
> >
> > I will again check if patch_1 text extension is easier to read.
> 
> this is latex, space is ignored. if you are only changing one word just don't move
> the rest and diff will look sane
> 
Yeah, right. I will fix this.

> > > > @@ -118,6 +119,10 @@ \subsubsection{Device
> > > > Migration}\label{sec:Basic Facilities of a Virtio Device /  \item
> > > > Device Context Read Command \item Device Context Write Command
> > > > \item Device Context Discard Command
> > > > +\item Device Write Record Capabilities Query Command \item Device
> > > > +Write Records Start Command \item Device Write Records Stop
> > > > +Command \item Device Write Records Read Command
> > > >  \end{enumerate}
> > > >
> > > >  These commands are currently only defined for the SR-IOV group type.
> > > > @@ -307,6 +312,129 @@ \subsubsection{Device
> > > > Migration}\label{sec:Basic Facilities of a Virtio Device /
> > > > discarded, subsequent VIRTIO_ADMIN_CMD_DEV_CTX_WRITE command
> > > > writes a new device
> > > context.
> > > >
> > > > +\paragraph{Device Write Record Capabilities Query Command}
> > > > +\label{par:Basic Facilities of a Virtio Device / Device groups /
> > > > +Group administration commands / Device Migration / Device Write
> > > > +Record Capabilities Query Command}
> > > > +
> > > > +This command reads the device write record capabilities.
> > > > +For the command
> VIRTIO_ADMIN_CMD_DEV_WRITE_RECORD_CAP_QUERY,
> > > > +\field{opcode} is set to 0xd.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_admin_cmd_dev_write_record_cap_result {
> > > > +        le32 supported_iova_page_size_bitmap;
> > > > +        le32 supported_iova_ranges; }; \end{lstlisting}
> > > > +
> > > > +When the command completes successfully,
> > > > +\field{command_specific_result} is in the format \field{struct
> > > > +virtio_admin_cmd_dev_write_record_cap_result}
> > > > +returned by the device. The
> > > > +\field{supported_iova_page_size_bitmap}
> > > > +indicates the granularity at which the device can record IOVA ranges.
> > > > +the minimum granularity can be 4KB. Bit 0 corresponds to 4KB, bit
> > > > +1 corresponds to 8KB, bit 31 corresponds to 4TB. The device
> > > > +supports at least
> > > one page granularity.
> > > > +The device support one or more IOVA page granularity; for each
> > > > +IOVA page granularity, the device sets corresponding bit in the
> > > > +\field{supported_iova_page_size_bitmap}. The
> > > > +\field{supported_iova_ranges} indicates how many unique (non
> > > > +overlapping) IOVA ranges can be recorded by the device.
> > >
> > > what role does this granularity play? i see no mention of it down the road.
> > >
> > The page_size in struct virtio_admin_cmd_write_record_start_data must
> match to the granularity supplied above.
> > I missed it. Will add in v2.
> > This is very useful comment.
> 
> Not that it's very clear what does page_size do.
> 
Page_size is the granularity on which to record the writes.
For example, when page_size = 2MB, any writes are aligned to 2MB page boundary.
If 8KB data is written, only single write record entry reported.

If the page_size = 4K, two write record entries reported.
( I assumed 4K aligned address to keep the example simple).

> > >
> > > > +
> > > > +\paragraph{Device Write Records Start Command} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device groups / Group
> > > > +administration commands / Device Migration / Device Write Records
> > > > +Start Command}
> > > > +
> > > > +This command starts the write recording in the device for the
> > > > +specified IOVA ranges.
> > > > +
> > > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START,
> > > > +\field{opcode} is set to 0xe.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +The \field{command_specific_data} is in the format \field{struct
> > > > +virtio_admin_cmd_write_record_start_data}.
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_admin_cmd_write_record_start_entry {
> > > > +        le64 iova;
> > > > +        le64 page_count;
> > > > +};
> > > > +
> > > > +struct virtio_admin_cmd_write_record_start_data {
> > > > +        le64 page_size;
> > > > +        le32 count;
> > > > +        u8 reserved[4];
> > > > +        struct virtio_admin_cmd_write_record_start_entry
> > > > +entries[]; };
> > > > +
> > > > +\end{lstlisting}
> > > > +
> > > > +The \field{count} is set to indicate number of valid \field{entries}.
> > > > +The \field{iova} indicates the start IOVA address. The
> > > > +\field{page_count} indicates number of pages of size
> > > > +\field{page_size} starting from \field{iova} to record for write
> > > > +reporting. VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > > > +command contains unique i.e. non overlapping IOVA range entries.
> > > > +Whenever a memory write occurs by the device in the supplied IOVA
> > > > +range, the device records the actual IOVA and number of bytes
> > > > +written to the
> > > IOVA.
> > > > +These write records can be read by the the driver using
> > > > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ command.
> > > > +
> > > > +This command has no command specific result.
> > > > +
> > > > +\paragraph{Device Write Record Stop Command} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device groups / Group
> > > > +administration commands / Device Migration / Device Write Record
> > > > +Stop Command}
> > > > +
> > > > +This command stops the write recording in the device for IOVA
> > > > +ranges which were previously started using
> > > > +VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> > > > +command.
> > > > +
> > > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_STOP,
> > > > +\field{opcode} is set to 0xf.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +This command does not have any command specific data.
> > > > +This command has no command specific result.
> > > > +
> > > > +\paragraph{Device Write Records Read Command} \label{par:Basic
> > > > +Facilities of a Virtio Device / Device groups / Group
> > > > +administration commands / Device Migration / Device Write Records
> > > > +Read Command}
> > > > +
> > > > +This command reads the device write records for which the write
> > > > +recording is previously started using
> > > VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START command.
> > > > +
> > > > +For the command VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_READ,
> > > > +\field{opcode} is set to 0x10.
> > > > +The \field{group_member_id} refers to the member device to be
> accessed.
> > > > +
> > > > +\begin{lstlisting}
> > > > +struct virtio_admin_cmd_write_records_read_data {
> > > > +        le64 iova;
> > > > +        le64 length;
> > > > +};
> > > > +
> > > > +struct virtio_admin_cmd_dev_write_records_cnt {
> > > > +        le32 count;
> > > > +};
> > > > +
> > > > +struct virtio_admin_cmd_dev_write_records_result {
> > > > +        le64 iova_entries[];
> > > > +};
> > > > +\end{lstlisting}
> > > > +
> > > > +The \field{command_specific_data} is in the format \field{struct
> > > > +virtio_admin_cmd_write_records_read_data}. The driver sets the
> > > > +\field {iova} indicating the start IOVA address for up to the
> > > > +\field{length} number of bytes. The supplied IOVA range same or
> > > > +smaller than the range supplied when write recording is started
> > > > +by the driver in VIRTIO_ADMIN_CMD_DEV_WRITE_RECORDS_START
> command.
> > >
> > > Seems pretty sparse. Lots of hypervisors chose to implement a bit
> > > per page strategy.
> > This command result addresses will feed into such bit.
> 
> Do we want to return it in this format then?
I think yes, because converting to the bit is easy.
Reporting bit requires bitmap being function of VM memory and not based on amount of written pages.

> 
> > >
> > > > +
> > > > +When the command completes successfully,
> > > > +\field{command_specific_result} is in the format \field{struct
> > > > +virtio_admin_cmd_dev_write_records_result}
> > > > +and \field{command_specific_result} is in format of \field{struct
> > > > +virtio_admin_cmd_dev_write_records_cnt} containing number of
> > > > +write records returned by the device.
> > >
> > > what are these records though?
> > >
> > It is struct virtio_admin_cmd_dev_write_records_result.
> > I will rephrase it to link to struct virtio_admin_cmd_dev_write_records_result.
> >
> > >
> > > > When the command completes
> > > > +successfully, the write records which are returned in the result
> > > > +are cleared from the device and same records cannot be read
> > > > +again. When new writes occur at same IOVA range or at different
> > > > +once, those records can be read as new write records.
> > >
> > >
> > > this last sentence just confuses.
> > >
> > How about just keeping below text rewrite?
> > When the command completes successfully, the write records returned in the
> result are cleared from the device.
> 
> I think we need to explain in more detail what exactly is expected to be
> recorded and when.
> 
Ok. I will give second look now to improve this description.


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]