[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]
Subject: [PATCH v4 0/8] Introduce device migration support commands
This series introduces administration commands for member device migration for PCI transport; when needed it can be extended for other transports too. It takes inspiration from the similar idea presented at KVM Forum at [1]. Use case requirements: ====================== 1. A hypervisor system needs to provide a PCI VF as directly mapped device to the guest virtual machine and also support live migration of this virtual machine. A direct mapped device has typically only PCI configuration space and MSI-X table emulated by hypervisor. No virtio native interface offered by the virtio member device is trapped and/or emulated. This includes utilizing member device's native virtio common and device config region, device specific cvq, data vqs without any VMEXIT from the guest virtual machine and without any device type specific code in hypervisor; this is because it is already present in the owner and member device natively as unified interface for guest virtual machines, containers and may be more use cases. 2. A virtual machine may have one or more such passthrough virtio devices. 3. A virtual machine may have other PCI directly mapped device which may also interact with the virtio device. 4. A hypervisor runs a generic device type agnostic driver with extension to support device migration. 5. A PCI VF direct mapped device needs to support transparent device reset and PCI FLR while the device migration is ongoing. 6. A owner driver do not involve in device operations mediation for the direct mapped device at virtio interface level. 7. Mechanism is generic enough that applies to large family of virtio devices and it does not involve trapping any virtio device interfaces for the direct mapped devices. Overview: ========= Above usecase requirements is solved by PCI PF group owner driver facilitating the member device migration functionality using administration commands. There are three major functionalities. 1. Suspend and resume the device operation 2. Read and Write the device context containing all the information that can be transferred from source to destination to migrate to a member device 3. Track pages written by the device during device migration is ongoing This comprehensive series introduces 4 infrastructure pieces covering PCI transport, peer to peer PCI devices, page write tracking (aka dirty page tracking) and generic virtio device context. 1. Device mode get,set (active, stop, freeze) 2. Device context read and write 3. Defines device context and compatibility command 4. Write reporting to track page addresses This series enables virtio PCI SR-IOV member device to member device migration. It can also be used to/from migrate from PCI SR-IOV member device to software composed PCI device if/when needed which can parse and compose software based PCI virtio device. This can also be useful for accessing member devices using some variant of only data path acceleration instead of direct mapped functionality. In future, for nested environment may be able to utilize the same infrastructure with VF capable of supporting nested VF with SR-IOV capability. Such novel approach is also being worked in industry at [2]. Page write recording functionality is optional for the device. If platform supports it, it can be used from the platform, if not if the device supports it, it can be used from the device. Example basic flow: =================== Source hypervisor: 1. Instructs device to start tracking pages it is writing 2. Periodically query the addresses of the written pages 3. Suspend the device operation 4. Read the device context and transfer to destination hypervisor Destination hypervisor: 5. Write the device context received from source 6. Resume the device that has newly written device context Example advance flow with small downtime: ========================================= Source hypervisor: 1. Instructs device to start tracking pages it is writing 2. Read current device context and transfer to deestination hypervisor a. Destination hypervisor writes and setup the device 3. Periodically query the addresses of the written pages 3. Suspend the device operation 4. Read the final device context and transfer to destination hypervisor Destination hypervisor: 5. Write the incremental device context received from source 6. Resume the device that has newly written device context (likely very small context that containly only virtqueues descriptor indices, in-flight descriptors indices) Patch summary: ============== patch-1: Adds theory of operation for device migration commands patch-2: Redefine reserved2 to command output field patch-3: Defines short device context for split virtqueues patch-4: Adds device migration commands patch-5: Adds requirements for device migration commands patch-6: Adds theory of operation for write reporting commands patch-7: Adds write reporting commands patch-8: Adds requirements for write reporting commands Please review. Changelog: ========== v3->v4: - updated commit message to be more precise for mapping a PCI VF member device to guest virtual machine - skipped adding msix table and pba table as they are mainly programmed by hypervisor which has the knowledge of vcpus and contains system specific encoded information which may be different on destination hypervisor - moved vq enabled field to vq config struct so that it can be used for packed vq and split vq uniformly. - reduce length in the field tlv from 64-bit to 32-bit to be realistic - added missing num_entries in the feature bits tlv - fixed copy paste error on structure for VIRTIO_DEV_CTX_VQ_SPLIT_DEV_OWN_DESC field - corrected discard field type location and enum values - enhanced queue context - made device context read command more flexible for device implementations - improved read command to be finish quickly during freeze mode to have lowest possible downtime - improved description for query supported fields command and device context to report the length in it - moved read context command dependency on output field - added remaining context size bytes in read response so that after the device mode is changed to freeze, the driver can get most accurate information quickly to decide when to stop reading the context from the device - relaxed device context formation and processing to simplify the device implementation v2->v3: - updated cover letter to reflect the use for passthrough and for only data path acceleration - updated cover letter to utilize same infra for nested - Addressed comments from Michael - updated read context command to not depend on returned data length for closing the read context stream, instead depend on explicit read response with zero length - fixed copy paste errors in write context command for fields description - added device and driver normatives for {_START,_END}_MARKER fields - wrote member VF device instead of VF device v1->v2: - Addressed comments from Michael and Jason - replaced iova to page/physical address range in write recording commands - several device specific requirements added to clarify, interaction of device reset, FLR, PCI PM and admin commands - added device context fields query command to learn compatibility - split device context field type range into generic and device specific - added device context extension section to maintain backward and future compatibility - several rewording in theory of operation - added requirements to cover config space read/write interaction with device context commands - added assumption about pci config space and msix table not present in device context, which can be added when hypervisor need them v0->v1: - enrich device context to cover device configuration layout, feature bits - fixed alignment of device context fields - added missing Sign-off for the joint work done with Satananda - added link to the github issue [1] https://static.sched.com/hosted_files/kvmforum2022/3a/KVM22-Migratable-Vhost-vDPA.pdf [2] https://netdevconf.info/0x17/sessions/talk/unleashing-sr-iov-offload-on-virtual-machines.html Parav Pandit (8): admin: Add theory of operation for device migration admin: Redefine reserved2 as command specific output device-context: Define the device context fields for device migration admin: Add device migration admin commands admin: Add requirements of device migration commands admin: Add theory of operation for write recording commands admin: Add write recording commands admin: Add requirements of write reporting commands admin-cmds-device-migration.tex | 690 ++++++++++++++++++++++++++++++++ admin.tex | 40 +- content.tex | 1 + device-context.tex | 258 ++++++++++++ 4 files changed, 982 insertions(+), 7 deletions(-) create mode 100644 admin-cmds-device-migration.tex create mode 100644 device-context.tex -- 2.34.1
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] | [List Home]