libvirt: an “embedded” QEMU driver mode for isolated usage

Posted: February 5th, 2020 | Filed under: Coding Tips, Fedora, libvirt, Virt Tools | Tags: | No Comments »

Since the project’s creation about 14 years ago, libvirt has grown enormously. In that time there has been a lot of code refactoring, but these were always fairly evolutionary changes; there has been little revolutionary change of the overall system architecture or some core technical decisions made early on. This blog post is one of a series examining recent technical decisions that can be considered more revolutionary to libvirt. This was the topic of a talk given at KVM Forum 2019 in Lyon.

Historical driver architecture

Historically the local stateful drivers in libvirt have supported one or two modes of access

  • “system mode” – privileged libvirtd running as root, global per host
  • “session mode” – unprivileged libvirtd, isolated to individual non-root users

Within context of each daemon, VM name uniqueness is enforced. Operating via the daemon means that all applications connected to that same libvirtd get the same world view. This single world view is exactly what you want when dealing with server / cloud / desktop virtualization, because it means tools like ‘virt-top‘, ‘virt-viewer’, ‘virsh‘ can see the same VMs as virt-manager / oVirt / OpenStack / GNOME Boxes / etc.

There are other use cases for virtualization, however, where this single world view across applications may be much less desirable. Instead of spawning VMs for the purpose of running a full guest operating system, the VM is used as a building block for an application specific use case. I describe these use cases as “embedded virtualization”, with the libguestfs project being a well known long standing example. This uses a VM as a way to confine execution of its appliance, allowing safe manipulation of disk images. The libvirt-sandbox project is another example which provides a way to take binaries installed on the host OS and directly execute them inside a virtual machine, using 9p filesystem passthrough. More recently the Kata project aims to provide a docker compatible container runtime built using KVM.

In many, but not neccessarily all, of these applications, it is unhelpful for the KVM instances that are launched to become visible to other applications like virt-manager / OpenStack. For example if Nova sees a libguestfs VM running in libvirt it won’t be able to correlate this VM with its own world view. There have been cases where a mgmt app would try to destroy these externally launched VM in order to reconcile its world view.

There are other practicalities to consider when using a shared daemon like libvirtd. Each application has to ensure it creates a sensible unique name for each virtual machine, that won’t clash with names picked by other applications. Then there is the question of cleaning up resources such as log files left over from short lived VMs.

When spawning KVM via a separate daemon, the QEMU process is daemonized, such that it disassociated from both libvirtd and the application which spawned it. It will only be cleaned up by an explicit API call to destroy it, or by the guest application shutting it down. For embedded use cases, it would be helpful if the VM would automatically die when the application which launched it dies. Libvirt introduces a notion of “auto destroy” to associated the lifetime of a VM with the client socket connection. It would be simpler if the VM process were simply in the same process group as the application, allowing normal OS level process tree pruning. The disassociated process context means that the QEMU process also looses the cgroup & namespace placement of the application using it

An initial embedded libvirt driver

A possible answer to all these problems is to introduce the notion of an “embedded mode” for libvirt drivers. When using a libvirt driver in this mode, there is no libvirtd daemon involved, instead the libvirt driver code is loaded into the application process itself. In embedded mode the libvirt driver is operating against a custom directory prefix for reading and writing config / state files. The directory is private to each application which has an instance of the embedded driver open. Since the libvirt driver is directly loaded into the application, there is no RPC service exposed and thus there is no way to use virsh and other tools to access the driver. This is important to remember because it means there is no way to debug problems with embedded VMs using normal libvirt tools. For some applications this is acceptable as the VMs are short-lived & throw away, but for others this restriction might be unacceptable.

At the time of writing this post, support for embedded QEMU driver connections has merged to GIT master, and will be released in 6.1.0. In order to enable use of encrypted disks, there is also support for an embedded secret driver. The embedded driver feature is considered experimental initially, and so contrary to normal libvirt practice we’re not providing a strong upgrade compatibility guarantee. The API and XML formats won’t change, but the behavior of the embedded driver may still change.

Along with the embedded driver mode, is a new command line tool called virt-qemu-run. This is a simple tool using the embedded QEMU driver to run a single QEMU virtual machine, automatically exiting when QEMU exits, or tearing down QEMU if the tool exits abnormally. This can be used directly by users for self contained virtual machines, but it also serves as an example of how to use the embedded driver and has been important for measuring startup performance. This tool is also considered experimental and so its CLI syntax is subject to change in future.

In general the embedded mode drivers should offer the same range of functionality as the main system or session modes in libvirtd. To learn more about their usage and configuration, consult the three pages linked in the above paragraphs.

Further development work

During development of the embedded driver one of the problems that quickly became apparently was the time required to launch a virtual machine. When libvirtd starts up one of the things it does is to probe all installed QEMU binaries to determine what features they support. This can take 300-500 milliseconds per binary which doesn’t sound like much, but if you have all 30 QEMU binaries installed this is 10-15 seconds. The results of this probing are cached, avoiding repeated performance hits until something changes which would invalidate the information. The caching doesn’t help the embedded driver case though, because it is using a private directory tree for state and thus doesn’t see the cache from the system / session mode drivers. To deal with this problem the QEMU driver startup process was significantly refactored such that probing of QEMU binaries is delayed until the data is actually needed. This massively helps both the new embedded mode and existing system/session modes.

Unfortunately it is fairly common for applications to query the libvirt host capabilities and the returned data is required to report on all QEMU binaries, thus triggering the slow probing operation. There is a new API which allows probing of a single QEMU binary which applications are increasingly using, but there are still valid use cases for the general host capabilities information. To address the inherent design limitations of the current API, one or more replacements are required to allow more targetted information reporting to avoid the mass QEMU probe.

Attention will then need to switch to optimizing the startup procedure for spawning QEMU. There is one key point where libvirt uses QMP to ask the just launched QEMU what CPU features it has exposed to the guest OS. This results in a huge number of QMP calls, one for each CPU feature. This needs to be optimized, ideally down to 1 single QMP call, which might require QEMU enhancements to enable libvirt to get the required information more efficiently.

One of the goals of the embedded driver is to have the QEMU process inherit the application’s process context (cgroups, namespaces, CPU affinity, etc) by default and keep QEMU as a child of the application process. This does not currently happen as the embedded driver is re-using the existing startup code which moves QEMU into dedicated cgroups and explicitly resets CPU affinity, as well as daemonizing QEMU. The need to address these problems is one of the reasons the embedded mode is marked experimental with behaviour subject to change.

Leave a Reply





Spam protection: Sum of tw0 plus f1ve ?: