What benefits does libvirt offer to developers targetting QEMU+KVM?
A common question from developers/users new to lbvirt, is to wonder what the benefit of using libvirt is, over directly scripting/developing against QEMU/KVM. This blog posting attempts to answer that, by outlining features exposed in the libvirt QEMU/KVM driver that would not be automatically available to users of the lower level QEMU command line/monitor.
“All right, but apart from the sanitation, medicine, education, wine, public order, irrigation, roads, the fresh water system and public health, what have the Romans ever done for us?”
Insurance against QEMU/KVM being replaced by new $SHINY virt thing
Linux virtualization technology has been through many iterations. In the beginning there was User Mode Linux, which was widely used by many ISPs offering Linux virtual hosting. Then along came Xen, which was shipped by many enterprise distributions and replaced much usage of UML. QEMU has been around for a long time, but when KVM was added to the Linux kernel and integrated with QEMU, it became the new standard for Linux host virtualization, often replacing the usage of Xen. Application developers may think
“QEMU/KVM is the Linux virtualization standard for the future, so why do we need a portable API?”
To think this way is ignoring the lessons of history, which are that every virtualization technology that has existed in Linux thus far has been replaced or rewritten or obsoleted. The current QEMU/KVM userspace implementation may turn out to be the exception to the rule, but do you want to bet on that? There is already a new experimental KVM userspace application being developed by a group of LKML developers which may one day replace the current QEMU based KVM userspace. libvirt exists to provide application developers an insurance policy should this come to pass, by isolating application code from the specific implementation of the low level userspace infrastructure.
Long term stable API and XML format
The libvirt API is an append-only API, so once added an existing API wil never be removed or changed. This ensures that applications can expect to operate unchanged across multiple major RHEL releases, even if the underlying virtualization technology changes the way it operates. Likewise the XML format is append-only, so even if the underlying virt technology changes its configuration format, apps will continue to work unchanged.
Automatic compliance with changes in KVM command line syntax “best practice”
As new releases of KVM come out, the “best practice” recommendations for invoking KVM processes evolve. The libvirt QEMU driver attempts to always follow the latest “best practice” for invoking the KVM version it finds
Example: Disk specification changes
- v1: Original QEMU syntax:
-hda filename
- v2: RHEL5 era KVM syntax
-drive file=filename,if=ide,index=0
- v3: RHEL6 era KVM syntax
-drive file=filename,id=drive-ide0-0-0,if=none \ -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide-0-0-0
- v4: (possible) RHEL-7 era KVM syntax:
-blockdev ...someargs... \ -device ide-drive,bus=ide.0,unit=0,blockdev=bdev-ide0-0-0,id=ide-0-0-0
Isolation against breakage in KVM command line syntax
Although QEMU attempts to maintain compatibility of command line flags across releases, there may been cases where options have been removed / changed. This has typically occurred as a result of changes which were in the KVM branch, and then done differently when the code was merged back into the main QEMU GIT repository. libvirt detects what is required for the QEMU binary and uses the appropriate supported syntax:
Example: Boot order changes for virtio.
- v1: Original QEMU syntax (didn’t support virtio)
-boot c
- v2: RHEL5 era KVM syntax
-drive file=filename,if=virtio,index=0,boot=on
- v3: RHEL-6.0 KVM syntax
-drive file=filename,id=drive-virtio0-0-0,if=none,boot=on \ -device virtio-blk-pci,bus=pci0.0,addr=1,drive=drive-virtio0-0-0,id=virtio-0-0-0
- v4: RHEL-6.1 KVM syntax (boot=on was removed in the update, changing syntax wrt RHEL-6.0)
-drive file=filename,id=drive-virtio0-0-0,if=none,bootindex=1 \ -device virtio-blk-pci,bus=pci0.0,addr=1,drive=drive-virtio0-0-0,id=virtio-0-0-0
Transparently take advantage of new features in KVM
Some new features introduced in KVM can be automatically enabled / used by the libvirt QEMU driver without requiring changes to applications using libvirt. Applications thus benefit from these new features without expending any further development effort
Example: Stable machine ABI in RHEL-6.0
The application provided XML requests a generic ‘pc’ machine type. libvirt automatically queries KVM to determine the supported machines types and canonicalizes ‘pc’ to the stable versioned machine type ‘pc-0.12’ and preserves this in the XML configuration. Thus even if host is updated to KVM 0.13, existing guests will continue to run with the ‘pc-0.12’ ABI, and thus avoid potential driver breakage or guest OS re-activation. It also makes it is possible to migrate the guest between hosts running different versions of KVM.
Example: Stable PCI device addressing in RHEL-6.0
The application provided XML requests 4 PCI devices (balloon, 2 disks & a NIC). libvirt automatically assigns PCI addresses to these devices and ensures that every time the guest is launched the exact same PCI addresses are used for each device. It will also manage PCI addresses when doing PCI device hotplug and unplug, so the sequence boot; unplug nic; migrate will result in stable PCI addresses on the target host.
Automatic compliance with changes in KVM monitor syntax best practice
In the same way that KVM command line argument “best practice” changes between releases, so does the monitor “best practice”. The libvirt QEMU drivers aims to always be in compliance with the “best practice” recommendations for the version of KVM being used.
Example: monitor protocol changes
- v1: Human parsable monitor in RHEL5 era or earlier KVM
- v2: JSON monitor in RHEL-6.0
- v3: JSON monitor, with HMP passthrough in RHEL-6.1
Isolation against breakage in KVM monitor commands
Upstream QEMU releases have generally avoided changing monitor command syntax, but the KVM fork of QEMU has not been so lucky when merging changes back into mainline QEMU. Likewise, OS distros sometimes introduce custom commands in their branches.
Example: Method for setting VNC/SPICE passwords
- v1: RHEL-5 using the ‘change’ command only for VNC, or ‘set_ticket’ for SPICE
'change vnc 123456' 'set_ticket 123456'
- v2: RHEL-6.0 using the (non-upstream) ‘__com.redhat__set_password’ for SPICE or VNC
{ "execute": "__com.redhat__set_password", "arguments": { "protocol": "spice", "password": "123456", "expiration": "60" } }
- v4: RHEL-6.1 using the upstream ‘set_password’ and ‘expire_password’ commands
{ "execute": "set_password", "arguments": { "protocol": "spice", "password": "123456" } } { "execute": "expire_password", "arguments": { "protocol": "spice", "time": "+60" } }
Security isolation with SELinux / AppArmour
When all guests run under the same user ID, any single KVM process which is exploited by a malicious guest, will result in compromise of all guests running on the host. If the management software runs as the same user ID as the KVM processes, the management software can also be exploited.
sVirt integration in libvirt, ensures that every guest is assigned a dedicated security context, which strictly confines what resources it can access. Thus even if a guest can read/write files based on matching UID/GID, the sVirt layer will block access, unless the file / resource was explicitly listed in the guest configuration file.
Security isolation using POSIX DAC
A additional security driver uses traditional POSIX DAC capabilities to isolate guests from the host, by running as an unprivileged UID and GID pair. Future enhancements will run each individual guest under a dedicated UID.
Security isolation using Linux container namespaces
Future will take advantage of recent advances in Linux container namespace capabilities. Every QEMU process will be placed into a dedicate PID namespace, preventing QEMU seeing any processes on the system, should it be exploited. A dedicated network namespace will block access to all host network devices, only allow access to TAP device FDs, preventing an exploited guest from making arbitrary network connections to outside world. When QEMU gains support for a ‘fd:’ disk protocol, a dedicated filesystem namespace will provide a very secure chrooted environment where it can only use file descriptors passed in from libvirt. UID/GID namespaces will allow separation of QEMU UID/GID for each process, even though from the host POV all processes will be under the same UID/GID.
Avoidance of shell code for most operations
Use of the shell for invoking management commands is susceptible to exploit by users by providing data which gets interpreted as shell metadata characters. It is very hard to get escaping correctly applied, so libvirt has an advanced set of APIs for invoking commands which avoid all use of the shell. They are also able to guarantee that no file descriptors from the management layer leak down into the QEMU process where they could be exploited.
Secure remote access to management APIs
The libvirt local library API can be exposed over TCP using TLS, x509 certificates and/or Kerberos. This provides secure remote access to all KVM management APIs, with parity of functionality vs local API usage. The secure remote access is a validated part of the common criteria certifications
Operation audit trails
All operations where there is an association between a virtual machine and a host resource will result in one or more audit records being generated via the Linux auditing subsystem. This provides a clear record of key changes in the virtualization state. This functionality is typically a mandatory requirement for deployment into government / defence / financial organizations.
Security certification for Common Criteria
libvirt’s QEMU/KVM driver has gone through Common Criteria certification. This provides assurance for the host management virtualization stack, which again, is typically a mandatory requirement for deployment into government / defense related organizations. The core components in this certification are the sVirt security model and the audit subsystem integration
Integration with cgroups for resource control
All guests are automatically placed into cgroups. This allows control of CPU schedular priority between guests, block I/O bandwidth tunables, memory utilization tunables. The devices cgroups controller, provides a further line of defence blocking guest access to char & block devices on the host in the unlikely event that another layer of protection fails. The cpu accounting group will also enable querying of the per-physical CPU utilization for guests, not available via any traditional /proc file.
Integration with libnuma for memory/cpu affinity
QEMU does not provide any native support for controlling NUMA affinity via its command line. libvirt integrates with libnuma and/or sched_setaffinity for memory and CPU pinning respectively
CPU compatibility guarantees upon migration
libvirt directly models the host and guest CPU feature sets. A variety of policies are available to control what features become visible to the guest, and strictly validate compatibility of host CPUs across migration. Further APIs are provided to allow apps to query CPU compatibility ahead of time, and to enable computation of common feature sets between CPUs. No comparable API exists elsewhere.
Host network filtering of guest traffic
The libvirt network filter APIs allow definition of a flexible set of policies to control network traffic seen by guests. The filter configurations are automatically translated into a set of rules in ebtables, iptables and
ip6tables. The rules are automatically created when the guest starts and torn down when it stops. The rules are able to automatically learn the guest IP address through ARP traffic, or DHCP responses from a (trusted) DHCP server. The data model is closely aligned with the DMTF industry spec for virtual machine network filtering
Libvirt ships with a default set of filters which can be turned on to provide ‘clean’ traffic from the guest. ie it blocks ARP spoofing and IP spoofing by the guest which would otherwise DOS other guests on the network
Secure migration data transport
KVM does not provide any secure migration capability. The libvirt KVM driver adds support for tunnelling migration data over a secure libvirtd managed data channel
Management of encrypted virtual disks for guests
An API set for managing secret keys enables a secure model for deploying guests to untrusted shared storage. By separating management of the secret keys from the guest configuration, it is possible for the management app to ensure a guest can only be started by an authorized user on an authorized host, and prevent restarts once it shuts down. This leverages the Qcow2 data encryption capabilities and is the only method to securely deploy guest if the storage admins / network is not trusted.
Seamless migration of SPICE clients
The libvirt migration protocol automatically takes invokes the KVM graphics clients migration commands at the right point in the migration process to relocate connected clients
Integration with arbitrary lock managers
The lock manager API allows strict guarantees that a virtual machine’s disks can only be in use by once process in the cluster. Pluggable implementations enable integration with different deployment scenarios. Implementations available can use Sanlock, POSIX locks (fcntl), and in the future DLM (clustersuite). Leases can be explicitly provided by the management application, or zero-conf based on the host disks assigned to the guests
Host device passthrough
Integration between host device management APIs and KVM device assignment allows for safe passthrough of PCI devices from the host. libvirt automatically ensures that the device is detached from the host OS before guest startup, and re-attached after shutdown (or equivalent for hotplug/unplug). PCI topology validation ensures that the devices can be safely reset, with FLR, Power management reset, or secondary bus reset.
Host device API enumeration
APIs for querying what devices are present on the host. This enables applications wishing to perform device assignment, to determine which PCI devices are available on the host and how they are used by the host OS. eg to detect case where user asked to assign a device that is currently used as the host eth0 interface.
NPIV SAN management
The host device APIs allow creation and deletion of NPIV SCSI interfaces. This in turns provides access to sets of LUNs, whcih can be associated with individual guests. This allows the virtual SCSI HBA to “follow” the guest where ever it may be running
Core dump integration with “crash”
The libvirt core dump APIs allows a guest memory dump to be obtained and saved into a format which can then be analysed by the Linux crash tool.
Access to guest dmesg logs even for crashed guests
The libvirt API for peeking into guest memory regions allows the ‘virt-dmesg’ tool to extract recent kernel log messages from a guest. This is possible even when the guest kernel has crashed and thus no guest agent is otherwise available.
Cross-language accessibility of APIs.
Provides an stable API which allows KVM to be managed from C, Perl, Python, Java, C#, OCaml and PHP. Components of the management stack are not tied into all using the same language
Zero-conf host network access
Out of the box libvirt provides a NAT based network which allows guests to have outbound network access on a host, whether running wired, or wireless, or VPN, without needing any administrator setup ahead of time. While QEMU provides the ‘SLIRP’ networking mode, this has very low performance and does not integrate with the networking filters capabilities libvirt also provides, nor does it allow guests to communicate with each other
Storage APIs for managing common host storage configurations
Few APIs existing for managing storage on Linux hosts. libvirt provides a general purpose storage API which allows use of NFS shares; formatting & use of local filesystems; partitioning of block devices; creation of LVM volume groups and creation of logical volumes within them; enumeration of LUNs on SCSI HBAs; login/out of ISCSI servers & access to their LUNs; automatic discovery of NFS exports, LVM volume groups and iSCSI targets; creation and cloning of non-raw disks formats, optionally including encryption; enumeration of multipath devices.
Integration with common monitoring tools
Plugins are available for munin, nagios and collectd to enable performance monitoring of all guests running on a host / in a network. Additional 3rd party monitoring tools can easily integrate by using a readonly libvirt connection to ensure they don’t interfere with the functional operation of the system.
Expose information about virtual machines via SNMP
The libvirt SNMP agent allows information about guests on a host to be exposed via SNMP. This allows industry standard tools to report on what guests are running on each host. Optionally SNMP can also be used to control guests and make changes
Expose information & control of host via CIM
The libvirt CIM agent exposes information about, and allows control of, virtual machines, virtual networks and storage pools via the industry standard DMTF CIM schema for virtualization.
Expose information & control of host via AMQP/QMF
The libvirt AMQP agent (libvirt-qpid) exposes information about, and allows control of, virtual machines, virtual networks and storage pools via the QMF object modelling system, running on top of AMQP. This allows data center wide view & control of the state of virtualization hosts.