ANNOUNCE: libosinfo 0.3.1 released

Posted: July 1st, 2016 | Filed under: Fedora, libvirt, Virt Tools | Tags: libosinfo, osinfo | No Comments »

I am happy to announce a new release of libosinfo, version 0.3.1 is now available, signed with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R). All historical releases are available from the project download page.

Changes in this release include:

Require glib2 >= 2.36
Replace GSimpleAsyncResult usage with GTask
Fix VPATH based builds
Don’t include autogenerated enum files in dist
Fix build with older GCC versions
Add/improve/fix data for
- Debian
- SLES/SLED
- OpenSUSE
- FreeBSD
- Windows
- RHEL
- Ubuntu
Update README content
Fix string comparison for bootable media detection
Fix linker flags for OS-X & solaris
Fix darwin detection code
Fix multiple memory leaks

Thanks to everyone who contributed towards this release.

A special note to downstream vendors/distributors.

The next major release of libosinfo will include a major change in the way libosinfo is released and distributed. The current single release will be replaced with three indepedently released artefacts:

libosinfo – this will continue to provide the libosinfo shared library and most associated command line tools
osinfo-db – this will contain only the database XML files and RNG schema, no code at all.
osinfo-db-tools – a set of command line tools for managing deployment of osinfo-db archives for vendors & users.

The libosinfo and osinfo-db-tools releases will be fairly infrequently as they are today. The osinfo-db releases will be done very frequently, with automated releases made available no more than 1 day after updated DB content is submitted to the project.

ANNOUNCE: virt-viewer 4.0 release

Posted: June 30th, 2016 | Filed under: Fedora, libvirt, Virt Tools | Tags: remote-viewer, virt-viewer | No Comments »

I am happy to announce a new bugfix release of virt-viewer 4.0 (gpg), including experimental Windows installers for Win x86 MSI (gpg) and Win x64 MSI (gpg). Signatures are created with key DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)

All historical releases are available from:

http://virt-manager.org/download/

Changes in this release include:

Drop support for gtk2 builds
Require spice-gtk >= 0.31
Require glib2 >= 2.38
Require gtk3 >= 3.10
Require libvirt-glib >= 0.1.8
Increase minimum window size fo 320×200 instead of 50×50
Remove use of GSLice
Don’t show usbredir button if not connected yet
Only compute monitor mapping in full screen
Don’t ignore usb-filter in spiec vv-file
Port to use GtkApplication API
Don’t leave window open after connection failure
Validate symbols from max glib/gdk versions
Don’t use GtkStock
Don’t use gtk_widget-modify_{fg,bg} APIs
Drop use of built-in eventloop in favour of libvirt-glib
Don’t open X display while parsing command line
Fix window title
Use GResource for building ui files into binary
Fix crash with invalid spice monitor mapping
Add dialog to show file transfer progress and allow cancelling
Remove unused nsis installer support
Include adwaita icon theme in msi builds
Add more menu mnemonics
Fix support for windows consoles to allow I/O redirection
Add support for ovirt sso-token in vv-file
Fix crash with zooming window while not connected
Remove custom auto drawer widget with GtkRevealer
Add appdata file for gnome software
Misc other bug fixes
Refresh translations

Thanks to everyone who contributed towards this release.

ANNOUNCE: libvirt switch to time based rules for updating version numbers

Posted: June 14th, 2016 | Filed under: Fedora, libvirt, OpenStack | Tags: releases, version | No Comments »

Until today, libvirt has used a 3 digit version number for monthly releases off the git master branch, and a 4 digit version number for maintenance releases off stable branches. Henceforth all releases will use 3 digits, and the next release will be 2.0.0, followed by 2.1.0, 2.2.0, etc, with stable releases incrementing the last digit (2.0.1, 2.0.2, etc) instead of appending yet another digit.

For the longer explanation read on…

We have the following rules about when we increment each digit in the version number

major: no one has any clue about when we should bump this
minor: bump this when some “significant”[*] features appear
micro: bump this on each new master branch release
extra: bump this for stable branch releases

[*] for a definition of “significant” that either no one knows, or that we invent post-update to justify why we changed the digit.

Now consider the actual requirements libvirt has

A number that increments on each release from master branches
A number that can be further incremented for stable branch releases without clashing with future master branch releases

The micro + extra digits alone deal with our two actual requirements, so one may ask what is the point of the major + minor digits in the version number ?

In 11 years of libvirt development we’ve only bumped the major digit once, and we didn’t have any real reason why we chose to the bump the major digit, instead of continuing to bump the minor digit. It just felt like we ought to have a 1.0 release after 7+ years. Our decisions about when to bump the minor digit have not been that much less arbitrary. We just look at what features are around and randomly decide if any feel “big enough” to justify a minor digit bump.

Way back in the early days of libvirt, we had exactly this kind of mess when deciding when to actually make releases. Sometimes we’d release after a month, sometimes after 3 months, completely arbitrarily based on whether the accumulated changes felt “big enough” to justify a release. Feature based release schedules are insanity as no one can predict when the next one might happen. Fortunately we wised up pretty quickly and adopted a time base release schedule where we release monthly approximately on the 1st. The only exception is over xmas/new year period, where we avoid Jan 1st and Feb 1st releases and instead have a Jan 15th release, giving a 6 week gap. There is no stated semantic difference between any of our releases off git master branch – they just include whatever happens to be ready at the time.

Considering version numbers again, it is clear that the reason why a feature based release timeline are a bad idea, is just as applicable to feature based version numbering rules. So we have decided to switch to a time based rule for incrementing the version number. Note, that this is *not* to be confused with switching to a time based version number. We want individual digits in the version number to be completely devoid of any semantics. Just as we don’t want version number changes to imply a particular level of feature changes, we also don’t want version numbers to correspond to dates of releases. IOW, we are *not* using the year and month to form the version number, rather that we are using the change in year and change in month as a trigger to update the version number. So our new version number rules are

major: bumped for the first release of each year
minor: bumped for every major release
micro: bumped for every stable branch release

Rather than wait until January 2017 to put this new rule into effect, we are pretending that July is January, so the next libvirt release will bump the major version number to 2.0.0. There after the releases will be 2.1.0, 2.2.0, etc until January 2017, when we’ll go to 3.0.0. The maintenance releases based off 2.0.0 will be 2.0.1, 2.0.2, 2.0.3, etc, and live on a v2.0-maint branch in git.

So henceforth you should not interpret the libvirt version numbers as having any semantic meaning. They are merely indicating the progression of releases.

As a reminder, libvirt promises API and ABI stability forever, and the ELF library soname version number is thus fixed forever at libvirt.so.0, regardless of what version number a release has.

Analysis of techniques for ensuring migration completion with KVM

Posted: May 12th, 2016 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Virt Tools | 1 Comment »

Live migration is a long standing feature in QEMU/KVM (and other competing virtualization platforms), however, by default it does not cope very well with guests whose workload are very memory write intensive. It is very easy to create a guest workload that will ensure a migration will never complete in its default configuration. For example, a guest which continually writes to each byte in a 1 GB region of RAM will never successfully migrate over a 1Gb/sec NIC. Even with a 10Gb/s NIC, a slightly larger guest can dirty memory fast enough to prevent completion without an unacceptably large downtime at switchover. Thus over the years, a number of optional features have been developed for QEMU with the aim to helping migration to complete.

If you don’t want to read the background information on migration features and the testing harness, skip right to the end where there are a set of data tables showing charts of the results, followed by analysis of what this all means.

The techniques available

Downtime tuning. Unless the guest is completely idle, it never possible to get to a point where 100% of memory has been transferred to the target host. So at some point there needs to be a decision made about whether enough memory has been transferred to allow the switch over to the target host with acceptable blackout period. The downtime tunable controls how long a blackout period is permitted during the switchover. QEMU measures the network transfer rate it is achieving and compares it to the amount of outstanding RAM to determine if it can be transferred within the configured downtime window. When migrating it is not desirable to set QEMU to use the maximum accepted downtime straightaway, as that guarantees that the guest will always suffer from the maximum downtime blackout. Instead, it is better to start off with a fairly small downtime value and increase the permitted downtime as time passes. The idea is to maximise the likelihood that migration can complete with a small downtime.
Bandwidth tuning. If the migration is taking place over a NIC that is used for other non-migration related actions, it may be desirable to prevent the migration stream from consuming all bandwidth. As noted earlier though, even a relatively small guest is capable of dirtying RAM fast enough that even a 10Gbs NIC will not be able to complete migration. Thus if the goal is to maximise the chances of getting a successful migration though, the aim should be to maximise the network bandwidth available to the migration operation. Following on from this, it is wise not to try to run multiple migration operations in parallel unless their transfer rates show that they are not maxing out the available bandwidth, as running parallel migrations may well mean neither will ever finish.
Pausing CPUs. The simplest and crudest mechanism for ensuring guest migration complete is to simply pause the guest CPUs. This prevent the guest from continuing to dirty memory and thus even on the slowest network, it will ensure migration completes in a finite amount of time. The cost is that the guest workload will be completely stopped for a prolonged period of time. Think of pausing the guest as being equivalent to setting an arbitrarily long maximum permitted downtime. For example, assuming a guest with 8 GB of RAM and an idle 10Gbs NIC, in the worst case pausing would lead to to approx 6 second period of downtime. If higher speed NICs are available, the impact of pausing will decrease until it converges with a typical max downtime setting.
Auto-convergence. The rate at which a guest can dirty memory is related to the amount of time the guest CPUs are permitted to run for. Thus by throttling the CPU execution time it is possible to prevent the guest from dirtying memory so quickly and thus allow migration data transfer to keep ahead of RAM dirtying. If this feature is enabled, by default QEMU starts by cutting 20% of the guest vCPU execution time. At the startof each iteration over RAM, it will check progress during the previous two iterations. If insufficient forward progress is being made, it will repeatedly cut 10% off the running time allowed to vCPUs. QEMU will throttle CPUs all the way to 99%. This should guarantee that migration can complete on all by the most sluggish networks, but has a pretty high cost to guest CPU performance. It is also indiscriminate in that all guest vCPUs are throttled by the same factor, even if only one guest process is responsible for the memory dirtying.
Post-copy. Normally migration will only switch over to running on the target host once all RAM has been transferred. With post-copy, the goal is to transfer “enough” or “most” RAM across and then switch over to running on the target. When the target QEMU gets a fault for a memory page that has not yet been transferred, it’ll make an explicit out of band request for that page from the source QEMU. Since it is possible to switch to post-copy mode at any time, it avoids the entire problem of having to complete migration in a fixed downtime window. The cost is that while running in post-copy mode, guest page faults can be quite expensive, since there is a need to wait for the source host to transfer the memory page over to the target, which impacts performance of the guest during post-copy phase. If there is a network interruption while in post-copy mode it will also be impossible to recover. Since neither the source or target host has a complete view of the guest RAM it will be necessary to reboot the guest.
Compression. The migration pages are usually transferred to the target host as-is. For many guest workloads, memory page contents will be fairly easily compressible. So if there are available CPU cycles on the source host and the network bandwidth is a limiting factor, it may be worth while burning source CPUs in order to compress data transferred over the network. Depending on the level of compression achieved it may allow migration to complete. If the memory is not compression friendly though, it would be burning CPU cycles for no benefit. QEMU supports two compression methods, XBZRLE and multi-thread, either of which can be enabled. With XBZRLE a cache of previously sent memory pages is maintained that is sized to be some percentage of guest RAM. When a page is dirtied by the guest, QEMU compares the new page contents to that in the cache and then only sends a delta of the changes rather than the entire page. For this to be effective the cache size must generally be quite large – 50% of guest RAM would not be unreasonable. The alternative compression approach uses multiple threads which simply use zlib to directly compress the full RAM pages. This avoids the need to maintain a large cache of previous RAM pages, but is much more CPU intensive unless hardware acceleration is available for the zlib compression algorithm.

Measuring impact of the techniques

Understanding what the various techniques do in order to maximise chances of a successful migration is useful, but it is hard to predict how well they will perform in the real world when faced with varying workloads. In particular, are they actually capable of ensuring completion under worst case workloads and what level of performance impact do they actually have on the guest workload. This is a problem that the OpenStack Nova project is currently struggling to get a clear answer on, with a view to improving Nova’s management of libvirt migration. In order to try and provide some guidance in this area, I’ve spent a couple of weeks working on a framework for benchmarking QEMU guest performance when subjected to the various different migration techniques outlined above.

In OpenStack the goal is for migration to be a totally “hands off” operation for cloud administrators. They should be able to request a migration and then forget about it until it completes, without having to baby sit it to apply tuning parameters. The other goal is that the Nova API should not have to expose any hypervisor specific concepts such as post-copy, auto-converge, compression, etc. Essentially Nova itself has to decide which QEMU migration features to use and just “do the right thing” to ensure completion. Whatever approach is chosen needs to be able to cope with any type of guest workload, since the cloud admins will not have any visibility into what applications are actually running inside the guest. With this in mind, when it came to performance testing the QEMU migration features, it was decided to look at their behaviour when faced with the worst case scenario. Thus a stress program was written which would allocate many GB of RAM, and then spawn a thread on each vCPU that would loop forever xor’ing every byte of RAM against an array of bytes taken from /dev/random. This ensures that the guest is both heavy on reads and writes to memory, as well as creating RAM pages which are very hostile towards compression. This stress program was statically linked and built into a ramdisk as the /init program, so that Linux would boot and immediately run this stress workload in a fraction of a second. In order to measure performance of the guest, each time 1 GB of RAM has been touched, the program will print out details of how long it took to update this GB and an absolute timestamp. These records are captured over the serial console from the guest, to be later correlated with what is taking place on the host side wrt migration.

Next up it was time to create a tool to control QEMU from the host and manage the migration process, activating the desired features. A test scenario was defined which encodes details of what migration features are under test and their settings (number of iterations before activating post-copy, bandwidth limits, max downtime values, number of compression threads, etc). A hardware configuration was also defined which expressed the hardware characteristics of the virtual machine running the test (number of vCPUs, size of RAM, host NUMA memory & CPU binding, usage of huge pages, memory locking, etc). The tests/migration/guestperf.py tool provides the mechanism to invoke the test in any of the possible configurations.For example, to test post-copy migration, switching to post-copy after 3 iterations, allowing 1Gbs bandwidth on a guest with 4 vCPUs and 8 GB of RAM one might run

$ tests/migration/guestperf.py --cpus 4 --mem 8 --post-copy --post-copy-iters 3 --bandwidth 125 --dst-host myotherhost --transport tcp --output postcopy.json

The myotherhost.json file contains the full report of the test results. This includes all details of the test scenario and hardware configuration, migration status recorded at start of each iteration over RAM, the host CPU usage recorded once a second, and the guest stress test output. The accompanying tests/migration/guestperf-plot.py tool can consume this data file and produce interactive HTML charts illustrating the results.

$ tests/migration/guestperf-plot.py --split-guest-cpu --qemu-cpu --vcpu-cpu --migration-iters --output postcopy.html postcopy.json

To assist in making comparisons between runs, however, a set of standardized test scenarios also defined which can be run via a tests/migration/guestperf-batch.py tool, in which case it is merely required to provide the desired hardware configuration

$ tests/migration/guestperf-batch.py --cpus 4 --mem 8 --dst-host myotherhost --transport tcp --output myotherhost-4cpu-8gb

This will run all the standard defined test scenarios and save many data files in the myotherhost-4cpu-8gb directory. The same guestperf-plot.py tool can be used to create charts combining multiple data sets at once to allow easy comparison.

Performance results for QEMU 2.6

With the tools written, I went about running some tests against QEMU GIT master codebase, which was effectively the same as the QEMU 2.6 code just released. The pair of hosts used were Dell PowerEdge R420 servers with 8 CPUs and 24 GB of RAM, spread across 2 NUMA nodes. The primary NICs were Broadcom Gigabit, but it has been augmented with Mellanox 10-Gig-E RDMA capable NICs, which is what were picked for transfer of the migration traffic. For the tests I decided to collect data for two distinct hardware configurations, a small uniprocessor guest (1 vCPU and 1 GB of RAM) and a moderately sized multi-processor guest (4 vCPUs and 8 GB of RAM). Memory and CPU binding was specified such that the guests were confined to a single NUMA node to avoid performance measurements being skewed by cross-NUMA node memory accesses. The hosts and guests were all running the RHEL-7 3.10.0-0369.el7.x86_64 kernel.

To understand the impact of different network transports & their latency characteristics, the two hardware configurations were combinatorially expanded against 4 different network configurations – a local UNIX transport, a localhost TCP transport, a remote 10Gbs TCP transport and a remote 10Gbs RMDA transport.

The full set of results are linked from the tables that follow. The first link in each row gives a guest CPU performance comparison for each scenario in that row. The other cells in the row give the full host & guest performance details for that particular scenario

UNIX socket, 1 vCPU, 1 GB RAM

Using UNIX socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM

Scenario	Tunable
Pause unlimited BW	0 iters	1 iters	5 iters	20 iters
Pause 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Post-copy unlimited BW	0 iters	1 iters	5 iters	20 iters
Post-copy 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Auto-converge unlimited BW	5% CPU step	10% CPU step	20% CPU step
Auto-converge 10% CPU step	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
MT compression unlimited BW	1 thread	2 threads	4 threads
XBZRLE compression unlimited BW	5% cache	10% cache	20% cache	50% cache

UNIX socket, 4 vCPU, 8 GB RAM

Using UNIX socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM

Scenario	Tunable
Pause unlimited BW	0 iters	1 iters	5 iters	20 iters
Pause 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Post-copy unlimited BW	0 iters	1 iters	5 iters	20 iters
Post-copy 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Auto-converge unlimited BW	5% CPU step	10% CPU step	20% CPU step
Auto-converge 10% CPU step	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
MT compression unlimited BW	1 thread	2 threads	4 threads
XBZRLE compression unlimited BW	5% cache	10% cache	20% cache	50% cache

TCP socket local, 1 vCPU, 1 GB RAM

Using TCP socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM

Scenario	Tunable
Pause unlimited BW	0 iters	1 iters	5 iters	20 iters
Pause 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Post-copy unlimited BW	0 iters	1 iters	5 iters	20 iters
Post-copy 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Auto-converge unlimited BW	5% CPU step	10% CPU step	20% CPU step
Auto-converge 10% CPU step	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
MT compression unlimited BW	1 thread	2 threads	4 threads
XBZRLE compression unlimited BW	5% cache	10% cache	20% cache	50% cache

TCP socket local, 4 vCPU, 8 GB RAM

Using TCP socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM

Scenario	Tunable
Pause unlimited BW	0 iters	1 iters	5 iters	20 iters
Pause 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Post-copy unlimited BW	0 iters	1 iters	5 iters	20 iters
Post-copy 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Auto-converge unlimited BW	5% CPU step	10% CPU step	20% CPU step
Auto-converge 10% CPU step	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
MT compression unlimited BW	1 thread	2 threads	4 threads
XBZRLE compression unlimited BW	5% cache	10% cache	20% cache	50% cache

TCP socket remote, 1 vCPU, 1 GB RAM

Using TCP socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM

Scenario	Tunable
Pause unlimited BW	0 iters	1 iters	5 iters	20 iters
Pause 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Post-copy unlimited BW	0 iters	1 iters	5 iters	20 iters
Post-copy 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Auto-converge unlimited BW	5% CPU step	10% CPU step	20% CPU step
Auto-converge 10% CPU step	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
MT compression unlimited BW	1 thread	2 threads	4 threads
XBZRLE compression unlimited BW	5% cache	10% cache	20% cache	50% cache

TCP socket remote, 4 vCPU, 8 GB RAM

Using TCP socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM

Scenario	Tunable
Pause unlimited BW	0 iters	1 iters	5 iters	20 iters
Pause 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Post-copy unlimited BW	0 iters	1 iters	5 iters	20 iters
Post-copy 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Auto-converge unlimited BW	5% CPU step	10% CPU step	20% CPU step
Auto-converge 10% CPU step	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
MT compression unlimited BW	1 thread	2 threads	4 threads
XBZRLE compression unlimited BW	5% cache	10% cache	20% cache	50% cache

RDMA socket, 1 vCPU, 1 GB RAM

Using RDMA socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM

Scenario	Tunable
Pause unlimited BW	0 iters	1 iters	5 iters	20 iters
Pause 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Post-copy unlimited BW	0 iters	1 iters	5 iters	20 iters
Post-copy 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Auto-converge unlimited BW	5% CPU step	10% CPU step	20% CPU step
Auto-converge 10% CPU step	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
MT compression unlimited BW	1 thread	2 threads	4 threads
XBZRLE compression unlimited BW	5% cache	10% cache	20% cache	50% cache

RDMA socket, 4 vCPU, 8 GB RAM

Using RDMA socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM

Scenario	Tunable
Pause unlimited BW	0 iters	1 iters	5 iters	20 iters
Pause 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Post-copy unlimited BW	0 iters	1 iters	5 iters	20 iters
Post-copy 5 iters	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
Auto-converge unlimited BW	5% CPU step	10% CPU step	20% CPU step
Auto-converge 10% CPU step	100 mbs	300 mbs	1 gbs	10 gbs	unlimited
MT compression unlimited BW	1 thread	2 threads	4 threads
XBZRLE compression unlimited BW	5% cache	10% cache	20% cache	50% cache

Analysis of results

The charts above provide the full set of raw results, from which you are welcome to draw your own conclusions. The test harness is also posted on the qemu-devel mailing list and will hopefully be merged into GIT at some point, so anyone can repeat the tests or run tests to compare other scenarios. What follows now is my interpretation of the results and interesting points they show

There is a clear periodic pattern in guest performance that coincides with the start of each migration iteration. Specifically at the start of each iteration there is a notable and consistent momentary drop in guest CPU performance. Picking an example where this effect is clearly visible – the 1 vCPU, 1GB RAM config with the “Pause 5 iters, 300 mbs” test – we can see the guest CPU performance drop from 200ms/GB of data modified, to 450ms/GB. QEMU maintains a bitmap associated with guest RAM to track which pages are dirtied by the guest while migration is running. At the start of each iteration over RAM, this bitmap has to be read and reset and this action is what is responsible for this momentary drop in performance.
With the larger guest sizes, there is a second roughly periodic but slightly more chaotic pattern in guest performance that is continual throughout migration. The magnitude of these spikes is about 1/2 that of those occurring at the start of each iteration. An example where this effect is clearly visible is the 4 vCPU, 8GB RAM config with the “Pause unlimited BW, 20 iters” test – we can see the guest CPU performance is dropping from 500ms/GB to between 700ms/GB and 800ms/GB. The host NUMA node that the guest is confined to has 4 CPUs and the guest itself has 4 CPUs. When migration is running, QEMU has a dedicated thread performing the migration data I/O and this is sharing time on the 4 host CPUs with the guest CPUs. So with QEMU emulator threads sharing the same pCPUs as the vCPU threads, we have 5 workloads competing for 4 CPUs. IOW the frequently slightly chaotic spikes in guest performance throughout the migration iteration are a result of overcommiting the host pCPUs. The magnitude of the spikes is directly proportional to the total transfer bandwidth permitted for the migration. This is not an inherent problem with migration – it would be possible to place QEMU emulator threads on a separate pCPU from vCPU threads if strong isolation is desired between the guest workload and migration processing.
The baseline guest CPU performance differs between the 1 vCPU, 1 GB RAM and 4 vCPU 8 GB RAM guests. Comparing the UNIX socket “Pause unlimited BW, 20 iters” test results for these 1 vCPU and 4 vCPU configs we see the former has a baseline performance of 200ms/GB of data modified while the latter has 400ms/GB of data modified. This is clearly nothing to do with migration at all. Naively one might think that going from 1 vCPU to 4 vCPUs would result in 4 times the performance, since we have 4 times more threads available to do work. What we’re seeing here is likely the result of hitting the memory bandwidth limit, so each vCPU is competing for memory bandwidth and thus the overall performance of each vCPU has decreased. So instead of getting x4 the performance going from 1 to 4 vCPUs only doubled the performance.
When post-copy is operating in its pre-copy phase, it has no measurable impact on the gust performance compared to when post-copy is not enabled at all. This can be seen by comparing the TCP socket “Paused 5 iters, 1 Gbs” test results with the “Post-copy 5 iters, 1 Gbs” test results. Both show the same baseline guest CPU performance and the same magnitude of spikes at the start of each iteration. This shows that it is viable to unconditionally enable the post-copy feature for all migration operations, even if the migration is likely to complete without needing to switch from pre-copy to post-copy phases. It provides the admin/app the flexibility to dynamically decide on the fly whether to switch to post-copy mode or stay in pre-copy mode until completion.
When post-copy migration switches from its pre-copy phase to the post-copy phase, there is a major but short-lived spike in guest CPU performance. What is happening here is that the guest has perhaps 80% of its RAM transferred to the target host when post-copy phase starts but the guest workload is touching some pages which are still on the source, so the page fault is having to wait for the page to be transferred across the network. The magnitude of the spike and duration of the post-copy phase is related to the total guest RAM size and bandwidth available. Taking the remote TCP case with 1 vCPU, 1 GB RAM hardware config for clarity, and comparing the “Post-copy 5 iters, 1Gbs” scenario with the “Post-copy 5 iters, 10Gbs” scenario, we can see the magnitude of the spike in guest performance is the same order of magnitude in both cases. The overall time for each iteration of pre-copy phase is clearly shorter in the 10Gbs case. If we further compare with the local UNIX domain socket, we can see the spike in performance is much lower at the post-copy phase. What this is telling us is that the magnitude of the spike in the post-copy phase is largely driven by the latency in the time to transfer an out of band requested page from the source to the target, rather than the overall bandwidth available. There are plans in QEMU to allow migration to use multiple TCP connections which should significantly reduce the post-copy latency spike as the out-of-band requested pages will not get stalled behind a long TCP transmit queue for the background bulk-copy.
Auto-converge will often struggle to ensure convergence for larger guest sizes or when the bandwidth is limited. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing effects of different bandwidth limits we can see that with a 10Gbs bandwidth cap, auto-converge had to throttle to 80% to allow completion, while other tests show as much as 95% or even 99% in some cases. With a lower bandwidth limit of 1Gbs, the test case timed out after 5 minutes of running, having only attempted throttled down by 20%, showing auto-converge is not nearly aggressive enough when faced with low bandwidth links. The worst case guest performance seen when running auto-converge with CPUs throttled to 80% was on a par with that seen with post-copy immediately after switching to post-copy phase. The difference is that auto-converge shows that worst-case hit for a very long time during pre-copy, potentially many minutes, where as post-copy only showed it for a few seconds.
Multi-thread compression was actively harmful to chances of a successful migration. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing thread counts, we can see that increasing the number of threads actually made performance worse, with less iterations over RAM being completed before the 5 minute timeout was hit. The longer each iteration takes the more time the guest has to dirty RAM, so the less likely migration is to complete. There are two factors believe to be at work here to make MT compression results so bad. First, as noted earlier QEMU is confined to 4 pCPUs, so with 4 vCPUs running, the compression threads have to compete for time with the vCPU threads slowing down speed of compression. The stress test workload run in the guest is writing completely random bytes which are a pathological input dataset for compression, allowing almost no compression. Given the fact the compression was CPU limited though, even if there had been a good compression ratio, it would be unlikely to have a significant benefit since the increased time to iterate over RAM would allow the guest to dirty more data eliminating the advantage of compressing it. If the QEMU emulator threads were given dedicated host pCPUs to run on it may have increased the performance somewhat, but then that assumes the host has CPUs free that are not running other guests.
XBZRLE compression fared a little better than MT compression. Again considering the 4 vCPU, 8 GB RAM remote TCP test comparing RAM cache sizing, we can see that the time required for each iteration over RAM did not noticeably increase. This shows that while XBZRLE compression did have a notable impact on guest CPU performance, is not seeing a major bottleneck on processing of each page as compared to MT compression. Again though, it did not help to achieve migration completion, with all tests timing out after 5 minutes or 30 iterations over RAM. This is due to the fact that the guest stress workload is again delivering input data that hits the pathological worst case in the algorithm. Faced with such a workload, no matter how much CPU time or RAM cache is available, XBZRLE can never have any positive impact on migration.
The RDMA data transport showed up a few of its quirks. First, by looking at the RDMA results comparing pause bandwidth, we can clearly identify a bug in QEMU’s RDMA implementation – it is not honouring the requested bandwidth limits – it always transfers at maximum link speed. Second, all the post-copy results show failure, confirming that post-copy is currently not compatible with RDMA migration. When comparing 10Gbs RDMA against 10Gbs TCP transports, there is no obvious benefit to using RDMA – it was not any more likely to complete migration in any of the test scenarios.

Considering all the different features tested, post-copy is the clear winner. It was able to guarantee completion of migration every single time, regardless of guest RAM size with minimal long lasting impact on guest performance. While it did have a notable spike impacting guest performance at time of switch from pre to post copy phases, this impact was short lived, only a few seconds. The next best result was seen with auto-converge which again managed to complete migration in the majority of cases. By comparison with post-copy, the worst case impact seen to the guest CPU performance was the same order of magnitude, but it lasted for a very very long time, many minutes long. In addition in more bandwidth limited scenarios, auto-converge was unable to throttle guest CPUs quickly enough to avoid hitting the overall 5 minute timeout, where as post-copy would always succeed except in the most limited bandwidth scenarios (100Mbs – where no strategy can ever work). The other benefit of post-copy is that only the guest OS thread responsible for the page fault is delayed – other threads in the guest OS will continue running at normal speed if their RAM is already on the host. With auto-converge, all guest CPUs and threads are throttled regardless of whether they are responsible for dirtying memory. IOW post-copy has a targetted performance hit, where as auto-converge is indiscriminate. Finally, as noted earlier, post-copy does have a failure scenario which can result in loosing the VM in post-copy mode if the network to the source host is lost for long enough to timeout the TCP connection. This risk can be mitigated with redundancy at the network layer and it is only at risk for the short period of time the guest is running in post-copy mode, which is mere seconds with 10Gbs link

It was expected that the compression features would fare badly given the guest workload, but the impact was far worse than expected, particularly for MT compression. Given the major requirement compression has in terms of host CPU time (MT compression) or host RAM (XBZRLE compression), they do no appear to be viable as a general purpose features. They should only be used if the workloads are known to be compression friendly, the host has the CPU and/or RAM resources to spare and neither post-copy or auto-converge are possible to use. To make these features more practical to use in an automated general purpose manner, QEMU would have to be enhanced to allow the mgmt application to have directly control over turning them on and off during migration. This would allow the app to try using compression, monitor its effectiveness and then turn compression off if it is being harmful, rather than having to abort the migration entirely and restart it.

There is scope for further testing with RDMA, since the hardware used for testing was limited to 10Gbs. Newer RDMA hardware is supposed to be capable of reaching higher speeds, 40Gbs, even 100 Gbs which would have a correspondingly positive impact on ability to migrate. At least for any speeds of 10Gbs or less though, it does not appear worthwhile to use RDMA, apps would be better off using TCP in combintaion with post-copy.

In terms of network I/O, no matter what guest workload, QEMU is generally capable of saturating whatever link is used for migration for as long as it takes to complete. It is very easy to create workloads that will never complete, and decreasing the bandwidth available just increases the chances of migration. It might be tempting to think that if you have 2 guests, it would take the same total time whether you migrate them one after the other, or migrate them in parallel. This is not necessarily the case though, as with a parallel migration the bandwidth will be shared between them, which increases the chances that neither guest will ever be able to complete. So as a general rule it appears wise to serialize all migration operations on a given host, unless there are multiple NICs available.

In summary, use post-copy if it is available, otherwise use auto-converge. Don’t bother with compression unless the workload is known to be very compression friendly. Don’t bother with RDMA unless it supports more than 10 Gbs, otherwise stick with plain TCP.

Improving QEMU security part 5: TLS support for NBD server & client

Posted: April 5th, 2016 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: migration, nbd, qemu, tls | No Comments »

This blog is part 5 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.

For many years now QEMU has had code to support the NBD protocol, either as a client or as a server. The qemu-nbd command line tool can be used to export a disk image over NBD to a remote machine, or connect it directly to the local kernel’s NBD block device driver. The QEMU system emulators also have a block driver that acts as an NBD client, allowing VMs to be run from NBD volumes. More recently the QEMU system emulators gained the ability to export the disks from a running VM as named NBD volumes. The latter is particularly interesting because it is the foundation of live migration with block device replication, allowing VMs to be migrated even if you don’t have shared storage between the two hosts. In common with most network block device protocols, NBD has never offered any kind of data security capability. Administrators are recommended to run NBD over a private LAN/vLAN, use network layer security like IPSec, or tunnel it over some other kind of secure channel. While all these options are capable of working, none are very convenient to use because they require extra setup steps outside of the basic operation of the NBD server/clients. Libvirt has long had the ability to tunnel the QEMU migration channel over its own secure connection to the target host, but this has not been extended to cover the NBD channel(s) opened when doing block migration. While it could theoretically be extended to cover NBD, it would not be ideal from a performance POV because the libvirtd architecture means that the TLS encryption/decryption for multiple separate network connections would be handled by a single thread. For fast networks (10-GigE), libvirt will quickly become the bottleneck on performance even if the CPU has native support for AES.

Thus it was decided that the QEMU NBD client & server would need to be extended to support TLS encryption of the data channel natively. Initially the thought was to just add a flag to the client/server code to indicate that TLS was desired and run the TLS handshake before even starting the NBD protocol. After some discussion with the NBD maintainers though, it was decided to explicitly define a way to support TLS in the NBD protocol negotiation phase. The primary benefit of doing this is to allow clearer error reporting to the user if the client connects to a server requiring use of TLS and the client itself does not support TLS, or vica-verca – ie instead of just seeing what appears to be a mangled NBD handshake and not knowing what it means, the client can clearly report “This NBD server requires use of TLS encryption”.

The extension to the NBD protocol was fairly straightforward. After the initial NBD greeting (where the client & server agree the NBD protocol variant to be used) the client is able to request a number of protocol options. A new option was defined to allow the client to request TLS support. If the server agrees to use TLS, then they perform a standard TLS handshake and the rest of the NBD protocol carries on as normal. To prevent downgrade attacks, if the NBD server requires TLS and the client does not request the TLS option, then it will respond with an error and drop the client. In addition if the server requires TLS, then TLS must be the first option that the client requests – other options are only permitted once the TLS session is active & the server will again drop the client if it tries to request non-TLS options first.

The QEMU NBD implementation was originally using plain POSIX sockets APIs for all its I/O. So the first step in enabling TLS was to update the NBD code so that it used the new general purpose QEMU I/O channel APIs instead. With that done it was simply a matter of instantiating a new QIOChannelTLS object at the correct part of the protocol handshake and adding various command line options to the QEMU system emulator and qemu-nbd program to allow the user to turn on TLS and configure x509 certificates.

Running a NBD server using TLS can be done as follows:

$ qemu-nbd --object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
           --tls-creds tls0 /path/to/disk/image.qcow2

On the client host, a QEMU guest can then be launched, connecting to this NBD server:

$ qemu-system-x86_64 -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
                     -drive driver=nbd,host=theotherhost,port=10809,tls-creds=tls0 \
                     ...other QEMU options...

Finally to enable support for live migration with block device replication, the QEMU system monitor APIs gained support for a new parameter when starting the internal NBD server. All of this code was merged in time for the forthcoming QEMU 2.6 release. Work has not yet started to enable TLS with NBD in libvirt, as there is little point securing the NBD protocol streams, until the primary live migration stream is using TLS. More on live migration in a future blog post, as that’s going to be QEMU 2.7 material now.

In this blog series:

Part 1: crypto code consolidation
Part 2: generic TLS support
Part 3: securely passing in credentials
Part 4: generic I/O channel framework to simplify TLS
Part 5: TLS support for NBD server & client
Part 6: TLS support for character devices
Part 7: TLS support for migration

Daniel P. Berrangé

Writing about open source software, virtualization & more