This blog is part 6 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.
A number of QEMU device models and objects use a character devices for providing connectivity with the outside world, including the QEMU monitor, serial ports, parallel ports, virtio serial channels, RNG EGD object, CCID smartcard passthrough, IPMI device, USB device redirection and vhost-user. While some of these will only ever need a character device configured with local connectivity, some will certainly need to make use of TCP connections to remote hosts. Historically these connections have always been entirely in clear text, which is unacceptable in the modern hostile network environment where even internal networks cannot be trusted. Clearly the QEMU character device code requires the ability to use TLS for encrypting sensitive data and providing some level of authentication on connections.
The QEMU character device code was mostly using GLib’s GIOChannel framework for doing I/O but this has a number of unsatisfactory limitations. It can not do vectored I/O, is not easily extensible and does not concern itself at all with initial connection establishment. These are all reasons why the QIOChannel framework was added to QEMU. So the first step in supporting TLS on character devices was to convert the code over to use QIOChannel instead of GIOChannel. With that done, adding in support for TLS was quite straightforward, merely requiring addition of a new configuration property (“tls-creds
“) to set the desired TLS credentials.
For example to run a QEMU VM with a serial port listening on IP 10.0.01, port 9000, acting as a TLS server:
$ qemu-system-x86_64 \
-object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
-chardev socket,id=s0,host=10.0.0.1,port=9000,tls-creds=tls0,server \
-device isa-serial,chardev=s0
...other QEMU options...
It is possible test connectivity to this TLS server using the gnutls-cli tool
$ gnutls-cli --priority=NORMAL -p 9000 \
--x509cafile=/home/berrange/security/qemutls/ca-cert.pem \
127.0.0.1
In the above example, QEMU was running as a TCP server, and acting as the TLS server endpoint, but this matching is not required. It is valid to configure it to run as a TLS client if desired, though this would be somewhat uncommon.
Of course you can connect 2 QEMU VMs together, both using TLS. Assuming the above QEMU is still running, we can launch a second QEMU connecting to it with
$ qemu-system-x86_64 \
-object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
-chardev socket,id=s0,host=10.0.0.1,port=9000,tls-creds=tls0 \
-device isa-serial,chardev=s0
...other QEMU options...
Notice, we’ve changed the “endpoint” and removed the “server” option, so this second QEMU runs as a TCP client and acts as the TLS client endpoint.
This feature is available since the QEMU 2.6.0 release a few months ago.
In this blog series:
Live migration is a long standing feature in QEMU/KVM (and other competing virtualization platforms), however, by default it does not cope very well with guests whose workload are very memory write intensive. It is very easy to create a guest workload that will ensure a migration will never complete in its default configuration. For example, a guest which continually writes to each byte in a 1 GB region of RAM will never successfully migrate over a 1Gb/sec NIC. Even with a 10Gb/s NIC, a slightly larger guest can dirty memory fast enough to prevent completion without an unacceptably large downtime at switchover. Thus over the years, a number of optional features have been developed for QEMU with the aim to helping migration to complete.
If you don’t want to read the background information on migration features and the testing harness, skip right to the end where there are a set of data tables showing charts of the results, followed by analysis of what this all means.
The techniques available
- Downtime tuning. Unless the guest is completely idle, it never possible to get to a point where 100% of memory has been transferred to the target host. So at some point there needs to be a decision made about whether enough memory has been transferred to allow the switch over to the target host with acceptable blackout period. The downtime tunable controls how long a blackout period is permitted during the switchover. QEMU measures the network transfer rate it is achieving and compares it to the amount of outstanding RAM to determine if it can be transferred within the configured downtime window. When migrating it is not desirable to set QEMU to use the maximum accepted downtime straightaway, as that guarantees that the guest will always suffer from the maximum downtime blackout. Instead, it is better to start off with a fairly small downtime value and increase the permitted downtime as time passes. The idea is to maximise the likelihood that migration can complete with a small downtime.
- Bandwidth tuning. If the migration is taking place over a NIC that is used for other non-migration related actions, it may be desirable to prevent the migration stream from consuming all bandwidth. As noted earlier though, even a relatively small guest is capable of dirtying RAM fast enough that even a 10Gbs NIC will not be able to complete migration. Thus if the goal is to maximise the chances of getting a successful migration though, the aim should be to maximise the network bandwidth available to the migration operation. Following on from this, it is wise not to try to run multiple migration operations in parallel unless their transfer rates show that they are not maxing out the available bandwidth, as running parallel migrations may well mean neither will ever finish.
- Pausing CPUs. The simplest and crudest mechanism for ensuring guest migration complete is to simply pause the guest CPUs. This prevent the guest from continuing to dirty memory and thus even on the slowest network, it will ensure migration completes in a finite amount of time. The cost is that the guest workload will be completely stopped for a prolonged period of time. Think of pausing the guest as being equivalent to setting an arbitrarily long maximum permitted downtime. For example, assuming a guest with 8 GB of RAM and an idle 10Gbs NIC, in the worst case pausing would lead to to approx 6 second period of downtime. If higher speed NICs are available, the impact of pausing will decrease until it converges with a typical max downtime setting.
- Auto-convergence. The rate at which a guest can dirty memory is related to the amount of time the guest CPUs are permitted to run for. Thus by throttling the CPU execution time it is possible to prevent the guest from dirtying memory so quickly and thus allow migration data transfer to keep ahead of RAM dirtying. If this feature is enabled, by default QEMU starts by cutting 20% of the guest vCPU execution time. At the startof each iteration over RAM, it will check progress during the previous two iterations. If insufficient forward progress is being made, it will repeatedly cut 10% off the running time allowed to vCPUs. QEMU will throttle CPUs all the way to 99%. This should guarantee that migration can complete on all by the most sluggish networks, but has a pretty high cost to guest CPU performance. It is also indiscriminate in that all guest vCPUs are throttled by the same factor, even if only one guest process is responsible for the memory dirtying.
- Post-copy. Normally migration will only switch over to running on the target host once all RAM has been transferred. With post-copy, the goal is to transfer “enough” or “most” RAM across and then switch over to running on the target. When the target QEMU gets a fault for a memory page that has not yet been transferred, it’ll make an explicit out of band request for that page from the source QEMU. Since it is possible to switch to post-copy mode at any time, it avoids the entire problem of having to complete migration in a fixed downtime window. The cost is that while running in post-copy mode, guest page faults can be quite expensive, since there is a need to wait for the source host to transfer the memory page over to the target, which impacts performance of the guest during post-copy phase. If there is a network interruption while in post-copy mode it will also be impossible to recover. Since neither the source or target host has a complete view of the guest RAM it will be necessary to reboot the guest.
- Compression. The migration pages are usually transferred to the target host as-is. For many guest workloads, memory page contents will be fairly easily compressible. So if there are available CPU cycles on the source host and the network bandwidth is a limiting factor, it may be worth while burning source CPUs in order to compress data transferred over the network. Depending on the level of compression achieved it may allow migration to complete. If the memory is not compression friendly though, it would be burning CPU cycles for no benefit. QEMU supports two compression methods, XBZRLE and multi-thread, either of which can be enabled. With XBZRLE a cache of previously sent memory pages is maintained that is sized to be some percentage of guest RAM. When a page is dirtied by the guest, QEMU compares the new page contents to that in the cache and then only sends a delta of the changes rather than the entire page. For this to be effective the cache size must generally be quite large – 50% of guest RAM would not be unreasonable. The alternative compression approach uses multiple threads which simply use zlib to directly compress the full RAM pages. This avoids the need to maintain a large cache of previous RAM pages, but is much more CPU intensive unless hardware acceleration is available for the zlib compression algorithm.
Measuring impact of the techniques
Understanding what the various techniques do in order to maximise chances of a successful migration is useful, but it is hard to predict how well they will perform in the real world when faced with varying workloads. In particular, are they actually capable of ensuring completion under worst case workloads and what level of performance impact do they actually have on the guest workload. This is a problem that the OpenStack Nova project is currently struggling to get a clear answer on, with a view to improving Nova’s management of libvirt migration. In order to try and provide some guidance in this area, I’ve spent a couple of weeks working on a framework for benchmarking QEMU guest performance when subjected to the various different migration techniques outlined above.
In OpenStack the goal is for migration to be a totally “hands off” operation for cloud administrators. They should be able to request a migration and then forget about it until it completes, without having to baby sit it to apply tuning parameters. The other goal is that the Nova API should not have to expose any hypervisor specific concepts such as post-copy, auto-converge, compression, etc. Essentially Nova itself has to decide which QEMU migration features to use and just “do the right thing” to ensure completion. Whatever approach is chosen needs to be able to cope with any type of guest workload, since the cloud admins will not have any visibility into what applications are actually running inside the guest. With this in mind, when it came to performance testing the QEMU migration features, it was decided to look at their behaviour when faced with the worst case scenario. Thus a stress program was written which would allocate many GB of RAM, and then spawn a thread on each vCPU that would loop forever xor’ing every byte of RAM against an array of bytes taken from /dev/random. This ensures that the guest is both heavy on reads and writes to memory, as well as creating RAM pages which are very hostile towards compression. This stress program was statically linked and built into a ramdisk as the /init program, so that Linux would boot and immediately run this stress workload in a fraction of a second. In order to measure performance of the guest, each time 1 GB of RAM has been touched, the program will print out details of how long it took to update this GB and an absolute timestamp. These records are captured over the serial console from the guest, to be later correlated with what is taking place on the host side wrt migration.
Next up it was time to create a tool to control QEMU from the host and manage the migration process, activating the desired features. A test scenario was defined which encodes details of what migration features are under test and their settings (number of iterations before activating post-copy, bandwidth limits, max downtime values, number of compression threads, etc). A hardware configuration was also defined which expressed the hardware characteristics of the virtual machine running the test (number of vCPUs, size of RAM, host NUMA memory & CPU binding, usage of huge pages, memory locking, etc). The tests/migration/guestperf.py tool provides the mechanism to invoke the test in any of the possible configurations.For example, to test post-copy migration, switching to post-copy after 3 iterations, allowing 1Gbs bandwidth on a guest with 4 vCPUs and 8 GB of RAM one might run
$ tests/migration/guestperf.py --cpus 4 --mem 8 --post-copy --post-copy-iters 3 --bandwidth 125 --dst-host myotherhost --transport tcp --output postcopy.json
The myotherhost.json file contains the full report of the test results. This includes all details of the test scenario and hardware configuration, migration status recorded at start of each iteration over RAM, the host CPU usage recorded once a second, and the guest stress test output. The accompanying tests/migration/guestperf-plot.py tool can consume this data file and produce interactive HTML charts illustrating the results.
$ tests/migration/guestperf-plot.py --split-guest-cpu --qemu-cpu --vcpu-cpu --migration-iters --output postcopy.html postcopy.json
To assist in making comparisons between runs, however, a set of standardized test scenarios also defined which can be run via a tests/migration/guestperf-batch.py tool, in which case it is merely required to provide the desired hardware configuration
$ tests/migration/guestperf-batch.py --cpus 4 --mem 8 --dst-host myotherhost --transport tcp --output myotherhost-4cpu-8gb
This will run all the standard defined test scenarios and save many data files in the myotherhost-4cpu-8gb directory. The same guestperf-plot.py tool can be used to create charts combining multiple data sets at once to allow easy comparison.
Performance results for QEMU 2.6
With the tools written, I went about running some tests against QEMU GIT master codebase, which was effectively the same as the QEMU 2.6 code just released. The pair of hosts used were Dell PowerEdge R420 servers with 8 CPUs and 24 GB of RAM, spread across 2 NUMA nodes. The primary NICs were Broadcom Gigabit, but it has been augmented with Mellanox 10-Gig-E RDMA capable NICs, which is what were picked for transfer of the migration traffic. For the tests I decided to collect data for two distinct hardware configurations, a small uniprocessor guest (1 vCPU and 1 GB of RAM) and a moderately sized multi-processor guest (4 vCPUs and 8 GB of RAM). Memory and CPU binding was specified such that the guests were confined to a single NUMA node to avoid performance measurements being skewed by cross-NUMA node memory accesses. The hosts and guests were all running the RHEL-7 3.10.0-0369.el7.x86_64 kernel.
To understand the impact of different network transports & their latency characteristics, the two hardware configurations were combinatorially expanded against 4 different network configurations – a local UNIX transport, a localhost TCP transport, a remote 10Gbs TCP transport and a remote 10Gbs RMDA transport.
The full set of results are linked from the tables that follow. The first link in each row gives a guest CPU performance comparison for each scenario in that row. The other cells in the row give the full host & guest performance details for that particular scenario
UNIX socket, 1 vCPU, 1 GB RAM
Using UNIX socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM
UNIX socket, 4 vCPU, 8 GB RAM
Using UNIX socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM
TCP socket local, 1 vCPU, 1 GB RAM
Using TCP socket migration to local host, guest configured with 1 vCPU and 1 GB of RAM
TCP socket local, 4 vCPU, 8 GB RAM
Using TCP socket migration to local host, guest configured with 4 vCPU and 8 GB of RAM
TCP socket remote, 1 vCPU, 1 GB RAM
Using TCP socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM
TCP socket remote, 4 vCPU, 8 GB RAM
Using TCP socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM
RDMA socket, 1 vCPU, 1 GB RAM
Using RDMA socket migration to remote host, guest configured with 1 vCPU and 1 GB of RAM
RDMA socket, 4 vCPU, 8 GB RAM
Using RDMA socket migration to remote host, guest configured with 4 vCPU and 8 GB of RAM
Analysis of results
The charts above provide the full set of raw results, from which you are welcome to draw your own conclusions. The test harness is also posted on the qemu-devel mailing list and will hopefully be merged into GIT at some point, so anyone can repeat the tests or run tests to compare other scenarios. What follows now is my interpretation of the results and interesting points they show
- There is a clear periodic pattern in guest performance that coincides with the start of each migration iteration. Specifically at the start of each iteration there is a notable and consistent momentary drop in guest CPU performance. Picking an example where this effect is clearly visible – the 1 vCPU, 1GB RAM config with the “Pause 5 iters, 300 mbs” test – we can see the guest CPU performance drop from 200ms/GB of data modified, to 450ms/GB. QEMU maintains a bitmap associated with guest RAM to track which pages are dirtied by the guest while migration is running. At the start of each iteration over RAM, this bitmap has to be read and reset and this action is what is responsible for this momentary drop in performance.
- With the larger guest sizes, there is a second roughly periodic but slightly more chaotic pattern in guest performance that is continual throughout migration. The magnitude of these spikes is about 1/2 that of those occurring at the start of each iteration. An example where this effect is clearly visible is the 4 vCPU, 8GB RAM config with the “Pause unlimited BW, 20 iters” test – we can see the guest CPU performance is dropping from 500ms/GB to between 700ms/GB and 800ms/GB. The host NUMA node that the guest is confined to has 4 CPUs and the guest itself has 4 CPUs. When migration is running, QEMU has a dedicated thread performing the migration data I/O and this is sharing time on the 4 host CPUs with the guest CPUs. So with QEMU emulator threads sharing the same pCPUs as the vCPU threads, we have 5 workloads competing for 4 CPUs. IOW the frequently slightly chaotic spikes in guest performance throughout the migration iteration are a result of overcommiting the host pCPUs. The magnitude of the spikes is directly proportional to the total transfer bandwidth permitted for the migration. This is not an inherent problem with migration – it would be possible to place QEMU emulator threads on a separate pCPU from vCPU threads if strong isolation is desired between the guest workload and migration processing.
- The baseline guest CPU performance differs between the 1 vCPU, 1 GB RAM and 4 vCPU 8 GB RAM guests. Comparing the UNIX socket “Pause unlimited BW, 20 iters” test results for these 1 vCPU and 4 vCPU configs we see the former has a baseline performance of 200ms/GB of data modified while the latter has 400ms/GB of data modified. This is clearly nothing to do with migration at all. Naively one might think that going from 1 vCPU to 4 vCPUs would result in 4 times the performance, since we have 4 times more threads available to do work. What we’re seeing here is likely the result of hitting the memory bandwidth limit, so each vCPU is competing for memory bandwidth and thus the overall performance of each vCPU has decreased. So instead of getting x4 the performance going from 1 to 4 vCPUs only doubled the performance.
- When post-copy is operating in its pre-copy phase, it has no measurable impact on the gust performance compared to when post-copy is not enabled at all. This can be seen by comparing the TCP socket “Paused 5 iters, 1 Gbs” test results with the “Post-copy 5 iters, 1 Gbs” test results. Both show the same baseline guest CPU performance and the same magnitude of spikes at the start of each iteration. This shows that it is viable to unconditionally enable the post-copy feature for all migration operations, even if the migration is likely to complete without needing to switch from pre-copy to post-copy phases. It provides the admin/app the flexibility to dynamically decide on the fly whether to switch to post-copy mode or stay in pre-copy mode until completion.
- When post-copy migration switches from its pre-copy phase to the post-copy phase, there is a major but short-lived spike in guest CPU performance. What is happening here is that the guest has perhaps 80% of its RAM transferred to the target host when post-copy phase starts but the guest workload is touching some pages which are still on the source, so the page fault is having to wait for the page to be transferred across the network. The magnitude of the spike and duration of the post-copy phase is related to the total guest RAM size and bandwidth available. Taking the remote TCP case with 1 vCPU, 1 GB RAM hardware config for clarity, and comparing the “Post-copy 5 iters, 1Gbs” scenario with the “Post-copy 5 iters, 10Gbs” scenario, we can see the magnitude of the spike in guest performance is the same order of magnitude in both cases. The overall time for each iteration of pre-copy phase is clearly shorter in the 10Gbs case. If we further compare with the local UNIX domain socket, we can see the spike in performance is much lower at the post-copy phase. What this is telling us is that the magnitude of the spike in the post-copy phase is largely driven by the latency in the time to transfer an out of band requested page from the source to the target, rather than the overall bandwidth available. There are plans in QEMU to allow migration to use multiple TCP connections which should significantly reduce the post-copy latency spike as the out-of-band requested pages will not get stalled behind a long TCP transmit queue for the background bulk-copy.
- Auto-converge will often struggle to ensure convergence for larger guest sizes or when the bandwidth is limited. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing effects of different bandwidth limits we can see that with a 10Gbs bandwidth cap, auto-converge had to throttle to 80% to allow completion, while other tests show as much as 95% or even 99% in some cases. With a lower bandwidth limit of 1Gbs, the test case timed out after 5 minutes of running, having only attempted throttled down by 20%, showing auto-converge is not nearly aggressive enough when faced with low bandwidth links. The worst case guest performance seen when running auto-converge with CPUs throttled to 80% was on a par with that seen with post-copy immediately after switching to post-copy phase. The difference is that auto-converge shows that worst-case hit for a very long time during pre-copy, potentially many minutes, where as post-copy only showed it for a few seconds.
- Multi-thread compression was actively harmful to chances of a successful migration. Considering the 4 vCPU, 8 GB RAM remote TCP test comparing thread counts, we can see that increasing the number of threads actually made performance worse, with less iterations over RAM being completed before the 5 minute timeout was hit. The longer each iteration takes the more time the guest has to dirty RAM, so the less likely migration is to complete. There are two factors believe to be at work here to make MT compression results so bad. First, as noted earlier QEMU is confined to 4 pCPUs, so with 4 vCPUs running, the compression threads have to compete for time with the vCPU threads slowing down speed of compression. The stress test workload run in the guest is writing completely random bytes which are a pathological input dataset for compression, allowing almost no compression. Given the fact the compression was CPU limited though, even if there had been a good compression ratio, it would be unlikely to have a significant benefit since the increased time to iterate over RAM would allow the guest to dirty more data eliminating the advantage of compressing it. If the QEMU emulator threads were given dedicated host pCPUs to run on it may have increased the performance somewhat, but then that assumes the host has CPUs free that are not running other guests.
- XBZRLE compression fared a little better than MT compression. Again considering the 4 vCPU, 8 GB RAM remote TCP test comparing RAM cache sizing, we can see that the time required for each iteration over RAM did not noticeably increase. This shows that while XBZRLE compression did have a notable impact on guest CPU performance, is not seeing a major bottleneck on processing of each page as compared to MT compression. Again though, it did not help to achieve migration completion, with all tests timing out after 5 minutes or 30 iterations over RAM. This is due to the fact that the guest stress workload is again delivering input data that hits the pathological worst case in the algorithm. Faced with such a workload, no matter how much CPU time or RAM cache is available, XBZRLE can never have any positive impact on migration.
- The RDMA data transport showed up a few of its quirks. First, by looking at the RDMA results comparing pause bandwidth, we can clearly identify a bug in QEMU’s RDMA implementation – it is not honouring the requested bandwidth limits – it always transfers at maximum link speed. Second, all the post-copy results show failure, confirming that post-copy is currently not compatible with RDMA migration. When comparing 10Gbs RDMA against 10Gbs TCP transports, there is no obvious benefit to using RDMA – it was not any more likely to complete migration in any of the test scenarios.
Considering all the different features tested, post-copy is the clear winner. It was able to guarantee completion of migration every single time, regardless of guest RAM size with minimal long lasting impact on guest performance. While it did have a notable spike impacting guest performance at time of switch from pre to post copy phases, this impact was short lived, only a few seconds. The next best result was seen with auto-converge which again managed to complete migration in the majority of cases. By comparison with post-copy, the worst case impact seen to the guest CPU performance was the same order of magnitude, but it lasted for a very very long time, many minutes long. In addition in more bandwidth limited scenarios, auto-converge was unable to throttle guest CPUs quickly enough to avoid hitting the overall 5 minute timeout, where as post-copy would always succeed except in the most limited bandwidth scenarios (100Mbs – where no strategy can ever work). The other benefit of post-copy is that only the guest OS thread responsible for the page fault is delayed – other threads in the guest OS will continue running at normal speed if their RAM is already on the host. With auto-converge, all guest CPUs and threads are throttled regardless of whether they are responsible for dirtying memory. IOW post-copy has a targetted performance hit, where as auto-converge is indiscriminate. Finally, as noted earlier, post-copy does have a failure scenario which can result in loosing the VM in post-copy mode if the network to the source host is lost for long enough to timeout the TCP connection. This risk can be mitigated with redundancy at the network layer and it is only at risk for the short period of time the guest is running in post-copy mode, which is mere seconds with 10Gbs link
It was expected that the compression features would fare badly given the guest workload, but the impact was far worse than expected, particularly for MT compression. Given the major requirement compression has in terms of host CPU time (MT compression) or host RAM (XBZRLE compression), they do no appear to be viable as a general purpose features. They should only be used if the workloads are known to be compression friendly, the host has the CPU and/or RAM resources to spare and neither post-copy or auto-converge are possible to use. To make these features more practical to use in an automated general purpose manner, QEMU would have to be enhanced to allow the mgmt application to have directly control over turning them on and off during migration. This would allow the app to try using compression, monitor its effectiveness and then turn compression off if it is being harmful, rather than having to abort the migration entirely and restart it.
There is scope for further testing with RDMA, since the hardware used for testing was limited to 10Gbs. Newer RDMA hardware is supposed to be capable of reaching higher speeds, 40Gbs, even 100 Gbs which would have a correspondingly positive impact on ability to migrate. At least for any speeds of 10Gbs or less though, it does not appear worthwhile to use RDMA, apps would be better off using TCP in combintaion with post-copy.
In terms of network I/O, no matter what guest workload, QEMU is generally capable of saturating whatever link is used for migration for as long as it takes to complete. It is very easy to create workloads that will never complete, and decreasing the bandwidth available just increases the chances of migration. It might be tempting to think that if you have 2 guests, it would take the same total time whether you migrate them one after the other, or migrate them in parallel. This is not necessarily the case though, as with a parallel migration the bandwidth will be shared between them, which increases the chances that neither guest will ever be able to complete. So as a general rule it appears wise to serialize all migration operations on a given host, unless there are multiple NICs available.
In summary, use post-copy if it is available, otherwise use auto-converge. Don’t bother with compression unless the workload is known to be very compression friendly. Don’t bother with RDMA unless it supports more than 10 Gbs, otherwise stick with plain TCP.
This blog is part 5 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.
For many years now QEMU has had code to support the NBD protocol, either as a client or as a server. The qemu-nbd command line tool can be used to export a disk image over NBD to a remote machine, or connect it directly to the local kernel’s NBD block device driver. The QEMU system emulators also have a block driver that acts as an NBD client, allowing VMs to be run from NBD volumes. More recently the QEMU system emulators gained the ability to export the disks from a running VM as named NBD volumes. The latter is particularly interesting because it is the foundation of live migration with block device replication, allowing VMs to be migrated even if you don’t have shared storage between the two hosts. In common with most network block device protocols, NBD has never offered any kind of data security capability. Administrators are recommended to run NBD over a private LAN/vLAN, use network layer security like IPSec, or tunnel it over some other kind of secure channel. While all these options are capable of working, none are very convenient to use because they require extra setup steps outside of the basic operation of the NBD server/clients. Libvirt has long had the ability to tunnel the QEMU migration channel over its own secure connection to the target host, but this has not been extended to cover the NBD channel(s) opened when doing block migration. While it could theoretically be extended to cover NBD, it would not be ideal from a performance POV because the libvirtd architecture means that the TLS encryption/decryption for multiple separate network connections would be handled by a single thread. For fast networks (10-GigE), libvirt will quickly become the bottleneck on performance even if the CPU has native support for AES.
Thus it was decided that the QEMU NBD client & server would need to be extended to support TLS encryption of the data channel natively. Initially the thought was to just add a flag to the client/server code to indicate that TLS was desired and run the TLS handshake before even starting the NBD protocol. After some discussion with the NBD maintainers though, it was decided to explicitly define a way to support TLS in the NBD protocol negotiation phase. The primary benefit of doing this is to allow clearer error reporting to the user if the client connects to a server requiring use of TLS and the client itself does not support TLS, or vica-verca – ie instead of just seeing what appears to be a mangled NBD handshake and not knowing what it means, the client can clearly report “This NBD server requires use of TLS encryption”.
The extension to the NBD protocol was fairly straightforward. After the initial NBD greeting (where the client & server agree the NBD protocol variant to be used) the client is able to request a number of protocol options. A new option was defined to allow the client to request TLS support. If the server agrees to use TLS, then they perform a standard TLS handshake and the rest of the NBD protocol carries on as normal. To prevent downgrade attacks, if the NBD server requires TLS and the client does not request the TLS option, then it will respond with an error and drop the client. In addition if the server requires TLS, then TLS must be the first option that the client requests – other options are only permitted once the TLS session is active & the server will again drop the client if it tries to request non-TLS options first.
The QEMU NBD implementation was originally using plain POSIX sockets APIs for all its I/O. So the first step in enabling TLS was to update the NBD code so that it used the new general purpose QEMU I/O channel APIs instead. With that done it was simply a matter of instantiating a new QIOChannelTLS object at the correct part of the protocol handshake and adding various command line options to the QEMU system emulator and qemu-nbd program to allow the user to turn on TLS and configure x509 certificates.
Running a NBD server using TLS can be done as follows:
$ qemu-nbd --object tls-creds-x509,id=tls0,endpoint=server,dir=/home/berrange/qemutls \
--tls-creds tls0 /path/to/disk/image.qcow2
On the client host, a QEMU guest can then be launched, connecting to this NBD server:
$ qemu-system-x86_64 -object tls-creds-x509,id=tls0,endpoint=client,dir=/home/berrange/qemutls \
-drive driver=nbd,host=theotherhost,port=10809,tls-creds=tls0 \
...other QEMU options...
Finally to enable support for live migration with block device replication, the QEMU system monitor APIs gained support for a new parameter when starting the internal NBD server. All of this code was merged in time for the forthcoming QEMU 2.6 release. Work has not yet started to enable TLS with NBD in libvirt, as there is little point securing the NBD protocol streams, until the primary live migration stream is using TLS. More on live migration in a future blog post, as that’s going to be QEMU 2.7 material now.
In this blog series:
This blog is part 4 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.
Part 2 of this series described the creation of a general purpose API for simplifying TLS session handling inside QEMU, particularly with a view to hiding the complexity of the handshake and x509 certificate validation. The VNC server was converted to use this API, which was a big benefit, but there was still a need to add extra code to support TLS in the I/O paths. Specifically, anywhere that the VNC server would read/write on the network socket, had to be made TLS aware so that it would use plain POSIX send/recv functions vs the TLS wrapped send/recv functions as appropriate. For the VNC server it is actually even more complex, because it also supports websockets, so each I/O point had to choose between plain, TLS, websockets and websockets plus TLS. As TLS support extends to other areas of QEMU this pattern would continue to complicate I/O paths in each backend.
Clearly there was a need for some form of I/O channel abstraction that would allow TLS to be enabled in each QEMU network backend without having to add conditional logic at every I/O send/recv call. Looking around at the QEMU subsystems that would ultimately need TLS support, showed a variety of approaches currently in use
- Character devices use combination of POSIX sockets APIs to establish connections and GIOChannel for performing I/O on them
- Migration has a QEMUFile abstraction which provides read/write facilities for a number of underlying transports, TCP sockets, UNIX sockets, STDIO, external command, in memory buffer and RDMA. The various QEMUFile impls all uses the plain POSIX sockets APIs and for TCP/UNIX sockets the sendmsg/recvmsg functions for I/O
- NBD client & server use plain POSIX sockets APIs and sendmsg/recvmsg for I/O
- VNC server uses plain POSIX sockets APIs and sendmsg/recvmsg for I/O
The GIOChannel APIs used by the character device backend theoretically provide an extensible framework for I/O and there is even a TLS implementation of the GIOChannel API. The two limitations of GIOChannel for QEMU though are that it does not support scatter / gather / vectored I/O APIs and that it does not support file descriptor passing over UNIX sockets. The latter is not a show stopper, since you can still access the socket handle directly to send/recv file descriptors. The lack of vectored I/O though would be a significant issue for migration and NBD servers where performance is very important. While we could potentially extend GIOChannel to add support for new callbacks to do vectored I/O, by the time you’ve done that most of the original GIOChannel code isn’t going to be used, limiting the benefit of starting from GIOChannel as a base. It is also clear that GIOChannel is really not something that is going to get any further development from the GLib maintainers, since their focus is on the new and much better GIO library. This supports file descriptor passing and TLS encryption, but again lacks support for vectored I/O. The bigger show stopper though is that to get access to the TLS support requires depending on a version on GLib that is much newer than what QEMU is willing to use. The existing QEMUFile APIs could form the basis of a general purpose I/O channel system if they were untangled & extracted from migration codebase. One limitation is that QEMUFile only concerns itself with I/O, not the initial channel establishment which is left to the migration core code to deal with, so did not actually provide very much of a foundation on which to build.
After looking through the various approaches in use in QEMU, and potentially available from GLib, it was decided that QEMU would be best served by creating a new general purpose I/O channel API. Thus a new QEMU subsystem was added in the io/ and include/io/ directories to provide a set of classes for I/O over a variety of different data channels. The core design aims were to use the QEMU object model (QOM) framework to provide a standard pattern for extending / subclassing, use the QEMU Error object for all error reporting, file descriptor passing, main loop watch integration and coroutine integration. Overall the new design took many elements of its design from GIOChannel and the GIO library, and blended them with QEMU’s own codebase design. The initial goal was to provide enough functionality to convert the VNC server as a proof of concept. To this end the following classes were created
- QIOChannel – the abstract base defining the overall interface for the I/O framework
- QIOChannelSocket – implementation targeting TCP, UDP and UNIX sockets
- QIOChannelTLS – layer that can provide a TLS session over any other channel
- QIOChannelWebsock – layer that can run the websockets protocol over any other channel
To avoid making this blog posting even larger, I won’t go into details of these (the code is available in QEMU git for anyone who’s really interesting), but instead illustrate it with a comparison of the VNC code before & after. First consider the original code in the VNC server for dealing with writing a buffer of data over a plain socket or websocket either with TLS enabled. The following functions existed in the VNC server code to handle all the combinations:
ssize_t vnc_tls_push(const char *buf, size_t len, void *opaque)
{
VncState *vs = opaque;
ssize_t ret;
retry:
ret = send(vs->csock, buf, len, 0);
if (ret < 0) {
if (errno == EINTR) {
goto retry;
}
return -1;
}
return ret;
}
ssize_t vnc_client_write_buf(VncState *vs, const uint8_t *data, size_t datalen)
{
ssize_t ret;
int err = 0;
if (vs->tls) {
ret = qcrypto_tls_session_write(vs->tls, (const char *)data, datalen);
if (ret < 0) {
err = errno;
}
} else {
ret = send(vs->csock, (const void *)data, datalen, 0);
if (ret < 0) {
err = socket_error();
}
}
return vnc_client_io_error(vs, ret, err);
}
long vnc_client_write_ws(VncState *vs)
{
long ret;
vncws_encode_frame(&vs->ws_output, vs->output.buffer, vs->output.offset);
buffer_reset(&vs->output);
return vnc_client_write_buf(vs, vs->ws_output.buffer, vs->ws_output.offset);
}
static void vnc_client_write_locked(void *opaque)
{
VncState *vs = opaque;
if (vs->encode_ws) {
vnc_client_write_ws(vs);
} else {
vnc_client_write_plain(vs);
}
}
After conversion to use the new QIOChannel classes for sockets, websockets and TLS, all of the VNC server code above turned into
ssize_t vnc_client_write_buf(VncState *vs, const uint8_t *data, size_t datalen)
{
Error *err = NULL;
ssize_t ret;
ret = qio_channel_write(vs->ioc, (const char *)data, datalen, &err);
return vnc_client_io_error(vs, ret, &err);
}
It is clearly a major win for maintainability of the VNC server code to have all the TLS and websockets I/O support handled by the QIOChannel APIs. There is no impact to supporting TLS and websockets anywhere in the VNC server I/O paths now. The only place where there is new code is the point where the TLS or websockets session is initiated and this now only requires instantiation of a suitable QIOChannel subclass and registering a callback to be run when the session handshake completes (or fails).
tls = qio_channel_tls_new_server(vs->ioc, vs->vd->tlscreds, vs->vd->tlsaclname, &err);
if (!tls) {
vnc_client_error(vs);
return 0;
}
object_unref(OBJECT(vs->ioc));
vs->ioc = QIO_CHANNEL(tls);
qio_channel_tls_handshake(tls, vnc_tls_handshake_done, vs, NULL);
Notice that the code is simply replacing the current QIOChannel handle ‘vs->ioc’ with an instance of the QIOChannelTLS class. The vnc_tls_handshake_done method is invoked when the TLS handshake is complete or failed and lets the VNC server continue with the next part of its authentication protocol, or drop the client connection as appropriate. So adding TLS session support to the VNC server comes in at about 10 lines of code now.
In this blog series:
This blog is part 3 of a series I am writing about work I’ve completed over the past few releases to improve QEMU security related features.
When configuring a virtual machine, there are a number of places where QEMU requires some form of security sensitive credentials, typically passwords or encryption keys. Historically QEMU has had no standard approach for getting these credentials from the user, so things have grown in an adhoc manner with predictably awful results. If the VNC server is configured to use basic VNC authentication, then it requires a password to be set. When I first wrote patches to add password auth to QEMU’s VNC server it was clearly not desirable to expose the password on the command line, so when configuring VNC you just request password authentication be enabled using -vnc 0.0.0.0:0,password
and then have to use the monitor interface to set the actual password value “change vnc password
“. Until a password has been set via the monitor, the VNC server should reject all clients, except that we’ve accidentally broken this in the past, allowing clients when no server password is set :-( The qcow & qcow2 disk image formats support use of AES for encryption (remember this is horribly broken) and so there needs to be a way to provide the decryption password for this. Originally you had to wait for QEMU to prompt for the disk password on the interactive console. This clearly doesn’t work very nicely when QEMU is being managed by libvirt, so we added another monitor command which allows apps to provide the disk password upfront, avoiding the need to prompt. Fast forward a few years and QEMU’s block device layer gained support for various network protocols including iSCSI, RBD, FTP and HTTP(s). All of these potentially require authentication and thus a password needs to be provided to QEMU. The CURL driver for ftp, http(s) simply skipped support for authentication since there was no easy way to provide the passwords securely. Sadly, the iSCSI and RBD drivers simply decided to allow the password to be provided in the command line. Hence the passwords for RBD and iSCSI are visible in plain text in the process listing and in libvirt’s QEMU log files, which often get attached to bug reports, which has resulted in a CVE being filed against libvirt. I had an intention to add support for the LUKS format in the QEMU block layer which will also require passwords to be provided securely to QEMU, and it would be desirable if the x509 keys provided to QEMU could be encrypted too.
Looking at this mess and the likely future requirements, it was clear that QEMU was in desperate need of a standard mechanism for securely receiving credentials from the user / management app (libvirt). There are a variety of channels via which credentials can be theoretically passed to QEMU:
- Command line argument
- Environment variable
- Plain file
- Anonymous pipe
- Monitor command
As mentioned previously, using command line arguments or environment variables is not secure if the credential is passed in plain text, because they are visible in the processing list and log files. It would be possible to create a plain file on disk and write each password to it and use file permissions to ensure only QEMU can read it. Using files is not too bad as long as your host filesystem is on encrypted storage. It has a minor complexity of having to dynamically create files on the fly each time you want to hotplug a new device using a password. Most of these problems can be avoided by using an anonymous pipe, but this is more complicated for end users because for hotplugging devices it would require passing file descriptors over a UNIX socket. Finally the monitor provides a decent secure channel which users / mgmt apps will typically already have open via a UNIX socket. There is a chicken & egg problem with it though, because the credentials are often required at initial QEMU startup when parsing the command line arguments, and the monitor is not available that early.
After considering all the options, it was decided that using plain files and/or anonymous pipes to pass credentials would be the most desirable approach. The qemu_open() method has a convenient feature whereby there is a special path prefix that allows mgmt apps to pass a file descriptor across instead of a regular filename. To enable reuse of existing -object
command line argument and object_add
monitor commands for definin credentials, the QEMU object model framework (QOM) was used to define a ‘secret’ object class. The ‘secret
‘ class has a ‘path
‘ property which is the filename containing the credential. For example it could be used
# echo "letmein" > mydisk.pw
# $QEMU -object secret,id=sec0,file=mydisk.pw
Having written this, I realized that it would be possible to allow passwords to be provided directly via the command line if we allowed secret data to be encrypted with a master key. The idea would be that when a QEMU process is first started, it gets given a new unique AES key via a file. The credentials for individual disks / servers would be encrypted with the master key and then passed directly on the command line. The benefit of this is that the mgmt app only needs to deal with a single file on disk with a well defined lifetime.
First a master key is generated and saved to a file in base64 format
# openssl rand -base64 32 > master-key.b64
Lets say we have two passwords we need to give to QEMU. We will thus need two initialization vectors
# openssl rand -base64 16 > sec0-iv.b64
# openssl rand -base64 16 > sec1-iv.b64
Each password is now encrypted using the master key and its respective initialization vector
# SEC0=$(printf "letmein" |
openssl enc -aes-256-cbc -a \
-K $(base64 -d master-key.b64 | hexdump -v -e '/1 "%02X"') \
-iv $(base64 -d sec0-iv.b64 | hexdump -v -e '/1 "%02X"'))
# SEC1=$(printf "1234567" |
openssl enc -aes-256-cbc -a \
-K $(base64 -d master-key.b64 | hexdump -v -e '/1 "%02X"') \
-iv $(base64 -d sec1-iv.b64 | hexdump -v -e '/1 "%02X"'))
Finally when QEMU is launched, three secrets are defined, the first gives the master key via a file, and the others provide the two encrypted user passwords
# $QEMU \
-object secret,id=secmaster,format=base64,file=key.b64 \
-object secret,id=sec0,keyid=secmaster,format=base64,\
data=$SECRET,iv=$(<sec0-iv.b64) \
-object secret,id=sec1,keyid=secmaster,format=base64,\
data=$SECRET,iv=$(<sec1-iv.b64) \
...other args using the secrets...
Now we have a way to securely get credentials into QEMU, there just remains the task of associating the secrets with the things in QEMU that need to use them. The TLS credentials object previously added originally required the x509 server key to be provided in an unencrypted PEM file. The tls-creds-x509
object can now gain a new property “passwordid
” which provides the ID of a secret object that defines the password to use for decrypting the x509 key.
# $QEMU \
-object secret,id=secmaster,format=base64,file=key.b64 \
-object secret,id=sec0,keyid=secmaster,format=base64,\
data=$SECRET,iv=$(<sec0-iv.b64) \
-object tls-creds-x509,id=tls0,dir=/home/berrange/qemutls,endpoint=server,passwordid=sec0 \
-vnc 0.0.0.0:0,tls-creds=tls0
Aside from adding support for encrypted x509 certificates, the RBD, iSCSI and CURL block drivers in QEMU have all been updated to allow authentication passwords to be provided using the ‘secret
‘ object type. Libvirt will shortly be gaining support to use this facility which will address the long standing problem of RBD/ISCSI passwords being visible in clear text in the QEMU process command line arguments. All the enhancements described in this posting have been merged for the forthcoming QEMU 2.6.0 release so will soon be available to users. The corresponding enhancements to libvirt to make use of these features are under active development.
In this blog series: