It has been a long time coming, but the Linux desktop is finally getting to the point where colour management is widely available in applications. At a low level ArgyllCMS is providing support for many colour calibration devices and lCMS provides a nice library for applying colour profile transformations to images. At a high level, the graphics/photos tools DigiKam, GIMP, UFRaw, InkScape, Phatch and XSane are all able to do colour management. Most are even following the X colour management spec to automatically obtain the current monitor profile. In the last few weeks Richard Hughes has filled in another missing piece, writing gnome-colour-manager to provide a UI for driving ArgyllCMS and setting up monitor profiles upon login.
It is great to be able to do photo/graphics work on a fully colour managed Linux desktop….and then you upload the photos to Flickr and they go back to looking awful. After a little googling though, it turns out all is not lost. Firefox does in fact contain some colour management support, hidden away in its truly awful about:config page. If you go to that page and filter on ‘gfx’, you’ll find a couple of settings with ‘color_management’ in their name
gfx.color_management.display_profile
gfx.color_management.mode
gfx.color_management.rendering_intent
The first, display_profile
, takes the full path to an ICC profile for your monitor, while mode
controls where colour management is applied. A value of ‘2’ will make firefox only apply profiles to images explicitly tagged with a profile. A value of ‘1’ will make firefox apply profiles to CSS and images, assuming an sRGB profile if the image does is tagged. rendering_intent
takes values 0, 1, 2, 3 corresponding to ‘perceptual’, ‘relative colourimetric’, ‘saturation’ and ‘absolute colourimetric’ respectively. I configured my firefox for mode=1, set a profile and restarted. Browsing to Flickr to showed an immediate improvement, with my images actually appearing in the correct colours, matching those I see during editing in GIMP/UFRaw/etc. There’s a little more info about these settings at the mozilla developer notes on ICC.
While it is nice to have colour management in firefox, its implementation is rather sub-optimal since it requires the user to manually configure the display ICC profile path. Each display profile is only valid with the monitor against which it was created. So the moment I switch my laptop from its built-in LCD to an external LCD all the colours in firefox will go to hell. If firefox followed the X ICC profile spec it would be able to automatically apply the correct profiles for each monitor. Hopefully someone will be motivated to fix this soon, since the spec is rather easy to comply with only needing a quick look at a particular named property on the root window.
In my previous post I discussed the new disk encryption capabilities available with libvirt in Fedora 12. That was not the only new feature we quietly introduced without telling anyone. The second was integration between libvirt and the kernels’ CGroups functionality. In fact we have had this around for a while, but prior to this point it has only been used with our LXC driver. It is now also available for use with the QEMU driver.
What are CGroups?
You might be wondering at this point what CGroups actually are ? At a high level, it is a generic mechanism the kernel provides for grouping of processes and applying controls to those groups. The grouping is done via a virtual filesystem called “cgroup”. Within this filesytem, each directory defines a new group. Thus groups can be arranged to form an arbitrarily nested hierarchy simply by creating new sub-directories.
Tunables within a cgroup are provided by what the kernel calls ‘controllers’, with each controller able to expose one or more tunable or control. When mounting the cgroups filesystem it is possible to indicate what controllers are to be activated. This makes it possible to mount the filesystem several times, with each mount point having a different set of (non-overlapping) controllers. Why might separate mount points be useful ? The key idea is that this allows the administrator to construct differing group hierarchies for different sets of controllers/tunables.
memory
: Memory controller
- Allows for setting limits on RAM and swap usage and querying cumulative usage of all processes in the group
cpuset
: CPU set controller
- Binding of processes within a group to a set of CPUs and controlling migration between CPus
cpuacct
: CPU accounting controller
- Information about CPU usage for a group of processes
cpu
: CPU schedular controller
- Controlling the priorization of processes in the group. Think of it as a more advanced nice level
devices
: Devices controller
- Access control lists on character and block devices
freezer
: Freezer controller
- Pause and resume execution of processes in the group. Think of it as SIGSTOP for the whole group
net_cls
: Network class controller
- Control network utilization by associating processes with a ‘tc’ network class
This isn’t the blog post to go into fine details about each of these controllers & their capabilities, the high level overview will do. Suffice to say that at this time, the libvirt LXC driver (container based virtualization) will use all of these controllers except for net_cls
and cpuset
, while the libvirt QEMU driver will only use the cpu
and devices
controllers.
Activating CGroups on a Fedora 12 system
CGroups are a system-wide resource and libvirt doesn’t presume that it can dictate how CGroup controllers are mounted, nor in what hierarchy they are arranged. It will leave mount point & directory setup entirely to the administrators’ discretion. Unfortunately though, is not just a matter of adding some mount points to /etc/fstab
. It is neccessary to setup the directory hierarchy and decide how processes get placed within it. Fortunately the libcg project provides a init service and set of tools to assist in host configuration. On Fedora 12 this is packaged in the libcgroup
RPM. If you don’t have that installed, install it now!
The primary configuration file for CGroups is /etc/cgconfig.conf
. There are two interesting things to configure here. Step 1 is declaring what controllers are mounted where. Should you wish to keep things very simple it is possible to mount many controllers in one single location with a snippet that looks like
mount {
cpu = /dev/cgroups;
cpuacct = /dev/cgroups;
memory = /dev/cgroups;
devices = /dev/cgroups;
}
This will allow a hierarchy of process cgroups rooted at /dev/cgroups. If you get more advanced though, you might wish to have one hierarchy just for CPU schedular controls, and another for device ACLs, and a third for memory management. That could be accomplished using a configuration that looks like this
mount {
cpu = /dev/cgroups/cpu;
cpuacct = /dev/cgroups/cpu;
memory = /dev/cgroups/memory;
devices = /dev/cgroups/devices;
}
Going with this second example, save the /etc/cgconfig.conf
file with these mount rules, and then activate the configuration by running
# service cgconfig start
Looking in /dev/cgroups
, there should be a number of top level directories.
# ls /dev/cgroups/
cpu devices memory
# ls /dev/cgroups/cpu
cpuacct.stat cpu.rt_period_us notify_on_release tasks
cpuacct.usage cpu.rt_runtime_us release_agent
cpuacct.usage_percpu cpu.shares sysdefault
# ls /dev/cgroups/memory/
memory.failcnt memory.stat
memory.force_empty memory.swappiness
memory.limit_in_bytes memory.usage_in_bytes
memory.max_usage_in_bytes memory.use_hierarchy
memory.memsw.failcnt notify_on_release
memory.memsw.limit_in_bytes release_agent
memory.memsw.max_usage_in_bytes sysdefault
memory.memsw.usage_in_bytes tasks
# ls /dev/cgroups/devices/
devices.allow devices.list release_agent tasks
devices.deny notify_on_release sysdefault
Now that the basic cgroups controllers are mounted in the locations we want, there is a choice of how to proceed. If starting libvirtd at this point, it will end up in the sysdefault
group seen here. This may be satisfactory for some people, in which case they can skip right ahead to the later notes on how KVM virtual machines use cgroups. Other people may want to move the libvirtd daemon (and thus everything it runs) into a separate cgroup first.
Placing the libvirtd daemon in a dedicated group
Lets say we wish to place an overall limit on the amount of memory that can be used by the libvirtd daemon and all guests that it launches. For this it will be neccessary to define a new group, and specify a limit using the memory controller. Back at the /etc/cgconfig.conf
configuration file, this can be achieved using the ‘group’ statement:
group virt {
memory {
memory.limit_in_bytes = 7500M;
}
}
This says that no matter what all the processes in this group do, their combined usage will never be allowed above 7.5 GB. Any usage above this limit will cause stuff to be pushed out to swap. Newer kernels can even let you control how much swap can be used, before the OOM killer comes out of hiding. This particular example is chosen to show how cgroups can be used to protect the virtualization host against accidental overcommit. eg On this server with 8 GB RAM, no matter how crazy & out of control the virtual machines get, I have reasonable reassurance that I’ll always be able to get to an SSH / console login prompt because we’ve left a guaranteed 500 MB for other cgroups (ie the rest of the system) to use.
Now that the desired custom group has been defined it is neccessary to actually tell the system that the libvirtd
daemon (or any other interesting daemons) needs to be placed in this virt
group. If the daemon in question has a well designed initscript, it will be using the shared functions from /etc/init.d/functions
, in particular the ‘daemon’ function. libvirtd is of course well designed :-P Placing libvirtd into a cgroup requires adding one line to its configuration file /etc/sysconfig/libvirtd
.
CGROUP_DAEMON="memory:/virt"
If we wanted to place it in several cgroups, those would be listed in that same parameter, using a space to separate each. At this point a (re-)start of the cgconfig and libvirtd services will complete the host setup. There is a magic /proc
file which can show you at a glance what cgroup any process is living in
# service cgconfig restart
# service libvirtd restart
# PID=`pgrep libvirtd`
# cat /proc/$PID/cgroup
32:devices:/sysdefault
16:memory:/virt
12:cpuacct,cpu:/sysdefault
Our config didn’t say anything about the devices
or cpuacct
groups, even though we’d asked for them to be mounted. Thus libvirtd
got placed in the sysdefault
group for those controllers.
Controlling KVM guests
The libvirtd daemon has drivers for many virtualization technologies, and at time of writing, its LXC and QEMU drivers integrate with CGroups. For maximum flexibility of administration, libvirt will create its own mini hierarchy of groups in to which guests will be placed. This mini hierarchy will be rooted at whatever location the libvirtd daemon starts in.
$ROOT
|
+- libvirt (all virtual machines/containers run by libvirtd)
|
+- lxc (all LXC containers run by libvirtd)
| |
| +- guest1 (LXC container called 'guest1')
| +- guest2 (LXC container called 'guest2')
| +- guest3 (LXC container called 'guest3')
| +- ... (LXC container called ...)
|
+- qemu (all QEMU/KVM containers run by libvirtd)
|
+- guest1 (QENU machine called 'guest1')
+- guest2 (QEMU machine called 'guest2')
+- guest3 (QEMU machine called 'guest3')
+- ... (QEMU machine called ...)
Remember that we actually configured 3 mount points, /dev/cgroups/cpu
, /dev/cgroups/devices
and /dev/cgroups/memory
. libvirt will detect whatever you mounted, and replicate its mini hierarchy at the appropriate place. So in the above example $ROOT will expand to 3 locations
/dev/cgroups/cpu/sysdefault
/dev/cgroups/memory/virt
/dev/cgroups/devices/sysdefault
As an administrator you should not need to concern yourself with directly accessing the tunables within libvirt’s mini cgroups hiearchy. libvirt will add/remove the child groups for each guest when the guest is booted/destroyed respectively. The libvirt API, and/or virsh command line tool provide mechanisms to set tunables on guests. For the QEMU and LXC drivers in libvirt, the virsh schedinfo
command provides access to CPU scheduler prioritization for a guest,
# virsh schedinfo demo
Scheduler : posix
cpu_shares : 1024
The “cpu_shares” value is a relative priorization. All guests start out with a cpu_shares of 1024. If you half its “cpu_shares” value, it will get 1/2 the CPU time as compared to other guests. This is applied to the guest as a whole, regardless of how many virtual CPUs it has. This last point is an important benefit over simple ‘nice’ levels which operate per-thread. With the latter it is very hard to set relative prioritization between guests unless they all have exactly the same number of virtual CPUs The cpu_shares tunable can be set with the same virsh command
# virsh schedinfo --set cpu_shares=500 demo
Scheduler : posix
cpu_shares : 500
For LXC containers, the configured guest memory limit is implemented via the ‘memory’ controller, and container CPU time accounting is done with the ‘cpuacct’ controller.
If the “devices” controller is mounted, libvirt will attempt to use that to control what a guest has access to. It will setup the ACLs so that all guests can access things like /dev/null, /dev/rand, etc, but deny all access to block devices except for those explicitly configured in its XML. This is done for both QEMU and LXC guests.
Future work
There is quite alot more that libvirt could do to take advantage of CGroups. We already have three feature requests that can be satisfied with the use of Cgroups
- Enforcement of guest memory usage
- QEMU guests are supplied with the VirtIO Balloon device. This allows the host OS to request that the guest OS release memory it was previously allocated back to the host. This allows administrators to dynamically adjust memory allocation of guests on the fly. The obvious downside is that it relies on co-operation of the guest OS. If the guest ignores the request to release memory there is nothing the host can do. Almost nothing. The cgroups
memory
controller allows a limit on both RAM and swap usage to be set. Since each guest is placed in its own cgroup, we can use this control to enforce the balloon request, by updating the cgroup memory controller limit whenever changing the balloon limit. If the guest ignores the balloon request, then the cgroups controller will force part of the guest’s RAM allocation out to swap. This gives us both a carrot and a stick.
- Network I/O policy on guests
- The cgroups
net_cls
controller allows a cgroup to be associated with a tc
traffic control class (see the tc(8) manpage
). One obvious example usage would involve setting a hard bandwidth cap, but there are plenty more use cases beyond that
- Disk I/O policy on guests
- A number of kernel developers have been working on a new disk I/O bandwidth control mechanism for a while now, targeting virtualization as a likely consumer. The problem is not quite as simple as it sounds though. While you might put caps on bandwidth of guests, this is ignoring the impact of seek time. If all the guest I/O patterns are sequential, high throughput it might be great, but a single guest doing excessive random access I/O causing too many seeks can easily destroy the throughput of all others. None the less, there appears to be a strong desire for disk I/O bandwidth controls. This is almost certainly going to end up as another cgroup controller that libvirt can then take advantage of.
There are probably more things can be done with cgroups, but this is plenty to keep us busy for a while
Without muchany fan-fare we slipped a significant new feature into libvirt in Fedora 12, namely the ability to encrypt a virtual machine’s disks. Ordinarily this would have been widely publicised as a Fedora 12 release feature, but the code arrived into libvirt long after the Fedora feature writeup deadline. Before continuing, special thanks are due to Miloslav Trmač who wrote nearly all of this encryption/secrets management code for libvirt!
Why might you want to encrypt a guest’s disk from the host, rather than using the guest OS’s own block encryption capabilities (eg the block encryption support in anaconda) ? There’s a couple of reasons actually…
- The host is using a network filesystem (like NFS/GFS) for storing guest disks, a guarantee is required that no one can snoop on any guest data, regardless of guest OS configuration.
- Guest OS can boot without needing any password prompts, since libvirt can supply the decryption key directly to QEMU on the host side when launching the guest.
- Allows integration with a key management server. Libvirt provides APIs for setting the keys associated with a guest’s disks. An key management service can use these APIs to set/clear the keys for each host to match the list of guests it is intended to run.
There are probably more advantages to managing encryption on the virtualization host but I’m not going to try to think about them now. Instead the rest of this posting will give a short overview of how to use the new encryption capabilities
Secret management
There are many objects managed by libvirt which can conceivably use/require encryption secrets. It is also desirable that libvirt be able to integrate with external key management services, rather than always having to store secrets itself. For these two reasons, rather than directly set encryption secrets against virtual machines, or virtual disks, libvirt introduces a simple set of “secrets” management APIs. The first step in using disk encryption is thus to define a new secret in libvirt. In keeping with all other libvirt objects, a secret is defined by a short XML document
# cat demo-secret.xml
<secret ephemeral='no' private='no'>
<uuid>0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f</uuid>
<usage type='volume'>
<volume>/home/berrange/VirtualMachines/demo.qcow2</volume>
</usage>
</secret>
The “ephemeral” attribute controls whether libvirt will store a persistent copy of the secret on disk. If you had an external key management server talking to libvirt you would typically set this to ‘yes’, so that keys were never written to disk on individual virtualization hosts. Most people though will want to set this to ‘no’, so that libvirt stores the secret, otherwise you’ll loose all your keys when you reboot which probably isn’t what you want ! When running against a privileged libvirtd instance (eg with the qemu:///system URI), secrets are stored in /etc/libvirt/secrets, while when running unprivileged (qemu:///session), secrets are stored in $HOME/.libvirt/secrets.
The “private” attribute controls whether you can ask libvirt to give you the value associated with a secret. When it is ‘yes’, secrets are “write only”, once you’ve set the value, libvirt will refuse to tell you what it is. Again this is useful if you are using a key management server, because it allows it to load a secret into libvirt in order to start a guest, without allowing anyone else who is connected to libvirt to actually see what its value is.
The “uuid” is simply a unique identifier for the secret, when defining a secret this can be left out and it will be auto-generated.
Finally the “usage” element indicates what object the secret will be used this. This is not technically required, but when you have many hundreds of secrets defined, it is useful to know what objects they’re associated with, so you can easily purge secrets which are no longer used/required.
Having created the XML snippet for a secret as above, the first step is thus to load the secret definition into libvirt. If you are familiar with libvirt API/command naming conventions, you won’t be surprised to find out that this is done using the ‘virsh secret-define’ command
# virsh secret-define demo-secret.xml
Secret 1a81f5b2-8403-7b23-c8d6-21ccc2f80d6f created
Notice how we have not actually set the value of the secret anywhere, merely defined metadata. We explicitly choose not to include the secret’s value in the XML, since that would increase the risk of it being accidentally exposed in log files / bug reports / etc, etc. Thus once the secret has been defined, it is neccessary to set its value. There is a special virsh command for doing this called ‘virsh secret-set-value’, which takes two parameters, the UUID of the secret and then the value in base64. If you’re one of these people who can’t compute base64 in your head, then there’s of course the useful ‘base64’ command line tool
# MYSECRET=`echo "open seseme" | base64`
# virsh secret-set-value 0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f $MYSECRET
Secret value set
There are a few other virsh commands available, for managing secrets, but those two are the key ones you need to know in order to provision a new guest using encrypted disks. See the ‘virsh help’ output for the other commands
Virtual disks
Being able to define secrets isn’t much fun if you can’t then put those secrets to use. The first interesting task is probably to create an encrypted disk. At this point in time, libvirt’s storage APIs only support encryption of the qcow1, or qcow2 formats. It would be very desirable to support dm-crypt too, but that’s an outstanding feature request for someone else to implement.
I’ve got a directory based storage pool configured in my libvirt host which points to $HOME/VirtualMachines
, defined with the following XML
# virsh pool-dumpxml VirtualMachines
<pool type='dir'>
<name>VirtualMachines</name>
<source>
</source>
<target>
<path>/home/berrange/VirtualMachines</path>
</target>
</pool>
To create a encrypted volume within this pool it is neccessary to provide a short XML document describing the volume, what format it shoould be, how large it should be, and what secret it should use for encryption
# cat demo-disk.xml
<volume>
<name>demo.qcow2</name>
<capacity>5368709120</capacity>
<target>
<format type='qcow2'/>
<encryption format='qcow'>
<secret type='passphrase' uuid='0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f'/>
</encryption>
</target>
</volume>
Notice that we set the volume format to ‘qcow2’ since that is the type of disk we want to create. The XML then has the newly introduced “encryption” element which says that the volume should be encrypted using the ‘qcow’ encryption method (this is the same method for both qcow1, and qcow2 format disks). Finally it indicates that the ‘qcow’ encryption “passphrase” is provided by the secret with UUID 0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f
. The disk can now be created using the “virsh vol-create” command, for example,
# virsh vol-create VirtualMachines demo-disk.xml
Vol demo.qcow2 created from demo-disk.xml
An oddity of the qcow2 disk format is that it doesn’t actually need to have the encryption passphrase at the time it creates the volume, since it only encrypts its data, not metadata. libvirt still requires you set a secret in the XML at time of creation though, because you never know when qcow may change its requirements, or when we might use a format that does require the passphrase at time of creation. Oh and if you are cloning an existing volume, you would actually need the passphrase straight away to copy the data.
Virtual Machines
Now that we have created an encrypted QCow2 disk, it would be nice to use that disk with a virtual machine. In this example I’ve downloaded the PXE boot initrd.img and vmlinuz files for Fedora 12 and intend to use them to create a brand new virtual machine. So my guest XML configuration will be setup to use a kernel+initrd to boot, and have a single disk, the encrypted qcow file we just created. The key new feature in the XML format introduced here is again the “encryption” element within the “disk” element description. This is used to indicate what encryption method is used for this disk (again it is the ‘qcow’ method), and then associates the qcow decryption ‘passphrase’ with the secret we defined earlier
# cat demo-guest.xml
<domain type='qemu'>
<name>demo</name>
<memory>500200</memory>
<vcpu>4</vcpu>
<os>
<type arch='i686' machine='pc'>hvm</type>
<kernel>/home/berrange/vmlinuz-PAE</kernel>
<initrd>/home/berrange/initrd-PAE.img</initrd>
<boot dev='hd'/>
</os>
<devices>
<emulator>/usr/bin/qemu-kvm</emulator>
<disk type='file' device='disk'>
<driver name='qemu' type='qcow2'/>
<source file='/home/berrange/VirtualMachines/demo.qcow2'/>
<target dev='hda' bus='ide'/>
<encryption format='qcow'>
<secret type='passphrase' uuid='0a81f5b2-8403-7b23-c8d6-21ccc2f80d6f'/>
</encryption>
</disk>
<input type='tablet' bus='usb'/>
<input type='mouse' bus='ps2'/>
<graphics type='vnc' port='-1' autoport='yes'/>
</devices>
</domain>
With that XML config written, it is a simple matter to define a new guest, and then start it
# virsh define demo-guest.xml
Domain demo defined from demo-guest.xml
# virsh start demo
Domain demo started
# virt-viewer demo
If everything has gone to plan upto this point, the guest will boot off the kernel/initrd, hopefully taking you into anaconda. Everything written to the guest disk will now be encrypted using the secrets defined.
Future work
You’ll have noticed that all these examples are using the low level virsh
command. Great if you are the king of shell scripting, not so great if you want something friendly to use. So of course this new encryption functionality needs to be integrated into virt-install, and virt-manager. They should both allow you say that you want an encrypted guest, prompt you for a passphrase and then setup all the secrets automatically from there.
The libvirtd daemon has the ability to store secrets and their values persistently on disk, but this is not really secure, since the secrets are stored in unencrypted base64 format ! Clearly the next step here is for libvirtd to at the very least have the option of using gpg to encrypt the base64 files. The problem is that this then introduces a boot-strapping problem – what key does libvirt use for gpg ! This is a familiar problem to anyone who’s ever had to setup apache with SSL and wondered how to give apache the key to decrypt its SSL server key upon host startup.
As mentioned earlier on, the libvirt secrets management public API was designed to be easy to integrate with external key management services. For a desktop virtualization application like virt-manager there is an opportunity to integrate with gnome-keyring (or equivalent). When defining secrets in libvirt, virt-manager would mark them all as ephemeral so that libvirt never stored them in itself. At the time of starting a guest, virt-manager would query gnome-keyring for the disk decryption keys, and pass them onto libvirt. This would ensure no one could ever run your guests unless they are able to login & authenticate to your gnome-keyring service. A server virtualization application like oVirt could do much the same, perhaps storing keys in FreeIPA (if it had such a capability).
Being restricted to qcow2 disk formats isn’t all that nice because qcow2 isn’t the fastest virtual disk format to start off with, and adding encryption doesn’t improve matters. Many people, particularly in server virtualization environments, prefer to use LVM or raw block devices (SCSI/iSCSI). There are hacks which let you tell QEMU to write to the block device in qcow2 format, but they make me feel rather dirty. The kernel already comes with a generic block device encryption capability in the form of ‘dm-crypt’. libvirt really ought to support creation of encrypted block devices using dm-crypt.
For a long time the libvirtd daemon was single threaded, using a poll() loop to handle and dispatch all I/O requests in a single thread. This was reasonable when each operation was guaranteed to be a fast, but as libvirt has grown in scope, so has the complexity and execution time of many requests. Ultimately we had no choice but to make all the internal drivers thread-safe, and the libvirtd daemon itself multi-threaded. Each libvirt virConnectPtr connection can now be safely used from multiple threads at once. The libvirtd daemon will also process RPC calls from multiple clients concurrently, and even parallelize requests from a single client. Achieving this obviously required that we make significant use of pthread mutexes for locking data structures can be accessed concurrently.
To try and keep the code within the realm of human understanding, we came up with a fixed 2-level locking model. At the first level, each libvirt driver has a lock protecting its global shared state. At the second level, individual managed objects each have a lock (eg, each instance of virDomainObj, virNetworkObj, virStoragePoolObj has a lock). The rules for locking, can be stated pretty concisely
- The driver lock must be held when accessing any driver shared state
- The managed object lock must be held when accessing any managed object state
- The driver lock must be held before obtaining a lock on a managed object
- The driver and managed object locks may be released in any order
- The locks may not be recursively obtained
With these rules there are three normal locking sequences for each public API entry pointer implemented by a driver. The first, simplest, pattern is for a API that only accesses driver state
lock(driver)
....
....do something with 'driver'.....
....
unlock(driver)
The next pattern is for an API which has work with a managed object. These can usually release the driver lock once the managed object has been locked
lock(driver)
lock(object)
unlock(driver)
....
.... do something with 'object' ....
....
unlock(object)
The final pattern is for an API which has to work with both the driver and managed object state. In this case both locks must be held for the duration of the function
lock(driver)
lock(object)
....
.... do something with 'object' and 'driver'....
....
unlock(object)
unlock(driver)
For the 0.6.0 release I updated several hundred methods in libvirt to follow these 3 locking patterns. Inevitably there would be mistakes along the way, and those reviewing the patches did indeed find some. End users also found some more when we released this. And there is a continual risk of causing accidental regressions. One approach to validating the locking rules is via functional testing, but libvirt is a very complicated and expansive codebase, making it incredibly hard to exercise all possible codepaths in a functional test.
About a year ago, Richard Jones (of OCaml Tutorial fame) did an proof of concept using CIL to analyse libvirt. His examples loaded all libvirt code, generated control flow graphs and reported on exported functions. CIL is an enormously powerful toolset able to parse even very complicated C programs and perform complex static analysis checks. It sounded like just the tool for checking libvirt locking rules, as well as being a good excuse to learn OCaml
Functional languages really force you to think about the concepts you wish to express & check up front, before hacking code. This is possibly what discourages alot of people from using functional languages ;-) So after trying and failing to hack something up quickly, I stopped and considered what I needed to represent. The libvirt internal code has a pretty consistent pattern across different drivers and managed objects. So we can generalize some concepts
- Driver data types (eg, struct qemu_driver, xenUnifiedPrivatePtr)
- Managed object data types (eg, virDomainObjPtr, virNetworkObjPtr)
- A set of methods which lock drivers (eg, qemuDriverLock, xenUnifiedLock)
- A set of methods which unlock drivers (eg qemuDriverUnlock, xenUnifiedUnlock)
- A set of methods which obtain a locked managed object (eg, virDomainObjFindByName, virDomainObjAssignDef, virNetworkObjectFindByUUID)
- A set of methods which unlock managed objects (virDomainObjUnlock)
- A set of public API entry points for each driver (the function pointers in each virDriver struct instance)
I won’t go into fine details about CIL, because it is quite a complex beast. Suffice to say, it loads C code, parses it, and builds up an in-memory data structure representing the entire codebase. Todo this it needs the intermediate files from after pre-processing, eg, the ‘.i’ files generated by GCC when the ‘-save-temps’ flag is given. Once it has loaded the code into memory you can start to extract and process the interesting information.
The first step in libvirt analysis is to find all the public API entry points for each driver. Todo this, we search over all the global variables looking for instances of driver structs virDriver, virNetworkDriver, etc, and then processing their initializers to get all the function pointers. At this point we now have a list containing all the functions we wish to check
The next step is to iterate over each function, and run a data flow analysis. CIL provides two useful modules for data flow analysis. One of them runs analysis from the entry point, forwards, to each return point. The other runs analysis from each return point, backwards to the entry point. I can’t really say which module is “best”, but for libvirt locking analysis it felt like a forward data flow analysis would work pretty well. To perform analysis, you provide an implementation of the Cil.DataFlow.ForwardsTransfer
interface which has a couple of methods, of which the interesting ones are doStmt
and doInstr
. The first is invoked once for each C code block, the second is invoked once for each C statement. When invoked they are passed some initial state. The interface implementation looks at the code block or statement in question, and decides what state has changed, and returns the new state. For lock analysis, I decided to store the following state about variables:
- The set of driver variables that are currently locked
- The set of driver variables that are currently unlocked
- The set of object variables that are currently locked
- The set of object variables that are currently unlocked
Thus, the first part of the implementation of the doInstr
method was simply looking for function calls which lock or unlock a managed object or driver.
From this core state, some additional state can be derived at each C level statement, recording locking mistakes. The mistakes currently identified are
- Statements using an unlocked driver variable
- Statements using an unlocked object variable
- Statement locking a object variable while not holding a locked driver variable
- Statements locking a driver variable while holding a locked object variable
- Statements causing deadlock by fetching a lock object, while an object is already locked
The lock checker runs a forwards code flow analysis, constructing this state for each interesting function we identified earlier. When complete, it simply looks at the final state accumulated for each function. For each of the 5 types of mistake, it prints out the function name, the line number and the C code causing the mistake. In addition to those 5 mistakes, it also checks the final list of locked variables, and reports on any functions which forget to unlock a managed object, or driver, in any of their exit points.
In summary, we have been able to automatically detect and report on 7 critical locking mistakes across the entire driver codebase without needing to ever run the code. We also have a data flow analysis framework that can be further extended to detect other interesting locking problems. For example, it would be desirable to check whether one public API function calls into another, since this would cause a deadlock with non-recursive mutexes.
The final reports generated from the lock checking tool are designed to make it easily for the person reading them to figure out what locking rule has been violated. An example report from current libvirt CVS codebase looks like this
================================================================
Function: umlDomainStart
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 1564 "uml_driver.c"
ret = umlStartVMDaemon(dom->conn, driver___0, vm);
================================================================
================================================================
Function: umlDomainGetAutostart
----------------------------------------------------------------
- Total exit points with locked vars: 1
- At exit on #line 1675
return (ret);
^^^^^^^^^
variables still locked are
| struct uml_driver * driver___0
- Total blocks with lock ordering mistakes: 0
================================================================
================================================================
Function: umlDomainSetAutostart
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 4
- Driver used while unlocked on #line 1704
configFile = virDomainConfigFile(dom->conn,
(char const *)driver___0->configDir,
(char const *)(vm->def)->name);
- Driver used while unlocked on #line 1706
autostartLink = virDomainConfigFile(dom->conn,
(char const *)driver___0->autostartDir,
(char const *)(vm->def)->name);
- Driver used while unlocked on #line 1712
err = virFileMakePath((char const *)driver___0->autostartDir);
- Driver used while unlocked on #line 1713
virReportSystemErrorFull(dom->conn, 21, err, "uml_driver.c",
"umlDomainSetAutostart", 1715U,
(char const *)tmp___1, driver___0->autostartDir);
================================================================
================================================================
Function: testOpen
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 3
- Driver used while unlocked on #line 663 "test.c"
tmp___20 = virAlloc((void *)(& privconn->domainEventCallbacks),
sizeof(*(privconn->domainEventCallbacks)));
- Driver used while unlocked on #line 663
privconn->domainEventQueue = virDomainEventQueueNew();
- Driver used while unlocked on #line 670
privconn->domainEventTimer = virEventAddTimeout(-1, & testDomainEventFlush,
(void *)privconn,
(void (*)(void *opaque ))((void *)0));
================================================================
================================================================
Function: qemudClose
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 1815 "qemu_driver.c"
virDomainEventCallbackListRemoveConn(conn, driver___0->domainEventCallbacks);
================================================================
================================================================
Function: qemudDomainSuspend
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 2259
tmp___4 = qemudSaveDomainStatus(dom->conn, driver___0, vm);
================================================================
================================================================
Function: qemudDomainResume
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 2312
tmp___4 = qemudSaveDomainStatus(dom->conn, driver___0, vm);
================================================================
================================================================
Function: qemudDomainGetSecurityLabel
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 3189
tmp___2 = (*((driver___0->securityDriver)->domainGetSecurityLabel))(dom->conn,
vm, seclabel);
================================================================
================================================================
Function: qemudDomainStart
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 3630
ret = qemudStartVMDaemon(dom->conn, driver___0, vm, (char const *)((void *)0),
-1);
================================================================
================================================================
Function: qemudDomainAttachDevice
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 4192
tmp___8 = qemudSaveDomainStatus(dom->conn, driver___0, vm);
================================================================
================================================================
Function: qemudDomainDetachDevice
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 4316
tmp___3 = qemudSaveDomainStatus(dom->conn, driver___0, vm);
================================================================
================================================================
Function: qemudDomainSetAutostart
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 4
- Driver used while unlocked on #line 4381
configFile = virDomainConfigFile(dom->conn,
(char const *)driver___0->configDir,
(char const *)(vm->def)->name);
- Driver used while unlocked on #line 4383
autostartLink = virDomainConfigFile(dom->conn,
(char const *)driver___0->autostartDir,
(char const *)(vm->def)->name);
- Driver used while unlocked on #line 4389
err = virFileMakePath((char const *)driver___0->autostartDir);
- Driver used while unlocked on #line 4390
virReportSystemErrorFull(dom->conn, 10, err, "qemu_driver.c",
"qemudDomainSetAutostart", 4392U,
(char const *)tmp___1, driver___0->autostartDir);
================================================================
================================================================
Function: qemudDomainMigratePrepare2
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Object fetched while locked objects exist #line 4990
vm = virDomainAssignDef(dconn, & driver___0->domains, def);
================================================================
================================================================
Function: storagePoolRefresh
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 827 "storage_driver.c"
virStoragePoolObjRemove(& driver___0->pools, pool);
================================================================
================================================================
Function: storagePoolSetAutostart
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 2
- Driver used while unlocked on #line 962
err = virFileMakePath((char const *)driver___0->autostartDir);
- Driver used while unlocked on #line 963
virReportSystemErrorFull(obj->conn, 18, err, "storage_driver.c",
"storagePoolSetAutostart", 965U,
(char const *)tmp___1, driver___0->autostartDir);
================================================================
================================================================
Function: storageVolumeCreateXML
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Object locked while driver is unlocked on #line 1277
virStoragePoolObjLock(pool);
================================================================
================================================================
Function: networkStart
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 1228 "network_driver.c"
ret = networkStartNetworkDaemon(net->conn, driver___0, network);
================================================================
================================================================
Function: networkDestroy
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 1
- Driver used while unlocked on #line 1251
ret = networkShutdownNetworkDaemon(net->conn, driver___0, network);
================================================================
================================================================
Function: networkSetAutostart
----------------------------------------------------------------
- Total exit points with locked vars: 0
- Total blocks with lock ordering mistakes: 4
- Driver used while unlocked on #line 1363
configFile = virNetworkConfigFile(net->conn,
(char const *)driver___0->networkConfigDir,
(char const *)(network->def)->name);
- Driver used while unlocked on #line 1365
autostartLink = virNetworkConfigFile(net->conn,
(char const *)driver___0->networkAutostartDir,
(char const *)(network->def)->name);
- Driver used while unlocked on #line 1371
err = virFileMakePath((char const *)driver___0->networkAutostartDir);
- Driver used while unlocked on #line 1372
virReportSystemErrorFull(net->conn, 19, *tmp___1, "network_driver.c",
"networkSetAutostart", 1374U, (char const *)tmp___0,
driver___0->networkAutostartDir);
================================================================
In summary, CIL is an incredibly powerful tool, and it is well worth learning OCaml in order to use the CIL framework for doing static analysis of C code. The nummber of bugs it has allowed us to identify and fix more than compensates for effort developing this test harness.
It has been a while since I reported on libvirt development news, but that doesn’t mean we’ve been idle. The big news is the introduction of another new hypervisor driver in libvirt, this time for User Mode Linux. While Xen / KVM get all the press these days, UML has been quietly providing virtualization for Linux users for many years – until very recently nearly all Linux virtual server providers were deploying User Mode Linux guests. libvirt aims to be the universal management API for all virtualization technologies, and UML has no formal API of its own, so it is only natural that we provide a UML driver in libvirt. It is still at a fairly basic level of functionality, only supporting disks & paravirt consoles, but it is enough to get a guest booted & interact locally. The next step is adding networking support at which point it’ll be genuinely useful. To recap, libvirt now has drivers for Xen, QEMU, KVM, OpenVZ, LXC (LinuX native Containers) and UML, as well as a test driver & RPC support.
In other news, a couple of developers at VirtualIron have recently contributed some major new features to libvirt. The first set of APIs provides the ability to register for lifecycle events against domains, allowing an application to be notified whenever a domain stops, starts, migrates, etc, rather than having to continually poll for status changes. This is implemented for KVM and Xen so far. The second huge set of APIs provide a way to query a host for details of all the hardware devices it has. This is a key building block to allow remote management tools to assign PCI/USB devices directly to guest VMs, and to more intelligently configure networking and storage. Think of it as a remotely accessible version of HAL. In fact, we use HAL as one of the backend implementations for the API, or as an alternative, the new DeviceKit service.