Using CGroups with libvirt and LXC/KVM guests in Fedora 12
In my previous post I discussed the new disk encryption capabilities available with libvirt in Fedora 12. That was not the only new feature we quietly introduced without telling anyone. The second was integration between libvirt and the kernels’ CGroups functionality. In fact we have had this around for a while, but prior to this point it has only been used with our LXC driver. It is now also available for use with the QEMU driver.
What are CGroups?
You might be wondering at this point what CGroups actually are ? At a high level, it is a generic mechanism the kernel provides for grouping of processes and applying controls to those groups. The grouping is done via a virtual filesystem called “cgroup”. Within this filesytem, each directory defines a new group. Thus groups can be arranged to form an arbitrarily nested hierarchy simply by creating new sub-directories.
Tunables within a cgroup are provided by what the kernel calls ‘controllers’, with each controller able to expose one or more tunable or control. When mounting the cgroups filesystem it is possible to indicate what controllers are to be activated. This makes it possible to mount the filesystem several times, with each mount point having a different set of (non-overlapping) controllers. Why might separate mount points be useful ? The key idea is that this allows the administrator to construct differing group hierarchies for different sets of controllers/tunables.
memory
: Memory controller- Allows for setting limits on RAM and swap usage and querying cumulative usage of all processes in the group
cpuset
: CPU set controller- Binding of processes within a group to a set of CPUs and controlling migration between CPus
cpuacct
: CPU accounting controller- Information about CPU usage for a group of processes
cpu
: CPU schedular controller- Controlling the priorization of processes in the group. Think of it as a more advanced nice level
devices
: Devices controller- Access control lists on character and block devices
freezer
: Freezer controller- Pause and resume execution of processes in the group. Think of it as SIGSTOP for the whole group
net_cls
: Network class controller- Control network utilization by associating processes with a ‘tc’ network class
This isn’t the blog post to go into fine details about each of these controllers & their capabilities, the high level overview will do. Suffice to say that at this time, the libvirt LXC driver (container based virtualization) will use all of these controllers except for net_cls
and cpuset
, while the libvirt QEMU driver will only use the cpu
and devices
controllers.
Activating CGroups on a Fedora 12 system
CGroups are a system-wide resource and libvirt doesn’t presume that it can dictate how CGroup controllers are mounted, nor in what hierarchy they are arranged. It will leave mount point & directory setup entirely to the administrators’ discretion. Unfortunately though, is not just a matter of adding some mount points to /etc/fstab
. It is neccessary to setup the directory hierarchy and decide how processes get placed within it. Fortunately the libcg project provides a init service and set of tools to assist in host configuration. On Fedora 12 this is packaged in the libcgroup
RPM. If you don’t have that installed, install it now!
The primary configuration file for CGroups is /etc/cgconfig.conf
. There are two interesting things to configure here. Step 1 is declaring what controllers are mounted where. Should you wish to keep things very simple it is possible to mount many controllers in one single location with a snippet that looks like
mount { cpu = /dev/cgroups; cpuacct = /dev/cgroups; memory = /dev/cgroups; devices = /dev/cgroups; }
This will allow a hierarchy of process cgroups rooted at /dev/cgroups. If you get more advanced though, you might wish to have one hierarchy just for CPU schedular controls, and another for device ACLs, and a third for memory management. That could be accomplished using a configuration that looks like this
mount { cpu = /dev/cgroups/cpu; cpuacct = /dev/cgroups/cpu; memory = /dev/cgroups/memory; devices = /dev/cgroups/devices; }
Going with this second example, save the /etc/cgconfig.conf
file with these mount rules, and then activate the configuration by running
# service cgconfig start
Looking in /dev/cgroups
, there should be a number of top level directories.
# ls /dev/cgroups/ cpu devices memory # ls /dev/cgroups/cpu cpuacct.stat cpu.rt_period_us notify_on_release tasks cpuacct.usage cpu.rt_runtime_us release_agent cpuacct.usage_percpu cpu.shares sysdefault # ls /dev/cgroups/memory/ memory.failcnt memory.stat memory.force_empty memory.swappiness memory.limit_in_bytes memory.usage_in_bytes memory.max_usage_in_bytes memory.use_hierarchy memory.memsw.failcnt notify_on_release memory.memsw.limit_in_bytes release_agent memory.memsw.max_usage_in_bytes sysdefault memory.memsw.usage_in_bytes tasks # ls /dev/cgroups/devices/ devices.allow devices.list release_agent tasks devices.deny notify_on_release sysdefault
Now that the basic cgroups controllers are mounted in the locations we want, there is a choice of how to proceed. If starting libvirtd at this point, it will end up in the sysdefault
group seen here. This may be satisfactory for some people, in which case they can skip right ahead to the later notes on how KVM virtual machines use cgroups. Other people may want to move the libvirtd daemon (and thus everything it runs) into a separate cgroup first.
Placing the libvirtd daemon in a dedicated group
Lets say we wish to place an overall limit on the amount of memory that can be used by the libvirtd daemon and all guests that it launches. For this it will be neccessary to define a new group, and specify a limit using the memory controller. Back at the /etc/cgconfig.conf
configuration file, this can be achieved using the ‘group’ statement:
group virt { memory { memory.limit_in_bytes = 7500M; } }
This says that no matter what all the processes in this group do, their combined usage will never be allowed above 7.5 GB. Any usage above this limit will cause stuff to be pushed out to swap. Newer kernels can even let you control how much swap can be used, before the OOM killer comes out of hiding. This particular example is chosen to show how cgroups can be used to protect the virtualization host against accidental overcommit. eg On this server with 8 GB RAM, no matter how crazy & out of control the virtual machines get, I have reasonable reassurance that I’ll always be able to get to an SSH / console login prompt because we’ve left a guaranteed 500 MB for other cgroups (ie the rest of the system) to use.
Now that the desired custom group has been defined it is neccessary to actually tell the system that the libvirtd
daemon (or any other interesting daemons) needs to be placed in this virt
group. If the daemon in question has a well designed initscript, it will be using the shared functions from /etc/init.d/functions
, in particular the ‘daemon’ function. libvirtd is of course well designed :-P Placing libvirtd into a cgroup requires adding one line to its configuration file /etc/sysconfig/libvirtd
.
CGROUP_DAEMON="memory:/virt"
If we wanted to place it in several cgroups, those would be listed in that same parameter, using a space to separate each. At this point a (re-)start of the cgconfig and libvirtd services will complete the host setup. There is a magic /proc
file which can show you at a glance what cgroup any process is living in
# service cgconfig restart # service libvirtd restart # PID=`pgrep libvirtd` # cat /proc/$PID/cgroup 32:devices:/sysdefault 16:memory:/virt 12:cpuacct,cpu:/sysdefault
Our config didn’t say anything about the devices
or cpuacct
groups, even though we’d asked for them to be mounted. Thus libvirtd
got placed in the sysdefault
group for those controllers.
Controlling KVM guests
The libvirtd daemon has drivers for many virtualization technologies, and at time of writing, its LXC and QEMU drivers integrate with CGroups. For maximum flexibility of administration, libvirt will create its own mini hierarchy of groups in to which guests will be placed. This mini hierarchy will be rooted at whatever location the libvirtd daemon starts in.
$ROOT | +- libvirt (all virtual machines/containers run by libvirtd) | +- lxc (all LXC containers run by libvirtd) | | | +- guest1 (LXC container called 'guest1') | +- guest2 (LXC container called 'guest2') | +- guest3 (LXC container called 'guest3') | +- ... (LXC container called ...) | +- qemu (all QEMU/KVM containers run by libvirtd) | +- guest1 (QENU machine called 'guest1') +- guest2 (QEMU machine called 'guest2') +- guest3 (QEMU machine called 'guest3') +- ... (QEMU machine called ...)
Remember that we actually configured 3 mount points, /dev/cgroups/cpu
, /dev/cgroups/devices
and /dev/cgroups/memory
. libvirt will detect whatever you mounted, and replicate its mini hierarchy at the appropriate place. So in the above example $ROOT will expand to 3 locations
/dev/cgroups/cpu/sysdefault /dev/cgroups/memory/virt /dev/cgroups/devices/sysdefault
As an administrator you should not need to concern yourself with directly accessing the tunables within libvirt’s mini cgroups hiearchy. libvirt will add/remove the child groups for each guest when the guest is booted/destroyed respectively. The libvirt API, and/or virsh command line tool provide mechanisms to set tunables on guests. For the QEMU and LXC drivers in libvirt, the virsh schedinfo
command provides access to CPU scheduler prioritization for a guest,
# virsh schedinfo demo Scheduler : posix cpu_shares : 1024
The “cpu_shares” value is a relative priorization. All guests start out with a cpu_shares of 1024. If you half its “cpu_shares” value, it will get 1/2 the CPU time as compared to other guests. This is applied to the guest as a whole, regardless of how many virtual CPUs it has. This last point is an important benefit over simple ‘nice’ levels which operate per-thread. With the latter it is very hard to set relative prioritization between guests unless they all have exactly the same number of virtual CPUs The cpu_shares tunable can be set with the same virsh command
# virsh schedinfo --set cpu_shares=500 demo Scheduler : posix cpu_shares : 500
For LXC containers, the configured guest memory limit is implemented via the ‘memory’ controller, and container CPU time accounting is done with the ‘cpuacct’ controller.
If the “devices” controller is mounted, libvirt will attempt to use that to control what a guest has access to. It will setup the ACLs so that all guests can access things like /dev/null, /dev/rand, etc, but deny all access to block devices except for those explicitly configured in its XML. This is done for both QEMU and LXC guests.
Future work
There is quite alot more that libvirt could do to take advantage of CGroups. We already have three feature requests that can be satisfied with the use of Cgroups
- Enforcement of guest memory usage
- QEMU guests are supplied with the VirtIO Balloon device. This allows the host OS to request that the guest OS release memory it was previously allocated back to the host. This allows administrators to dynamically adjust memory allocation of guests on the fly. The obvious downside is that it relies on co-operation of the guest OS. If the guest ignores the request to release memory there is nothing the host can do. Almost nothing. The cgroups
memory
controller allows a limit on both RAM and swap usage to be set. Since each guest is placed in its own cgroup, we can use this control to enforce the balloon request, by updating the cgroup memory controller limit whenever changing the balloon limit. If the guest ignores the balloon request, then the cgroups controller will force part of the guest’s RAM allocation out to swap. This gives us both a carrot and a stick. - Network I/O policy on guests
- The cgroups
net_cls
controller allows a cgroup to be associated with atc
traffic control class (see thetc(8) manpage
). One obvious example usage would involve setting a hard bandwidth cap, but there are plenty more use cases beyond that - Disk I/O policy on guests
- A number of kernel developers have been working on a new disk I/O bandwidth control mechanism for a while now, targeting virtualization as a likely consumer. The problem is not quite as simple as it sounds though. While you might put caps on bandwidth of guests, this is ignoring the impact of seek time. If all the guest I/O patterns are sequential, high throughput it might be great, but a single guest doing excessive random access I/O causing too many seeks can easily destroy the throughput of all others. None the less, there appears to be a strong desire for disk I/O bandwidth controls. This is almost certainly going to end up as another cgroup controller that libvirt can then take advantage of.
There are probably more things can be done with cgroups, but this is plenty to keep us busy for a while