Controlling guest CPU & NUMA affinity in libvirt with QEMU, KVM & Xen
When provisioning new guests with libvirt, the standard policy for affinity between the guest and host CPUs / NUMA nodes, is to have no policy at all. In other words the guest will follow whatever the hypervisor’s own default policy is, which is usually to run the guest on whatever host CPU is available. There are times when an explicit policy may be better, in particular to make the most of a NUMA architecture it is usually desirable to lock a guest to a particular NUMA node so that its memory allocations are always local to the node it is running on, avoiding the cross-node memory transports which have less bandwidth. As of writing, libvirt supports this capability for QEMU, KVM and Xen guests. Even on a non-NUMA system some form of explicit placement across the hosts’ sockets, cores & hyperthreads may be desired.
Querying host CPU / NUMA topology
The first step in deciding what policy to apply is to figure out the host’s topology is. The virsh nodeinfo
command provides information about how many sockets, cores & hyperthreads there are on a host.
# virsh nodeinfo CPU model: x86_64 CPU(s): 8 CPU frequency: 1000 MHz CPU socket(s): 2 Core(s) per socket: 4 Thread(s) per core: 1 NUMA cell(s): 1 Memory size: 8179176 kB
There are a total of 8 CPUs, in 2 sockets, each with 4 cores.
More interesting though is the NUMA topology. This can be significantly more complex, so the data is provided in a structured XML document, as part of the virsh capabilities
output
# virsh capabilities <capabilities> <host> <cpu> <arch>x86_64</arch> </cpu> <migration_features> <live/> <uri_transports> <uri_transport>tcp</uri_transport> </uri_transports> </migration_features> <topology> <cells num='2'> <cell id='0'> <cpus num='4'> <cpu id='0'/> <cpu id='1'/> <cpu id='2'/> <cpu id='3'/> </cpus> </cell> <cell id='1'> <cpus num='4'> <cpu id='4'/> <cpu id='5'/> <cpu id='6'/> <cpu id='7'/> </cpus> </cell> </cells> </topology> <secmodel> <model>selinux</model> <doi>0</doi> </secmodel> </host> ...removed remaining XML... </capabilities>
This tells us that there are two NUMA nodes (aka cells), each containing 4 logical CPUs. Since we know there are two sockets, we can obviously infer from this that each socket is in a separate node, not that this really matters for the what we need later. If we’re intending to run a guest with 4 virtual CPUs, we can that it will be desirable to lock the guest to physical CPUs 0-3, or 4-7 to avoid non-local memory accesses. If our guest workload required 8 virtual CPUs, since each NUMA node only has 4 physical CPUs, better utilization may be obtained by running a pair of 4 cpu guests & splitting the work between them, rather than using a single 8 cpu guest.
Deciding which NUMA node to run the guest on
Locking a guest to a particular NUMA node is rather pointless if that node does not have sufficient free memory to allocation for local memory allocations. Indeed, it would be very detrimental to utilization. The next step is to ask libvirt what the free memory is on each node, using the virsh freecell
command
# virsh freecell 0 0: 2203620 kB # virsh freecell 1 1: 3354784 kB
If our guest needs to have 3 GB of RAM allocated, then clearly it needs to be run on NUMA node (cell) 1, rather than node 0, sine the latter only has 2.2 GB available.
Locking the guest to a NUMA node or physical CPU set
We have now decided to run the guest on NUMA node 1, and referring back to the capabilities data about NUMA topology, we see this node has physical CPUs 4-7. When creating the guest XML we can now specify this as the CPU mask for the guest. Where the guest virtual CPU count is specified
<vcpus>4</vcpus>
we can now add the mask
<vcpus cpuset='4-7'>4</vcpus>
As mentioned earlier, this works for QEMU, KVM and Xen guests. In the QEMU/KVM case, libvirt will use the sched_setaffinity
call at guest startup, while in the Xen case libvirt will instruct XenD to make an equivalent hypercall.
Automatic placement using virt-install
This walkthrough illustrated the concepts in terms of virsh
commands. If writing a management application using libvirt, you would of course use the equivalent APIs for looking up this data, virNodeGetInfo
, virConnectGetCapabilities
and virNodeGetCellsFreeMemory
. The virt-install
provisioning tool has done exactly this and provides a simple way to automatically apply a ‘best fit’ NUMA policy when installing guests. Quoting its manual page
--cpuset=CPUSET Set which physical cpus the guest can use. "CPUSET" is a comma separated list of numbers, which can also be specified in ranges. Example: 0,2,3,5 : Use processors 0,2,3 and 5 1-3,5,6-8 : Use processors 1,2,3,5,6,7 and 8 If the value ’auto’ is passed, virt-install attempts to automatically determine an optimal cpu pinning using NUMA data, if available.
So if you have a NUMA machine and use virt-install, simply always add --cpuset=auto
whenever provisioning a new guest.
Fine tuning CPU affinity at runtime
The scheme outlined above is focused on the initial guest placement at boot time. There may be times where it becomes necessary to fine-tune the CPU affinity at runtime. libvirt/virsh can cope with this need too, via the vcpuinfo
and vcpupin
commands. First, the virsh vcpuinfo
command gives you the latest data about where each virtual CPU is running. In this example, rhel5xen
is a guest on a Fedora KVM host which I used for RHEL5 Xen package maintenance work. It has 4 virtual CPUs and is being allowed to run on any host CPU
# virsh vcpuinfo rhel5xen VCPU: 0 CPU: 3 State: running CPU time: 0.5s CPU Affinity: yyyyyyyy VCPU: 1 CPU: 1 State: running CPU Affinity: yyyyyyyy VCPU: 2 CPU: 1 State: running CPU Affinity: yyyyyyyy VCPU: 3 CPU: 2 State: running CPU Affinity: yyyyyyyy
Now lets say the I want to lock each of these virtual CPUs to a separate host CPU in the 2nd NUMA node.
# virsh vcpupin rhel5xen 0 4 # virsh vcpupin rhel5xen 1 5 # virsh vcpupin rhel5xen 2 6 # virsh vcpupin rhel5xen 3 7
The vcpuinfo command can be used again to confirm the placement
# virsh vcpuinfo rhel5xen VCPU: 0 CPU: 4 State: running CPU time: 32.2s CPU Affinity: ----y--- VCPU: 1 CPU: 5 State: running CPU time: 16.9s CPU Affinity: -----y-- VCPU: 2 CPU: 6 State: running CPU time: 11.9s CPU Affinity: ------y- VCPU: 3 CPU: 7 State: running CPU time: 14.6s CPU Affinity: -------y
And just to prove I’m not faking it all, here’s KVM process running on the host and its /proc status
# grep pid /var/run/libvirt/qemu/rhel5xen.xml <domstatus state='running' pid='4907'> # grep Cpus_allowed_list /proc/4907/task/*/status /proc/4907/task/4916/status:Cpus_allowed_list: 4 /proc/4907/task/4917/status:Cpus_allowed_list: 5 /proc/4907/task/4918/status:Cpus_allowed_list: 6 /proc/4907/task/4919/status:Cpus_allowed_list: 7
Future work
The approach outlined above relies on the fact that the kernel will always try to allocate memory from the NUMA node that matches the one the guest CPUs are executing on. While this is sufficient in the simple case, there are some pitfalls along the way. Between the time the guest is started & memory is allocated, RAM from the NUMA node in question may have been used up causing the OS to fallback to allocating from another node. For this reason, if placing guests on NUMA nodes, it is crucial that all guests running on the host have fixed placement, with none allowed to float free. In some wierd and wonderful NUMA topologies (hello Itanium !) there can be NUMA nodes which have only CPUs, and/or only RAM. To cope with these it will be necessary to extend libvirt to allow an explicit memory allocation node to be listed in the guest configuration.