Provisioning KVM virtual machines on iSCSI with QNAP + virt-manager (Part 1 of 2)

Posted: May 4th, 2010 | Filed under: libvirt, Virt Tools | Tags: , | 10 Comments »

I recently got a rather nice QNAP NAS to replace my crufty, cobbled together home fileserver. Unpack the NAS, insert 4 SATA disks and turn it on. The first time it starts up, it prompts to set up RAID5 across the disks by default (overridable to any other RAID config or no RAID) and after a few minutes formatting the disks, it is up and running ready todo work. It exposes storage shares via all commonly used protocols such as SAMBA, NFS, FTP, SSH/SCP, WebDAV, Web UI, and most interestingly to this post, iSCSI. The QNAP marketing material is strongly pushing iSCSI as a solution for use in combination with VMWare but here in Fedora & the open source virt world it is KVM we’re interested in. libvirt has a fairly generic set of APIs for managing storage which allows impls for many different types of storage, including iSCSI. The virt-manager app provides a UI for nearly all of the libvirt storage capabilities, again including iSCSI. This blog posting will graphically illustrate how to deploy a new guest on Fedora 12 using virt-manager and the QNAP iSCSI service.

iSCSI Service Enablement

In the QNAP administration UI, navigate to the “Disk Management -> iSCSI” panel. Ensure the iSCSI Target Service is enabled and running on the default TCP port 3260. libvirt currently has no need for iSNS, so that can be left disabled unless you have other reasons for requiring its use.

iSCSI Service Enablement

iSCSI Service Enablement

iSCSI Targets

An iSCSI target is simply a means to group a set of LUNs (storage volumes). There are many ways to organize your targets + LUNs, but to keep things simple I’m going to keep all my KVM guest storage volumes in a single target and enable this target on all my virtualization hosts. Some people like the other extreme of one target per guest, and only enabling the guest’s target on the host currently running the guest. In this screengrab I’ve already got one iSCSI target for other experimentation and intend to use the ‘Quick Configuration Wizard’ to add a new one for my KVM guests.

iSCSI target list

iSCSI target list

iSCSI Target + LUN creation

Since this is my first guest, I need to create both he iSCSI target and a LUN. For future guests I can create the iSCSI LUN only and associate it with the previously created target.

iSCSI target/LUN creation wizard

iSCSI target/LUN creation wizard

Setting the iSCSI target name

Every iSCSI target needs a unique identifier, known as the IQN. For reasons I don’t want to understand, the typical IQN format is a rather unpleasant looking string, but fortuitously the QNAP admin UI only expects you to enter a short name, which it then uses to form the full IQN. In this example I’m giving my new iSCSI target the name “kvmguests” which results in the adorable IQN “iqn.2004-04.com.qnap:ts-439proii:iscsi.kvmguests.bf6d84“. Remember this IQN, you’ll need it later in virt-manager.

iSCSI target naming

iSCSI Target naming

iSCSI target authentication

iSCSI includes support for an authentication scheme known as CHAP. Unfortunately libvirt does not yet support configuration of iSCSI storage using CHAP, so this has to be left disabled. This sucks if you’re using a shared/untrusted local network between your virt hosts & iSCSI server, but if you’ve got a separate network or VLAN dedicated to storage traffic this isn’t so much of a problem. I also don’t care for home usage.

iSCSI target authentication

iSCSI target authentication

iSCSI LUN allocation

Now the iSCSI target is configured, it is time to allocate storage for the LUN. This is what will provide the guest’s virtual disk. Each LUN configured via the QNAP admin UI ends up being backed by a plain file on the NAS’ (ext4) filesystem. If you choose the “Thin Provisioning” option at this stage, the LUN’s backing file will be a sparse file with no storage allocated upfront. It will grow on demand as data is written to it. This allows you to over-commit allocation of storage on the NAS, with the expectation that should you actually reach the limit on the NAS some time later, you can simply swap in larger disks to the RAID array or attach some extra external SATA devices. “Instant allocation” meanwhile, fully allocates the LUN at time of creation so there is never a problem of running out of disk space at runtime. The LUN name is just an aid-memoir for the admin. I name my LUNs to match the KVM guest name, so this one is called “rhel6x86_64”. The size of 10 GB will be more than enough for the basic RHEL6 install I plan.

iSCSI LUN allocation

iSCSI LUN allocation

iSCSI setup confirmation

Before actually creating the iSCSI target and LUN, the QNAP wizard allows a chance to review the configuration choices just made.

iSCSI target + LUN creation summary

iSCSI target + LUN creation summary

iSCSI target list (updated)

After completing the wizard, the browser returns to the list of iSCSI targets. The newly created target is visible, with one LUN beneath it. The status of ‘Processing’ just means that storage for the LUN is still being allocated & goes away pretty much immediately for LUNs using ‘Thin provisioning’ since there’s no data to allocate upfront. In theory full allocation should be pretty much instantaneous on ext4 too, if QNAP is using posix_fallocate, but I’ve not had time to check that yet.

iSCSI target list (updated)

iSCSI target list (updated)

With the iSCSI target and LUN created, it is now time to provision a new KVM guest using this storage. This is the topic of Part II

Guest CPU model configuration in libvirt with QEMU/KVM

Posted: February 15th, 2010 | Filed under: libvirt, Virt Tools | 8 Comments »

Many of the management problems in virtualization are caused by the annoyingly popular & desirable host migration feature! I previously talked about PCI device addressing problems, but this time the topic to consider is that of CPU models. Every hypervisor has its own policies for what a guest will see for its CPUs by default, Xen just passes through the host CPU, with QEMU/KVM the guest sees a generic model called “qemu32” or “qemu64”.  VMWare does something more advanced, classifying all physical CPUs into a handful of groups and has one baseline CPU model for each group that’s exposed to the guest. VMWare’s behaviour lets guests safely migrate between hosts provided they all have physical CPUs that classify into the same group. libvirt does not like to enforce policy itself, preferring just to provide the mechanism on which the higher layers define their own desired policy. CPU models are a complex subject, so it has taken longer than desirable to support their configuration in libvirt. In the 0.7.5 release that will be in Fedora 13, there is finally a comprehensive mechanism for controlling guest CPUs.

Learning about the host CPU model

If you have been following earlier articles (or otherwise know a bit about libvirt) you’ll know that the “virsh capabilities” command displays an XML document describing the capabilities of the hypervisor connection & host. It should thus come as no surprise that this XML schema has been extended to provide information about the host CPU model. One of the big challenges in describing a CPU models is that every architecture has different approach to exposing their capabilities. On x86, a modern CPUs’ capabilities are exposed via the CPUID instruction. Essentially this comes down to a set of 32-bit integers with each bit given a specific meaning. Fortunately AMD & Intel agree on common semantics for these bits. VMWare and Xen both expose the notion of CPUID masks directly in their guest configuration format. Unfortunately (or fortunately depending on your POV) QEMU/KVM support far more than just the x86 architecture, so CPUID is clearly not suitable as the canonical configuration format. QEMU ended up using a scheme which combines a CPU model name string, with a set of named flags. On x86 the CPU model  maps to a baseline CPUID mask, and the flags can be used to then toggle bits in the mask on or off. libvirt decided to follow this lead and use a combination of a model name and flags. Without further ado, here is an example of what libvirt reports as the capabilities of my laptop’s CPU

# virsh capabilities
<capabilities>

  <host>
    <cpu>
      <arch>i686</arch>
      <model>pentium3</model>
      <topology sockets='1' cores='2' threads='1'/>
      <feature name='lahf_lm'/>
      <feature name='lm'/>
      <feature name='xtpr'/>
      <feature name='cx16'/>
      <feature name='ssse3'/>
      <feature name='tm2'/>
      <feature name='est'/>
      <feature name='vmx'/>
      <feature name='ds_cpl'/>
      <feature name='monitor'/>
      <feature name='pni'/>
      <feature name='pbe'/>
      <feature name='tm'/>
      <feature name='ht'/>
      <feature name='ss'/>
      <feature name='sse2'/>
      <feature name='acpi'/>
      <feature name='ds'/>
      <feature name='clflush'/>
      <feature name='apic'/>
    </cpu>
    ...snip...

In it not practical to have a database listing all known CPU models, so libvirt has a small list of baseline CPU model names.  It picks the one that shares the greatest number of CPUID bits with the actual host CPU and then lists the remaining bits as named features. Notice that libvirt does not tell you what features the baseline CPU contains. This might seem like a flaw at first, but as will be shown next, it is not actually necessary to know this information.

Determining a compatible CPU model to suit a pool of hosts

Now that it is possible to find out what CPU capabilities a single host has, the next problem is to determine what CPU capabilities are best to expose to the guest. If it is known that the guest will never need to be migrated to another host,  the host CPU model can be passed straight through unmodified. Some lucky people might have a virtualized data center where they can guarantee all servers will have 100% identical CPUs. Again the host CPU model can be passed straight through unmodified. The interesting case though, is where there is variation in CPUs between hosts. In this case the lowest common denominator CPU must be determined. This is not entirely straightforward, so libvirt provides an API for exactly this task. Provide libvirt with a list of XML documents, each describing a host’s CPU model, and it will internally convert these to CPUID masks, calculate their intersection, finally converting the CPUID mask result back into an XML CPU description. Taking the CPU description from a random server

<capabilities>
  <host>
    <cpu>
      <arch>x86_64</arch>
      <model>phenom</model>
      <topology sockets='2' cores='4' threads='1'/>
      <feature name='osvw'/>
      <feature name='3dnowprefetch'/>
      <feature name='misalignsse'/>
      <feature name='sse4a'/>
      <feature name='abm'/>
      <feature name='cr8legacy'/>
      <feature name='extapic'/>
      <feature name='cmp_legacy'/>
      <feature name='lahf_lm'/>
      <feature name='rdtscp'/>
      <feature name='pdpe1gb'/>
      <feature name='popcnt'/>
      <feature name='cx16'/>
      <feature name='ht'/>
      <feature name='vme'/>
    </cpu>
    ...snip...

As a quick check is it possible to ask libvirt whether this CPU description is compatible with the previous laptop CPU description, using the “virsh cpu-compare” command

$ ./tools/virsh cpu-compare cpu-server.xml
CPU described in cpu-server.xml is incompatible with host CPU

libvirt is correctly reporting the CPUs are incompatible, because there are several features in the laptop CPU that are missing in the server CPU. To be able to migrate between the laptop and the server, it will be necessary to mask out some features, but which ones ? Again libvirt provides an API for this, also exposed via the “virsh cpu-baseline” command

# virsh cpu-baseline both-cpus.xml
<cpu match='exact'>
  <model>pentium3</model>
  <feature policy='require' name='lahf_lm'/>
  <feature policy='require' name='lm'/>
  <feature policy='require' name='cx16'/>
  <feature policy='require' name='monitor'/>
  <feature policy='require' name='pni'/>
  <feature policy='require' name='ht'/>
  <feature policy='require' name='sse2'/>
  <feature policy='require' name='clflush'/>
  <feature policy='require' name='apic'/>
</cpu>

libvirt has determined that in order to safely migrate a guest between the laptop and the server, it is neccesary to mask out 11 features from the laptop’s XML description.

Configuring the guest CPU model

To simplify life, the guest CPU configuration accepts the same basic XML representation as the host capabilities XML exposes. In other words, the XML from that “cpu-baseline” virsh command, can now be copied directly into the guest XML at the top level under the <domain> element. As the observant reader will have noticed from that last XML snippet, there are a few extra attributes available when describing a CPU in the guest XML. These can mostly be ignored, but for the curious here’s a quick description of what they do. The top level <cpu> element gets an attribute called “match” with possible values

  • match=”minimum” – the host CPU must have at least the CPU features described in the guest XML. If the host has additional features beyond the guest configuration, these will also be exposed to the guest
  • match=”exact” – the host CPU must have at least the CPU features described in the guest XML. If the host has additional features beyond the guest configuration, these will be masked out from the guest
  • match=”strict” – the host CPU must have exactly the same CPU features described in the guest XML. No more, no less.

The next enhancement is that the <feature> elements can each have an extra “policy” attribute with possible values

  • policy=”force” – expose the feature to the guest even if the host does not have it. This is kind of crazy, except in the case of software emulation.
  • policy=”require” – expose the feature to the guest and fail if the host does not have it. This is the sensible default.
  • policy=”optional” – expose the feature to the guest if it happens to support it.
  • policy=”disable” – if the host has this feature, then hide it from the guest
  • policy=”forbid” – if the host has this feature, then fail and refuse to start the guest

The “forbid” policy is for a niche scenario where a badly behaved application will try to use a feature even if it is not in the CPUID mask, and you wish to prevent accidentally running the guest on a host with that feature. The “optional” policy has special behaviour wrt migration. When the guest is initially started the flag is optional, but when the guest is live migrated, this policy turns into “require”, since you can’t have features disappearing across migration.

All the stuff described in this posting is currently implemented for libvirt’s QEMU/KVM driver, basic code in the 0.7.5/6 releases, but the final ‘cpu-baseline’ stuff is arriving in 0.7.7. Needless to say this will all be available in Fedora 13 and future RHEL. This obviously also needs to be ported over to the Xen and VMWare ESX drivers in libvirt, which isn’t as hard as it sounds, because libvirt has a very good internal API for processed CPUID masks now. Kudos to Jiri Denemark for doing all the really hardwork on this CPU modelling system!

Stable guest machine ABI, PCI addressing and disk controllers in libvirt

Posted: February 15th, 2010 | Filed under: libvirt, Virt Tools | 2 Comments »

If you are using a physical machine, the chances are that the only things you hotplug are USB devices, or more occasionally SCSI disk drives, never PCI devices themselves. Although you might flash upgrade the BIOS, doing so is not all that likely to change the machine ABI seen by the operating system. The world of virtual machines though, is not quite so static. It is common for administrators to want to hotplug CPUs, memory, USB devices, PCI devices & disk drives. Upgrades of the underlying virtualization software will also bring in new features & capabilities to existing configured devices in the virtual machine, potentially changing the machine ABI seen by the guest OS. Some operating systems (Linux !) cope with these kind of changes without batting an eyelid. Other operating systems (Windows !) get very upset and decide you have to re-activate the license keys for your operating system, or re-configure your devices.

Stable machine ABIs

The first problem we addressed in libvirt was that of providing a stable machine ABI, so that when you upgrade QEMU/KVM, the guest doesn’t unexpectedly break. QEMU has always had the concept of a “machine type”. In the x86 emulator this let you switch between a plain old boring ISA only PC, and a new ISA+PCI enabled PC. The libvirt capabilities XML format exposed the allowed machine types for each guest architecture

<guest>
  <os_type>hvm</os_type>
  <arch name='i686'>
    <wordsize>32</wordsize>
    <emulator>/usr/bin/qemu</emulator>
    <machine>pc</machine>
    <machine>isapc</machine>
    <domain type='qemu'>
    </domain>
  </arch>
</guest>

This tells you that the i686 emulator with path /usr/bin/qemu supports two machine types “pc” and “isapc”. This information then appears in the guest XML

<domain type='kvm'>
  <name>plain</name>
  <uuid>c7a1edbd-edaf-9455-926a-d65c16db1809</uuid>
  <os>
    <type arch='i686' machine='pc'>hvm</type>
    ...snip...
  </os>
  ...snip...
</domain>

To support a stable machine ABI, the QEMU developers introduced the idea of versioned machine types. Instead of just “pc”, there is now “pc-0.10”, “pc-0.11”, etc new versions being added for each QEMU release that changes something in the machine type. The original “pc” machine type is declared to be an alias pointing to the latest version. libvirt captures this information in the capabilities XML, by listing all machine types and using the “canonical” attribute to link up the alias

<guest>
 <os_type>hvm</os_type>
 <arch name='i686'>
   <wordsize>32</wordsize>
   <emulator>/usr/bin/qemu</emulator>
   <machine>pc-0.11</machine>
   <machine canonical='pc-0.11'>pc</machine>
   <machine>pc-0.10</machine>
   <machine>isapc</machine>
   <domain type='qemu'>
   </domain>
</guest>

Comparing to the earlier XML snippet, you can see 2 new machine types “pc-0.10” and “pc-0.11”, and can determine that “pc” is an alias for “pc-0.11”. The next clever part is that when you define / create a new guest in libvirt, if you specify a machine type of “pc”, then libvirt will automatically canonicalize this to “pc-0.11” for you. This means an application developer never need worry about this most of the time, they will automatically get a stable versioned machine ABI for all their guests

PCI device addressing

The second problem faced is that of ensuring that a device’s PCI address does not randomly change across reboots, or even across host migrations. This problem was primarily caused by the fact that you can hotplug/unplug arbitrary PCI devices at runtime. As an example, a guest might boot with 2 disks disks and 1 NIC, a total of 3 PCI devices. These get assigned PCI slots 4, 5 and 6 respectively. The admin then unplugs one of virtio disk. When the guest then reboots, or migrates to another host, QEMU will assign PCI slots 4 & 5. In other words the NIC has unexpectedly moved from slot 6 to slot 5. Migration is unlikely to be successful when this happens! The first roadblock in attempting to solve this problem was that QEMU did not provide any way for a management application to specify a PCI address for devices. As of the QEMU 0.12 release though, this limitation is finally removed with the introduction of a new generic ‘-device’ argument for configuring virtual hardware devices. QEMU only supports a single PCI bus and no bridges, so we can only configure the PCI slot number at this time, but this is sufficient. As part of a giant patch series I switched libvirt over to use this new syntax.

In implementing this, it was neccessary to add a way to record the PCI addresses in the libvirt guest XML format. After a little head scratching we settled on adding a generic “address” element to every single device in the libvirt XML. This example shows what it looks like in the context of a NIC definition

    <interface type='network'>
      <mac address='52:54:00:f7:c5:0e'/>
      <source network='default'/>
      <target dev='vnet0'/>
      <model type='e1000'/>
      <alias name='e1000-nic2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </interface>

Requiring every management application to assign PCI addresses to devices was not at all desirable though, they should not need to think about this kind of thing. Thus whenever a new guest is defined in libvirt with the QEMU/KVM driver, the first thing libvirt does is to issue unique PCI addresses to every single device in the XML. This neatly side-steps the minor problem of having to tell apps that PCI slot 1 is reserved for the PIIX3, slot 2 for the VGA controller and slot 3 for the balloon device. So an application defines a guest without addresses and then can query libvirt for the updated XML to discover what addresses were assigned. For older QEMU < 0.12, we can’t support static PCI addressing due to lack of the “-device” argument, but libvirt will still report on what addresses were actually assigned by QEMU at runtime.

At time of hotplugging a device, the same principals are applied. The application passes in the device XML without any address info and libvirt assigns a PCI slot. The only minor caveat is that if an application then invokes the “define” operation to make the new device persistent, it should take care to copy across the auto-assigned address. This limitation will be addressed in the near future with an improved libvirt hotplug API.

Disk controllers & drive addressing

At around the same time that I was looking at static PCI addressing, Wolfgang Mauerer, was proposing a way to fix the SCSI disk hotplug support in libvirt. The problem here was that at boot time, if you listed 4 SCSI disks, you would get 1 SCSI controller with 4 SCSI disks attached. Meanwhile if you tried hotplugging 4 SCSI disks, you’d get 4 SCSI controllers each with 1 disk attached. In other words we were faking SCSI disk hotplug, by attaching entirely new SCSI controllers. It doesn’t take a genius to work out that this is going to crash & burn at next reboot, or migration, when those hotplugged SCSI controllers all disappear and the disks end up back on a single controller. The solution here was to explicitly model the idea of a disk controller in the libvirt guest XML, separate from the disk itself. For this task, we invented a fairly simple new XML syntax that looks like this

    <controller type='scsi' index='0'>
      <alias name='scsi1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x0c' function='0x0'/>
    </controller>

Each additional SCSI controller defined would get a new “index” value , and of course has a unique auto-assigned PCI address too. Notice how the “address” element has a type attribute. This bit of forward planning allows libvirt to introduce new address types later. This capability is now used to link disks to controllers, but defining a “drive” address type, consisting of a controller ID, bus ID and unit ID. In this example, the SCSI disk “sdz” is linked to controller 3, attached as unit 4.

    <disk type='file' device='disk'>
      <source file='/home/berrange/output.img'/>
      <target dev='sdz' bus='scsi'/>
      <alias name='scsi3-0-4'/>
      <address type='drive' controller='3' bus='0' unit='4'/>
    </disk>

Since no existing libvirt application knows about the “controller” or “address” elements, libvirt will automatically populate these as required. So in this example, if the SCSI disk were specified without any “address” element, libvirt will add one, figuring out controller & unit properaties based onthe device name “sdz”. Each SCSI controller can have 7 units attached, and “sdz” corresponds to drive number 26, hence ending up controller 3 (26 / 7) and unit 4 ((26 % 7)-1). Once we added controllers and drive addresses for SCSI disks, naturally the same was also done for IDE and floppy disks. Virtio disks are a little special in that there is no separate VirtIO disk controller, every VirtIO disk is a separate PCI device. This is somewhat of a scalability problem, since QEMU only has 31 PCI slots and 3 of those are already taken up with the PIIX3, VGA adapter & balloon driver. Needless to say, we hope that QEMU will implement a PCI bridge soon, or introduce a real VirtIO disk controller.

Controlling guest CPU & NUMA affinity in libvirt with QEMU, KVM & Xen

Posted: February 12th, 2010 | Filed under: libvirt, Virt Tools | 16 Comments »

When provisioning new guests with libvirt, the standard policy for affinity between the guest and host CPUs / NUMA nodes, is to have no policy at all. In other words the guest will follow whatever the hypervisor’s own default policy is, which is usually to run the guest on whatever host CPU is available. There are times when an explicit policy may be better, in particular to make the most of a NUMA architecture it is usually desirable to lock a guest to a particular NUMA node so that its memory allocations are always local to the node it is running on, avoiding the cross-node memory transports which have less bandwidth. As of writing, libvirt supports this capability for QEMU, KVM and Xen guests. Even on a non-NUMA system some form of explicit placement across the hosts’ sockets, cores & hyperthreads may be desired.

Querying host CPU / NUMA topology

The first step in deciding what policy to apply is to figure out the host’s topology is. The virsh nodeinfo command provides information about how many sockets, cores & hyperthreads there are on a host.

# virsh nodeinfo
CPU model:           x86_64
CPU(s):              8
CPU frequency:       1000 MHz
CPU socket(s):       2
Core(s) per socket:  4
Thread(s) per core:  1
NUMA cell(s):        1
Memory size:         8179176 kB

There are a total of 8 CPUs, in 2 sockets, each with 4 cores.

More interesting though is the NUMA topology. This can be significantly more complex, so the data is provided in a structured XML document, as part of the virsh capabilities output

# virsh capabilities
<capabilities>

  <host>
    <cpu>
      <arch>x86_64</arch>
    </cpu>
    <migration_features>
      <live/>
      <uri_transports>
        <uri_transport>tcp</uri_transport>
      </uri_transports>
    </migration_features>
    <topology>
      <cells num='2'>
        <cell id='0'>
          <cpus num='4'>
            <cpu id='0'/>
            <cpu id='1'/>
            <cpu id='2'/>
            <cpu id='3'/>
          </cpus>
        </cell>
        <cell id='1'>
          <cpus num='4'>
            <cpu id='4'/>
            <cpu id='5'/>
            <cpu id='6'/>
            <cpu id='7'/>
          </cpus>
        </cell>
      </cells>
    </topology>
    <secmodel>
      <model>selinux</model>
      <doi>0</doi>
    </secmodel>
  </host>

 ...removed remaining XML...

</capabilities>

This tells us that there are two NUMA nodes (aka cells), each containing 4 logical CPUs. Since we know there are two sockets, we can obviously infer from this that each socket is in a separate node, not that this really matters for the what we need later. If we’re intending to run a guest with 4 virtual CPUs, we can that it will be desirable to lock the guest to physical CPUs 0-3, or 4-7 to avoid non-local memory accesses. If our guest workload required 8 virtual CPUs, since each NUMA node only has 4 physical CPUs, better utilization may be obtained by running a pair of 4 cpu guests & splitting the work between them, rather than using a single 8 cpu guest.

Deciding which NUMA node to run the guest on

Locking a guest to a particular NUMA node is rather pointless if that node does not have sufficient free memory to allocation for local memory allocations. Indeed, it would be very detrimental to utilization. The next step is to ask libvirt what the free memory is on each node, using the virsh freecell command

# virsh freecell 0
0: 2203620 kB

# virsh freecell 1
1: 3354784 kB

If our guest needs to have 3 GB of RAM allocated, then clearly it needs to be run on NUMA node (cell) 1, rather than node 0, sine the latter only has 2.2 GB available.

Locking the guest to a NUMA node or physical CPU set

We have now decided to run the guest on NUMA node 1, and referring back to the capabilities data about NUMA topology, we see this node has physical CPUs 4-7. When creating the guest XML we can now specify this as the CPU mask for the guest. Where the guest virtual CPU count is specified

<vcpus>4</vcpus>

we can now add the mask

<vcpus cpuset='4-7'>4</vcpus>

As mentioned earlier, this works for QEMU, KVM and Xen guests. In the QEMU/KVM case, libvirt will use the sched_setaffinity call at guest startup, while in the Xen case libvirt will instruct XenD to make an equivalent hypercall.

Automatic placement using virt-install

This walkthrough illustrated the concepts in terms of virsh commands. If writing a management application using libvirt, you would of course use the equivalent APIs for looking up this data, virNodeGetInfo, virConnectGetCapabilities and virNodeGetCellsFreeMemory. The virt-install provisioning tool has done exactly this and provides a simple way to automatically apply a ‘best fit’ NUMA policy when installing guests. Quoting its manual page

   --cpuset=CPUSET

   Set which physical cpus the guest can use. "CPUSET" is a comma separated
   list of numbers, which can also be specified in ranges. Example:

     0,2,3,5     : Use processors 0,2,3 and 5
     1-3,5,6-8   : Use processors 1,2,3,5,6,7 and 8

   If the value ’auto’ is passed, virt-install attempts to automatically
   determine an optimal cpu pinning using NUMA data, if available.

So if you have a NUMA machine and use virt-install, simply always add --cpuset=auto whenever provisioning a new guest.

Fine tuning CPU affinity at runtime

The scheme outlined above is focused on the initial guest placement at boot time. There may be times where it becomes necessary to fine-tune the CPU affinity at runtime. libvirt/virsh can cope with this need too, via the vcpuinfo and vcpupin commands. First, the virsh vcpuinfo command gives you the latest data about where each virtual CPU is running. In this example, rhel5xen is a guest on a Fedora KVM host which I used for RHEL5 Xen package maintenance work. It has 4 virtual CPUs and is being allowed to run on any host CPU

# virsh vcpuinfo rhel5xen
VCPU:           0
CPU:            3
State:          running
CPU time:       0.5s
CPU Affinity:   yyyyyyyy

VCPU:           1
CPU:            1
State:          running
CPU Affinity:   yyyyyyyy

VCPU:           2
CPU:            1
State:          running
CPU Affinity:   yyyyyyyy

VCPU:           3
CPU:            2
State:          running
CPU Affinity:   yyyyyyyy

Now lets say the I want to lock each of these virtual CPUs to a separate host CPU in the 2nd NUMA node.

# virsh vcpupin rhel5xen 0 4

# virsh vcpupin rhel5xen 1 5

# virsh vcpupin rhel5xen 2 6

# virsh vcpupin rhel5xen 3 7

The vcpuinfo command can be used again to confirm the placement

# virsh vcpuinfo rhel5xen
VCPU:           0
CPU:            4
State:          running
CPU time:       32.2s
CPU Affinity:   ----y---

VCPU:           1
CPU:            5
State:          running
CPU time:       16.9s
CPU Affinity:   -----y--

VCPU:           2
CPU:            6
State:          running
CPU time:       11.9s
CPU Affinity:   ------y-

VCPU:           3
CPU:            7
State:          running
CPU time:       14.6s
CPU Affinity:   -------y

And just to prove I’m not faking it all, here’s KVM process running on the host and its /proc status

# grep pid /var/run/libvirt/qemu/rhel5xen.xml
<domstatus state='running' pid='4907'>

# grep Cpus_allowed_list /proc/4907/task/*/status
/proc/4907/task/4916/status:Cpus_allowed_list: 4
/proc/4907/task/4917/status:Cpus_allowed_list: 5
/proc/4907/task/4918/status:Cpus_allowed_list: 6
/proc/4907/task/4919/status:Cpus_allowed_list: 7

Future work

The approach outlined above relies on the fact that the kernel will always try to allocate memory from the NUMA node that matches the one the guest CPUs are executing on. While this is sufficient in the simple case, there are some pitfalls along the way. Between the time the guest is started & memory is allocated, RAM from the NUMA node in question may have been used up causing the OS to fallback to allocating from another node. For this reason, if placing guests on NUMA nodes, it is crucial that all guests running on the host have fixed placement, with none allowed to float free. In some wierd and wonderful NUMA topologies (hello Itanium !) there can be NUMA nodes which have only CPUs, and/or only RAM. To cope with these it will be necessary to extend libvirt to allow an explicit memory allocation node to be listed in the guest configuration.

Visualizing libvirt development history using gource and code swarm

Posted: January 16th, 2010 | Filed under: libvirt, Virt Tools | Tags: | 3 Comments »

Michael DeHaan yesterday posted an example using gource to visualize Cobbler development history. Development on Cobbler started in April 2006, making it a similar vintage to libvirt whose development started in November 2005. So I thought it would be interesting to produce a visualization of libvirt development as a comparison.

Head over to the YouTube page for this video if it doesn’t show the option watch in highdef in this embedded viewer. HD makes it much easier to make out the names

Until July last year, libvirt was using CVS for source control. Among a great many disadvantages of CVS is that it does not track author attribution at all, so the first 3 & 1/2 years show an inaccurately small contributor base. Watching the video it is clear when the switch to GIT happened as the number of authors explodes. Even with the inaccuracies from the CVS history, it is clear from the video just how much development of libvirt has been expanding over the past 4 years, particularly with the expansion to cover VirtualBox and VMWare ESX server as hypervisor targets. This video was generated on Fedora 12 using

 # gource -s 0.07 --auto-skip-seconds 0.1 \
          --file-idle-time 500 --disable-progress \
          --output-framerate 25 --highlight-all-users \
          -1280x720 --stop-at-end --output-ppm-stream - \
 | ffmpeg -y -b 15000K -r 17 -f image2pipe -vcodec ppm \
          -i - -vcodec mpeg4 libvirt-2010-01-15-alt.mp4

gource isn’t the only source code visualization application around. Last year a project called code swarm came along too. It has a rather different & simpler physics model to gource, not showing the directory structure explicitly. As a comparison I produced a visualization of libvirt using code_swarm too:

Head over to the YouTube page for this video if it doesn’t show the option to watch in highdef in this embedded viewer. HD makes it much easier to make out the names

In this video the libvirt files are coloured into four groups, source code, test cases, documentation and i18n data (ie translated .po files). Each coloured dot represents a file, and each developer exerts a gravitional pull on files they have modified. For the years in which libvirt used CVS there were just a handful of developers who committed changes.This results in a visualization where developers have largely overlapping spheres of influence on files. In the last 6 months with GIT, changes have correct author attribution, so the visualization spreads out more accurately reflecting who is changing what. In the end, I think I rather prefer gource’s results because it has a less abstract view of the source tree and better illustrates the rate of change over time.

Finally, can anyone recommend a reliable online video hosting service that’s using HTML5 + Ogg Theora yet ? I can easily encode these videos in Ogg Theora, but don’t want to host the 200 MB files on my own webserver since it doesn’t have the bandwidth to cope.