Deprecating libvirt / KVM hypervisor versions in OpenStack Nova

Posted: May 18th, 2015 | Filed under: Fedora, libvirt, OpenStack, Virt Tools | Tags: hypervisor, support, versions | 1 Comment »

If you read nothing else, just take note that in the Liberty release cycle Nova has deprecated usage of libvirt versions < 0.10.2, and in the Mxxxxx release cycle support for running with libvirt < 0.10.2 will be explicitly dropped.

OpenStack has a fairly aggressive policy of updating the minimum required versions of python modules it depends on. Many python modules are updated pretty frequently and (bitter) experience has shown that updates will often not be API compatible, even across seemingly minor version number changes. Maintaining working OpenStack code across different incompatible versions of the same module is tricky to get right and will inevitably be fragile without good testing coverage. While OpenStack has a huge level of testing, it cannot reasonably be expected to track the matrix of different incompatible python module versions. So (reluctantly) I accept that OpenStack has chosen the only practical approach, which is the increase the min required version of a library anytime it is found to be incompatible with an older version, or whenever there is a new feature required that is only present in a newer version. Now this does create pain in that the versions of python modules shipped in most distros are going to be too old to satisfy OpenStack’s needs. Thus when deploying OpenStack the distro provided versions must be updated to something newer. Fortunately, most OpenStack deployment tools mitigate the pain for users by taking ownership of installation and management of the full python stack, whether a 3rd party module, or an OpenStack provided module and this works pretty well in general.

It is important to contrast this with the situation found for dependencies on non-python modules, and in particular for Nova, the hypervisor platform that is targeted. While OpenStack does get some testing coverage of the hypervisor control plane, it is inconsequential when placed in the context of testing done by the hypervisor vendors themselves. The vendors will of course have tested the control plane themselves, both directly and often in the context of higher level apps such as oVirt and OpenStack. Beyond that though, the vendors will test a whole suite of guest operating systems to ensure they deploy and operate in a functionally correct manner. For Windows guests, there will be certifications of accelerated guest drivers via WHQL and the OS as a whole with Microsoft’s SVVP. The vendor will benchmark and validate scalability and performance the hypervisor on a multitude of compute workloads, and against various different storage and network technologies. For government related deployments, the platform will go through Common Criteria Certifications and security audits. Finally of course, the vendor will have a team of people maintaining the version they ship, most critically of course, to deal with security errata. I should note that I’m thinking about Open Source hypervisors primarily here and the difference between upstream releases and productized downstream releases. For closed source hypervisors you only ever get access to the productized release.

This is all a long winded way of saying that it is a very hard sell for OpenStack to require users to update their hypervisor versions to something OpenStack has tested, in preference to the version that the vendor ordinarily ships & supports. The benefit of OpenStack’s testing of the hypervisor control plane does not come anywhere close to offsetting the costs of loosing the testing, certification & support work that the vendor has put onto the hypervisor platform as a whole. There are also costs suffered directly by the user wrt platform upgrades, as distinct from application upgrades. It is fairly common for organizations to go through their own internal build and certification process when deploying a new operating system and/or hypervisor platform. This will include jobs such as integrating with their network services, particularly authentication & authorization engines, service monitoring frameworks, auditing systems and backups services. In addition the OS/hypervisor is also likely to undergo testing against any hardware platforms/models that the organization may have standardized on. It may take as long as 6 months, or even 12, before some organizations are ready to deploy a new hypervisor platform released by a vendor. Once an organization has deployed a platform, they will naturally wish to maximise its useful lifetime before upgrading to newer versions. This is in stark contrast to applications that an organization runs on the platforms which may be upgraded very frequently in matter of weeks or even days. It is sad that there can be such time lags for platform but not applications, but unfortunately this is just the way many organizations IT support works.

For these reasons, OpenStack needs to take a different approach to hypervisor platforms, and be pretty conservative about updating the minimum required version. The costs on users will be quite large and not something that can be mitigated by deployment tools that OpenStack can provide, unless the organization is one of the minority that is nimble enough to cope with a continuous deployment model and has enough in house expertise to take on a degree of hypervisor maintenance. In cases where Nova does wish to update the minimum required version there needs to be a fairly compelling set of benefits to Nova that outweigh the costs that will be imposed on the downstream users. Mere prettiness / cleanliness of the code is exceedingly unlikely to count as a compelling benefit.

Looking specifically at the Libvirt + KVM platform dependency in Nova, back in November 2013 we increased the minimum required libvirt from 0.9.6 to 0.9.11. This had the cost of dropping the ability to run Nova on the (then current) Ubuntu LTS platform. This cost was largely mitigated by the fact that Canonical provide the Cloud Archive add-on repository which ships newer libvirt and KVM versions specifically for use with OpenStack, so users had an easy way out in that case. The compelling benefit to Nova though, was that it enabled OpenStack to depend on the new libvirt-python module that had been split off from the main libvirt package and made available on PyPi. This made it possible for OpenStack testing to setup virtualenvs with specific libvirt python versions in common with its approach for any other python modules. More importantly this new libvirt-python has support for the Python 3 platform, so unblocking that porting item for Nova. As a result, the upgrade from 0.9.6 to 0.9.11 was a clear net win on balance.

The benefit of increasing the min required libvirt to values beyond 0.9.11 is harder to articulate. It would enable removal of a few workarounds in OpenStack but nothing that is imposing an undue burden on Nova Libvirt driver maintenance at this time. Mostly the problem with older versions is that they simply lack a lot of functionality compared to current versions, so there will be an increasingly large set of OpenStack features which will not work at all on such old versions. They also get comparatively less testing by developers, vendors and users alike, so as time goes by we’re less likely to spot incompatibilities with old versions which will ultimately affect the experience users have when deploying OpenStack. It is less clear cut when to the draw the line though, in these cases. To help guide our decision making, a list of currently shipping libvirt, kvm and libguestfs versions across distros is maintained. For the community focused distros with short lifetimes (short == less than 2 years from release to end-of-life), it is quite simple to just drop them as supported targets when they go end of life. So from the POV of Fedora, at time of writing, we’ll only care about Nova supporting Libvirt >= 1.1.3. For the enterprise focus distros with long lifetimes (long == more than 2 years, often 5-10 years), it is hard to decide when to drop them as a supported target. As mentioned earlier, enterprise organizations will typically have quite a time lag between a new release coming out and it being something that is widely deployed. Despite RHEL-7 having been available since June 2014, it is not uncommon for organizations to still be using RHEL-6 for new platform deployments. Officially, RHEL-6 is a supported platform by Red Hat until at least 2020, but clearly Nova will not wish to continue targeting it for that length of time. So there is a question of when it is reasonable for Nova to end support for the RHEL-6 platform. Nova already dropped support for Python 2.6, so RHEL-6 users will need to use the Software Collections Layer to get Python 2.7 access, and Red Hat’s OpenStack product is now RHEL-7 based only, so clearly Nova on RHEL-6 is entering its twilight years.

Looking at the current distro support matrix for libvirt versions it was decided that support for Debian Wheezy and OpenSuse 12.2 was reasonable to drop, but at this time Nova will continue to support RHEL-6 vintage libvirt. To provide users with greater advance notice it was agreed that dropping of libvirt/kvm versions should require issuance of a deprecation warning for one release cycle.. So in the Liberty release, Nova will now print out a warning if run on libvirt < 0.10.2, and in the Mxxxx release cycle this will turn into a fatal error. So anyone currently deployed on libvirt 0.9.11 -> 0.10.1 has advance warning to plan for an upgrade of their hypervisor platform. I suspect that RHEL-6 may well get the chop 1 cycle later, eg we’d issue a warning in Mxxx and drop it in Nxxxx release, as RHEL-7 would have been available for 2 years by that point and should be taking the overwhealming majority of KVM hypervisor deployments.

One of the things to come out of the discussion around incrementing the libvirt minimum version was that we haven’t really articulated what our policy is in this area. As one of the lead maintainers of the Nova libvirt driver, this blog post is an attempt to set out my views of the matter. As you can see there is no simple answer, but the intent is to be as conservative as practical to minimize the number of users who are likely to be impacted by decisions to increase the minimum version. Is also became clear that we need to do a better job of articulating our approach to required platform versions to users in documentation. Previously there had been an attempt to categorize Nova hypervisor platforms/drivers into three groups, primarily according to the level of testing they have in the OpenStack or 3rd party CI systems. The intention behind this is fine, but the usefulness to users is somewhat limited because OpenStack CI obviously only tests a handful of very specific hypervisor platforms. So this classification gives you confidence that a Nova driver has been tested, but not confidence that it has been tested with your particular versions. So functionality that OpenStack claims is tested & operational may not be available on your platform due to version differences. To address this, OpenStack needs to provide more detailed information to users, in particular it must distinguish between what versions of a hypervisor Nova is technically capable of running against, vs the versions of a hypervisor that have been validated by CI. Armed with this knowledge, where those versions differ, it is reasonable for the user to look to their hypervisor vendor for confirmation that their own testing can provide an equivalent level of assurance to the OpenStack CI testing. The user also has the option of running the OpenStack CI tests themselves against their own specific deployment platform. On the theme of providing users with more information about hypervisor capabilities, the Nova feature support matrix which was previously held in a wiki has been turned into a piece of formal documentation maintained in Nova itself. The intent is to continue to expand this to provide more fine grained information about features and eventually annotate them with any caveats about minimum required versions of the hypervisor in the associated notes for each feature item.

QEMU QCow2 built-in encryption: just say no. Deprecated now, to be deleted soon

Posted: March 17th, 2015 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security | Tags: deprecated, luks, qcow2 | 6 Comments »

A little over 5 years ago now, I wrote about a how libvirt introduced support for QCow2 built-in encryption. The use cases for built-in qcow2 encryption were compelling back then, and remain so today. In particular while LUKS is fine if your disk backend is already a kernel visible block device, it is not a generically usable alternative for QEMU since it requires privileged operation to set it up, would require yet another I/O layer via a loopback or qemu-nbd device, and finally is entirely Linux specific. The direction QEMU has taken over the past few years has in fact been to take the kernel out of the equation for more & more functionality. For example, QEMU can now natively connect to RBD, Gluster, iSCSI and NFS servers with no kernel assistance – the client code is implemented entirely within QEMU block driver layer, which precludes the use of LUKS there.

At the time I wrote that blog post, no one had seriously looked at the QCow2 encryption design to see if it was in any way sane from a security POV. At least if they had, AFAIK, they didn’t make their analysis public. Over time though, various QEMU maintainers did eventually look at the QCow2 encryption code and their conclusions were not positive. The consensus opinion amongst QEMU maintainers today is that QCow2 encryption is terminally flawed in a number of serious ways, including but not limited to:

The AES-CBC cipher is used with predictable initialization vectors based on the sector number. This makes it vulnerable to chosen plaintext attacks which can reveal the existence of encrypted data.
The user passphrase is directly used as the encryption key.
- A poorly chosen or short passphrase will compromise the security of the encryption.
- In the event of the passphrase being compromised there is no way to change the passphrase to protect data in any qcow images.
- It is difficult to make the data permanently inaccessible upon file deletion – at best you can try to destroy data with shred, though even this is ineffective with many modern storage technologies.

By comparison the LUKS encryption format does not suffer from any of these problems. With LUKS the initialization vectors typically use ESSIV to ensure unpredictability; the passphrase is only indirectly used to unlock the master encryption key material, so can be changed at will; the passphrase is put through a PBKDF2 function to mitigate the effects of short sequences of letters; the size of the master key material is artificially inflated with an anti-forensic algorithm to increase the difficulty of recovering the key from deleted volumes.

The QCow2 encryption scheme is a prime example of why merely using a well known standard algorithm (AES) is not sufficient to guarantee a secure implementation. In January 2014, I submitted an update for the QEMU docs to explicitly warn users about the security limitations of QCow2 encryption, which made it into the 1.5.0 release of QEMU. This week Markus has gone one step further and explicitly deprecated use of QCow2 encryption for the forthcoming 2.3.0 release of QEMU. Any attempt to use an encrypted QCow2 file with the QEMU system emulator will now result in a warning being printed to stderr, which in turn ends up in the libvirt logfile for that guest. As well as the security issues, Markus’ other motivation for deprecating this is that the way it is integrated into QEMU block driver layer causes a number of technical & usability problems. So even if we want encrypted block devices in QEMU, the internals for encryption need a complete rewrite from scratch.

In the 2.4.0 release, the intention is to go one step further and actually delete support for QCow2 encryption from the QEMU system emulator entirely, as well as all the infrastructure for block device encryption. We will keep support for decrypting images in the qemu-img program only, to provide a way for users to get their previously encrypted data out into a supported format.

In the immediate future, the recommendation is that users who need encryption for virtual disks should use LUKS on the host, despite the limitations that I noted earlier. At some point in the next 6 months my intention is to start working on a QEMU block driver implementation of the LUKS format, which will enable QEMU to add encryption to any of its virtual disk backends, not merely QCow2. This will require designing new infrastructure for handling decryption keys inside QEMU too, to replace the unsatisfactory approach used today. By using the LUKS format directly though, QEMU will benefit from the security knowledge of those who designed and analysed this format over many years to identify its strengths & weaknesses. It will also provide good interoperability. eg an encrypted qcow2-luks file will be able to be converted to/from a block device for access by the kernel’s LUKS driver with no need to re-encrypt the data, which is very desirable as it lets users decide whether to use in-QEMU or in-kernel block device backends at the flick of a switch.

So just to sum up. Do not ever use QCow2 built-in encryption as it exists today. It is unfixably broken by design. It is deprecated in QEMU 2.3.0 and is likely to be deleted in QEMU 2.4.0.

Nova metadata recorded in libvirt guest instance XML

Posted: February 20th, 2015 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Virt Tools | Tags: debugging, metadata, nova, troubleshooting | No Comments »

One of the issues encountered when debugging libvirt guest problems with Nova, is that it isn’t always entirely obvious why the guest XML is configured the way it is. For a while now, libvirt has had the ability to record arbitrary application specific metadata in the guest XML. Each application simply declares the XML namespace it wishes to use and can then record whatever it wants. Libvirt will treat this metadata as a black box, never attempting to interpret or modify it. In the Juno release I worked on a blueprint to make use of this feature to record some interesting information about Nova.

The initial set of information recorded is as follows:

Version – the Nova version number, and any vendor specific package suffiix (eg RPM release number). This is useful as the user reporting a bug is often not entirely clear what particular RPM version was installed when the guest was first booted.
Name – the Nova instance display name. While you can correlated Nova instances to libvirt guests using the UUID, users reporting bugs often only tell you the display name. So recording this in the XML is handy to correlate which XML config corresponds to which Nova guest they’re talking about.
Creation time – the time at which Nova booted the guest. Sometimes useful when trying to understand the sequence in which things happened.
Flavour – the Nova flavour name, memory, disk, swap, ephemeral and vcpus settings. Flavours can be changed by the admin after a guest is booted, so having the original values recorded against the guest XML is again handy.
Owner – the tenant user ID and name, as well as their project
Root image – the glance image ID, if the guest was booted from an image

The Nova version number information in particular has already proved very useful in a couple of support tickets, showing that the VM instance was not booted under the software version that was initially claimed. There is still scope for augmenting this information further though. When working on another support issues it would have been handy to know the image properties and flavour extra specs that were set, as the user’s bug report also gave misleading / incorrect information in this area. Information about cinder block devices would also be useful to have access to, for cases where the guest isn’t booting from an image.

While all this info is technically available from the Nova database, it is far easier (and less dangerous) to ask the user to provide the libvirt XML configuration than to have them run random SQL queries. Standard OS trouble shooting tools such as sosreport from RHEL/Fedora already collect the libvirt guest XML when run. As a result, the bug report is more likely to contain this useful data in the initial filing, avoiding the need to ask the user to collect further data after the fact.

To give an example of what the data looks like, a Nova guest booted with

$ nova boot --image cirros-0.3.0-x86_64-uec --flavor m1.tiny vm1

Gets the following data recorded

$ virsh -c qemu:///system dumpxml instance-00000001
<domain type='kvm' id='2'>
  <name>instance-00000001</name>
  <uuid>d0e51bbd-cbbd-4abc-8f8c-dee2f23ded12</uuid>
  <metadata>
    <nova:instance xmlns:nova="http://openstack.org/xmlns/libvirt/nova/1.0">
      <nova:package version="2015.1"/>
      <nova:name>vm1</nova:name>
      <nova:creationTime>2015-02-19 18:23:44</nova:creationTime>
      <nova:flavor name="m1.tiny">
        <nova:memory>512</nova:memory>
        <nova:disk>1</nova:disk>
        <nova:swap>0</nova:swap>
        <nova:ephemeral>0</nova:ephemeral>
        <nova:vcpus>1</nova:vcpus>
      </nova:flavor>
      <nova:owner>
        <nova:user uuid="ef53a6031fc643f2af7add439ece7e9d">admin</nova:user>
        <nova:project uuid="60a60883d7de429aa45f8f9d689c1fd6">demo</nova:project>
      </nova:owner>
      <nova:root type="image" uuid="2344a0fc-a34b-4e2d-888e-01db795fc89a"/>
    </nova:instance>
  </metadata>
 ...snip...
</domain>

The intention is that as long as the XML namespace URI (http://openstack.org/xmlns/libvirt/nova/1.0) isn’t changed, the data reported here will not change in a backwards incompatible manner. IOW, we will add further elements or attributes to the Nova metadata, but not change or remove existing elements or attributes. So if OpenStack related troubleshooting / debugging tools want to extract this data from the libvirt XML they can be reasonably well assured of compatibility across future Nova releases.

In the Kilo development cycle there have been patches submitted to record similar kind of data for VMWare guests, though obviously it uses a config data format relevant to VMWare rather than XML. Sadly this useful debugging improvement for VMWare had its feature freeze exception request rejected, pushing it out to Liberty, which is rather a shame :-(

Usage of the libvirt virCommand APIs for process spawning

Posted: October 2nd, 2014 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: bash, commands, exec, fork, popen, shellshock, spawn, system, virCommand | 4 Comments »

The previous blog post looked at the history of libvirt APIs for spawning processes, up to the current day where there is a single virCommand object + APIs for spawning processes in a very flexible manner. This blog post will now look at the key features of this API and how it is used in practice.

Example usage

Before going into the dry details, lets consider a couple of real world examples where libvirt uses these APIs.

“system” replacement

As a first example, the “virNodeSuspendSetNodeWakeup” method is a place where libvirt might have traditionally used the ‘system’ call.

The goal is to suspend the host, setting a pre-defined wakeup alarm, for which libvirt needs to run the ‘rtcwake’ command. The wakeup time is provided to the API in terms of a unsigned long long which needs to be converted to a string.

If attempting this with the ‘system’ call the code might look like this:

static int virNodeSuspendSetNodeWakeup(unsigned long long alarmTime)
{
    char *setAlarmCmd;
    int ret = -1;

    if (asprintf(&setAlarmCmd, "rtcwake -m no -s %lld", alarmTime) < 0)
        goto cleanup;

    if (system(setAlarmCmd) < 0)
        goto cleanup;

    ret = 0;

 cleanup:
    free(setAlarmCmd);
    return ret;
}

Now consider what this would look like when using the virCommand APIs:

static int virNodeSuspendSetNodeWakeup(unsigned long long alarmTime)
{
    virCommandPtr setAlarmCmd;
    int ret = -1;

    setAlarmCmd = virCommandNewArgList("rtcwake", "-m", "no", "-s", NULL);
    virCommandAddArgFormat(setAlarmCmd, "%lld", alarmTime);

    if (virCommandRun(setAlarmCmd, NULL) < 0)
        goto cleanup;

    ret = 0;

 cleanup:
    virCommandFree(setAlarmCmd);
    return ret;
}

The difference in code complexity is negligible, but the difference in the quality of the implementation is significant.

“popen” replacement

As a second example, the “virStorageBackendIQNFound” method is a place where libvirt might have traditionally used the ‘popen’ call.

The goal this time is to run the iscsiadm command with a number of arguments and parse its stdout to look for a particular iSCSI target.

First consider what this might look like when using ‘popen’

static int virStorageBackendIQNFound(const char *initiatoriqn,
                                          char **ifacename)
{
    int ret = -1;
    FILE *fp = NULL;
    int fd = -1;
    char line[4096]

    if ((fp = popen("iscsiadm --mode iface")) == NULL)
        goto cleanup;

    while (fgets(line, 4096, fp) != NULL) {
       ...analyse line for a match... 
    }

    ret = 0;

cleanup:
    pclose(fp);
    close(fd);
    virCommandFree(cmd);

    return ret;
}

Now consider the re-write to use virCommand APIs

static int virStorageBackendIQNFound(const char *initiatoriqn,
                                          char **ifacename)
{
    int ret = -1;
    FILE *fp = NULL;
    int fd = -1;
    char line[4096]
    virCommandPtr cmd = virCommandNewArgList("iscsiadm", "--mode", "iface", NULL);

    virCommandSetOutputFD(cmd, &fd);
    if (virCommandRunAsync(cmd, NULL) < 0)
        goto cleanup;

    if ((fp = fdopen(fd, "r")) == NULL)
        goto cleanup;

    while (fgets(line, 4096, fp) != NULL) {
       ...analyse line for a match... 
    }
 
    if (virCommandWait(cmd, NULL) < 0)
        goto cleanup;
 
    ret = 0;

cleanup:
    fclose(fp);
    close(fd);
    virCommandFree(cmd);

    return ret;
}

There is a little more work todo for virCommand in terms of initial setup in this example. Technically the call to ‘virCommandWait’ could have been omitted here, since we don’t care about the exit status, but it is good practice to included it. If we had extra dynamic arguments to be provided to ‘iscsiadm’ that needed string formatting the two examples would have been nearer parity in terms of complexity. Even with the slightly longer code for virCommand, the result is a clear win from the quality POV avoiding the many flaws of popen’s implementation.

Detailed API examination

Now that the two examples have given a taste of what the virCommand APIs can do to replace popen/system, lets consider the full set of features exposed. After this it should be clear that the flexibility of the virCommand means there is never any need to delve into fork+exec anymore, let alone popen/system.

Constructing the command arguments

Probably the first task when spawning a command is actually construct the array of arguments and environment variables. In simple cases this can be done immediately when allocating the new virCommad object instance, for example using var-args

virCommandPtr cmd = virCommandNewArgList("touch", "/tmp/foo", NULL);

there is no need to check the return value of virCommandNew* for NULL. The later virCommandRun() API will look for a NULL pointer and report the OOM error at that point instead. The same is true for all error reporting in these APIs – virCommandRun is generally the only place where an error check is needed. This simplification in error handling is a major contributor to making this APIs hard to mis-use in calling code and thus minimizing errors.

Sometimes the list of arguments is not so simple that it can be initialized in one go via var-args. To deal with this it is possible to add arguments to an existing constructed command in a variety of ways

virCommandAddArgFormat(cmd, "--size=%d", 1025);
virCommandAddArgPair(cmd, "--user", "fred");
virCommandAddArgList(cmd, "some", "extra", "args", NULL);

This is handy because for complex command lines (eg those used with QEMU) it allows construction of the virCommand to be split up across multiple functions, each adding their own piece of the command line.

Setting up the environment

By default a process spawned will inherit the full environment of the parent process (almost always libvirtd in libvirt code). With things like QEMU though libvirt wants to be in complete control of the environment it runs under, so it will filter the environment to a subset of names. There are a couple of env variables that are always desired to pass down LD_PRELOAD, LD_LIBRARY_PATH, PATH, USER, HOME, LOGNAME & TMPDIR. If this set is desired there is a convenient method to request passthrough of this set

virCommandAddEnvPassCommand(cmd);

Additional environment variables can be set for passthrough from libvirtd. When passing through environment variables libvirt requires an explicit decision on whether the env variable is safe to pass when running setuid. If an env variable is considered unsafe for a setuid application, there is the option of passing a default value to substitute. The “PATH” variable is unsafe to pass when setuid, and should be set to a known safe value when running setuid:

virCommandAddEnvPassBlockSUID(cmd, "PATH", "/bin:/usr/bin");

The “LD_LIBRARY_PATH” variable is also unsafe when running setuid and should simply be dropped from the environment entirely:

virCommandAddEnvPassBlockSUID(cmd, "LD_LIBRARY_PATH", NULL);

Finally the “LOGNAME” is fine to allow even when setuid so can be left unchanged

virCommandAddEnvPassAllowSUID(cmd, "LOGNAME");

It is not always sufficient to just passthrough existing environment variables, so there are of course APIs to set them directly

virCommandAddEnvPair(cmd, "LOGNAME", "fred");
virCommandAddEnvFormat(cmd, "LOGNAME=%s", fred);
virCommandAddEnvString(cmd, "LOGNAME=fred");

Setting security attributes

Under UNIX a program will inherit process limits, umask and working directory from the parent. It is thus desirable to be able to override this, for example:

virCommandSetMaxFiles(cmd, 65536);

Along a similar vein the child’s umask can also be set

virCommandSetUmask(cmd, 0007);

If the current process’ working directory is unknown, it is a good idea to force an explicit working directory:

virCommandSetWorkingDirectory(cmd, "/");

It may sometimes be desirable to control what capabilities bits a child process has, to override the default behaviour. In such cases the command would to be initialized with the empty set and then the desired bits whitelisted.

virCommandClearCaps(cmd);
virCommandAllowCap(cmd, CAP_NET_RAW);

Finally when interacting with mandatory access control systems like SELinux or AppArmour it is possible to configure an explicit label for the child

virCommandSetSELinuxLabel(cmd, "system_u:system_r:svirt_t:s0:c135,c275");
virCommandSetAppArmorProfile(cmd, "2cb0e828-e6f6-40d1-b0f5-c50cdf34f5c9");

Interacting with stdio

A common requirement when spawning processes is to be able to interact with the child’s stdio in some manner. By default with the virCommand APIs, a process will get its stdin, stdout & stderr connected to /dev/null. For stdin there is a choice of feeding it a fixed length string, or connecting it up to the read end of an existing file descriptor, typically a pipe

virCommandSetInputBuffer(cmd, "Feed me brains");
virCommandSetInputFD(cmd, pipefd);

There are a similar pair of choices for receiving the stdout/stderr from the child. It is possible to supply a pointer to a ‘char *’ which will be filled with the child’s output upon exit. Alternatively a pointer to an ‘int’ can be provided, which can either specify an existing file descriptor, or if ‘-1’ a new anonymous pipe will be created.

char *child_out, *child_err;
virCommandSetOutputBuffer(cmd, &child_out);
virCommandSetErrorBuffer(cmd, &child_err);

int child_out -1, child_err = -1;
virCommandSetOutputFD(cmd, &child_out);
virCommandSetOutputFD(cmd, &child_err);

Passing file descriptors

Aside from any associated with stdio, all file descriptors will be closed when the child process is launched. This is generally a good thing since it prevents any leakage of file descriptors to the child. Such leakage can be a security flaw, and unless using glibc extensions to POSIX, it is not possible to avoid the race condition with setting O_CLOEXEC, so an explicit mass close is a very desirable approach. There can be times when it is is necessary to pass other arbitrary file descriptors to a child process. When passing a file descriptor it may also be desirable to close it in the parent process.

virCommandPassFD(cmd, 8, 0);
virCommandPassFD(cmd, 8, VIR_COMMAND_PASS_FD_CLOSE_PARENT);

Pre-exec callback hooks

Despite its wide range of features, there are times when the virCommand API is not sufficient for the job. In these cases there is the ability to request that a callback function be invoked immediately before exec’ing the child binary. This allows the caller to do arbitrary extra work, though of course bearing in mind POSIX’s rules about which functions are safe to use between fork+exec in a threaded application.

int my_hook(void *data)
{
    if (sometask(data) < 0)
        return -1;
    return 0;
}

virCommandSetPreExecHook(cmd, my_hook, "thefilename");

Executing the command

Everything until this point has been about setting up the command args and the constraints under which it will execute. None of the APIs shown so far require any kind of error checking of return values. Only now that it is time to execute the command will errors be reported and checked by the caller. The simplest way to execute is to use ‘virCommandRun’ which will block until the command finishes running and report the exit status

int status;
if (virCommandRun(cmd, &status) < 0) {
    virCommandFree(cmd);
    return -1;
}

virCommandFree(cmd);

It is possible to leave the ‘status’ arg as NULL in which case the API will turn any non-zero exit status into a fatal error. If the intention is to interact with the command via one or more file descriptors connected to stdio, then a slightly more flexible ‘virCommandRunAsync’ API is required. This call will only block until the process is actually running. The parent can then interact with it and when ready call ‘virCommandWait’.

int status;
pid_t child;
if (virCommandRunAsync(cmd, &child) < 0) {
    virCommandFree(cmd);
    return -1;
}

...interact with child via stdio or other means...
if (virCommandWait(cmd, &status) < 0) {
    virCommandFree(cmd);
    return -1;
}

virCommandFree(cmd);

If something goes wrong during interaction, it is possible to terminate the process with prejudice by using ‘virCommandAbort’ instead of ‘virCommandWait’.

Synchronizing with the child

Normally, once a child has fork’d off, the child & parent will continue execution in parallel, with the parent having no idea at which point the final exec() will have been performed. There can be cases where the parent needs to do some work in between the fork and exec. A pre-exec hook can often be used for this, but the work needs to take place in the parent process another solution is required. To deal with this it is possible to request a handshake take place with the child process. Before the child exec’s its binary, it will notify the parent process that is wants to handshake and wait for a reply. The parent process meanwhile will wait to be notify, then do its work and finally reply to the child again

virCommandRequireHandshake(cmd);

if (virCommandRunAsync(cmd) < 0)
    return -1;

virCommandHandshakeWait(cmd);
...do setup work...
virCommandHandshakeNotify(cmd);

Simplified command execution

The examples above are very powerful, but in the simplest use cases it is possible to combine the virCommandNew + virCommandRun + virCommandFree calls into a single API call

const char *args[] = { "/bin/program", "arg1", "arg2", NULL };
if (virRun(args, NULL) < 0)
  return -1;

This is pretty much equivalent to ‘system’ in terms of complexity, but much safer as it avoids the shell and many other problems mentioned previously.

Integration with unit tests

One of the particularly interesting features of the virCommand APIs is the ability to do unit testing of code that otherwise spawns external commands. The test suite can define a callback that will be invoked any time an attempt is made to run a command. This callback can analyse stdin string, fill in stdout/stderr strings and set the exit status. This is sufficient to avoid the need to run the real command in the context of unit tests in most cases.

static void testCommandCallback(const char *const*args,
                                const char *const*env,
                                const char *input,
                                char **output,
                                char **error,
                                int *status,
                                void *opaque)
{
    ....fake the exit status and fill in **output or **error...
}

virCommandSetDryRun(NULL, testCommandCallback, NULL);

The End

That completes our whirlwind tour of libvirts APIs for spawning child processes. It should be clear that a lot of thought & effort has gone into designing a set of APIs that maximize safety without compromising on ease of use. There can really be no excuse for using either the popen or system calls for spawning programs & thus leaving yourself vulnerable to flaws like shellshock. The libvirt code described in this post is all available under the terms of the LGPLv2+ should anyone wish to pull out & adapt the virCommand APIs for their own programs. I look forward to the day when it is possible to use a Linux system with no reliance on shell by any program. Shell should be exclusively for use by interactive login sessions and administrator local scripting work, not a part of applications where it only ever leads to misery & insecurity.

History of APIs for spawning processes in libvirt without involving the shell

Libvirt has an interesting history when it comes to the spawning of child processes, for a long time eschewing all use of the standard C functions system and popen, instead preferring to use fork+exec via some higher level wrappers of our own design. There were a number of reasons for this decision, some obvious, some not so obvious:

Eliminate all use of shell. Command lines we pass to programs often contain user input. Correctly validating user input to block malicious shell meta characters is non-trivial to get right. Removing use of shell avoids this class of exploits entirely and was the top priority.
Thread safe file descriptor handling. It is generally a bug to allow child processes to inherit file descriptors. In a masterstroke of mis-design UNIX specifies that file descriptors are inheritable across exec by default, requiring an fcntl() call to set the O_CLOEXEC flag to prevent inheritance. Unless using non-standard glibc extensions, setting O_CLOEXEC is a race condition you will often loose in threaded programs using system/popen. The only portable way to 100% guarantee no leakage of file descriptors to child processes is to do a “mass close” of all file descriptors between fork+exec.
Safe signal handling. The system/popen commands do not reset signal handlers after fork(). Thus signal handlers registered by the program are at risk of executing in between fork/exec when spawning external commands. Depending on what the signal handler does this might be a significant problem.
Provide a better API. The system/popen commands are simple to use in the very simple scenarios, but do nothing to help with the more complex scenarios. For example, building up the list of argv to be execute often requires a lot of string & array manipulation.

An aside on bash “shellshock”

In light of the shellshock bug in bash, we’re rather happy that libvirt has taken the approach of avoiding system/popen & shell in general. There are currently only two places where I recall libvirt relying on the shell.

First when using the “SSH” transport for remote API access mechanism, the libvirt client will login to a remote host to setup a tunnel to the libvirtd daemon, and this involves the shell. This can only be used to exploit shellshock if the admin had attempted to limit the user’s access to the remote host using SSH’s ForceCommand config. This fairly uncommon in the context of libvirt, since granting access to the libvirtd daemon implies privileges equivalent to root already, and thus there’s nothing to be gained by login restrictions in SSH.

Second, the libvirt-guests service, which is run by init to suspend guests on shutdown, is written in shell and thus has the potential to be susceptible to shellshock. The risk would come if an unprivileged user had influence over the name of a guest (eg a guest called “() { :;}; echo vulnerable”. Fortunately testing so far indicates that no exploit was possible in the context of libvirt-guests, even with maliciously crafted guest names, since most of the work is done based on UUIDs.

Until a few months ago the network filtering code in libvirt relied in automatically generated huge shell scripts to run iptables commands. We’ve not analysed this obsolete code to see if it was vulnerable to shellshock attacks, but it is believed to be unlikely since the only user controlled strings were iptables chain names and libvirt had strict validation on their content. In any case current libvirt versions have removed all this usage of shell, instead talking to firewalld via the DBus APIs.

Of course it is possible that other external programs that libvirt talks to, in turn use shell & could be vulnerable, but at least the areas that libvirt is responsible appear to be safe from shellshock attacks.

A timeline of process spawning API features

With system/popen ruled out, the alternatives for spawning external programs are fork+exec or possibly posix_spawn. The posix_spawn API is actually fairly decent in terms of the flexibility it offers, but still has a fairly high cost of usage in terms of populating the arguments it needs to receive. Thus, over the years libvirt has evolved a higher level API around the fork+exec system calls that is now in universal use across the codebase. The history below gives a little background on how the APIs evolved to their current form and featureset.

Feb 2007 – qemudStartVMDaemon(). When the QEMU driver was first added to libvirt the qemudStartVMDaemon() function was introduced to simplify spawning of the QEMU binary via fork+exec. The notable things this wrapper did were to connect the child processes stdio to pipes and then explicitly close every other file descriptor. This avoided the risky usage of shell and prevents leak of file descriptors. The impl was not reusable at this point since it also contained the code for turning the QEMU configuration into an array of argv
Feb 2007 – qemudExec(). When support for running dnsmasq was added to the QEMU driver, qemudStartVMDaemon() was refactored to create the qemudExec() funtion. This was the first re-usable wrapper around fork+exec in libvirt. Given an array of argv parameters it would spawn the child process connecting its stdin to /dev/null and returning a pair of pipes for reading stdout & stderr. This is on a par with popen() in terms of ease of use, but far safer to use due tot he avoidance of shell and safe file descriptor handling. It would gain many more features in the changes that follow
Jul, 2007 – _virExec(). With the introduction of the OpenVZ driver, the qemudExec() function was pulled out of the QEMU driver files into a shared utility module. This was the first tentative step towards sharing large amounts of code between different virtualization drivers in libvirt. The functionality remains the same as with qemudExec()
Aug, 2008 – _virExec(). The signal handling race mentioned earlier was discovered. A signal handler registered in the parent was set to write to a pipe when it triggered, but _virExec had closed the pipe file descriptor and the FD number happened to have been reused when setting up the pipe to use for the child programs stdio. The fix was to block all signals before running fork() and unblock them after fork(). Before unblocking them though, the child process reset all signal handlers to the defaults.
Jan, 2008 – virRun(). The _virExec() API was designed towards long lived child programs, so it would return the child PID and expect the caller to run waitpid later. To simplify spawning of short-lived programs the virRun() AP I was introduced that simply runs _virExec() and then does an immediate waitpid, feeding back the exit status to the caller. This is on a par with system() in terms of ease of use, but far safer to use due to avoidance of shell and safe signal & file descriptor handling.
Aug, 2008 – _virExec(). The _virExec() API initially created a pair of pipes to allow the childs stdout/stderr to be fed back to the parent. It was found to be convenient to pass a pre-opened file descriptor instead of creating a new pipe. Thus the _virExec() API was enhanced to allow such usage.
Aug, 2008 – _virExec(). Initially all programs spawned would inherit all environment variables set in the libvirtd daemon. The _virExec() API was enhanced to allow a custom set of environments to be passed in, to replace the existing environment. If a custom environment was requested, execve() would be used, otherwise it would continue to use execvp(). The same change also introduced a flag to request that the child program be daemonized. When this is requested, the child will fork() again and the original child would exit. The child process would also have its working directory changed to “/” and become a session leader.
Aug, 2008 – _virExec(). As mentioned previously, all file descriptors in the child would be closed and new set of FDs attached to stdin/out/err. To have greater control the _virExec() API was enhanced to allow the caller to specify a set of file descriptors to keep open.
Feb, 2009 – _virExec(). When introducing support for sVirt / SELinux to the QEMU driver there was a need to perform some special actions in between the fork() and exec() calls. Rather than hard code this setup for every caller of _virExec(), the ability to provide a pre-exec callback was introduced. This callback would be run immediately before exec() and would be used to set the SELinux process label for QEMU.
May, 2009 – _virExec(). Many child programs are able to write out a pidfile, which is particularly useful when daemonizing them, as there is no easy way to identify the second level child PID otherwise. This is a little racy for the parent process though, because there is no guarantee that the pidfile exists by the time _virExec() returns control to the parent. Thus _virExec() was enhanced to allow it to directly create pidfiles when daemonizing a command. The parent is thus guaranteed that the pidfile exists when _virExec() returns.
Jun, 2009 – _virExec(). When a privileged process is spawned it will generally inherit all capabilities of the parent process. If it is known that a program won’t require any capabilities even when running as root, then there can be a benefit in removing them. The virExec() API was thus extended to use libcap-ng to optionally clear capabilities of child procceses.
Feb, 2010 – _virExec(). Synchronize with internal logging mutex. If a thread was in the middle of a logging call while another thread used virExec() the logging mutex would still be held in the child process. Any attempt to log in the child process would thus deadlock. To deal with this, the logging mutex is aquired before forking the new process and released afterwards. This kind of dance must be done anywhere there is a global mutex that needs to be safely accessed in between a fork+exec pair.
Feb, 2010 – virFork(). There were a couple of cases where libvirt needs the ability to fork without exec’ing a new binary. The code for handling fork() and resetting of signal handlers was split out of virExec() to allow to be used independently.
May, 2010 – virCommand. The list of parameters to the virExec() API had grown larger than desired, so a new object, virCommand, is introduced. The idea is that an object is populated with all the information related to the command to be run and then an API is invoked to execute it. With virExec the caller was still responsible for allocating the char **argv, but with virCommand there are now helpers to greatly simplify the argv creation and make it less error prone.
Nov, 2010 – virCommand. Ordinarily, once a child process has forked it will run asynchronously from the parent process. There are some scenarios in which it is necessary to have a lock-step synchronization between the parent and child process. For example, libvirt’s disk locking needs to acquire leases on storage before the QEMU binary is exec’d but it needs to know the child PID too & the lock acquisition code cannot run in the child PID. The virCommand API is thus extended to introduce a handshake capability. Before exec’ing the new binary the child will send a notification to the parent process and then wait for a response before continuing.
Jan, 2012 – virCommand. The virCommand API will either allow the child to inherit all capabilities or block all capabilities. There are some cases where finer control is required, such as spawning LXC containers with limited privileges. The virCommand API was thus extended to allow specific capbilities to be whitelisted in the child.
Jan, 2013 – virCommand. It is not uncommon to want to change the UID/GID of a process that is spawned, particularly if it cannot be trusted to change on its own accord, or if it will be unable to change due to filtering of the capabilities to remove the CAP_SETUID bit. The virCommand API is extended to allow an alternate UID+GID to be specified for the command to launch. This change will be done inbetween the fork+exec at the same time as modifying the process capabilities.
May, 2013 – virCommand.When spawning QEMU it is necessary to adjust some of the process limits, for example, to raise the maximum file handle count so it is independent of any limit applied to libvirtd. Once again the virCommand APIs are extended so that a number of process limits can be changed in between the fork+exec pair.
Mar, 2014 – virCommand. When unit testing code it is desirable to avoid interacting with broader system state, so it is hard to test code which spawns external commands to do work. With the virCommand APIs though it was easy to introduce a dry run mode where the test suite can supply a callback to be invoked instead of the actual command. The callback can send back fake data for stdout/stderr as required to test the calling code.
Sep, 2014 – virCommand. Under UNIX, a child process will inherit its parent’s umask by default which is not always desirable. For example although libvirtd may have a umask for 0077, it is desirable for QEMU to get a umask for 0007, so that group shared resources can be set up. The virCommand APIs grew a new option for specifying a umask to set in between the fork+exec stage.

The timeline above shows that libvirt’s own APIs for spawning child processes have grown a huge number of features over the last 7 years of development. They very quickly surpassed system() and popen() in terms of safety from various serious problems while maintaining their ease of use. They have achieved this without having to expose the codebase as a whole to the low level complexities of fork()+exec(). The low level details are isolated in one central place, with the rest of the code using higher level APIs. The next blog post will illustrate just how libvirt’s virCommand APIs are used in practice.

Daniel P. Berrangé

Writing about open source software, virtualization & more