History of APIs for spawning processes in libvirt without involving the shell
Libvirt has an interesting history when it comes to the spawning of child processes, for a long time eschewing all use of the standard C functions system and popen, instead preferring to use fork+exec via some higher level wrappers of our own design. There were a number of reasons for this decision, some obvious, some not so obvious:
- Eliminate all use of shell. Command lines we pass to programs often contain user input. Correctly validating user input to block malicious shell meta characters is non-trivial to get right. Removing use of shell avoids this class of exploits entirely and was the top priority.
- Thread safe file descriptor handling. It is generally a bug to allow child processes to inherit file descriptors. In a masterstroke of mis-design UNIX specifies that file descriptors are inheritable across exec by default, requiring an fcntl() call to set the O_CLOEXEC flag to prevent inheritance. Unless using non-standard glibc extensions, setting O_CLOEXEC is a race condition you will often loose in threaded programs using system/popen. The only portable way to 100% guarantee no leakage of file descriptors to child processes is to do a “mass close” of all file descriptors between fork+exec.
- Safe signal handling. The system/popen commands do not reset signal handlers after fork(). Thus signal handlers registered by the program are at risk of executing in between fork/exec when spawning external commands. Depending on what the signal handler does this might be a significant problem.
- Provide a better API. The system/popen commands are simple to use in the very simple scenarios, but do nothing to help with the more complex scenarios. For example, building up the list of argv to be execute often requires a lot of string & array manipulation.
An aside on bash “shellshock”
In light of the shellshock bug in bash, we’re rather happy that libvirt has taken the approach of avoiding system/popen & shell in general. There are currently only two places where I recall libvirt relying on the shell.
First when using the “SSH” transport for remote API access mechanism, the libvirt client will login to a remote host to setup a tunnel to the libvirtd daemon, and this involves the shell. This can only be used to exploit shellshock if the admin had attempted to limit the user’s access to the remote host using SSH’s ForceCommand config. This fairly uncommon in the context of libvirt, since granting access to the libvirtd daemon implies privileges equivalent to root already, and thus there’s nothing to be gained by login restrictions in SSH.
Second, the libvirt-guests service, which is run by init to suspend guests on shutdown, is written in shell and thus has the potential to be susceptible to shellshock. The risk would come if an unprivileged user had influence over the name of a guest (eg a guest called “() { :;}; echo vulnerable”. Fortunately testing so far indicates that no exploit was possible in the context of libvirt-guests, even with maliciously crafted guest names, since most of the work is done based on UUIDs.
Until a few months ago the network filtering code in libvirt relied in automatically generated huge shell scripts to run iptables commands. We’ve not analysed this obsolete code to see if it was vulnerable to shellshock attacks, but it is believed to be unlikely since the only user controlled strings were iptables chain names and libvirt had strict validation on their content. In any case current libvirt versions have removed all this usage of shell, instead talking to firewalld via the DBus APIs.
Of course it is possible that other external programs that libvirt talks to, in turn use shell & could be vulnerable, but at least the areas that libvirt is responsible appear to be safe from shellshock attacks.
A timeline of process spawning API features
With system/popen ruled out, the alternatives for spawning external programs are fork+exec or possibly posix_spawn. The posix_spawn API is actually fairly decent in terms of the flexibility it offers, but still has a fairly high cost of usage in terms of populating the arguments it needs to receive. Thus, over the years libvirt has evolved a higher level API around the fork+exec system calls that is now in universal use across the codebase. The history below gives a little background on how the APIs evolved to their current form and featureset.
- Feb 2007 – qemudStartVMDaemon(). When the QEMU driver was first added to libvirt the qemudStartVMDaemon() function was introduced to simplify spawning of the QEMU binary via fork+exec. The notable things this wrapper did were to connect the child processes stdio to pipes and then explicitly close every other file descriptor. This avoided the risky usage of shell and prevents leak of file descriptors. The impl was not reusable at this point since it also contained the code for turning the QEMU configuration into an array of argv
- Feb 2007 – qemudExec(). When support for running dnsmasq was added to the QEMU driver, qemudStartVMDaemon() was refactored to create the qemudExec() funtion. This was the first re-usable wrapper around fork+exec in libvirt. Given an array of argv parameters it would spawn the child process connecting its stdin to /dev/null and returning a pair of pipes for reading stdout & stderr. This is on a par with popen() in terms of ease of use, but far safer to use due tot he avoidance of shell and safe file descriptor handling. It would gain many more features in the changes that follow
- Jul, 2007 – _virExec(). With the introduction of the OpenVZ driver, the qemudExec() function was pulled out of the QEMU driver files into a shared utility module. This was the first tentative step towards sharing large amounts of code between different virtualization drivers in libvirt. The functionality remains the same as with qemudExec()
- Aug, 2008 – _virExec(). The signal handling race mentioned earlier was discovered. A signal handler registered in the parent was set to write to a pipe when it triggered, but _virExec had closed the pipe file descriptor and the FD number happened to have been reused when setting up the pipe to use for the child programs stdio. The fix was to block all signals before running fork() and unblock them after fork(). Before unblocking them though, the child process reset all signal handlers to the defaults.
- Jan, 2008 – virRun(). The _virExec() API was designed towards long lived child programs, so it would return the child PID and expect the caller to run waitpid later. To simplify spawning of short-lived programs the virRun() AP I was introduced that simply runs _virExec() and then does an immediate waitpid, feeding back the exit status to the caller. This is on a par with system() in terms of ease of use, but far safer to use due to avoidance of shell and safe signal & file descriptor handling.
- Aug, 2008 – _virExec(). The _virExec() API initially created a pair of pipes to allow the childs stdout/stderr to be fed back to the parent. It was found to be convenient to pass a pre-opened file descriptor instead of creating a new pipe. Thus the _virExec() API was enhanced to allow such usage.
- Aug, 2008 – _virExec(). Initially all programs spawned would inherit all environment variables set in the libvirtd daemon. The _virExec() API was enhanced to allow a custom set of environments to be passed in, to replace the existing environment. If a custom environment was requested, execve() would be used, otherwise it would continue to use execvp(). The same change also introduced a flag to request that the child program be daemonized. When this is requested, the child will fork() again and the original child would exit. The child process would also have its working directory changed to “/” and become a session leader.
- Aug, 2008 – _virExec(). As mentioned previously, all file descriptors in the child would be closed and new set of FDs attached to stdin/out/err. To have greater control the _virExec() API was enhanced to allow the caller to specify a set of file descriptors to keep open.
- Feb, 2009 – _virExec(). When introducing support for sVirt / SELinux to the QEMU driver there was a need to perform some special actions in between the fork() and exec() calls. Rather than hard code this setup for every caller of _virExec(), the ability to provide a pre-exec callback was introduced. This callback would be run immediately before exec() and would be used to set the SELinux process label for QEMU.
- May, 2009 – _virExec(). Many child programs are able to write out a pidfile, which is particularly useful when daemonizing them, as there is no easy way to identify the second level child PID otherwise. This is a little racy for the parent process though, because there is no guarantee that the pidfile exists by the time _virExec() returns control to the parent. Thus _virExec() was enhanced to allow it to directly create pidfiles when daemonizing a command. The parent is thus guaranteed that the pidfile exists when _virExec() returns.
- Jun, 2009 – _virExec(). When a privileged process is spawned it will generally inherit all capabilities of the parent process. If it is known that a program won’t require any capabilities even when running as root, then there can be a benefit in removing them. The virExec() API was thus extended to use libcap-ng to optionally clear capabilities of child procceses.
- Feb, 2010 – _virExec(). Synchronize with internal logging mutex. If a thread was in the middle of a logging call while another thread used virExec() the logging mutex would still be held in the child process. Any attempt to log in the child process would thus deadlock. To deal with this, the logging mutex is aquired before forking the new process and released afterwards. This kind of dance must be done anywhere there is a global mutex that needs to be safely accessed in between a fork+exec pair.
- Feb, 2010 – virFork(). There were a couple of cases where libvirt needs the ability to fork without exec’ing a new binary. The code for handling fork() and resetting of signal handlers was split out of virExec() to allow to be used independently.
- May, 2010 – virCommand. The list of parameters to the virExec() API had grown larger than desired, so a new object, virCommand, is introduced. The idea is that an object is populated with all the information related to the command to be run and then an API is invoked to execute it. With virExec the caller was still responsible for allocating the char **argv, but with virCommand there are now helpers to greatly simplify the argv creation and make it less error prone.
- Nov, 2010 – virCommand. Ordinarily, once a child process has forked it will run asynchronously from the parent process. There are some scenarios in which it is necessary to have a lock-step synchronization between the parent and child process. For example, libvirt’s disk locking needs to acquire leases on storage before the QEMU binary is exec’d but it needs to know the child PID too & the lock acquisition code cannot run in the child PID. The virCommand API is thus extended to introduce a handshake capability. Before exec’ing the new binary the child will send a notification to the parent process and then wait for a response before continuing.
- Jan, 2012 – virCommand. The virCommand API will either allow the child to inherit all capabilities or block all capabilities. There are some cases where finer control is required, such as spawning LXC containers with limited privileges. The virCommand API was thus extended to allow specific capbilities to be whitelisted in the child.
- Jan, 2013 – virCommand. It is not uncommon to want to change the UID/GID of a process that is spawned, particularly if it cannot be trusted to change on its own accord, or if it will be unable to change due to filtering of the capabilities to remove the CAP_SETUID bit. The virCommand API is extended to allow an alternate UID+GID to be specified for the command to launch. This change will be done inbetween the fork+exec at the same time as modifying the process capabilities.
- May, 2013 – virCommand.When spawning QEMU it is necessary to adjust some of the process limits, for example, to raise the maximum file handle count so it is independent of any limit applied to libvirtd. Once again the virCommand APIs are extended so that a number of process limits can be changed in between the fork+exec pair.
- Mar, 2014 – virCommand. When unit testing code it is desirable to avoid interacting with broader system state, so it is hard to test code which spawns external commands to do work. With the virCommand APIs though it was easy to introduce a dry run mode where the test suite can supply a callback to be invoked instead of the actual command. The callback can send back fake data for stdout/stderr as required to test the calling code.
- Sep, 2014 – virCommand. Under UNIX, a child process will inherit its parent’s umask by default which is not always desirable. For example although libvirtd may have a umask for 0077, it is desirable for QEMU to get a umask for 0007, so that group shared resources can be set up. The virCommand APIs grew a new option for specifying a umask to set in between the fork+exec stage.
The timeline above shows that libvirt’s own APIs for spawning child processes have grown a huge number of features over the last 7 years of development. They very quickly surpassed system() and popen() in terms of safety from various serious problems while maintaining their ease of use. They have achieved this without having to expose the codebase as a whole to the low level complexities of fork()+exec(). The low level details are isolated in one central place, with the rest of the code using higher level APIs. The next blog post will illustrate just how libvirt’s virCommand APIs are used in practice.