An idea for a race-free way to kill processes with safety checks
When a program has to kill a process, it may well want to perform some safety checks before invoking kill()
, to ensure that the process really is the one that it is expecting. For example, it might check that the /proc/$PID/exe symlink
matches the executable it originally used to start it. Or it might want to check the uid/gid the process is running as from /proc/$PID/status
. Or any number of other attributes about a process that are exposed in /proc/$PID
files.
This is really a similar pattern to the checks that a process might do when operating on a file. For example, it may stat() the file to check it has the required uid/gid ownership. It has long been known that doing stat()
followed by open()
is subject to a race condition which can have serious security implications in many circumstances. For this reason, well written apps will actually do open()
and then fstat()
on the file descriptor, so they’re guaranteed the checks are performed against the file they’re actually operating on.
If the kernel ever wraps around on PID numbers, it should be obvious that apps that do checks on /proc/$PID
before killing a process are also subject to a race condition with potential security implications. This leads to the questions of whether it is possible to design a race free way of checking & killing a process, similar to the open()
+ fstat()
pattern used for files. AFAICT there is not, but there are a couple of ways that one could be constructed without too much extra support from the kernel.
To attack this, we first need an identifier for the process which is unique and not reusable if a process dies & new one takes it place. As mentioned above, if the kernel allows PID numbers to wrap, the PID is unsuitable for this purpose. A file handle for the /proc/$PID
directory though, it potentially suitable for this purpose. We still can’t simply open files like /proc/$PID/exe
directly, since that’s relying on the potentially reusable PID number. Here the openat()
/readlinkat()
system calls come to rescuse. They allow you to pass a file descriptor for a directory and a relative filename. This guarantees that the file opened is in the directory expected, even if the original directory path changes / is replaced. All that is now missing is a way to kill a process, given a FD for /proc/$PID
. There are two ideas for this, first a new system call fkill($fd, $signum)
, or a new proc file /proc/$PID/kill
where you can write signal numbers.
With these pieces, a race free way to validate a process’ executable before killing it would look like
int pidfd = open("/proc/1234", O_RDONLY); char buf[PATH_MAX]; char *exe = readlinkat(pidfd, "exe", buf, sizeof(buf)); if (strcmp(exe, "/usr/bin/mybinary") != 0) return -1; fkill(pidfd, SIGTERM);
Alternatively the last line could look like
char signum[10]; snprintf(signum, sizeof(signum), "%d", SIGTERM); int sigfd = openat(pidfd, "kill", O_WRONLY); write(sigfd, signum, strlen(signum)) close(sigfd)
Though I prefer the fkill()
variant for reasons of brevity.
After reading this, you might wonder how something like systemd safely kills processes it manages without this kind of functionality. Well, it uses cgroups to confine processes associated with each service. Therefore it does not need to care about safety checking arbitrary PIDs, instead it can merely iterate over every PID in the service’s cgroup and kill them. The downside of cgroups is that it is Linux specific, but that doesn’t matter for systemd’s purposes. I believe that a fkill()
syscall, however, is something that could be more easily used in a variety of existing apps without introducing too much portability code. Apps could simply use fkill()
if available, but fallback to regular kill()
elsewhere.
NB I’ve no intention of actually trying to implement this, since I like to stay out of kernel code. It is just an idea I wanted to air in case it interests anyone else.
The proposed API only pretends to be portable. The reason is that it relies on the fact that you can open a /proc directory representing a process – i.e. you take the fact that /proc is mounted for granted. On BSDs, however, it is a common practice not to mount /proc at all, so there is no fd for your API to use.
How is iterating over the PIDs in a cgroup safer than anything else one can do with a PID? They could easily be recycled between read() and kill(), no?