A new (configurable) cgroups layout for libvirt with QEMU, KVM & LXC

Posted: May 13th, 2013 | Filed under: Fedora, libvirt, OpenStack, Virt Tools | Tags: , , , , | 1 Comment »

Several years ago I wrote a bit about libvirt and cgroups in Fedora 12. Since that time, much has changed, and we’ve learnt alot about the use of cgroups, not all of it good.

Perhaps the biggest change has been the arrival of systemd, which has brought cgroups to the attention of a much wider audience. One of the biggest positive impacts of systemd on cgroups, has been a formalization of how to integrate with cgroups as an application developer. Libvirt of course follows these cgroups guidelines, has had input into their definition & continues to work with the systemd community to improve them.

One of the things we’ve learnt the hard way is that the kernel implementation of control groups is not without cost, and the way applications use cgroups can have a direct impact on the performance of the system. The kernel developers have done a great deal of work to improve the performance and scalability of cgroups but there will always be a cost to their usage which application developers need to be aware of. In broad terms, the performance impact is related to the number of cgroups directories created and particularly to their depth.

To cut a long story short, it became clear that the directory hierarchy layout libvirt used with cgroups was seriously sub-optimal, or even outright harmful. Thus in libvirt 1.0.5, we introduced some radical changes to the layout created.

Historically libvirt would create a cgroup directory for each virtual machine or container, at a path $LOCATION-OF-LIBVIRTD/libvirt/$DRIVER-NAME/$VMNAME. For example, if libvirtd was placed in /system/libvirtd.service, then a QEMU guest named “web1” would live at /system/libvirtd.service/libvirt/qemu/web1. That’s 5 levels deep already, which is not good.

As of libvirt 1.0.5, libvirt will create a cgroup directory for each virtual machine or container, at a path /machine/$VMNAME.libvirt-$DRIVER-NAME. First notice how this is now completely disassociated from the location of libvirtd itself. This allows the administrator greater flexibility in controlling resources for virtual machines independently of system services. Second notice that the directory hierarchy is only 2 levels deep by default, so a QEMU guest named “web” would live at /machine/web1.libvirt-qemu

The final important change is that the location of virtual machine / container can now be configured on a per-guest basis in the XML configuration, to override the default of /machine. So if the guest config says

  <resource>
    <partition>/virtualmachines/production</partition>
  </resource>

then libvirt will create the guest cgroup directory /virtualmachines.partition/production.partition/web1.libvirt-qemu. Notice that there will always be a .partition suffix on these user defined directories. Only the default top level directories /machine, /system and /user will be without a suffix. The suffix ensures that user defined directories can never clash with anything the kernel will create. The systemd PaxControlGroups will be updated with this & a few escaping rules soon.

There is still more we intend todo with cgroups in libvirt, in particular adding APIs for creating & managing these partitions for grouping VMs, so you don’t need to go to a tool outside libvirt to create the directories.

One final thing, libvirt now has a bit of documentation about its cgroups usage which will serve as the base for future documentation in this area.

Installing a 4 node Fedora 18 OpenStack Folsom cluster with PackStack

Posted: March 1st, 2013 | Filed under: Fedora, libvirt, OpenStack, Virt Tools | Tags: , , , | 9 Comments »

For a few months now Derek has been working on a tool called PackStack, which aims to facilitate & automate the deployment of OpenStack services. Most of the time I’ve used DevStack for deploying OpenStack, but this is not at all suitable for doing production quality deployments.  I’ve also done production deployments from scratch following the great Fedora instructions. The latter work but require the admin to do far too much tedious legwork and know too much about OpenStack in general. This is where PackStack comes in. It starts from the assumption that the admin knows more or less nothing about how the OpenStack tools work. All they need do is decide which services they wish to deploy on each machine. With that answered PackStack goes off and does the work to make it all happen. Under the hood PackStack does its work by connecting to each machine over SSH, and using Puppet to deploy/configure the services on each one. By leveraging puppet, PackStack itself is mostly isolated from the differences between various Linux distros. Thus although PackStack has been developed on RHEL and Fedora, it should be well placed for porting to other distros like Debian & I hope we’ll see that happen in the near future. It will be better for the OpenStack community to have a standard tool that is portable across all target distros, than the current situation where pretty much ever distributor of OpenStack has reinvented the wheel building their own private tooling for deployment. This is why PackStack is being developed as an upstream project, hosted on StackForge rather than as a Red Hat only private project.

Preparing the virtual machines

Anyway back to the point of this blog post. Having followed PackStack progress for a while I decided it was time to actually try it out for real. While I have a great many development machines, I don’t have enough free to turn into an OpenStack cluster, so straight away I decided to do my test deployment inside a set of Fedora 18 virtual machines, running on a Fedora 18 KVM host.

The current PackStack network support requires that you have 2 network interfaces. For an all-in-one box deployment you only actually need one physical NIC for the public interface – you can use ‘lo’ for the private interface on which VMs communicate with each other. I’m doing a multi-node deployment though, so my first step was to decide how to provide networking to my VMs. A standard libvirt install will provide a default NAT based network, using the virbr0 bridge device. This will serve just fine as the public interface over which we can communicate with the OpenStack services & their REST / Web APIs. For VM traffic, I decided to create a second libvirt network on the host machine.

# cat > openstackvms.xml <<EOF
<network>
  <name>openstackvms</name>
  <bridge name='virbr1' stp='off' delay='0' />
</network>
EOF
# virsh net-define openstackvms.xml
Network openstackvms defined from openstackvms.xml

# virsh net-start openstackvms
Network openstackvms started

EOF

Next up, I installed a single Fedora 18 guest machine, giving it two network interfaces, the first attached to the ‘default’ libvirt network, and the second attached to the ‘openstackvms’ virtual network.

# virt-install  --name f18x86_64a --ram 1000 --file /var/lib/libvirt/images/f18x86_64a.img \
    --location http://www.mirrorservice.org/sites/dl.fedoraproject.org/pub/fedora/linux/releases/18/Fedora/x86_64/os/ \
    --noautoconsole --vnc --file-size 10 --os-variant fedora18 \
    --network network:default --network network:openstackvms

In the installer, I used the defaults for everything with two exceptions. I select the “Minimal install” instead of “GNOME Desktop”, and I reduced the size of the swap partition from 2 GB, to 200 MB – it the VM ever needed more than a few 100 MB of swap, then it is pretty much game over for responsiveness of that VM. A minimal install is very quick, taking only 5 minutes or so to completely install the RPMs – assuming good download speeds from the install mirror chosen. Now I need to turn that one VM into 4 VMs. For this I looked to the virt-clone tool. This is a fairly crude tool which merely does a copy of each disk image, and then updates the libvirt XML for the guest to given it a new UUID and MAC address. It doesn’t attempt to change anything inside the guest, but for a F18 minimal install this is not a significant issue.

# virt-clone  -o f18x86_64a -n f18x86_64b -f /var/lib/libvirt/images/f18x86_64b.img 
Allocating 'f18x86_64b.img'                                                                                    |  10 GB  00:01:20     

Clone 'f18x86_64b' created successfully.
# virt-clone  -o f18x86_64a -n f18x86_64c -f /var/lib/libvirt/images/f18x86_64c.img 
Allocating 'f18x86_64c.img'                                                                                    |  10 GB  00:01:07     

Clone 'f18x86_64c' created successfully.
# virt-clone  -o f18x86_64a -n f18x86_64d -f /var/lib/libvirt/images/f18x86_64d.img 
Allocating 'f18x86_64d.img'                                                                                    |  10 GB  00:00:59     

Clone 'f18x86_64d' created successfully.

I don’t fancy having to remember the IP address of each of the virtual machines I installed, so I decided to setup some fixed IP address mappings in the libvirt default network, and add aliases to /etc/hosts

# virsh net-destroy default
# virsh net-edit default
...changing the following...
  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.254' />
    </dhcp>
  </ip>

...to this...

  <ip address='192.168.122.1' netmask='255.255.255.0'>
    <dhcp>
      <range start='192.168.122.2' end='192.168.122.99' />
      <host mac='52:54:00:fd:e7:03' name='f18x86_64a' ip='192.168.122.100' />
      <host mac='52:54:00:c4:b7:f6' name='f18x86_64b' ip='192.168.122.101' />
      <host mac='52:54:00:81:84:d6' name='f18x86_64c' ip='192.168.122.102' />
      <host mac='52:54:00:6a:9b:1a' name='f18x86_64d' ip='192.168.122.102' />
    </dhcp>
  </ip>
# cat >> /etc/hosts <<EOF
192.168.122.100 f18x86_64a
192.168.122.101 f18x86_64b
192.168.122.102 f18x86_64c
192.168.122.103 f18x86_64d
EOF
# virsh net-start default

Now we’re ready to actually start the virtual machines

# virsh start f18x86_64a
Domain f18x86_64a started

# virsh start f18x86_64b
Domain f18x86_64b started

# virsh start f18x86_64c
Domain f18x86_64c started

# virsh start f18x86_64d
Domain f18x86_64d started

# virsh list
 Id    Name                           State
----------------------------------------------------
 25    f18x86_64a                     running
 26    f18x86_64b                     running
 27    f18x86_64c                     running
 28    f18x86_64d                     running

Deploying OpenStack with PackStack

All of the above is really nothing todo with OpenStack or PackStack – it is just about me getting some virtual machines ready to act as the pretend “physical servers”. The interesting stuff starts now. PackStack doesn’t need to be installed in the machines that will receive the OpenStack install, but rather on any client machine which has SSH access to the target machines. In my case I decided to run packstack from the physical host running the VMs I just provisioned.

# yum -y install openstack-packstack

While PackStack is happy to prompt you with questions, it is far simpler to just use an answer file straight away. It lets you see upfront everything that is required and will make it easy for you repeat the exercise later.

$ packstack --gen-answer-file openstack.txt

The answer file tries to fill in sensible defaults, but there’s not much it can do for IP addresses. So it just fills in the IP address of the host on which it was generated. This is suitable if you’re doing an all-in-one install on the current machine, but not for doing a multi-node install. So the next step is to edit the answer file and customize at least the IP addresses. I have decided that f18x86_64a will be the Horizon frontend and host the user facing APIs from glance/keystone/nova/etc, f18x86_64b will provide QPid, MySQL, Nova schedular and f18x86_64c and f18x86_64d will be compute nodes and swift storage nodes (though I haven’t actually enabled swift in the config).

$ emacs openstack.txt
...make IP address changes...

So you can see what I changed, here is the unified diff

--- openstack.txt	2013-03-01 12:41:31.226476407 +0000
+++ openstack-custom.txt	2013-03-01 12:51:53.877073871 +0000
@@ -4,7 +4,7 @@
 # been installed on the remote servers the user will be prompted for a
 # password and this key will be installed so the password will not be
 # required again
-CONFIG_SSH_KEY=
+CONFIG_SSH_KEY=/home/berrange/.ssh/id_rsa.pub

 # Set to 'y' if you would like Packstack to install Glance
 CONFIG_GLANCE_INSTALL=y
@@ -34,7 +34,7 @@
 CONFIG_NAGIOS_INSTALL=n

 # The IP address of the server on which to install MySQL
-CONFIG_MYSQL_HOST=10.33.8.113
+CONFIG_MYSQL_HOST=192.168.122.101

 # Username for the MySQL admin user
 CONFIG_MYSQL_USER=root
@@ -43,10 +43,10 @@
 CONFIG_MYSQL_PW=5612a75877464b70

 # The IP address of the server on which to install the QPID service
-CONFIG_QPID_HOST=10.33.8.113
+CONFIG_QPID_HOST=192.168.122.101

 # The IP address of the server on which to install Keystone
-CONFIG_KEYSTONE_HOST=10.33.8.113
+CONFIG_KEYSTONE_HOST=192.168.122.100

 # The password to use for the Keystone to access DB
 CONFIG_KEYSTONE_DB_PW=297088140caf407e
@@ -58,7 +58,7 @@
 CONFIG_KEYSTONE_ADMIN_PW=342cc8d9150b4662

 # The IP address of the server on which to install Glance
-CONFIG_GLANCE_HOST=10.33.8.113
+CONFIG_GLANCE_HOST=192.168.122.100

 # The password to use for the Glance to access DB
 CONFIG_GLANCE_DB_PW=a1d8435d61fd4ed2
@@ -83,25 +83,25 @@

 # The IP address of the server on which to install the Nova API
 # service
-CONFIG_NOVA_API_HOST=10.33.8.113
+CONFIG_NOVA_API_HOST=192.168.122.100

 # The IP address of the server on which to install the Nova Cert
 # service
-CONFIG_NOVA_CERT_HOST=10.33.8.113
+CONFIG_NOVA_CERT_HOST=192.168.122.100

 # The IP address of the server on which to install the Nova VNC proxy
-CONFIG_NOVA_VNCPROXY_HOST=10.33.8.113
+CONFIG_NOVA_VNCPROXY_HOST=192.168.122.100

 # A comma separated list of IP addresses on which to install the Nova
 # Compute services
-CONFIG_NOVA_COMPUTE_HOSTS=10.33.8.113
+CONFIG_NOVA_COMPUTE_HOSTS=192.168.122.102,192.168.122.103

 # Private interface for Flat DHCP on the Nova compute servers
 CONFIG_NOVA_COMPUTE_PRIVIF=eth1

 # The IP address of the server on which to install the Nova Network
 # service
-CONFIG_NOVA_NETWORK_HOST=10.33.8.113
+CONFIG_NOVA_NETWORK_HOST=192.168.122.101

 # The password to use for the Nova to access DB
 CONFIG_NOVA_DB_PW=f67f9f822a934509
@@ -116,14 +116,14 @@
 CONFIG_NOVA_NETWORK_PRIVIF=eth1

 # IP Range for Flat DHCP
-CONFIG_NOVA_NETWORK_FIXEDRANGE=192.168.32.0/22
+CONFIG_NOVA_NETWORK_FIXEDRANGE=192.168.123.0/24

 # IP Range for Floating IP's
-CONFIG_NOVA_NETWORK_FLOATRANGE=10.3.4.0/22
+CONFIG_NOVA_NETWORK_FLOATRANGE=192.168.124.0/24

 # The IP address of the server on which to install the Nova Scheduler
 # service
-CONFIG_NOVA_SCHED_HOST=10.33.8.113
+CONFIG_NOVA_SCHED_HOST=192.168.122.101

 # The overcommitment ratio for virtual to physical CPUs. Set to 1.0
 # to disable CPU overcommitment
@@ -131,20 +131,20 @@

 # The overcommitment ratio for virtual to physical RAM. Set to 1.0 to
 # disable RAM overcommitment
-CONFIG_NOVA_SCHED_RAM_ALLOC_RATIO=1.5
+CONFIG_NOVA_SCHED_RAM_ALLOC_RATIO=10

 # The IP address of the server on which to install the OpenStack
 # client packages. An admin "rc" file will also be installed
-CONFIG_OSCLIENT_HOST=10.33.8.113
+CONFIG_OSCLIENT_HOST=192.168.122.100

 # The IP address of the server on which to install Horizon
-CONFIG_HORIZON_HOST=10.33.8.113
+CONFIG_HORIZON_HOST=192.168.122.100

 # To set up Horizon communication over https set this to "y"
 CONFIG_HORIZON_SSL=n

 # The IP address on which to install the Swift proxy service
-CONFIG_SWIFT_PROXY_HOSTS=10.33.8.113
+CONFIG_SWIFT_PROXY_HOSTS=192.168.122.100

 # The password to use for the Swift to authenticate with Keystone
 CONFIG_SWIFT_KS_PW=aec1c74ec67543e7
@@ -155,7 +155,7 @@
 # on 127.0.0.1 as a swift storage device(packstack does not create the
 # filesystem, you must do this first), if /dev is omitted Packstack
 # will create a loopback device for a test setup
-CONFIG_SWIFT_STORAGE_HOSTS=10.33.8.113
+CONFIG_SWIFT_STORAGE_HOSTS=192.168.122.102,192.168.122.103

 # Number of swift storage zones, this number MUST be no bigger than
 # the number of storage devices configured
@@ -223,7 +223,7 @@
 CONFIG_SATELLITE_PROXY_PW=

 # The IP address of the server on which to install the Nagios server
-CONFIG_NAGIOS_HOST=10.33.8.113
+CONFIG_NAGIOS_HOST=192.168.122.100

 # The password of the nagiosadmin user on the Nagios server
 CONFIG_NAGIOS_PW=7e787e71ff18462c

The current version of PackStack in Fedora mistakenly assumes that ‘net-tools’ is installed by default in Fedora. This used to be the case, but as of Fedora 18 it is not longer installed. Upstream PackStack git has switched from using ifconfig to ip, to avoid this. So for F18 we temporarily need to make sure the ‘net-tools’ RPM is installed in each host. In addition the SELinux policy has not been finished for all openstack components, so we need to set it to permissive mode.

$ ssh root@f18x86_64a setenforce 0
$ ssh root@f18x86_64b setenforce 0
$ ssh root@f18x86_64c setenforce 0
$ ssh root@f18x86_64d setenforce 0
$ ssh root@f18x86_64a yum -y install net-tools
$ ssh root@f18x86_64b yum -y install net-tools
$ ssh root@f18x86_64c yum -y install net-tools
$ ssh root@f18x86_64d yum -y install net-tools

Assuming that’s done, we can now just run packstack

# packstack --answer-file openstack-custom.txt
Welcome to Installer setup utility

Installing:
Clean Up...                                              [ DONE ]
Setting up ssh keys...                                   [ DONE ]
Adding pre install manifest entries...                   [ DONE ]
Adding MySQL manifest entries...                         [ DONE ]
Adding QPID manifest entries...                          [ DONE ]
Adding Keystone manifest entries...                      [ DONE ]
Adding Glance Keystone manifest entries...               [ DONE ]
Adding Glance manifest entries...                        [ DONE ]
Adding Cinder Keystone manifest entries...               [ DONE ]
Checking if the Cinder server has a cinder-volumes vg... [ DONE ]
Adding Cinder manifest entries...                        [ DONE ]
Adding Nova API manifest entries...                      [ DONE ]
Adding Nova Keystone manifest entries...                 [ DONE ]
Adding Nova Cert manifest entries...                     [ DONE ]
Adding Nova Compute manifest entries...                  [ DONE ]
Adding Nova Network manifest entries...                  [ DONE ]
Adding Nova Scheduler manifest entries...                [ DONE ]
Adding Nova VNC Proxy manifest entries...                [ DONE ]
Adding Nova Common manifest entries...                   [ DONE ]
Adding OpenStack Client manifest entries...              [ DONE ]
Adding Horizon manifest entries...                       [ DONE ]
Preparing servers...                                     [ DONE ]
Adding post install manifest entries...                  [ DONE ]
Installing Dependencies...                               [ DONE ]
Copying Puppet modules and manifests...                  [ DONE ]
Applying Puppet manifests...
Applying 192.168.122.100_prescript.pp
Applying 192.168.122.101_prescript.pp
Applying 192.168.122.102_prescript.pp
Applying 192.168.122.103_prescript.pp
192.168.122.101_prescript.pp :                                       [ DONE ]
192.168.122.103_prescript.pp :                                       [ DONE ]
192.168.122.100_prescript.pp :                                       [ DONE ]
192.168.122.102_prescript.pp :                                       [ DONE ]
Applying 192.168.122.101_mysql.pp
Applying 192.168.122.101_qpid.pp
192.168.122.101_mysql.pp :                                           [ DONE ]
192.168.122.101_qpid.pp :                                            [ DONE ]
Applying 192.168.122.100_keystone.pp
Applying 192.168.122.100_glance.pp
Applying 192.168.122.101_cinder.pp
192.168.122.100_keystone.pp :                                        [ DONE ]
192.168.122.100_glance.pp :                                          [ DONE ]
192.168.122.101_cinder.pp :                                          [ DONE ]
Applying 192.168.122.100_api_nova.pp
192.168.122.100_api_nova.pp :                                        [ DONE ]
Applying 192.168.122.100_nova.pp
Applying 192.168.122.102_nova.pp
Applying 192.168.122.103_nova.pp
Applying 192.168.122.101_nova.pp
Applying 192.168.122.100_osclient.pp
Applying 192.168.122.100_horizon.pp
192.168.122.101_nova.pp :                                            [ DONE ]
192.168.122.100_nova.pp :                                            [ DONE ]
192.168.122.100_osclient.pp :                                        [ DONE ]
192.168.122.100_horizon.pp :                                         [ DONE ]
192.168.122.103_nova.pp :                                            [ DONE ]
192.168.122.102_nova.pp :                                            [ DONE ]
Applying 192.168.122.100_postscript.pp
Applying 192.168.122.101_postscript.pp
Applying 192.168.122.103_postscript.pp
Applying 192.168.122.102_postscript.pp
192.168.122.100_postscript.pp :                                      [ DONE ]
192.168.122.103_postscript.pp :                                      [ DONE ]
192.168.122.101_postscript.pp :                                      [ DONE ]
192.168.122.102_postscript.pp :                                      [ DONE ]
[ DONE ]

**** Installation completed successfully ******

(Please allow Installer a few moments to start up.....)

Additional information:
* Time synchronization installation was skipped. Please note that unsynchronized time on server instances might be problem for some OpenStack components.
* Did not create a cinder volume group, one already existed
* To use the command line tools you need to source the file /root/keystonerc_admin created on 192.168.122.100
* To use the console, browse to http://192.168.122.100/dashboard
* The installation log file is available at: /var/tmp/packstack/20130301-135443-qbNvvH/openstack-setup.log

That really is it – you didn’t need to touch any config files for OpenStack, QPid, MySQL or any other service involved. PackStack just worked its magic and there is now a 4 node OpenStack cluster up and running. One of the nice things about PackStack using Puppet for all its work, is that if something goes wrong 1/2 way through, you don’t need to throw it all away – just fix the issue and re-run packstack and it’ll do whatever work was left over from before.

The results

Lets see what’s running on each node. First the frontend user facing node

$ ssh root@f18x86_64a ps -ax
PID TTY      STAT   TIME COMMAND
1 ?        Ss     0:03 /usr/lib/systemd/systemd --switched-root --system --deserialize 14
283 ?        Ss     0:00 /usr/lib/systemd/systemd-udevd
284 ?        Ss     0:07 /usr/lib/systemd/systemd-journald
348 ?        S      0:00 /usr/lib/systemd/systemd-udevd
391 ?        Ss     0:06 /usr/bin/python -Es /usr/sbin/firewalld --nofork
392 ?        S<sl   0:00 /sbin/auditd -n
394 ?        Ss     0:00 /usr/lib/systemd/systemd-logind
395 ?        Ssl    0:00 /sbin/rsyslogd -n -c 5
397 ?        Ssl    0:01 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
403 ?        Ss     0:00 login -- root
411 ?        Ss     0:00 /usr/sbin/crond -n
417 ?        S      0:00 /usr/lib/systemd/systemd-udevd
418 ?        Ssl    0:01 /usr/sbin/NetworkManager --no-daemon
452 ?        Ssl    0:00 /usr/lib/polkit-1/polkitd --no-debug
701 ?        S      0:00 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-eth0.pid -lf /var/lib/dhclient/dhclient-4d0e96db-64cd-41d3-a9c3-c584da37dd84-eth0.lease -cf /var/run/nm-dhclient-eth0.conf eth0
769 ?        Ss     0:00 /usr/sbin/sshd -D
772 ?        Ss     0:00 sendmail: accepting connections
792 ?        Ss     0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
800 tty1     Ss+    0:00 -bash
8702 ?        Ss     0:00 /usr/bin/python /usr/bin/glance-registry --config-file /etc/glance/glance-registry.conf
8745 ?        S      0:00 /usr/bin/python /usr/bin/glance-registry --config-file /etc/glance/glance-registry.conf
8764 ?        Ss     0:00 /usr/bin/python /usr/bin/glance-api --config-file /etc/glance/glance-api.conf
10030 ?        Ss     0:01 /usr/bin/python /usr/bin/keystone-all --config-file /etc/keystone/keystone.conf
10201 ?        S      0:00 /usr/bin/python /usr/bin/glance-api --config-file /etc/glance/glance-api.conf
13096 ?        Ss     0:01 /usr/bin/python /usr/bin/nova-api --config-file /etc/nova/nova.conf --logfile /var/log/nova/api.log
13103 ?        S      0:00 /usr/bin/python /usr/bin/nova-api --config-file /etc/nova/nova.conf --logfile /var/log/nova/api.log
13111 ?        S      0:00 /usr/bin/python /usr/bin/nova-api --config-file /etc/nova/nova.conf --logfile /var/log/nova/api.log
13120 ?        S      0:00 /usr/bin/python /usr/bin/nova-api --config-file /etc/nova/nova.conf --logfile /var/log/nova/api.log
13484 ?        Ss     0:05 /usr/bin/python /usr/bin/nova-consoleauth --config-file /etc/nova/nova.conf --logfile /var/log/nova/consoleauth.log
20354 ?        Ss     0:00 python /usr/bin/nova-novncproxy --web /usr/share/novnc/
20429 ?        Ss     0:03 /usr/bin/python /usr/bin/nova-cert --config-file /etc/nova/nova.conf --logfile /var/log/nova/cert.log
21035 ?        Ssl    0:00 /usr/bin/memcached -u memcached -p 11211 -m 922 -c 8192 -l 0.0.0.0 -U 11211 -t 1
21311 ?        Ss     0:00 /usr/sbin/httpd -DFOREGROUND
21312 ?        Sl     0:00 /usr/sbin/httpd -DFOREGROUND
21313 ?        S      0:00 /usr/sbin/httpd -DFOREGROUND
21314 ?        S      0:00 /usr/sbin/httpd -DFOREGROUND
21315 ?        S      0:00 /usr/sbin/httpd -DFOREGROUND
21316 ?        S      0:00 /usr/sbin/httpd -DFOREGROUND
21317 ?        S      0:00 /usr/sbin/httpd -DFOREGROUND
21632 ?        S      0:00 /usr/sbin/httpd -DFOREGROUND

Now the infrastructure node

$ ssh root@f18x86_64b ps -ax
PID TTY      STAT   TIME COMMAND
1 ?        Ss     0:02 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
289 ?        Ss     0:00 /usr/lib/systemd/systemd-udevd
290 ?        Ss     0:05 /usr/lib/systemd/systemd-journald
367 ?        S      0:00 /usr/lib/systemd/systemd-udevd
368 ?        S      0:00 /usr/lib/systemd/systemd-udevd
408 ?        Ss     0:04 /usr/bin/python -Es /usr/sbin/firewalld --nofork
409 ?        S<sl   0:00 /sbin/auditd -n
411 ?        Ss     0:00 /usr/lib/systemd/systemd-logind
412 ?        Ssl    0:00 /sbin/rsyslogd -n -c 5
414 ?        Ssl    0:00 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
419 tty1     Ss+    0:00 /sbin/agetty --noclear tty1 38400 linux
429 ?        Ss     0:00 /usr/sbin/crond -n
434 ?        Ssl    0:01 /usr/sbin/NetworkManager --no-daemon
484 ?        Ssl    0:00 /usr/lib/polkit-1/polkitd --no-debug
717 ?        S      0:00 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-eth0.pid -lf /var/lib/dhclient/dhclient-2c0f596e-002a-49b0-b3f6-5e228601e7ba-eth0.lease -cf /var/run/nm-dhclient-eth0.conf eth0
766 ?        Ss     0:00 sendmail: accepting connections
792 ?        Ss     0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
805 ?        Ss     0:00 /usr/sbin/sshd -D
8531 ?        Ss     0:00 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
8884 ?        Sl     0:15 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306
9778 ?        Ssl    0:01 /usr/sbin/qpidd --config /etc/qpidd.conf
10004 ?        S<     0:00 [loop2]
13381 ?        Ss     0:02 /usr/sbin/tgtd -f
14831 ?        Ss     0:00 /usr/bin/python /usr/bin/cinder-api --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/api.log
14907 ?        Ss     0:04 /usr/bin/python /usr/bin/cinder-scheduler --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/scheduler.log
14956 ?        Ss     0:02 /usr/bin/python /usr/bin/cinder-volume --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/volume.log
15516 ?        Ss     0:06 /usr/bin/python /usr/bin/nova-scheduler --config-file /etc/nova/nova.conf --logfile /var/log/nova/scheduler.log
15609 ?        Ss     0:08 /usr/bin/python /usr/bin/nova-network --config-file /etc/nova/nova.conf --logfile /var/log/nova/network.log

And finally one of the 2 compute nodes

$ ssh root@f18x86_64c ps -ax
PID TTY      STAT   TIME COMMAND
  1 ?        Ss     0:02 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
315 ?        Ss     0:00 /usr/lib/systemd/systemd-udevd
317 ?        Ss     0:04 /usr/lib/systemd/systemd-journald
436 ?        Ss     0:05 /usr/bin/python -Es /usr/sbin/firewalld --nofork
437 ?        S<sl   0:00 /sbin/auditd -n
439 ?        Ss     0:00 /usr/lib/systemd/systemd-logind
440 ?        Ssl    0:00 /sbin/rsyslogd -n -c 5
442 ?        Ssl    0:00 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
454 ?        Ss     0:00 /usr/sbin/crond -n
455 tty1     Ss+    0:00 /sbin/agetty --noclear tty1 38400 linux
465 ?        S      0:00 /usr/lib/systemd/systemd-udevd
466 ?        S      0:00 /usr/lib/systemd/systemd-udevd
470 ?        Ssl    0:01 /usr/sbin/NetworkManager --no-daemon
499 ?        Ssl    0:00 /usr/lib/polkit-1/polkitd --no-debug
753 ?        S      0:00 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-eth0.pid -lf /var/lib/dhclient/dhclient-ada59d24-375c-481e-bd57-ce0803ac5574-eth0.lease -cf /var/run/nm-dhclient-eth0.conf eth0
820 ?        Ss     0:00 sendmail: accepting connections
834 ?        Ss     0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
846 ?        Ss     0:00 /usr/sbin/sshd -D
9749 ?        Ssl    0:13 /usr/sbin/libvirtd
16060 ?        Sl     0:01 /usr/bin/python -Es /usr/sbin/tuned -d
16163 ?        Ssl    0:03 /usr/bin/python /usr/bin/nova-compute --config-file /etc/nova/nova.conf --logfile /var/log/nova/compute.log

All-in-all PackStack exceeded my expectations for such a young tool – it did a great job with minimum of fuss and was nice & reliable at what it did too. The only problem I hit was forgetting to set SELinux permissive first, which was not its fault – this is a bug in Fedora policy we will be addressing – and it recovered from that just fine when I re-ran it after setting permissive mode.

A reminder why you should never mount guest disk images on the host OS

Posted: February 20th, 2013 | Filed under: Coding Tips, Fedora, libvirt, OpenStack, Security, Virt Tools | Tags: , , , , , , | 1 Comment »

The OpenStack Nova project has the ability to inject files into a guest filesystem immediately prior to booting the virtual machine instance. Historically the way it did this was to setup either a loop or NBD device on the host OS and then mount the guest filesystem directly on the host OS. One of the high priority tasks for Red Hat engineers when we became involved in OpenStack was to integrate libguestfs FUSE into Nova to replace the use of loop back + NBD devices, and then subsequently refactor Nova to introduce a VFS layer which enables use of the native libguestfs API to avoid any interaction with the host filesystem at all.

There has already been a vulnerability in the Nova code which allowed a malicious user to inject files to arbitrary locations in the host filesystem. This has of course been fixed, but even so mounting guest disk images on the host OS should still be considered very bad practice. The libguestfs manual describes the remaining risk quite well:

When you mount a filesystem, mistakes in the kernel filesystem (VFS) can be escalated into exploits by attackers creating a malicious filesystem. These exploits are very severe for two reasons. Firstly there are very many filesystem drivers in the kernel, and many of them are infrequently used and not much developer attention has been paid to the code. Linux userspace helps potential crackers by detecting the filesystem type and automatically choosing the right VFS driver, even if that filesystem type is unexpected. Secondly, a kernel-level exploit is like a local root exploit (worse in some ways), giving immediate and total access to the system right down to the hardware level

Libguestfs provides protection against this risk by creating a virtual machine inside which all guest filesystem manipulations are performed. Thus even if the guest kernel gets compromised by a VFS flaw, the attacker then still has to break out of the KVM virtual machine and its sVirt confinement to stand a chance of compromising the host OS. Some people have doubted the severity of this kernel VFS driver risk in the past, but an article posted on LWN today should serve reinforce the fact that libguestfs is right to be paranoid. The article highlights two kernel filesystem vulnerabilities (one in ext4 which is enabled in pretty much all Linux hosts) which left hosts vulnerable for as long as 3 years in some cases:

  • CVE-2009-4307: a divide-by-zero crash in the ext4 filesystem code. Causing this oops requires convincing the user to mount a specially-crafted ext4 filesystem image
  • CVE-2009-4020: a buffer overflow in the HFS+ filesystem exploitable, once again, by convincing a user to mount a specially-crafted filesystem image on the target system.

If the user has access to an OpenStack deployment which is not using libguestfs for file injection, then “convincing a user to mount a specially crafted filesystem” merely requires them to upload their evil filesystem image to glance and then request Nova to boot it.

Anyone deploying OpenStack with file injection enabled, is strongly advised to make sure libguestfs is installed to avoid any direct exposure of the host OS kernel to untrusted guest images.

While I picked on OpenStack as a convenient way to illustrate the problem here, it is not unique to OpenStack. Far too frequently I find documentation relating to virtualization that suggests people mount untrusted disk images directly on their OS. Based on their documented features I’m confident that a number of public virtual machine hosting companies will be mounting untrusted user disk images on their virtualization hosts, likely without using libguestfs for protection.

Installing Fedora 17 ARM on a Samsung Google Chromebook

Posted: November 30th, 2012 | Filed under: Fedora, libvirt, Virt Tools | Tags: , , , | 9 Comments »

Today I purchased one of the lovely new ARM based Samsung Google Chromebooks. While ChromeOS is very slick, I don’t have much interest in running that, I got this machine to serve as a dev machine for doing ARM work on libvirt and related projects. Not co-incidentally, this machine has the ARM Cortex A-15 CPU which is the architecture targeted by the new KVM for ARM work.

Obviously the first task at hand is to actually get Fedora onto the machine. Fortunately Christopher Hewitt has been here already and wrote up some notes for Fedora 17 on Google+. The notes look good, but when you look, the first step in installing Fedora is to use an Ubuntu host to build a SD card. So what follows is my slightly tweaked instructions for doing the same starting from a Fedora 17 host.

Building the bootable SD card

We don’t want to blow away ChromeOS entirely, so this guide is focused on booting using an SD Card. I have a bunch of 8 GB SD cards for my SLR, so one of those is re-purposed for this task.

Creating partitions & filesystems

The first part of the process is to create the partitions and root filesystem. All the instructions below are to be run as root on the Fedora 17 host.

  1. Insert the SD card in the host machine, let GNOME automount it, and then look to see what device it is
    # df
    Filesystem                              1K-blocks       Used Available Use% Mounted on
    rootfs                                  151607468  127703344  16202828  89% /
    devtmpfs                                  1958948          0   1958948   0% /dev
    tmpfs                                     1970892        284   1970608   1% /dev/shm
    tmpfs                                     1970892       1656   1969236   1% /run
    /dev/mapper/vg_t500wlan-lv_root         151607468  127703344  16202828  89% /
    tmpfs                                     1970892          0   1970892   0% /sys/fs/cgroup
    tmpfs                                     1970892          0   1970892   0% /media
    /dev/sda1                                  198337     126585     61512  68% /boot
    /dev/mmcblk0p1                            7753728    7668960     84768  99% /run/media/berrange/NIKON D90

    So in this case, I see the device is /dev/mmcblk0 (ignoring the p1 suffix which is the partition number)

  2. Take a look at the SD card and double-check there are no important files still on it, then unmount it. I found a few photographs that I hadn’t transferred, so happy I checked
  3. Being a nice modern platform, the Chromebook doesn’t want no stinkin’ MS-Dos partition table. This is a GPT world we live it. Using parted we re-format the SD Card with a GPT table (well, ok, if you must know, this creates a MS-Dos partition table too for back compat)
     # parted /dev/mmcblk0
    GNU Parted 3.0
    Using /dev/mmcblk0
    Welcome to GNU Parted! Type 'help' to view a list of commands.
    (parted) mktable gpt                                                      
    Warning: The existing disk label on /dev/mmcblk0 will be destroyed and all data on this disk will be lost. Do you want to continue?
    Yes/No? yes                                                               
    (parted) quit                                                             
    Information: You may need to update /etc/fstab.
  4. Now time to create 3 partitions, 2 for storing kernels, and one for storing the filesystem. You could just create 1 for the kernel, if you don’t care about having a backup sane kernel for emergencies
    ## gdisk /dev/mmcblk0
    GPT fdisk (gdisk) version 0.8.4
    
    Partition table scan:
      MBR: protective
      BSD: not present
      APM: not present
      GPT: present
    
    Found valid GPT with protective MBR; using GPT.
    
    Command (? for help): x
    
    Expert command (? for help): l
    Enter the sector alignment value (1-65536, default = 2048): 8192
    
    Expert command (? for help): m
    
    Command (? for help): n
    Partition number (1-128, default 1): 1
    First sector (34-15523806, default = 8192) or {+-}size{KMGTP}: 
    Last sector (8192-15523806, default = 15523806) or {+-}size{KMGTP}: +16M
    Current type is 'Linux filesystem'
    Hex code or GUID (L to show codes, Enter = 8300): 7f00
    Changed type of partition to 'ChromeOS kernel'
    
    Command (? for help): n
    Partition number (2-128, default 2): 2
    First sector (34-15523806, default = 40960) or {+-}size{KMGTP}: 
    Last sector (40960-15523806, default = 15523806) or {+-}size{KMGTP}: +16M
    Current type is 'Linux filesystem'
    Hex code or GUID (L to show codes, Enter = 8300): 7f00
    Changed type of partition to 'ChromeOS kernel'
    
    Command (? for help): n
    Partition number (3-128, default 3): 3
    First sector (34-15523806, default = 73728) or {+-}size{KMGTP}: 
    Last sector (73728-15523806, default = 15523806) or {+-}size{KMGTP}: 
    Current type is 'Linux filesystem'
    Hex code or GUID (L to show codes, Enter = 8300): 
    Changed type of partition to 'Linux filesystem'
    
    Command (? for help): w
    
    Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
    PARTITIONS!!
    
    Do you want to proceed? (Y/N): y
    OK; writing new GUID partition table (GPT) to /dev/mmcblk0.
    The operation has completed successfully.
  5. With that out of the way, it is time to format the filesystem on the 3rd partition, and ext4 is the choice we’re going for here
    # mkfs.ext4 /dev/mmcblk0p3 
    mke2fs 1.42.3 (14-May-2012)
    Discarding device blocks: ^C76960/1931259
    [root@t500wlan ~]# 
    [root@t500wlan ~]# mkfs.ext4 /dev/mmcblk0p3 
    mke2fs 1.42.3 (14-May-2012)
    Discarding device blocks: done                            
    Filesystem label=
    OS type: Linux
    Block size=4096 (log=2)
    Fragment size=4096 (log=2)
    Stride=0 blocks, Stripe width=0 blocks
    483328 inodes, 1931259 blocks
    96562 blocks (5.00%) reserved for the super user
    First data block=0
    Maximum filesystem blocks=1979711488
    59 block groups
    32768 blocks per group, 32768 fragments per group
    8192 inodes per group
    Superblock backups stored on blocks: 
    	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632
    
    Allocating group tables: done                            
    Writing inode tables: done                            
    Creating journal (32768 blocks): done
    Writing superblocks and filesystem accounting information: done
  6. A filesystem isn’t too interesting, unless it has some data in it. Lets download a suitable Fedora 17 ARM filesystem archive and copy its data across
    # wget http://download.fedoraproject.org/pub/fedora-secondary/releases/17/Images/armhfp/Fedora-17-armhfp-xfce.tar.xz
    # mount /dev/mmcblk0p3 /mnt
    # cd /mnt
    # tar Jxvf /root/Fedora-17-armhfp-xfce.tar.xz

    When unpacked, the filesystem will consume about 2.4 GB of space, so a 8GB SD card isn’t strictly needed – a 4GB one would suffice.

  7. The core F17 image is now installed on the SD card, so unmount the card. The next step will move onto the ChromeBook
    # umount /dev/mmcblk0p3

Getting the Chromebook into Developer Mode

Before moving onto the next stage, the ChomeBook must be in Developer Mode.

  1. Booting into Recovery Mode is the first thing todo. This requires holding down the Esc and Refresh keys, while pressing the Power key.
  2. It should now display a nice scary message that the installation is broken. This is of course a lie, so ignore it.
  3. Pressing Ctrl-D now will start the switch into Developer Mode.
  4. It should now display a less scary message asking if you want to turn off verification. We do indeed want todo this, so press Enter.
  5. The ChromeBook will reboot and tell you that verification is disabled.
  6. Wait a short while, or press Ctrl-D, and it’ll display another message that it is transitioning to Developer Mode.
  7. Erasing local data can take a while, so be patient here. Strangely, there’s a crude ascii art progress bar along the top of the screen here, while everything else remains pretty & graphical.

Copying the ChromeBook ARM kernels

Fedora 17 doesn’t come with a suitable kernel for the Cortex A-15 CPU, so this step grabs the one already present in ChromeOS.

  1. Assuming the ChromeBook is now booted in Developer Mode, insert the SD card.
  2. Bring up a croshprompt by typing Ctrl-Alt-T
  3. From here, run ‘shell’ which gives you a bash session.
    crosh> shell
    chronos@localhost / $
  4. Become the root user
    $ sudo su -
    localhost ~ #
  5. The kernel image has the boot arguments embedded in it, and these need to be changed for Fedora. This involves a fun little command which bundles up the vmlinuz file, config args and boot keys into one magic image
    # cd /tmp
    # echo “console=tty1 debug verbose root=/dev/mmcblk1p3 rootwait rw” > /tmp/config
    # vbutil_kernel –pack /tmp/newkern –keyblock /usr/share/vboot/devkeys/kernel.keyblock –version 1 –signprivate /usr/share/vboot/devkeys/kernel_data_key.vbprivk –config=/tmp/config –vmlinuz /boot/vmlinuz-3.4.0 –arch arm
  6. Assuming that worked, the new kernel image can copied across to the new SD card partitions created earlier
    # dd if=/tmp/newkern of=/dev/mmcblk1p1
    # dd if=/tmp/newkern of=/dev/mmcblk1p2
    # rm /tmp/newkern /tmp/config
  7. Having the kernels in a partition isn’t much use, unless you mark those partitions as bootable
    # cgpt add -i 1 -S 1 -T 5 -P 10 -l KERN-A /dev/mmcblk1
    # cgpt add -i 2 -S 1 -T 5 -P 5 -l KERN-B /dev/mmcblk1

    This says try to boot from the first partition 10 times, then fallback to trying the second partition 5 times. This fallback behaviour is handy if you want to test out custom kernel builds, and have an escape route for when it all goes belly up.

  8. Much of the ChromeOS kernel is built as modules, so those also need to be copied across to the filesystem image
    cp -rf /lib/modules/* /media/removable/External\ Drive\ 1/lib/modules/
    cp -rf /lib/firmware/* /media/removable/External\ Drive\ 1/lib/firmware/
  9. Even in Developer Mode, the Chromebook will ignore the SD/USB card when booting, unless explicitly told to look at them
    # crossystemm dev_boot_usb=1

    It might print out a couple of warnings, but those can be ignored – it will have done what is required.

  10. Xorg will work fine out of the box, but the TouchPad sensitivity can do with improving
    # cat > /media/removable/External\ Drive\ 1/etc/X11/xorg.conf.d/50-touchpad.conf <<EOF
    Section "InputClass"
      Identifier "touchpad"
      MatchIsTouchpad "on"
      Option "FingerHigh" "5"
      Option "FingerLow" "5"
    EndSection
    EOF

    Some other guides suggest copying across the Xorg drivers armsoc_drv.so and cmt_drv.so, but I wanted to keep my system as vanilla & open source as possible

  11. For sound to work, copy the alsa profiles across too
    # cp -a /usr/share/alsa/ucm/ /media/removable/External\ Drive\ 1/usr/share/alsa/ucm

    WARNING: Under no circumstances use ‘alsamixer’ to adjust setting on your ChromeBook. Multiple people have reported (with pictures) melting/burning out the speakers (no seriously, you can kill the hardware)

  12. Set things up so that the alsa settings are loaded on startup
    # cat > /media/removable/External\ Drive\ 1/etc/rc.d/rc.local <<EOF
    #!/bin/sh
    alsaucm -C DAISY-I2S set _verb HiFi
    EOF
    # chmod +x /media/removable/External\ Drive\ 1/etc/rc.d/rc.local
    

    Note that some instructions miss off the ‘set _verb HiFi’ part of this command line. My experiance is that audio did not work unless that was set.

  13. The final thing todo is configure the root password and user account
    # mount -o remount,suid,exec /media/removable/External\ Drive\ 1/
    # chroot /media/removable/External\ Drive\ 1/
    # passwd
    ...enter new root password...
    # mkdir /home/guest
    # rsync -a /etc/skel/ /home/guest
    # chown -R guest:users /home/guest
    # usermod -d /home/guest -g users guest
    # exit
    # umount /media/removable/External\ Drive\ 1/

That’s it, you’re now ready to rock-n-roll. Power-cycle the Chromebook, and when the initial boot screen appears press Ctrl-U to boot from the SD card.

First impressions of Fedora 17 on the Chromebook

Writing these instructions was interrupted part way through, when I began to realize that the TrackPad on my Chromebook was faulty. I didn’t work at all on first boot of ChromeOS, but second time around it was working fine, so I thought nothing of it. It just got worse & worse from then on though. Looking at the kernel dmesg logs showed the hardware did not even appear visible 60% of the boot attempts. It is probably just a loose cable inside, but I didn’t fancy opening up the case, so I took it back to PC World for an exchange. Thankfully the 2nd unit worked flawlessly.

The first annoying thing to discover is that it is only possible to boot custom distros on the ChromeBook while in Developer Mode. This means that at every power-on, the boot screen pauses 30 seconds with a scary warning. AFAICT, there is no ability to sign the kernel images, such that you can boot in Release Mode, without the warning. So before distros have even done any work, it appears guaranteed that their user experience will be worse that ChromeOS on this hardware. Thankyou “verified boot”.

Booting up Fedora 17 is significantly slower than booting ChromeOS. While some of this might be Fedora’s fault, I’m not prepared to lay any blame here, since I’m comparing booting from the internal drive for ChromeOS, vs booting from an SD-Card for Fedora. It just isn’t a fair comparison unless I can get Fedora onto the internal drive. It also isn’t as pretty, though again that is mostly my fault for not using graphical boot with Fedora.

If the procedure above is followed, you will get a pretty functional Fedora 17 XFCE desktop environment. Since I skipped the part others recommend about copying over the binary Xorg driver for the graphics card, X performance isn’t going to win any benchmarks though. Playing youtube video in firefox with HTML5, in a smallest video size consumes 95% CPU time. I expect proper graphics drivers would help that significantly.

In anycase I’m not doing any of this for Xorg / graphics performance. This is primarily going to be a dev machine which I remotely access for doing libvirt work

The hope is that when Fedora 18 is released for ARM, there will be a remix done specifically for the new ChromeBook which will simplify the setup process, use a normal Fedora kernel, and just generally “do the right thing”. I may well try to re-do this image using a F18 alpha filesystem as the source instead.

An idea for a race-free way to kill processes with safety checks

Posted: November 29th, 2012 | Filed under: Coding Tips, Fedora, libvirt, Security | Tags: , , , , , , | 2 Comments »

When a program has to kill a process, it may well want to perform some safety checks before invoking kill(), to ensure that the process really is the one that it is expecting. For example, it might check that the /proc/$PID/exe symlink matches the executable it originally used to start it. Or it might want to check the uid/gid the process is running as from /proc/$PID/status. Or any number of other attributes about a process that are exposed in /proc/$PID files.

This is really a similar pattern to the checks that a process might do when operating on a file. For example, it may stat() the file to check it has the required uid/gid ownership. It has long been known that doing stat() followed by open() is subject to a race condition which can have serious security implications in many circumstances. For this reason, well written apps will actually do open() and then fstat() on the file descriptor, so they’re guaranteed the checks are performed against the file they’re actually operating on.

If the kernel ever wraps around on PID numbers, it should be obvious that apps that do checks on /proc/$PID before killing a process are also subject to a race condition with potential security implications. This leads to the questions of whether it is possible to design a race free way of checking & killing a process, similar to the open() + fstat() pattern used for files. AFAICT there is not, but there are a couple of ways that one could be constructed without too much extra support from the kernel.

To attack this, we first need an identifier for the process which is unique and not reusable if a process dies & new one takes it place. As mentioned above, if the kernel allows PID numbers to wrap, the PID is unsuitable for this purpose. A file handle for the /proc/$PID directory though, it potentially suitable for this purpose. We still can’t simply open files like /proc/$PID/exe directly, since that’s relying on the potentially reusable PID number. Here the openat()/readlinkat() system calls come to rescuse. They allow you to pass a file descriptor for a directory and a relative filename. This guarantees that the file opened is in the directory expected, even if the original directory path changes / is replaced. All that is now missing is a way to kill a process, given a FD for /proc/$PID. There are two ideas for this, first a new system call fkill($fd, $signum), or a new proc file /proc/$PID/kill where you can write signal numbers.

With these pieces, a race free way to validate a process’ executable before killing it would look like

int pidfd = open("/proc/1234", O_RDONLY);
char buf[PATH_MAX];
char *exe = readlinkat(pidfd, "exe", buf, sizeof(buf));
if (strcmp(exe, "/usr/bin/mybinary") != 0)
return -1;
fkill(pidfd, SIGTERM);

Alternatively the last line could look like

char signum[10];
snprintf(signum, sizeof(signum), "%d", SIGTERM);
int sigfd = openat(pidfd, "kill", O_WRONLY);
write(sigfd, signum, strlen(signum))
close(sigfd)

Though I prefer the fkill() variant for reasons of brevity.

After reading this, you might wonder how something like systemd safely kills processes it manages without this kind of functionality. Well, it uses cgroups to confine processes associated with each service. Therefore it does not need to care about safety checking arbitrary PIDs, instead it can merely iterate over every PID in the service’s cgroup and kill them. The downside of cgroups is that it is Linux specific, but that doesn’t matter for systemd’s purposes. I believe that a fkill() syscall, however, is something that could be more easily used in a variety of existing apps without introducing too much portability code. Apps could simply use fkill() if available, but fallback to regular kill() elsewhere.

NB I’ve no intention of actually trying to implement this, since I like to stay out of kernel code. It is just an idea I wanted to air in case it interests anyone else.