I was recently asked to outline some of the risks of virtualization wrt networking, in particular, how guests running on the same network could attack each other’s network traffic. The examples in this blog post will consider a scenario with three guests running on the same host, connected to the libvirt default virtual machine (backed by the virbr0
bridge device). As is traditional, the two guests trying to communicate shall be called alice
, bob
, while the attacker/eavesdropper shall be eve
. Provision three guests with those names, and make sure their network configuration look like this
<interface type='network'> (for the VM 'alice')
<mac address='52:54:00:00:00:11'/>
<source network='default'/>
<target dev='vnic-alice"/>
<model type='virtio'/>
</interface>
<interface type='network'> (for the VM 'bob')
<mac address='52:54:00:00:00:22'/>
<source network='default'/>
<target dev='vnic-bob"/>
<model type='virtio'/>
</interface>
<interface type='network'> (for the VM 'eve')
<mac address='52:54:00:00:00:33'/>
<source network='default'/>
<target dev='vnic-eve"/>
<model type='virtio'/>
</interface>
If the guest interfaces are to be configured using DHCP, it is desirable to have predictable IP addresses for alice, bob & eve. This can be achieved by altering the default network configuration:
# virsh net-destroy default
# virsh net-edit default
In the editor change the IP configuration to look like
<ip address='192.168.122.1' netmask='255.255.255.0'>
<dhcp>
<range start='192.168.122.2' end='192.168.122.254' />
<host mac='52:54:00:00:00:11' name='alice' ip='192.168.122.11' />
<host mac='52:54:00:00:00:22' name='bob' ip='192.168.122.22' />
<host mac='52:54:00:00:00:33' name='eve' ip='192.168.122.33' />
</dhcp>
</ip>
With all these changes made, start the network and the guests
# virsh net-start default
# virsh start alice
# virsh start bob
# virsh start eve
After starting these three guests, the host sees the following bridge configuration
# brctl show
bridge name bridge id STP enabled interfaces
virbr0 8000.fe5200000033 yes vnic-alice
vnic-bob
vnic-eve
For the sake of testing, the “very important” communication between alice
and bob
will be a repeating ICMP ping. So login to ‘alice’ (via the console, not the network) and leave the following command running forever
# ping bob
PING bob.test.berrange.com (192.168.122.22) 56(84) bytes of data.
64 bytes from bob.test.berrange.com (192.168.122.22): icmp_req=1 ttl=64 time=0.790 ms
64 bytes from bob.test.berrange.com (192.168.122.22): icmp_req=2 ttl=64 time=0.933 ms
64 bytes from bob.test.berrange.com (192.168.122.22): icmp_req=3 ttl=64 time=0.854 ms
...
Attacking VMs on a hub
The first thought might be for eve
to just run ‘tcpdump
‘ (again via the console shell, not a network shell):
# tcpdump icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
...nothing captured...
Fortunately Linux bridge devices act as switches by default, so eve
won’t see any traffic flowing between alice
and bob
. For the sake of completeness though, I should point out that it is possible to make a Linux bridge act as a hub instead of a switch. This can be done as follows:
# brctl setfd 0
# brctl setageing 0
Switching back to the tcpdump session in eve
, should now show traffic between alice
and bob
being captured
10:38:15.644181 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 29, length 64
10:38:15.644620 IP bob.test.berrange.com > alice.test.berrange.com: ICMP echo reply, id 8053, seq 29, length 64
10:38:16.645523 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 30, length 64
10:38:16.645886 IP bob.test.berrange.com > alice.test.berrange.com: ICMP echo reply, id 8053, seq 30, length 64
Attacking VMs on a switch using MAC spoofing
Putting the bridge into ‘hub mode’ was cheating though, so reverse that setting on the host
# brctl setageing 300
Since the switch is clever enough to only send traffic out of the port where it has seen the corresponding MAC address, perhaps eve
can impersonate bob
by spoofing his MAC address. MAC spoofing is quite straightforward; in the console for eve
run
# ifdown eth0
# ifconfig eth0 hw ether 52:54:00:00:00:22
# ifconfig eth0 up
# ifconfig eth0 192.168.122.33/24
Now that the interface is up with eve
‘s IP address, but bob
‘s MAC address, the final step is to just poison the host switch’s MAC address/port mapping. A couple of ping packets sent to an invented IP address (so alice
/bob
don’t see any direct traffic from eve
) suffice todo this
# ping -c 5 192.168.122.44
To see whether eve
is now receiving bob
‘s traffic launch tcpdump again in eve
‘s console
# tcpdump icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
11:02:41.981567 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1493, length 64
11:02:42.981624 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1494, length 64
11:02:43.981785 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1495, length 64
...
The original ‘ping’ session, back in alice
‘s console, should have stopped receiving any replies from bob
since all his traffic is being redirected to eve
. Occasionally bob
‘s OS might send out some packet on its own accord which re-populates the host bridge’s MAC address/port mapping, causing the ping to start again. eve
can trivially re-poison the mapping at any time by sending out further packets of her own.
Attacking VMs on a switch using MAC and IP spoofing
The problem with only using MAC spoofing is that traffic from alice
to bob
goes into a black hole – the ping packet loss quickly shows alice
that something is wrong. To try and address this, eve
could also try spoofing bob
‘s IP address, by running:
# ifconfig eth0 192.168.122.22/24
The tcpdump session in eve
should now show replies being sent back out, in response to alice
‘s ping requests
# tcpdump icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
11:10:55.797471 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1986, length 64
11:10:55.797521 IP bob.test.berrange.com > alice.test.berrange.com: ICMP echo reply, id 8053, seq 1986, length 64
11:10:56.798914 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1987, length 64
11:10:56.799031 IP bob.test.berrange.com > alice.test.berrange.com: ICMP echo reply, id 8053, seq 1987, length 64
A alice
‘s ping session will now be receiving replies just as she expects, except that unbeknown to her, the replies are actually being sent by eve
not bob
.
Protecting VMs against MAC/IP spoofing
So eve
can impersonate a ping response from bob
, big deal ? What about some real application level protocols like SSH or HTTPS which have security built in. These are no doubt harder to attack, but by no means impossible particularly if you are willing to bet/rely on human/organizational weakness. For MITM attacks like this, the SSH host key fingerprint is critical. How many people actually go to the trouble of checking that the SSH host key matches what it is supposed to be, when first connecting to a new host ? I’d wager very few. Rather more users will question the alert from SSH when a previously known host key changes, but I’d still put money on a non-trivial number ignoring the warning. For HTTPS, the key to avoiding MITM attacks is the x509 certificate authority system. Everyone knows that this is absolutely flawless without any compromised/rogue CA’s ;-P
What can we do about these risks for virtual machines running on the same host ? libvirt provides a reasonably advanced firewall capability in both its KVM and LXC drivers. This capability is built upon the standard Linux ebtables, iptables and ip6tables infrastructure and enables rules to be set per guest TAP device. The example firewall filters that are present out of the box provide a so called “clean traffic” ruleset. Amongst other things, these filters prevent and MAC and IP address spoofing by virtual machines. Enabling this requires a very simple change to the guest domain network interface configuration.
Shutdown alice
, bob
and eve
and then alter their XML configuration (using virsh edit) so that each one now contains the following:
<interface type='network'> (for the VM 'alice')
<mac address='52:54:00:00:00:11'/>
<source network='default'/>
<target dev='vnic-alice"/>
<model type='virtio'/>
<filterref filter='clean-traffic'>
</interface>
<interface type='network'> (for the VM 'bob')
<mac address='52:54:00:00:00:22'/>
<source network='default'/>
<target dev='vnic-bob"/>
<model type='virtio'/>
<filterref filter='clean-traffic'>
</interface>
<interface type='network'> (for the VM 'eve')
<mac address='52:54:00:00:00:33'/>
<source network='default'/>
<target dev='vnic-eve"/>
<model type='virtio'/>
<filterref filter='clean-traffic'>
</interface>
Start the guests again and now try to repeat the previous MAC and IP spoofing attacks from eve
. If all is working as intended, it should be impossible for eve
to capture any traffic between alice
and bob
, or disrupt it in any way.
The clean-traffic
filter rules are written to require two configuration parameters, the whitelisted MAC address and the whitelisted IP address. The MAC address is inserted by libvirt automatically based on the declared MAC in the XML configuration. For the IP address, libvirt will sniff the DHCPOFFER responses from the DHCP server running on the host to learn the assigned IP address. There is a fairly obvious attack with this, where by someone just runs a rogue DHCP server. It is possible to alter the design of the filter rules so that any rogue DHCP servers are blocked, however, there is one additional problem. Upon migration of guests, the new host needs to learn the IP address, but guests’s don’t re-run DHCP upon migration between it is supposed to be totally seemless. Thus in most cases, when using filters, the host admin will want to explicitly specify the guest’s IP address in the XML
<filterref filter='clean-traffic'>
<parameter name='IP' value='192.168.122.33'>
</filterref>
There is quite alot more that can be done using libvirt’s guest network filtering capabilities. One idea would be to block outbound SMTP traffic to prevent compromised guests being turned into spambots. In fact almost anything that an administrator might wish todo inside the guest using iptables, could be done in the host using libvirt’s network filtering, to provide additional protection against guest OS compromise.
This will be left as an exercise for the reader…
sVirt has been available in the libvirt KVM driver for a few years now, both for SELinux and more recently for AppArmour. When using it with SELinux there has been a choice of two different configurations
- Dynamic configuration
- libvirt takes the default base label (“system_u:system_r:svirt_t:s0”), generates a unique MCS label for the guest (“c123,c465”) and combines them to form the complete security label for the virtual machine process. libvirt takes the same MCS label and combines it with the default image base label (“system_u:system_r:svirt_image_t:s0”) to form the image label. libvirt will then automatically apply the image label to all host OS files that the VM is required to access. These can be disk images, disk devices, PCI devices (we label the corresponding sysfs files), USB devices (we label the /dev/bus/usb files), kernel/initrd files, and a few more things. When the VM shuts down again, we reverse the labelling. This mode was originally intended for general usage where the management application is not aware of the existence of sVirt.
- Static configuration
- The guest XML provides the full security label, including the MCS part. libvirt simply assigns this security label to the virtual machine process without trying to alter/interpret it any further. libvirt does not change the labels of any files on disk. The administrator/application using libvirt, is expected to have done all the resource file labelling ahead of time. This mode was originally intended for locked down MLS environments, where even libvirtd itself is not trusted to perform relabelling
These two configurations have worked well enough for the two uses cases they were designed to satisfy. As sVirt has become an accepted part of the libvirt/KVM ecosystem, application developers have started wanting todo more advances things which are currently harder than they should be. In particular some applications want to have full control over the security label generation (eg to ensure cluster-wide unique labels, instead of per-host uniqueness), but still want libvirt to take care of resource relabelling. This is sort of a hybrid between our static & dynamic configuration. Other applications would like to be able to choose a different base label (“system_u:system_r:svirt_custom_t:s0”) but still have libvirt assign the MCS suffix and perform relabelling. This is another variant on dynamic labelling. To satisfy these use cases we have extended the syntax for sVirt labelling in recent libvirt. The “seclabel” element gained a ‘relabel’ attribute to control whether resource relabelling is attempted. A new “baselabel” element was introduced to override the default base security label in dynamic mode. So there are now 4 possible styles of configuration:
-
Dynamic configuration (the default out of the box usage)
<seclabel type='dynamic' model='selinux' relabel='yes'>
<label>system_u:system_r:svirt_t:s0:c192,c392</label> (output only element)
<imagelabel>system_u:object_r:svirt_image_t:s0:c192,c392</imagelabel> (output only element)
</seclabel>
-
Dynamic configuration, with base label
<seclabel type='dynamic' model='selinux' relabel='yes'>
<baselabel>system_u:system_r:svirt_custom_t:s0</baselabel>
<label>system_u:system_r:svirt_custom_t:s0:c192,c392</label> (output only element)
<imagelabel>system_u:object_r:svirt_image_t:s0:c192,c392</imagelabel> (output only element)
</seclabel>
-
Static configuration, no resource labelling (primarily for MLS/strictly controlled environments)
<seclabel type='static' model='selinux' relabel='no'>
<label>system_u:system_r:svirt_custom_t:s0:c192,c392</label>
</seclabel>
-
Static configuration, with dynamic resource labelling
<seclabel type='static' model='selinux' relabel='yes'>
<label>system_u:system_r:svirt_custom_t:s0:c192,c392</label>
</seclabel>
The NetCF library provides a simple API for managing network interface configuration files. Libvirt has used NetCF for several releases now to provide the internal implementation of the virInterface APIs. This allows libvirt based applications to configure plain ethernet devices, bridging, bonding and VLANs, which covers all the network configurations required for a typical virtualization host. The problem is that nearly every single OS distro has a different format for its configuration files, and NetCF has only had an implementation that works with Fedora and RHEL. This has led to a perception that NetCF is a project that is “Red Hat specific”, which is not at all the intention when NetCF was created. To try to address this problem, I have spent the last couple of weeks hacking on a driver for NetCF that knows how to deal with the Debian /etc/network/interfaces file format. As of last night, I pushed the code into the upstream NetCF git repository, so Debian & Ubuntu users have something nice to try out in the next release. Indeed, it would be good if any interested persons were to try out the latest NetCF GIT code before the next release to make sure it works for someone other than myself :-)
In the course of porting to Debian, we also became aware that there was a port of NetCF to Suse distributions, found as a patch in the netcf RPM for OpenSuse 11. We have not had chance to test it ourselves yet, but on the assumption that it must have been at least partially functional when added to OpenSuse 11 RPMs, we have merged that patch into the latest NetCF GIT. If anyone is using Suse and wants to try it out and report what works & what doesn’t, we’d appreciate the feedback. If someone wants to actually do further development work for the Suse driver to finish it off, that would be even better !
Finally, a few months ago, there was work on creating a Windows driver for NetCF. This was posted a few times to the NetCF mailing lists, but was never completed because the original submitter ran out of time to work on it. In the hope that it will be a useful starting point for other interested developers, we have also merged the most recent Windows patch into the NetCF GIT repository. This is by no means useful yet, only able to list interfaces and bring them up/down – it can’t report their config, or create new interfaces.
Supported Debian driver configurations
Back to the Debian driver now. The Debian /etc/network/interfaces configuration file is quite nicely structured and reasonably well documented, but one of the problems faced is that there is often more than one way to get to the same end result. To make the development of a Debian driver a tractable problem, I decided to pick one specific configuration approach for each desired network interface arrangement. So while NetCF should be able to read back any configuration that it wrote itself, it may not be able to correctly read arbitrary configurations that a user created manually. I expect that over time the driver will iteratively improve its configuration parsing to cope with other styles.
AFAICT, the Debian best practice for setting up vlans, bonding & bridging is to use the extra configuration syntax offered by certain add in packages, rather than custom post scripts. So for the NetCF Debian driver to work, it is necessary to have the following DPKGs installed:
- ifenslave (required for any bonding config)
- bridge-utils (required for any bridging config)
- vlan (required for any vlan config)
Ethernet Devices
Loopback:
auto lo
iface lo inet loopback
DHCP:
auto eth0
iface eth0 inet dhcp
Static config:
auto eth0
iface eth0 inet static
address 196.168.1.1
netmask 255.255.255.0
gateway 196.168.1.255
No IP config
auto eth0
iface eth0 inet manual
Config with MTU / MAC addres
auto eth0
iface eth0 inet dhcp
hwaddr ether 00:11:22:33:44:55
mtu 150
Bonding
With miimon
iface bond0 inet dhcp
bond_slaves eth1 eth2
bond_primary eth1
bond_mode active-backup
bond_miimon 100
bond_updelay 10
bond_use_carrier 0
With arpmon
iface bond2 inet dhcp
bond_slaves eth6
bond_primary eth6
bond_mode active-backup
bond_arp_interval 100
bond_arp_ip_target 198.168.1.1
VLANs
auto eth0.42
iface eth0.42 inet static
address 196.168.1.1
netmask 255.255.255.0
vlan_raw_device eth0
Bridging
With single interface and IP addr
auto br0
iface br0 inet static
address 192.68.2.3
netmask 255.255.255.0
mtu 1500
bridge_ports eth3
bridge_stp off
bridge_fd 0.01
With no IP addr
auto br0
iface br0 inet manual
bridge_ports eth3
bridge_stp off
bridge_fd 0.01
With multiple interfaces
auto br0
iface br0 inet static
address 192.68.2.3
netmask 255.255.255.0
mtu 1500
bridge_ports eth3 eth4
bridge_stp off
bridge_fd 0.01
With no interface or addr
auto br0
iface br0 inet manual
mtu 1500
bridge_ports none
bridge_stp off
bridge_fd 0.01
Complex Bridging
Bridging a bond:
auto br1
iface br1 inet manual
mtu 1500
bridge_ports bond1
bridge_stp off
pre-up ifup bond1
post-down ifdown bond1
iface bond1 inet manual
bond_slaves eth4
bond_primary eth4
bond_mode active-backup
bond_miimon 100
bond_updelay 10
bond_use_carrier 0
Bridging a VLAN:
auto br2
iface br2 inet manual
mtu 1500
bridge_ports eth0.42
bridge_stp off
iface eth0.42 inet manual
vlan_raw_device eth0
IPv6
Static addressing, with multiple addresses:
auto eth5
iface eth5 inet6 static
address 3ffe:ffff:0:5::1
netmask 128
pre-up echo 0 > /proc/sys/net/ipv6/conf/eth5/autoconf
post-down echo 1 > /proc/sys/net/ipv6/conf/eth5/autoconf
up /sbin/ifconfig eth5 inet6 add 3ffe:ffff:0:5::5/128
down /sbin/ifconfig eth5 inet6 del 3ffe:ffff:0:5::5/128
Stateless autoconf
auto eth5
iface eth5 inet6 manual
DHCPv6 with autoconf
auto eth5
iface eth5 inet6 dhcp
DHCPv6 without autoconf
auto eth5
iface eth5 inet6 dhcp
pre-up echo 0 > /proc/sys/net/ipv6/conf/eth5/autoconf
post-down echo 1 > /proc/sys/net/ipv6/conf/eth5/autoconf
The most recent set of example configurations can be found in the documentation in GIT.
For quite a while now, libvirt has had an LXC driver that uses Linux’s namespace + cgroups features to provide container based virtualization. Before continuing I should point out that the libvirt LXC driver does not have any direct need for the userspace tools from the LXC sf.net project, since it directly leverages APIs the Linux kernel exposes to userspace. There are in fact many other potential users of the kernel’s namespace APIs which have their own userspace, such as OpenVZ, Linux-VServer, Parallels. This blog post will just concern itself solely with the native libvirt LXC support.
Connecting to the LXC driver
At this point in time, there is only one URI available for connecting to the libvirt LXC driver, lxc:///
, which gets you a privileged connection. There is not yet any support for unprivileged libvirtd instances using containers, due to restrictions of the kernel’s DAC security models. I’m hoping this may be refined in the future.
If you’re familiar with using libvirt in combination with KVM, then it is likely you are just relying on libvirt picking the right URI by default. Well each host can only have one default URI for libvirt, and KVM will usually take precedence over LXC. You can discover what libvirt has decided the default URI:
# virsh uri
qemu:///system
So when using tools like virsh you’ll need to specify the LXC URI somehow. The first way is to use the ‘-c URI’ or ‘–connect URI’ arguments that most libvirt based applications have:
# virsh -c lxc:/// uri
lxc:///
The second option is to explicitly override the default libvirt URI for your session using the LIBVIRT_DEFAULT_URI environment variable.
# export LIBVIRT_DEFAULT_URI=lxc:///
# virsh uri
lxc:///
For the sake of brevity, all the examples that follow will presume that export LIBVIRT_DEFAULT_URI=lxc:///
has been set.
A simple “Hello World” LXC container
The Hello World equivalent for LXC is probably a container which just runs /bin/sh with the main root filesystem / network interfaces all still being visible. What you’re gaining here is not security, but a rather way to manage resource utilization of everything spawned from that initial process. The libvirt LXC driver currently does most of its resource controls using cgroups, but will also leverage the network traffic shaper directly for network controls which you want to be done per virtual network interface, not per cgroup.
Anyone familiar with libvirt will know that to create a new guest, requires an XML document specifying its configuration. Machine based virtualization requires either a kernel/initrd or a virtual BIOS to boot, and can create a fullyvirutalized (hvm) or paravirtualized machine (xen). Container virtualization by contrast, just wants to know the path to the binary to spawn as the container’s “init” (aka process with PID 1). The virtualization type for containers is thus referred to in libvirt as “exe”. Aside from the virtualization type & path of the initial process, the only other required XML parameters are the guest name, initial memory limit and a text console device. Putting this together, creating the “Hello World” container will require an XML configuration that looks like this:
# cat > helloworld.xml <<EOF
<domain type='lxc'>
<name>helloworld</name>
<memory>102400</memory>
<os>
<type>exe</type>
<init>/bin/sh</init>
</os>
<devices>
<console type='pty'/>
</devices>
</domain>
EOF
This configuration can be imported into libvirt in the normal manner
# virsh define helloworld.xml
Domain helloworld defined from helloworld.xml
then started
# virsh start helloworld
Domain helloworld started
# virsh list
Id Name State
----------------------------------
31417 helloworld running
The ID values assigned by the libvirt LXC driver are in the process ID of the libvirt_lxc helper process libvirt launches. This helper is what actually creates the container, spawning the initial process, after which it just sits around handling console I/O. Speaking of the console, this can now be accessed with virsh
# virsh console helloworld
Connected to domain helloworld
Escape character is ^]
sh-4.2#
That ‘sh’ prompt is the shell process inside the container. All the container processes are visible outside the container as regular proceses
# ps -axuwf
...
root 31417 0.0 0.0 42868 1252 ? Ss 16:17 0:00 /usr/libexec/libvirt_lxc --name helloworld --console 27 --handshake 30 --background
root 31418 0.0 0.0 13716 1692 pts/39 Ss+ 16:17 0:00 \_ /bin/sh
...
Inside the container, PID numbers are distinct, starting again from ‘1’.
# virsh console helloworld
Connected to domain helloworld
Escape character is ^]
sh-4.2# ps -axuwf
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 13716 1692 pts/39 Ss 16:17 0:00 /bin/sh
The container will shutdown when the ‘init’ process exits, so in this example when ‘exit’ is run in the container’s bash shell. Alternatively issue the usual ‘virsh destroy’ to kill it off.
# virsh destroy helloworld
Domain helloworld destroyed
Finally remove its configuration
# virsh undefine helloworld
Domain helloworld has been undefined
Adding custom mounts to the “Hello World” container
The “Hello World” container shared the same root filesystem as the primary (host) OS. What if the application inside the container requires custom data in certain locations. For example, using containers to sandbox apache servers, might require a custom /etc/httpd and /var/www. This can easily be achieved by specifying one or more filesystem devices in the initial XML configuration. Lets create some custom locations to pass to the “Hello World” container.
# mkdir /export/helloworld/config
# touch /export/helloworld/config/hello.txt
# mkdir /export/helloworld/data
# touch /export/helloworld/data/world.txt
Now edit the helloworld.xml file and add in
<filesystem type='mount'>
<source dir='/export/helloworld/config'/>
<target dir='/etc/httpd'/>
</filesystem>
<filesystem type='mount'>
<source dir='/export/helloworld/data'/>
<target dir='/var/www'/>
</filesystem>
Now after defining and starting the container again, it should see the custom mounts
# virsh define helloworld.xml
Domain helloworld defined from helloworld.xml
# virsh start helloworld
Domain helloworld started
# virsh console helloworld
Connected to domain helloworld
Escape character is ^]
sh-4.2# ls /etc/httpd/
hello.txt
sh-4.2# ls /var/www/
world.txt
sh-4.2# exit
# virsh undefine helloworld
Domain helloworld has been undefined
A private root filesystem with busybox
So far the container has shared the root filesystem with the host OS. This may be OK if the application running in the container is going to an unprivileged user ID and you are careful not to mess up your host OS. If you want todo things like running DHCP inside the container, or have things running as root, then you almost certainly want a private root filesystem in the container. In this example, we’ll use the busybox tools to setup the simplest possible private root for “Hello World”. First create a new directory and copy the busybox binary into position
mkdir /export/helloworld
cd /export/helloworld
mkdir -p bin var/www etc/httpd
cd bin
cp /sbin/busybox busybox
cd /root
Next step is to setup symlinks for all the busybox commands you intend to use. For example
for i in ls cat rm find ps echo date kill sleep \
true false test pwd sh which grep head wget
do
ln -s busybox /root/helloworld/bin/$i
done
Now all that is required, is to add another filesystem device to the XML configuration
<filesystem type='mount'>
<source dir='/export/helloworld/root'/>
<target dir='/'/>
</filesystem>
With that added to the XML, follow the same steps to define and start the guest again
# virsh define helloworld.xml
Domain helloworld defined from helloworld.xml
# virsh start helloworld
Domain helloworld started
Now when accessing the guest console a completely new filesystem should be visible
# virsh console helloworld
Connected to domain helloworld
Escape character is ^]
# ls
bin dev etc proc selinux sys var
# ls bin/
busybox echo grep ls rm test which
cat false head ps sh true
date find kill pwd sleep wget
# cat /proc/mounts
rootfs / rootfs rw 0 0
devpts /dev/pts devpts rw,seclabel,relatime,gid=5,mode=620,ptmxmode=666 0 0
/dev/mapper/vg_t500wlan-lv_root / ext4 rw,seclabel,relatime,user_xattr,barrier=1,data=ordered 0 0
devpts /dev/pts devpts rw,seclabel,relatime,gid=5,mode=620,ptmxmode=666 0 0
devfs /dev tmpfs rw,seclabel,nosuid,relatime,mode=755 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
proc /proc/sys proc ro,relatime 0 0
/sys /sys sysfs ro,seclabel,relatime 0 0
selinuxfs /selinux selinuxfs ro,relatime 0 0
devpts /dev/ptmx devpts rw,seclabel,relatime,gid=5,mode=620,ptmxmode=666 0 0
/dev/mapper/vg_t500wlan-lv_root /etc/httpd ext4 rw,seclabel,relatime,user_xattr,barrier=1,data=ordered 0 0
/dev/mapper/vg_t500wlan-lv_root /var/www ext4 rw,seclabel,relatime,user_xattr,barrier=1,data=ordered 0 0
Custom networking in the container
The examples thus far have all just inherited access to the host network interfaces. This may or may not be desirable. It is of course possible to configure private networking for the container. Conceptually this works in much the same way as with KVM. Currently it is possible to choose between libvirt’s bridge, network or direct networking modes, giving ethernet bridging, NAT/routing, or VEPA respectively. When configuring private networking, the host OS will get a ‘vethNNN’ device for each container NIC, and the container will see their own ‘ethNNN’ and ‘lo’ devices. The XML configuration additions are just the same as what’s required for KVM, for example
<interface type='network'>
<mac address='52:54:00:4d:2b:cd'/>
<source network='default'/>
</interface>
Define and start the container as before, then compare the network interfaces in the container to what is in the host
# virsh console helloworld
Connected to domain helloworld
Escape character is ^]
# ifconfig
eth0 Link encap:Ethernet HWaddr 52:54:00:16:61:DA
inet6 addr: fe80::5054:ff:fe16:61da/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:93 errors:0 dropped:0 overruns:0 frame:0
TX packets:6 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5076 (4.9 KiB) TX bytes:468 (468.0 B)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
We have a choice of configuring the guest eth0 manually, or just launching a DHCP client. To do manual configuration try
# virsh console helloworld
Connected to domain helloworld
Escape character is ^]
# ifconfig eth0 192.168.122.50
# route add 0.0.0.0 gw 192.168.122.1 eth0
# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 192.168.122.1 255.255.255.255 UGH 0 0 0 eth0
192.168.122.0 * 255.255.255.0 U 0 0 0 eth0
# ping 192.168.122.1
PING 192.168.122.1 (192.168.122.1): 56 data bytes
64 bytes from 192.168.122.1: seq=0 ttl=64 time=0.786 ms
64 bytes from 192.168.122.1: seq=1 ttl=64 time=0.157 ms
^C
--- 192.168.122.1 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.157/0.471/0.786 ms
Am I running in an LXC container?
Some programs may wish to know if they have been launched inside a libvirt container. To assist them, the initial process is given two environment variables, LIBVIRT_LXC_NAME and LIBVIRT_LXC_UUID
# echo $LIBVIRT_LXC_NAME
helloworld
# echo $LIBVIRT_LXC_UUID
a099376e-a803-ca94-f99c-d9a8f9a30088
An aside about CGroups and LXC
Every libvirt LXC container gets placed inside a dedicated cgroup, $CGROUPROOT/libvirt/lxc/$CONTAINER-NAME
. Libvirt expects the memory
, devices
, freezer
, cpu
and cpuacct
cgroups controllers to be mounted on the host OS. Work on leveraging cgroups inside LXC with libvirt is still ongoing, but there are already APIs to set/get memory and CPU limits, with networking to follow soon. This is could be a topic for a blog post on its own, so won’t be discussed further here.
An aside about LXC security, or lack thereof
You might think that since we can create a private root filesystem, it’d be cool to run an entire Fedora/RHEL OS in the container. I strongly caution against doing this. The DAC (discretionary access control) system on which LXC currently relies for all security is known to be incomplete and so it is entirely possible to accidentally/intentionally break out of the container and/or impose a DOS attack on the host OS. Repeat after me “LXC is not yet secure. If I want real security I will use KVM”. There is a plan to make LXC DAC more secure, but that is no where near finished. We also plan to integrate sVirt with LXC to so that MAC will mitigate holes in the DAC security model.
An aside about Fedora >= 15, SystemD and autofs
If you are attempting to try any of this on Fedora 16 or later, there is currently an unresolved problem with autofs that breaks much use of containers. The root problem is that we are unable to unmount autofs mount points after switching into the private filesystem namespace. Unfortunately SystemD uses autofs in its default configuration, for several type mounts. So if you find containers fail to start, then as a temporary hack you can try disabling all SystemD’s autofs mount points
for i in `systemctl --full | grep automount | awk '{print $1}'`
do
systemctl stop $i
done
We hope to resolve this in a more satisfactory way in the near future.
The complete final example XML configuration
# cat helloworld.xml
<domain type='lxc'>
<name>helloworld</name>
<memory>102400</memory>
<os>
<type>exe</type>
<init>/bin/sh</init>
</os>
<devices>
<console type='pty'/>
<filesystem type='mount'>
<source dir='/export/helloworld/root'/>
<target dir='/'/>
</filesystem>
<filesystem type='mount'>
<source dir='/export/helloworld/config'/>
<target dir='/etc/httpd'/>
</filesystem>
<filesystem type='mount'>
<source dir='/export/helloworld/data'/>
<target dir='/var/www'/>
</filesystem>
<interface type='network'>
<source network='default'/>
</interface>
</devices>
</domain>
Yesterday on the #virt@irc.oftc.net
IRC channel there was a question asked about whether sVirt+SELinux would prevent two virtual machines running under the same user ID, from ptrace()ing each other. If no SELinux is involved, there is no DAC restriction ptrace() between two PIDs with the same UID. So this is clearly the kind of thing you would expect/want sVirt to block, and indeed it does. But how can you easily prove the policy blocks ptrace ? Enter the ‘runcon’ command, which lets you impersonate VMs.
NB, when trying out the following, you want SELinux to be in “permissive” mode, not “enforcing”, since the way we do some parts of the tests will trigger other AVCs which get in the way.
Under sVirt each QEMU process is given a dedicate security label, formed by combining the base label “system_u:system_r:svirt_t:s0
” with a unique MCS level. So to test our belief about ptrace(), we need to have two security labels. Lets use these two
system_u:system_r:svirt_t:s0:c12,c34
system_u:system_r:svirt_t:s0:c56,c78
Now we want a process to act as the target VM to be ptrace()d. With the first SELinux label above, and “runcon
” we can launch a confined QEMU process in the same way libvirtd would have done:
$ runcon system_u:system_r:svirt_t:s0:c12,c34 /usr/bin/qemu -vnc :1
‘ps’ can be used to verify that ‘qemu’ really is under the confined domain
$ ps -axuwZ | grep qemu
system_u:system_r:svirt_t:s0:c12,c34 berrange 29542 0.0 0.0 106680 460 pts/12 S+ 14:32 0:00 qemu
Now we have the victim running, we can try launching an attacker. Since we’re looking to see if ptrace() is blocked, ‘strace
‘ is a natural command to try out. For testing other attack vectors you might want to create a tiny dedicated program. Using the second security label, and ‘runcon
‘ again, we can do
$ runcon system_u:system_r:svirt_t:s0:c56,c78 strace -p $PID-OF-VICTIM
Finally, we can look at the audit log for any AVC messages about the ‘ptrace’ access vector:
# grep AVC /var/log/audit/audit.log | grep ptrace
type=AVC msg=audit(1317130603.887:33048): avc: denied { ptrace } for pid=29644 comm="strace" scontext=system_u:system_r:svirt_t:s0:c56,c78 tcontext=system_u:system_r:svirt_t:s0:c12,c34 tclass=process
What this AVC is saying is that a process under the label “scontext=system_u:system_r:svirt_t:s0:c56,c78
” tried to execute ptrace() against a process under the label “system_u:system_r:svirt_t:s0:c12,c34
” and it was blocked.
This exactly what we wanted to see happen. We have now proved that 2 VMs as the same user ID can not ptrace() each other.
When I outlined this on IRC, there was a follow up question. Can the attacking VM just use ‘runcon’ to change its security label ? The answer is again no. Transitions between security labels must be explicitly allowed by the policy, and sVirt policy does not allow any such transitions for the svirt_t type. Again this can be demonstrated by using ‘runcon
‘ by chaining together two invocations:
runcon system_u:system_r:svirt_t:s0:c56,c78 runcon system_u:system_r:svirt_t:s0:c12,c34 strace -p $PID-OF-VICTIM
And then looking for any AVC log message about the ‘transition’ access vector
# grep AVC /var/log/audit/audit.log | grep transition
type=AVC msg=audit(1317131267.839:33153): avc: denied { transition } for pid=29811 comm="runcon" path="/usr/bin/strace" dev=dm-1 ino=662333 scontext=system_u:system_r:svirt_t:s0:c56,c78 tcontext=system_u:system_r:svirt_t:s0:c12,c34 tclass=process