Setting up a Ceph cluster and exporting a RBD volume to a KVM guest

Posted: October 12th, 2011 | Filed under: Fedora, libvirt, Virt Tools | Tags: , , , , , , , | 5 Comments »

Yesterday I talked about setting up Sheepdog with KVM, so today is it is time to discuss use of Ceph and RBD with KVM.

Host Cluster Setup, the easy way

Fedora has included Ceph for a couple of releases, but since my hosts are on Fedora 14/15, I grabbed the latest ceph 0.3.1 sRPMs from Fedora 16 and rebuilt those to get something reasonably up2date. In the end I have the following packages installed, though to be honest I don’t really need anything except the base ‘ceph’ RPM:

# rpm -qa | grep ceph | sort
ceph-0.31-4.fc17.x86_64
ceph-debuginfo-0.31-4.fc17.x86_64
ceph-devel-0.31-4.fc17.x86_64
ceph-fuse-0.31-4.fc17.x86_64
ceph-gcephtool-0.31-4.fc17.x86_64
ceph-obsync-0.31-4.fc17.x86_64
ceph-radosgw-0.31-4.fc17.x86_64

Installing the software is the easy bit, configuring the cluster is where the fun begins. I had three hosts available for testing all of which are virtualization hosts. Ceph has at least 3 daemons it needs to run, which should all be replicated across several hosts for redundancy. There’s no requirement to use the same hosts for each daemon, but for simplicity I decided to run every Ceph daemon on every virtualization host.

My hosts are called lettuce, avocado and mustard. Following the Ceph wiki instructions, I settled on a configuration file that looks like this:

[global]
    auth supported = cephx
    keyring = /etc/ceph/keyring.admin

[mds]
    keyring = /etc/ceph/keyring.$name
[mds.lettuce]
    host = lettuce
[mds.avocado]
    host = avocado
[mds.mustard]
    host = mustard

[osd]
    osd data = /srv/ceph/osd$id
    osd journal = /srv/ceph/osd$id/journal
    osd journal size = 512
    osd class dir = /usr/lib64/rados-classes
    keyring = /etc/ceph/keyring.$name
[osd.0]
    host = lettuce
[osd.1]
    host = avocado
[osd.2]
    host = mustard

[mon]
    mon data = /srv/ceph/mon$id
[mon.0]
    host = lettuce
    mon addr = 192.168.1.1:6789
[mon.1]
    host = avocado
    mon addr = 192.168.1.2:6789
[mon.2]
    host = mustard
    mon addr = 192.168.1.3:6789

The osd class dir bit should not actually be required, but the OSD code looks in the wrong place (/usr/lib instead of /usr/lib64) on x86_64 arches.

With the configuration file written, it is time to actually initialize the cluster filesystem / object store. This is the really fun bit. The Ceph wiki has a very basic page which talks about the mkcephfs tool, along with a scary warning about how it’ll ‘rm -rf’ all the data on the filesystem it is initializing. It turns out that it didn’t mean your entire host filesystem, AFAICT, it only the blows away the contents of the directory configured for ‘osd data‘ and ‘mon data‘, in my case both under /srv/ceph.

The recommended way is to let mkcephfs ssh into each of your hosts and run all the configuration tasks automatically. Having tried the non-recommended way and failed several times before finally getting it right, I can recommend following the recommended way :-P There are some caveats not mentioned in the wiki page though:

  • The configuration file above must be copied to /etc/ceph/ceph.conf on every node before attempting to run mkcephfs.
  • The configuration file on the host where you run mkcephfsmust be in /etc/ceph/ceph.conf or it will get rather confused about where it is in the other nodes.
  • The mkcephfscommand must be run as root since, it doesn’t specify ‘-l root’ to ssh, leading to an inability to setup the nodes.
  • The directories /srv/ceph/osd$i must be pre-created, since it is unable to do that itself, despite being able to creat the /srv/ceph/mon$idirectories.
  • The Fedora RPMs have also forgotten to create /etc/ceph

With that in mind, I ran the following commands from my laptop, as root

 # n=0
 # for host in lettuce avocado mustard ; \
   do \
       ssh root@$host mkdir -p /etc/ceph /srv/ceph/mon$n; \
       n=$(expr $n + 1; \
       scp /etc/ceph/ceph.conf root@$host:/etc/ceph/ceph.conf
   done
 # mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.bin

On the host where you ran mkcephfs there should now be a file /etc/ceph/keyring.admin. This will be needed for mounting filesystems. I copied it across to all my virtualization hosts

 # for host in lettuce avocado mustard ; \
   do \
       scp /etc/ceph/keyring.admin root@$host:/etc/ceph/keyring.admin; \
   done

Host Cluster Usage

Assuming the setup phase all went to plan, the cluster can now be started. A word of warning though, Ceph really wants your clocks VERY well synchronized. If your NTP server is a long way away, the synchronization might not be good enough to stop Ceph complaining. You really want a NTP server on your local LAN for hosts to sync against. Sort this out before trying to start the cluster.

 # for host in lettuce avocado mustard ; \
   do \
       ssh root@$host service ceph start; \
   done

The ceph tool can show the status of everything. The ‘mon’, ‘osd’ and ‘msd’ lines in the status ought to show all 3 host present & correct

# ceph -s
2011-10-12 14:49:39.085764    pg v235: 594 pgs: 594 active+clean; 24 KB data, 94212 MB used, 92036 MB / 191 GB avail
2011-10-12 14:49:39.086585   mds e6: 1/1/1 up {0=lettuce=up:active}, 2 up:standby
2011-10-12 14:49:39.086622   osd e5: 3 osds: 3 up, 3 in
2011-10-12 14:49:39.086908   log 2011-10-12 14:38:50.263058 osd1 192.168.1.1:6801/8637 197 : [INF] 2.1p1 scrub ok
2011-10-12 14:49:39.086977   mon e1: 3 mons at {0=192.168.1.1:6789/0,1=192.168.1.2:6789/0,2=192.168.1.3:6789/0}

The cluster configuration I chose has authentication enabled, so to actually mount the ceph filesystem requires a secret key. This key is stored in the /etc/ceph/keyring.admin file that was created earlier. To view the keyring contents, the cauthtool program must be used

# cauthtool -l /etc/ceph/keyring.admin 
[client.admin]
	key = AQDLk5VOeHkHLxAAfGjcaUsOXOhJr7hZCNjXSQ==
	auid = 18446744073709551615

The base64 key there will be passed to the mount command, repeating on every host needing a filesystem present:

# mount -t ceph 192.168.1.1:6789:/ /mnt/ -o name=admin,secret=AQDLk5VOeHkHLxAAfGjcaUsOXOhJr7hZCNjXSQ==
error adding secret to kernel, key name client.admin: No such device

For some reason, that error message is always printed on my Fedora hosts, and despite that, the mount has actually succeeded

# grep /mnt /proc/mounts 
192.168.1.1:6789:/ /mnt ceph rw,relatime,name=admin,secret= 0 0

Congratulations, /mnt is now a distributed filesystem. If you create a file on one host, it should appear on the other hosts & vica-verca.

RBD Volume setup

A shared filesystem is very nice, and can be used to hold regular virtual disk images in a variety of formats (raw, qcow2, etc). What I really wanted to try was the RBD virtual block device functionality in QEMU. Ceph includes a tool called rbd for manipulating those. The syntax of this tool is pretty self-explanatory

# rbd create --size 100 demo
# rbd ls
demo
# rbd info demo
rbd image 'demo':
	size 102400 KB in 25 objects
	order 22 (4096 KB objects)
	block_name_prefix: rb.0.0
	parent:  (pool -1)

Alternatively RBD volume creation can be done using qemu-img …. at least once the Fedora QEMU package is fixed to enable RBD support.

# qemu-img create -f rbd rbd:rbd/demo  100M
Formatting 'rbd:rbd/foo', fmt=rbd size=104857600 cluster_size=0 
# qemu-img info rbd:rbd/demo
image: rbd:rbd/foo
file format: raw
virtual size: 100M (104857600 bytes)
disk size: unavailable

KVM guest setup

The syntax for configuring a RBD block device in libvirt, is very similar to that used for Sheepdog. In Sheepdog, every single virtualization node is also a storage node, so there is no hostname required. Not so for RBD. Here it is necessary to specify one or more host names, for the RBD servers.

<disk type='network' device='disk'>
  <driver name='qemu' type='raw'/>
  <source protocol='rbd' name='demo/wibble'>
    <host name='lettuce.example.org' port='6798'/>
    <host name='mustard.example.org' port='6798'/>
    <host name='avocado.example.org' port='6798'/>
  </source>
  <target dev='vdb' bus='virtio'/>
</disk>

More observant people might be wondering how QEMU gets permission to connect to the RBD server, given that the configuration earlier enabled authentication. This is thanks to the magic of the /etc/ceph/keyring.admin file which must exist on any virtualization server. Patches are currently being discussed which will allow authentication credentials to be set via libvirt, avoiding the need to store the credentials on the virtualization hosts permanently.

A blog planet for following virtualization tools development & tutorials, etc

Posted: October 11th, 2011 | Filed under: Fedora, libvirt, Virt Tools | Tags: , , , , , | 2 Comments »

The virt-tools.org website, launched last year, provides tutorials, videos, documentation, online help and roadmaps relevant to libvirt, libguestfs, gtk-vnc, spice, other related libraries, and tools or applications like virt-manager & virt-install. The site goal is to inform & assist end users, system administrators & application developers who wish to learn about the capabilities of the virt tools stack. The focus of most content is the, state of the art, Linux native KVM hypervisor, but writing about using other hypervisors using virt tools is also welcome.

Back in June I finally got around to setting up a blog planet to aggregate the RSS feeds of various people working in libvirt, libguestfs, etc. While I announced this to various mailing lists, it appears I forgot to blog about it. Whoops. So this post is just a quick alert that if you’re interested in libvirt, libguestfs, virt-manager, etc and don’t want to follow a high traffic site like the Fedora planet, then this is the blog feed aggregator for you:

Setting up a Sheepdog cluster and exporting a volume to a KVM guest

Posted: October 11th, 2011 | Filed under: Fedora, libvirt, Virt Tools | Tags: , , , , , | 3 Comments »

There were recently patches posted to libvir-list to improve the Ceph support in the KVM driver. While trying to review them it quickly became clear I did not have enough knowledge of Ceph to approve the code. So I decided it was time to setup some clustered storage devices to test libvirt with. I decided to try out Ceph, GlusterFS and Sheepdog, and by virtue of Sheepdog compiling the fastest, that is the first one I have tried and thus responsible for this blog post.

Host setup

If you have Fedora 16, sheepdog can directly installed using yum

# yum install sheepdog

Sheepdog relies on corosync to maintain cluster membership, so the first step is to configure that. Corosync ships with an example configuration file, but since I’ve not used it before, I chose to just use the example configuration recommended by the Sheepdog website. So on the 2 hosts I wanted to participate in the cluster I created:

# cat > /etc/cluster/cluster.conf <EOF
compatibility: whitetank
totem {
  version: 2
  secauth: off
  threads: 0
  interface {
    ringnumber: 0
    bindnetaddr: -YOUR IP HERE-
    mcastaddr: 226.94.1.1
    mcastport: 5405
  }
}
logging {
  fileline: off
  to_stderr: no
  to_logfile: yes
  to_syslog: yes
  logfile: /var/log/cluster/corosync.log
  debug: off
  timestamp: on
  logger_subsys {
    subsys: AMF
    debug: off
  }
}
amf {
  mode: disabled
}
EOF

Obviously remembering to change the ‘bindnetaddr‘ parameter. One thing to be aware of is that this configuration allows any host in the same subnet to join the cluster, no authentication or encryption is required. I believe corosync has some support for encryption keys, but I have not explored this. If you don’t trust the network, this should definitely be examined. Then it is simply a matter of starting the corosync and sheepdog, each on each node:

# service corosync start
# service sheepdog start

If all went to plan, it should be possible to see all hosts in the sheepdog cluster, from any node:

# collie node list
   Idx - Host:Port              Number of vnodes
------------------------------------------------
     0 - 192.168.1.2:7000    	64
*    1 - 192.168.1.3:7000    	64

The final step in initializing the nodes is to create a storage cluster across the nodes. This command only needs to be run on one of the nodes

# collie cluster format --copies=2
# collie cluster info
running

Ctime                Epoch Nodes
2011-10-11 10:50:01      1 [192.168.1.2:7000, 192.168.1.3:7000]

Volume setup

libvirt has a storage management API for creating/managing volumes, but there is not currently a driver for sheepdog. So for the time being, volumes need to be created manually using the qemu-img command. All that is required is a volume name and a size. So on any of the nodes:

$ qemu-img create sheepdog:demo 1G

The more observant people might notice that this command can be run by any user on the host, no authentication required. Even if the host is locked down to not allow unprivileged user logins, this still means that any compromised QEMU instance can access all the sheepdog storage. Not cool. Some form of authentication is clearly needed before this can be used for production.

With the default Fedora configuration of sheepdog, all the disk volumes end up being stored under /var/lib/sheepdog, so make sure that directory has plenty of free space.

Guest setup

Once a volume has been created, setting up a guest to use it, is just a matter of using a special XML configuration block for the guest disk.

<disk type='network' device='disk'>
  <driver name='qemu' type='raw'/>
  <source protocol='sheepdog' name='demo'/>
  <target dev='vdb' bus='virtio'/>
</disk>

Notice how although this is a network block device, there is no need to provide a hostname of the storage server. Every virtualization host is a member of the storage cluster, and vica-verca, so the storage is “local” as far as QEMU is concerned. Inside the guest there is nothing special to worry about, a regular virtio block device appears, in this case /dev/vdb. As data is written to the block device in the guest, the data should end up in /var/lib/sheepdog on all nodes in the cluster.

One final caveat to mention, is that live migration of guests between hosts is not currently supported with Sheepdog.

Edit: Live migration *is* supported with sheepdog 0.2.0 and later.

Troubleshooting libvirt with the KVM and LXC drivers

Posted: October 3rd, 2011 | Filed under: Fedora, libvirt, Virt Tools | Tags: , , , , , , | 1 Comment »

In “fantasy island” the libvirt and KVM/LXC code is absolutely perfect and always does exactly what you want it todo. Back in the real world, however, there may be annoying bugs in libvirt, KVM/LXC, the kernel and countless other parts of the OS that conspire to cause you great pain and suffering. This blog post contains a very quick introduction to debugging/troubleshooting libvirt problems, particularly focusing on the KVM and LXC drivers.

libvirt logging capabilities

The libvirt code is full of logging statements which can be instrumental in understanding where a problem might lie.

Configuring libvirtd logging

Current releases of libvirt will log problems occurring in libvirtd at level WARNING/ERROR to a dedicated log file /var/log/libvirt/libvirtd.log, while older releases would be send them to syslog, typically ending up in /var/log/messages. The libvirtd configuration file has two parameters that can be used to increase the amount of logging information printed.

log_filters="...filter string..."
log_outputs="...destination config..."

The logging documentation describes these in some detail. If you just want to quickly get started though, it suffices to understand that filter strings are simply doing substring matches against libvirt source filenames. So to enable all debug information from ‘src/util/event.c’ (the libvirt event loop) you would set

log_filters="1:event"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"

If you wanted to enable logging for everything in ‘src/util’, except for ‘src/util/event.c’ you would set

log_filters="3:event 1:util"
log_outputs="1:file:/var/log/libvirt/libvirtd.log"

Configuring libvirt client logging

On the client side of libvirt there is no configuration file to put log settings in, so instead, there are a couple of environment variables. These take exactly the same type of strings as the libvirtd configuration file

LIBVIRT_LOG_FILTERS="...filter string..."
LIBVIRT_LOG_OUTPUTS="...destination config..."
export LIBVIRT_LOG_FILTERS LIBVIRT_LOG_OUTPUTS

One thing to be aware of is that with the KVM and LXC drivers in libvirt, very little code is ever run on the libvirt client. The only interesting pieces are the RPC code, event loop and main API entrypoints. To enable debugging of the RPC code you might use

LIBVIRT_LOG_FILTERS="1:rpc" LIBVIRT_LOG_OUTPUTS="1:stderr" virsh list

Useful log filter settings for KVM and LXC

The following are some useful values for logging wrt the KVM and LXC drivers

All libvirt public APIs invoked
1:libvirt
All external commands run by libvirt
1:command
Cgroups management
1:cgroup
All QEMU driver code
1:qemu
QEMU text monitor commands
1:qemu_monitor_text
QEMU JSON/QMP monitor commands
1:qemu_monitor_json
All LXC driver code
1:lxc
All lock management code
1:locking
All security manager code
1:security

QEMU driver logfiles

Every QEMU process run by libvirt has a dedicated log file /var/log/libvirt/qemu/$VMNAME.log which captures any data that QEMU writes to stderr/stdout. It also contains timestamps written by libvirtd whenever the QEMU process is started, and exits. Finally, prior to starting a guest, libvirt will write out the full set of environment variables and command line arguments it intends to launch QEMU with.

If you are running libvirtd with elevated log settings, there is also the possibility that some of the logging output will end up in the per-VM logfile, instead of the location set by the log_outputs configuration parameter. This is because a little bit of libvirt code will run in the child process between the time it is forked and QEMU is exec()d.

LXC driver logfiles

Every LXC process run by libvirt has a dedicated log file /var/log/libvirt/qemu/$VMNAME.log which captures any data that QEMU writes to stderr/stdout. As with QEMU it will also contain the command line args libvirt uses, though these are much less interesting in the LXC case. The LXC logfile is mostly useful for debugging the initial container bootstrap process.

Troubleshooting SELinux / sVirt

On a RHEL or Fedora host, the out of the box configuration will run all guests under confined SELinux contexts. One common problem that may affect developers running libvirtd straight from the source tree is that libvirtd itself will run under the wrong context, which in turn prevents guests from running correctly. This can be addressed in two ways, first by manually labelling the libvirtd binary after each rebuild

chcon system_u:object_r:virtd_exec_t:s0 $SRCTREE/daemon/.libs/lt-libvirtd

Or by specifying a label when executing libvirtd

runcon system_u:object_r:virtd_exec_t:s0 $SRCTREE/daemon/libvirtd

Another problem might be with libvirt not correctly labelling some device needed by the QEMU process. The best way to see what’s going on here, is to enable libvirtd logging with a filter of “1:security_selinux”, which will print out a message for every single file path that libvirtd labels. Then look at the log to see that everything expected is present:

14:36:57.223: 14351: debug : SELinuxGenSecurityLabel:284 : model=selinux label=system_u:system_r:svirt_t:s0:c669,c903 imagelabel=system_u:object_r:svirt_image_t:s0:c669,c903 baselabel=(null)
14:36:57.350: 14351: info : SELinuxSetFilecon:402 : Setting SELinux context on '/var/lib/libvirt/images/f16x86_64.img' to 'system_u:object_r:svirt_image_t:s0:c669,c903'
14:36:57.350: 14351: info : SELinuxSetFilecon:402 : Setting SELinux context on '/home/berrange/boot.iso' to 'system_u:object_r:virt_content_t:s0'
14:36:57.551: 14351: debug : SELinuxSetSecurityDaemonSocketLabel:1129 : Setting VM f16x86_64 socket context unconfined_u:unconfined_r:unconfined_t:s0:c669,c903

If a guest is failing to start, then there are two ways to double check if it really is SELinux related. SELinux can be put into permissive mode on the virtualization host

setenforce 0

Or the sVirt driver can be disabled in libvirt entirely

# vi /etc/libvirt/qemu.conf
...set 'security_driver="none" ...
# service libvirtd restart

Troubleshooting cgroups

When libvirt runs guests on modern Linux systems, cgroups will be used to control aspects of the guests’ execution. If any cgroups are mounted on the host when libvirtd starts up, it will create a basic hierarchy

$MOUNT_POINT
 |
 +- libvirt
     |
     +- qemu
     +- lxc

When starting a KVM or LXC guest, further directories will be created, one per guest, so that after a while the tree will look like

$MOUNT_POINT
 |
 +- libvirt
     |
     +- qemu
     |    |
     |    +- VMNAME1
     |    +- VMNAME1
     |    +- VMNAME1
     |    +- ...
     |    ...
     +- lxc
          |
          +- VMNAME1
          +- VMNAME1
          +- VMNAME1
          +- ...

Assuming the host administrator has not changed the policy in the top level cgroups, there should be no functional change to operation of the guests with this default setup. There are possible exceptions though if you are trying something unusal. For example, the ‘devices’ cgroups controller will be used to setup a whitelist of block / character devices that QEMU is allowed to access. So if you have modified QEMU to access to funky new device, libvirt will likely block this via the cgroups device ACL. Due to various kernel bugs, some of the cgroups controllers have also had a detrimental performance impact on both QEMU guest and the host OS as a whole.

libvirt will never try to mount any cgroups itself, so the quickest way to stop libvirt using cgroups is to stop the host OS from mounting them. This is not always desirable though, so there is a configuration parameter in /etc/libvirt/qemu.conf which can be used to restrict what cgroups libvirt will use.

Running from the GIT source tree

Sometimes when troubleshooting a particularly hard problem it might be desirable to build libvirt from the latest GIT source and run that. When doing this is a good idea not to overwrite your distro provided installation with a GIT build, but instead run libvirt directly from the source tree. The first thing to be careful of is that the custom build uses the right installation prefix (ie /etc, /usr, /var and not /usr/local). To simplify this libvirt provides an ‘autogen.sh’ script to run all the right libtool commands and set the correct prefixes. So to build libvirt from GIT, in a way that is compatible with a typical distro build use:

./autogen.sh --system --enable-compile-warnings=error
make

Hint: use make -j 4 (or larger) to significantly speed up the build on multi-core systems

To run libvirtd from the source tree, as root, stop the existing daemon and invoke the libtool wrapper script

# service libvirtd stop
# ./daemon/libvirtd

Or to run with SELinux contexts

# service libvirtd stop
# runcon system_u:system_r:virtd_t:s0-s0:c0.c1023 ./daemon/libvirtd

virsh can easily be run from the source tree in the same way

# ./tools/virsh ....normal args...

Running python programs against a non-installed libvirt gets a little harder, but that can be overcome too

$ export PYTHONPATH=$SOURCETREE/python:$SOURCETREE/python/.libs
$ export LD_LIBRARY_PATH=$SOURCETREE/src/.libs
$ python virt-manager --no-fork

When running the LXC driver, it is necessary to make a change to the guest XML to point it to a different emulator. Running ‘virsh edit $GUEST’ change

/usr/libexec/libvirt_lxc

to

$SOURCETREE/src/libvirt_lxc

(expand $SOURCETREE to be the actual path of the GIT checkout – libvirt won’t interpret env vars in the XML)

Guest MAC spoofing denial of service and preventing it with libvirt and KVM

I was recently asked to outline some of the risks of virtualization wrt networking, in particular, how guests running on the same network could attack each other’s network traffic. The examples in this blog post will consider a scenario with three guests running on the same host, connected to the libvirt default virtual machine (backed by the virbr0 bridge device). As is traditional, the two guests trying to communicate shall be called alice, bob, while the attacker/eavesdropper shall be eve. Provision three guests with those names, and make sure their network configuration look like this

  <interface type='network'>                 (for the VM 'alice')
    <mac address='52:54:00:00:00:11'/>
    <source network='default'/>
    <target dev='vnic-alice"/>
    <model type='virtio'/>
  </interface>

  <interface type='network'>                 (for the VM 'bob')
    <mac address='52:54:00:00:00:22'/>
    <source network='default'/>
    <target dev='vnic-bob"/>
    <model type='virtio'/>
  </interface>
  <interface type='network'>                 (for the VM 'eve')
    <mac address='52:54:00:00:00:33'/>
    <source network='default'/>
    <target dev='vnic-eve"/>
    <model type='virtio'/>
  </interface>

If the guest interfaces are to be configured using DHCP, it is desirable to have predictable IP addresses for alice, bob & eve. This can be achieved by altering the default network configuration:

# virsh net-destroy default
# virsh net-edit default

In the editor change the IP configuration to look like

<ip address='192.168.122.1' netmask='255.255.255.0'>
  <dhcp>
    <range start='192.168.122.2' end='192.168.122.254' />
    <host mac='52:54:00:00:00:11' name='alice' ip='192.168.122.11' />
    <host mac='52:54:00:00:00:22' name='bob' ip='192.168.122.22' />
    <host mac='52:54:00:00:00:33' name='eve' ip='192.168.122.33' />
  </dhcp>
</ip>

With all these changes made, start the network and the guests

# virsh net-start default
# virsh start alice
# virsh start bob
# virsh start eve

After starting these three guests, the host sees the following bridge configuration

# brctl show
bridge name	bridge id		STP enabled	interfaces
virbr0		8000.fe5200000033	yes		vnic-alice
							vnic-bob
							vnic-eve

For the sake of testing, the “very important” communication between alice and bob will be a repeating ICMP ping. So login to ‘alice’ (via the console, not the network) and leave the following command running forever

# ping bob
PING bob.test.berrange.com (192.168.122.22) 56(84) bytes of data.
64 bytes from bob.test.berrange.com (192.168.122.22): icmp_req=1 ttl=64 time=0.790 ms
64 bytes from bob.test.berrange.com (192.168.122.22): icmp_req=2 ttl=64 time=0.933 ms
64 bytes from bob.test.berrange.com (192.168.122.22): icmp_req=3 ttl=64 time=0.854 ms
...

Attacking VMs on a hub

The first thought might be for eve to just run ‘tcpdump‘ (again via the console shell, not a network shell):

# tcpdump icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
...nothing captured...

Fortunately Linux bridge devices act as switches by default, so eve won’t see any traffic flowing between alice and bob. For the sake of completeness though, I should point out that it is possible to make a Linux bridge act as a hub instead of a switch. This can be done as follows:

# brctl setfd 0
# brctl setageing 0

Switching back to the tcpdump session in eve, should now show traffic between alice and bob being captured

10:38:15.644181 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 29, length 64
10:38:15.644620 IP bob.test.berrange.com > alice.test.berrange.com: ICMP echo reply, id 8053, seq 29, length 64
10:38:16.645523 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 30, length 64
10:38:16.645886 IP bob.test.berrange.com > alice.test.berrange.com: ICMP echo reply, id 8053, seq 30, length 64

Attacking VMs on a switch using MAC spoofing

Putting the bridge into ‘hub mode’ was cheating though, so reverse that setting on the host

# brctl setageing 300

Since the switch is clever enough to only send traffic out of the port where it has seen the corresponding MAC address, perhaps eve can impersonate bob by spoofing his MAC address. MAC spoofing is quite straightforward; in the console for eve run

# ifdown eth0
# ifconfig eth0 hw ether 52:54:00:00:00:22
# ifconfig eth0 up
# ifconfig eth0 192.168.122.33/24

Now that the interface is up with eve‘s IP address, but bob‘s MAC address, the final step is to just poison the host switch’s MAC address/port mapping. A couple of ping packets sent to an invented IP address (so alice/bob don’t see any direct traffic from eve) suffice todo this

# ping -c 5 192.168.122.44

To see whether eve is now receiving bob‘s traffic launch tcpdump again in eve‘s console

# tcpdump icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
11:02:41.981567 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1493, length 64
11:02:42.981624 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1494, length 64
11:02:43.981785 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1495, length 64
...

The original ‘ping’ session, back in alice‘s console, should have stopped receiving any replies from bob since all his traffic is being redirected to eve. Occasionally bob‘s OS might send out some packet on its own accord which re-populates the host bridge’s MAC address/port mapping, causing the ping to start again. eve can trivially re-poison the mapping at any time by sending out further packets of her own.

Attacking VMs on a switch using MAC and IP spoofing

The problem with only using MAC spoofing is that traffic from alice to bob goes into a black hole – the ping packet loss quickly shows alice that something is wrong. To try and address this, eve could also try spoofing bob‘s IP address, by running:

# ifconfig eth0 192.168.122.22/24

The tcpdump session in eve should now show replies being sent back out, in response to alice‘s ping requests

# tcpdump icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
11:10:55.797471 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1986, length 64
11:10:55.797521 IP bob.test.berrange.com > alice.test.berrange.com: ICMP echo reply, id 8053, seq 1986, length 64
11:10:56.798914 IP alice.test.berrange.com > bob.test.berrange.com: ICMP echo request, id 8053, seq 1987, length 64
11:10:56.799031 IP bob.test.berrange.com > alice.test.berrange.com: ICMP echo reply, id 8053, seq 1987, length 64

A alice‘s ping session will now be receiving replies just as she expects, except that unbeknown to her, the replies are actually being sent by eve not bob.

Protecting VMs against MAC/IP spoofing

So eve can impersonate a ping response from bob, big deal ? What about some real application level protocols like SSH or HTTPS which have security built in. These are no doubt harder to attack, but by no means impossible particularly if you are willing to bet/rely on human/organizational weakness. For MITM attacks like this, the SSH host key fingerprint is critical. How many people actually go to the trouble of checking that the SSH host key matches what it is supposed to be, when first connecting to a new host ? I’d wager very few. Rather more users will question the alert from SSH when a previously known host key changes, but I’d still put money on a non-trivial number ignoring the warning. For HTTPS, the key to avoiding MITM attacks is the x509 certificate authority system. Everyone knows that this is absolutely flawless without any compromised/rogue CA’s ;-P

What can we do about these risks for virtual machines running on the same host ? libvirt provides a reasonably advanced firewall capability in both its KVM and LXC drivers. This capability is built upon the standard Linux ebtables, iptables and ip6tables infrastructure and enables rules to be set per guest TAP device. The example firewall filters that are present out of the box provide a so called “clean traffic” ruleset. Amongst other things, these filters prevent and MAC and IP address spoofing by virtual machines. Enabling this requires a very simple change to the guest domain network interface configuration.

Shutdown alice, bob and eve and then alter their XML configuration (using virsh edit) so that each one now contains the following:

  <interface type='network'>                 (for the VM 'alice')
    <mac address='52:54:00:00:00:11'/>
    <source network='default'/>
    <target dev='vnic-alice"/>
    <model type='virtio'/>
    <filterref filter='clean-traffic'>
  </interface>

  <interface type='network'>                 (for the VM 'bob')
    <mac address='52:54:00:00:00:22'/>
    <source network='default'/>
    <target dev='vnic-bob"/>
    <model type='virtio'/>
    <filterref filter='clean-traffic'>
  </interface>

  <interface type='network'>                 (for the VM 'eve')
    <mac address='52:54:00:00:00:33'/>
    <source network='default'/>
    <target dev='vnic-eve"/>
    <model type='virtio'/>
    <filterref filter='clean-traffic'>
  </interface>

Start the guests again and now try to repeat the previous MAC and IP spoofing attacks from eve. If all is working as intended, it should be impossible for eve to capture any traffic between alice and bob, or disrupt it in any way.

The clean-traffic filter rules are written to require two configuration parameters, the whitelisted MAC address and the whitelisted IP address. The MAC address is inserted by libvirt automatically based on the declared MAC in the XML configuration. For the IP address, libvirt will sniff the DHCPOFFER responses from the DHCP server running on the host to learn the assigned IP address. There is a fairly obvious attack with this, where by someone just runs a rogue DHCP server. It is possible to alter the design of the filter rules so that any rogue DHCP servers are blocked, however, there is one additional problem. Upon migration of guests, the new host needs to learn the IP address, but guests’s don’t re-run DHCP upon migration between it is supposed to be totally seemless. Thus in most cases, when using filters, the host admin will want to explicitly specify the guest’s IP address in the XML

    <filterref filter='clean-traffic'>
      <parameter name='IP' value='192.168.122.33'>
    </filterref>

There is quite alot more that can be done using libvirt’s guest network filtering capabilities. One idea would be to block outbound SMTP traffic to prevent compromised guests being turned into spambots. In fact almost anything that an administrator might wish todo inside the guest using iptables, could be done in the host using libvirt’s network filtering, to provide additional protection against guest OS compromise.
This will be left as an exercise for the reader…