Building application sandboxes with libvirt, LXC & KVM
I have mentioned in passing every now & then over the past few months, that I have been working on a tool for creating application sandboxes using libvirt, LXC and KVM. Last Thursday, I finally got around to creating a first public release of a package that is now called libvirt-sandbox. Before continuing it is probably worth defining what I consider the term “application sandbox” to mean. My working definition is that an “application sandbox” is simply a way to confine the execution environment of an application, limiting the access it has to OS resources. To me one notable point is that there is no need for a separate / special installation of the application to be confined. An application sandbox ought to be able to run any existing application installed in the OS.
Background motivation & prototype
For a few Fedora releases, users have had the SELinux sandbox command which will execute a command with a strictly confined SELinux context applied. It is also able to make limited use of the kernel filesystem namespace feature, to allow changes to the mount table inside the sandbox. For example, the common case is to put in place a different $HOME. The SELinux sandbox has been quite effective, but there is a limit to what can be done with SELinux policy alone, as evidenced by the need to create a setuid helper to enable use of the kernel namespace feature. Architecturally this gets even more problematic as new feature requests need to be dealt with.
As most readers are no doubt aware, libvirt provides a virtualization management API, with support for a wide variety of virtualization technologies. The KVM driver is easily the most advanced and actively developed driver for libvirt with a very wide array of features for machine based virtualization. In terms of container based virtualization, the LXC driver is the most advanced driver in libvirt, often getting new features “for free” since it shares alot of code with the KVM driver, in particular anything cgroup based. The LXC driver has always had the ability to pass arbitrary host filesystems through to the container, and the KVM driver gained similar capabilities last year with the inclusion of support for virtio 9p filesystems. One of the well known security features in libvirt is sVirt, which leverages MAC technology like SELinux to strictly confine the execution environment of QEMU. This has also now been adapted to work for the LXC driver.
Looking at the architecture of the SELinux sandbox command last year, it occurred to me that the core concepts mapped very well to the host filesystem passthrough & sVirt features in libvirt’s KVM & LXC drivers. In other words, it ought to be possible to create application sandboxes using the libvirt API and suitably advanced drivers like KVM or LXC. A few weeks hacking resulted in a proof of concept tool virt-sandbox which can run simple commands in sandboxes built on LXC or KVM.
The libvirt-sandbox API
A command line tool for running applications inside a sandbox is great, but even more useful would be an API for creating application sandboxes that programmers can use directly. While libvirt provides an API that is portable across different virtualization technologies, it cannot magically hide the differences in feature set or architecture between the technologies. Thus the decision was taken to create a new library called libvirt-sandbox that provides a higher level API for managing application sandboxes, built on top of libvirt. The virt-sandbox command from the proof of concept would then be re-implemented using this library API.
The libvirt-sandbox library is built using GObject to enable it to be accessible to any programming language via GObject Introspection. The basic idea is that programmer simply defines the desired characteristics of the sandbox, such as the command to be executed, any arguments, filesystems to be exposed from host, any bind mounts, private networking configuration, etc. From this configuration description, libvirt-sandbox will decide upon & construct a libvirt guest XML configuration that can actually provided the requested characteristics. In other words, the libvirt-sandbox API is providing a layer of policy avoid libvirt, to isolate the application developer from the implementation details of the underlying hypervisor.
Building sandboxes using LXC is quite straightforward, since application confinement is a core competency of LXC. Thus I will move straight to the KVM implementation, which is where the real fun is. Booting up an entire virtual machine probably sounds like quite a slow process, but it really need not be particularly if you have a well constrained hardware definition which avoids any need for probing. People also generally assume that running a KVM guest, means having a guest operating system install. This is absolutely something that is not acceptable for application sandboxing, and indeed not actually necessary. In a nutshell, libvirt-sandbox creates a new initrd image containing a custom init binary. This init binary simply loads the virtio-9p kernel module and then mounts the host OS’ root filesystem as the guest’s root filesystem, readonly of course. It then hands off to a second boot strap process which runs the desired application binary and forwards I/O back to the host OS, until the sandboxed application exits. Finally the init process powers off the virtual machine. To get an idea of the overhead, the /bin/false binary can be executed inside a KVM sandbox with an overall execution time of 4 seconds. That is the total time for libvirt to start QEMU, QEMU to run its BIOS, the BIOS to load the kernel + initrd, the kenrel to boot up, /bin/false to run, and the kernel to shutdown & QEMU to exit. I think 3 seconds is pretty impressive todo all that. This is a constant overhead, so for a long running command like an MP3 encoder, it disappears into the background noise. With sufficient optimization, I’m fairly sure we could get the overhead down to approx 2 seconds.
Using the virt-sandbox command
The Fedora review of the libvirt-sandbox package was nice & straightforward, so the package is already available in rawhide for ready to test the VirtSandbox F17 feature. The virt-sandbox command is provided by the libvirt-sandbox RPM package
# yum install libvirt-sandbox
Assuming libvirt is already installed & able to run either LXC or KVM guests, everything is ready to use immediately.
A first example is to run the ‘/bin/date’ command inside a KVM sandbox:
$ virt-sandbox -c qemu:///session /bin/date Thu Jan 12 22:30:03 GMT 2012
You want proof that this really is running an entire KVM guest ? How about looking at the /proc/cpuinfo contents:
$ virt-sandbox -c qemu:///session /bin/cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 2 model name : QEMU Virtual CPU version 1.0 stepping : 3 cpu MHz : 2793.084 cache size : 4096 KB fpu : yes fpu_exception : yes cpuid level : 4 wp : yes flags : fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pse36 clflush mmx fxsr sse sse2 syscall nx lm up rep_good nopl pni cx16 hypervisor lahf_lm bogomips : 5586.16 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management:
How about using LXC instead of KVM, and providing an interactive console instead of just a one-shot command ? Yes, we can do that too:
$ virt-sandbox -c lxc:/// /bin/sh sh-4.2$ ps -axuwf USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 165436 3756 pts/0 Ss+ 22:31 0:00 libvirt-sandbox-init-lxc berrange 24 0.0 0.1 167680 4688 pts/0 S+ 22:31 0:00 libvirt-sandbox-init-common berrange 47 0.0 0.0 13852 1608 pts/1 Ss 22:31 0:00 \_ /bin/sh berrange 48 0.0 0.0 13124 996 pts/1 R+ 22:31 0:00 \_ ps -axuwf
Notice how we only see the processes from our sandbox, none from the host OS. There are many more examples I’d like to illustrate, but this post is already far too long.
Future development
This blog post might give the impression that every is complete & operational, but that is far from the truth. This is only the bare minimum functionality to enable some real world usage. Things that are yet to be dealt with include
- Write suitable SELinux policy extensions to allow KVM to access host OS filesystems in readonly mode. Currently you need to run in permissive mode which is obviously something that needs solving before F17
- Turn the virt-viewer command code for SPICE/VNC into a formal API and use that to provide a graphical sandbox running Xorg.
- Integrate a tool that is able to automatically create sandbox instances for system services like apache to facilitate confined vhosting deployments
- Correctly propagate exit status from the sandboxed command to the host OS
- Unentangle stderr and stdout from the sandboxed command
- Figure out how to make dhclient work nicely when / is readonly and resolv.conf must be updated in-place
- Expose all the libvirt performance tuning controls to allow disk / net I/O controls, CPU scheduling, NUMA affinity, etc
- Wire up libvirt’s firewall capability to allow detailed filtering of network traffic to/from sandboxes
- Much more…
For those attending FOSDEM this year, I will be giving a presentation about libvirt-sandbox in the virt/cloud track.
Oh and as well as the released tar.gz mentioned in the first paragraph, or the Fedora RPM, the code is all available in GIT
What can I say? Cool
Dan,
Apart from file-sharing b/n guest and host, can you shed some more light on where 9p filesystems are used in more real-life scenarios?
This looks really cool.
Why should resolv.conf be updated?
Going even further, would it be possible to have some kind of “networking pass through” too, where the guest would not have any socket of its own but request them from the host instead?
@Marc: I think that some people may want to run some network apps and isolate them in a specific network topo that’s why resolv.conf may have to be updated. And there is probably the same problem with /etc/mtab if you plan to play with mounting/umounting filesystems.
In embedded world, a typical fix for this kind of problem is to mount a tmpfs filesystem (/var/tmp) and files like /etc/resolv.conf are symilnks to /var/tmp/resolv.conf which is rw. In this case, the problem is how to create such symlinks :/
Dan, what about a sandboxing criterion akin to java applets, where
not both of local files & network connections may be used (to prevent
data leakage)? Or at least subsetting the host filesystem that is
readable to the sandboxed app?
By default, a sandbox created with these tools will have zero network access. You have to explicitly enable networking if the application needs it. Eventually we will also have the ability to specify filters for network traffic, so you can whitelist what sites the sandbox can connect to. On the filesystem side, the sandbox gets a readonly view of the root filesystem, although the SELinux policy will not actually allow the sandbox to read some areas. In addition, for any areas of the root filesystem, you can specify custom mounts to completely hide/replace the content from the host with alternate content.
I presume it is far more secure than using a full blown VM.
If it runs it’s own kernel, will it inherit or lose security features, that a Grsecurity patched host kernel provides?
The host kernel’s security features apply to execution of the QEMU binary running on the host. They don’t have a direct interaction with the guest kernel
Would this sandbox prevent applications on an X server from snooping keypresses from other programs (like su/gksu)?
How is usable the project today? Do you plane to continue the development on it?
@Benotic : virt-sandbox is being actively developed. Fedora 17 (released today) ships with virt-sandbox, and this is one of the more interesting features in F17. See:
* http://fedoraproject.org/wiki/Features/VirtSandbox
* http://docs.fedoraproject.org/en-US/Fedora/17/html/Release_Notes/sect-Release_Notes-Changes_for_Sysadmin.html#id512652
Not sure why I am getting the following error in Fedora 17…
# virt-sandbox -c lxc:/// /bin/sh
Unable to start sandbox: Failed to create domain: internal error guest failed to start: 2012-07-14 08:38:32.954+0000: 1229: info : libvirt version: 0.9.11.4, package: 3.fc17 (Fedora Project, 2012-06-28-13:49:23, x86-03.phx2.fedoraproject.org)
2012-07-14 08:38:32.954+0000: 1229: error : lxcControllerRun:1504 : Failed to mount devpts on //dev/pts: Invalid argument