Previously I talked about improving the performance of logging in libvirt. This article will consider the second, even bigger, performance problem I’ve tackled over the past month or so, the management of firewall rules.
There are three areas of libvirt which have a need to interact with the host firewall
- Virtual networks – add rules to iptables/ip6tables to control IPv4/IPv6 forwarding and masquerading of traffic from virtual networks (eg the virbr0 device)
- MAC filtering – adds rules to ebtables to reject spoofed MAC addresses from guest NICs (pretty much obsoleted by general filtering)
- General network filtering – adds almost arbitrary rules to ebtables/iptables/ip6tables to filter guest NIC traffic
When first written all of these areas of libvirt code would directly invoke the iptables/ip6tables/ebtables command line tools to add/remove the rules needed. Then along came firewalld. While we could pretend firewalld didn’t exist and continue invoking the CLI tools directly, this would end in tears when firewalld flushes chains / tables. When we first realized that libvirt needed to add firewalld support, time was short wrt the forthcoming Fedora release schedule which would include firewalld by default. Thus an expedient decision was made, simply replacing any calls to iptables/ip6tables/ebtables with a call to ‘firewall-cmd –passthrough’ instead. This allowed libvirt to integrate with firewalld, so its rules would not get lost, with the minimum possible number of code changes to libvirt.
Unfortunately, it quickly became apparent that using ‘firewall-cmd’ imposes quite a severe performance penalty. Testing with libvirt’s virtual network feature showed that to start 20 virtual networks took 3 seconds with direct iptables usage and 42 seconds with “firewall-cmd –passthrough”. IOW using firewall-cmd is over x10 slower than direct invocation. The network filtering code is similarly badly affected, running our integration test suite went from 28 seconds to 479 seconds, almost x18 slower with firewall-cmd.
We’ve lived with this performance degradation in libvirt for a little while now, but it was clear that this was never going to be acceptable in the long term. This kind of performance bottleneck in firewall manipulation really hurts applications using libvirt, slowing down guest creation in OpenStack noticeably, and even making libvirtd much slower to startup. With a little slack time in my schedule I did some microbenchmarks comparing repeated invocation of firewall-cmd to a script that instead directly invoked the underlying firewalld dbus API repeatedly. While I don’t have the results anymore, it suffices to say that, the tests strongly showed the overhead lay in the firewall-cmd tool, as opposed to the DBus API.
Based on the test results it was clear that libvirt needed to switch to talking to firewalld directly via DBus, instead of spawning an external program to do this indirectly. Even without the performance overhead of firewall-cmd this would be just good engineering practice, particularly wrt error handling, since parsing errors from stderr is truly horrible. The virtual network code had a lot of complexity in it to cope with the fact that applying any single firewall rule might fail and thus need a set of steps to rollback to the previous state of the firewall. This was nothing compared to the complexity of the network filtering code, which would dynamically generate hairy/scary shell scripts with conditionals to either report or ignore errors that occurred and to dynamically query existing rules to decide what to create next.
What libvirt needed was an internal API for interacting with the firewall which would magically talk to either the DBus API, or the iptables/ip6tables/ebtables tools directly, as appropriate for the host being managed. To further reduce the burden on users of the API, it would also to need to have the concept of “transactions”. That is a user of the API can define a set of rules that it wants applied, and if applying any of those rules fails, then a set of “rollback” steps would be applied to cleanup. Finally, it would need the ability to register hooks which query the rule state and dynamically changed later rules.
It took a while to figure out the right design for this libvirt internal firewall API, but once done the task of converting all libvirt code remained ahead. The virtual network and MAC filtering code was fairly easy to convert over the space of a day or so. The same could not be said of the network filtering code. The task of changing it from dynamically writing shell scripts, to using the new firewall API lasted days, which turned into weeks, eventually turning into the best part of a month’s work. What saved me from insanity was that the original creator of this firewall code had the good sense to create a comprehensive set of integration tests for it. This made it much easier to identify places where I’d screwed up and has dramatically reduced the number of regressions I am likely to have created. Hopefully there are no regressions at all.
At the end of this months work, as well as the original integration tests, libvirt now also has a large number of new unit tests which can validate the operation of our firewall code. These will demonstrate that future changes in this area, do not alter the commands used to build the firewall, complementing the integration tests which validate the final firewall ruleset. Now back to the performance. Creating 20 virtual networks originally took 29 seconds with firewall-cmd, now takes only 3 seconds with firewalld DBus APIs. So there is no measurable difference from direct invocation of iptables there. Running the network filter integration test suite, which took 479 seconds with firewall-cmd, now only takes 37 seconds with firewalld DBus APIs. This is slightly slower than direct ebtables invocation which took 29 seconds, but I think that delta is now at an acceptably low level.
This code is all merged into current libvirt GIT and will be included in the 1.2.4 release due out next week and heading to Fedora rawhide immediately thereafter. So come Fedora 21, anything firewall related in libvirt should be noticeably faster. If any application or script you work on happens to make use of firewall-cmd, then I’d strongly recommend changing it to use the DBus API directly. I don’t know why firewall-cmd is so terribly slow, but as it is today, I can’t recommend anyone using it if they have a non-trivial number of rules to define.
Over the past month I’ve been working on a few different areas of libvirt to address some performance bottlenecks the code has. This article is about improvements to the logging subsystem in libvirt, primarily to reduce the CPU overhead that the libvirtd daemon imposes on virtualization hosts. This might not seem like a big deal, but when you have many 100’s or 1000’s of containers packed into a machine, libvirtd CPU usage could start to become quite noticable at times.
Over the years the libvirt codebase has gained more and more debugging statements which have been worth their weight in gold when it comes to troubleshooting problems in deployments. Through the configuration files or environment variables it is possible to tailor logging to output to stderr, plain files, syslog or, most recently, the systemd journal, as well as filters to restrict which files will emit log messages. One less well known feature though is that libvirt has an internal logging ring buffer which records every single log message, no matter what logging output / filter settings are configured. It was possible to send libvirt a signal to dump this ring buffer to stderr, and it would also be dumped automatically upon crash.
For some reason, we had never previously thought to measure the impact of our logging framework on the performance of libvirt. Investigating an unrelated performance problem one day I noticed that oprofile was showing unexpectedly high counts against asprintf/memcpy/malloc functions in an area of code that I wasn’t expecting to be utilizing those functions. I recently had time to investigate this in more detail and discovered that the cause was the internal logging ring buffer. We should have realized long ago that the task of asprintf formatting the log messages and copying them into the ring buffer, would have a non-negligible CPU overhead as the number of log statements in our codebase increased.
To enable the overhead to be more precisely quantified, I wrote a short benchmark program for libvirt which simulated the behaviour of the libvirtd main thread running the event loop with 500 virtual machines / containers all doing simultaneous virtual console I/O. With the existing code, running ~50,000 iterations of this simulated event loop took 1 minute 40 seconds. After a quick hack to turn our logging code into a complete no-op, the demo program completed in only 3 seconds. IOW, the logging code was responsible for 97% of the CPU time consumed by the benchmark. This is somewhat embarrassing, to say the least :-)
Obviously we can not leave the logging code as a complete no-op permanently, since it is incredibly valuable to have it. What we need is for it to be as close to a no-op as possible, unless a specific logging output / filter is actually enabled. With this goal, unfortunately, it became clear that the idea of an always-on internal ring buffer for collecting all logs just had to go. There is simply no way to reduce the overhead of this ring buffer, since it inherently has to asprintf the log messages, which is where the majority of the performance hit comes from. At first I just tweaked the way the ring buffer operated so that it would only collect messages that were already generated for an explicitly configured log output. This reduced the runtime of the benchmark program to 4.6 seconds which is pretty decent. After doing this though, we came to the conclusion that the ring buffer as a concept was now pretty much worthless, since the messages it contains were already being output to another log target. So in the end I just deleted the ring buffer code entirely.
The remaining delta of 1.6 seconds over the base line 3 second execution time was potentially a result of two things. First, for each log message generated, there would be multiple string comparisons against the filename of the code emitting it, in order to determine if log messages for that file were requested for output to a log target. Second, the logging code is inside a mutex protected block to ensure serialization of outputs. The latter did not affect the benchmark program since it was single threaded, and uncontended mutex locks on Linux have negligible overhead. It would, however, affect real world usage of libvirtd since the daemon is heavily threaded. To address both of these potential performance problems, we introduced the idea of a log “category” object which has to be statically declared in each source file that intends to emit log messages. This category object has internal state which determines whether log messages from that source file will be sent on to a log output or just dropped. Without going into too much detail, the key benefit this brought, is that we can determine if any single log message needs to be output with only 2 lockless, integer comparisons in the common case. IOW this removed the repeated string comparisons (only need to be done once now) and removed the mutex locking. With this improvement the logging benchmark takes only 3 seconds. IOW there is no appreciable overhead from logging, except for those files where it is explicitly requested.
This performance improvement was released last month in the 1.2.3 release of libvirt, and thus in Fedora rawhide repositories.
EDIT: BTW, if you have an older libvirt you can edit /etc/libvirt/libvirtd.conf, set “log_buffer_size=0” and then restart libvirtd, which will disable the internal ring buffer and thus avoid the worst of the CPU wastage.
I pleased to announce the a new public release of libvirt-sandbox, version 0.5.1, is now available from:
http://sandbox.libvirt.org/download/
The packages are GPG signed with
Key fingerprint: DAF3 A6FD B26B 6291 2D0E 8E3F BE86 EBB4 1510 4FDF (4096R)
The libvirt-sandbox package provides an API layer on top of libvirt-gobject which facilitates the cration of application sandboxes using virtualization technology. An application sandbox is a virtual machine or container that runs a single application binary, directly from the host OS filesystem. In other words there is no separate guest operating system install to build or manage.
At this point in time libvirt-sandbox can create sandboxes using either LXC or KVM, and should in theory be extendable to any libvirt driver.
This release focused on exclusively on bugfixing
Changed in this release:
- Fix path to systemd binary (prefers dir /lib/systemd not /bin)
- Remove obsolete commands from virt-sandbox-service man page
- Fix delete of running service container
- Allow use of custom root dirs with ‘virt-sandbox –root DIR’
- Fix ‘upgrade’ command for virt-sandbox-service generic services
- Fix logrotate script to use virsh for listing sandboxed services
- Add ‘inherit’ option for virt-sandbox ‘-s’ security context option, to auto-copy calling process’ context
- Remove non-existant ‘-S’ option froom virt-sandbox-service man page
- Fix line break formatting of man page
- Mention LIBVIRT_DEFAULT_URI in virt-sandbox-service man page
- Check some return values in libvirt-sandbox-init-qemu
- Remove unused variables
- Fix crash with partially specified mount option string
- Add man page docs for ‘ram’ mount type
- Avoid close of un-opened file descriptor
- Fix leak of file handles in init helpers
- Log a message if sandbox cleanup fails
- Cope with domain being missing when deleting container
- Improve stack trace diagnostics in virt-sandbox-service
- Fix virt-sandbox-service content copying code when faced with non-regular files.
- Improve error reporting if kernel does not exist
- Allow kernel version/path/kmod to be set with virt-sandbox
- Don’t overmount ‘/root’ in QEMU sandboxes by default
- Fix nosuid / nodev mount options for tmpfs
- Force 9p2000.u protocol version to avoid QEMU bugs
- Fix cleanup when failing to start interactive sandbox
- Create copy of kernel from /boot to allow relabelling
- Bulk re-indent of code
- Avoid crash when gateway is missing in network options
- Fix symlink target created in multi-user.target.wants
- Add ‘-p PATH’ option for virt-sandbox-service clone/delete to match ‘create’ command option.
- Only allow ‘lxc:///’ URIs with virt-sandbox-service until further notice
- Rollback state if cloning a service sandbox fails
- Add more kernel modules instead of assuming they are all builtins
- Don’t complain if some kmods are missing, as they may be builtins
- Allow –mount to be repeated with virt-sandbox-service
Thanks to everyone who contributed to this release
Historically, running a Linux OS inside an LXC guest, has required a execution of a set of hacky scripts which do a bunch of customizations to the default OS install to make it work in the constrained container environment. One of the many benefits to Fedora, of the switch over to systemd, has been that a default Fedora install has much more sensible behaviour when run inside containers. For example, systemd will skip running udev inside a container since containers do not get given permission to mknod – the /dev is pre-populated with the whitelist of devices the container is allowed to use. As such running Fedora inside a container is really not much more complicated than invoking yum to install desired packages into a chroot, then invoking virt-install to configure the LXC guest.
As a proof of concept, on Fedora 19 I only needed to do the following to setup a Fedora 19 environment suitable for execution inside LXC
# yum -y --releasever=19 --nogpg --installroot=/var/lib/libvirt/filesystems/mycontainer \
--disablerepo='*' --enablerepo=fedora install \
systemd passwd yum fedora-release vim-minimal openssh-server procps-ng
# echo "pts/0" >> /var/lib/libvirt/filesystems/mycontainer/etc/securetty
# chroot /var/lib/libvirt/filesystems/mycontainer /bin/passwd root
It would be desirable to avoid the manual editing of /etc/securetty. LXC guests get their default virtual consoles backed by a /dev/pts/0 device, which isn’t listed in the securetty file by default. Perhaps it is a simple as just adding that device node unconditionally. Just have to think about whether there’s a reason to not do that which would impact bare metal. With the virtual root environment ready, now virt-install can be used to configure the container with libvirt
# virt-install --connect lxc:/// --name mycontainer --ram 800 \
--filesystem /var/lib/libvirt/filesystems/mycontainer,/
virt-install will create the XML config libvirt wants, and boot the guest, opening a connection to the primary text console. This should display boot up messages from the instance of systemd running as the container’s init process, and present a normal text login prompt.
If attempting this with systemd-nspawn command, login would fail because the PAM modules audit code will reject all login attempts. This is really unhelpful behaviour by PAM modules which can’t be disabled by any config, except for booting the entire host with audit=0 which is not very desirable. Fortunately, however, virt-install will configure a separate network namespace for the container by default, which will prevent the PAM module from talking to the kernel audit service entirely, giving it a ECONNREFUSED error. By a stoke of good luck, the PAM modules treat ECONNREFUSED as being equivalent to booting with audit=0, so everything “just works”. This is nice case of two bugs cancelling out to leave no bug :-)
While the above commands are fairly straightforward, it is a goal of ours to simplify live even further, into a single command. We would like to provide a command that looks something like this:
# virt-bootstrap --connect lxc:/// --name mycontainer --ram 800 \
--root /var/lib/libvirt/filesystems/mycontainer \
--osid fedora19
The idea is that the ‘–osid’ value will be looked up in the libosinfo database. This will have details of the software repository for that OS, and whether it uses yum/apt/ebuild/somethingelse. virt-bootstrap will then invoke the appropriate packaging tool to populate the root filesystem, and then boot the container in one single step.
One final point is that LXC in Fedora still can’t really be considered to be secure without the use of SELinux. The commands I describe above don’t do anything to enable SELinux protection of the container at this time. This is obviously something that ought to be fixed. Separate from this, upstream libvirt now has support for the kernel user namespace feature. This enables plain old the DAC framework to provide a secure container environment. Unfortunately this kernel feature is still not available in Fedora kernel builds. It is blocked on upstream completion of patches for XFS. Fortunately this work seems to be moving forward again, so if we’re lucky it might just be possible to enable user namespaces in Fedora 20, finally making LXC reasonably secure by default even without SELinux.
Historically access control to libvirt has been very coarse, with only three privilege levels “anonymous” (only authentication APIs are allowed), “read-only” (only querying information is allowed) and “read-write” (anything is allowed). Over the past few months I have been working on infrastructure inside libvirt to support fine grained access control policies. The initial code drop arrived in libvirt 1.1.0, and the wiring up of authorization checks in drivers was essentially completed in libvirt 1.1.1 (with the exception of a handful of APIs in the legacy Xen driver code). We did not wish to tie libvirt to any single access control system, so the framework inside libvirt is modular, to allow for multiple plugins to be developed. The only plugin provided at this time makes use of polkit for its access control checks. There was a second proof of concept plugin that used SELinux to provide MAC, but there are a number of design issues still to be resolved with that, so it is not merged at this time.
The basic framework integration
The libvirt library exposes a number of objects (virConnectPtr, virDomainPtr, virNetworkPtr, virNWFilterPtr, virNodeDevicePtr, virIntefacePtr, virSecretPtr, virStoragePoolPtr, virStorageVolPtr), with a wide variety of operations defined in the public API. Right away it was clear that we did not wish to describe access controls based on the names of the APIs themselves. For each object there are a great many APIs which all imply the same level of privilege, so it made sense to collapse those APIs onto single permission bits. At the same time, some individual APIs could have multiple levels of privilege depending on the flags set in parameters, so would expand to multiple permission bits. Thus the first task to was come up with a list of permission bits which were able to cover all APIs. This was encoded in the internal viraccessperm.h header file. With the permissions defined, the next big task was to define a mapping between permissions and APIs. This mapping was encoded as magic comments in the RPC protocol definition file. This in turn allows the code for performing access control checks to be automatically generated, thus minimizing scope for coding errors, such as forgetting to perform checks in a method, or performing the wrong checks.
The final coding step was for the automatically generated ACL check methods to be inserted into each of the libvirt driver APIs. Most of the ACL checks validate the input parameters to ensure the caller is authorized to operate on the object in question. In a number of methods, the ACL checks are used to restrict / filter the data returned. For example, if asking for a list of domains, the returned list must be filtered to only those the client is authorized to see. While the code for checking permissions was auto-generated, it is not practical to automatically insert the checks into each libvirt driver. It was, however, possible to write scripts to perform static analysis on the code to validate that each driver had the full set of access control checks present. Of course it helps to tell developers / administrators which permissions apply to each API, so the code which generates the API reference documentation was also enhanced so that the API reference lists the permissions required in each circumstance.
The polkit access control driver
Libvirt has long made use of polkit for authenticating connections over its UNIX domain sockets. It was thus natural to expand on this work to make use of polkit as a driver for the access control framework. Historically this would not have been practical, because the polkit access control rule format did not provide a way for the admin to configure access control checks on individual object instances – only object classes. In polkit 0.106, however, a new engine was added which allowed admins to use javascript to write access control policies. The libvirt polkit driver takes object class names and permission names to form polkit action names. For example, the “getattr
” permission on the virDomainPtr
class maps to the polkit org.libvirt.api.domain.getattr
permission. When performing a access control check, libvirt then populates the polkit authorization “details
” map with one or more attributes which uniquely identify the object instance. For example, the virDomainPtr
object gets “connect_driver
” (libvirt driver name), “domain_uuid
” (globally unique UUID), and “domain_name
” (host local unique name) details set. These details can be referenced in the javascript policy to scope rules to individual object instances.
Consider a local user berrange
who has been granted permission to connect to libvirt in full read-write mode. The goal is to only allow them to use the QEMU driver and not the Xen or LXC drivers which are also available in libvirtd. To achieve this we need to write a rule which checks whether the connect_driver
attribute is QEMU
, and match on an action name of org.libvirt.api.connect.getattr
. Using the javascript rules format, this ends up written as
polkit.addRule(function(action, subject) {
if (action.id == "org.libvirt.api.connect.getattr" &&
subject.user == "berrange") {
if (action.lookup("connect_driver") == 'QEMU') {
return polkit.Result.YES;
} else {
return polkit.Result.NO;
}
}
});
As another example, consider a local user berrange
who has been granted permission to connect to libvirt in full read-write mode. The goal is to only allow them to see the domain called demo
on the LXC driver. To achieve this we need to write a rule which checks whether the connect_driver
attribute is LXC
and the domain_name
attribute is demo
, and match on a action name of org.libvirt.api.domain.getattr
. Using the javascript rules format, this ends up written as
polkit.addRule(function(action, subject) {
if (action.id == "org.libvirt.api.domain.getattr" &&
subject.user == "berrange") {
if (action.lookup("connect_driver") == 'LXC' &&
action.lookup("domain_name") == 'demo') {
return polkit.Result.YES;
} else {
return polkit.Result.NO;
}
}
});
Futher work
While the access control support in libvirt 1.1.1 provides a useful level of functionality, there is still more that can be done in the future. First of all, the polkit driver needs to have some performance optimization work done. It currently relies on invoking the ‘pkcheck’ binary to validate permissions. While this is fine for hosts with small numbers of objects, it will quickly become too costly. The solution here is to directly use the DBus API from inside libvirt.
The latest polkit framework is fairly flexible in terms of letting us identify object instances via the details map it associates with every access control check. It is far less flexible in terms of identifying the client user. It is fairly locked into the idea of identifying users via remote PID or DBus service name, and then exposing the username/groupnames to the javascript rules files. While this works fine for local libvirt connections over UNIX sockets, it is pretty much useless for connections arriving on libvirt’s TCP sockets. In the latter case the libvirt user is identied by a SASL username (typically a Kerberos principal name), or via an x509 certificate distinguished name (when using client certs with TLS). There’s no way official way to feed the SASL username of x509 dname down to the polkit javascript authorization rules files. Requests upstream to allow extra identifying attributes to be provide for the authorization subject have not been productive, so I’m considering (ab-)using the “details” map to provide identifying info for the users, alongside the identifying info for the object.
As mentioned earlier, there was a proof of concept SELinux driver written, that is yet to be finished. The work there is around figuring out / defining what the SELinux context is for each object to be checked and doing some work on SELinux policy. I think of this work as providing a capability similar to that done in PostgreSQL to enable SELinux MAC checks. It would be very nice to have a system which provides end-to-end MAC. I refer to this as sVirt 2.0 – the first (current) version of sVirt protected the host from guests – the second (future) version would also protect the host from management clients.
The legacy XenD based Xen driver has a couple of methods which lack access control, due to the inability to get access to the identifying attributes for the objects being operated upon. While we encourage people to use the new libxl based Xen driver, it is desirable to have the legacy Xen driver fully controlled for those people using legacy virtualization hosts. Some code refactoring will be required to fix the legacy Xen driver, likely at the cost of making some methods less efficient.
If there is user demand, work may be done to write an access control driver which is natively implemented entirely within libvirt. While the polkit javascript engine is fairly flexible, I’m not much of a fan of having administrators write code to define their access control policy. It would be preferable to have a way to describe the policy that was entirely declarative. With a libvirt native access control driver, it would be possible to create a simple declarative policy file format tailored to our precise needs. This would let us solve the problem of providing identifying info about the subject being checked. It would also have the potential to be more scalable by avoiding the need to interact with any remote authorization deamons over DBus. The latter could be a big deal when an individual API call needs to check 1000’s of permissions at once. The flipside of course, is that a libvirt specific access control driver is not good for interoperability across the broader system – the standardized use of polkit is good in that respect. There’s no technical reason why we can’t support multiple access control drivers to give the administrator choice / flexibility.
Finally, this work is all scheduled to arrive in Fedora 20, so anyone interested in testing it should look at current rawhide, or keep an eye out for the Fedora 20 virtualization test day.
EDITED: Aug 15th: Change example use of ‘action._detail_connect_driver’ to ‘action.lookup(“connect_driver”)’