Historically, running a Linux OS inside an LXC guest, has required a execution of a set of hacky scripts which do a bunch of customizations to the default OS install to make it work in the constrained container environment. One of the many benefits to Fedora, of the switch over to systemd, has been that a default Fedora install has much more sensible behaviour when run inside containers. For example, systemd will skip running udev inside a container since containers do not get given permission to mknod – the /dev is pre-populated with the whitelist of devices the container is allowed to use. As such running Fedora inside a container is really not much more complicated than invoking yum to install desired packages into a chroot, then invoking virt-install to configure the LXC guest.
As a proof of concept, on Fedora 19 I only needed to do the following to setup a Fedora 19 environment suitable for execution inside LXC
# yum -y --releasever=19 --nogpg --installroot=/var/lib/libvirt/filesystems/mycontainer \
--disablerepo='*' --enablerepo=fedora install \
systemd passwd yum fedora-release vim-minimal openssh-server procps-ng
# echo "pts/0" >> /var/lib/libvirt/filesystems/mycontainer/etc/securetty
# chroot /var/lib/libvirt/filesystems/mycontainer /bin/passwd root
It would be desirable to avoid the manual editing of /etc/securetty. LXC guests get their default virtual consoles backed by a /dev/pts/0 device, which isn’t listed in the securetty file by default. Perhaps it is a simple as just adding that device node unconditionally. Just have to think about whether there’s a reason to not do that which would impact bare metal. With the virtual root environment ready, now virt-install can be used to configure the container with libvirt
# virt-install --connect lxc:/// --name mycontainer --ram 800 \
--filesystem /var/lib/libvirt/filesystems/mycontainer,/
virt-install will create the XML config libvirt wants, and boot the guest, opening a connection to the primary text console. This should display boot up messages from the instance of systemd running as the container’s init process, and present a normal text login prompt.
If attempting this with systemd-nspawn command, login would fail because the PAM modules audit code will reject all login attempts. This is really unhelpful behaviour by PAM modules which can’t be disabled by any config, except for booting the entire host with audit=0 which is not very desirable. Fortunately, however, virt-install will configure a separate network namespace for the container by default, which will prevent the PAM module from talking to the kernel audit service entirely, giving it a ECONNREFUSED error. By a stoke of good luck, the PAM modules treat ECONNREFUSED as being equivalent to booting with audit=0, so everything “just works”. This is nice case of two bugs cancelling out to leave no bug :-)
While the above commands are fairly straightforward, it is a goal of ours to simplify live even further, into a single command. We would like to provide a command that looks something like this:
# virt-bootstrap --connect lxc:/// --name mycontainer --ram 800 \
--root /var/lib/libvirt/filesystems/mycontainer \
--osid fedora19
The idea is that the ‘–osid’ value will be looked up in the libosinfo database. This will have details of the software repository for that OS, and whether it uses yum/apt/ebuild/somethingelse. virt-bootstrap will then invoke the appropriate packaging tool to populate the root filesystem, and then boot the container in one single step.
One final point is that LXC in Fedora still can’t really be considered to be secure without the use of SELinux. The commands I describe above don’t do anything to enable SELinux protection of the container at this time. This is obviously something that ought to be fixed. Separate from this, upstream libvirt now has support for the kernel user namespace feature. This enables plain old the DAC framework to provide a secure container environment. Unfortunately this kernel feature is still not available in Fedora kernel builds. It is blocked on upstream completion of patches for XFS. Fortunately this work seems to be moving forward again, so if we’re lucky it might just be possible to enable user namespaces in Fedora 20, finally making LXC reasonably secure by default even without SELinux.
Historically access control to libvirt has been very coarse, with only three privilege levels “anonymous” (only authentication APIs are allowed), “read-only” (only querying information is allowed) and “read-write” (anything is allowed). Over the past few months I have been working on infrastructure inside libvirt to support fine grained access control policies. The initial code drop arrived in libvirt 1.1.0, and the wiring up of authorization checks in drivers was essentially completed in libvirt 1.1.1 (with the exception of a handful of APIs in the legacy Xen driver code). We did not wish to tie libvirt to any single access control system, so the framework inside libvirt is modular, to allow for multiple plugins to be developed. The only plugin provided at this time makes use of polkit for its access control checks. There was a second proof of concept plugin that used SELinux to provide MAC, but there are a number of design issues still to be resolved with that, so it is not merged at this time.
The basic framework integration
The libvirt library exposes a number of objects (virConnectPtr, virDomainPtr, virNetworkPtr, virNWFilterPtr, virNodeDevicePtr, virIntefacePtr, virSecretPtr, virStoragePoolPtr, virStorageVolPtr), with a wide variety of operations defined in the public API. Right away it was clear that we did not wish to describe access controls based on the names of the APIs themselves. For each object there are a great many APIs which all imply the same level of privilege, so it made sense to collapse those APIs onto single permission bits. At the same time, some individual APIs could have multiple levels of privilege depending on the flags set in parameters, so would expand to multiple permission bits. Thus the first task to was come up with a list of permission bits which were able to cover all APIs. This was encoded in the internal viraccessperm.h header file. With the permissions defined, the next big task was to define a mapping between permissions and APIs. This mapping was encoded as magic comments in the RPC protocol definition file. This in turn allows the code for performing access control checks to be automatically generated, thus minimizing scope for coding errors, such as forgetting to perform checks in a method, or performing the wrong checks.
The final coding step was for the automatically generated ACL check methods to be inserted into each of the libvirt driver APIs. Most of the ACL checks validate the input parameters to ensure the caller is authorized to operate on the object in question. In a number of methods, the ACL checks are used to restrict / filter the data returned. For example, if asking for a list of domains, the returned list must be filtered to only those the client is authorized to see. While the code for checking permissions was auto-generated, it is not practical to automatically insert the checks into each libvirt driver. It was, however, possible to write scripts to perform static analysis on the code to validate that each driver had the full set of access control checks present. Of course it helps to tell developers / administrators which permissions apply to each API, so the code which generates the API reference documentation was also enhanced so that the API reference lists the permissions required in each circumstance.
The polkit access control driver
Libvirt has long made use of polkit for authenticating connections over its UNIX domain sockets. It was thus natural to expand on this work to make use of polkit as a driver for the access control framework. Historically this would not have been practical, because the polkit access control rule format did not provide a way for the admin to configure access control checks on individual object instances – only object classes. In polkit 0.106, however, a new engine was added which allowed admins to use javascript to write access control policies. The libvirt polkit driver takes object class names and permission names to form polkit action names. For example, the “getattr
” permission on the virDomainPtr
class maps to the polkit org.libvirt.api.domain.getattr
permission. When performing a access control check, libvirt then populates the polkit authorization “details
” map with one or more attributes which uniquely identify the object instance. For example, the virDomainPtr
object gets “connect_driver
” (libvirt driver name), “domain_uuid
” (globally unique UUID), and “domain_name
” (host local unique name) details set. These details can be referenced in the javascript policy to scope rules to individual object instances.
Consider a local user berrange
who has been granted permission to connect to libvirt in full read-write mode. The goal is to only allow them to use the QEMU driver and not the Xen or LXC drivers which are also available in libvirtd. To achieve this we need to write a rule which checks whether the connect_driver
attribute is QEMU
, and match on an action name of org.libvirt.api.connect.getattr
. Using the javascript rules format, this ends up written as
polkit.addRule(function(action, subject) {
if (action.id == "org.libvirt.api.connect.getattr" &&
subject.user == "berrange") {
if (action.lookup("connect_driver") == 'QEMU') {
return polkit.Result.YES;
} else {
return polkit.Result.NO;
}
}
});
As another example, consider a local user berrange
who has been granted permission to connect to libvirt in full read-write mode. The goal is to only allow them to see the domain called demo
on the LXC driver. To achieve this we need to write a rule which checks whether the connect_driver
attribute is LXC
and the domain_name
attribute is demo
, and match on a action name of org.libvirt.api.domain.getattr
. Using the javascript rules format, this ends up written as
polkit.addRule(function(action, subject) {
if (action.id == "org.libvirt.api.domain.getattr" &&
subject.user == "berrange") {
if (action.lookup("connect_driver") == 'LXC' &&
action.lookup("domain_name") == 'demo') {
return polkit.Result.YES;
} else {
return polkit.Result.NO;
}
}
});
Futher work
While the access control support in libvirt 1.1.1 provides a useful level of functionality, there is still more that can be done in the future. First of all, the polkit driver needs to have some performance optimization work done. It currently relies on invoking the ‘pkcheck’ binary to validate permissions. While this is fine for hosts with small numbers of objects, it will quickly become too costly. The solution here is to directly use the DBus API from inside libvirt.
The latest polkit framework is fairly flexible in terms of letting us identify object instances via the details map it associates with every access control check. It is far less flexible in terms of identifying the client user. It is fairly locked into the idea of identifying users via remote PID or DBus service name, and then exposing the username/groupnames to the javascript rules files. While this works fine for local libvirt connections over UNIX sockets, it is pretty much useless for connections arriving on libvirt’s TCP sockets. In the latter case the libvirt user is identied by a SASL username (typically a Kerberos principal name), or via an x509 certificate distinguished name (when using client certs with TLS). There’s no way official way to feed the SASL username of x509 dname down to the polkit javascript authorization rules files. Requests upstream to allow extra identifying attributes to be provide for the authorization subject have not been productive, so I’m considering (ab-)using the “details” map to provide identifying info for the users, alongside the identifying info for the object.
As mentioned earlier, there was a proof of concept SELinux driver written, that is yet to be finished. The work there is around figuring out / defining what the SELinux context is for each object to be checked and doing some work on SELinux policy. I think of this work as providing a capability similar to that done in PostgreSQL to enable SELinux MAC checks. It would be very nice to have a system which provides end-to-end MAC. I refer to this as sVirt 2.0 – the first (current) version of sVirt protected the host from guests – the second (future) version would also protect the host from management clients.
The legacy XenD based Xen driver has a couple of methods which lack access control, due to the inability to get access to the identifying attributes for the objects being operated upon. While we encourage people to use the new libxl based Xen driver, it is desirable to have the legacy Xen driver fully controlled for those people using legacy virtualization hosts. Some code refactoring will be required to fix the legacy Xen driver, likely at the cost of making some methods less efficient.
If there is user demand, work may be done to write an access control driver which is natively implemented entirely within libvirt. While the polkit javascript engine is fairly flexible, I’m not much of a fan of having administrators write code to define their access control policy. It would be preferable to have a way to describe the policy that was entirely declarative. With a libvirt native access control driver, it would be possible to create a simple declarative policy file format tailored to our precise needs. This would let us solve the problem of providing identifying info about the subject being checked. It would also have the potential to be more scalable by avoiding the need to interact with any remote authorization deamons over DBus. The latter could be a big deal when an individual API call needs to check 1000’s of permissions at once. The flipside of course, is that a libvirt specific access control driver is not good for interoperability across the broader system – the standardized use of polkit is good in that respect. There’s no technical reason why we can’t support multiple access control drivers to give the administrator choice / flexibility.
Finally, this work is all scheduled to arrive in Fedora 20, so anyone interested in testing it should look at current rawhide, or keep an eye out for the Fedora 20 virtualization test day.
EDITED: Aug 15th: Change example use of ‘action._detail_connect_driver’ to ‘action.lookup(“connect_driver”)’
Several years ago I wrote a bit about libvirt and cgroups in Fedora 12. Since that time, much has changed, and we’ve learnt alot about the use of cgroups, not all of it good.
Perhaps the biggest change has been the arrival of systemd, which has brought cgroups to the attention of a much wider audience. One of the biggest positive impacts of systemd on cgroups, has been a formalization of how to integrate with cgroups as an application developer. Libvirt of course follows these cgroups guidelines, has had input into their definition & continues to work with the systemd community to improve them.
One of the things we’ve learnt the hard way is that the kernel implementation of control groups is not without cost, and the way applications use cgroups can have a direct impact on the performance of the system. The kernel developers have done a great deal of work to improve the performance and scalability of cgroups but there will always be a cost to their usage which application developers need to be aware of. In broad terms, the performance impact is related to the number of cgroups directories created and particularly to their depth.
To cut a long story short, it became clear that the directory hierarchy layout libvirt used with cgroups was seriously sub-optimal, or even outright harmful. Thus in libvirt 1.0.5, we introduced some radical changes to the layout created.
Historically libvirt would create a cgroup directory for each virtual machine or container, at a path $LOCATION-OF-LIBVIRTD/libvirt/$DRIVER-NAME/$VMNAME
. For example, if libvirtd was placed in /system/libvirtd.service
, then a QEMU guest named “web1” would live at /system/libvirtd.service/libvirt/qemu/web1
. That’s 5 levels deep already, which is not good.
As of libvirt 1.0.5, libvirt will create a cgroup directory for each virtual machine or container, at a path /machine/$VMNAME.libvirt-$DRIVER-NAME
. First notice how this is now completely disassociated from the location of libvirtd itself. This allows the administrator greater flexibility in controlling resources for virtual machines independently of system services. Second notice that the directory hierarchy is only 2 levels deep by default, so a QEMU guest named “web” would live at /machine/web1.libvirt-qemu
The final important change is that the location of virtual machine / container can now be configured on a per-guest basis in the XML configuration, to override the default of /machine
. So if the guest config says
<resource>
<partition>/virtualmachines/production</partition>
</resource>
then libvirt will create the guest cgroup directory /virtualmachines.partition/production.partition/web1.libvirt-qemu
. Notice that there will always be a .partition
suffix on these user defined directories. Only the default top level directories /machine
, /system
and /user
will be without a suffix. The suffix ensures that user defined directories can never clash with anything the kernel will create. The systemd PaxControlGroups will be updated with this & a few escaping rules soon.
There is still more we intend todo with cgroups in libvirt, in particular adding APIs for creating & managing these partitions for grouping VMs, so you don’t need to go to a tool outside libvirt to create the directories.
One final thing, libvirt now has a bit of documentation about its cgroups usage which will serve as the base for future documentation in this area.
My web servers host a number of domains for both personal sites and open source projects, which of course means managing a number of DNS zone files. I use Gandi as my registrar, and they throw in free DNS hosting when you purchase a domain. When you have more than 2-3 domains to manage and want to keep the DNS records consistent across all of them, dealing with the pointy-clicky web form interfaces is really incredibly tedious. Thus I have traditionally hosted my own DNS servers, creating the Bind DNS zone files in emacs. Anyone who has ever used Bind though, will know that its DNS zone file syntax is one of the most horrific formats you can imagine. It is really easy to make silly typos which will screw up your zone in all sorts of fun ways. Keeping the DNS records in sync across domains is also still somewhat tedious.
What I wanted is a simpler, safer configuration file format for defining DNS zones, which can minimise the duplication of data across different domains. There may be tools which do this already, but I fancied writing something myself tailored to my precise use case, so didn’t search for any existing solutions. The result of a couple of evenings hacking efforts is a tool I’m calling NoZone, which now has its first public release, version 1.0. The upstream source is available in a GIT repository
The /etc/nozone.cfg configuration file
The best way to illustrate what NoZone can do, is to simply show a sample configuration file. For reasons of space, I’m cutting out all the comments – the copy that is distributed contains copious comments. In this example, 3 (hypothetical) domain names are being configured, nozone.com
, nozone.org
which are the public facing domains, and an internal domain for testing purposes qa.nozone.org
. All three domains are intended to be configured with the same DNS records, the only difference is that the internal zone (qa.nozone.org
) needs to have different IP addresses for its records. For each domain, there will be three physical machines involved, gold
, platinum
and silver
The first step is to define a zone with all the common parameters specified. Note that this zone isn’t specifying any machine IP addresses, or domain names. It is just referring to the machine names to define an abstract base for the child zones
zones = {
common = {
hostmaster = dan-hostmaster
lifetimes = {
refresh = 1H
retry = 15M
expire = 1W
negative = 1H
ttl = 1H
}
default = platinum
mail = {
mx0 = {
priority = 10
machine = gold
}
mx1 = {
priority = 20
machine = silver
}
}
dns = {
ns0 = gold
ns1 = silver
}
names = {
www = platinum
}
aliases = {
db = gold
backup = silver
}
wildcard = platinum
}
With the common parameters defined, a second zone is defined called “production” which lists the domain names nozone.org
and nozone.com
and the IP details for the physical machines hosting the domains.
production = {
inherits = common
domains = (
nozone.org
nozone.com
)
machines = {
platinum = {
ipv4 = 12.32.56.1
ipv6 = 2001:1234:6789::1
}
gold = {
ipv4 = 12.32.56.2
ipv6 = 2001:1234:6789::2
}
silver = {
ipv4 = 12.32.56.3
ipv6 = 2001:1234:6789::3
}
}
}
The third zone is used to define the internal qa.nozone.org
domain.
testing = {
inherits = common
domains = (
qa.nozone.org
)
machines = {
platinum = {
ipv4 = 192.168.1.1
ipv6 = fc00::1:1
}
gold = {
ipv4 = 192.168.1.2
ipv6 = fc00::1:2
}
silver = {
ipv4 = 192.168.1.3
ipv6 = fc00::1:3
}
}
}
}
Generating the Bind DNS zone files
With the /etc/nozone.org
configuration file created, the Bind9 DNS zone files can now be generated by invoking the nozone
command.
$ nozone
This generates a number of files
# ls /etc/named
nozone.com.conf nozone.conf nozone.org.conf qa.nozone.org.conf
$ ls /var/named/data/
named.run named.run-20130317 nozone.org.data
named.run-20130315 nozone.com.data qa.nozone.org.data
The final step is to add one line to /etc/named.conf
and then restart bind.
$ echo 'include "/etc/named/nozone.conf";' >> /etc/named.conf
$ systemctl restart named.service
The generated files
The /etc/named/nozone.conf
file is always generated and contains references to the conf files for each domain named
include "/etc/named/nozone.com.conf";
include "/etc/named/nozone.org.conf";
include "/etc/named/qa.nozone.org.conf";
Each of these files defines a domain name and links to the zone file definition. For example, nozone.com.conf
contains
zone "nozone.com" in {
type master;
file "/var/named/data/nozone.com.data";
};
Finally, the interesting data is in the actual zone files, in this case /var/named/data/nozone.com.data
$ORIGIN nozone.com.
$TTL 1H ; queries are cached for this long
@ IN SOA ns1 hostmaster (
1363531990 ; Date 2013/03/17 14:53:10
1H ; slave queries for refresh this often
15M ; slave retries refresh this often after failure
1W ; slave expires after this long if not refreshed
1H ; errors are cached for this long
)
; Primary name records for unqualfied domain
@ IN A 12.32.56.1 ; Machine platinum
@ IN AAAA 2001:1234:6789::1 ; Machine platinum
; DNS server records
@ IN NS ns0
@ IN NS ns1
ns0 IN A 12.32.56.2 ; Machine gold
ns0 IN AAAA 2001:1234:6789::2 ; Machine gold
ns1 IN A 12.32.56.3 ; Machine silver
ns1 IN AAAA 2001:1234:6789::3 ; Machine silver
; E-Mail server records
@ IN MX 10 mx0
@ IN MX 20 mx1
mx0 IN A 12.32.56.2 ; Machine gold
mx0 IN AAAA 2001:1234:6789::2 ; Machine gold
mx1 IN A 12.32.56.3 ; Machine silver
mx1 IN AAAA 2001:1234:6789::3 ; Machine silver
; Primary names
gold IN A 12.32.56.2
gold IN AAAA 2001:1234:6789::2
platinum IN A 12.32.56.1
platinum IN AAAA 2001:1234:6789::1
silver IN A 12.32.56.3
silver IN AAAA 2001:1234:6789::3
; Extra names
www IN A 12.32.56.1 ; Machine platinum
www IN AAAA 2001:1234:6789::1 ; Machine platinum
; Aliased names
backup IN CNAME silver
db IN CNAME gold
; Wildcard
* IN A 12.32.56.1 ; Machine platinum
* IN AAAA 2001:1234:6789::1 ; Machine platinum
As of 2 days ago, I’m using nozone to manage the DNS zones for all the domains I own. If it is useful to anyone else, it can be downloaded from CPAN. I’ll likely be submitting it for a Fedora review at some point too.
For a few months now Derek has been working on a tool called PackStack, which aims to facilitate & automate the deployment of OpenStack services. Most of the time I’ve used DevStack for deploying OpenStack, but this is not at all suitable for doing production quality deployments. I’ve also done production deployments from scratch following the great Fedora instructions. The latter work but require the admin to do far too much tedious legwork and know too much about OpenStack in general. This is where PackStack comes in. It starts from the assumption that the admin knows more or less nothing about how the OpenStack tools work. All they need do is decide which services they wish to deploy on each machine. With that answered PackStack goes off and does the work to make it all happen. Under the hood PackStack does its work by connecting to each machine over SSH, and using Puppet to deploy/configure the services on each one. By leveraging puppet, PackStack itself is mostly isolated from the differences between various Linux distros. Thus although PackStack has been developed on RHEL and Fedora, it should be well placed for porting to other distros like Debian & I hope we’ll see that happen in the near future. It will be better for the OpenStack community to have a standard tool that is portable across all target distros, than the current situation where pretty much ever distributor of OpenStack has reinvented the wheel building their own private tooling for deployment. This is why PackStack is being developed as an upstream project, hosted on StackForge rather than as a Red Hat only private project.
Preparing the virtual machines
Anyway back to the point of this blog post. Having followed PackStack progress for a while I decided it was time to actually try it out for real. While I have a great many development machines, I don’t have enough free to turn into an OpenStack cluster, so straight away I decided to do my test deployment inside a set of Fedora 18 virtual machines, running on a Fedora 18 KVM host.
The current PackStack network support requires that you have 2 network interfaces. For an all-in-one box deployment you only actually need one physical NIC for the public interface – you can use ‘lo’ for the private interface on which VMs communicate with each other. I’m doing a multi-node deployment though, so my first step was to decide how to provide networking to my VMs. A standard libvirt install will provide a default NAT based network, using the virbr0 bridge device. This will serve just fine as the public interface over which we can communicate with the OpenStack services & their REST / Web APIs. For VM traffic, I decided to create a second libvirt network on the host machine.
# cat > openstackvms.xml <<EOF
<network>
<name>openstackvms</name>
<bridge name='virbr1' stp='off' delay='0' />
</network>
EOF
# virsh net-define openstackvms.xml
Network openstackvms defined from openstackvms.xml
# virsh net-start openstackvms
Network openstackvms started
EOF
Next up, I installed a single Fedora 18 guest machine, giving it two network interfaces, the first attached to the ‘default’ libvirt network, and the second attached to the ‘openstackvms’ virtual network.
# virt-install --name f18x86_64a --ram 1000 --file /var/lib/libvirt/images/f18x86_64a.img \
--location http://www.mirrorservice.org/sites/dl.fedoraproject.org/pub/fedora/linux/releases/18/Fedora/x86_64/os/ \
--noautoconsole --vnc --file-size 10 --os-variant fedora18 \
--network network:default --network network:openstackvms
In the installer, I used the defaults for everything with two exceptions. I select the “Minimal install” instead of “GNOME Desktop”, and I reduced the size of the swap partition from 2 GB, to 200 MB – it the VM ever needed more than a few 100 MB of swap, then it is pretty much game over for responsiveness of that VM. A minimal install is very quick, taking only 5 minutes or so to completely install the RPMs – assuming good download speeds from the install mirror chosen. Now I need to turn that one VM into 4 VMs. For this I looked to the virt-clone tool. This is a fairly crude tool which merely does a copy of each disk image, and then updates the libvirt XML for the guest to given it a new UUID and MAC address. It doesn’t attempt to change anything inside the guest, but for a F18 minimal install this is not a significant issue.
# virt-clone -o f18x86_64a -n f18x86_64b -f /var/lib/libvirt/images/f18x86_64b.img
Allocating 'f18x86_64b.img' | 10 GB 00:01:20
Clone 'f18x86_64b' created successfully.
# virt-clone -o f18x86_64a -n f18x86_64c -f /var/lib/libvirt/images/f18x86_64c.img
Allocating 'f18x86_64c.img' | 10 GB 00:01:07
Clone 'f18x86_64c' created successfully.
# virt-clone -o f18x86_64a -n f18x86_64d -f /var/lib/libvirt/images/f18x86_64d.img
Allocating 'f18x86_64d.img' | 10 GB 00:00:59
Clone 'f18x86_64d' created successfully.
I don’t fancy having to remember the IP address of each of the virtual machines I installed, so I decided to setup some fixed IP address mappings in the libvirt default network, and add aliases to /etc/hosts
# virsh net-destroy default
# virsh net-edit default
...changing the following...
<ip address='192.168.122.1' netmask='255.255.255.0'>
<dhcp>
<range start='192.168.122.2' end='192.168.122.254' />
</dhcp>
</ip>
...to this...
<ip address='192.168.122.1' netmask='255.255.255.0'>
<dhcp>
<range start='192.168.122.2' end='192.168.122.99' />
<host mac='52:54:00:fd:e7:03' name='f18x86_64a' ip='192.168.122.100' />
<host mac='52:54:00:c4:b7:f6' name='f18x86_64b' ip='192.168.122.101' />
<host mac='52:54:00:81:84:d6' name='f18x86_64c' ip='192.168.122.102' />
<host mac='52:54:00:6a:9b:1a' name='f18x86_64d' ip='192.168.122.102' />
</dhcp>
</ip>
# cat >> /etc/hosts <<EOF
192.168.122.100 f18x86_64a
192.168.122.101 f18x86_64b
192.168.122.102 f18x86_64c
192.168.122.103 f18x86_64d
EOF
# virsh net-start default
Now we’re ready to actually start the virtual machines
# virsh start f18x86_64a
Domain f18x86_64a started
# virsh start f18x86_64b
Domain f18x86_64b started
# virsh start f18x86_64c
Domain f18x86_64c started
# virsh start f18x86_64d
Domain f18x86_64d started
# virsh list
Id Name State
----------------------------------------------------
25 f18x86_64a running
26 f18x86_64b running
27 f18x86_64c running
28 f18x86_64d running
Deploying OpenStack with PackStack
All of the above is really nothing todo with OpenStack or PackStack – it is just about me getting some virtual machines ready to act as the pretend “physical servers”. The interesting stuff starts now. PackStack doesn’t need to be installed in the machines that will receive the OpenStack install, but rather on any client machine which has SSH access to the target machines. In my case I decided to run packstack from the physical host running the VMs I just provisioned.
# yum -y install openstack-packstack
While PackStack is happy to prompt you with questions, it is far simpler to just use an answer file straight away. It lets you see upfront everything that is required and will make it easy for you repeat the exercise later.
$ packstack --gen-answer-file openstack.txt
The answer file tries to fill in sensible defaults, but there’s not much it can do for IP addresses. So it just fills in the IP address of the host on which it was generated. This is suitable if you’re doing an all-in-one install on the current machine, but not for doing a multi-node install. So the next step is to edit the answer file and customize at least the IP addresses. I have decided that f18x86_64a will be the Horizon frontend and host the user facing APIs from glance/keystone/nova/etc, f18x86_64b will provide QPid, MySQL, Nova schedular and f18x86_64c and f18x86_64d will be compute nodes and swift storage nodes (though I haven’t actually enabled swift in the config).
$ emacs openstack.txt
...make IP address changes...
So you can see what I changed, here is the unified diff
--- openstack.txt 2013-03-01 12:41:31.226476407 +0000
+++ openstack-custom.txt 2013-03-01 12:51:53.877073871 +0000
@@ -4,7 +4,7 @@
# been installed on the remote servers the user will be prompted for a
# password and this key will be installed so the password will not be
# required again
-CONFIG_SSH_KEY=
+CONFIG_SSH_KEY=/home/berrange/.ssh/id_rsa.pub
# Set to 'y' if you would like Packstack to install Glance
CONFIG_GLANCE_INSTALL=y
@@ -34,7 +34,7 @@
CONFIG_NAGIOS_INSTALL=n
# The IP address of the server on which to install MySQL
-CONFIG_MYSQL_HOST=10.33.8.113
+CONFIG_MYSQL_HOST=192.168.122.101
# Username for the MySQL admin user
CONFIG_MYSQL_USER=root
@@ -43,10 +43,10 @@
CONFIG_MYSQL_PW=5612a75877464b70
# The IP address of the server on which to install the QPID service
-CONFIG_QPID_HOST=10.33.8.113
+CONFIG_QPID_HOST=192.168.122.101
# The IP address of the server on which to install Keystone
-CONFIG_KEYSTONE_HOST=10.33.8.113
+CONFIG_KEYSTONE_HOST=192.168.122.100
# The password to use for the Keystone to access DB
CONFIG_KEYSTONE_DB_PW=297088140caf407e
@@ -58,7 +58,7 @@
CONFIG_KEYSTONE_ADMIN_PW=342cc8d9150b4662
# The IP address of the server on which to install Glance
-CONFIG_GLANCE_HOST=10.33.8.113
+CONFIG_GLANCE_HOST=192.168.122.100
# The password to use for the Glance to access DB
CONFIG_GLANCE_DB_PW=a1d8435d61fd4ed2
@@ -83,25 +83,25 @@
# The IP address of the server on which to install the Nova API
# service
-CONFIG_NOVA_API_HOST=10.33.8.113
+CONFIG_NOVA_API_HOST=192.168.122.100
# The IP address of the server on which to install the Nova Cert
# service
-CONFIG_NOVA_CERT_HOST=10.33.8.113
+CONFIG_NOVA_CERT_HOST=192.168.122.100
# The IP address of the server on which to install the Nova VNC proxy
-CONFIG_NOVA_VNCPROXY_HOST=10.33.8.113
+CONFIG_NOVA_VNCPROXY_HOST=192.168.122.100
# A comma separated list of IP addresses on which to install the Nova
# Compute services
-CONFIG_NOVA_COMPUTE_HOSTS=10.33.8.113
+CONFIG_NOVA_COMPUTE_HOSTS=192.168.122.102,192.168.122.103
# Private interface for Flat DHCP on the Nova compute servers
CONFIG_NOVA_COMPUTE_PRIVIF=eth1
# The IP address of the server on which to install the Nova Network
# service
-CONFIG_NOVA_NETWORK_HOST=10.33.8.113
+CONFIG_NOVA_NETWORK_HOST=192.168.122.101
# The password to use for the Nova to access DB
CONFIG_NOVA_DB_PW=f67f9f822a934509
@@ -116,14 +116,14 @@
CONFIG_NOVA_NETWORK_PRIVIF=eth1
# IP Range for Flat DHCP
-CONFIG_NOVA_NETWORK_FIXEDRANGE=192.168.32.0/22
+CONFIG_NOVA_NETWORK_FIXEDRANGE=192.168.123.0/24
# IP Range for Floating IP's
-CONFIG_NOVA_NETWORK_FLOATRANGE=10.3.4.0/22
+CONFIG_NOVA_NETWORK_FLOATRANGE=192.168.124.0/24
# The IP address of the server on which to install the Nova Scheduler
# service
-CONFIG_NOVA_SCHED_HOST=10.33.8.113
+CONFIG_NOVA_SCHED_HOST=192.168.122.101
# The overcommitment ratio for virtual to physical CPUs. Set to 1.0
# to disable CPU overcommitment
@@ -131,20 +131,20 @@
# The overcommitment ratio for virtual to physical RAM. Set to 1.0 to
# disable RAM overcommitment
-CONFIG_NOVA_SCHED_RAM_ALLOC_RATIO=1.5
+CONFIG_NOVA_SCHED_RAM_ALLOC_RATIO=10
# The IP address of the server on which to install the OpenStack
# client packages. An admin "rc" file will also be installed
-CONFIG_OSCLIENT_HOST=10.33.8.113
+CONFIG_OSCLIENT_HOST=192.168.122.100
# The IP address of the server on which to install Horizon
-CONFIG_HORIZON_HOST=10.33.8.113
+CONFIG_HORIZON_HOST=192.168.122.100
# To set up Horizon communication over https set this to "y"
CONFIG_HORIZON_SSL=n
# The IP address on which to install the Swift proxy service
-CONFIG_SWIFT_PROXY_HOSTS=10.33.8.113
+CONFIG_SWIFT_PROXY_HOSTS=192.168.122.100
# The password to use for the Swift to authenticate with Keystone
CONFIG_SWIFT_KS_PW=aec1c74ec67543e7
@@ -155,7 +155,7 @@
# on 127.0.0.1 as a swift storage device(packstack does not create the
# filesystem, you must do this first), if /dev is omitted Packstack
# will create a loopback device for a test setup
-CONFIG_SWIFT_STORAGE_HOSTS=10.33.8.113
+CONFIG_SWIFT_STORAGE_HOSTS=192.168.122.102,192.168.122.103
# Number of swift storage zones, this number MUST be no bigger than
# the number of storage devices configured
@@ -223,7 +223,7 @@
CONFIG_SATELLITE_PROXY_PW=
# The IP address of the server on which to install the Nagios server
-CONFIG_NAGIOS_HOST=10.33.8.113
+CONFIG_NAGIOS_HOST=192.168.122.100
# The password of the nagiosadmin user on the Nagios server
CONFIG_NAGIOS_PW=7e787e71ff18462c
The current version of PackStack in Fedora mistakenly assumes that ‘net-tools’ is installed by default in Fedora. This used to be the case, but as of Fedora 18 it is not longer installed. Upstream PackStack git has switched from using ifconfig to ip, to avoid this. So for F18 we temporarily need to make sure the ‘net-tools’ RPM is installed in each host. In addition the SELinux policy has not been finished for all openstack components, so we need to set it to permissive mode.
$ ssh root@f18x86_64a setenforce 0
$ ssh root@f18x86_64b setenforce 0
$ ssh root@f18x86_64c setenforce 0
$ ssh root@f18x86_64d setenforce 0
$ ssh root@f18x86_64a yum -y install net-tools
$ ssh root@f18x86_64b yum -y install net-tools
$ ssh root@f18x86_64c yum -y install net-tools
$ ssh root@f18x86_64d yum -y install net-tools
Assuming that’s done, we can now just run packstack
# packstack --answer-file openstack-custom.txt
Welcome to Installer setup utility
Installing:
Clean Up... [ DONE ]
Setting up ssh keys... [ DONE ]
Adding pre install manifest entries... [ DONE ]
Adding MySQL manifest entries... [ DONE ]
Adding QPID manifest entries... [ DONE ]
Adding Keystone manifest entries... [ DONE ]
Adding Glance Keystone manifest entries... [ DONE ]
Adding Glance manifest entries... [ DONE ]
Adding Cinder Keystone manifest entries... [ DONE ]
Checking if the Cinder server has a cinder-volumes vg... [ DONE ]
Adding Cinder manifest entries... [ DONE ]
Adding Nova API manifest entries... [ DONE ]
Adding Nova Keystone manifest entries... [ DONE ]
Adding Nova Cert manifest entries... [ DONE ]
Adding Nova Compute manifest entries... [ DONE ]
Adding Nova Network manifest entries... [ DONE ]
Adding Nova Scheduler manifest entries... [ DONE ]
Adding Nova VNC Proxy manifest entries... [ DONE ]
Adding Nova Common manifest entries... [ DONE ]
Adding OpenStack Client manifest entries... [ DONE ]
Adding Horizon manifest entries... [ DONE ]
Preparing servers... [ DONE ]
Adding post install manifest entries... [ DONE ]
Installing Dependencies... [ DONE ]
Copying Puppet modules and manifests... [ DONE ]
Applying Puppet manifests...
Applying 192.168.122.100_prescript.pp
Applying 192.168.122.101_prescript.pp
Applying 192.168.122.102_prescript.pp
Applying 192.168.122.103_prescript.pp
192.168.122.101_prescript.pp : [ DONE ]
192.168.122.103_prescript.pp : [ DONE ]
192.168.122.100_prescript.pp : [ DONE ]
192.168.122.102_prescript.pp : [ DONE ]
Applying 192.168.122.101_mysql.pp
Applying 192.168.122.101_qpid.pp
192.168.122.101_mysql.pp : [ DONE ]
192.168.122.101_qpid.pp : [ DONE ]
Applying 192.168.122.100_keystone.pp
Applying 192.168.122.100_glance.pp
Applying 192.168.122.101_cinder.pp
192.168.122.100_keystone.pp : [ DONE ]
192.168.122.100_glance.pp : [ DONE ]
192.168.122.101_cinder.pp : [ DONE ]
Applying 192.168.122.100_api_nova.pp
192.168.122.100_api_nova.pp : [ DONE ]
Applying 192.168.122.100_nova.pp
Applying 192.168.122.102_nova.pp
Applying 192.168.122.103_nova.pp
Applying 192.168.122.101_nova.pp
Applying 192.168.122.100_osclient.pp
Applying 192.168.122.100_horizon.pp
192.168.122.101_nova.pp : [ DONE ]
192.168.122.100_nova.pp : [ DONE ]
192.168.122.100_osclient.pp : [ DONE ]
192.168.122.100_horizon.pp : [ DONE ]
192.168.122.103_nova.pp : [ DONE ]
192.168.122.102_nova.pp : [ DONE ]
Applying 192.168.122.100_postscript.pp
Applying 192.168.122.101_postscript.pp
Applying 192.168.122.103_postscript.pp
Applying 192.168.122.102_postscript.pp
192.168.122.100_postscript.pp : [ DONE ]
192.168.122.103_postscript.pp : [ DONE ]
192.168.122.101_postscript.pp : [ DONE ]
192.168.122.102_postscript.pp : [ DONE ]
[ DONE ]
**** Installation completed successfully ******
(Please allow Installer a few moments to start up.....)
Additional information:
* Time synchronization installation was skipped. Please note that unsynchronized time on server instances might be problem for some OpenStack components.
* Did not create a cinder volume group, one already existed
* To use the command line tools you need to source the file /root/keystonerc_admin created on 192.168.122.100
* To use the console, browse to http://192.168.122.100/dashboard
* The installation log file is available at: /var/tmp/packstack/20130301-135443-qbNvvH/openstack-setup.log
That really is it – you didn’t need to touch any config files for OpenStack, QPid, MySQL or any other service involved. PackStack just worked its magic and there is now a 4 node OpenStack cluster up and running. One of the nice things about PackStack using Puppet for all its work, is that if something goes wrong 1/2 way through, you don’t need to throw it all away – just fix the issue and re-run packstack and it’ll do whatever work was left over from before.
The results
Lets see what’s running on each node. First the frontend user facing node
$ ssh root@f18x86_64a ps -ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:03 /usr/lib/systemd/systemd --switched-root --system --deserialize 14
283 ? Ss 0:00 /usr/lib/systemd/systemd-udevd
284 ? Ss 0:07 /usr/lib/systemd/systemd-journald
348 ? S 0:00 /usr/lib/systemd/systemd-udevd
391 ? Ss 0:06 /usr/bin/python -Es /usr/sbin/firewalld --nofork
392 ? S<sl 0:00 /sbin/auditd -n
394 ? Ss 0:00 /usr/lib/systemd/systemd-logind
395 ? Ssl 0:00 /sbin/rsyslogd -n -c 5
397 ? Ssl 0:01 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
403 ? Ss 0:00 login -- root
411 ? Ss 0:00 /usr/sbin/crond -n
417 ? S 0:00 /usr/lib/systemd/systemd-udevd
418 ? Ssl 0:01 /usr/sbin/NetworkManager --no-daemon
452 ? Ssl 0:00 /usr/lib/polkit-1/polkitd --no-debug
701 ? S 0:00 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-eth0.pid -lf /var/lib/dhclient/dhclient-4d0e96db-64cd-41d3-a9c3-c584da37dd84-eth0.lease -cf /var/run/nm-dhclient-eth0.conf eth0
769 ? Ss 0:00 /usr/sbin/sshd -D
772 ? Ss 0:00 sendmail: accepting connections
792 ? Ss 0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
800 tty1 Ss+ 0:00 -bash
8702 ? Ss 0:00 /usr/bin/python /usr/bin/glance-registry --config-file /etc/glance/glance-registry.conf
8745 ? S 0:00 /usr/bin/python /usr/bin/glance-registry --config-file /etc/glance/glance-registry.conf
8764 ? Ss 0:00 /usr/bin/python /usr/bin/glance-api --config-file /etc/glance/glance-api.conf
10030 ? Ss 0:01 /usr/bin/python /usr/bin/keystone-all --config-file /etc/keystone/keystone.conf
10201 ? S 0:00 /usr/bin/python /usr/bin/glance-api --config-file /etc/glance/glance-api.conf
13096 ? Ss 0:01 /usr/bin/python /usr/bin/nova-api --config-file /etc/nova/nova.conf --logfile /var/log/nova/api.log
13103 ? S 0:00 /usr/bin/python /usr/bin/nova-api --config-file /etc/nova/nova.conf --logfile /var/log/nova/api.log
13111 ? S 0:00 /usr/bin/python /usr/bin/nova-api --config-file /etc/nova/nova.conf --logfile /var/log/nova/api.log
13120 ? S 0:00 /usr/bin/python /usr/bin/nova-api --config-file /etc/nova/nova.conf --logfile /var/log/nova/api.log
13484 ? Ss 0:05 /usr/bin/python /usr/bin/nova-consoleauth --config-file /etc/nova/nova.conf --logfile /var/log/nova/consoleauth.log
20354 ? Ss 0:00 python /usr/bin/nova-novncproxy --web /usr/share/novnc/
20429 ? Ss 0:03 /usr/bin/python /usr/bin/nova-cert --config-file /etc/nova/nova.conf --logfile /var/log/nova/cert.log
21035 ? Ssl 0:00 /usr/bin/memcached -u memcached -p 11211 -m 922 -c 8192 -l 0.0.0.0 -U 11211 -t 1
21311 ? Ss 0:00 /usr/sbin/httpd -DFOREGROUND
21312 ? Sl 0:00 /usr/sbin/httpd -DFOREGROUND
21313 ? S 0:00 /usr/sbin/httpd -DFOREGROUND
21314 ? S 0:00 /usr/sbin/httpd -DFOREGROUND
21315 ? S 0:00 /usr/sbin/httpd -DFOREGROUND
21316 ? S 0:00 /usr/sbin/httpd -DFOREGROUND
21317 ? S 0:00 /usr/sbin/httpd -DFOREGROUND
21632 ? S 0:00 /usr/sbin/httpd -DFOREGROUND
Now the infrastructure node
$ ssh root@f18x86_64b ps -ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:02 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
289 ? Ss 0:00 /usr/lib/systemd/systemd-udevd
290 ? Ss 0:05 /usr/lib/systemd/systemd-journald
367 ? S 0:00 /usr/lib/systemd/systemd-udevd
368 ? S 0:00 /usr/lib/systemd/systemd-udevd
408 ? Ss 0:04 /usr/bin/python -Es /usr/sbin/firewalld --nofork
409 ? S<sl 0:00 /sbin/auditd -n
411 ? Ss 0:00 /usr/lib/systemd/systemd-logind
412 ? Ssl 0:00 /sbin/rsyslogd -n -c 5
414 ? Ssl 0:00 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
419 tty1 Ss+ 0:00 /sbin/agetty --noclear tty1 38400 linux
429 ? Ss 0:00 /usr/sbin/crond -n
434 ? Ssl 0:01 /usr/sbin/NetworkManager --no-daemon
484 ? Ssl 0:00 /usr/lib/polkit-1/polkitd --no-debug
717 ? S 0:00 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-eth0.pid -lf /var/lib/dhclient/dhclient-2c0f596e-002a-49b0-b3f6-5e228601e7ba-eth0.lease -cf /var/run/nm-dhclient-eth0.conf eth0
766 ? Ss 0:00 sendmail: accepting connections
792 ? Ss 0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
805 ? Ss 0:00 /usr/sbin/sshd -D
8531 ? Ss 0:00 /bin/sh /usr/bin/mysqld_safe --basedir=/usr
8884 ? Sl 0:15 /usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib64/mysql/plugin --log-error=/var/log/mysqld.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock --port=3306
9778 ? Ssl 0:01 /usr/sbin/qpidd --config /etc/qpidd.conf
10004 ? S< 0:00 [loop2]
13381 ? Ss 0:02 /usr/sbin/tgtd -f
14831 ? Ss 0:00 /usr/bin/python /usr/bin/cinder-api --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/api.log
14907 ? Ss 0:04 /usr/bin/python /usr/bin/cinder-scheduler --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/scheduler.log
14956 ? Ss 0:02 /usr/bin/python /usr/bin/cinder-volume --config-file /etc/cinder/cinder.conf --logfile /var/log/cinder/volume.log
15516 ? Ss 0:06 /usr/bin/python /usr/bin/nova-scheduler --config-file /etc/nova/nova.conf --logfile /var/log/nova/scheduler.log
15609 ? Ss 0:08 /usr/bin/python /usr/bin/nova-network --config-file /etc/nova/nova.conf --logfile /var/log/nova/network.log
And finally one of the 2 compute nodes
$ ssh root@f18x86_64c ps -ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:02 /usr/lib/systemd/systemd --switched-root --system --deserialize 21
315 ? Ss 0:00 /usr/lib/systemd/systemd-udevd
317 ? Ss 0:04 /usr/lib/systemd/systemd-journald
436 ? Ss 0:05 /usr/bin/python -Es /usr/sbin/firewalld --nofork
437 ? S<sl 0:00 /sbin/auditd -n
439 ? Ss 0:00 /usr/lib/systemd/systemd-logind
440 ? Ssl 0:00 /sbin/rsyslogd -n -c 5
442 ? Ssl 0:00 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation
454 ? Ss 0:00 /usr/sbin/crond -n
455 tty1 Ss+ 0:00 /sbin/agetty --noclear tty1 38400 linux
465 ? S 0:00 /usr/lib/systemd/systemd-udevd
466 ? S 0:00 /usr/lib/systemd/systemd-udevd
470 ? Ssl 0:01 /usr/sbin/NetworkManager --no-daemon
499 ? Ssl 0:00 /usr/lib/polkit-1/polkitd --no-debug
753 ? S 0:00 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-eth0.pid -lf /var/lib/dhclient/dhclient-ada59d24-375c-481e-bd57-ce0803ac5574-eth0.lease -cf /var/run/nm-dhclient-eth0.conf eth0
820 ? Ss 0:00 sendmail: accepting connections
834 ? Ss 0:00 sendmail: Queue runner@01:00:00 for /var/spool/clientmqueue
846 ? Ss 0:00 /usr/sbin/sshd -D
9749 ? Ssl 0:13 /usr/sbin/libvirtd
16060 ? Sl 0:01 /usr/bin/python -Es /usr/sbin/tuned -d
16163 ? Ssl 0:03 /usr/bin/python /usr/bin/nova-compute --config-file /etc/nova/nova.conf --logfile /var/log/nova/compute.log
All-in-all PackStack exceeded my expectations for such a young tool – it did a great job with minimum of fuss and was nice & reliable at what it did too. The only problem I hit was forgetting to set SELinux permissive first, which was not its fault – this is a bug in Fedora policy we will be addressing – and it recovered from that just fine when I re-ran it after setting permissive mode.