A little while back Cole documented a minimal ceph deployment on Fedora. Unfortunately, since then the ‘mkcephfs’ command has been dropped in favour of the ‘ceph-deploy’ tool. There’s various other blog posts talking about ceph-deploy, but none of them had quite the right set of commands to get a working single node deployment – the status would always end up in “HEALTH_WARN” which is pretty much an error state for ceph. After much trial & error I finally figured out the steps that work on Fedora 23.
Even though we’re doing a single node deployment, the ‘ceph-deploy’ tool expects to be able to ssh into the local host as root, without password prompts. So before starting, make sure to install ssh keys and edit /etc/ssh/sshd_config to set PermitRootLogin to yes. Everything that follows should also be run as root.
First, we need the ‘ceph-deploy’ tool installed
# dnf install ceph-deploy
ceph-deploy will create some config files in the local directory, so it is best to create a directory to hold them and run it from there
# mkdir ceph-deploy
# cd ceph-deploy
Make sure that the hostname for the local machine is resolvable, both with domain name and unqualified. If it is not, then add entries to /etc/hosts to make it resolve. The first step simply creates the basic config file for ceph-deploy
# export CEPH_HOST=`hostname -f`
# ceph-deploy new $CEPH_HOST
Since this will be a single node deployment there are 2 critical additions that must be made to the ceph.conf that was just created in the current directory
# echo "osd crush chooseleaf type = 0" >> ceph.conf
# echo "osd pool default size = 1" >> ceph.conf
Without these two settings, the storage will never achieve a healthy status.
Now tell ceph-deploy to actually install the main ceph software. By default it will try to activate YUM repos hosted on ceph.com, but Fedora has everything needed, so the ‘--no-adjust-repos
‘ argument tells it not to add custom repos
# ceph-deploy install --no-adjust-repos $CEPH_HOST
With the software install the monitor service can be created and started
# ceph-deploy mon create-initial
Ceph can use storage on a block device, but for single node test deployments it is far easier to just point it to a local directory
# mkdir -p /srv/ceph/osd
# ceph-deploy osd prepare $CEPH_HOST:/srv/ceph/osd
# ceph-deploy osd activate $CEPH_HOST:/srv/ceph/osd
Assuming that completed without error, check the cluster status shows HEALTH_OK
# ceph status
cluster 7e7be62d-4c83-4b59-8c11-6b57301e8cb4
health HEALTH_OK
monmap e1: 1 mons at {t530wlan=192.168.1.66:6789/0}
election epoch 2, quorum 0 t530wlan
osdmap e5: 1 osds: 1 up, 1 in
pgmap v15: 64 pgs, 1 pools, 0 bytes data, 0 objects
246 GB used, 181 GB / 450 GB avail
64 active+clean
If it displays “HEALTH_WARN” don’t make the mistake of thinking that is merely a warning – chances are it is a fatal error that will prevent anything working. If you did get errors, then purge all trace of ceph before trying again
# ceph-deploy purgedata $CEPH_HOST
# ceph-deploy purge $CEPH_HOST
# ceph-deploy forgetkeys
# rm -rf /srv/ceph/osd
Once everything it working, it should be possible to use the ‘rbd’ command on the local node to setup volumes suitable for use with QEMU/KVM.
Yesterday I talked about setting up Sheepdog with KVM, so today is it is time to discuss use of Ceph and RBD with KVM.
Host Cluster Setup, the easy way
Fedora has included Ceph for a couple of releases, but since my hosts are on Fedora 14/15, I grabbed the latest ceph 0.3.1 sRPMs from Fedora 16 and rebuilt those to get something reasonably up2date. In the end I have the following packages installed, though to be honest I don’t really need anything except the base ‘ceph’ RPM:
# rpm -qa | grep ceph | sort
ceph-0.31-4.fc17.x86_64
ceph-debuginfo-0.31-4.fc17.x86_64
ceph-devel-0.31-4.fc17.x86_64
ceph-fuse-0.31-4.fc17.x86_64
ceph-gcephtool-0.31-4.fc17.x86_64
ceph-obsync-0.31-4.fc17.x86_64
ceph-radosgw-0.31-4.fc17.x86_64
Installing the software is the easy bit, configuring the cluster is where the fun begins. I had three hosts available for testing all of which are virtualization hosts. Ceph has at least 3 daemons it needs to run, which should all be replicated across several hosts for redundancy. There’s no requirement to use the same hosts for each daemon, but for simplicity I decided to run every Ceph daemon on every virtualization host.
My hosts are called lettuce
, avocado
and mustard
. Following the Ceph wiki instructions, I settled on a configuration file that looks like this:
[global]
auth supported = cephx
keyring = /etc/ceph/keyring.admin
[mds]
keyring = /etc/ceph/keyring.$name
[mds.lettuce]
host = lettuce
[mds.avocado]
host = avocado
[mds.mustard]
host = mustard
[osd]
osd data = /srv/ceph/osd$id
osd journal = /srv/ceph/osd$id/journal
osd journal size = 512
osd class dir = /usr/lib64/rados-classes
keyring = /etc/ceph/keyring.$name
[osd.0]
host = lettuce
[osd.1]
host = avocado
[osd.2]
host = mustard
[mon]
mon data = /srv/ceph/mon$id
[mon.0]
host = lettuce
mon addr = 192.168.1.1:6789
[mon.1]
host = avocado
mon addr = 192.168.1.2:6789
[mon.2]
host = mustard
mon addr = 192.168.1.3:6789
The osd class dir
bit should not actually be required, but the OSD code looks in the wrong place (/usr/lib instead of /usr/lib64) on x86_64 arches.
With the configuration file written, it is time to actually initialize the cluster filesystem / object store. This is the really fun bit. The Ceph wiki has a very basic page which talks about the mkcephfs
tool, along with a scary warning about how it’ll ‘rm -rf’ all the data on the filesystem it is initializing. It turns out that it didn’t mean your entire host filesystem, AFAICT, it only the blows away the contents of the directory configured for ‘osd data
‘ and ‘mon data
‘, in my case both under /srv/ceph
.
The recommended way is to let mkcephfs
ssh into each of your hosts and run all the configuration tasks automatically. Having tried the non-recommended way and failed several times before finally getting it right, I can recommend following the recommended way :-P There are some caveats not mentioned in the wiki page though:
- The configuration file above must be copied to
/etc/ceph/ceph.conf
on every node before attempting to run mkcephfs
.
- The configuration file on the host where you run
mkcephfs
must be in /etc/ceph/ceph.conf or it will get rather confused about where it is in the other nodes.
- The
mkcephfs
command must be run as root since, it doesn’t specify ‘-l root’ to ssh, leading to an inability to setup the nodes.
- The directories
/srv/ceph/osd$i
must be pre-created, since it is unable to do that itself, despite being able to creat the /srv/ceph/mon$i
directories.
- The Fedora RPMs have also forgotten to create
/etc/ceph
With that in mind, I ran the following commands from my laptop, as root
# n=0
# for host in lettuce avocado mustard ; \
do \
ssh root@$host mkdir -p /etc/ceph /srv/ceph/mon$n; \
n=$(expr $n + 1; \
scp /etc/ceph/ceph.conf root@$host:/etc/ceph/ceph.conf
done
# mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.bin
On the host where you ran mkcephfs
there should now be a file /etc/ceph/keyring.admin
. This will be needed for mounting filesystems. I copied it across to all my virtualization hosts
# for host in lettuce avocado mustard ; \
do \
scp /etc/ceph/keyring.admin root@$host:/etc/ceph/keyring.admin; \
done
Host Cluster Usage
Assuming the setup phase all went to plan, the cluster can now be started. A word of warning though, Ceph really wants your clocks VERY well synchronized. If your NTP server is a long way away, the synchronization might not be good enough to stop Ceph complaining. You really want a NTP server on your local LAN for hosts to sync against. Sort this out before trying to start the cluster.
# for host in lettuce avocado mustard ; \
do \
ssh root@$host service ceph start; \
done
The ceph
tool can show the status of everything. The ‘mon’, ‘osd’ and ‘msd’ lines in the status ought to show all 3 host present & correct
# ceph -s
2011-10-12 14:49:39.085764 pg v235: 594 pgs: 594 active+clean; 24 KB data, 94212 MB used, 92036 MB / 191 GB avail
2011-10-12 14:49:39.086585 mds e6: 1/1/1 up {0=lettuce=up:active}, 2 up:standby
2011-10-12 14:49:39.086622 osd e5: 3 osds: 3 up, 3 in
2011-10-12 14:49:39.086908 log 2011-10-12 14:38:50.263058 osd1 192.168.1.1:6801/8637 197 : [INF] 2.1p1 scrub ok
2011-10-12 14:49:39.086977 mon e1: 3 mons at {0=192.168.1.1:6789/0,1=192.168.1.2:6789/0,2=192.168.1.3:6789/0}
The cluster configuration I chose has authentication enabled, so to actually mount the ceph filesystem requires a secret key. This key is stored in the /etc/ceph/keyring.admin
file that was created earlier. To view the keyring contents, the cauthtool
program must be used
# cauthtool -l /etc/ceph/keyring.admin
[client.admin]
key = AQDLk5VOeHkHLxAAfGjcaUsOXOhJr7hZCNjXSQ==
auid = 18446744073709551615
The base64 key there will be passed to the mount
command, repeating on every host needing a filesystem present:
# mount -t ceph 192.168.1.1:6789:/ /mnt/ -o name=admin,secret=AQDLk5VOeHkHLxAAfGjcaUsOXOhJr7hZCNjXSQ==
error adding secret to kernel, key name client.admin: No such device
For some reason, that error message is always printed on my Fedora hosts, and despite that, the mount has actually succeeded
# grep /mnt /proc/mounts
192.168.1.1:6789:/ /mnt ceph rw,relatime,name=admin,secret= 0 0
Congratulations, /mnt
is now a distributed filesystem. If you create a file on one host, it should appear on the other hosts & vica-verca.
RBD Volume setup
A shared filesystem is very nice, and can be used to hold regular virtual disk images in a variety of formats (raw, qcow2, etc). What I really wanted to try was the RBD virtual block device functionality in QEMU. Ceph includes a tool called rbd
for manipulating those. The syntax of this tool is pretty self-explanatory
# rbd create --size 100 demo
# rbd ls
demo
# rbd info demo
rbd image 'demo':
size 102400 KB in 25 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.0
parent: (pool -1)
Alternatively RBD volume creation can be done using qemu-img
…. at least once the Fedora QEMU package is fixed to enable RBD support.
# qemu-img create -f rbd rbd:rbd/demo 100M
Formatting 'rbd:rbd/foo', fmt=rbd size=104857600 cluster_size=0
# qemu-img info rbd:rbd/demo
image: rbd:rbd/foo
file format: raw
virtual size: 100M (104857600 bytes)
disk size: unavailable
KVM guest setup
The syntax for configuring a RBD block device in libvirt, is very similar to that used for Sheepdog. In Sheepdog, every single virtualization node is also a storage node, so there is no hostname required. Not so for RBD. Here it is necessary to specify one or more host names, for the RBD servers.
<disk type='network' device='disk'>
<driver name='qemu' type='raw'/>
<source protocol='rbd' name='demo/wibble'>
<host name='lettuce.example.org' port='6798'/>
<host name='mustard.example.org' port='6798'/>
<host name='avocado.example.org' port='6798'/>
</source>
<target dev='vdb' bus='virtio'/>
</disk>
More observant people might be wondering how QEMU gets permission to connect to the RBD server, given that the configuration earlier enabled authentication. This is thanks to the magic of the /etc/ceph/keyring.admin
file which must exist on any virtualization server. Patches are currently being discussed which will allow authentication credentials to be set via libvirt, avoiding the need to store the credentials on the virtualization hosts permanently.
There were recently patches posted to libvir-list to improve the Ceph support in the KVM driver. While trying to review them it quickly became clear I did not have enough knowledge of Ceph to approve the code. So I decided it was time to setup some clustered storage devices to test libvirt with. I decided to try out Ceph, GlusterFS and Sheepdog, and by virtue of Sheepdog compiling the fastest, that is the first one I have tried and thus responsible for this blog post.
Host setup
If you have Fedora 16, sheepdog can directly installed using yum
# yum install sheepdog
Sheepdog relies on corosync to maintain cluster membership, so the first step is to configure that. Corosync ships with an example configuration file, but since I’ve not used it before, I chose to just use the example configuration recommended by the Sheepdog website. So on the 2 hosts I wanted to participate in the cluster I created:
# cat > /etc/cluster/cluster.conf <EOF
compatibility: whitetank
totem {
version: 2
secauth: off
threads: 0
interface {
ringnumber: 0
bindnetaddr: -YOUR IP HERE-
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
logging {
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/cluster/corosync.log
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
}
}
amf {
mode: disabled
}
EOF
Obviously remembering to change the ‘bindnetaddr
‘ parameter. One thing to be aware of is that this configuration allows any host in the same subnet to join the cluster, no authentication or encryption is required. I believe corosync has some support for encryption keys, but I have not explored this. If you don’t trust the network, this should definitely be examined. Then it is simply a matter of starting the corosync and sheepdog, each on each node:
# service corosync start
# service sheepdog start
If all went to plan, it should be possible to see all hosts in the sheepdog cluster, from any node:
# collie node list
Idx - Host:Port Number of vnodes
------------------------------------------------
0 - 192.168.1.2:7000 64
* 1 - 192.168.1.3:7000 64
The final step in initializing the nodes is to create a storage cluster across the nodes. This command only needs to be run on one of the nodes
# collie cluster format --copies=2
# collie cluster info
running
Ctime Epoch Nodes
2011-10-11 10:50:01 1 [192.168.1.2:7000, 192.168.1.3:7000]
Volume setup
libvirt has a storage management API for creating/managing volumes, but there is not currently a driver for sheepdog. So for the time being, volumes need to be created manually using the qemu-img
command. All that is required is a volume name and a size. So on any of the nodes:
$ qemu-img create sheepdog:demo 1G
The more observant people might notice that this command can be run by any user on the host, no authentication required. Even if the host is locked down to not allow unprivileged user logins, this still means that any compromised QEMU instance can access all the sheepdog storage. Not cool. Some form of authentication is clearly needed before this can be used for production.
With the default Fedora configuration of sheepdog, all the disk volumes end up being stored under /var/lib/sheepdog
, so make sure that directory has plenty of free space.
Guest setup
Once a volume has been created, setting up a guest to use it, is just a matter of using a special XML configuration block for the guest disk.
<disk type='network' device='disk'>
<driver name='qemu' type='raw'/>
<source protocol='sheepdog' name='demo'/>
<target dev='vdb' bus='virtio'/>
</disk>
Notice how although this is a network block device, there is no need to provide a hostname of the storage server. Every virtualization host is a member of the storage cluster, and vica-verca, so the storage is “local” as far as QEMU is concerned. Inside the guest there is nothing special to worry about, a regular virtio block device appears, in this case /dev/vdb. As data is written to the block device in the guest, the data should end up in /var/lib/sheepdog on all nodes in the cluster.
One final caveat to mention, is that live migration of guests between hosts is not currently supported with Sheepdog.
Edit: Live migration *is* supported with sheepdog 0.2.0 and later.