Setting up a Ceph cluster and exporting a RBD volume to a KVM guest
Yesterday I talked about setting up Sheepdog with KVM, so today is it is time to discuss use of Ceph and RBD with KVM.
Host Cluster Setup, the easy way
Fedora has included Ceph for a couple of releases, but since my hosts are on Fedora 14/15, I grabbed the latest ceph 0.3.1 sRPMs from Fedora 16 and rebuilt those to get something reasonably up2date. In the end I have the following packages installed, though to be honest I don’t really need anything except the base ‘ceph’ RPM:
# rpm -qa | grep ceph | sort ceph-0.31-4.fc17.x86_64 ceph-debuginfo-0.31-4.fc17.x86_64 ceph-devel-0.31-4.fc17.x86_64 ceph-fuse-0.31-4.fc17.x86_64 ceph-gcephtool-0.31-4.fc17.x86_64 ceph-obsync-0.31-4.fc17.x86_64 ceph-radosgw-0.31-4.fc17.x86_64
Installing the software is the easy bit, configuring the cluster is where the fun begins. I had three hosts available for testing all of which are virtualization hosts. Ceph has at least 3 daemons it needs to run, which should all be replicated across several hosts for redundancy. There’s no requirement to use the same hosts for each daemon, but for simplicity I decided to run every Ceph daemon on every virtualization host.
My hosts are called lettuce
, avocado
and mustard
. Following the Ceph wiki instructions, I settled on a configuration file that looks like this:
[global] auth supported = cephx keyring = /etc/ceph/keyring.admin [mds] keyring = /etc/ceph/keyring.$name [mds.lettuce] host = lettuce [mds.avocado] host = avocado [mds.mustard] host = mustard [osd] osd data = /srv/ceph/osd$id osd journal = /srv/ceph/osd$id/journal osd journal size = 512 osd class dir = /usr/lib64/rados-classes keyring = /etc/ceph/keyring.$name [osd.0] host = lettuce [osd.1] host = avocado [osd.2] host = mustard [mon] mon data = /srv/ceph/mon$id [mon.0] host = lettuce mon addr = 192.168.1.1:6789 [mon.1] host = avocado mon addr = 192.168.1.2:6789 [mon.2] host = mustard mon addr = 192.168.1.3:6789
The osd class dir
bit should not actually be required, but the OSD code looks in the wrong place (/usr/lib instead of /usr/lib64) on x86_64 arches.
With the configuration file written, it is time to actually initialize the cluster filesystem / object store. This is the really fun bit. The Ceph wiki has a very basic page which talks about the mkcephfs
tool, along with a scary warning about how it’ll ‘rm -rf’ all the data on the filesystem it is initializing. It turns out that it didn’t mean your entire host filesystem, AFAICT, it only the blows away the contents of the directory configured for ‘osd data
‘ and ‘mon data
‘, in my case both under /srv/ceph
.
The recommended way is to let mkcephfs
ssh into each of your hosts and run all the configuration tasks automatically. Having tried the non-recommended way and failed several times before finally getting it right, I can recommend following the recommended way :-P There are some caveats not mentioned in the wiki page though:
- The configuration file above must be copied to
/etc/ceph/ceph.conf
on every node before attempting to runmkcephfs
. - The configuration file on the host where you run
mkcephfs
must be in /etc/ceph/ceph.conf or it will get rather confused about where it is in the other nodes. - The
mkcephfs
command must be run as root since, it doesn’t specify ‘-l root’ to ssh, leading to an inability to setup the nodes. - The directories
/srv/ceph/osd$i
must be pre-created, since it is unable to do that itself, despite being able to creat the/srv/ceph/mon$i
directories. - The Fedora RPMs have also forgotten to create
/etc/ceph
With that in mind, I ran the following commands from my laptop, as root
# n=0 # for host in lettuce avocado mustard ; \ do \ ssh root@$host mkdir -p /etc/ceph /srv/ceph/mon$n; \ n=$(expr $n + 1; \ scp /etc/ceph/ceph.conf root@$host:/etc/ceph/ceph.conf done # mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.bin
On the host where you ran mkcephfs
there should now be a file /etc/ceph/keyring.admin
. This will be needed for mounting filesystems. I copied it across to all my virtualization hosts
# for host in lettuce avocado mustard ; \ do \ scp /etc/ceph/keyring.admin root@$host:/etc/ceph/keyring.admin; \ done
Host Cluster Usage
Assuming the setup phase all went to plan, the cluster can now be started. A word of warning though, Ceph really wants your clocks VERY well synchronized. If your NTP server is a long way away, the synchronization might not be good enough to stop Ceph complaining. You really want a NTP server on your local LAN for hosts to sync against. Sort this out before trying to start the cluster.
# for host in lettuce avocado mustard ; \ do \ ssh root@$host service ceph start; \ done
The ceph
tool can show the status of everything. The ‘mon’, ‘osd’ and ‘msd’ lines in the status ought to show all 3 host present & correct
# ceph -s 2011-10-12 14:49:39.085764 pg v235: 594 pgs: 594 active+clean; 24 KB data, 94212 MB used, 92036 MB / 191 GB avail 2011-10-12 14:49:39.086585 mds e6: 1/1/1 up {0=lettuce=up:active}, 2 up:standby 2011-10-12 14:49:39.086622 osd e5: 3 osds: 3 up, 3 in 2011-10-12 14:49:39.086908 log 2011-10-12 14:38:50.263058 osd1 192.168.1.1:6801/8637 197 : [INF] 2.1p1 scrub ok 2011-10-12 14:49:39.086977 mon e1: 3 mons at {0=192.168.1.1:6789/0,1=192.168.1.2:6789/0,2=192.168.1.3:6789/0}
The cluster configuration I chose has authentication enabled, so to actually mount the ceph filesystem requires a secret key. This key is stored in the /etc/ceph/keyring.admin
file that was created earlier. To view the keyring contents, the cauthtool
program must be used
# cauthtool -l /etc/ceph/keyring.admin [client.admin] key = AQDLk5VOeHkHLxAAfGjcaUsOXOhJr7hZCNjXSQ== auid = 18446744073709551615
The base64 key there will be passed to the mount
command, repeating on every host needing a filesystem present:
# mount -t ceph 192.168.1.1:6789:/ /mnt/ -o name=admin,secret=AQDLk5VOeHkHLxAAfGjcaUsOXOhJr7hZCNjXSQ== error adding secret to kernel, key name client.admin: No such device
For some reason, that error message is always printed on my Fedora hosts, and despite that, the mount has actually succeeded
# grep /mnt /proc/mounts 192.168.1.1:6789:/ /mnt ceph rw,relatime,name=admin,secret= 0 0
Congratulations, /mnt
is now a distributed filesystem. If you create a file on one host, it should appear on the other hosts & vica-verca.
RBD Volume setup
A shared filesystem is very nice, and can be used to hold regular virtual disk images in a variety of formats (raw, qcow2, etc). What I really wanted to try was the RBD virtual block device functionality in QEMU. Ceph includes a tool called rbd
for manipulating those. The syntax of this tool is pretty self-explanatory
# rbd create --size 100 demo # rbd ls demo # rbd info demo rbd image 'demo': size 102400 KB in 25 objects order 22 (4096 KB objects) block_name_prefix: rb.0.0 parent: (pool -1)
Alternatively RBD volume creation can be done using qemu-img
…. at least once the Fedora QEMU package is fixed to enable RBD support.
# qemu-img create -f rbd rbd:rbd/demo 100M Formatting 'rbd:rbd/foo', fmt=rbd size=104857600 cluster_size=0 # qemu-img info rbd:rbd/demo image: rbd:rbd/foo file format: raw virtual size: 100M (104857600 bytes) disk size: unavailable
KVM guest setup
The syntax for configuring a RBD block device in libvirt, is very similar to that used for Sheepdog. In Sheepdog, every single virtualization node is also a storage node, so there is no hostname required. Not so for RBD. Here it is necessary to specify one or more host names, for the RBD servers.
<disk type='network' device='disk'> <driver name='qemu' type='raw'/> <source protocol='rbd' name='demo/wibble'> <host name='lettuce.example.org' port='6798'/> <host name='mustard.example.org' port='6798'/> <host name='avocado.example.org' port='6798'/> </source> <target dev='vdb' bus='virtio'/> </disk>
More observant people might be wondering how QEMU gets permission to connect to the RBD server, given that the configuration earlier enabled authentication. This is thanks to the magic of the /etc/ceph/keyring.admin
file which must exist on any virtualization server. Patches are currently being discussed which will allow authentication credentials to be set via libvirt, avoiding the need to store the credentials on the virtualization hosts permanently.
Hello
What does mkcephfs do?Will it initialize the cluster filesystem / object store.
or
how to initialize object store.
hi,
a little late, but i stumbled upon this just now.
cephfs is a complete distributed file system. it is different from rbd.
The port in the KVM guest setup should be 6789, not 6798.
I found this article very useful, in fact, more useful that the ceph documentation.
For your example though, should you change
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.bin
to
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.admin
Hi, I’m trying to follow your steps, but it doesn’t work for me, I’m working on opensuse 12.3 (3 vms with opensuse 12.3 in a virtualnet using virtualbox) and I have to make it work fine, if I execute everithing as you posted, when I get to the following step (see at the end), the command stays quiet at this point, I couldn’t find a tutorial where only with following steps I could make it work. There’s no documentation at all (understandably for mokeys as me), and ceph website howto uses ceph-deploy (it doesn’t work on opensuse 12.3).
I would appreciate if anybody could lend me, show me to a working dumbs proof tutorial for me.
Anyway, thanks for your article, it’s the best info I had found in the web.
sorry my english, it may not be the best, but I hope I can make me understand.
thanks a lot.
mkcephfs -a -c /etc/ceph/ceph.conf -k /etc/ceph/keyring.bin
temp dir is /tmp/mkcephfs.qpPhBHhmFf
preparing monmap in /tmp/mkcephfs.qpPhBHhmFf/monmap
/usr/bin/monmaptool –create –clobber –add 0 15.15.15.21:6789 –add 1 15.15.15.22:6789 –add 2 15.15.15.23:6789 –print /tmp/mkcephfs.qpPhBHhmFf/monmap
/usr/bin/monmaptool: monmap file /tmp/mkcephfs.qpPhBHhmFf/monmap
/usr/bin/monmaptool: generated fsid 745e86c8-07fd-4e4a-b0bf-7e69823f7314
epoch 0
fsid 745e86c8-07fd-4e4a-b0bf-7e69823f7314
last_changed 2013-09-16 13:19:49.149785
created 2013-09-16 13:19:49.149785
0: 15.15.15.21:6789/0 mon.0
1: 15.15.15.22:6789/0 mon.1
2: 15.15.15.23:6789/0 mon.2
/usr/bin/monmaptool: writing epoch 0 to /tmp/mkcephfs.qpPhBHhmFf/monmap (3 monitors)
=== osd.0 ===
pushing conf and monmap to suse1:/tmp/mkfs.ceph.4874