Ceph
We are running a three-node Ceph cluster on riboflavin, ginkgo and biloba for the purpose of cloud storage. Most Ceph services are running on riboflavin or ginkgo; biloba is just providing a tiny bit of extra storage space.
Official documentation: https://docs.ceph.com/en/latest/
At the time this page was written, the latest version of Ceph was 'Pacific'; check the website above to see what the latest version is.
Bootstrap
The instructions below were adapted from https://docs.ceph.com/en/pacific/cephadm/install/.
riboflavin was used as the bootstrap host, since it has the most storage.
Add the following to /etc/apt/sources.list.d/ceph.list:
deb http://mirror.csclub.uwaterloo.ca/ceph/debian-pacific/ bullseye main
Download the Ceph release key for the Debian packages:
wget -O /etc/apt/trusted.gpg.d/ceph.release.gpg https://download.ceph.com/keys/release.gpg
Now run:
apt update apt install cephadm podman ceph boostrap --mon-ip 172.19.168.25
For the rest of the instructions below, the ceph
command can be run inside a Podman container by running cephadm shell
. Alternatively, you can install the ceph-common
package to run ceph
directly on the host.
Add the disks for riboflavin:
ceph orch daemon add osd riboflavin:/dev/sdb ceph orch daemon add osd riboflavin:/dev/sdc
Note: Unfortunately Ceph didn't like it when I used one of the /dev/disk/by-id paths, so I had to use the /dev/sdX paths instead. I'm not sure what'll happen if the device names change at boot. Let's just cross our fingers and pray.
Add more hosts:
ceph orch host add ginkgo 172.19.168.22 --labels _admin ceph orch host add biloba 172.19.168.23
Add each available disk on each of the additional hosts.
Disable unnecessary services:
ceph orch rm alertmanager ceph orch rm grafana ceph orch rm node-exporter
Set the autoscale profile to scale-up instead of scale-down:
ceph osd pool set autoscale-profile scale-up
Set the default pool replication factor to 2 instead of 3:
ceph config set global osd_pool_default_size 2
Deploy the Managers and Monitors on riboflavin and ginkgo only:
ceph orch apply mon --placement '2 riboflavin ginkgo' ceph orch apply mgr --placement '2 riboflavin ginkgo'
CloudStack Primary Storage
We are using RBD (RADOS Block Device) for CloudStack primary storage. The instructions below were adapted from https://docs.ceph.com/en/pacific/rbd/rbd-cloudstack/.
Create and initialize a pool:
ceph osd pool create cloudstack rbd pool init cloudstack
Create a user for CloudStack:
ceph auth get-or-create client.cloudstack mon 'profile rbd' osd 'profile rbd pool=cloudstack'
Make a backup of this key. There is currently a copy in /etc/ceph/ceph.client.cloudstack.keyring on biloba. If you want to use the ceph
command with this set of credentials, use the -n
flag, e.g.
ceph -n client.cloudstack status
RBD commands
Here are some RBD commands which might be useful:
-
List images (i.e. block devices) in the cloudstack pool:
rbd ls -p cloudstack
-
View snapshots for an image:
rbd snap ls cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
-
Unprotect a snapshot:
rbd snap unprotect cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b@cloudstack-base-snap
-
Purge all snapshots for an image (after unprotecting them):
rbd snap purge cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
-
Delete an image:
rbd rm cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
-
A quick 'n dirty script to delete all images in the pool:
rbd ls -p cloudstack | while read image; do rbd snap unprotect cloudstack/$image@cloudstack-base-snap; done rbd ls -p cloudstack | while read image; do rbd snap purge cloudstack/$image; done rbd ls -p cloudstack | while read image; do rbd rm cloudstack/$image; done
CloudStack Secondary Storage
We are using NFS (v4) for CloudStack secondary storage. The steps below were adapted from:
- https://docs.ceph.com/en/pacific/cephfs.
- https://docs.ceph.com/en/pacific/cephadm/nfs/
- https://docs.ceph.com/en/pacific/mgr/nfs/#mgr-nfs
Create a new CephFS filesystem:
ceph fs volume create cloudstack-secondary
Enable the NFS module:
ceph mgr module enable nfs
Create a cluster placed on two hosts:
ceph nfs cluster create cloudstack-nfs --placement '2 riboflavin ginkgo'
View cluster info:
ceph nfs cluster ls ceph nfs cluster info cloudstack-nfs
Now create a CephFS export:
ceph nfs export create cephfs cloudstack-secondary cloudstack-nfs /cloudstack-secondary /
View export info:
ceph nfs export ls cloudstack-nfs ceph nfs export get cloudstack-nfs /cloudstack-secondary
Now on the clients, we can just mount the NFS export normally:
mkdir /mnt/cloudstack-secondary mount -t nfs4 -o port=2049 ceph-nfs.cloud.csclub.uwaterloo.ca:/cloudstack-secondary /mnt/cloudstack-secondary
Security
The NFS module in Ceph is just NFS-Ganesha, which does theoretically support ACLs, but I wasn't able to get it to work. I kept on getting some weird Python error. So we're going to use our iptables-fu instead (on riboflavin and ginkgo; make sure iptables-persistent is installed):
iptables -N CEPH-NFS iptables -A INPUT -j CEPH-NFS iptables -A CEPH-NFS -s 172.19.168.0/27 -j RETURN iptables -A CEPH-NFS -p tcp --dport 2049 -j REJECT iptables -A CEPH-NFS -p udp --dport 2049 -j REJECT iptables-save > /etc/iptables/rules.v4 ip6tables -N CEPH-NFS ip6tables -A INPUT -j CEPH-NFS ip6tables -A CEPH-NFS -s fd74:6b6a:8eca:4902::/64 -j RETURN ip6tables -A CEPH-NFS -p tcp --dport 2049 -j REJECT ip6tables -A CEPH-NFS -p udp --dport 2049 -j REJECT ip6tables-save > /etc/iptables/rules.v6
Dashboard
There is a web dashboard for Ceph running on riboflavin which is useful to get a holistic view of the system. You will need to do a port-forward over SSH:
ssh -L 8443:172.19.168.25:8443 riboflavin
Now if you visit https://localhost:8443 (ignore the HTTPS warning), you can login to the dashboard. Credentials are stored in the usual place.
Recovering from a disk failure
Check which placement group(s) failed:
# Run this from `cephadm shell` ceph health detail
The output will look something like this:
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 2.5 is active+clean+inconsistent, acting [3,0]
This means that placement group 2.5 failed and is in OSDs 3 and 0. Since our cluster has a replication factor of 2, one of those OSDs will be on the machine with the failed disk, and the other OSD will be on a healthy machine. Run this to see which machines have which OSDs:
ceph osd tree
Repairing the placement group
If the disk failure might have been intermittent, try and see if we can repair the PG first. See https://docs.ceph.com/en/pacific/rados/operations/pg-repair/ for details.
Removing or replacing a disk
First, find the OSD corresponding to the failed disk:
# Run this from `cephadm shell` ceph-volume lvm list
Next, follow these instructions: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure#replacing-a-failed-osd-disk-ops
If you want to keep the same OSD ID: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure#replacing-an-osd-drive-while-retaining-the-osd-id-ops
More useful info: https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds/
Forcefully removing an OSD
Update: I no longer recommend the instructions below; they are kept for historical purposes only. Follow the official Red Hat and Ceph docs instead.
In the examples below, osd.3 is the OSD with the bad disk.
ceph osd down osd.3 ceph osd out osd.3 ceph orch daemon rm osd.3 --force ceph osd destroy osd.3 --yes-i-really-mean-it ceph osd crush remove osd.3 ceph osd rm 3
Now on the host with the disk, run:
# view which LVM volumes are in which disks lsblk # get the device path of the bad Ceph LVM volume lvdisplay lvremove /dev/ceph-4318d615-2cde-4ea1-a25a-9cba09821fc3/osd-block-514bcfb1-07f2-4824-ba3c-c9031cc7d3e3 # get the VG with the bad LV vgdisplay vgremove 4318d615-2cde-4ea1-a25a-9cba09821fc3 # zero the beginning of the disk (if you plan on re-using it) dd if=/dev/zero of=/dev/sdc bs=1M count=10 conv=fsync
Miscellaneous commands
Here are some commands which may be useful. See the man page for a full reference.
-
Show devices:
ceph orch device ls
Note: this doesn't actually show all of the individual disks. I think it might have to do with the hardware RAID controllers.
-
Show OSDs (Object Storage Daemons) on the current host (this needs to be run from
cephadm shell
):ceph-volume lvm list
-
Show services:
ceph orch ls
-
Show daemons of those services:
ceph orch ps
-
Show non-default config settings:
ceph config dump
-
Show pools:
ceph osd pool ls detail
-
List users:
ceph auth ls