Ceph

From CSCWiki
Jump to navigation Jump to search

We are running a three-node Ceph cluster on riboflavin, ginkgo and biloba for the purpose of cloud storage. Most Ceph services are running on riboflavin or ginkgo; biloba is just providing a tiny bit of extra storage space.

Official documentation: https://docs.ceph.com/en/latest/

At the time this page was written, the latest version of Ceph was 'Pacific'; check the website above to see what the latest version is.

Bootstrap

The instructions below were adapted from https://docs.ceph.com/en/pacific/cephadm/install/.

riboflavin was used as the bootstrap host, since it has the most storage.

Add the following to /etc/apt/sources.list.d/ceph.list:

deb http://mirror.csclub.uwaterloo.ca/ceph/debian-pacific/ bullseye main

Download the Ceph release key for the Debian packages:

wget -O /etc/apt/trusted.gpg.d/ceph.release.gpg https://download.ceph.com/keys/release.gpg

Now run:

apt update
apt install cephadm podman
ceph boostrap --mon-ip 172.19.168.25

For the rest of the instructions below, the ceph command can be run inside a Podman container by running cephadm shell. Alternatively, you can install the ceph-common package to run ceph directly on the host.

Add the disks for riboflavin:

ceph orch daemon add osd riboflavin:/dev/sdb
ceph orch daemon add osd riboflavin:/dev/sdc

Note: Unfortunately Ceph didn't like it when I used one of the /dev/disk/by-id paths, so I had to use the /dev/sdX paths instead. I'm not sure what'll happen if the device names change at boot. Let's just cross our fingers and pray.

Add more hosts:

ceph orch host add ginkgo 172.19.168.22 --labels _admin
ceph orch host add biloba 172.19.168.23

Add each available disk on each of the additional hosts.

Disable unnecessary services:

ceph orch rm alertmanager
ceph orch rm grafana
ceph orch rm node-exporter

Set the autoscale profile to scale-up instead of scale-down:

ceph osd pool set autoscale-profile scale-up

Set the default pool replication factor to 2 instead of 3:

ceph config set global osd_pool_default_size 2

Deploy the Managers and Monitors on riboflavin and ginkgo only:

ceph orch apply mon --placement '2 riboflavin ginkgo'
ceph orch apply mgr --placement '2 riboflavin ginkgo'

CloudStack Primary Storage

We are using RBD (RADOS Block Device) for CloudStack primary storage. The instructions below were adapted from https://docs.ceph.com/en/pacific/rbd/rbd-cloudstack/.

Create and initialize a pool:

ceph osd pool create cloudstack
rbd pool init cloudstack

Create a user for CloudStack:

ceph auth get-or-create client.cloudstack mon 'profile rbd' osd 'profile rbd pool=cloudstack'

Make a backup of this key. There is currently a copy in /etc/ceph/ceph.client.cloudstack.keyring on biloba. If you want to use the ceph command with this set of credentials, use the -n flag, e.g.

ceph -n client.cloudstack status

RBD commands

Here are some RBD commands which might be useful:

  • List images (i.e. block devices) in the cloudstack pool:
    rbd ls -p cloudstack
    
  • View snapshots for an image:
    rbd snap ls cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
    
  • Unprotect a snapshot:
    rbd snap unprotect cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b@cloudstack-base-snap
    
  • Purge all snapshots for an image (after unprotecting them):
    rbd snap purge cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
    
  • Delete an image:
    rbd rm cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
    
  • A quick 'n dirty script to delete all images in the pool:
    rbd ls -p cloudstack | while read image; do rbd snap unprotect cloudstack/$image@cloudstack-base-snap; done
    rbd ls -p cloudstack | while read image; do rbd snap purge cloudstack/$image; done
    rbd ls -p cloudstack | while read image; do rbd rm cloudstack/$image; done
    

CloudStack Secondary Storage

We are using NFS (v4) for CloudStack secondary storage. The steps below were adapted from:

Create a new CephFS filesystem:

ceph fs volume create cloudstack-secondary

Enable the NFS module:

ceph mgr module enable nfs

Create a cluster placed on two hosts:

ceph nfs cluster create cloudstack-nfs --placement '2 riboflavin ginkgo'

View cluster info:

ceph nfs cluster ls
ceph nfs cluster info cloudstack-nfs

Now create a CephFS export:

ceph nfs export create cephfs cloudstack-secondary cloudstack-nfs /cloudstack-secondary /

View export info:

ceph nfs export ls cloudstack-nfs
ceph nfs export get cloudstack-nfs /cloudstack-secondary

Now on the clients, we can just mount the NFS export normally:

mkdir /mnt/cloudstack-secondary
mount -t nfs4 -o port=2049 ceph-nfs.cloud.csclub.uwaterloo.ca:/cloudstack-secondary /mnt/cloudstack-secondary

Security

The NFS module in Ceph is just NFS-Ganesha, which does theoretically support ACLs, but I wasn't able to get it to work. I kept on getting some weird Python error. So we're going to use our iptables-fu instead (on riboflavin and ginkgo; make sure iptables-persistent is installed):

iptables -N CEPH-NFS
iptables -A INPUT -j CEPH-NFS
iptables -A CEPH-NFS -s 172.19.168.0/27 -j RETURN
iptables -A CEPH-NFS -p tcp --dport 2049 -j REJECT
iptables -A CEPH-NFS -p udp --dport 2049 -j REJECT
iptables-save > /etc/iptables/rules.v4

ip6tables -N CEPH-NFS
ip6tables -A INPUT -j CEPH-NFS
ip6tables -A CEPH-NFS -s fd74:6b6a:8eca:4902::/64 -j RETURN
ip6tables -A CEPH-NFS -p tcp --dport 2049 -j REJECT
ip6tables -A CEPH-NFS -p udp --dport 2049 -j REJECT
ip6tables-save > /etc/iptables/rules.v6

Dashboard

There is a web dashboard for Ceph running on riboflavin which is useful to get a holistic view of the system. You will need to do a port-forward over SSH:

ssh -L 8443:172.19.168.25:8443 riboflavin

Now if you visit https://localhost:8443 (ignore the HTTPS warning), you can login to the dashboard. Credentials are stored in the usual place.

Recovering from a disk failure

Check which placement group(s) failed:

# Run this from `cephadm shell`
ceph health detail

The output will look something like this:

HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 2.5 is active+clean+inconsistent, acting [3,0]

This means that placement group 2.5 failed and is in OSDs 3 and 0. Since our cluster has a replication factor of 2, one of those OSDs will be on the machine with the failed disk, and the other OSD will be on a healthy machine. Run this to see which machines have which OSDs:

ceph osd tree

Repairing the placement group

If the disk failure might have been intermittent, try and see if we can repair the PG first. See https://docs.ceph.com/en/pacific/rados/operations/pg-repair/ for details.

Removing or replacing a disk

First, find the OSD corresponding to the failed disk:

# Run this from `cephadm shell`
ceph-volume lvm list

Next, follow these instructions: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure#replacing-a-failed-osd-disk-ops

If you want to keep the same OSD ID: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure#replacing-an-osd-drive-while-retaining-the-osd-id-ops

More useful info: https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds/

Forcefully removing an OSD

Update: I no longer recommend the instructions below; they are kept for historical purposes only. Follow the official Red Hat and Ceph docs instead.

In the examples below, osd.3 is the OSD with the bad disk.

ceph osd down osd.3
ceph osd out osd.3
ceph orch daemon rm osd.3 --force
ceph osd destroy osd.3 --yes-i-really-mean-it
ceph osd crush remove osd.3
ceph osd rm 3

Now on the host with the disk, run:

# view which LVM volumes are in which disks
lsblk
# get the device path of the bad Ceph LVM volume
lvdisplay
lvremove /dev/ceph-4318d615-2cde-4ea1-a25a-9cba09821fc3/osd-block-514bcfb1-07f2-4824-ba3c-c9031cc7d3e3
# get the VG with the bad LV
vgdisplay
vgremove 4318d615-2cde-4ea1-a25a-9cba09821fc3
# zero the beginning of the disk (if you plan on re-using it)
dd if=/dev/zero of=/dev/sdc bs=1M count=10 conv=fsync

Miscellaneous commands

Here are some commands which may be useful. See the man page for a full reference.

  • Show devices:
    ceph orch device ls
    

    Note: this doesn't actually show all of the individual disks. I think it might have to do with the hardware RAID controllers.

  • Show OSDs (Object Storage Daemons) on the current host (this needs to be run from cephadm shell):
    ceph-volume lvm list
  • Show services:
    ceph orch ls
  • Show daemons of those services:
    ceph orch ps
  • Show non-default config settings:
    ceph config dump
  • Show pools:
    ceph osd pool ls detail
  • List users:
    ceph auth ls