Ceph: Difference between revisions

From CSCWiki
Jump to navigation Jump to search
(Created page with "We are running a three-node [https://ceph.io Ceph] cluster on riboflavin, ginkgo and biloba for the purpose of cloud storage. Most Ceph services are running on riboflavin or g...")
 
mNo edit summary
 
(13 intermediate revisions by the same user not shown)
Line 27: Line 27:
For the rest of the instructions below, the <code>ceph</code> command can be run inside a Podman container by running <code>cephadm shell</code>. Alternatively, you can install the <code>ceph-common</code> package to run <code>ceph</code> directly on the host.
For the rest of the instructions below, the <code>ceph</code> command can be run inside a Podman container by running <code>cephadm shell</code>. Alternatively, you can install the <code>ceph-common</code> package to run <code>ceph</code> directly on the host.


Add the disks for that host:
Add the disks for riboflavin:
<pre>
<pre>
ceph orch daemon add osd riboflavin:/dev/sdb
ceph orch daemon add osd riboflavin:/dev/sdb
Line 72: Line 72:
rbd pool init cloudstack
rbd pool init cloudstack
</pre>
</pre>

Create a user for CloudStack:
<pre>
ceph auth get-or-create client.cloudstack mon 'profile rbd' osd 'profile rbd pool=cloudstack'
</pre>
Make a backup of this key. There is currently a copy in /etc/ceph/ceph.client.cloudstack.keyring on biloba. If you want to use the <code>ceph</code> command with this set of credentials, use the <code>-n</code> flag, e.g.
<pre>
ceph -n client.cloudstack status
</pre>

=== RBD commands ===
Here are some RBD commands which might be useful:

<ul>
<li>
List images (i.e. block devices) in the cloudstack pool:
<pre>
rbd ls -p cloudstack
</pre>
</li>
<li>
View snapshots for an image:
<pre>
rbd snap ls cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
</pre>
</li>
<li>
Unprotect a snapshot:
<pre>
rbd snap unprotect cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b@cloudstack-base-snap
</pre>
</li>
<li>
Purge all snapshots for an image (after unprotecting them):
<pre>
rbd snap purge cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
</pre>
</li>
<li>
Delete an image:
<pre>
rbd rm cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
</pre>
</li>
<li>
A quick 'n dirty script to delete all images in the pool:
<pre>
rbd ls -p cloudstack | while read image; do rbd snap unprotect cloudstack/$image@cloudstack-base-snap; done
rbd ls -p cloudstack | while read image; do rbd snap purge cloudstack/$image; done
rbd ls -p cloudstack | while read image; do rbd rm cloudstack/$image; done
</pre>
</li>
</ul>

== CloudStack Secondary Storage ==
We are using NFS (v4) for CloudStack secondary storage. The steps below were adapted from:

* https://docs.ceph.com/en/pacific/cephfs.
* https://docs.ceph.com/en/pacific/cephadm/nfs/
* https://docs.ceph.com/en/pacific/mgr/nfs/#mgr-nfs

Create a new CephFS filesystem:
<pre>
ceph fs volume create cloudstack-secondary
</pre>
Enable the NFS module:
<pre>
ceph mgr module enable nfs
</pre>
Create a cluster placed on two hosts:
<pre>
ceph nfs cluster create cloudstack-nfs --placement '2 riboflavin ginkgo'
</pre>
View cluster info:
<pre>
ceph nfs cluster ls
ceph nfs cluster info cloudstack-nfs
</pre>
Now create a CephFS export:
<pre>
ceph nfs export create cephfs cloudstack-secondary cloudstack-nfs /cloudstack-secondary /
</pre>
View export info:
<pre>
ceph nfs export ls cloudstack-nfs
ceph nfs export get cloudstack-nfs /cloudstack-secondary
</pre>
Now on the clients, we can just mount the NFS export normally:
<pre>
mkdir /mnt/cloudstack-secondary
mount -t nfs4 -o port=2049 ceph-nfs.cloud.csclub.uwaterloo.ca:/cloudstack-secondary /mnt/cloudstack-secondary
</pre>

=== Security ===
The NFS module in Ceph is just [https://github.com/nfs-ganesha/nfs-ganesha NFS-Ganesha], which does theoretically support ACLs, but I wasn't able to get it to work. I kept on getting some weird Python error. So we're going to use our iptables-fu instead (on riboflavin and ginkgo; make sure iptables-persistent is installed):
<pre>
iptables -N CEPH-NFS
iptables -A INPUT -j CEPH-NFS
iptables -A CEPH-NFS -s 172.19.168.0/27 -j RETURN
iptables -A CEPH-NFS -p tcp --dport 2049 -j REJECT
iptables -A CEPH-NFS -p udp --dport 2049 -j REJECT
iptables-save > /etc/iptables/rules.v4

ip6tables -N CEPH-NFS
ip6tables -A INPUT -j CEPH-NFS
ip6tables -A CEPH-NFS -s fd74:6b6a:8eca:4902::/64 -j RETURN
ip6tables -A CEPH-NFS -p tcp --dport 2049 -j REJECT
ip6tables -A CEPH-NFS -p udp --dport 2049 -j REJECT
ip6tables-save > /etc/iptables/rules.v6
</pre>

== Dashboard ==
There is a web dashboard for Ceph running on riboflavin which is useful to get a holistic view of the system. You will need to do a port-forward over SSH:
<pre>
ssh -L 8443:172.19.168.25:8443 riboflavin
</pre>
Now if you visit https://localhost:8443 (ignore the HTTPS warning), you can login to the dashboard. Credentials are stored in the usual place.

== Adding a new disk ==
Let's say we added a new disk /dev/sdg to ginkgo. Log in to one of the Ceph management servers (riboflavin or ginkgo), then run
<pre>
# clear any metadata at the start of the disk
dd if=/dev/zero of=/dev/sdg bs=1M count=10 conv=fsync
# Run this from inside `cephadm shell`
ceph orch daemon add osd ginkgo:/dev/sdg
</pre>
And that's it! You can run <code>ceph status</code> to see the progress of the PGs getting rebalanced.

== Recovering from a disk failure ==
Check which placement group(s) failed:
<pre>
# Run this from `cephadm shell`
ceph health detail
</pre>
The output will look something like this:
<pre>
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 2.5 is active+clean+inconsistent, acting [3,0]
</pre>
This means that placement group 2.5 failed and is in OSDs 3 and 0. Since our cluster has a replication factor of 2, one of those OSDs will be on the machine with the failed disk, and the other OSD will be on a healthy machine. Run this to see which machines have which OSDs:
<pre>
ceph osd tree
</pre>

=== Repairing the placement group ===
If the disk failure might have been intermittent, try and see if we can repair the PG first. See https://docs.ceph.com/en/pacific/rados/operations/pg-repair/ for details.

=== Removing or replacing a disk ===
First, find the OSD corresponding to the failed disk:
<pre>
# Run this from `cephadm shell`
ceph-volume lvm list
</pre>

Read these pages:
<ul>
<li>https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds</li>
<li>https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure</li>
</ul>

Here's the TLDR (assuming OSD 3 has the disk which failed):

First, take the OSD out of the cluster:
<pre>
ceph osd out osd.3
</pre>
Wait until the data is backfilled to the other OSDs (this could take a long time):
<pre>
ceph status
</pre>
Remove the OSD daemon, then purge the OSD completely:
<pre>
ceph orch daemon rm osd.3 --force
ceph osd purge osd.3 --yes-i-really-mean-it
</pre>
Destroy the LVM logical volume and volume group:
<pre>
ceph-volume lvm zap --destroy --osd-id 3
</pre>
At this point, the hard drive can be removed.

After the drive has been replaced, zap it and add it to the cluster normally:
<pre>
dd if=/dev/zero of=/dev/sde bs=1M count=10 conv=fsync
ceph orch daemon add osd ginkgo:/dev/sde
</pre>

== Reducing log verbosity ==
By default, debug messages are enabled and written to stderr (so they up in the journald logs because they run in Podman). Unfortunately there is a [https://tracker.ceph.com/issues/49161 bug in Ceph] which seems to always cause debug messages to be enabled on stderr specifically. So we will log to syslog instead (which is just systemd-journald on Debian).

Run <code>cephadm shell</code> on riboflavin, then run the following:
<pre>
ceph config set global mon_cluster_log_file_level info
ceph config set global log_to_stderr false
ceph config set global mon_cluster_log_to_stderr false
ceph config set global mon_cluster_log_to_syslog true
</pre>
These setting should take effect on all of the Ceph hosts immediately. See https://docs.ceph.com/en/pacific/rados/configuration/ceph-conf/ for reference.

== Miscellaneous commands ==
Here are some commands which may be useful. See the [https://docs.ceph.com/en/latest/man/8/ceph/ man page] for a full reference.
<ul>
<li>
Show devices:
<pre>
ceph orch device ls
</pre>
Note: this doesn't actually show all of the individual disks. I think it might have to do with the hardware RAID controllers.
</li>
<li>
Show OSDs (Object Storage Daemons) on the current host (this needs to be run from <code>cephadm shell</code>):
<pre>ceph-volume lvm list</pre>
</li>
<li>
Show services:
<pre>ceph orch ls</pre>
</li>
<li>
Show daemons of those services:
<pre>ceph orch ps</pre>
</li>
<li>
Show non-default config settings:
<pre>ceph config dump</pre>
</li>
<li>
Show pools:
<pre>ceph osd pool ls detail</pre>
</li>
<li>
List users:
<pre>ceph auth ls</pre>
</li>
</ul>

Latest revision as of 06:03, 10 February 2024

We are running a three-node Ceph cluster on riboflavin, ginkgo and biloba for the purpose of cloud storage. Most Ceph services are running on riboflavin or ginkgo; biloba is just providing a tiny bit of extra storage space.

Official documentation: https://docs.ceph.com/en/latest/

At the time this page was written, the latest version of Ceph was 'Pacific'; check the website above to see what the latest version is.

Bootstrap

The instructions below were adapted from https://docs.ceph.com/en/pacific/cephadm/install/.

riboflavin was used as the bootstrap host, since it has the most storage.

Add the following to /etc/apt/sources.list.d/ceph.list:

deb http://mirror.csclub.uwaterloo.ca/ceph/debian-pacific/ bullseye main

Download the Ceph release key for the Debian packages:

wget -O /etc/apt/trusted.gpg.d/ceph.release.gpg https://download.ceph.com/keys/release.gpg

Now run:

apt update
apt install cephadm podman
ceph boostrap --mon-ip 172.19.168.25

For the rest of the instructions below, the ceph command can be run inside a Podman container by running cephadm shell. Alternatively, you can install the ceph-common package to run ceph directly on the host.

Add the disks for riboflavin:

ceph orch daemon add osd riboflavin:/dev/sdb
ceph orch daemon add osd riboflavin:/dev/sdc

Note: Unfortunately Ceph didn't like it when I used one of the /dev/disk/by-id paths, so I had to use the /dev/sdX paths instead. I'm not sure what'll happen if the device names change at boot. Let's just cross our fingers and pray.

Add more hosts:

ceph orch host add ginkgo 172.19.168.22 --labels _admin
ceph orch host add biloba 172.19.168.23

Add each available disk on each of the additional hosts.

Disable unnecessary services:

ceph orch rm alertmanager
ceph orch rm grafana
ceph orch rm node-exporter

Set the autoscale profile to scale-up instead of scale-down:

ceph osd pool set autoscale-profile scale-up

Set the default pool replication factor to 2 instead of 3:

ceph config set global osd_pool_default_size 2

Deploy the Managers and Monitors on riboflavin and ginkgo only:

ceph orch apply mon --placement '2 riboflavin ginkgo'
ceph orch apply mgr --placement '2 riboflavin ginkgo'

CloudStack Primary Storage

We are using RBD (RADOS Block Device) for CloudStack primary storage. The instructions below were adapted from https://docs.ceph.com/en/pacific/rbd/rbd-cloudstack/.

Create and initialize a pool:

ceph osd pool create cloudstack
rbd pool init cloudstack

Create a user for CloudStack:

ceph auth get-or-create client.cloudstack mon 'profile rbd' osd 'profile rbd pool=cloudstack'

Make a backup of this key. There is currently a copy in /etc/ceph/ceph.client.cloudstack.keyring on biloba. If you want to use the ceph command with this set of credentials, use the -n flag, e.g.

ceph -n client.cloudstack status

RBD commands

Here are some RBD commands which might be useful:

  • List images (i.e. block devices) in the cloudstack pool:
    rbd ls -p cloudstack
    
  • View snapshots for an image:
    rbd snap ls cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
    
  • Unprotect a snapshot:
    rbd snap unprotect cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b@cloudstack-base-snap
    
  • Purge all snapshots for an image (after unprotecting them):
    rbd snap purge cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
    
  • Delete an image:
    rbd rm cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
    
  • A quick 'n dirty script to delete all images in the pool:
    rbd ls -p cloudstack | while read image; do rbd snap unprotect cloudstack/$image@cloudstack-base-snap; done
    rbd ls -p cloudstack | while read image; do rbd snap purge cloudstack/$image; done
    rbd ls -p cloudstack | while read image; do rbd rm cloudstack/$image; done
    

CloudStack Secondary Storage

We are using NFS (v4) for CloudStack secondary storage. The steps below were adapted from:

Create a new CephFS filesystem:

ceph fs volume create cloudstack-secondary

Enable the NFS module:

ceph mgr module enable nfs

Create a cluster placed on two hosts:

ceph nfs cluster create cloudstack-nfs --placement '2 riboflavin ginkgo'

View cluster info:

ceph nfs cluster ls
ceph nfs cluster info cloudstack-nfs

Now create a CephFS export:

ceph nfs export create cephfs cloudstack-secondary cloudstack-nfs /cloudstack-secondary /

View export info:

ceph nfs export ls cloudstack-nfs
ceph nfs export get cloudstack-nfs /cloudstack-secondary

Now on the clients, we can just mount the NFS export normally:

mkdir /mnt/cloudstack-secondary
mount -t nfs4 -o port=2049 ceph-nfs.cloud.csclub.uwaterloo.ca:/cloudstack-secondary /mnt/cloudstack-secondary

Security

The NFS module in Ceph is just NFS-Ganesha, which does theoretically support ACLs, but I wasn't able to get it to work. I kept on getting some weird Python error. So we're going to use our iptables-fu instead (on riboflavin and ginkgo; make sure iptables-persistent is installed):

iptables -N CEPH-NFS
iptables -A INPUT -j CEPH-NFS
iptables -A CEPH-NFS -s 172.19.168.0/27 -j RETURN
iptables -A CEPH-NFS -p tcp --dport 2049 -j REJECT
iptables -A CEPH-NFS -p udp --dport 2049 -j REJECT
iptables-save > /etc/iptables/rules.v4

ip6tables -N CEPH-NFS
ip6tables -A INPUT -j CEPH-NFS
ip6tables -A CEPH-NFS -s fd74:6b6a:8eca:4902::/64 -j RETURN
ip6tables -A CEPH-NFS -p tcp --dport 2049 -j REJECT
ip6tables -A CEPH-NFS -p udp --dport 2049 -j REJECT
ip6tables-save > /etc/iptables/rules.v6

Dashboard

There is a web dashboard for Ceph running on riboflavin which is useful to get a holistic view of the system. You will need to do a port-forward over SSH:

ssh -L 8443:172.19.168.25:8443 riboflavin

Now if you visit https://localhost:8443 (ignore the HTTPS warning), you can login to the dashboard. Credentials are stored in the usual place.

Adding a new disk

Let's say we added a new disk /dev/sdg to ginkgo. Log in to one of the Ceph management servers (riboflavin or ginkgo), then run

# clear any metadata at the start of the disk
dd if=/dev/zero of=/dev/sdg bs=1M count=10 conv=fsync
# Run this from inside `cephadm shell`
ceph orch daemon add osd ginkgo:/dev/sdg

And that's it! You can run ceph status to see the progress of the PGs getting rebalanced.

Recovering from a disk failure

Check which placement group(s) failed:

# Run this from `cephadm shell`
ceph health detail

The output will look something like this:

HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 2.5 is active+clean+inconsistent, acting [3,0]

This means that placement group 2.5 failed and is in OSDs 3 and 0. Since our cluster has a replication factor of 2, one of those OSDs will be on the machine with the failed disk, and the other OSD will be on a healthy machine. Run this to see which machines have which OSDs:

ceph osd tree

Repairing the placement group

If the disk failure might have been intermittent, try and see if we can repair the PG first. See https://docs.ceph.com/en/pacific/rados/operations/pg-repair/ for details.

Removing or replacing a disk

First, find the OSD corresponding to the failed disk:

# Run this from `cephadm shell`
ceph-volume lvm list

Read these pages:

Here's the TLDR (assuming OSD 3 has the disk which failed):

First, take the OSD out of the cluster:

ceph osd out osd.3

Wait until the data is backfilled to the other OSDs (this could take a long time):

ceph status

Remove the OSD daemon, then purge the OSD completely:

ceph orch daemon rm osd.3 --force
ceph osd purge osd.3 --yes-i-really-mean-it

Destroy the LVM logical volume and volume group:

ceph-volume lvm zap --destroy --osd-id 3

At this point, the hard drive can be removed.

After the drive has been replaced, zap it and add it to the cluster normally:

dd if=/dev/zero of=/dev/sde bs=1M count=10 conv=fsync
ceph orch daemon add osd ginkgo:/dev/sde

Reducing log verbosity

By default, debug messages are enabled and written to stderr (so they up in the journald logs because they run in Podman). Unfortunately there is a bug in Ceph which seems to always cause debug messages to be enabled on stderr specifically. So we will log to syslog instead (which is just systemd-journald on Debian).

Run cephadm shell on riboflavin, then run the following:

ceph config set global mon_cluster_log_file_level info
ceph config set global log_to_stderr false
ceph config set global mon_cluster_log_to_stderr false
ceph config set global mon_cluster_log_to_syslog true

These setting should take effect on all of the Ceph hosts immediately. See https://docs.ceph.com/en/pacific/rados/configuration/ceph-conf/ for reference.

Miscellaneous commands

Here are some commands which may be useful. See the man page for a full reference.

  • Show devices:
    ceph orch device ls
    

    Note: this doesn't actually show all of the individual disks. I think it might have to do with the hardware RAID controllers.

  • Show OSDs (Object Storage Daemons) on the current host (this needs to be run from cephadm shell):
    ceph-volume lvm list
  • Show services:
    ceph orch ls
  • Show daemons of those services:
    ceph orch ps
  • Show non-default config settings:
    ceph config dump
  • Show pools:
    ceph osd pool ls detail
  • List users:
    ceph auth ls