Ceph: Difference between revisions

From CSCWiki
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
 
(7 intermediate revisions by the same user not shown)
Line 190: Line 190:
Now if you visit https://localhost:8443 (ignore the HTTPS warning), you can login to the dashboard. Credentials are stored in the usual place.
Now if you visit https://localhost:8443 (ignore the HTTPS warning), you can login to the dashboard. Credentials are stored in the usual place.


== Removing a disk ==
== Adding a new disk ==
Let's say we added a new disk /dev/sdg to ginkgo. Log in to one of the Ceph management servers (riboflavin or ginkgo), then run
Only do this if you are certain that a disk has failed. In the example below, osd.3 is the OSD with the bad disk.
<pre>
# clear any metadata at the start of the disk
dd if=/dev/zero of=/dev/sdg bs=1M count=10 conv=fsync
# Run this from inside `cephadm shell`
ceph orch daemon add osd ginkgo:/dev/sdg
</pre>
And that's it! You can run <code>ceph status</code> to see the progress of the PGs getting rebalanced.

== Recovering from a disk failure ==
Check which placement group(s) failed:
<pre>
# Run this from `cephadm shell`
ceph health detail
</pre>
The output will look something like this:
<pre>
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
pg 2.5 is active+clean+inconsistent, acting [3,0]
</pre>
This means that placement group 2.5 failed and is in OSDs 3 and 0. Since our cluster has a replication factor of 2, one of those OSDs will be on the machine with the failed disk, and the other OSD will be on a healthy machine. Run this to see which machines have which OSDs:
<pre>
ceph osd tree
</pre>

=== Repairing the placement group ===
If the disk failure might have been intermittent, try and see if we can repair the PG first. See https://docs.ceph.com/en/pacific/rados/operations/pg-repair/ for details.

=== Removing or replacing a disk ===
First, find the OSD corresponding to the failed disk:
<pre>
# Run this from `cephadm shell`
ceph-volume lvm list
</pre>

Read these pages:
<ul>
<li>https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds</li>
<li>https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure</li>
</ul>

Here's the TLDR (assuming OSD 3 has the disk which failed):

First, take the OSD out of the cluster:
<pre>
<pre>
ceph osd down osd.3
ceph osd out osd.3
ceph osd out osd.3
</pre>
Wait until the data is backfilled to the other OSDs (this could take a long time):
<pre>
ceph status
</pre>
Remove the OSD daemon, then purge the OSD completely:
<pre>
ceph orch daemon rm osd.3 --force
ceph orch daemon rm osd.3 --force
ceph osd destroy osd.3 --yes-i-really-mean-it
ceph osd purge osd.3 --yes-i-really-mean-it
</pre>
ceph osd crush remove osd.3
Destroy the LVM logical volume and volume group:
ceph osd rm 3
<pre>
ceph-volume lvm zap --destroy --osd-id 3
</pre>
At this point, the hard drive can be removed.

After the drive has been replaced, zap it and add it to the cluster normally:
<pre>
dd if=/dev/zero of=/dev/sde bs=1M count=10 conv=fsync
ceph orch daemon add osd ginkgo:/dev/sde
</pre>
</pre>

Now on the host with the disk, run:
== Reducing log verbosity ==
By default, debug messages are enabled and written to stderr (so they up in the journald logs because they run in Podman). Unfortunately there is a [https://tracker.ceph.com/issues/49161 bug in Ceph] which seems to always cause debug messages to be enabled on stderr specifically. So we will log to syslog instead (which is just systemd-journald on Debian).

Run <code>cephadm shell</code> on riboflavin, then run the following:
<pre>
<pre>
ceph config set global mon_cluster_log_file_level info
# view which LVM volumes are in which disks
ceph config set global log_to_stderr false
lsblk
ceph config set global mon_cluster_log_to_stderr false
# get the device path of the bad Ceph LVM volume
ceph config set global mon_cluster_log_to_syslog true
lvdisplay
lvremove /dev/ceph-4318d615-2cde-4ea1-a25a-9cba09821fc3/osd-block-514bcfb1-07f2-4824-ba3c-c9031cc7d3e3
# get the VG with the bad LV
vgdisplay
vgremove 4318d615-2cde-4ea1-a25a-9cba09821fc3
# zero the beginning of the disk (make sure you have the right one!!)
dd if=/dev/zero of=/dev/sdc bs=1M count=10 conv=fsync
</pre>
</pre>
These setting should take effect on all of the Ceph hosts immediately. See https://docs.ceph.com/en/pacific/rados/configuration/ceph-conf/ for reference.


== Miscellaneous commands ==
== Miscellaneous commands ==

Latest revision as of 06:03, 10 February 2024

We are running a three-node Ceph cluster on riboflavin, ginkgo and biloba for the purpose of cloud storage. Most Ceph services are running on riboflavin or ginkgo; biloba is just providing a tiny bit of extra storage space.

Official documentation: https://docs.ceph.com/en/latest/

At the time this page was written, the latest version of Ceph was 'Pacific'; check the website above to see what the latest version is.

Bootstrap

The instructions below were adapted from https://docs.ceph.com/en/pacific/cephadm/install/.

riboflavin was used as the bootstrap host, since it has the most storage.

Add the following to /etc/apt/sources.list.d/ceph.list:

deb http://mirror.csclub.uwaterloo.ca/ceph/debian-pacific/ bullseye main

Download the Ceph release key for the Debian packages:

wget -O /etc/apt/trusted.gpg.d/ceph.release.gpg https://download.ceph.com/keys/release.gpg

Now run:

apt update
apt install cephadm podman
ceph boostrap --mon-ip 172.19.168.25

For the rest of the instructions below, the ceph command can be run inside a Podman container by running cephadm shell. Alternatively, you can install the ceph-common package to run ceph directly on the host.

Add the disks for riboflavin:

ceph orch daemon add osd riboflavin:/dev/sdb
ceph orch daemon add osd riboflavin:/dev/sdc

Note: Unfortunately Ceph didn't like it when I used one of the /dev/disk/by-id paths, so I had to use the /dev/sdX paths instead. I'm not sure what'll happen if the device names change at boot. Let's just cross our fingers and pray.

Add more hosts:

ceph orch host add ginkgo 172.19.168.22 --labels _admin
ceph orch host add biloba 172.19.168.23

Add each available disk on each of the additional hosts.

Disable unnecessary services:

ceph orch rm alertmanager
ceph orch rm grafana
ceph orch rm node-exporter

Set the autoscale profile to scale-up instead of scale-down:

ceph osd pool set autoscale-profile scale-up

Set the default pool replication factor to 2 instead of 3:

ceph config set global osd_pool_default_size 2

Deploy the Managers and Monitors on riboflavin and ginkgo only:

ceph orch apply mon --placement '2 riboflavin ginkgo'
ceph orch apply mgr --placement '2 riboflavin ginkgo'

CloudStack Primary Storage

We are using RBD (RADOS Block Device) for CloudStack primary storage. The instructions below were adapted from https://docs.ceph.com/en/pacific/rbd/rbd-cloudstack/.

Create and initialize a pool:

ceph osd pool create cloudstack
rbd pool init cloudstack

Create a user for CloudStack:

ceph auth get-or-create client.cloudstack mon 'profile rbd' osd 'profile rbd pool=cloudstack'

Make a backup of this key. There is currently a copy in /etc/ceph/ceph.client.cloudstack.keyring on biloba. If you want to use the ceph command with this set of credentials, use the -n flag, e.g.

ceph -n client.cloudstack status

RBD commands

Here are some RBD commands which might be useful:

  • List images (i.e. block devices) in the cloudstack pool:
    rbd ls -p cloudstack
    
  • View snapshots for an image:
    rbd snap ls cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
    
  • Unprotect a snapshot:
    rbd snap unprotect cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b@cloudstack-base-snap
    
  • Purge all snapshots for an image (after unprotecting them):
    rbd snap purge cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
    
  • Delete an image:
    rbd rm cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
    
  • A quick 'n dirty script to delete all images in the pool:
    rbd ls -p cloudstack | while read image; do rbd snap unprotect cloudstack/$image@cloudstack-base-snap; done
    rbd ls -p cloudstack | while read image; do rbd snap purge cloudstack/$image; done
    rbd ls -p cloudstack | while read image; do rbd rm cloudstack/$image; done
    

CloudStack Secondary Storage

We are using NFS (v4) for CloudStack secondary storage. The steps below were adapted from:

Create a new CephFS filesystem:

ceph fs volume create cloudstack-secondary

Enable the NFS module:

ceph mgr module enable nfs

Create a cluster placed on two hosts:

ceph nfs cluster create cloudstack-nfs --placement '2 riboflavin ginkgo'

View cluster info:

ceph nfs cluster ls
ceph nfs cluster info cloudstack-nfs

Now create a CephFS export:

ceph nfs export create cephfs cloudstack-secondary cloudstack-nfs /cloudstack-secondary /

View export info:

ceph nfs export ls cloudstack-nfs
ceph nfs export get cloudstack-nfs /cloudstack-secondary

Now on the clients, we can just mount the NFS export normally:

mkdir /mnt/cloudstack-secondary
mount -t nfs4 -o port=2049 ceph-nfs.cloud.csclub.uwaterloo.ca:/cloudstack-secondary /mnt/cloudstack-secondary

Security

The NFS module in Ceph is just NFS-Ganesha, which does theoretically support ACLs, but I wasn't able to get it to work. I kept on getting some weird Python error. So we're going to use our iptables-fu instead (on riboflavin and ginkgo; make sure iptables-persistent is installed):

iptables -N CEPH-NFS
iptables -A INPUT -j CEPH-NFS
iptables -A CEPH-NFS -s 172.19.168.0/27 -j RETURN
iptables -A CEPH-NFS -p tcp --dport 2049 -j REJECT
iptables -A CEPH-NFS -p udp --dport 2049 -j REJECT
iptables-save > /etc/iptables/rules.v4

ip6tables -N CEPH-NFS
ip6tables -A INPUT -j CEPH-NFS
ip6tables -A CEPH-NFS -s fd74:6b6a:8eca:4902::/64 -j RETURN
ip6tables -A CEPH-NFS -p tcp --dport 2049 -j REJECT
ip6tables -A CEPH-NFS -p udp --dport 2049 -j REJECT
ip6tables-save > /etc/iptables/rules.v6

Dashboard

There is a web dashboard for Ceph running on riboflavin which is useful to get a holistic view of the system. You will need to do a port-forward over SSH:

ssh -L 8443:172.19.168.25:8443 riboflavin

Now if you visit https://localhost:8443 (ignore the HTTPS warning), you can login to the dashboard. Credentials are stored in the usual place.

Adding a new disk

Let's say we added a new disk /dev/sdg to ginkgo. Log in to one of the Ceph management servers (riboflavin or ginkgo), then run

# clear any metadata at the start of the disk
dd if=/dev/zero of=/dev/sdg bs=1M count=10 conv=fsync
# Run this from inside `cephadm shell`
ceph orch daemon add osd ginkgo:/dev/sdg

And that's it! You can run ceph status to see the progress of the PGs getting rebalanced.

Recovering from a disk failure

Check which placement group(s) failed:

# Run this from `cephadm shell`
ceph health detail

The output will look something like this:

HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 2.5 is active+clean+inconsistent, acting [3,0]

This means that placement group 2.5 failed and is in OSDs 3 and 0. Since our cluster has a replication factor of 2, one of those OSDs will be on the machine with the failed disk, and the other OSD will be on a healthy machine. Run this to see which machines have which OSDs:

ceph osd tree

Repairing the placement group

If the disk failure might have been intermittent, try and see if we can repair the PG first. See https://docs.ceph.com/en/pacific/rados/operations/pg-repair/ for details.

Removing or replacing a disk

First, find the OSD corresponding to the failed disk:

# Run this from `cephadm shell`
ceph-volume lvm list

Read these pages:

Here's the TLDR (assuming OSD 3 has the disk which failed):

First, take the OSD out of the cluster:

ceph osd out osd.3

Wait until the data is backfilled to the other OSDs (this could take a long time):

ceph status

Remove the OSD daemon, then purge the OSD completely:

ceph orch daemon rm osd.3 --force
ceph osd purge osd.3 --yes-i-really-mean-it

Destroy the LVM logical volume and volume group:

ceph-volume lvm zap --destroy --osd-id 3

At this point, the hard drive can be removed.

After the drive has been replaced, zap it and add it to the cluster normally:

dd if=/dev/zero of=/dev/sde bs=1M count=10 conv=fsync
ceph orch daemon add osd ginkgo:/dev/sde

Reducing log verbosity

By default, debug messages are enabled and written to stderr (so they up in the journald logs because they run in Podman). Unfortunately there is a bug in Ceph which seems to always cause debug messages to be enabled on stderr specifically. So we will log to syslog instead (which is just systemd-journald on Debian).

Run cephadm shell on riboflavin, then run the following:

ceph config set global mon_cluster_log_file_level info
ceph config set global log_to_stderr false
ceph config set global mon_cluster_log_to_stderr false
ceph config set global mon_cluster_log_to_syslog true

These setting should take effect on all of the Ceph hosts immediately. See https://docs.ceph.com/en/pacific/rados/configuration/ceph-conf/ for reference.

Miscellaneous commands

Here are some commands which may be useful. See the man page for a full reference.

  • Show devices:
    ceph orch device ls
    

    Note: this doesn't actually show all of the individual disks. I think it might have to do with the hardware RAID controllers.

  • Show OSDs (Object Storage Daemons) on the current host (this needs to be run from cephadm shell):
    ceph-volume lvm list
  • Show services:
    ceph orch ls
  • Show daemons of those services:
    ceph orch ps
  • Show non-default config settings:
    ceph config dump
  • Show pools:
    ceph osd pool ls detail
  • List users:
    ceph auth ls