Ceph: Difference between revisions
mNo edit summary |
mNo edit summary |
||
(12 intermediate revisions by the same user not shown) | |||
Line 123: | Line 123: | ||
rbd ls -p cloudstack | while read image; do rbd rm cloudstack/$image; done |
rbd ls -p cloudstack | while read image; do rbd rm cloudstack/$image; done |
||
</pre> |
</pre> |
||
</li> |
|||
</ul> |
|||
== CloudStack Secondary Storage == |
|||
We are using NFS (v4) for CloudStack secondary storage. The steps below were adapted from: |
|||
* https://docs.ceph.com/en/pacific/cephfs. |
|||
* https://docs.ceph.com/en/pacific/cephadm/nfs/ |
|||
* https://docs.ceph.com/en/pacific/mgr/nfs/#mgr-nfs |
|||
Create a new CephFS filesystem: |
|||
<pre> |
|||
ceph fs volume create cloudstack-secondary |
|||
</pre> |
|||
Enable the NFS module: |
|||
<pre> |
|||
ceph mgr module enable nfs |
|||
</pre> |
|||
Create a cluster placed on two hosts: |
|||
<pre> |
|||
ceph nfs cluster create cloudstack-nfs --placement '2 riboflavin ginkgo' |
|||
</pre> |
|||
View cluster info: |
|||
<pre> |
|||
ceph nfs cluster ls |
|||
ceph nfs cluster info cloudstack-nfs |
|||
</pre> |
|||
Now create a CephFS export: |
|||
<pre> |
|||
ceph nfs export create cephfs cloudstack-secondary cloudstack-nfs /cloudstack-secondary / |
|||
</pre> |
|||
View export info: |
|||
<pre> |
|||
ceph nfs export ls cloudstack-nfs |
|||
ceph nfs export get cloudstack-nfs /cloudstack-secondary |
|||
</pre> |
|||
Now on the clients, we can just mount the NFS export normally: |
|||
<pre> |
|||
mkdir /mnt/cloudstack-secondary |
|||
mount -t nfs4 -o port=2049 ceph-nfs.cloud.csclub.uwaterloo.ca:/cloudstack-secondary /mnt/cloudstack-secondary |
|||
</pre> |
|||
=== Security === |
|||
The NFS module in Ceph is just [https://github.com/nfs-ganesha/nfs-ganesha NFS-Ganesha], which does theoretically support ACLs, but I wasn't able to get it to work. I kept on getting some weird Python error. So we're going to use our iptables-fu instead (on riboflavin and ginkgo; make sure iptables-persistent is installed): |
|||
<pre> |
|||
iptables -N CEPH-NFS |
|||
iptables -A INPUT -j CEPH-NFS |
|||
iptables -A CEPH-NFS -s 172.19.168.0/27 -j RETURN |
|||
iptables -A CEPH-NFS -p tcp --dport 2049 -j REJECT |
|||
iptables -A CEPH-NFS -p udp --dport 2049 -j REJECT |
|||
iptables-save > /etc/iptables/rules.v4 |
|||
ip6tables -N CEPH-NFS |
|||
ip6tables -A INPUT -j CEPH-NFS |
|||
ip6tables -A CEPH-NFS -s fd74:6b6a:8eca:4902::/64 -j RETURN |
|||
ip6tables -A CEPH-NFS -p tcp --dport 2049 -j REJECT |
|||
ip6tables -A CEPH-NFS -p udp --dport 2049 -j REJECT |
|||
ip6tables-save > /etc/iptables/rules.v6 |
|||
</pre> |
|||
== Dashboard == |
|||
There is a web dashboard for Ceph running on riboflavin which is useful to get a holistic view of the system. You will need to do a port-forward over SSH: |
|||
<pre> |
|||
ssh -L 8443:172.19.168.25:8443 riboflavin |
|||
</pre> |
|||
Now if you visit https://localhost:8443 (ignore the HTTPS warning), you can login to the dashboard. Credentials are stored in the usual place. |
|||
== Adding a new disk == |
|||
Let's say we added a new disk /dev/sdg to ginkgo. Log in to one of the Ceph management servers (riboflavin or ginkgo), then run |
|||
<pre> |
|||
# clear any metadata at the start of the disk |
|||
dd if=/dev/zero of=/dev/sdg bs=1M count=10 conv=fsync |
|||
# Run this from inside `cephadm shell` |
|||
ceph orch daemon add osd ginkgo:/dev/sdg |
|||
</pre> |
|||
And that's it! You can run <code>ceph status</code> to see the progress of the PGs getting rebalanced. |
|||
== Recovering from a disk failure == |
|||
Check which placement group(s) failed: |
|||
<pre> |
|||
# Run this from `cephadm shell` |
|||
ceph health detail |
|||
</pre> |
|||
The output will look something like this: |
|||
<pre> |
|||
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent |
|||
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors |
|||
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent |
|||
pg 2.5 is active+clean+inconsistent, acting [3,0] |
|||
</pre> |
|||
This means that placement group 2.5 failed and is in OSDs 3 and 0. Since our cluster has a replication factor of 2, one of those OSDs will be on the machine with the failed disk, and the other OSD will be on a healthy machine. Run this to see which machines have which OSDs: |
|||
<pre> |
|||
ceph osd tree |
|||
</pre> |
|||
=== Repairing the placement group === |
|||
If the disk failure might have been intermittent, try and see if we can repair the PG first. See https://docs.ceph.com/en/pacific/rados/operations/pg-repair/ for details. |
|||
=== Removing or replacing a disk === |
|||
First, find the OSD corresponding to the failed disk: |
|||
<pre> |
|||
# Run this from `cephadm shell` |
|||
ceph-volume lvm list |
|||
</pre> |
|||
Read these pages: |
|||
<ul> |
|||
<li>https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds</li> |
|||
<li>https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure</li> |
|||
</ul> |
|||
Here's the TLDR (assuming OSD 3 has the disk which failed): |
|||
First, take the OSD out of the cluster: |
|||
<pre> |
|||
ceph osd out osd.3 |
|||
</pre> |
|||
Wait until the data is backfilled to the other OSDs (this could take a long time): |
|||
<pre> |
|||
ceph status |
|||
</pre> |
|||
Remove the OSD daemon, then purge the OSD completely: |
|||
<pre> |
|||
ceph orch daemon rm osd.3 --force |
|||
ceph osd purge osd.3 --yes-i-really-mean-it |
|||
</pre> |
|||
Destroy the LVM logical volume and volume group: |
|||
<pre> |
|||
ceph-volume lvm zap --destroy --osd-id 3 |
|||
</pre> |
|||
At this point, the hard drive can be removed. |
|||
After the drive has been replaced, zap it and add it to the cluster normally: |
|||
<pre> |
|||
dd if=/dev/zero of=/dev/sde bs=1M count=10 conv=fsync |
|||
ceph orch daemon add osd ginkgo:/dev/sde |
|||
</pre> |
|||
== Reducing log verbosity == |
|||
By default, debug messages are enabled and written to stderr (so they up in the journald logs because they run in Podman). Unfortunately there is a [https://tracker.ceph.com/issues/49161 bug in Ceph] which seems to always cause debug messages to be enabled on stderr specifically. So we will log to syslog instead (which is just systemd-journald on Debian). |
|||
Run <code>cephadm shell</code> on riboflavin, then run the following: |
|||
<pre> |
|||
ceph config set global mon_cluster_log_file_level info |
|||
ceph config set global log_to_stderr false |
|||
ceph config set global mon_cluster_log_to_stderr false |
|||
ceph config set global mon_cluster_log_to_syslog true |
|||
</pre> |
|||
These setting should take effect on all of the Ceph hosts immediately. See https://docs.ceph.com/en/pacific/rados/configuration/ceph-conf/ for reference. |
|||
== Miscellaneous commands == |
|||
Here are some commands which may be useful. See the [https://docs.ceph.com/en/latest/man/8/ceph/ man page] for a full reference. |
|||
<ul> |
|||
<li> |
|||
Show devices: |
|||
<pre> |
|||
ceph orch device ls |
|||
</pre> |
|||
Note: this doesn't actually show all of the individual disks. I think it might have to do with the hardware RAID controllers. |
|||
</li> |
|||
<li> |
|||
Show OSDs (Object Storage Daemons) on the current host (this needs to be run from <code>cephadm shell</code>): |
|||
<pre>ceph-volume lvm list</pre> |
|||
</li> |
|||
<li> |
|||
Show services: |
|||
<pre>ceph orch ls</pre> |
|||
</li> |
|||
<li> |
|||
Show daemons of those services: |
|||
<pre>ceph orch ps</pre> |
|||
</li> |
|||
<li> |
|||
Show non-default config settings: |
|||
<pre>ceph config dump</pre> |
|||
</li> |
|||
<li> |
|||
Show pools: |
|||
<pre>ceph osd pool ls detail</pre> |
|||
</li> |
|||
<li> |
|||
List users: |
|||
<pre>ceph auth ls</pre> |
|||
</li> |
</li> |
||
</ul> |
</ul> |
Latest revision as of 06:03, 10 February 2024
We are running a three-node Ceph cluster on riboflavin, ginkgo and biloba for the purpose of cloud storage. Most Ceph services are running on riboflavin or ginkgo; biloba is just providing a tiny bit of extra storage space.
Official documentation: https://docs.ceph.com/en/latest/
At the time this page was written, the latest version of Ceph was 'Pacific'; check the website above to see what the latest version is.
Bootstrap
The instructions below were adapted from https://docs.ceph.com/en/pacific/cephadm/install/.
riboflavin was used as the bootstrap host, since it has the most storage.
Add the following to /etc/apt/sources.list.d/ceph.list:
deb http://mirror.csclub.uwaterloo.ca/ceph/debian-pacific/ bullseye main
Download the Ceph release key for the Debian packages:
wget -O /etc/apt/trusted.gpg.d/ceph.release.gpg https://download.ceph.com/keys/release.gpg
Now run:
apt update apt install cephadm podman ceph boostrap --mon-ip 172.19.168.25
For the rest of the instructions below, the ceph
command can be run inside a Podman container by running cephadm shell
. Alternatively, you can install the ceph-common
package to run ceph
directly on the host.
Add the disks for riboflavin:
ceph orch daemon add osd riboflavin:/dev/sdb ceph orch daemon add osd riboflavin:/dev/sdc
Note: Unfortunately Ceph didn't like it when I used one of the /dev/disk/by-id paths, so I had to use the /dev/sdX paths instead. I'm not sure what'll happen if the device names change at boot. Let's just cross our fingers and pray.
Add more hosts:
ceph orch host add ginkgo 172.19.168.22 --labels _admin ceph orch host add biloba 172.19.168.23
Add each available disk on each of the additional hosts.
Disable unnecessary services:
ceph orch rm alertmanager ceph orch rm grafana ceph orch rm node-exporter
Set the autoscale profile to scale-up instead of scale-down:
ceph osd pool set autoscale-profile scale-up
Set the default pool replication factor to 2 instead of 3:
ceph config set global osd_pool_default_size 2
Deploy the Managers and Monitors on riboflavin and ginkgo only:
ceph orch apply mon --placement '2 riboflavin ginkgo' ceph orch apply mgr --placement '2 riboflavin ginkgo'
CloudStack Primary Storage
We are using RBD (RADOS Block Device) for CloudStack primary storage. The instructions below were adapted from https://docs.ceph.com/en/pacific/rbd/rbd-cloudstack/.
Create and initialize a pool:
ceph osd pool create cloudstack rbd pool init cloudstack
Create a user for CloudStack:
ceph auth get-or-create client.cloudstack mon 'profile rbd' osd 'profile rbd pool=cloudstack'
Make a backup of this key. There is currently a copy in /etc/ceph/ceph.client.cloudstack.keyring on biloba. If you want to use the ceph
command with this set of credentials, use the -n
flag, e.g.
ceph -n client.cloudstack status
RBD commands
Here are some RBD commands which might be useful:
-
List images (i.e. block devices) in the cloudstack pool:
rbd ls -p cloudstack
-
View snapshots for an image:
rbd snap ls cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
-
Unprotect a snapshot:
rbd snap unprotect cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b@cloudstack-base-snap
-
Purge all snapshots for an image (after unprotecting them):
rbd snap purge cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
-
Delete an image:
rbd rm cloudstack/265dc008-4db5-11ec-b585-32ee6075b19b
-
A quick 'n dirty script to delete all images in the pool:
rbd ls -p cloudstack | while read image; do rbd snap unprotect cloudstack/$image@cloudstack-base-snap; done rbd ls -p cloudstack | while read image; do rbd snap purge cloudstack/$image; done rbd ls -p cloudstack | while read image; do rbd rm cloudstack/$image; done
CloudStack Secondary Storage
We are using NFS (v4) for CloudStack secondary storage. The steps below were adapted from:
- https://docs.ceph.com/en/pacific/cephfs.
- https://docs.ceph.com/en/pacific/cephadm/nfs/
- https://docs.ceph.com/en/pacific/mgr/nfs/#mgr-nfs
Create a new CephFS filesystem:
ceph fs volume create cloudstack-secondary
Enable the NFS module:
ceph mgr module enable nfs
Create a cluster placed on two hosts:
ceph nfs cluster create cloudstack-nfs --placement '2 riboflavin ginkgo'
View cluster info:
ceph nfs cluster ls ceph nfs cluster info cloudstack-nfs
Now create a CephFS export:
ceph nfs export create cephfs cloudstack-secondary cloudstack-nfs /cloudstack-secondary /
View export info:
ceph nfs export ls cloudstack-nfs ceph nfs export get cloudstack-nfs /cloudstack-secondary
Now on the clients, we can just mount the NFS export normally:
mkdir /mnt/cloudstack-secondary mount -t nfs4 -o port=2049 ceph-nfs.cloud.csclub.uwaterloo.ca:/cloudstack-secondary /mnt/cloudstack-secondary
Security
The NFS module in Ceph is just NFS-Ganesha, which does theoretically support ACLs, but I wasn't able to get it to work. I kept on getting some weird Python error. So we're going to use our iptables-fu instead (on riboflavin and ginkgo; make sure iptables-persistent is installed):
iptables -N CEPH-NFS iptables -A INPUT -j CEPH-NFS iptables -A CEPH-NFS -s 172.19.168.0/27 -j RETURN iptables -A CEPH-NFS -p tcp --dport 2049 -j REJECT iptables -A CEPH-NFS -p udp --dport 2049 -j REJECT iptables-save > /etc/iptables/rules.v4 ip6tables -N CEPH-NFS ip6tables -A INPUT -j CEPH-NFS ip6tables -A CEPH-NFS -s fd74:6b6a:8eca:4902::/64 -j RETURN ip6tables -A CEPH-NFS -p tcp --dport 2049 -j REJECT ip6tables -A CEPH-NFS -p udp --dport 2049 -j REJECT ip6tables-save > /etc/iptables/rules.v6
Dashboard
There is a web dashboard for Ceph running on riboflavin which is useful to get a holistic view of the system. You will need to do a port-forward over SSH:
ssh -L 8443:172.19.168.25:8443 riboflavin
Now if you visit https://localhost:8443 (ignore the HTTPS warning), you can login to the dashboard. Credentials are stored in the usual place.
Adding a new disk
Let's say we added a new disk /dev/sdg to ginkgo. Log in to one of the Ceph management servers (riboflavin or ginkgo), then run
# clear any metadata at the start of the disk dd if=/dev/zero of=/dev/sdg bs=1M count=10 conv=fsync # Run this from inside `cephadm shell` ceph orch daemon add osd ginkgo:/dev/sdg
And that's it! You can run ceph status
to see the progress of the PGs getting rebalanced.
Recovering from a disk failure
Check which placement group(s) failed:
# Run this from `cephadm shell` ceph health detail
The output will look something like this:
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent [ERR] OSD_SCRUB_ERRORS: 1 scrub errors [ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent pg 2.5 is active+clean+inconsistent, acting [3,0]
This means that placement group 2.5 failed and is in OSDs 3 and 0. Since our cluster has a replication factor of 2, one of those OSDs will be on the machine with the failed disk, and the other OSD will be on a healthy machine. Run this to see which machines have which OSDs:
ceph osd tree
Repairing the placement group
If the disk failure might have been intermittent, try and see if we can repair the PG first. See https://docs.ceph.com/en/pacific/rados/operations/pg-repair/ for details.
Removing or replacing a disk
First, find the OSD corresponding to the failed disk:
# Run this from `cephadm shell` ceph-volume lvm list
Read these pages:
- https://docs.ceph.com/en/pacific/rados/operations/add-or-rm-osds
- https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/operations_guide/handling-a-disk-failure
Here's the TLDR (assuming OSD 3 has the disk which failed):
First, take the OSD out of the cluster:
ceph osd out osd.3
Wait until the data is backfilled to the other OSDs (this could take a long time):
ceph status
Remove the OSD daemon, then purge the OSD completely:
ceph orch daemon rm osd.3 --force ceph osd purge osd.3 --yes-i-really-mean-it
Destroy the LVM logical volume and volume group:
ceph-volume lvm zap --destroy --osd-id 3
At this point, the hard drive can be removed.
After the drive has been replaced, zap it and add it to the cluster normally:
dd if=/dev/zero of=/dev/sde bs=1M count=10 conv=fsync ceph orch daemon add osd ginkgo:/dev/sde
Reducing log verbosity
By default, debug messages are enabled and written to stderr (so they up in the journald logs because they run in Podman). Unfortunately there is a bug in Ceph which seems to always cause debug messages to be enabled on stderr specifically. So we will log to syslog instead (which is just systemd-journald on Debian).
Run cephadm shell
on riboflavin, then run the following:
ceph config set global mon_cluster_log_file_level info ceph config set global log_to_stderr false ceph config set global mon_cluster_log_to_stderr false ceph config set global mon_cluster_log_to_syslog true
These setting should take effect on all of the Ceph hosts immediately. See https://docs.ceph.com/en/pacific/rados/configuration/ceph-conf/ for reference.
Miscellaneous commands
Here are some commands which may be useful. See the man page for a full reference.
-
Show devices:
ceph orch device ls
Note: this doesn't actually show all of the individual disks. I think it might have to do with the hardware RAID controllers.
-
Show OSDs (Object Storage Daemons) on the current host (this needs to be run from
cephadm shell
):ceph-volume lvm list
-
Show services:
ceph orch ls
-
Show daemons of those services:
ceph orch ps
-
Show non-default config settings:
ceph config dump
-
Show pools:
ceph osd pool ls detail
-
List users:
ceph auth ls