Observability
There are three pillars of observability: metrics, logging and tracing. We are only interested in the first two.
Metrics
All of our machines are, or at least should be, running the Prometheus node exporter. This collects and sends machine metrics (e.g. RAM used, disk space) to the Prometheus server running at https://prometheus.csclub.uwaterloo.ca (currently a VM on phosphoric-acid). There are a few specialized exporters running on several other machines; a Postfix exporter is running on mail, an Apache exporter is running on caffeine, and an NGINX expoter is running on potassium-benzoate. There is also a custom exporter written on syscom running on potassium-benzoate for mirror stats.
Most of the exporters use mutual TLS authentication with the Prometheus server. I set the expiration date for the TLS certs to 10 years. If you are reading this and it is 2031 or later, then go update the certs.
I highly suggest becoming familiar with PromQL, the query language for Prometheus. You can run and visualize some queries at https://prometheus.csclub.uwaterloo.ca/prometheus. For example, here is a query to determine which machines are up or down:
up{job="node_exporter"}
Here's how we determine if a machine has NFS mounted. This will return 1 for machines which have NFS mounted, but will not return any records for machines which do not have NFS mounted. (We ignore the actual value of node_filesystem_device_error because it returns 1 for machines using Kerberized NFS.)
count by (instance) (node_filesystem_device_error{mountpoint="/users", fstype="nfs"})
Now this is a rather complicated expression which can return one of three values:
- 0: the machine is down
- 1: the machine is up, but NFS is not mounted
- 2: the machine is up and NFS is mounted
The or operator in PromQL is key here.
sum by (instance) ( (count by (instance) (node_filesystem_device_error{mountpoint="/users", fstype="nfs"})) or up{job="node_exporter"} )
We also use AlertManager to send email alerts from Prometheus metrics. We should figure out how to also send messages to IRC or similar.
We also use the Blackbox prober exporter to check if some of our web-based services are up.
We make some pretty charts on Grafana (https://prometheus.csclub.uwaterloo.ca) from PromQL queries. Grafana also has an 'Explorer' page where you can test out some queries before making chart panels from them.