Difference between revisions of "Observability"

From CSCWiki
Jump to navigation Jump to search
(Add Prometheus section)
 
(Add Loki section)
Line 1: Line 1:
There are three pillars of observability: metrics, logging and tracing. We are only interested in the first two.
+
There are [https://www.oreilly.com/library/view/distributed-systems-observability/9781492033431/ch04.html three pillars of observability]: metrics, logging and tracing. We are only interested in the first two.
   
 
== Metrics ==
 
== Metrics ==
All of our machines are, or at least should be, running the Prometheus node exporter. This collects and sends machine metrics (e.g. RAM used, disk space) to the Prometheus server running at https://prometheus.csclub.uwaterloo.ca (currently a VM on phosphoric-acid). There are a few specialized exporters running on several other machines; a Postfix exporter is running on mail, an Apache exporter is running on caffeine, and an NGINX expoter is running on potassium-benzoate. There is also a custom exporter written on syscom running on potassium-benzoate for mirror stats.
+
All of our machines are, or at least should be, running the Prometheus node exporter. This collects and sends machine metrics (e.g. RAM used, disk space) to the Prometheus server running at https://prometheus.csclub.uwaterloo.ca (currently a VM on phosphoric-acid). There are a few specialized exporters running on several other machines; a Postfix exporter is running on mail, an Apache exporter is running on caffeine, and an NGINX expoter is running on potassium-benzoate. There is also a custom exporter written by syscom running on potassium-benzoate for mirror stats.
   
 
Most of the exporters use mutual TLS authentication with the Prometheus server. I set the expiration date for the TLS certs to 10 years. If you are reading this and it is 2031 or later, then go update the certs.
 
Most of the exporters use mutual TLS authentication with the Prometheus server. I set the expiration date for the TLS certs to 10 years. If you are reading this and it is 2031 or later, then go update the certs.
Line 27: Line 27:
 
)
 
)
 
</pre>
 
</pre>
  +
<br>
 
 
We also use [https://prometheus.io/docs/alerting/latest/alertmanager/ AlertManager] to send email alerts from Prometheus metrics. We should figure out how to also send messages to IRC or similar.
 
We also use [https://prometheus.io/docs/alerting/latest/alertmanager/ AlertManager] to send email alerts from Prometheus metrics. We should figure out how to also send messages to IRC or similar.
   
Line 33: Line 33:
   
 
We make some pretty charts on Grafana (https://prometheus.csclub.uwaterloo.ca) from PromQL queries. Grafana also has an 'Explorer' page where you can test out some queries before making chart panels from them.
 
We make some pretty charts on Grafana (https://prometheus.csclub.uwaterloo.ca) from PromQL queries. Grafana also has an 'Explorer' page where you can test out some queries before making chart panels from them.
  +
  +
== Logging ==
  +
We use a combination of [https://www.elastic.co/beats/ Elastic Beats], [https://www.elastic.co/logstash/ Logstash] and [https://grafana.com/oss/loki/ Loki] for collecting, storing and querying our logs; for visualization, we use Grafana. Logstash and Loki are currently both running in the prometheus VM.
  +
  +
The reason why I chose Loki over Elasticsearch is because Loki is <i>very</i> space efficient with regards to storage. It also consumes way less RAM and CPU. This means that we can collect a shit ton of logs without worrying too much about resource usage.
  +
  +
We have Journalbeat and/or Filebeat running on some of our machines to collect logs from sshd, Apache and NGINX. The Beats send these logs to Logstash, which does some pre-processing. The most useful contribution by Logstash is its GeoIP plugin, which allows us to enrich the logs with some geographical information from IP addresses (e.g. add city and country). Logstash sends these logs to Loki, and we can then view these from Grafana.
  +
  +
The language for querying logs in Loki is [https://grafana.com/docs/loki/latest/logql/ LogQL], which, syntactically, is very similar to PromQL. If you have already learned PromQL, then you should be able to pick up LogQL very easily. You can try out some LogQL queries from the 'Explore' page on Grafana; make sure you toggle the data source to 'Loki' in the top left corner. For the 'topk' queries, you will also want to toggle 'Query type' to 'Instant' rather than 'Range'.
  +
  +
=== LogQL examples ===
  +
Here are the number of failed SSH login attempts for each host for a given time range:
  +
<pre>
  +
sum by (hostname) (
  +
count_over_time(
  +
{job="logstash-sshd"} [$__range]
  +
)
  +
)
  +
</pre>
  +
Note that <code>$__range</code> is a special [https://grafana.com/docs/grafana/latest/variables/variable-types/global-variables/ global variable] in Grafana which is equal to the time range in the top right corner of a chart.
  +
<br><br>
  +
Here are the top 10 IP addresses from which failed SSH login attempts arrived, for a given host and time range:
  +
<pre>
  +
topk(10,
  +
sum by (ip_address) (
  +
count_over_time(
  +
{job="logstash-sshd",hostname="$hostname"} | json | __error__ = ""
  +
[$__range]
  +
)
  +
)
  +
)
  +
</pre>
  +
$hostname is a chart variable, which can be configured from a chart's settings.
  +
  +
I configured Logstash to send logs to Loki as JSON, but it's a rather hacky solution, so occasionally invalid JSON is sent.
  +
<br><br>
  +
Here are the number of HTTP requests for the 15 distros on our mirror from the last hour:
  +
<pre>
  +
topk(15,
  +
sum by (distro) (
  +
count_over_time(
  +
{job="logstash-nginx"} | json | __error__ = "" | distro != "server-status"
  +
[1h]
  +
)
  +
)
  +
)
  +
</pre>
  +
<br><br>
  +
Here are the number of total bytes sent over HTTP for the top 15 distros from the last hour. Note the use of the <code>unwrap</code> operator.
  +
<pre>
  +
topk(15,
  +
sum by (distro) (
  +
sum_over_time(
  +
{job="logstash-nginx"} | json | __error__ = "" | distro != "server-status" | unwrap bytes
  +
[1h]
  +
)
  +
)
  +
)
  +
</pre>
  +
You can see more examples on the Mirror Requests dashboard on Grafana.
  +
  +
=== Some more LogQL examples (webcom) ===
  +
Here are queries which the Web Committee may find interesting. Try these out from the 'Explore' page in Grafana (after setting the data source to 'Loki'). You may optionally create a new dashboard if you think you've written some good queries.
  +
<br><br>
  +
Here's a query to just view the raw logs, parsed as JSON (explore the extracted labels for each log):
  +
<pre>
  +
{job="logstash-apache"} | json
  +
</pre>
  +
<br><br>
  +
Here's the number of requests by User-Agent for the top 15 requesters:
  +
<pre>
  +
topk(15,
  +
sum by (agent) (
  +
count_over_time(
  +
{job="logstash-apache"} | json
  +
[$__range]
  +
)
  +
)
  +
)
  +
<pre>
  +
You can change 'agent' by 'request', 'ip_address', etc.

Revision as of 18:18, 17 October 2021

There are three pillars of observability: metrics, logging and tracing. We are only interested in the first two.

Metrics

All of our machines are, or at least should be, running the Prometheus node exporter. This collects and sends machine metrics (e.g. RAM used, disk space) to the Prometheus server running at https://prometheus.csclub.uwaterloo.ca (currently a VM on phosphoric-acid). There are a few specialized exporters running on several other machines; a Postfix exporter is running on mail, an Apache exporter is running on caffeine, and an NGINX expoter is running on potassium-benzoate. There is also a custom exporter written by syscom running on potassium-benzoate for mirror stats.

Most of the exporters use mutual TLS authentication with the Prometheus server. I set the expiration date for the TLS certs to 10 years. If you are reading this and it is 2031 or later, then go update the certs.

I highly suggest becoming familiar with PromQL, the query language for Prometheus. You can run and visualize some queries at https://prometheus.csclub.uwaterloo.ca/prometheus. For example, here is a query to determine which machines are up or down:

up{job="node_exporter"}


Here's how we determine if a machine has NFS mounted. This will return 1 for machines which have NFS mounted, but will not return any records for machines which do not have NFS mounted. (We ignore the actual value of node_filesystem_device_error because it returns 1 for machines using Kerberized NFS.)

count by (instance) (node_filesystem_device_error{mountpoint="/users", fstype="nfs"})


Now this is a rather complicated expression which can return one of three values:

  • 0: the machine is down
  • 1: the machine is up, but NFS is not mounted
  • 2: the machine is up and NFS is mounted

The or operator in PromQL is key here.

sum by (instance) (
  (count by (instance) (node_filesystem_device_error{mountpoint="/users", fstype="nfs"}))
  or up{job="node_exporter"}
)


We also use AlertManager to send email alerts from Prometheus metrics. We should figure out how to also send messages to IRC or similar.

We also use the Blackbox prober exporter to check if some of our web-based services are up.

We make some pretty charts on Grafana (https://prometheus.csclub.uwaterloo.ca) from PromQL queries. Grafana also has an 'Explorer' page where you can test out some queries before making chart panels from them.

Logging

We use a combination of Elastic Beats, Logstash and Loki for collecting, storing and querying our logs; for visualization, we use Grafana. Logstash and Loki are currently both running in the prometheus VM.

The reason why I chose Loki over Elasticsearch is because Loki is very space efficient with regards to storage. It also consumes way less RAM and CPU. This means that we can collect a shit ton of logs without worrying too much about resource usage.

We have Journalbeat and/or Filebeat running on some of our machines to collect logs from sshd, Apache and NGINX. The Beats send these logs to Logstash, which does some pre-processing. The most useful contribution by Logstash is its GeoIP plugin, which allows us to enrich the logs with some geographical information from IP addresses (e.g. add city and country). Logstash sends these logs to Loki, and we can then view these from Grafana.

The language for querying logs in Loki is LogQL, which, syntactically, is very similar to PromQL. If you have already learned PromQL, then you should be able to pick up LogQL very easily. You can try out some LogQL queries from the 'Explore' page on Grafana; make sure you toggle the data source to 'Loki' in the top left corner. For the 'topk' queries, you will also want to toggle 'Query type' to 'Instant' rather than 'Range'.

LogQL examples

Here are the number of failed SSH login attempts for each host for a given time range:

sum by (hostname) (
  count_over_time(
    {job="logstash-sshd"} [$__range]
  )
)

Note that $__range is a special global variable in Grafana which is equal to the time range in the top right corner of a chart.

Here are the top 10 IP addresses from which failed SSH login attempts arrived, for a given host and time range:

topk(10,
  sum by (ip_address) (
    count_over_time(
      {job="logstash-sshd",hostname="$hostname"} | json | __error__ = ""
      [$__range]
    )
  )
)

$hostname is a chart variable, which can be configured from a chart's settings.

I configured Logstash to send logs to Loki as JSON, but it's a rather hacky solution, so occasionally invalid JSON is sent.

Here are the number of HTTP requests for the 15 distros on our mirror from the last hour:

topk(15,
  sum by (distro) (
    count_over_time(
      {job="logstash-nginx"} | json | __error__ = "" | distro != "server-status"
      [1h]
    )
  )
)



Here are the number of total bytes sent over HTTP for the top 15 distros from the last hour. Note the use of the unwrap operator.

topk(15,
  sum by (distro) (
    sum_over_time(
      {job="logstash-nginx"} | json | __error__ = "" | distro != "server-status" | unwrap bytes
      [1h]
    )
  )
)

You can see more examples on the Mirror Requests dashboard on Grafana.

Some more LogQL examples (webcom)

Here are queries which the Web Committee may find interesting. Try these out from the 'Explore' page in Grafana (after setting the data source to 'Loki'). You may optionally create a new dashboard if you think you've written some good queries.

Here's a query to just view the raw logs, parsed as JSON (explore the extracted labels for each log):

{job="logstash-apache"} | json



Here's the number of requests by User-Agent for the top 15 requesters:

topk(15,
  sum by (agent) (
    count_over_time(
      {job="logstash-apache"} | json
      [$__range]
    )
  )
)
You can change 'agent' by 'request', 'ip_address', etc.