Sysadmin Guide: Difference between revisions
No edit summary |
mNo edit summary |
||
(5 intermediate revisions by 2 users not shown) | |||
Line 10: | Line 10: | ||
<ul> |
<ul> |
||
<li> |
<li> |
||
Send an email to csc-general announcing the outage (example [https://mailman.csclub.uwaterloo.ca/ |
Send an email to csc-general announcing the outage (example [https://mailman.csclub.uwaterloo.ca/private/csc-general/2016-November/000694.html here]) |
||
</li> |
</li> |
||
<li> |
<li> |
||
Line 33: | Line 33: | ||
</li> |
</li> |
||
</ul> |
</ul> |
||
==== Post-Outage ==== |
==== Post-Outage ==== |
||
<ul> |
<ul> |
||
<li> |
<li> |
||
Log back into each MC machine and make sure that <code>/users</code> was mounted correctly. If not, check /etc/network/interfaces to get the name of the VLAN device, and use <code>ip addr</code> to see if the interface is up. If it is not up, try to use ifup; if that doesn't work, manually bring up the device and assign it the appropriate IP addresses using iproute2 |
Log back into each MC machine and make sure that <code>/users</code> was mounted correctly. If not, check /etc/network/interfaces to get the name of the VLAN device, and use <code>ip addr</code> to see if the interface is up. If it is not up, try to use ifup; if that doesn't work, manually bring up the device and assign it the appropriate IP addresses using iproute2: |
||
<pre> |
|||
# check /etc/network/interfaces for the interface name and IP |
|||
ip link add name ens3.530 link ens3 type vlan id 530 |
|||
ip addr add dev ens3.530 172.19.168.49/27 |
|||
ip addr add dev ens3.530 fd74:6b6a:8eca:4903:c5c::49/64 |
|||
ip link set dev ens3.530 up |
|||
</pre> |
|||
</li> |
</li> |
||
</ul> |
</ul> |
||
Line 44: | Line 52: | ||
We handle LE certs for members and clubs who host their websites on our servers. The certs should be renewed automatically; if they do not, then something is very wrong. There are plans underway to migrate from <code>certbot</code> to <code>dehydrate</code> since the apt version of certbot appears to be broken. |
We handle LE certs for members and clubs who host their websites on our servers. The certs should be renewed automatically; if they do not, then something is very wrong. There are plans underway to migrate from <code>certbot</code> to <code>dehydrate</code> since the apt version of certbot appears to be broken. |
||
If you get an email LE warning you that a cert is about to expire, login to caffeine and check /var/log/letsencrypt/letsencrypt.log. There should usually be some clue as to what went wrong. Often, a club or member decides that they no longer want to host their website on our servers, in which case the cert can safely be removed via <code>certbot delete --cert-name CERT_NAME</code>. Make sure to also delete the corresponding Apache config files. Sometimes a subset of the domains for a member have become invalid, in which case they must be removed from the cert. One way to do this is via <code>certbot certonly --webroot -d domain1.com -d domain2.com ...</code>. Only list the domains which are still valid with the <code>-d</code> flags; omitted domains will be removed. Make sure to update the corresponding Apache config files. |
|||
=== uwaterloo.ca subdomains === |
=== uwaterloo.ca subdomains === |
||
Make sure to read [[Web Hosting]] first. |
Make sure to read [[Web Hosting]] first. |
||
If a |
If a club requests a uwaterloo.ca subdomain, first make sure that their website is being hosted on our servers. Then, forward the email to hostmaster (at) uwaterloo.ca, and ask them to make the domain a CNAME for caffeine.csclub.uwaterloo.ca. You will also need to add a VirtualHost entry in /etc/apache2 on caffeine, redirecting requests to /users/club_name/www. |
||
Make sure to create a new Let's Encrypt certificate for the domain. |
|||
Members do not get their own top-level uwaterloo.ca subdomains, they can instead request a <WatIAM>.members.csclub.uwaterloo.ca subdomain. |
|||
=== Mailing list subscriptions === |
=== Mailing list subscriptions === |
||
Line 63: | Line 77: | ||
</li> |
</li> |
||
</ul> |
</ul> |
||
=== Membership lifecycle === |
|||
We should (but often don't) make sure that we clean up resources created by members who have expired. Otherwise, these just take up space. For cloud resources, we have a Python class called CloudResourceManager in the [https://git.csclub.uwaterloo.ca/public/pyceo/ pyceo] repo which deletes resources for expired members and sends a warning email one week in advance (the implementation is pretty sketchy and could be improved). If you decide to host some service where members can create their own resources, please update the pyceo code as necessary so that those resources are deleted when the owners' memberships expire. |
Latest revision as of 19:58, 4 June 2022
The system administrator chairs the Systems Committee, and is responsible for keeping all of our computers in working order. The CSC computing environment is good, but not nearly perfect, and the sysadmin should look for ways to improve it. We don't have a strict "if it works, don't touch it" policy, and encourage people to try new things to see if they work better. Because of this, we don't have "5 nines" uptime or anything close, but do have a modern computing environment that is constantly improving. Our systems should be, and often are, better at the end of term than the beginning.
Early in the term, the sysadmin should consider what hardware upgrades we would like to have, and send proposals to the treasurer to add to the budget. A bit later, this happens again with MEF proposals.
The sysadmin should also make sure requests by our users (to systems-committee@csclub) are answered, and make recommendations to the Executive Council to add new systems committee members or reevaluate old ones.
Power Outages
Occasionally MC will undergo planned power outages. These usually last from the morning until the evening. a2brenna or someone from IST will hopefully give us a notice in advance. When this happens, you should:
Pre-Outage
- Send an email to csc-general announcing the outage (example here)
- Create an announcement on our main website announcing the outage
- Announce the outage in the #csc IRC channel and update the channel topic to show outage information
-
Schedule the shutdown the night before the outage using the
shutdown
command on all of our MC machines, e.g.sudo shutdown 06:00 "CSC systems will be unavailable for a power outage 7am -> 5pm. This machine will shutdown at 6:00AM EDT."
-
If the real machines hosting the web server (phosphoric-acid) and mirror (potassium-benzoate) cannot be kept up during the outage,
set up a backup web server in an LXC container on a machine which is not located in MC (currently there is a container named dr-website
on biloba). After the MC machines have shutdown, assign the IP addresses of csclub.uwaterloo.ca and mirror.csclub.uwaterloo.ca to
the backup container.
TODO: Consider using keepalived to automate this process.
Post-Outage
-
Log back into each MC machine and make sure that
/users
was mounted correctly. If not, check /etc/network/interfaces to get the name of the VLAN device, and useip addr
to see if the interface is up. If it is not up, try to use ifup; if that doesn't work, manually bring up the device and assign it the appropriate IP addresses using iproute2:# check /etc/network/interfaces for the interface name and IP ip link add name ens3.530 link ens3 type vlan id 530 ip addr add dev ens3.530 172.19.168.49/27 ip addr add dev ens3.530 fd74:6b6a:8eca:4903:c5c::49/64 ip link set dev ens3.530 up
Let's Encrypt certificates
Make sure to read SSL first.
We handle LE certs for members and clubs who host their websites on our servers. The certs should be renewed automatically; if they do not, then something is very wrong. There are plans underway to migrate from certbot
to dehydrate
since the apt version of certbot appears to be broken.
If you get an email LE warning you that a cert is about to expire, login to caffeine and check /var/log/letsencrypt/letsencrypt.log. There should usually be some clue as to what went wrong. Often, a club or member decides that they no longer want to host their website on our servers, in which case the cert can safely be removed via certbot delete --cert-name CERT_NAME
. Make sure to also delete the corresponding Apache config files. Sometimes a subset of the domains for a member have become invalid, in which case they must be removed from the cert. One way to do this is via certbot certonly --webroot -d domain1.com -d domain2.com ...
. Only list the domains which are still valid with the -d
flags; omitted domains will be removed. Make sure to update the corresponding Apache config files.
uwaterloo.ca subdomains
Make sure to read Web Hosting first.
If a club requests a uwaterloo.ca subdomain, first make sure that their website is being hosted on our servers. Then, forward the email to hostmaster (at) uwaterloo.ca, and ask them to make the domain a CNAME for caffeine.csclub.uwaterloo.ca. You will also need to add a VirtualHost entry in /etc/apache2 on caffeine, redirecting requests to /users/club_name/www.
Make sure to create a new Let's Encrypt certificate for the domain.
Members do not get their own top-level uwaterloo.ca subdomains, they can instead request a <WatIAM>.members.csclub.uwaterloo.ca subdomain.
Mailing list subscriptions
At the very least, you need to be subscribed to the syscom and exec mailing lists. You may also wish to subscribe to the following:
- git: Get alerts on some git commits (mainly Apache configs)
- packages: Get alerts of when packages are added to our Debian repository
- ceo: Get alerts on new user accounts
Membership lifecycle
We should (but often don't) make sure that we clean up resources created by members who have expired. Otherwise, these just take up space. For cloud resources, we have a Python class called CloudResourceManager in the pyceo repo which deletes resources for expired members and sends a warning email one week in advance (the implementation is pretty sketchy and could be improved). If you decide to host some service where members can create their own resources, please update the pyceo code as necessary so that those resources are deleted when the owners' memberships expire.