August 2018 Power Outage Plan: Difference between revisions
(→Mail) |
No edit summary |
||
(20 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
There is a planned power outage in MC from Tuesday, August 21 to |
There is a planned power outage in MC from Tuesday, August 21 to Thursday, August 30. |
||
There are also two outages in DC, which will complicate keeping services up during the entire outage. |
|||
== Impact == |
|||
All services in MC for Aug. 21-30, and services in DC for two days during that window. |
|||
Services in PHY are not affected (This is redundant DNS and Authentication services. There are no other services (general-use or otherwise) in PHY.). It's also on a different network then MC and DC. |
|||
== Timeline == |
|||
=== Before Sunday, August 19 === |
|||
* Complete plan for outage |
|||
* Send notifications (and reminders) to csc-general |
|||
* Take backups of LDAP and Kerberos, and download offsite |
|||
* Take backups of system passwords, and download offsite |
|||
* Take backups of important containers/machines (whole things or just config): auth1, mail, caffeine |
|||
=== Sunday, August 19 === |
|||
* Copy the CSC website to caffeine-dr |
|||
* Shutdown general-use computing services |
|||
* Shutdown csclub.cloud components (they won't really work since not everything is redundant yet) |
|||
* Transfer computing services to redundant / temporary systems |
|||
* Revoke access to home directories on aspartame to all machines |
|||
=== Sometime during the outage window === |
|||
* Shutdown DC systems before the building outage |
|||
=== After the outage === |
|||
* Being restoring normal services |
|||
== Networking == |
|||
Our network is announced from both MC and DC. No impact to networking is expected when MC goes offline. |
|||
DHCP is hosted in MC (on caffeine). This is not strictly required as our servers use static IPs, but we can move it to DC so it's available. |
|||
Note that the University's core network and external links will be operating with reduced redundancy. |
|||
== Systems == |
== Systems == |
||
Line 5: | Line 47: | ||
=== Mirror === |
=== Mirror === |
||
CSCF will provided some generator power for mirror in MC. |
|||
TODO |
|||
CSCF is also setting up a second node in DC. |
|||
=== Website === |
=== Website === |
||
(Note: Aug 18, 2018 - we might be able to power the netapp with generator power. If that's the case, then websites will be up during the outage) |
|||
The CSC website is a static site, and will be straightforward to maintain during the outage. |
|||
A copy of the CSC website will be hosted on caffeine-dr. All pages not found on the local machine (including member and club sites) will return a 503 Service Unavailable error page. |
|||
All user and club sites are hosted in home directories (which are unavailable), so we will display an outage page (with a 503 status code). |
|||
Sample status page: [https://www-dr.csclub.uwaterloo.ca/test https://www-dr.csclub.uwaterloo.ca/test] |
|||
The following IP addresses should be added to caffeine-dr during the outage to serve the error page for other CSC services: |
|||
* caffeine: 129.97.134.17 / 2620:101:f000:4901:c5c::caff:e12e |
|||
* git: 129.97.134.49 / 2620:101:f000:4901:c5c:3eb::49 |
|||
* wiki: 129.97.134.44 / 2620:101:f000:4901:c5c:3eb::44 |
|||
* munin: 129.97.134.51 / 2620:101:f000:4901:c5c::51 |
|||
* prometheus: 129.97.134.15 / 2620:101:f000:4901:c5c::15 |
|||
=== Mail === |
=== Mail === |
||
Line 21: | Line 75: | ||
However, this requires: |
However, this requires: |
||
* Users not reference any scripts, programs, etc. in their procmailrc file that reference things in their |
* Users not reference any scripts, programs, etc. in their procmailrc file that reference things in their home directory |
||
=== Authentication === |
|||
Authentication is located in both MC and PHY. |
|||
While the MC node is down, the PHY node can continue to answer to authentication requests. However, updating membership and changing passwords will not be possible. |
|||
=== DNS === |
|||
CSC's DNS service is located in both MC and PHY. |
|||
''NOTE: The MC node is the master node, so we will need to ensure that the SOA record contains a long enough expiry time so the PHY doesn't stop serving zones.'' |
|||
== Additional Resources == |
== Additional Resources == |
Latest revision as of 14:50, 18 August 2018
There is a planned power outage in MC from Tuesday, August 21 to Thursday, August 30.
There are also two outages in DC, which will complicate keeping services up during the entire outage.
Impact
All services in MC for Aug. 21-30, and services in DC for two days during that window.
Services in PHY are not affected (This is redundant DNS and Authentication services. There are no other services (general-use or otherwise) in PHY.). It's also on a different network then MC and DC.
Timeline
Before Sunday, August 19
- Complete plan for outage
- Send notifications (and reminders) to csc-general
- Take backups of LDAP and Kerberos, and download offsite
- Take backups of system passwords, and download offsite
- Take backups of important containers/machines (whole things or just config): auth1, mail, caffeine
Sunday, August 19
- Copy the CSC website to caffeine-dr
- Shutdown general-use computing services
- Shutdown csclub.cloud components (they won't really work since not everything is redundant yet)
- Transfer computing services to redundant / temporary systems
- Revoke access to home directories on aspartame to all machines
Sometime during the outage window
- Shutdown DC systems before the building outage
After the outage
- Being restoring normal services
Networking
Our network is announced from both MC and DC. No impact to networking is expected when MC goes offline.
DHCP is hosted in MC (on caffeine). This is not strictly required as our servers use static IPs, but we can move it to DC so it's available.
Note that the University's core network and external links will be operating with reduced redundancy.
Systems
Mirror
CSCF will provided some generator power for mirror in MC.
CSCF is also setting up a second node in DC.
Website
(Note: Aug 18, 2018 - we might be able to power the netapp with generator power. If that's the case, then websites will be up during the outage)
A copy of the CSC website will be hosted on caffeine-dr. All pages not found on the local machine (including member and club sites) will return a 503 Service Unavailable error page.
Sample status page: https://www-dr.csclub.uwaterloo.ca/test
The following IP addresses should be added to caffeine-dr during the outage to serve the error page for other CSC services:
- caffeine: 129.97.134.17 / 2620:101:f000:4901:c5c::caff:e12e
- git: 129.97.134.49 / 2620:101:f000:4901:c5c:3eb::49
- wiki: 129.97.134.44 / 2620:101:f000:4901:c5c:3eb::44
- munin: 129.97.134.51 / 2620:101:f000:4901:c5c::51
- prometheus: 129.97.134.15 / 2620:101:f000:4901:c5c::15
Since the outage is for a week, we need to maintain email services during the outage. An initial plan by ztseguin and jxpryde:
- rsync users' .forward, .procmailrc and .maildir to a local directory on mail, allowing mail to continue as expected
However, this requires:
- Users not reference any scripts, programs, etc. in their procmailrc file that reference things in their home directory
Authentication
Authentication is located in both MC and PHY.
While the MC node is down, the PHY node can continue to answer to authentication requests. However, updating membership and changing passwords will not be possible.
DNS
CSC's DNS service is located in both MC and PHY.
NOTE: The MC node is the master node, so we will need to ensure that the SOA record contains a long enough expiry time so the PHY doesn't stop serving zones.