Mirror: Difference between revisions

From CSCWiki
Jump to navigation Jump to search
(Update information about storage)
 
(23 intermediate revisions by 4 users not shown)
Line 17: Line 17:
==== Storage ====
==== Storage ====


All of our projects are stored on one of two zfs zpools. There are 8 drives per array, configured as raidz2, and there is an additional drive that can be swapped in (in the event of a disk failure).
All of our projects are stored on an 8x18TB disk raidz2 array (cscmirror0). There is an additional drive acting as a hot-spare.


* <code>/mirror/root/.cscmirror1</code>
* <code>/mirror/root/.cscmirror0</code>
* <code>/mirror/root/.cscmirror2</code>


Each project is given a filesystem under one of the two pools. Symlinks are created <code>/mirror/root</code> to point to the correct pool and file system.
Each project is given a filesystem the pool. Symlinks are created <code>/mirror/root</code> to point to the correct pool and file system.


==== Merlin ====
==== Merlin ====
Project synchronization is done by "merlin" which is a Go rewrite of the Python script "merlin" originally written by a2brenna.


The synchronization process is run by a Python script called &quot;merlin&quot;, written by a2brenna. The script is stored in <code>~mirror/merlin</code>.
The program is stored in <code>~mirror/merlin</code> and is managed by the systemd unit <code>merlin-go.service</code>.


The config file <code>merlin-config.ini</code> contains the list of repositories along with their configurations.
The list of repositories and their configuration (synch frequency, location, etc.) is configured in <code>merlin.py</code>.


To view the sync status, execute <code>~mirror/merlin/arthur.py status</code>. To force the sync of a project, execute <code>~mirror/merlin/arthur.py sync:PROJECT_NAME</code>.
To view the sync status, execute <code>~mirror/merlin/cmd/arthur/arthur status</code>. To force the sync of a project, execute <code>~mirror/merlin/cmd/arthur/arthur sync:PROJECT_NAME</code>.

'''Remark''': For syncing Debian repositories we were [https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1020998 requested] to use ftpsync which has configs in <code>~mirror/ftpsync</code>.

===== Push Sync =====

Some projects support push syncing via SSH.

We are running a special SSHD instance on mirror.csclub.uwaterloo.ca:22. This instance has been locked down, with the following settings:

* Only SSH key authentication
* Only users of the <code>push</code> group (except <code>mirror</code>) are allowed to connect
* X11 Forwarding, TCP Forwarding, Agent Forwarding, User RC and TTY are disabled
* Users are chrooted to <code>/mirror/merlin</code>

Most projects will connect using the <code>push</code> user. The SSH authorized keys file is located at <code>/home/push/.ssh/authorized_keys</code>. An example entry is:

<pre>
restrict,no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty,command="arthur sync:ubuntu >/dev/null 2>/dev/null </dev/null &",from="XXX.XXX.XXX.XXX" ssh-rsa ...
</pre>


==== Sync Scripts ====
==== Sync Scripts ====
Line 55: Line 74:
An index of the archives we mirror is available at [http://mirror.csclub.uwaterloo.ca mirror.csclub.uwaterloo.ca].
An index of the archives we mirror is available at [http://mirror.csclub.uwaterloo.ca mirror.csclub.uwaterloo.ca].


As of Winter 2010, it is now generated by a Python script in <code>~mirror/mirror-index</code>.
As of Spring 2023, it is now generated by Hugo.


<code>~mirror/mirror-index/make-index</code> is scheduled in <code>/etc/cron.d/csc-mirror</code> to be run at 5:40am on the 14th and 28th of each month. The script can be run manually when needed (for example, when the archive list is updated) by running:
<code>~mirror/mirror-index/deploy.sh</code> is scheduled in <code>/etc/cron.d/csc-mirror</code> to be run every minute.


The script will first run <code>synctask2project</code>, which pull project synchronization status from Merlin (using merlin's socket), combine sub-projects (for example <code>racket</code> is a combination for two merlin tasks, <code>plt-bundles</code> and <code>racket-installers</code>) and read the size of the project using <code>zfs list -Hp</code>. This Python script then spits out a json file to <code>data/sync.json</code>. Hugo then read the json file and generate the HTML table based on it. The table part is also generated separately into <code>public/project_table/index.html</code>, which can be read by htmx (JS library used on index page) to achieve live reload on sync status. Finally, the generated product of Hugo is copied to mirror root for display by nginx.
<code>sudo -u mirror /home/mirror/mirror-index/make-index.py</code>


Project information is located at <code>synctask2project/config.toml</code> ('''NOT''' the config.toml in the root folder! That's the config for Hugo). Its format is as follows:
This causes an instance of <code>du</code> which computes the size of each directory. This list is then sorted alphabetically by directory name and returned to the Python script. If any errors occur during this process, the script conservatively chooses to exit rather than risk generating an index file that is incorrect.
<pre class="toml">
merlin_sock = "/path/to/merlin/socket"
zfs_pools = ["mirror_zfs_pool1", "mirror_zfs_pool2"]


[project_name]
<code>make-index.py</code> is configured by means of a [https://yaml.org YAML] file, <code>config.yaml</code>, in the same directory. Its format is as follows:
# This is supposed to be the short version shown on the website
# Mandatory field
site = "project.site"
# The full URL
# Mandatory field
url = "https://full.project.site"
# We are the upstream or archived project. Don't show sync error or last sync time
# Optional. Default: no
upstream = yes
# If this project contains multiple merlin sync tasks, list them here
# Optional. Default: project_name
merlin-tasks = ["task1", "task2"]


<pre class="yaml">docroot: /mirror/root
duflags: --human-readable --max-depth=1
output: /mirror/root/index.html


# define more projects below...
exclude:
</pre>
- include
- lost+found
- pub
# (...)


The mirror-index also supports news. When adding new projects or making modifications, create a markdown file in <code>mirror-index/content/news/</code> to tell the user what was changed. It should be picked up by Hugo automatically on next generation.
directories:
apache:
site: apache.org
url: http://www.apache.org/


On first setup, run <code>setup.sh</code>. When doing development (like change the sass or static files), run <code>build.sh</code> to build assets.
archlinux:
site: archlinux.org
url: http://www.archlinux.org/


=== FTP ===
# (...)</pre>
The docroot is the directory which is to be scanned; this will probably always be the mirror root from which Apache serves. <code>duflags</code> specifies the flags to be passed to <code>du</code>. This is here so that it's easy to find and alter. For instance, we could change <code>--human-readable</code> to <code>--si</code> if we ever decided that, like hard disk manufacturers, we want sizes to appear larger than they are. <code>output</code> defines the file to which the generated index will be written.


<b>UPDATE</b>: We now use vsftpd instead. See /etc/vsftpd.conf for details. Official documentation can be found [https://manpages.debian.org/stable/vsftpd/vsftpd.conf.5.en.html here].
<code>exclude</code> specifies the list of directories which will not be included in the generated index page (since, by default, all folders are included in the generated index page).

Finally, <code>directories</code> specifies the list of directories to be listed. The format is fairly straightforward: simply name the directory and provide a site (the display name in the &quot;Project Site&quot; column) and URL. One caveat here is that YAML does not allow tabs for whitespace. Indent with two spaces to remain consistent with the existing file format, please. Also note that the directory name is case-sensitive, as is always the case on Unix.

Finally, the HTML index file is generated from <code>index.mako</code>, a Mako template (which is mostly HTML anyhow). If you really can't figure out how it works, look up the Mako documentation.

=== FTP ===


We use [http://www.proftpd.org/ proftpd] (standalone daemon) as our FTP server.
We use [http://www.proftpd.org/ proftpd] (standalone daemon) as our FTP server.
Line 120: Line 136:
refuse options = c delete</pre>
refuse options = c delete</pre>
The contents of <code>/mirror/root/include/motd.msg</code> are displayed when a user connects.
The contents of <code>/mirror/root/include/motd.msg</code> are displayed when a user connects.

== Mirror Administration ==

=== Making changes ===
Everything in the <code>~mirror</code> is managed by git (so a monorepo containing all sub-projects like Merlin and mirror-index). To make changes, switch to the mirror user and commit with <code>--author "FirstName LastName <email@csc></code> to show who made the change. Then run <code>git push</code> to push the changes. The remote is using the HTTPS URL, so just enter your CSC credentials.

=== Adding a new project ===

# Find the instructions for mirroring the project. Ideally, try to sync directly from the project’s source repository.
#* Note that some projects provide sync scripts, however we generally won’t use them. We will instead use our custom ones.
# Create a zfs filesystem to store the project in:
#*<code>zfs create cscmirror0/$PROJECT_NAME</code>
# Change the folder ownership
#*<code>chown mirror:mirror /mirror/root/.cscmirror0/$PROJECT_NAME</code>
# Create the symlink in <code>/mirror/root</code>
#*<code>ln -s .cscmirror0/$PROJECT_NAME $PROJECT_NAME</code> ('''NOTE''': The symlink must be relative to the <code>/mirror/root</code> directory. If it isn’t, the symlinks will not work when chrooted)
# Repeat the above steps on mirror-phys. <code>sudo ssh mirror-dc</code> on potassium-benzoate ['''NOTE: This machine is currently unavailable]'''
# Configure the project in merlin (<code>~mirror/merlin/merlin-config.ini</code>)
#* Select the appropriate sync script (typically <code>csc-sync-standard</code>) and supply the appropriate parameters
# Restart merlin: <code>systemctl restart merlin-go</code>
#* This will kick off the initial sync
#* Check <code>~mirror/merlin/log/$PROJECT_NAME</code> for errors, <code>~mirror/merlin/log-$PROTOCOL/$PROJECT_NAME-*.log</code> for transfer progress
# Configure the project in zfssync.yml (<code>~mirror/merlin/zfssync.yml</code>) ['''NOTE: The backup machine is currently unavailable, so this step is not currently needed]'''
# Update the mirror index configuration (<code>~mirror/mirror-index-ng/synctask2project/config.toml</code>)
# Add the project to rsync (<code>/etc/rsyncd.conf</code>)
#* Restart rsync with <code>systemctl restart rsync</code>

If push mirroring is available/required, see [[#Push_Sync|Push Sync]].

=== Rename project ===

# Change project name (title) and local_dir in <code>merlin-config.ini</code>
# Change zfs dataset name
#* <code>zfs rename cscmirror0/OLD_NAME cscmirror0/NEW_NAME</code>
# Reload merlin config
#* <code>systemctl reload merlin-go.service</code>
# Remove old symlink and create new symlink in mirror root
#* <code>rm OLD_DIR</code>
#* <code>ln -s .cscmirror0/NEW_DIR NEW_DIR</code>
# Add a symlink for the old name (in <code>/mirror/root</code>) so that existing users won't be broken by the change
#* <code>ln -s NEW_DIR OLD_DIR</code>
# Update the rsync daemon
#* Edit <code>/etc/rsyncd.conf</code>, adding a new entry for the new name (keep the old name too). Restart with <code>systemctl restart rsync</code>
# Modify index page generator config
#* At <code>~mirror/mirror-index-ng/synctask2project/config.toml</code>
# Update an mirror registrations with the project to ensure the new URLs are used

=== Secondary Mirror ===

The School of Computer Science's CSCF has provided us with a secondary mirror machine located in DC. This will limit the downtime of mirror.csclub in the event of an outage affecting the MC machine room.

As of June 2023, CSCF mirror is down. CSCF is planing to bring it back with new hardware but no ETA.

==== Keepalived ====

Mirror's IP addresses (129.97.134.71 and 2620:101:f000:4901:c5c::f:1055) have been configured has VRRP address on both machines. Keepalived does the monitoring and selecting of the active node.

Potassium-benzoate has higher priority and will typically be the active node. A node's priority is reduced when nginx, proftpd or rsync are not running. Potassium-benzoate starts with a score of 100 and mirror-dc starts with a priority of 90 (higher score wins).

When nginx is unavailable (checked w/ curl), the priority is reduced by 20. When proftpd is unavailable (checked with curl), the priority is reduced by 5. When rsync is unavailable (checking with rsync), the priority is reduced by 15.

The Systems Committee should received an email when the nodes swap position.

==== Project synchronization ====

Only potassium-benzoate is configure with merlin. mirror-dc has the software components, but they are probably not update to date nor configured to run correctly.

When a project sync is complete, merlin will kick off a custom script to sync the zfs dataset to the other node. These scripts live in /usr/local/bin and in ~mirror/merlin.

Latest revision as of 18:39, 1 July 2023

The Computer Science Club runs a public mirror (mirror.csclub.uwaterloo.ca) on potassium-benzoate.

We are listed on the ResNet "don't count" list, so downloading from our mirror will not count against one's ResNet quota.

Software Mirrored

A list of current archives (and their respective disk usage) is listed on our mirror's homepage at mirror.csclub.uwaterloo.ca.

Mirroring Requests

Requests to mirror a particular distribution or archive should be made to syscom@csclub.uwaterloo.ca.

Implementation Details

Syncing

Storage

All of our projects are stored on an 8x18TB disk raidz2 array (cscmirror0). There is an additional drive acting as a hot-spare.

  • /mirror/root/.cscmirror0

Each project is given a filesystem the pool. Symlinks are created /mirror/root to point to the correct pool and file system.

Merlin

Project synchronization is done by "merlin" which is a Go rewrite of the Python script "merlin" originally written by a2brenna.

The program is stored in ~mirror/merlin and is managed by the systemd unit merlin-go.service.

The config file merlin-config.ini contains the list of repositories along with their configurations.

To view the sync status, execute ~mirror/merlin/cmd/arthur/arthur status. To force the sync of a project, execute ~mirror/merlin/cmd/arthur/arthur sync:PROJECT_NAME.

Remark: For syncing Debian repositories we were requested to use ftpsync which has configs in ~mirror/ftpsync.

Push Sync

Some projects support push syncing via SSH.

We are running a special SSHD instance on mirror.csclub.uwaterloo.ca:22. This instance has been locked down, with the following settings:

  • Only SSH key authentication
  • Only users of the push group (except mirror) are allowed to connect
  • X11 Forwarding, TCP Forwarding, Agent Forwarding, User RC and TTY are disabled
  • Users are chrooted to /mirror/merlin

Most projects will connect using the push user. The SSH authorized keys file is located at /home/push/.ssh/authorized_keys. An example entry is:

restrict,no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty,command="arthur sync:ubuntu >/dev/null 2>/dev/null </dev/null &",from="XXX.XXX.XXX.XXX" ssh-rsa ...

Sync Scripts

Our collection of synchronization scripts are located in ~mirror/bin. They currently include:

  • csc-sync-apache
  • csc-sync-debian
  • csc-sync-debian-cd
  • csc-sync-gentoo
  • csc-sync-ssh
  • csc-sync-standard

Most of these scripts take the following parameters:

local_dir rsync_host rsync_dir

HTTP(s)

We use nginx as our webserver.

Index

An index of the archives we mirror is available at mirror.csclub.uwaterloo.ca.

As of Spring 2023, it is now generated by Hugo.

~mirror/mirror-index/deploy.sh is scheduled in /etc/cron.d/csc-mirror to be run every minute.

The script will first run synctask2project, which pull project synchronization status from Merlin (using merlin's socket), combine sub-projects (for example racket is a combination for two merlin tasks, plt-bundles and racket-installers) and read the size of the project using zfs list -Hp. This Python script then spits out a json file to data/sync.json. Hugo then read the json file and generate the HTML table based on it. The table part is also generated separately into public/project_table/index.html, which can be read by htmx (JS library used on index page) to achieve live reload on sync status. Finally, the generated product of Hugo is copied to mirror root for display by nginx.

Project information is located at synctask2project/config.toml (NOT the config.toml in the root folder! That's the config for Hugo). Its format is as follows:

merlin_sock = "/path/to/merlin/socket"
zfs_pools = ["mirror_zfs_pool1", "mirror_zfs_pool2"]

[project_name]
# This is supposed to be the short version shown on the website
# Mandatory field
site = "project.site"
# The full URL
# Mandatory field
url = "https://full.project.site"
# We are the upstream or archived project. Don't show sync error or last sync time
# Optional. Default: no
upstream = yes 
# If this project contains multiple merlin sync tasks, list them here
# Optional. Default: project_name
merlin-tasks = ["task1", "task2"]


# define more projects below...

The mirror-index also supports news. When adding new projects or making modifications, create a markdown file in mirror-index/content/news/ to tell the user what was changed. It should be picked up by Hugo automatically on next generation.

On first setup, run setup.sh. When doing development (like change the sass or static files), run build.sh to build assets.

FTP

UPDATE: We now use vsftpd instead. See /etc/vsftpd.conf for details. Official documentation can be found here.

We use proftpd (standalone daemon) as our FTP server.

To increase performance, we disable DNS lookups in proftpd.conf:

UseReverseDNS           off
IdentLookups            off

We also limit the amount of CPU/memory resources used (e.g. to minimize Globbing resources):

RLimitCPU               session 10
RLimitMemory            session 4096K

We allow a maximum of 500 concurrent FTP sessions:

MaxInstances            500
MaxClients              500

The contents of /mirror/root/include/motd.msg are displayed when a user connects.

rsync

We use rsyncd (standalone daemon).

We disable compression and checksumming in rsyncd.conf:

dont compress = *
refuse options = c delete

The contents of /mirror/root/include/motd.msg are displayed when a user connects.

Mirror Administration

Making changes

Everything in the ~mirror is managed by git (so a monorepo containing all sub-projects like Merlin and mirror-index). To make changes, switch to the mirror user and commit with --author "FirstName LastName <email@csc> to show who made the change. Then run git push to push the changes. The remote is using the HTTPS URL, so just enter your CSC credentials.

Adding a new project

  1. Find the instructions for mirroring the project. Ideally, try to sync directly from the project’s source repository.
    • Note that some projects provide sync scripts, however we generally won’t use them. We will instead use our custom ones.
  2. Create a zfs filesystem to store the project in:
    • zfs create cscmirror0/$PROJECT_NAME
  3. Change the folder ownership
    • chown mirror:mirror /mirror/root/.cscmirror0/$PROJECT_NAME
  4. Create the symlink in /mirror/root
    • ln -s .cscmirror0/$PROJECT_NAME $PROJECT_NAME (NOTE: The symlink must be relative to the /mirror/root directory. If it isn’t, the symlinks will not work when chrooted)
  5. Repeat the above steps on mirror-phys. sudo ssh mirror-dc on potassium-benzoate [NOTE: This machine is currently unavailable]
  6. Configure the project in merlin (~mirror/merlin/merlin-config.ini)
    • Select the appropriate sync script (typically csc-sync-standard) and supply the appropriate parameters
  7. Restart merlin: systemctl restart merlin-go
    • This will kick off the initial sync
    • Check ~mirror/merlin/log/$PROJECT_NAME for errors, ~mirror/merlin/log-$PROTOCOL/$PROJECT_NAME-*.log for transfer progress
  8. Configure the project in zfssync.yml (~mirror/merlin/zfssync.yml) [NOTE: The backup machine is currently unavailable, so this step is not currently needed]
  9. Update the mirror index configuration (~mirror/mirror-index-ng/synctask2project/config.toml)
  10. Add the project to rsync (/etc/rsyncd.conf)
    • Restart rsync with systemctl restart rsync

If push mirroring is available/required, see Push Sync.

Rename project

  1. Change project name (title) and local_dir in merlin-config.ini
  2. Change zfs dataset name
    • zfs rename cscmirror0/OLD_NAME cscmirror0/NEW_NAME
  3. Reload merlin config
    • systemctl reload merlin-go.service
  4. Remove old symlink and create new symlink in mirror root
    • rm OLD_DIR
    • ln -s .cscmirror0/NEW_DIR NEW_DIR
  5. Add a symlink for the old name (in /mirror/root) so that existing users won't be broken by the change
    • ln -s NEW_DIR OLD_DIR
  6. Update the rsync daemon
    • Edit /etc/rsyncd.conf, adding a new entry for the new name (keep the old name too). Restart with systemctl restart rsync
  7. Modify index page generator config
    • At ~mirror/mirror-index-ng/synctask2project/config.toml
  8. Update an mirror registrations with the project to ensure the new URLs are used

Secondary Mirror

The School of Computer Science's CSCF has provided us with a secondary mirror machine located in DC. This will limit the downtime of mirror.csclub in the event of an outage affecting the MC machine room.

As of June 2023, CSCF mirror is down. CSCF is planing to bring it back with new hardware but no ETA.

Keepalived

Mirror's IP addresses (129.97.134.71 and 2620:101:f000:4901:c5c::f:1055) have been configured has VRRP address on both machines. Keepalived does the monitoring and selecting of the active node.

Potassium-benzoate has higher priority and will typically be the active node. A node's priority is reduced when nginx, proftpd or rsync are not running. Potassium-benzoate starts with a score of 100 and mirror-dc starts with a priority of 90 (higher score wins).

When nginx is unavailable (checked w/ curl), the priority is reduced by 20. When proftpd is unavailable (checked with curl), the priority is reduced by 5. When rsync is unavailable (checking with rsync), the priority is reduced by 15.

The Systems Committee should received an email when the nodes swap position.

Project synchronization

Only potassium-benzoate is configure with merlin. mirror-dc has the software components, but they are probably not update to date nor configured to run correctly.

When a project sync is complete, merlin will kick off a custom script to sync the zfs dataset to the other node. These scripts live in /usr/local/bin and in ~mirror/merlin.