Mirror: Difference between revisions

From CSCWiki
Jump to navigation Jump to search
No edit summary
(Updated mirror information to reflect current state)
Line 1: Line 1:
We currently run a public mirror ([http://mirror.csclub.uwaterloo.ca/ mirror.csclub.uwaterloo.ca]) on [[Machine_List#potassium-benzoate|potassium-benzoate]]. We are listed on the ResNet [http://noc.uwaterloo.ca/cn/Stats/resReport "don't count" list] so downloading from our mirror will not count against one's ResNet quota. Requests to mirror a particular distribution or archive should be made to syscom@csclub.uwaterloo.ca. We also have [http://munin.csclub.uwaterloo.ca/mirror/potassium-benzoate.csclub/index.html resource graphs] you can look at.
The [https://csclub.uwaterloo.ca Computer Science Club] runs a public mirror ([http://mirror.csclub.uwaterloo.ca mirror.csclub.uwaterloo.ca]) on [[Machine_List#potassium-benzoate|potassium-benzoate]].


''We are listed on the ResNet "don't count" list, so downloading from our mirror will not count against one's ResNet quota.''
== Archives Mirrored ==
For a list of what is currently mirrored and their respective disk usage see http://mirror.csclub.uwaterloo.ca/


== Proposed Archives to Mirror ==
== Software Mirrored ==


A list of current archives (and their respective disk usage) is listed on our mirror's homepage at [http://mirror.csclub.uwaterloo.ca mirror.csclub.uwaterloo.ca].
* Mandriva

* PCLinuxOS
=== Mirroring Requests ===
* OpenSSL

* RubyForge
Requests to mirror a particular distribution or archive should be made to [mailto:syscom@csclub.uwaterloo.ca syscom@csclub.uwaterloo.ca].


== Implementation Details ==
== Implementation Details ==


=== Syncing ===
The mirroring is done by one of three scripts. The latter two are based on [http://www.debian.org/mirror/anonftpsync anonftpsync]. [[#merlin|merlin]] is used to call one of these scripts. Most of the scripts and such used to maintain the mirror are available in the public [http://git.csclub.uwaterloo.ca/?p=public/mirror.git mirror.git] repository.


=== ftpsync ===
==== Storage ====


All of our projects are stored on one of two software RAID 6 arrays.
[http://ftp-master.debian.org/ftpsync.tar.gz ftpsync] is the official Debian mirror synchronization tool, and is used to rsync the Debian repository. It's located in ~mirror/debian. Its invocation takes a few steps (this is more or less how [[#merlin|merlin]] invokes it:


* <code>/mirror/root/.mirror-importantfs</code>
export BASEDIR=/home/mirror/debian
** 8 2TB drives plus 2 additional hot spare drives
cd $BASEDIR
* <code>/mirror/root/.mirror-largefs</code>
./bin/ftpsync sync:stage1
** 7 4TB drives plus 2 additional hot spare drives
./bin/ftpsync sync:stage2


All projects listed under <code>/mirror/root</code> are a symlink to one of the two arrays.
=== csc-sync-debian ===


==== Merlin ====
This is used to sync debian-style repositories. Its usage is:
csc-sync-debian local_dir rsync_host rsync_dir [trace_host [trace_dir]]


The synchronization process is run by a Python script called &quot;merlin&quot;, written by a2brenna. The script is stored in <code>~mirror/merlin</code>.
If trace_host is specified, then $rsync_dir/project/trace/$trace_host is checked to see if it has changed. If it has, a normal debian-style (two-pass) rsync is done.


The list of repositories and their configuration (synch frequency, location, etc.) is configured in <code>merlin.py</code>.
=== csc-sync-standard ===


To view the sync status, execute <code>~mirror/merlin/arthur.py status</code>. To force the sync of a project, execute <code>~mirror/merlin/arthur.py sync:PROJECT_NAME</code>.
This is used to sync a tree in a general way. Like anonftpsync, it supports locking and logging. Its usage is:


==== Sync Scripts ====
csc-sync-standard local_dir rsync_host rsync_dir


Our collection of synchronization scripts are located in <code>~mirror/bin</code>. They currently include:
=== merlin ===


* <code>csc-sync-apache</code>
The synchronization process is run by a Python script called "merlin", written by a2brenna, stored in ~mirror/merlin. The repository list, sync time, etc. is maintained in merlin.py.
* <code>csc-sync-debian</code>
* <code>csc-sync-debian-cd</code>
* <code>csc-sync-gentoo</code>
* <code>csc-sync-ssh</code>
* <code>csc-sync-standard</code>


Most of these scripts take the following parameters:
=== HTTP ===


<code>local_dir rsync_host rsync_dir</code>
We use [[Apache]] as our web server. Here's a snippet of the worker configuration:


=== HTTP(s) ===
<IfModule mpm_worker_module>
ServerLimit 64
ThreadLimit 64
StartServers 2
MaxClients 4096
MinSpareThreads 16
MaxSpareThreads 48
ThreadsPerChild 64
MaxRequestsPerChild 0
</IfModule>


We use [https://nginx.org nginx] as our webserver.
We use the bwbar application to display current bandwidth in the footer of mirror pages.


==== Index ====
==== Index ====


An index of the archives we mirror is available at http://mirror.csclub.uwaterloo.ca/.
An index of the archives we mirror is available at [http://mirror.csclub.uwaterloo.ca mirror.csclub.uwaterloo.ca].
As of Winter 2010, it is now generated by a Python script in <tt>~mirror/mirror-index</tt>.


As of Winter 2010, it is now generated by a Python script in <code>~mirror/mirror-index</code>.
<tt>~mirror/mirror-index/make-index.py</tt> is scheduled in <tt>mirror</tt>'s crontab to be
run at 5:40 AM on the 14th and 28th of each month. The script can be run manually when needed
(for example, when an archive is removed) as follows:


<code>~mirror/mirror-index/make-index</code> is scheduled in <code>/etc/cron.d/csc-mirror</code> to be run at 5:40am on the 14th and 28th of each month. The script can be run manually when needed (for example, when the archive list is updated) by running:
sudo -u mirror /home/mirror/mirror-index/make-index.py


<code>sudo -u mirror /home/mirror/mirror-index/make-index.py</code>
This causes an instance of <tt>du</tt> to be run which computes the size of each directory. This
list is then sorted alphabetically by directory name and returned to the Python script.
If any errors occur during this process, the script conservatively chooses to exit rather
than risk generating an index file that is incorrect.


This causes an instance of <code>du</code> which computes the size of each directory. This list is then sorted alphabetically by directory name and returned to the Python script. If any errors occur during this process, the script conservatively chooses to exit rather than risk generating an index file that is incorrect.
<tt>make-index.py</tt> is configured by means of a [http://www.yaml.org/ YAML] file,
<tt>config.yaml</tt>, in the same directory. Its format is as follows:


<code>make-index.py</code> is configured by means of a [https://yaml.org YAML] file, <code>config.yaml</code>, in the same directory. Its format is as follows:
docroot: /mirror/root
duflags: --human-readable --max-depth=1
output: /mirror/root/index.html
directories:
apache:
site: apache.org
url: <nowiki>http://www.apache.org/</nowiki>
archlinux:
site: archlinux.org
url: <nowiki>http://www.archlinux.org/</nowiki>
# (...)


<pre class="yaml">docroot: /mirror/root
The <tt>docroot</tt> is the directory which is to be scanned; this will probably
duflags: --human-readable --max-depth=1
always be the mirror root from which Apache serves. <tt>duflags</tt> specifies
output: /mirror/root/index.html
the flags to be passed to <tt>du</tt>. This is here so that it's easy to find
and alter. For instance, we could change <tt>--human-readable</tt> to <tt>--si</tt>
if we ever decided that, like hard disk manufacturers, we want sizes to appear larger
than they are. <tt>output</tt> defines the file to which the generated index will be
written.


exclude:
Finally, <tt>directories</tt> specifies the list of directories to be listed.
- include
No directories not listed here will be shown. If you add a new archive and it doesn't
- lost+found
appear, that's why. The format is fairly straightforward: simply name the directory
- pub
and provide a site (the display name in the "Project Site" column) and URL.
# (...)


directories:
One caveat here is that YAML does not allow tabs for whitespace. Indent with
apache:
two spaces to remain consistent with the existing file format, please. Also note
site: apache.org
that the directory name is case-sensitive, as is always the case on Unix.
url: http://www.apache.org/


archlinux:
Finally, the HTML index file is generated from <tt>index.mako</tt>, a
site: archlinux.org
[http://www.makotemplates.org/ Mako] template (which is mostly HTML anyhow).
url: http://www.archlinux.org/
If you really can't figure out how it works, look up the Mako documentation.


# (...)</pre>
=== FTP ===
The docroot is the directory which is to be scanned; this will probably always be the mirror root from which Apache serves. <code>duflags</code> specifies the flags to be passed to <code>du</code>. This is here so that it's easy to find and alter. For instance, we could change <code>--human-readable</code> to <code>--si</code> if we ever decided that, like hard disk manufacturers, we want sizes to appear larger than they are. <code>output</code> defines the file to which the generated index will be written.


<code>exclude</code> specifies the list of directories which will not be included in the generated index page (since, by default, all folders are included in the generated index page).
We use proftpd (standalone daemon) as our ftp server. To increase performance we disable DNS lookups in proftpd.conf:


Finally, <code>directories</code> specifies the list of directories to be listed. The format is fairly straightforward: simply name the directory and provide a site (the display name in the &quot;Project Site&quot; column) and URL. One caveat here is that YAML does not allow tabs for whitespace. Indent with two spaces to remain consistent with the existing file format, please. Also note that the directory name is case-sensitive, as is always the case on Unix.
UseReverseDNS off
IdentLookups off


Finally, the HTML index file is generated from <code>index.mako</code>, a Mako template (which is mostly HTML anyhow). If you really can't figure out how it works, look up the Mako documentation.
We also limit the amount of CPU/memory resources used (e.g. to minimize [http://en.wikipedia.org/wiki/Globbing Globbing] resources):


=== FTP ===
RLimitCPU session 10
RLimitMemory session 4096K


We use [http://www.proftpd.org/ proftpd] (standalone daemon) as our FTP server.
We allow a maximum of 200 concurrent ftp sessions:


To increase performance, we disable DNS lookups in <code>proftpd.conf</code>:
MaxInstances 500
MaxClients 500


<pre>UseReverseDNS off
=== rsync ===
IdentLookups off</pre>
We also limit the amount of CPU/memory resources used (e.g. to minimize [https://en.wikipedia.org/wiki/Globbing Globbing] resources):


<pre>RLimitCPU session 10
We use rsyncd (standalone daemon). We disable compression and checksumming in rsyncd.conf:
RLimitMemory session 4096K</pre>
We allow a maximum of 500 concurrent FTP sessions:

<pre>MaxInstances 500
MaxClients 500</pre>
The contents of <code>/mirror/root/include/motd.msg</code> are displayed when a user connects.

=== rsync ===


We use <code>rsyncd</code> (standalone daemon).
dont compress = *
refuse options = c delete


We disable compression and checksumming in <code>rsyncd.conf</code>:
For ftp and rsync, the contents of /mirror/root/include/motd.msg are displayed when users connect.


<pre>dont compress = *
[[Category:Services]]
refuse options = c delete</pre>
[[Category:Software]]
The contents of <code>/mirror/root/include/motd.msg</code> are displayed when a user connects.

Revision as of 00:14, 6 March 2016

The Computer Science Club runs a public mirror (mirror.csclub.uwaterloo.ca) on potassium-benzoate.

We are listed on the ResNet "don't count" list, so downloading from our mirror will not count against one's ResNet quota.

Software Mirrored

A list of current archives (and their respective disk usage) is listed on our mirror's homepage at mirror.csclub.uwaterloo.ca.

Mirroring Requests

Requests to mirror a particular distribution or archive should be made to syscom@csclub.uwaterloo.ca.

Implementation Details

Syncing

Storage

All of our projects are stored on one of two software RAID 6 arrays.

  • /mirror/root/.mirror-importantfs
    • 8 2TB drives plus 2 additional hot spare drives
  • /mirror/root/.mirror-largefs
    • 7 4TB drives plus 2 additional hot spare drives

All projects listed under /mirror/root are a symlink to one of the two arrays.

Merlin

The synchronization process is run by a Python script called "merlin", written by a2brenna. The script is stored in ~mirror/merlin.

The list of repositories and their configuration (synch frequency, location, etc.) is configured in merlin.py.

To view the sync status, execute ~mirror/merlin/arthur.py status. To force the sync of a project, execute ~mirror/merlin/arthur.py sync:PROJECT_NAME.

Sync Scripts

Our collection of synchronization scripts are located in ~mirror/bin. They currently include:

  • csc-sync-apache
  • csc-sync-debian
  • csc-sync-debian-cd
  • csc-sync-gentoo
  • csc-sync-ssh
  • csc-sync-standard

Most of these scripts take the following parameters:

local_dir rsync_host rsync_dir

HTTP(s)

We use nginx as our webserver.

Index

An index of the archives we mirror is available at mirror.csclub.uwaterloo.ca.

As of Winter 2010, it is now generated by a Python script in ~mirror/mirror-index.

~mirror/mirror-index/make-index is scheduled in /etc/cron.d/csc-mirror to be run at 5:40am on the 14th and 28th of each month. The script can be run manually when needed (for example, when the archive list is updated) by running:

sudo -u mirror /home/mirror/mirror-index/make-index.py

This causes an instance of du which computes the size of each directory. This list is then sorted alphabetically by directory name and returned to the Python script. If any errors occur during this process, the script conservatively chooses to exit rather than risk generating an index file that is incorrect.

make-index.py is configured by means of a YAML file, config.yaml, in the same directory. Its format is as follows:

docroot: /mirror/root
duflags: --human-readable --max-depth=1
output: /mirror/root/index.html

exclude:
   - include
   - lost+found
   - pub
# (...)

directories:
  apache:
    site: apache.org
    url: http://www.apache.org/

  archlinux:
    site: archlinux.org
    url: http://www.archlinux.org/

# (...)

The docroot is the directory which is to be scanned; this will probably always be the mirror root from which Apache serves. duflags specifies the flags to be passed to du. This is here so that it's easy to find and alter. For instance, we could change --human-readable to --si if we ever decided that, like hard disk manufacturers, we want sizes to appear larger than they are. output defines the file to which the generated index will be written.

exclude specifies the list of directories which will not be included in the generated index page (since, by default, all folders are included in the generated index page).

Finally, directories specifies the list of directories to be listed. The format is fairly straightforward: simply name the directory and provide a site (the display name in the "Project Site" column) and URL. One caveat here is that YAML does not allow tabs for whitespace. Indent with two spaces to remain consistent with the existing file format, please. Also note that the directory name is case-sensitive, as is always the case on Unix.

Finally, the HTML index file is generated from index.mako, a Mako template (which is mostly HTML anyhow). If you really can't figure out how it works, look up the Mako documentation.

FTP

We use proftpd (standalone daemon) as our FTP server.

To increase performance, we disable DNS lookups in proftpd.conf:

UseReverseDNS           off
IdentLookups            off

We also limit the amount of CPU/memory resources used (e.g. to minimize Globbing resources):

RLimitCPU               session 10
RLimitMemory            session 4096K

We allow a maximum of 500 concurrent FTP sessions:

MaxInstances            500
MaxClients              500

The contents of /mirror/root/include/motd.msg are displayed when a user connects.

rsync

We use rsyncd (standalone daemon).

We disable compression and checksumming in rsyncd.conf:

dont compress = *
refuse options = c delete

The contents of /mirror/root/include/motd.msg are displayed when a user connects.