This page describes the configuration of a cluster of computers dedicated to scientific computing, with focus on parallel computation. The computers are hosted by Faculdade de Ciências da Universidade de Lisboa; they have been acquired by Centro de Matemática, Aplicações Fundamentais e Investigação Operacional (FCT project UID/MAT/04561/2020) and by Grupo de Física Matemática.
This document describes the steps I followed for configuring the cluster.
A large part of it is only relevant for system administrators.
Parts which are relevant for users of the cluster are highlighted by a
)
is reachable from outside;
the other five (hereby designated by s)
are connected to a local network and communicate with the outside world through
(which acts as a gateway).
All nodes run archlinux.
There was a previous installation using Ubuntu.
Click on a triangle to open or close the corresponding paragraph.
For installing For booting The only original part was that I decided to keep the As arguments to Configuring the network was not trivial.
I chose to use Each machine has several I started by installing Then, in the installed system (after If the For If you want Note the name of the file : In order for Then, enable the services (they will be effectively started after reboot) :
Replace On each Then, in the installed system (after If you want Then, enable the services (they will be effectively started after reboot) :
Replace Some applications ( Although the live On the We must also ensure the The command On the Somewhat misleadingly, the command Each machine has a 895Gib disk.
I have reserved 200Gib for the operating system and 15Gib for a We intend to use Installing and configuring On the server side (on the Then On the client side, I did not implement One has to be careful about the user IDs.
If you have two users on different nodes with the same username but different IDs,
they will be unable to access their home directory through In order to save bandwidth and installation time, I want to share For normal updates, I share I use a "temporary" cache directory Of course all the above is not performed by hand; rather, it is done through a
python script.
While working on the script for moving around package files
(see section " Suppose there is a file If If I don't know if this is intended behaviour of command Below is a graphic representation of the execution time for several identical (single-threaded) processes
launched simultaneously on one machine only.
We see a linear growth for more than 40 processes (serial behaviour).
We also see a drop in performance at 20 processes.
Recall that each machine has 20 cores, 40 threads.
So I wrote a script
for propagating user accounts and passwords across the nodes of the cluster.
It is written in Edit The same script
can be used by the system administrator to add new users :
There are no quota.
The installing
archlinuxarchlinux it was enough to follow the
instructions.
archlinux I chose, in the BIOS,
efi boot.
I formatted the hard disk with a gpt label
and installed GRUB on a partition of 500Mib
(looks too big at first sight, but actually GRUB occupies 220Mib).
After this, booting was smooth.
pacman cache
on the USB stick (where the live system boots from).
To do this, when burning the archlinux image (using rufus)
I chose a high value for
persistent partition size;
this left a lot of free space on the stick;
on that free space I created (using fdisk) a second (ext4) partition.
In the live system, I mounted this second partition on /mnt/cache.
Then I specified in /etc/pacman.conf the cache directory
/mnt/cache/pacman/pkg/ and added the -c option to
the pacstrap command in order to use
the local cache rather than the cache on the installed system.
This way, when installing archlinux on the first machine the cache gets populated
and when installing on subsequent machines the package files are already in the cache.
This saves time and bandwidth.
pacstrap, include all packages needed for a minimally
functional system, like nano, grub, efibootmgr,
intel-ucode, sudo, openssh.
configuring the network
systemd_networkd and systemd_resolved
as network managers;
they are included in the systemd package so they are installed by default.
ethernet ports;
on the I used two of them;
eno1 links to a plug in the wall and to the outside world,
eno2 links to a switch and provides connectivity to the local network
of s.
archlinux on the
alpha_nodearchlinux system, we need access to the internet :
ip address add /etc/resolv.confnameserver 1.1.1.1 # or some other serverarch-chroot), create
/etc/systemd/network/20-wired-outside.network[Match]
Name=eno1
[Link]
RequiredForOnline=routable
[Network]
DHCP=yes
IPv4Forward=yes
[DHCPv4]
ClientIdentifier=mac DHCP service on your network does not exist or does not work properly,
you can use a fixed IP number instead:
/etc/systemd/network/20-wired-outside.network[Match]
Name=eno1
[Link]
RequiredForOnline=routable
[Network]
Address=eno2, I created
/etc/systemd/network/20-wired-local.network[Match]
Name=eno2
[Network]
Address=192.168.1.1/24
IPv4Forward=yes root to be able to login through ssh, add
/etc/ssh/sshd_configPermitRootLogin yessshd_config, not ssh_config !
Beware, this opens the door to cyberattacks; remove this line as soon as you
implement a different way to login remotely as administrator;
until then, choose a strong password for root.
That different way to login remotely could be creating a regular user
and adding it to the wheel group,
then use sudo to gain super-user privileges.
to work properly as a gateway,
we must enable NAT forwarding.
Create
/etc/sysctl.d/30-ipforward.confnet.ipv4.ip_forward=1/etc/systemd/system/nat.service[Unit]
Description=NAT configuration for gateway
After=network.target
[Service]
Type=oneshot
ExecStart=/usr/bin/iptables -t nat -A POSTROUTING -o eno1 -j MASQUERADE
ExecStart=/usr/bin/iptables -A FORWARD -i eno2 -o eno1 -j ACCEPT
ExecStart=/usr/bin/iptables -A FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
RemainAfterExit=yes
[Install]
WantedBy=multi-user.target systemctl enable systemd-networkd
systemctl enable systemd-resolved
systemctl enable nat
systemctl enable sshdenableenable --nowchroot).
Commands like systemctl statussystemctl status
journalctl -xe, the configuration is far simpler.
I only used eno2.
On the live archlinux system, we need access to the internet
(using as gateway) :
ip address add 192.168.1./etc/resolv.confnameserver 1.1.1.1 # or some other serverarch-chroot),
/etc/systemd/network/20-wired-local.network[Match]
Name=eno2
[Network]
Address=192.168.1.root to be able to login through ssh,
follow the steps described above for the .
systemctl enable systemd-networkd
systemctl enable systemd-resolved
systemctl enable sshdenableenable --nowchroot).
NTP (network time protocol)make for example) rely on the timestamp of
files in order to work correctly.
Thus, it is important that the machines composing the cluster have the clocks
synchronized (otherwise, sharing file through NFS could stir up trouble).
NTP is a good solution for that.
Actually, NTP does more than we need.
It synchronizes the clock of your machine with a universal time provided by timeservers
scattered across the world, and it does it with a very high precision.
What matters for us is sychronization between our machines only.
Anyway, NTP does the job.
archlinux comes with systemd-timesyncd enabled,
on the installed archlinux we must choose and install a service providing
NTP.
I installed ntp;
it provides a client which gets the time from an exterior server
and also a server, used by the s:
pacman -Syu ntp
systemctl enable --now ntpdntpd service waits for the systemd-networkd
service at boot. Create
/etc/systemd/system/ntpd.service.d/wait-for-network.conf[Unit]
After=network-online.target
Wants=network-online.targetsystemctl status ntpdkernel reports TIME_ERROR: 0x41:
Clock Unsynchronized;
looks like this message can be safely ingored.
Check with ntpq -ps I installed systemd-timesyncd
which only provides a client.
It gets the time from the :
/etc/systemd/timesyncd.confNTP=systemctl enable --now systemd-timesyncdsystemd-timesyncd handles graciously network failures, so the next step
is optional : create
/etc/systemd/system/systemd-timesyncd.service.d/wait-for-network.conf[Unit]
After=network-online.target
Wants=network-online.targettimedatectl
statusNTP service: active if we use systemd-timesyncd but
answers NTP service: inactive if we use ntp.
disk partitions
swap partition.
/nfs-home on one machine as home directory;
it will be available to other machines through NFS.
I have chosen the
to keep /nfs-home.
So, on the
I mounted /nfs-home on a 300Gib partition;
/sci-data is mounted on a 380Gib partition.
On each , the /nfs-home directory from
is visible through NFS and
/sci-data is mounted on a local 680Gib partition.
See section "intended disk usage" below.
NFS (network file system)NFS was not that difficult.
),
pacman -Syu nfs-utils/etc/exports.
I listed all s in one line :
/etc/exports/nfs-home systemctl enable --now
nfsv4-server/nfs-home, e.g. by invoking
useradd-b or -d
(or edit /etc/default/useradd).
pacman -Syu
nfs-utilsmkdir /nfs-homemount /nfs-home/nfs-home becomes invisible until umount
(as happens with any mount operation).
To mount /nfs-home through NFS automatically at boot time,
you should edit /etc/fstab.
kerberos authentication.
NFS.
See paragraph "user accounts".
s will see /nfs-home
through NFS, disk access will be rather slow on this directory.
Thus, users are encouraged to keep large files on local storage, under /sci-data.
A folder /sci-data/
exists for that purpose on all machines.
Configuration and preferences files should be kept in
/nfs-home/ of course;
this is useful for defining your preferences throughout the cluster.pacman's cachepacman's cache
among the computers composing the cluster.
During the initial installation of archlinux I kept the cache on the USB
stick, as explained in section "installing archlinux" above.
pacman's cache through NFS.
However, this is not a trivial process because the package files should be owned by
root and this is not compatible with NFS' philosophy
(I decided against specifying the no_root_squash option).
/nfs-home/cache, owned by a regular user.
Each update operation is initiated on the
and uses the usual cache directory /var/cache/pacman/pkg.
Before calling pacman -Syu/var/cache/pacman/pkg; after the update has finished on the
we list again all files there
and copy the new ones (not previously present) to /nfs-home/cache.
Old versions of package files in /var/cache/pacman/pkg are deleted
using the command paccache -rk1, we copy all files
from /nfs-home/cache to /var/cache/pacman/pkg,
ensuring that the copied files are owned by root.
We then perform the update; in theory, there is no need to download any package file.
After updating the we delete all files
in /var/cache/pacman/pkg (which thus stays empty most of the time).
After all s have been updated,
the folder /nfs-home/cache is also emptied.
moving and copying files with different owner
pacman's cache" above), I have noticed something peculiar
about the commands mv and cp.
file-1.txt belonging to user-A.
Suppose another user, user-B, goes to that directory.
Suppose user-B has read and write permissions on the directory and on
file-1.txt.
user-B issues the command mv file-1.txt
file-2.txtfile-2.txt will belong to
user-A. This is true independently of whether a file file-2.txt
exists (previously to the mv operation) or not.
user-B issues the command cp file-1.txt
file-2.txtcp operation.
If a file file-2.txt existed previously and belonged to user-A,
then it will belong to the same user-A after the cp operation,
although with a new content.
If no file named file-2.txt existed previously, it will be created and will
belong to user-B.
cp or is some sort of bug.
top).
python3 and uses
fabric.
cluster passwordpasswd command if you want different passwords on
different nodes, but why would you want that ?/etc/default/useradd if you want a default home directory different from
/home (in our case, /nfs-home).
cluster add usercluster delete useruser, not the
we want to add or delete;
the script is interactive and will ask for information at the prompt.
The script creates the directory /nfs-home/
only on the ;
on the s
the user will see their home directory through NFS.
In contrast, the folder /sci-data/
is created on all machines.
If you are careful not to add/delete user accounts through other means than the above commands,
the user (and group) IDs will be the same across nodes (this is important since
/nfs-home is seen through NFS).
linux kernel is updated frequently, thus requiring frequent reboots,
which is rather annoying.
Cristian Barbarosie, 2025.05