SLURM is one of the most popular schedulers for clusters and High-Performance Computing (HPC). It takes care of two tasks. Firstly, it prevents everyone from starting processes on the same machine in a way that none of the processes can run successfully (due to not enough RAM, Disk or CPU time). Secondly, it allows to submit a set of programs to multiple computers automatically.
Typically, SLURM is used in a single, weaker computer (called the login node). Users submit jobs (a single program that can be executed many times, in parallel) and these jobs are scheduled in more power machines, which the user has no access to (for consistency sake).
These instructions are for the case where you want SLURM controlling a single computer (node). This is useful when you do not have a cluster, but a single powerful machine. Many of the instructions are taken from How to quickly set up Slurm on Ubuntu 20.04 for single node workload scheduling.
Install SLURM
sudo apt update -y
sudo apt install slurmd slurmctld -y
sudo mkdir /etc/slurm-llnl/
sudo chmod 777 /etc/slurm-llnl
sudo mkdir /var/lib/slurm-llnl/
sudo mkdir /var/log/slurm-llnl/
sudo chmod 777 /var/lib/slurm-llnl/
sudo chmod 777 /var/log/slurm-llnl/
And update the permissions to your liking.
Then we need to create two files: /etc/slurm-llnl/slurm.conf
and /etc/slurm/slurm.conf
. They should be the same, but they are in two different locations because of the multimode support (not in use in our scenario). As such, I end up creating a soft link between the two:
sudo ln -s /etc/slurm-llnl/slurm.conf /etc/slurm/slurm.conf
Now we edit the contents of /etc/slurm/slurm.conf
and of /etc/slurm/gres.conf
to the following:
To fill in the last line of slurm.conf
, you can run: slurmd -C
Note that this configuration sets up two Nvidia A30 GPUs. If you have no Nvidia GPUs, then you can delete gres.conf and remove Gres=gpu:2,mps:200
from slurm.conf.
Now you can start the slurm processes (one to manage the execution, the other to manage the queues):
sudo service slurmctld restart && sudo service slurmd restart
To troubleshoot, you should check the following files: /var/log/slurm-llnl/slurmd.log
and /var/log/slurm-llnl/slurmctld.log
.