Batch Computing (SLURM)

Overview¶

In order to maintain a fair shared usage of the computational resources at sciCORE, we use a queueing system (also known as Workload Manager) named SLURM. Users that want to run calculations in the cluster must interact with SLURM to reserve the resources required for the calculation. To that end, the user must write and execute a script that besides containing the command that executes their calculation, also contains certain directives that are understood by SLURM where the amount of CPU/GPU, RAM, and duration (among other details) are specified.

This guide will show you how to write such a script and the meaning of those SLURM directives.

Note

SLURM is just one of many queueing systems but is very common and powerful. This means that a substantial part of what you will use to run your calculation in sciCORE, can be also easily ported to other infrastructures that also use SLURM, such as the Swiss National Supercomputer at CSCS.

Understanding a SLURM script¶

A generic SLURM script¶

In order to explain the generic SLURM script, we will use an example that a user has a Python script that analyzes some data and is usually executed as:

python my_script.py inputdata.txt

To run it in the cluster, the user must create a SLURM script similar to this:

#!/bin/bash
#The previous line is mandatory

#SBATCH --job-name=myrun     #Name of your job
#SBATCH --cpus-per-task=1    #Number of cores to reserve
#SBATCH --mem-per-cpu=1G     #Amount of RAM/core to reserve
#SBATCH --time=06:00:00      #Maximum allocated time
#SBATCH --qos=6hours         #Selected queue to allocate your job
#SBATCH --output=myrun.o%j   #Path and name to the file for the STDOUT, %j will be substituted by the job ID
#SBATCH --error=myrun.e%j    #Path and name to the file for the STDERR

ml <software.version>        #Load required module(s)

<your_commands>               #Execute your command(s): this would be 'my_script.py inputdata.txt'

Preamble:¶

The line #!/bin/bash is mandatory. The rest of lines starting with #SBATCH are formally comments (because they start with #). However, since the have the special string SBATCH attached to them, they become directives for SLURM. These directives are optional. If they are not provided, a default value will be used (which might not be adequate for you). Below, we explain the most common SLURM directives in more detail.

Your compute job¶

After the SLURM directives we might need to load the necessary libraries and software (see (Module System at sciCORE)[software.md#module-system] for more information) needed to execute the job that you want to run on the cluster. Such a job could be a single line, or a more complex set of commands.

This script should be saved in a file (e.g. launch.sh) and then can be submitted to the queue with:

sbatch <name_of_your_script_file>

Once it is submitted, a job ID number will be assigned to this job, it will be added to the queue, and launched once the required resources become available. Please see below for common problems in SLURM scripts.

SLURM directives¶

SLURM directives tell the scheduler what resources your job needs to run and how SLURM reports output and errors of the job back to you.

Job name¶

A descriptive name for your job helps you to identify your job in the queue, but a name is optional and will by default be constructed based on the job ID.

Example:

--job-name=supernova_s15

Number of CPUs¶

A single job can use more than a single CPU and you need to tell the scheduler how many CPUs your job requires. This is done using the cpus-per-task directive:

--cpus-per-task=4

This will reserve that specific (4 in this case) number of cores. By default, a single CPU is allocated. With few exceptions, most of the computing nodes in sciCORE have 128 cores and 128 is therefore the maximum possible value.

Please ensure that they code your a running is designed and configured to use the cores your request. Often, even parallelized code doesn’t scale well to many cores. As allocated cores are unavailable to other users, we ask you to verify that your code uses the allocated resources efficiently and reach out in case of questions.

Number of tasks¶

The --ntasks directive is not in the example above, but it is relevant for parallel calculations. The main difference with --cpus-per-task is that --ntasks specifies how many instances of your command will be executed, while --cpus-per-task determines how many cores will be used by each one of those instances. One single task cannot be distributed among computing nodes, but you can have several tasks in one computing node. Be aware that one task is not always equal to one job, as one job can have several tasks, each one of them can be in different computing nodes. This is a subtle difference but extremely important to understand if you are performing classical parallel calculations. As a general guideline, if you use only OpenMP (i.e. threads) to parallelize your code you only need to use --cpus-per-task. If you only use MPI (i.e. processes) you only need --ntasks. If you use both (i.e. a hybrid calculation) you will need both.

Example:

 --ntasks=16

Memory¶

In addition to compute resources, SLURM also allocates memory. Memory is allocated on a ‘per CPU’ basis wit a default of 1GB per cpu, for example:

--mem-per-cpu=8G

The default value is 1 GB. This value will be multiplied by the number in the directive --cpus-per-task to calculate the total requested RAM memory.

Most of our computing nodes have 256 GB of RAM. So, if your total requested RAM is higher than that your script will fail. If you need more RAM for a special case, please contact us. Alternatively, you can use the directive --mem by which you can specify the overall amount of memory available to all CPUs allocated to the task.

Tip

You must always specify the unit (G for Gigabytes, M for Megabytes, etc…)

Warning

Be aware that you should not ask for an amount of RAM too close to the maximum RAM capacity of a node. This is because the operating system already consumes part of that memory for management. As a rule of thumb, consider that all nodes have about 20% less memory than their maximum. Otherwise, your jobs might be waiting in the queue forever.

Execution time¶

SLURM assigns jobs into different queues depending on their expected run-time. In addition, specifying an expected run-time is important such that jobs that get stuck or otherwise don’t terminate can be removed from the cluster to free up resources. You therefore have to specify how long your job will run via a directive like this Example:

--time=1-12:00:00

(this will reserve 1 day and 12 hours). The default value is 6h.

You should provide a generous estimate of the time requires since if that limit is reached and your job is still running, SLURM will kill it. At the same time, the time limit needs to be less than the maximal time of the respective queue (see below). In general, jobs in queues for shorter jobs wait less before they start.

Tip

The format is hh : mm : ss. If days are needed you can use: days-hh : mm : ss

Selecting the queue¶

The directive --qos determines the queue to which your job will be assigned. You must select a queue compatible with the amount of time that you requested. Namely, the value in the directive --time should be smaller or equal to the time limit of the selected --qos value. Example:

--qos=1day

The default value is 6hours. See QoS in sciCORE for all available queues. The selected queue will also impose limits on the number of cores and simultaneous jobs that a user or group can have.

Capturing terminal output (`stdout` and `stderr`)¶

Output that your job prints to the terminal (strictly speaking into the pipes stdout and stderr) are captured by SLURM and written to file. You can specify the names of these files using the following directives:

--output=myoutput.%j
--error=myoutput.%j

If these directives are not given, SLURM will generate generic files with the assigned job ID %j in your home directory.

With this directive, you provide a path and a name to the file that will contain all the output that otherwise would be printed on the screen. If you only provide a name, the file will be created in the current working directory (usually the same from which the script was launched). If you use the string %j the job ID will be used in the name of the file. This can be useful to identify different output files from different runs.

Note

If you provide a path, that path must exist. If it doesn’t, SLURM won’t create it. Instead, your submission will silently fail without giving any warning!

Common errors in a SLURM script¶

Info

To help you avoid many common errors when creating your SLURM script, we have developed an online tool that generates SLURM scripts that work in sciCORE.

Typo in the Shebang¶

The shebang is the sequence of characters at the beginning of a script that defines the interpreter that should read the script.

SLURM uses bash so it must be:

#!/bin/bash

No other information should be included in this line (not even comments!) unless they are arguments for the interpreter.

Typo in the #SBATCH directive¶

If you forget the ‘#’, the script will fail immediately.

If you have a typo in the word ‘SBATCH’ as:

#SBACTH

the directive will be ignored because every line that starts with # is considered a comment. This error is more subtle, because the script might run but with a requirement missing, leading to likely unwanted and maybe unexpected results.

Spaces around the `=` sign in the directives¶

SLURM directives that need a value use the = sign. No spaces should be present around it. In fact, as a general rule, no spaces should be present anywhere in the directives unless they are ‘escaped’ using a backslash (e.g. for the name of a directory), but we don’t recommend this.

For example, this will fail:

#SBATCH --cpus-per-task= 1

Error in the path for `--output` and/or `--error`¶

If the path to the output and/or error files cannot be found (due to non-existing or a typo) the script will fail silently without warning or error message. If you see this behavior, these paths are the usual suspects.

Submitting and managing jobs¶

To submit a SLURM script to the workload manager you use the command sbatch:

sbatch <slurm_script>

If there is no error, a message will prompt the assigned job ID and the job will be queued. You can see all your queued jobs with:

squeue -u <username>

This will list all your jobs in the queue and their status.

If you want to cancel one of your jobs, find out its job ID with squeue and then:

scancel <jobID>

Or if you want to cancel all your queued jobs independently of their status (pending, running, etc…) then:

scancel -u <username>

Queues and partitions¶

Quality of Service (QoS)¶

In SLURM, there are no formal queues, but QoS (Quality of Service).

Depending on your job runtime, you must choose one of the following QoS. Choose the right QoS according to your runtime by using the sbatch directive --qos=qos-name.

QOS	Maximum runtime	GrpTRES	MaxTREPA	MaxTREPU
30min	00:30:00	CPU=12,000 MEM=68 TB	CPU=10,000 MEM=50 TB	CPU=10,000 MEM=50 TB
6hours	06:00:00	CPU=11,500 MEM=64 TB	CPU=7,500 MEM=40 TB	CPU=7,500 MEM=40 TB
1day	1-00:00:00	CPU=9,000 MEM=60 TB	CPU=4,500 MEM=30 TB	CPU=4,500 MEM=30 TB
1week	7-00:00:00	CPU=3,800 MEM=30 TB	CPU=2,000 MEM=15 TB	CPU=2,000 MEM=15 TB
2weeks	14-00:00:00	CPU=1,300 MEM=10 TB	CPU=128 MEM=2 TB	CPU=128 MEM=2 TB
gpu30min	00:30:00	CPU=3,300 GPU=170 MEM=26 TB	CPU=2,400 GPU=136 MEM=22 TB	CPU=2,600 GPU=136 MEM=22 TB
gpu6hours	06:00:00	CPU=3,000 GPU=150 MEM=24 TB	CPU=2,000 GPU=100 MEM=16 TB	CPU=2,000 GPU=100 MEM=16 TB
gpu1day	1-00:00:00	CPU=2,500 GPU=120 MEM=20 TB	CPU=1,250 GPU=60 MEM=10 TB	CPU=1,250 GPU=60 MEM=10 TB
gpu1week	7-00:00:00	CPU=1,500 GPU=48 MEM=12 TB	CPU=750 GPU=24 MEM=6 TB	CPU=750 GPU=24 MEM=6 TB

TRES means Trackable RESources. These are a type of resource that can be tracked to enforce limits. In sciCORE, these resource are the number of cores, amount of RAM, and number of GPUs.

GrpTRES is the maximum number of trackable resources assigned to the QoS.

MaxTREPA is the maximum number of trackable resources per account (i.e., a research group).

MaxTREPU is the maximum number of trackable resources per user.

All three limits are enforced on all users in sciCORE. It is important to note that, despite having available resources in the cluster, your jobs won’t run if they imply that you will have allocated more resources than any of those limits. This can be seen when doing squeue -u \<username\>, under the column REASON.

Warning

The limits on all QoS will change over time, due to temporary situations/needs or due to upgrades of the cluster. You can always check the current limits with the command usage.

Partitions¶

The compute nodes are logically grouped into partitions, which can overlap. The aim is to make available a specific type of infrastructure grouped by functionality (e.g. dedicated nodes for a project) or characteristic (e.g. nodes with GPUs or nodes with the same number of cores).

Partition	Assigned nodes	Cores per node	RAM per node*	GPUs per node	Allowed QoS
scicore	sca[05-52] scb[01-28]	128	512 GB 1 TB	-	30min, 6hours, 1day, 1week, 2weeks
bigmem	scb[29-40] scc[01-02]	128	1 TB 2 TB	-	30min, 6hours, 1day, 1week, 2weeks
titan	sgi[01-04]	28	512 GB	7 x Titan-Pascal	gpu30min, gpu6hours, gpu1day, gpu1week
rtx4090	sgd[01-03]	128	1 TB	8 × RTX 4090 with NVLink	gpu30min, gpu6hours, gpu1day, gpu1week
a100	sga[01-06] sgc[01-06]	128	1 TB	4 × A100 (40 GB) with NVLink	gpu30min, gpu6hours, gpu1day, gpu1week
a100-80g	sgb01 sgj[01-02]	128	1 TB	4 × A100 (80 GB) with NVLink	gpu30min, gpu6hours, gpu1day, gpu1week

Note

Take into account that the operating system already consumes around 20% of this maximum value for management purposes. So the actual available RAM/node is about 80% of the presented value.

Info

scicore is the default partition, if you don’t specify a partition your job will run in this one.

Warning

The use of the bigmem partition is only for high-memory jobs (over 256GB of RAM). Any misuse of this partition is prohibited and the jobs will be killed.Please contact scicore-admin@unibas.ch for advice.

Array jobs¶

Sometimes you want to execute the exact same code but varying the input data or some parameters. In those cases, where there is no communication between instances, perfect parallelism can be achieved by launching several copies of the same instance at the same time. Instead of creating a different SLURM script for each instance, you should use an array of jobs. This is a SLURM option that allows you to execute many instances with one single script. This is the recommended procedure, as it is much easier to manage and puts much less stress on the Workload Manager, than launching hundreds of individual scripts.

Imagine the following situation: you have an R script, named analyze.R, that takes two arguments, the name of the input data file and the name of the file where it will write the results. You would normally execute it like this:

Rscript analyze.R input.dat output.txt

Now, it turns out that input.dat is too big and it will take 1 week to analyze it. Fortunately, input.dat is basically a set of independent data, so you can split it into 200 smaller chunks. If you launch these 200 instances you get a x200 speedup, reducing the time-to-solution from 1 week to 50 min.

To do so, you first create a file that will contain all the instances that you want to launch. Let’s call it commands.cmd (names and extensions don’t really matter, we use those just to make the files easy to identify):

Rscript analyze.R input1.dat output1.txt
Rscript analyze.R input2.dat output2.txt
Rscript analyze.R input3.dat output3.txt
...
Rscript analyze.R input199.dat output199.txt
Rscript analyze.R input200.dat output200.txt

Now we will create a SLURM script that will launch 200 tasks and where each task will execute only one line of commands.cmd:

#!/bin/bash

#SBATCH --job-name=array_analyze
#SBATCH --time=01:00:00
#SBATCH --qos=6hours
#SBATCH --output=myoutput%A_%a.out
#SBATCH --error=myerror%A_%a.err
#SBATCH --mem=1G
#SBATCH --array=1-200

module load R/4.4.2-foss-2024a

$(head -$SLURM_ARRAY_TASK_ID commands.cmd | tail -1)

The directive --array will launch in this case 200 tasks, numbered from 1 to 200. It acts as an implicit loop with an index from 1 to 200. As a result, this script will execute the module load and the head ... tail lines with the same requirements of cores, time, and memory, 200 times. Each one of these 200 executions is called a task and each one of these tasks is numbered. The corresponding number to each task is stored in an environment variable named $SLURM_ARRAY_TASK_ID.

The head ... | tail combination selects one single line of the commands.cmd file and the characters $( ... ) will execute that line.

Using the $SLURM_ARRAY_TASK_ID as a variable to select the line to be executed in the head ... | tail combination, we ensure that task 1 will execute line 1 of the commands.cmd file, task 2 will execute line 2, and so on.

Note that we selected 1h of runtime, because our estimation is that each calculation will take 50 min.

Note also that a format option has been used in the output and error files. This will create a different file for every task, resulting in this example in 400 small log files. Having many small files is the enemy number one of a filesystem, so we strongly recommend to delete these files once they are not needed anymore.

Due to the limits applied by the QoS, you might want to limit the amount of simultaneous tasks so that other members of your group can also run. To do so you can use the following:

#SBATCH --array=1-200%20

This will limit the number of simultaneous tasks to 20.

An alternative syntax for calling the jobs one at a time for the array:

SEEDFILE=commands.cmd
SEED=$(sed -n ${SLURM_ARRAY_TASK_ID}p $SEEDFILE)
eval $SEED

Backfilling¶

The backfill scheduling plugin is loaded by default. Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are essential for backfill scheduling to work well.

This means that the most accurate running time and resources requirements you provide in your SLURM script, the higher chances you have to benefit from jumping up in the queue because of backfilling. This also points to the fact that shorter, less demanding calculations usually benefit from this. So if you can re-structure your pipeline/analysis to work in many smaller chunks is almost always better than fewer bigger chunks.

As a rule of thumb, always try to fit your calculations in the smallest possible queue. Under the assumption of normal occupancy of the cluster, if your run can fit in the 30min QoS, it will move in the queue faster than if you launch it in the 6hours QoS.

Requesting GPUs¶

To request a GPU you must specify in your SLURM script the type, the number of GPUs, and the partition. For example:

#!/bin/bash

#SBATCH --job-name=GPU_JOB
#SBATCH --time=01:00:00
#SBATCH --qos=gpu6hours
#SBATCH --mem-per-cpu=1G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=a100
#SBATCH --gres=gpu:1        # --gres=gpu:2 for two GPU, etc

module load CUDA
...

Warning

Note that GPUs have exclusive QoS, named gpu30min, gpu6hours, gpu1day, and gpu1week. If GPUs are requested in your SLURM script, one of these QoS must be used.

Using local SCRATCH¶

The sciCORE cluster counts with a centralized parallel file system that is exported to the login and computing nodes. This is where your home directory is located. Nevertheless, there are additional hard disks in the computing nodes. These are referred to as local SCRATCH.

See Storage & Data Management to have a better understanding of this structure.

For every job submitted to the cluster, a temporary directory is automatically created in the local scratch folder of the computing node. This temporary directory is independent for each SLURM job and is automatically deleted when the job finishes. You can use the environment variable $TMPDIR in your submit script to use this temporary directory. For example:

#!/bin/bash
#SBATCH --job-name=test_JOB
#SBATCH --time=01:00:00
#SBATCH --mem=1G


module load CoolApp/1.1

cp $HOME/input-data/* $TMPDIR/

CoolApp --input $TMPDIR/inputfile.txt --output $TMPDIR/output.txt

cp $TMPDIR/output.txt $HOME/output-data/

This SLURM script loads a module and then copies data from a directory in the user’s home (located in the centralized file system) to the local hard disk of whatever computing node is assigned ($TMPDIR). Then it executes the application that will generate an output file. Finally, because the $TMPDIR directory will be automatically deleted after the script finishes, the user copies the output from $TMPDIR to their home directory.

Many applications need to have a lot of access to the hard disk. This imposes a serious limitation in terms of performance when such I/O bound exists. Because the local SCRATCH disks are closer to the cores that perform the calculations, it is worth exploring if copying the input data to the local SCRATCH, generating the output locally there, and copying the results back to your home, provide a gain in the time to solution. In many cases, the overhead of moving data up and down is completely compensated by the gain of not needing to constantly write in the parallel file system from the computing node over the network.

Requesting for minimum available space in the local SCRATCH¶

When submitting your SLURM job you can request that the compute node, where the job is going to be executed, has at least a minimum of free available space in the local SCRATCH folder ($TMPDIR). To do this you can use the --tmp directive. For example:

#!/bin/bash

#SBATCH --job-name=test_JOB
#SBATCH --time=01:00:00
#SBATCH --qos=6hours
#SBATCH --mem=1G
#SBATCH --tmp=10G            # the compute node should have at least 10G
                             # of free space in local scratch folder ($TMPDIR)

module load CoolApp/1.1

cp $HOME/input-data/* $TMPDIR/

CoolApp --input $TMPDIR/inputfile.txt --output $TMPDIR/output.txt

cp $TMPDIR/output.txt $HOME/output-data/

Monitoring¶

In this section, we compile a series of methods and tools that will help you to monitor your jobs in SLURM when using the sciCORE cluster.

When the job has finished¶

The most reliable information about your job is only attainable after it has finished. The following tools and methods will provide this information.

time

Using time in front of your command will provide you with the total Wallclock, CPU, and User times that your application used. These are written in the STDERR.

$ time <very_complicated_command>

This is a straightforward and computationally cheap way of timing your applications. Additionally, dividing the CPU time by the Wallclock time should give a number close to the number of CPUs used in your calculation. The closer, the better your application is using the parallel resources. Strictly speaking, this is not monitoring, but profiling your code. Nevertheless, is such a ubiquitous and easy methodology that is important that is mentioned here.

sacct

sacct is the main tool for accounting. It can provide a great deal of information about the characteristics of a job. Nevertheless, if all the information is shown, it is not very ergonomic for reading:

$ sacct -lj 1927461

       JobID     JobIDRaw    JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxR
SS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode Mi
nCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMa
x ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDis
kWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    AllocGRES      ReqGRES    ReqTRES  AllocTRES
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- --------
-- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- --
-------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------
- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------
------ ---------------- ---------------- -------------- ------------ ------------ ---------- ----------
1927460_1    1927461            test    scicore
                                                                                                          2
  00:00:07  COMPLETED      0:0                  Unknown       Unknown    Unknown 1000Mc
                                                          cpu=2,mem+ cpu=2,mem+
1927460_1.b+ 1927461.bat+      batch               150064K          shi38              0    150064K      105
6K      shi38          0      1056K        0        shi38              0          0   00:00:00      shi38
       0   00:00:00        1          2   00:00:07  COMPLETED      0:0      1.20G             0
0             0     1000Mc              0            0           shi38           65534              0
     0            shi38            65534              0                                      cpu=2,mem+

Where the -l option means ‘long format’ and -j is for specifying the JOB_ID. Instead of this, we can use the default version of sacct, which is easier to read:

$ sacct -j 1927461

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1927460_1          test    scicore    scicore          2  COMPLETED      0:0
1927460_1.b+      batch               scicore          2  COMPLETED      0:0

Nevertheless, this might not provide the values we are interested in, like for example, requested memory, maximum memory, elapsed time, etc… Fortunately, we can change the output format of sacct very easily with the option -o:

sacct -o JobID,JobName,AllocCPUS,ReqMem,MaxRSS,User,Timelimit,State -j 1927461

       JobID    JobName  AllocCPUS     ReqMem     MaxRSS      User  Timelimit      State
------------ ---------- ---------- ---------- ---------- --------- ---------- ----------
1927460_1          test          2     1000Mc              cabezon   00:01:00  COMPLETED
1927460_1.b+      batch          2     1000Mc      1056K                       COMPLETED

With the -o option we can select a series of accounting fields, separated by commas. The complete of available fields can be retrieved with the --helpformat option:

$ sacct --helpformat

Account           AdminComment      AllocCPUS         AllocGRES
AllocNodes        AllocTRES         AssocID           AveCPU
AveCPUFreq        AveDiskRead       AveDiskWrite      AvePages
AveRSS            AveVMSize         BlockID           Cluster
Comment           ConsumedEnergy    ConsumedEnergyRaw CPUTime
CPUTimeRAW        DerivedExitCode   Elapsed           ElapsedRaw
Eligible          End               ExitCode          GID
Group             JobID             JobIDRaw          JobName
Layout            MaxDiskRead       MaxDiskReadNode   MaxDiskReadTask
MaxDiskWrite      MaxDiskWriteNode  MaxDiskWriteTask  MaxPages
MaxPagesNode      MaxPagesTask      MaxRSS            MaxRSSNode
MaxRSSTask        MaxVMSize         MaxVMSizeNode     MaxVMSizeTask
MinCPU            MinCPUNode        MinCPUTask        NCPUS
NNodes            NodeList          NTasks            Priority
Partition         QOS               QOSRAW            ReqCPUFreq
ReqCPUFreqMin     ReqCPUFreqMax     ReqCPUFreqGov     ReqCPUS
ReqGRES           ReqMem            ReqNodes          ReqTRES
Reservation       ReservationId     Reserved          ResvCPU
ResvCPURAW        Start             State             Submit
Suspended         SystemCPU         Timelimit         TotalCPU
UID               User              UserCPU           WCKey
WCKeyID

To know what each one of these formats shows you can always refer to the sacct manual: man sacct.

Info

Interestingly, sacct offers an output format that can be parsed very easily through the option --parsable. In this case, the output will be delimited with pipes |.

seff

seff is a sciCORE custom tool that provides a summary of the most relevant characteristics of a finished job. This tool is automatically executed at the end of a job and its outcome is attached in the email that you receive when you have this option in your script.

$ seff 1927461

Job ID: 1927461
Array Job ID: 1927460_1
Cluster: scicore
User/Group: cabezon/scicore
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:14 core-walltime
Memory Utilized: 1.03 MB
Memory Efficiency: 0.05% of 1.95 GB

As can be seen in the example above, seff provides some information, which is parsed from sacct. Especially relevant are the State (COMPLETED / FAILED) and CPU and Memory Efficiencies. If the latter are very low, it means that the reserved resources are mostly idle, and therefore it is advisable to reduce the CPU and memory requirements. This will benefit the other users, and your jobs will get through the queue faster.

jobstats

It is important to submit efficient jobs to the cluster, by doing this you will benefit from shorter wait times, more resources for you and your group, and finally a higher utilization of the sciCORE cluster.

jobstats is a utility to easily view the resource usage and the efficiency of jobs submitted to SLURM using the sacct utility. jobstats provides the same information as sacct but it displays the most relevant information about the efficiency of the job.

$ jobstats
JobID       JobName                          ReqMem    MaxRSS     ReqCPUS   UserCPU      Timelimit    Elapsed      State           JobEff
===========================================================================================================================================
9351465     beast_L2.2.1_2                   36.0G     1.44G      1         5-18:45:24   7-00:00:00   5-20:42:30   COMPLETED       43.87
9579068     slurm_gdb9_005992_5_casscf_s     10.0G     2.34G      1         06:24:38     2-00:00:00   1-23:58:01   COMPLETED       61.66
9582327     slurm_gdb9_006412_3_casscf_s     10.0G     2.36G      1         19:36:00     2-00:00:00   1-23:58:02   COMPLETED       61.74
9584625     slurm_gdb9_006542_1_casscf_s     10.0G     2.44G      1         21:53:33     7-00:00:00   1-21:07:16   COMPLETED       25.63
9585121     slurm_gdb9_006555_3_casscf_s     10.0G     2.43G      1         1-01:30:27   7-00:00:00   1-16:45:45   COMPLETED       24.26
9594820     slurm_gdb9_007824_2_casscf_s     10.0G     2.47G      1         05:55:58     7-00:00:00   09:46:05     COMPLETED       15.28
9715952     slurm_gdb9_008671_0_casscf_s     10.0G     666M       1         00:44.771    7-00:00:00   00:01:32     COMPLETED       3.26
9737217     slurm_gdb9_052778_1_casscf_pm6   10.0G     1.85G      1         00:45.062    06:00:00     00:01:27     COMPLETED       9.45

The jobstats output shows the most important parameters involved with the efficiency of a job, these are the job time, the CPU usage, and memory usage.

In this output, we have a few inefficient jobs:

Job 9351465 requested 36 GB (ReqMem) but used only 1.44 G (MaxRSS) while blocking 34 GB of unused memory for over 5 days.
Job 9715952 requested 10 GB but needed only 666 MB.
Job 9594820 requested 7-00:00:00 (Timelimit) but ran only 09:46:05 (Elapsed). This job could run in the 1day QoS and start earlier.
Job 9737217 requested 6 hours and run 00:01:32. This job should be submitted to the 30min QoS.

While the job is still running¶

The information obtained while the job is running is not always accurate, as it is a snapshot of the status information that SLURM gathers with a certain frequency. Therefore, if your job has sporadic big changes in its resources usage, these commands might not catch them. Nevertheless, they are useful to have an impression of the overall evolution of your jobs.

Info

A user is allowed to ssh to a node while they have a running job. This is a way to do some real-time checking. Of course, the available resources are still limited to those requested by the running job.

squeue

This is the first, naive but very convenient and most-used tool to check the status of your jobs.

squeue provides the full list of jobs (PENDING and RUNNING) that are present in the queue at a certain moment. It provides the full list by default, but this can be changed using the -u option:

$ squeue -u cabezon

JOBID     PARTITION           NAME                USER      STATE     TIME        TIME_LIMIT  QOS       NODELIST(REASON)
2025256   scicore             test                cabezon   PENDING   0:00        1:00        30min     (None)

$ squeue -u cabezon

JOBID     PARTITION           NAME                USER      STATE     TIME        TIME_LIMIT  QOS       NODELIST(REASON)
2025257   scicore             test                cabezon   RUNNING   0:05        1:00        30min     shi28
2025258   scicore             test                cabezon   RUNNING   0:05        1:00        30min     shi36
2025259   scicore             test                cabezon   RUNNING   0:05        1:00        30min     shi27
2025256   scicore             test                cabezon   RUNNING   0:05        1:00        30min     shi27

The first example shows the outcome when the job is still waiting in the queue (PENDING), while the latter shows a running job (which in fact is an array job with 4 tasks).

There are many options to modify the format of the output, that are accessible via man squeue.

Tip

Most of the time you want squeue to show only your jobs. To do so as default, simply add an alias to your .bashrc file (which is in your home directory):

alias squeue='squeue -u <username>'

where <username> is your username. Then logout and login again or source the .bashrc file with: source ~/.bashrc

You can always retrieve the full list with squeue -a

scontrol

This command can do many things to control your currently pending/running jobs, but we will focus here on its informative options.

$ scontrol show jobid -dd 2033257
JobId=2033257 JobName=test
   UserId=cabezon(38334) GroupId=scicore(3731) MCS_label=N/A
   Priority=1772 Nice=0 Account=scicore QOS=30min
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:06 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2017-07-12T11:05:10 EligibleTime=2017-07-12T11:05:10
   StartTime=2017-07-12T11:05:20 EndTime=2017-07-12T11:06:20 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=scicore AllocNode:Sid=login01:19949
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=shi38
   BatchHost=shi38
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=2000M,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=shi38 CPU_IDs=8-9 Mem=2000 GRES_IDX=
   MinCPUsNode=2 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scicore/home/scicore/cabezon/testing/launch.sh
   WorkDir=/scicore/home/scicore/cabezon/testing
   StdErr=/scicore/home/scicore/cabezon/testing/../out2033257
   StdIn=/dev/null
   StdOut=/scicore/home/scicore/cabezon/testing/../out2033257
   Power=
   BatchScript=
#!/bin/bash

#SBATCH --job-name=test
#SBATCH --cpus-per-task=2
#SBATCH --time=00:00:10
#SBATCH --output=../out%j
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=ruben.cabezon@unibas.ch
##SBATCH --array=1-4
#SBATCH --qos=30min

time sleep 5

scontrol can be slow and other commands are recommended, like sstat or squeue, but sometimes is useful to have access to this information. Note that the -dd option provides detailed information, including the actual script launched.

Info

The information that scontrol shows will be deleted some minutes after the job has finished. Therefore it is meant to be used while the job is still in the queue or running.

sstat

sstat provides a piece of information similar to sacct -l but only while the job is still running. It can be applied to the batch step:

$sstat 2025177.batch

       JobID  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPag
es MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUFre
qMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask  AveDiskRead MaxDiskWr
ite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- ------
-- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------
---- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ---------
--- ---------------- ---------------- ------------
2025177.bat+    221048K          shi26              0    221048K      2064K      shi26          0      2064K
 0        shi26              0          0  01:00.000      shi26          0  01:00.000        1      2.24G       Unk
nown       Unknown       Unknown              0        0.08M           shi26               0        0.08M     6719.
22M            shi26                0     6719.22M

As can be seen, it also provides a lot of information that is not easy to read. Fortunately, sstat counts with similar options to those of sacct. Hence, the -o option can be used to select specific fields.

Batch Computing (SLURM)

Overview¶

Understanding a SLURM script¶

A generic SLURM script¶

Preamble:¶

Your compute job¶

SLURM directives¶

Job name¶

Number of CPUs¶

Number of tasks¶

Memory¶

Execution time¶

Selecting the queue¶

Capturing terminal output (stdout and stderr)¶

Common errors in a SLURM script¶

Typo in the Shebang¶

Typo in the #SBATCH directive¶

Spaces around the = sign in the directives¶

Error in the path for --output and/or --error¶

Submitting and managing jobs¶

Queues and partitions¶

Quality of Service (QoS)¶

Partitions¶

Array jobs¶

Backfilling¶

Requesting GPUs¶

Using local SCRATCH¶

Requesting for minimum available space in the local SCRATCH¶

Monitoring¶

When the job has finished¶

While the job is still running¶

Capturing terminal output (`stdout` and `stderr`)¶

Spaces around the `=` sign in the directives¶

Error in the path for `--output` and/or `--error`¶