Skip to content

Batch Computing (SLURM)

Overview

In order to maintain a fair shared usage of the computational resources at sciCORE, we use a queueing system (also known as Workload Manager) named SLURM. Users that want to run calculations in the cluster must interact with SLURM to reserve the resources required for the calculation. To that end, the user must write and execute a script that besides containing the command that executes their calculation, also contains certain directives that are understood by SLURM where the amount of CPU/GPU, RAM, and duration (among other details) are specified.

This guide will show you how to write such a script and the meaning of those SLURM directives.

Note

SLURM is just one of many queueing systems but is very common and powerful. This means that a substantial part of what you will use to run your calculation in sciCORE, can be also easily ported to other infrastructures that also use SLURM, such as the Swiss National Supercomputer at CSCS.

Understanding a SLURM script

A generic SLURM script

In order to explain the generic SLURM script, we will use an example that a user has a Python script that analyzes some data and it is usually executed as:

python my_script.py inputdata.txt

To run it in the cluster, the user must create a SLURM script similar to this:

#!/bin/bash                  
#The previous line is mandatory

#SBATCH --job-name=myrun     #Name of your job
#SBATCH --cpus-per-task=1    #Number of cores to reserve
#SBATCH --mem-per-cpu=1G     #Amount of RAM/core to reserve
#SBATCH --time=06:00:00      #Maximum allocated time
#SBATCH --qos=6hours         #Selected queue to allocate your job
#SBATCH --output=myrun.o%j   #Path and name to the file for the STDOUT
#SBATCH --error=myrun.e%j    #Path and name to the file for the STDERR

ml <software.version>        #Load required module(s)

<execute command>            #Execute your command(s)

The line #!/bin/bash is mandatory. The rest of lines starting with #SBATCH are formally comments (because they start with #). However, since the have the special string SBATCH attached to them, they become directives that SLURM can understand. These directives are optative and if they are not provided, a default value will be used (which might not be adequate for you).

After the SLURM directives we might need to load the necessary libraries and software (see (Module System at sciCORE)[software.nd#module-system] for more information) needed to execute the last line, which is our command and what we want to be run on the compute nodes.

This script should be saved in a file (e.g. launch.sh) and then can be submitted to the queue with:

sbatch <name_of_your_script_file>

Once it is submitted and if everything goes fine, a job ID number will be assigned to this job and it will be allocated in the queue, where it will wait until the required resources become available for you. At that moment, your job will be executed automatically.

Let’s explore now the meaning of each one of the different SLURM directives from the script above.


--job=name

If you don’t provide one, a default name based on the assigned job ID.

It is advisable to give here a descriptive name that helps you to identify your jobs easily.

Example:

--job-name=supernova_s15


--cpus-per-task
The default value is 1.

This will reserve that specifc amount of cores. With few exceptions, most of the computing nodes in sciCORE have 128 cores. Therefore, a value higher than that is not possible. In order to actually use those cores, a parallellized code must be executed. SLURM will only ensure that the requested cores are available for you. The proper usage of those is responsibility of the user. Also be aware, that not all parallelized codes scale to a high number of cores. It is very possible that increasing the number of cores beyond a certain quantity brings very little benefit in terms of time-to-solution. It is the user’s responsibility to know the scaling properties of their codes.

Example:

--cpus-per-task=4


--ntasks
The default value is 1.

This directive is not in the example above, but it is relevant for parallel calculations. The main difference with --cups-per-task is that --ntasks specifies how many instances of your command will be executed, while --cpus-per-task determines how many cores will be used by each one of those instances. One single task cannot be distributed among computing nodes, but you can have several tasks in one computing node. Be aware that one task is not always equal to one job, as one job can have several tasks, each one of them can be in different computing nodes. This is a subtle difference but extremely important to understand if you are performing classical parallel calculations. As a general guideline, if you use only OpenMP (i.e. threads) to parallelize your code you only need to use --cpus-per-task. If you only use MPI (i.e. processes) you only need --ntasks. If you use both (i.e. a hybrid calculation) you will need both. Examples of all this can be found in the examples section.

Example:

 --ntasks=16


--mem-per-cpu

The default value is 1 GB.

This value will be multiplied by the number in the directive --cpus-per-task to calculate the total requested RAM memory.

Most of our computing nodes have 256 GB of RAM. So, if your total requested RAM is higher than that your script will fail. If you need more RAM for a special case, please contact us.

Alternatively, you can use the directive --mem which provides the full overall amount of memory to be used by all cores as they see fit.

Tip

You must always specify the unit (G for Gigabytes, M for Megabytes, etc…)

Example:

--mem-per-cpu=8G

Warning

Be aware that you should not ask for an amount of RAM too close to the maximum RAM capacity of a node. This is because the operating system already consumes part of that memory for management. As a rule of thumb, consider that all nodes have about 20% less memory than their maximum. Otherwise, your jobs might be waiting in the queue forever.




--time

The default value is 6h.

This specifies the maximum amount of real-time (wallclock time) that your job will need. If that limit is reached and your job is still running, SLURM will kill it.

It is important that you don’t overshoot this parameter. If you know the amount of time that your calculation needs, use a value close to that. In this way, your job will spend a lower amount of time waiting in the queue (see backfilling).

Tip

The format is hh : mm : ss. If days are needed you can use: days-hh : mm : ss

Example:

--time=1-12:00:00
(this will reserve 1 day and 12 hours)


--qos

The default value is 6hours.

This determines the queue to which your job will be assigned. You must select a queue compatible with the amount of time that you requested. Namely, the value in the directive --time should be smaller or equal to the time limit of the selected --qos value.

The selected queue will also impose limits on the number of cores and simultaneous jobs that a user or group can have. See QoS in sciCORE to know more.

Example:

--qos=1day


--output

If this directive is not given, SLURM will generate a generic file with the assigned job ID in your home directory.

With this directive, you provide a path and a name to the file that will contain all the output that otherwise would be printed on the screen. If you only provide a name, the file will be created in the current working directory (usually the same from which the script was launched).

If you provide a path, that path must exist. If it doesn’t, SLURM won’t create it. Instead, your submission will silently fail without giving any warning!

If you use the string %j the job ID will be used in the name of the file. This can be useful to identify different output files from different runs.

Example:

--output=myoutput.%j


--error

If this directive is not given, SLURM will generate a generic file with the assigned job ID in your home directory.

With this directive, you provide a path and a name to the file that will contain all the errors and system output that otherwise would be printed on the screen. If you only provide a name, the file will be created in the current working directory (usually the same from which the script was launched).

If you provide a path, that path must exist. If it doesn’t, SLURM won’t create it. Instead, your submission will silently fail without giving any warning!

If you use the string %j the job ID will be used in the name of the file. This can be useful to identify different output files from different runs.

Example:

--output=myerror.%j

Common errors in a SLURM script

Info

To help you avoid many common errors when creating your SLURM script, we have developed an online tool that generates SLURM scripts that work in sciCORE.

Typo in the Shebang

The shebang is the sequence of characters at the beginning of a script that defines the interpreter that should read the script.

SLURM uses bash so it must be:

#!/bin/bash

No other information should be included in this line (not even comments!) unless they are arguments for the interpreter.

Typo in the #SBATCH directive

If you forget the ‘#’, the script will fail immediately.

If you have a typo in the word ‘SBATCH’ as:

#SBACTH
the directive will be ignored because every line that starts with # is considered a comment. This error is more mischievous, because the script might run but with a requirement missing, leading to likely unwanted and maybe unexpected results.

Spaces around the = sign in the directives

SLURM directives that need a value, use the = sign. No spaces should be present around it. In fact, as a general rule, no spaces should be present anywhere in the directives unless they are backslashed (e.g. for the name of a directory), but we don’t recommend this.

For example, this will fail:

#SBATCH --cpus-per-task= 1

Error in the path for --output and/or --error

If the path to the output and/or error files cannot be found (due to non-existing or a typo) the script will fail silently without warning or error message. If you see this behavior, these paths are the usual suspects.

Submitting and managing jobs

To submit a SLURM script to the workload manager you use the command sbatch:

sbatch <slurm_script>

If there is no error, a message will prompt the assigned job ID and the job will be queued. You can see all your queued jobs with:

squeue -u <username>

This will list all your jobs in the queue and their status.

If you want to cancel one of your jobs, find out its job ID with squeue and then:

scancel <jobID>

Or if you want to cancel all your queued jobs independently of their status (pending, running, etc…) then:

scancel -u <username>

Queues and partitions

Quality of Service (QoS)

In SLURM, there are no formal queues, but QoS (Quality of Service).

Depending on your job runtime, you must choose one of the following QoS. Choose the right QoS according to your runtime by using the sbatch directive --qos=qos-name.

QOS Maximum runtime GrpTRES MaxTREPA MaxTREPU
30min 00:30:00 CPU=12,000
MEM=68 TB
CPU=10,000
MEM=50 TB
CPU=10,000
MEM=50 TB
6hours 06:00:00 CPU=11,500
MEM=64 TB
CPU=7,500
MEM=40 TB
CPU=7,500
MEM=40 TB
1day 1-00:00:00 CPU=9,000
MEM=60 TB
CPU=4,500
MEM=30 TB
CPU=4,500
MEM=30 TB
1week 7-00:00:00 CPU=3,800
MEM=30 TB
CPU=2,000
MEM=15 TB
CPU=2,000
MEM=15 TB
2weeks 14-00:00:00 CPU=1,300
MEM=10 TB
CPU=128
MEM=2 TB
CPU=128
MEM=2 TB
gpu30min 00:30:00 CPU=3,300
GPU=170
MEM=26 TB
CPU=2,400
GPU=136
MEM=22 TB
CPU=2,600
GPU=136
MEM=22 TB
gpu6hours 06:00:00 CPU=3,000
GPU=150
MEM=24 TB
CPU=2,000
GPU=100
MEM=16 TB
CPU=2,000
GPU=100
MEM=16 TB
gpu1day 1-00:00:00 CPU=2,500
GPU=120
MEM=20 TB
CPU=1,250
GPU=60
MEM=10 TB
CPU=1,250
GPU=60
MEM=10 TB
gpu1week 7-00:00:00 CPU=1,500
GPU=48
MEM=12 TB
CPU=750
GPU=24
MEM=6 TB
CPU=750
GPU=24
MEM=6 TB

TRES means Trackable RESources. These are a type of resource that can be tracked to enforce limits. In sciCORE, these resource are the number of cores, amount of RAM, and number of GPUs.

GrpTRES is the maximum number of trackable resources assigned to the QoS.

MaxTREPA is the maximum number of trackable resources per account (i.e., a research group).

MaxTREPU is the maximum number of trackable resources per user.

All three limits are enforced on all users in sciCORE. It is important to note that, despite having available resources in the cluster, your jobs won’t run if they imply that you will have allocated more resources than any of those limits. This can be seen when doing squeue -u \<username\>, under the column REASON.

Warning

The limits on all QoS will change over time, due to temporary situations/needs or due to upgrades of the cluster. You can always check the current limits with the command usage.

Partitions

The compute nodes are logically grouped into partitions, which can overlap. The aim is to make available a specific type of infrastructure grouped by functionality (e.g. dedicated nodes for a project) or characteristic (e.g. nodes with GPUs or nodes with the same number of cores). 

Partition Assigned nodes Cores per node RAM per node* GPUs per node Allowed QoS
scicore sca[05–52]
scb[01–28]
128 512 GB
1 TB
- 30min, 6hours, 1day, 1week, 2weeks
bigmem scb[29–40]
scc[01–02]
128 1 TB
2 TB
30min, 6hours, 1day, 1week, 2weeks
titan sgi[01–04] 28 512 GB 7 x Titan-Pascal gpu30min, gpu6hours, gpu1day, gpu1week
rtx4090 sgd[01–03] 128 1 TB 8 × RTX 4090 with NVLink gpu30min, gpu6hours, gpu1day, gpu1week
a100 sga[01–06]
sgc[01–06]
128 1 TB 4 × A100 (40 GB) with NVLink gpu30min, gpu6hours, gpu1day, gpu1week
a100-80g sgb01
sgj[01–02]
128 1 TB 4 × A100 (80 GB) with NVLink gpu30min, gpu6hours, gpu1day, gpu1week

Note

Take into account that the operating system already consumes around 20% of this maximum value for management purposes. So the actual available RAM/node is about 80% of the presented value.

Info

scicore is the default partition, if you don’t specify a partition your job will run in this one.

Warning

The use of the bigmem partition is only for high-memory jobs (over 256GB of RAM). Any misuse of this partition is prohibited and the jobs will be killed.Please contact scicore-admin@unibas.ch for advice.

Array jobs

Sometimes you want to execute the exact same code but varying the input data or some parameters. In those cases, where there is no communication between instances, perfect parallelism can be achieved by launching several copies of the same instance at the same time. Instead of creating a different SLURM script for each instance, you should use an array of jobs. This is a SLURM option that allows you to execute many instances with one single script. This is the recommended procedure, as it is much easier to manage and puts much less stress on the Workload Manager, than launching hundreds of individual scripts.

Imagine the following situation: you have an R script, named analyze.R, that takes two arguments, the name of the input data file and the name of the file where it will write the results. You would normally execute it like this:

Rscript analyze.R input.dat output.txt

Now, it turns out that input.dat is too big and it will take 1 week to analyze it. Fortunately, input.dat is basically a set of independent data, so you can split it into 200 smaller chunks. If you launch these 200 instances you get a x200 speedup, reducing the time-to-solution from 1 week to 50 min.

To do so, you first create a file that will contain all the instances that you want to launch. Let’s call it commands.cmd (names and extensions don’t really matter, we use those just to make the files easy to identify):

Rscript analyze.R input1.dat output1.txt
Rscript analyze.R input2.dat output2.txt
Rscript analyze.R input3.dat output3.txt
...
Rscript analyze.R input199.dat output199.txt
Rscript analyze.R input200.dat output200.txt

Now we will create a SLURM script that will launch 200 tasks and where each task will execute only one line of commands.cmd:

#!/bin/bash

#SBATCH --job-name=array_analyze
#SBATCH --time=01:00:00
#SBATCH --qos=6hours
#SBATCH --output=myoutput%A_%a.out
#SBATCH --error=myerror%A_%a.err
#SBATCH --mem=1G 
#SBATCH --array=1-200

module load R/4.4.2-foss-2024a

$(head -$SLURM_ARRAY_TASK_ID commands.cmd | tail -1) 

The directive --array will launch in this case 200 tasks, numbered from 1 to 200. It acts as an implicit loop with an index from 1 to 200. As a result, this script will execute the module load and the head ... tail lines with the same requirements of cores, time, and memory, 200 times. Each one of these 200 executions is called a task and each one of these tasks is numbered. The corresponding number to each task is stored in an environment variable named $SLURM_ARRAY_TASK_ID.

The head ... | tail combination selects one single line of the commands.cmd file and the characters $( ... ) will execute that line.

Using the $SLURM_ARRAY_TASK_ID as a variable to select the line to be executed in the head ... | tail combination, we ensure that task 1 will execute line 1 of the commands.cmd file, task 2 will execute line 2, and so on.

Note that we selected 1h of runtime, because our estimation is that each calculation will take 50 min.

Note also that a format option has been used in the output and error files. This will create a different file for every task, resulting in this example in 400 small log files. Having many small files is the enemy number one of a filesystem, so we strongly recommend to delete these files once they are not needed anymore.

Due to the limits applied by the QoS, you might want to limit the amount of simultaneous tasks so that other members of your group can also run. To do so you can use the following:

#SBATCH --array=1-200%20

This will limit the number of simultaneous tasks to 20.

An alternative syntax for calling the jobs one at a time for the array:

SEEDFILE=commands.cmd
SEED=$(sed -n ${SLURM_ARRAY_TASK_ID}p $SEEDFILE)
eval $SEED

Backfilling

The backfill scheduling plugin is loaded by default. Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are essential for backfill scheduling to work well.

This means that the most accurate running time and resources requirements you provide in your SLURM script, the higher chances you have to benefit from jumping up in the queue because of backfilling. This also points to the fact that shorter, less demanding calculations usually benefit from this. So if you can re-structure your pipeline/analysis to work in many smaller chunks is almost always better than fewer bigger chunks.

As a rule of thumb, always try to fit your calculations in the smallest possible queue. Under the assumption of normal occupancy of the cluster, if your run can fit in the 30min QoS, it will move in the queue faster than if you launch it in the 6hours QoS.

Requesting GPUs

To request a GPU you must specify in your SLURM script the type, the number of GPUs, and the partition. For example:

#!/bin/bash

#SBATCH --job-name=GPU_JOB
#SBATCH --time=01:00:00
#SBATCH --qos=gpu6hours
#SBATCH --mem-per-cpu=1G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=a100
#SBATCH --gres=gpu:1        # --gres=gpu:2 for two GPU, etc

module load CUDA
...

Warning

Note that GPUs have exclusive QoS, named gpu30min, gpu6hours, gpu1day, and gpu1week. If GPUs are requested in your SLURM script, one of these QoS must be used.

Using local SCRATCH

The sciCORE cluster counts with a centralized parallel file system that is exported to the login and computing nodes. This is where your home directory is located. Nevertheless, there are additional hard disks in the computing nodes. These are referred to as local SCRATCH.

See Storage Services & Policies in sciCORE to have a better understanding of this structure.

For every job submitted to the cluster, a temporary directory is automatically created in the local scratch folder of the computing node. This temporary directory is independent for each SLURM job and is automatically deleted when the job finishes. You can use the environment variable $TMPDIR in your submit script to use this temporary directory. For example:

#!/bin/bash
#SBATCH --job-name=test_JOB
#SBATCH --time=01:00:00
#SBATCH --mem=1G


module load CoolApp/1.1

cp $HOME/input-data/* $TMPDIR/

CoolApp --input $TMPDIR/inputfile.txt --output $TMPDIR/output.txt

cp $TMPDIR/output.txt $HOME/output-data/

This SLURM script loads a module and then copies data from a directory in the user’s home (located in the centralized file system) to the local hard disk of whatever computing node is assigned ($TMPDIR). Then it executes the application that will generate an output file. Finally, because the $TMPDIR directory will be automatically deleted after the script finishes, the user copies the output from $TMPDIR to their home directory.

Many applications need to have a lot of access to the hard disk. This imposes a serious limitation in terms of performance when such I/O bound exists. Because the local SCRATCH disks are closer to the cores that perform the calculations, it is worth exploring if copying the input data to the local SCRATCH, generating the output locally there, and copying the results back to your home, provide a gain in the time to solution. In many cases, the overhead of moving data up and down is completely compensated by the gain of not needing to constantly write in the parallel file system from the computing node over the network.

Requesting for minimum available space in the local SCRATCH

When submitting your SLURM job you can request that the compute node, where the job is going to be executed, has at least a minimum of free available space in the local SCRATCH folder ($TMPDIR). To do this you can use the --tmp directive. For example:

#!/bin/bash

#SBATCH --job-name=test_JOB
#SBATCH --time=01:00:00
#SBATCH --qos=6hours
#SBATCH --mem=1G
#SBATCH --tmp=10G            # the compute node should have at least 10G
                             # of free space in local scratch folder ($TMPDIR)

module load CoolApp/1.1

cp $HOME/input-data/* $TMPDIR/

CoolApp --input $TMPDIR/inputfile.txt --output $TMPDIR/output.txt

cp $TMPDIR/output.txt $HOME/output-data/

Monitoring

In this section, we compile a series of methods and tools that will help you to monitor your jobs in SLURM when using the sciCORE cluster.

When the job has finished

The most reliable information about your job is only attainable after it has finished. The following tools and methods will provide this information.


time

Using time in front of your command will provide you with the total Wallclock, CPU, and User times that your application used. These are written in the STDERR.

$ time <very_complicated_command>

This is a straightforward and computationally cheap way of timing your applications. Additionally, dividing the CPU time by the Wallclock time should give a number close to the number of CPUs used in your calculation. The closer, the better your application is using the parallel resources. Strictly speaking, this is not monitoring, but profiling your code. Nevertheless, is such a ubiquitous and easy methodology that is important that is mentioned here.


sacct

sacct is the main tool for accounting. It can provide a great deal of information about the characteristics of a job. Nevertheless, if all the information is shown, it is not very ergonomic for reading:

$ sacct -lj 1927461

       JobID     JobIDRaw    JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxR
SS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode Mi
nCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMa
x ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDis
kWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    AllocGRES      ReqGRES    ReqTRES  AllocTRES 
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- --------
-- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- --
-------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------
- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------
------ ---------------- ---------------- -------------- ------------ ------------ ---------- ---------- 
1927460_1    1927461            test    scicore                                                                                                     
                                                                                                          2 
  00:00:07  COMPLETED      0:0                  Unknown       Unknown    Unknown 1000Mc                                                                                                      
                                                          cpu=2,mem+ cpu=2,mem+ 
1927460_1.b+ 1927461.bat+      batch               150064K          shi38              0    150064K      105
6K      shi38          0      1056K        0        shi38              0          0   00:00:00      shi38   
       0   00:00:00        1          2   00:00:07  COMPLETED      0:0      1.20G             0             
0             0     1000Mc              0            0           shi38           65534              0       
     0            shi38            65534              0                                      cpu=2,mem+ 

Where the -l option means ‘long format’ and -j is for specifying the JOB_ID. Instead of this, we can use the default version of sacct, which is easier to read:

$ sacct -j 1927461

       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1927460_1          test    scicore    scicore          2  COMPLETED      0:0 
1927460_1.b+      batch               scicore          2  COMPLETED      0:0 

Nevertheless, this might not provide the values we are interested in, like for example, requested memory, maximum memory, elapsed time, etc… Fortunately, we can change the output format of sacct very easily with the option -o:

sacct -o JobID,JobName,AllocCPUS,ReqMem,MaxRSS,User,Timelimit,State -j 1927461

       JobID    JobName  AllocCPUS     ReqMem     MaxRSS      User  Timelimit      State 
------------ ---------- ---------- ---------- ---------- --------- ---------- ---------- 
1927460_1          test          2     1000Mc              cabezon   00:01:00  COMPLETED 
1927460_1.b+      batch          2     1000Mc      1056K                       COMPLETED 

With the -o option we can select a series of accounting fields, separated by commas. The complete of available fields can be retrieved with the --helpformat option:

$ sacct --helpformat

Account           AdminComment      AllocCPUS         AllocGRES        
AllocNodes        AllocTRES         AssocID           AveCPU           
AveCPUFreq        AveDiskRead       AveDiskWrite      AvePages         
AveRSS            AveVMSize         BlockID           Cluster          
Comment           ConsumedEnergy    ConsumedEnergyRaw CPUTime          
CPUTimeRAW        DerivedExitCode   Elapsed           ElapsedRaw       
Eligible          End               ExitCode          GID              
Group             JobID             JobIDRaw          JobName          
Layout            MaxDiskRead       MaxDiskReadNode   MaxDiskReadTask  
MaxDiskWrite      MaxDiskWriteNode  MaxDiskWriteTask  MaxPages         
MaxPagesNode      MaxPagesTask      MaxRSS            MaxRSSNode       
MaxRSSTask        MaxVMSize         MaxVMSizeNode     MaxVMSizeTask    
MinCPU            MinCPUNode        MinCPUTask        NCPUS            
NNodes            NodeList          NTasks            Priority         
Partition         QOS               QOSRAW            ReqCPUFreq       
ReqCPUFreqMin     ReqCPUFreqMax     ReqCPUFreqGov     ReqCPUS          
ReqGRES           ReqMem            ReqNodes          ReqTRES          
Reservation       ReservationId     Reserved          ResvCPU          
ResvCPURAW        Start             State             Submit           
Suspended         SystemCPU         Timelimit         TotalCPU         
UID               User              UserCPU           WCKey            
WCKeyID

To know what each one of these formats shows you can always refer to the sacct manual: man sacct.

Info

Interestingly, sacct offers an output format that can be parsed very easily through the option --parsable. In this case, the output will be delimited with pipes |.


seff

seff is a sciCORE custom tool that provides a summary of the most relevant characteristics of a finished job. This tool is automatically executed at the end of a job and its outcome is attached in the email that you receive when you have this option in your script.

$ seff 1927461

Job ID: 1927461
Array Job ID: 1927460_1
Cluster: scicore
User/Group: cabezon/scicore
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:14 core-walltime
Memory Utilized: 1.03 MB
Memory Efficiency: 0.05% of 1.95 GB

As can be seen in the example above, seff provides some information, which is parsed from sacct. Especially relevant are the State (COMPLETED / FAILED) and CPU and Memory Efficiencies. If the latter are very low, it means that the reserved resources are mostly idle, and therefore it is advisable to reduce the CPU and memory requirements. This will benefit the other users, and your jobs will get through the queue faster.


jobstats

It is important to submit efficient jobs to the cluster, by doing this you will benefit from shorter wait times, more resources for you and your group, and finally a higher utilization of the sciCORE cluster.

jobstats is a utility to easily view the resource usage and the efficiency of jobs submitted to SLURM using the sacct utility. jobstats provides the same information as sacct but it displays the most relevant information about the efficiency of the job.

$ jobstats
JobID       JobName                          ReqMem    MaxRSS     ReqCPUS   UserCPU      Timelimit    Elapsed      State           JobEff  
===========================================================================================================================================
9351465     beast_L2.2.1_2                   36.0G     1.44G      1         5-18:45:24   7-00:00:00   5-20:42:30   COMPLETED       43.87
9579068     slurm_gdb9_005992_5_casscf_s     10.0G     2.34G      1         06:24:38     2-00:00:00   1-23:58:01   COMPLETED       61.66       
9582327     slurm_gdb9_006412_3_casscf_s     10.0G     2.36G      1         19:36:00     2-00:00:00   1-23:58:02   COMPLETED       61.74       
9584625     slurm_gdb9_006542_1_casscf_s     10.0G     2.44G      1         21:53:33     7-00:00:00   1-21:07:16   COMPLETED       25.63       
9585121     slurm_gdb9_006555_3_casscf_s     10.0G     2.43G      1         1-01:30:27   7-00:00:00   1-16:45:45   COMPLETED       24.26       
9594820     slurm_gdb9_007824_2_casscf_s     10.0G     2.47G      1         05:55:58     7-00:00:00   09:46:05     COMPLETED       15.28       
9715952     slurm_gdb9_008671_0_casscf_s     10.0G     666M       1         00:44.771    7-00:00:00   00:01:32     COMPLETED       3.26       
9737217     slurm_gdb9_052778_1_casscf_pm6   10.0G     1.85G      1         00:45.062    06:00:00     00:01:27     COMPLETED       9.45 

The jobstats output shows the most important parameters involved with the efficiency of a job, these are the job time, the CPU usage, and memory usage.

In this output, we have a few inefficient jobs:

  • Job 9351465 requested 36 GB (ReqMem) but used only 1.44 G (MaxRSS) while blocking 34 GB of unused memory for over 5 days.
  • Job 9715952 requested 10 GB but needed only 666 MB.
  • Job 9594820 requested 7-00:00:00 (Timelimit) but ran only 09:46:05 (Elapsed). This job could run in the 1day QoS and start earlier.
  • Job 9737217 requested 6 hours and run 00:01:32. This job should be submitted to the 30min QoS.

While the job is still running

The information obtained while the job is running is not always accurate, as it is a snapshot of the status information that SLURM gathers with a certain frequency. Therefore, if your job has sporadic big changes in its resources usage, these commands might not catch them. Nevertheless, they are useful to have an impression of the overall evolution of your jobs.

Info

A user is allowed to ssh to a node while they have a running job. This is a way to do some real-time checking. Of course, the available resources are still limited to those requested by the running job.


squeue

This is the first, naive but very convenient and most-used tool to check the status of your jobs.

squeue provides the full list of jobs (PENDING and RUNNING) that are present in the queue at a certain moment. It provides the full list by default, but this can be changed using the -u option:

$ squeue -u cabezon

JOBID     PARTITION           NAME                USER      STATE     TIME        TIME_LIMIT  QOS       NODELIST(REASON)                        
2025256   scicore             test                cabezon   PENDING   0:00        1:00        30min     (None) 
$ squeue -u cabezon

JOBID     PARTITION           NAME                USER      STATE     TIME        TIME_LIMIT  QOS       NODELIST(REASON)                        
2025257   scicore             test                cabezon   RUNNING   0:05        1:00        30min     shi28                                   
2025258   scicore             test                cabezon   RUNNING   0:05        1:00        30min     shi36                                   
2025259   scicore             test                cabezon   RUNNING   0:05        1:00        30min     shi27                                   
2025256   scicore             test                cabezon   RUNNING   0:05        1:00        30min     shi27      

The first example shows the outcome when the job is still waiting in the queue (PENDING), while the latter shows a running job (which in fact is an array job with 4 tasks).

There are many options to modify the format of the output, that are accessible via man squeue.

Tip

Most of the time you want squeue to show only your jobs. To do so as default, simply add an alias to your .bashrc file (which is in your home directory):

alias squeue='squeue -u <username>'
where <username> is your username. Then logout and login again or source the .bashrc file with: source ~/.bashrc

You can always retrieve the full list with squeue -a


scontrol

This command can do many things to control your currently pending/running jobs, but we will focus here on its informative options.

$ scontrol show jobid -dd 2033257
JobId=2033257 JobName=test
   UserId=cabezon(38334) GroupId=scicore(3731) MCS_label=N/A
   Priority=1772 Nice=0 Account=scicore QOS=30min
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:06 TimeLimit=00:01:00 TimeMin=N/A
   SubmitTime=2017-07-12T11:05:10 EligibleTime=2017-07-12T11:05:10
   StartTime=2017-07-12T11:05:20 EndTime=2017-07-12T11:06:20 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=scicore AllocNode:Sid=login01:19949
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=shi38
   BatchHost=shi38
   NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,mem=2000M,node=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
     Nodes=shi38 CPU_IDs=8-9 Mem=2000 GRES_IDX=
   MinCPUsNode=2 MinMemoryCPU=1000M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/scicore/home/scicore/cabezon/testing/launch.sh
   WorkDir=/scicore/home/scicore/cabezon/testing
   StdErr=/scicore/home/scicore/cabezon/testing/../out2033257
   StdIn=/dev/null
   StdOut=/scicore/home/scicore/cabezon/testing/../out2033257
   Power=
   BatchScript=
#!/bin/bash

#SBATCH --job-name=test
#SBATCH --cpus-per-task=2
#SBATCH --time=00:00:10
#SBATCH --output=../out%j
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=ruben.cabezon@unibas.ch
##SBATCH --array=1-4
#SBATCH --qos=30min

time sleep 5

scontrol can be slow and other commands are recommended, like sstat or squeue, but sometimes is useful to have access to this information. Note that the -dd option provides detailed information, including the actual script launched.

Info

The information that scontrol shows will be deleted some minutes after the job has finished. Therefore it is meant to be used while the job is still in the queue or running.


sstat

sstat provides a piece of information similar to sacct -l but only while the job is still running. It can be applied to the batch step:

$sstat 2025177.batch

       JobID  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPag
es MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks AveCPUFreq ReqCPUFre
qMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask  AveDiskRead MaxDiskWr
ite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite 
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- ------
-- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------
---- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ---------
--- ---------------- ---------------- ------------ 
2025177.bat+    221048K          shi26              0    221048K      2064K      shi26          0      2064K       
 0        shi26              0          0  01:00.000      shi26          0  01:00.000        1      2.24G       Unk
nown       Unknown       Unknown              0        0.08M           shi26               0        0.08M     6719.
22M            shi26                0     6719.22M 

As can be seen, it also provides a lot of information that is not easy to read. Fortunately, sstat counts with similar options to those of sacct. Hence, the -o option can be used to select specific fields.