Batch Computing (SLURM)
Overview
In order to maintain a fair shared usage of the computational resources at sciCORE, we use a queueing system (also known as Workload Manager) named SLURM. Users that want to run calculations in the cluster must interact with SLURM to reserve the resources required for the calculation. To that end, the user must write and execute a script that besides containing the command that executes their calculation, also contains certain directives that are understood by SLURM where the amount of CPU/GPU, RAM, and duration (among other details) are specified.
This guide will show you how to write such a script and the meaning of those SLURM directives.
Note
SLURM is just one of many queueing systems but is very common and powerful. This means that a substantial part of what you will use to run your calculation in sciCORE, can be also easily ported to other infrastructures that also use SLURM, such as the Swiss National Supercomputer at CSCS.
Understanding a SLURM script
A generic SLURM script
In order to explain the generic SLURM script, we will use an example that a user has a Python script that analyzes some data and it is usually executed as:
To run it in the cluster, the user must create a SLURM script similar to this:
#!/bin/bash
#The previous line is mandatory
#SBATCH --job-name=myrun #Name of your job
#SBATCH --cpus-per-task=1 #Number of cores to reserve
#SBATCH --mem-per-cpu=1G #Amount of RAM/core to reserve
#SBATCH --time=06:00:00 #Maximum allocated time
#SBATCH --qos=6hours #Selected queue to allocate your job
#SBATCH --output=myrun.o%j #Path and name to the file for the STDOUT
#SBATCH --error=myrun.e%j #Path and name to the file for the STDERR
ml <software.version> #Load required module(s)
<execute command> #Execute your command(s)
The line #!/bin/bash
is mandatory. The rest of lines starting with #SBATCH
are formally comments (because they start with #
). However, since the have the special string SBATCH
attached to them, they become directives that SLURM can understand. These directives are optative and if they are not provided, a default value will be used (which might not be adequate for you).
After the SLURM directives we might need to load the necessary libraries and software (see (Module System at sciCORE)[software.nd#module-system] for more information) needed to execute the last line, which is our command and what we want to be run on the compute nodes.
This script should be saved in a file (e.g. launch.sh
) and then can be submitted to the queue with:
Once it is submitted and if everything goes fine, a job ID number will be assigned to this job and it will be allocated in the queue, where it will wait until the required resources become available for you. At that moment, your job will be executed automatically.
Let’s explore now the meaning of each one of the different SLURM directives from the script above.
If you don’t provide one, a default name based on the assigned job ID.
It is advisable to give here a descriptive name that helps you to identify your jobs easily.
Example:
The default value is 1.
This will reserve that specifc amount of cores. With few exceptions, most of the computing nodes in sciCORE have 128 cores. Therefore, a value higher than that is not possible. In order to actually use those cores, a parallellized code must be executed. SLURM will only ensure that the requested cores are available for you. The proper usage of those is responsibility of the user. Also be aware, that not all parallelized codes scale to a high number of cores. It is very possible that increasing the number of cores beyond a certain quantity brings very little benefit in terms of time-to-solution. It is the user’s responsibility to know the scaling properties of their codes.
Example:
The default value is 1.
This directive is not in the example above, but it is relevant for parallel calculations. The main difference with --cups-per-task
is that --ntasks
specifies how many instances of your command will be executed, while --cpus-per-task
determines how many cores will be used by each one of those instances. One single task cannot be distributed among computing nodes, but you can have several tasks in one computing node. Be aware that one task is not always equal to one job, as one job can have several tasks, each one of them can be in different computing nodes. This is a subtle difference but extremely important to understand if you are performing classical parallel calculations. As a general guideline, if you use only OpenMP (i.e. threads) to parallelize your code you only need to use --cpus-per-task
. If you only use MPI (i.e. processes) you only need --ntasks
. If you use both (i.e. a hybrid calculation) you will need both. Examples of all this can be found in the examples section.
Example:
The default value is 1 GB.
This value will be multiplied by the number in the directive --cpus-per-task
to calculate the total requested RAM memory.
Most of our computing nodes have 256 GB of RAM. So, if your total requested RAM is higher than that your script will fail. If you need more RAM for a special case, please contact us.
Alternatively, you can use the directive --mem
which provides the full overall amount of memory to be used by all cores as they see fit.
Tip
You must always specify the unit (G for Gigabytes, M for Megabytes, etc…)
Example:
Warning
Be aware that you should not ask for an amount of RAM too close to the maximum RAM capacity of a node. This is because the operating system already consumes part of that memory for management. As a rule of thumb, consider that all nodes have about 20% less memory than their maximum. Otherwise, your jobs might be waiting in the queue forever.
The default value is 6h.
This specifies the maximum amount of real-time (wallclock time) that your job will need. If that limit is reached and your job is still running, SLURM will kill it.
It is important that you don’t overshoot this parameter. If you know the amount of time that your calculation needs, use a value close to that. In this way, your job will spend a lower amount of time waiting in the queue (see backfilling).
Tip
The format is hh : mm : ss. If days are needed you can use: days-hh : mm : ss
Example:
(this will reserve 1 day and 12 hours)The default value is 6hours.
This determines the queue to which your job will be assigned. You must select a queue compatible with the amount of time that you requested. Namely, the value in the directive --time
should be smaller or equal to the time limit of the selected --qos
value.
The selected queue will also impose limits on the number of cores and simultaneous jobs that a user or group can have. See QoS in sciCORE to know more.
Example:
If this directive is not given, SLURM will generate a generic file with the assigned job ID in your home directory.
With this directive, you provide a path and a name to the file that will contain all the output that otherwise would be printed on the screen. If you only provide a name, the file will be created in the current working directory (usually the same from which the script was launched).
If you provide a path, that path must exist. If it doesn’t, SLURM won’t create it. Instead, your submission will silently fail without giving any warning!
If you use the string %j
the job ID will be used in the name of the file. This can be useful to identify different output files from different runs.
Example:
If this directive is not given, SLURM will generate a generic file with the assigned job ID in your home directory.
With this directive, you provide a path and a name to the file that will contain all the errors and system output that otherwise would be printed on the screen. If you only provide a name, the file will be created in the current working directory (usually the same from which the script was launched).
If you provide a path, that path must exist. If it doesn’t, SLURM won’t create it. Instead, your submission will silently fail without giving any warning!
If you use the string %j
the job ID will be used in the name of the file. This can be useful to identify different output files from different runs.
Example:
Common errors in a SLURM script
Info
To help you avoid many common errors when creating your SLURM script, we have developed an online tool that generates SLURM scripts that work in sciCORE.
Typo in the Shebang
The shebang is the sequence of characters at the beginning of a script that defines the interpreter that should read the script.
SLURM uses bash so it must be:
No other information should be included in this line (not even comments!) unless they are arguments for the interpreter.
Typo in the #SBATCH directive
If you forget the ‘#’, the script will fail immediately.
If you have a typo in the word ‘SBATCH’ as:
the directive will be ignored because every line that starts with#
is considered a comment. This error is more mischievous, because the script might run but with a requirement missing, leading to likely unwanted and maybe unexpected results.
Spaces around the =
sign in the directives
SLURM directives that need a value, use the =
sign. No spaces should be present around it. In fact, as a general rule, no spaces should be present anywhere in the directives unless they are backslashed (e.g. for the name of a directory), but we don’t recommend this.
For example, this will fail:
Error in the path for --output
and/or --error
If the path to the output and/or error files cannot be found (due to non-existing or a typo) the script will fail silently without warning or error message. If you see this behavior, these paths are the usual suspects.
Submitting and managing jobs
To submit a SLURM script to the workload manager you use the command sbatch
:
If there is no error, a message will prompt the assigned job ID and the job will be queued. You can see all your queued jobs with:
This will list all your jobs in the queue and their status.
If you want to cancel one of your jobs, find out its job ID with squeue and then:
Or if you want to cancel all your queued jobs independently of their status (pending, running, etc…) then:
Queues and partitions
Quality of Service (QoS)
In SLURM, there are no formal queues, but QoS (Quality of Service).
Depending on your job runtime, you must choose one of the following QoS. Choose the right QoS according to your runtime by using the sbatch directive --qos=qos-name
.
QOS | Maximum runtime | GrpTRES | MaxTREPA | MaxTREPU |
---|---|---|---|---|
30min | 00:30:00 | CPU=12,000 MEM=68 TB |
CPU=10,000 MEM=50 TB |
CPU=10,000 MEM=50 TB |
6hours | 06:00:00 | CPU=11,500 MEM=64 TB |
CPU=7,500 MEM=40 TB |
CPU=7,500 MEM=40 TB |
1day | 1-00:00:00 | CPU=9,000 MEM=60 TB |
CPU=4,500 MEM=30 TB |
CPU=4,500 MEM=30 TB |
1week | 7-00:00:00 | CPU=3,800 MEM=30 TB |
CPU=2,000 MEM=15 TB |
CPU=2,000 MEM=15 TB |
2weeks | 14-00:00:00 | CPU=1,300 MEM=10 TB |
CPU=128 MEM=2 TB |
CPU=128 MEM=2 TB |
gpu30min | 00:30:00 | CPU=3,300 GPU=170 MEM=26 TB |
CPU=2,400 GPU=136 MEM=22 TB |
CPU=2,600 GPU=136 MEM=22 TB |
gpu6hours | 06:00:00 | CPU=3,000 GPU=150 MEM=24 TB |
CPU=2,000 GPU=100 MEM=16 TB |
CPU=2,000 GPU=100 MEM=16 TB |
gpu1day | 1-00:00:00 | CPU=2,500 GPU=120 MEM=20 TB |
CPU=1,250 GPU=60 MEM=10 TB |
CPU=1,250 GPU=60 MEM=10 TB |
gpu1week | 7-00:00:00 | CPU=1,500 GPU=48 MEM=12 TB |
CPU=750 GPU=24 MEM=6 TB |
CPU=750 GPU=24 MEM=6 TB |
TRES means Trackable RESources. These are a type of resource that can be tracked to enforce limits. In sciCORE, these resource are the number of cores, amount of RAM, and number of GPUs.
GrpTRES is the maximum number of trackable resources assigned to the QoS.
MaxTREPA is the maximum number of trackable resources per account (i.e., a research group).
MaxTREPU is the maximum number of trackable resources per user.
All three limits are enforced on all users in sciCORE. It is important to note that, despite having available resources in the cluster, your jobs won’t run if they imply that you will have allocated more resources than any of those limits. This can be seen when doing squeue -u \<username\>
, under the column REASON
.
Warning
The limits on all QoS will change over time, due to temporary situations/needs or due to upgrades of the cluster. You can always check the current limits with the command usage
.
Partitions
The compute nodes are logically grouped into partitions, which can overlap. The aim is to make available a specific type of infrastructure grouped by functionality (e.g. dedicated nodes for a project) or characteristic (e.g. nodes with GPUs or nodes with the same number of cores).
Partition | Assigned nodes | Cores per node | RAM per node* | GPUs per node | Allowed QoS |
---|---|---|---|---|---|
scicore | sca[05–52] scb[01–28] |
128 | 512 GB 1 TB |
- | 30min, 6hours, 1day, 1week, 2weeks |
bigmem | scb[29–40] scc[01–02] |
128 | 1 TB 2 TB |
– | 30min, 6hours, 1day, 1week, 2weeks |
titan | sgi[01–04] | 28 | 512 GB | 7 x Titan-Pascal | gpu30min, gpu6hours, gpu1day, gpu1week |
rtx4090 | sgd[01–03] | 128 | 1 TB | 8 × RTX 4090 with NVLink | gpu30min, gpu6hours, gpu1day, gpu1week |
a100 | sga[01–06] sgc[01–06] |
128 | 1 TB | 4 × A100 (40 GB) with NVLink | gpu30min, gpu6hours, gpu1day, gpu1week |
a100-80g | sgb01 sgj[01–02] |
128 | 1 TB | 4 × A100 (80 GB) with NVLink | gpu30min, gpu6hours, gpu1day, gpu1week |
Note
Take into account that the operating system already consumes around 20% of this maximum value for management purposes. So the actual available RAM/node is about 80% of the presented value.
Info
scicore
is the default partition, if you don’t specify a partition your job will run in this one.
Warning
The use of the bigmem partition is only for high-memory jobs (over 256GB of RAM). Any misuse of this partition is prohibited and the jobs will be killed.Please contact scicore-admin@unibas.ch for advice.
Array jobs
Sometimes you want to execute the exact same code but varying the input data or some parameters. In those cases, where there is no communication between instances, perfect parallelism can be achieved by launching several copies of the same instance at the same time. Instead of creating a different SLURM script for each instance, you should use an array of jobs. This is a SLURM option that allows you to execute many instances with one single script. This is the recommended procedure, as it is much easier to manage and puts much less stress on the Workload Manager, than launching hundreds of individual scripts.
Imagine the following situation: you have an R script, named analyze.R
, that takes two arguments, the name of the input data file and the name of the file where it will write the results. You would normally execute it like this:
Now, it turns out that input.dat
is too big and it will take 1 week to analyze it. Fortunately, input.dat
is basically a set of independent data, so you can split it into 200 smaller chunks. If you launch these 200 instances you get a x200 speedup, reducing the time-to-solution from 1 week to 50 min.
To do so, you first create a file that will contain all the instances that you want to launch. Let’s call it commands.cmd
(names and extensions don’t really matter, we use those just to make the files easy to identify):
Rscript analyze.R input1.dat output1.txt
Rscript analyze.R input2.dat output2.txt
Rscript analyze.R input3.dat output3.txt
...
Rscript analyze.R input199.dat output199.txt
Rscript analyze.R input200.dat output200.txt
Now we will create a SLURM script that will launch 200 tasks and where each task will execute only one line of commands.cmd
:
#!/bin/bash
#SBATCH --job-name=array_analyze
#SBATCH --time=01:00:00
#SBATCH --qos=6hours
#SBATCH --output=myoutput%A_%a.out
#SBATCH --error=myerror%A_%a.err
#SBATCH --mem=1G
#SBATCH --array=1-200
module load R/4.4.2-foss-2024a
$(head -$SLURM_ARRAY_TASK_ID commands.cmd | tail -1)
The directive --array
will launch in this case 200 tasks, numbered from 1 to 200. It acts as an implicit loop with an index from 1 to 200. As a result, this script will execute the module load
and the head ... tail
lines with the same requirements of cores, time, and memory, 200 times. Each one of these 200 executions is called a task
and each one of these tasks is numbered. The corresponding number to each task is stored in an environment variable named $SLURM_ARRAY_TASK_ID
.
The head ... | tail
combination selects one single line of the commands.cmd
file and the characters $( ... )
will execute that line.
Using the $SLURM_ARRAY_TASK_ID
as a variable to select the line to be executed in the head ... | tail
combination, we ensure that task 1 will execute line 1 of the commands.cmd
file, task 2 will execute line 2, and so on.
Note that we selected 1h of runtime, because our estimation is that each calculation will take 50 min.
Note also that a format option has been used in the output
and error
files. This will create a different file for every task, resulting in this example in 400 small log files. Having many small files is the enemy number one of a filesystem, so we strongly recommend to delete these files once they are not needed anymore.
Due to the limits applied by the QoS, you might want to limit the amount of simultaneous tasks so that other members of your group can also run. To do so you can use the following:
This will limit the number of simultaneous tasks to 20.
An alternative syntax for calling the jobs one at a time for the array:
Backfilling
The backfill scheduling plugin is loaded by default. Without backfill scheduling, each partition is scheduled strictly in priority order, which typically results in significantly lower system utilization and responsiveness than otherwise possible. Backfill scheduling will start lower priority jobs if doing so does not delay the expected start time of any higher priority jobs. Since the expected start time of pending jobs depends upon the expected completion time of running jobs, reasonably accurate time limits are essential for backfill scheduling to work well.
This means that the most accurate running time and resources requirements you provide in your SLURM script, the higher chances you have to benefit from jumping up in the queue because of backfilling. This also points to the fact that shorter, less demanding calculations usually benefit from this. So if you can re-structure your pipeline/analysis to work in many smaller chunks is almost always better than fewer bigger chunks.
As a rule of thumb, always try to fit your calculations in the smallest possible queue. Under the assumption of normal occupancy of the cluster, if your run can fit in the 30min QoS, it will move in the queue faster than if you launch it in the 6hours QoS.
Requesting GPUs
To request a GPU you must specify in your SLURM script the type, the number of GPUs, and the partition. For example:
#!/bin/bash
#SBATCH --job-name=GPU_JOB
#SBATCH --time=01:00:00
#SBATCH --qos=gpu6hours
#SBATCH --mem-per-cpu=1G
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --partition=a100
#SBATCH --gres=gpu:1 # --gres=gpu:2 for two GPU, etc
module load CUDA
...
Warning
Note that GPUs have exclusive QoS, named gpu30min
, gpu6hours
, gpu1day
, and gpu1week
. If GPUs are requested in your SLURM script, one of these QoS must be used.
Using local SCRATCH
The sciCORE cluster counts with a centralized parallel file system that is exported to the login and computing nodes. This is where your home directory is located. Nevertheless, there are additional hard disks in the computing nodes. These are referred to as local SCRATCH
.
See Storage Services & Policies in sciCORE to have a better understanding of this structure.
For every job submitted to the cluster, a temporary directory is automatically created in the local scratch folder of the computing node. This temporary directory is independent for each SLURM job and is automatically deleted when the job finishes. You can use the environment variable $TMPDIR
in your submit script to use this temporary directory. For example:
#!/bin/bash
#SBATCH --job-name=test_JOB
#SBATCH --time=01:00:00
#SBATCH --mem=1G
module load CoolApp/1.1
cp $HOME/input-data/* $TMPDIR/
CoolApp --input $TMPDIR/inputfile.txt --output $TMPDIR/output.txt
cp $TMPDIR/output.txt $HOME/output-data/
This SLURM script loads a module and then copies data from a directory in the user’s home (located in the centralized file system) to the local hard disk of whatever computing node is assigned ($TMPDIR
). Then it executes the application that will generate an output file. Finally, because the $TMPDIR
directory will be automatically deleted after the script finishes, the user copies the output from $TMPDIR
to their home directory.
Many applications need to have a lot of access to the hard disk. This imposes a serious limitation in terms of performance when such I/O bound exists. Because the local SCRATCH disks are closer to the cores that perform the calculations, it is worth exploring if copying the input data to the local SCRATCH, generating the output locally there, and copying the results back to your home, provide a gain in the time to solution. In many cases, the overhead of moving data up and down is completely compensated by the gain of not needing to constantly write in the parallel file system from the computing node over the network.
Requesting for minimum available space in the local SCRATCH
When submitting your SLURM job you can request that the compute node, where the job is going to be executed, has at least a minimum of free available space in the local SCRATCH folder ($TMPDIR
). To do this you can use the --tmp
directive. For example:
#!/bin/bash
#SBATCH --job-name=test_JOB
#SBATCH --time=01:00:00
#SBATCH --qos=6hours
#SBATCH --mem=1G
#SBATCH --tmp=10G # the compute node should have at least 10G
# of free space in local scratch folder ($TMPDIR)
module load CoolApp/1.1
cp $HOME/input-data/* $TMPDIR/
CoolApp --input $TMPDIR/inputfile.txt --output $TMPDIR/output.txt
cp $TMPDIR/output.txt $HOME/output-data/
Monitoring
In this section, we compile a series of methods and tools that will help you to monitor your jobs in SLURM when using the sciCORE cluster.
When the job has finished
The most reliable information about your job is only attainable after it has finished. The following tools and methods will provide this information.
time
Using time
in front of your command will provide you with the total Wallclock, CPU, and User times that your application used. These are written in the STDERR.
This is a straightforward and computationally cheap way of timing your applications. Additionally, dividing the CPU time by the Wallclock time should give a number close to the number of CPUs used in your calculation. The closer, the better your application is using the parallel resources. Strictly speaking, this is not monitoring, but profiling your code. Nevertheless, is such a ubiquitous and easy methodology that is important that is mentioned here.
sacct
sacct
is the main tool for accounting. It can provide a great deal of information about the characteristics of a job. Nevertheless, if all the information is shown, it is not very ergonomic for reading:
$ sacct -lj 1927461
JobID JobIDRaw JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxR
SS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode Mi
nCPUTask AveCPU NTasks AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMa
x ReqCPUFreqGov ReqMem ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDis
kWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite AllocGRES ReqGRES ReqTRES AllocTRES
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- --------
-- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- --
-------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------
- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------
------ ---------------- ---------------- -------------- ------------ ------------ ---------- ----------
1927460_1 1927461 test scicore
2
00:00:07 COMPLETED 0:0 Unknown Unknown Unknown 1000Mc
cpu=2,mem+ cpu=2,mem+
1927460_1.b+ 1927461.bat+ batch 150064K shi38 0 150064K 105
6K shi38 0 1056K 0 shi38 0 0 00:00:00 shi38
0 00:00:00 1 2 00:00:07 COMPLETED 0:0 1.20G 0
0 0 1000Mc 0 0 shi38 65534 0
0 shi38 65534 0 cpu=2,mem+
Where the -l
option means ‘long format’ and -j
is for specifying the JOB_ID
. Instead of this, we can use the default version of sacct
, which is easier to read:
$ sacct -j 1927461
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
1927460_1 test scicore scicore 2 COMPLETED 0:0
1927460_1.b+ batch scicore 2 COMPLETED 0:0
Nevertheless, this might not provide the values we are interested in, like for example, requested memory, maximum memory, elapsed time, etc… Fortunately, we can change the output format of sacct very easily with the option -o
:
sacct -o JobID,JobName,AllocCPUS,ReqMem,MaxRSS,User,Timelimit,State -j 1927461
JobID JobName AllocCPUS ReqMem MaxRSS User Timelimit State
------------ ---------- ---------- ---------- ---------- --------- ---------- ----------
1927460_1 test 2 1000Mc cabezon 00:01:00 COMPLETED
1927460_1.b+ batch 2 1000Mc 1056K COMPLETED
With the -o
option we can select a series of accounting fields, separated by commas. The complete of available fields can be retrieved with the --helpformat
option:
$ sacct --helpformat
Account AdminComment AllocCPUS AllocGRES
AllocNodes AllocTRES AssocID AveCPU
AveCPUFreq AveDiskRead AveDiskWrite AvePages
AveRSS AveVMSize BlockID Cluster
Comment ConsumedEnergy ConsumedEnergyRaw CPUTime
CPUTimeRAW DerivedExitCode Elapsed ElapsedRaw
Eligible End ExitCode GID
Group JobID JobIDRaw JobName
Layout MaxDiskRead MaxDiskReadNode MaxDiskReadTask
MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask MaxPages
MaxPagesNode MaxPagesTask MaxRSS MaxRSSNode
MaxRSSTask MaxVMSize MaxVMSizeNode MaxVMSizeTask
MinCPU MinCPUNode MinCPUTask NCPUS
NNodes NodeList NTasks Priority
Partition QOS QOSRAW ReqCPUFreq
ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqCPUS
ReqGRES ReqMem ReqNodes ReqTRES
Reservation ReservationId Reserved ResvCPU
ResvCPURAW Start State Submit
Suspended SystemCPU Timelimit TotalCPU
UID User UserCPU WCKey
WCKeyID
To know what each one of these formats shows you can always refer to the sacct manual: man sacct
.
Info
Interestingly, sacct
offers an output format that can be parsed very easily through the option --parsable
. In this case, the output will be delimited with pipes |
.
seff
seff
is a sciCORE custom tool that provides a summary of the most relevant characteristics of a finished job. This tool is automatically executed at the end of a job and its outcome is attached in the email that you receive when you have this option in your script.
$ seff 1927461
Job ID: 1927461
Array Job ID: 1927460_1
Cluster: scicore
User/Group: cabezon/scicore
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 2
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:00:14 core-walltime
Memory Utilized: 1.03 MB
Memory Efficiency: 0.05% of 1.95 GB
As can be seen in the example above, seff
provides some information, which is parsed from sacct
. Especially relevant are the State (COMPLETED / FAILED) and CPU and Memory Efficiencies. If the latter are very low, it means that the reserved resources are mostly idle, and therefore it is advisable to reduce the CPU and memory requirements. This will benefit the other users, and your jobs will get through the queue faster.
jobstats
It is important to submit efficient jobs to the cluster, by doing this you will benefit from shorter wait times, more resources for you and your group, and finally a higher utilization of the sciCORE cluster.
jobstats
is a utility to easily view the resource usage and the efficiency of jobs submitted to SLURM using the sacct
utility. jobstats
provides the same information as sacct
but it displays the most relevant information about the efficiency of the job.
$ jobstats
JobID JobName ReqMem MaxRSS ReqCPUS UserCPU Timelimit Elapsed State JobEff
===========================================================================================================================================
9351465 beast_L2.2.1_2 36.0G 1.44G 1 5-18:45:24 7-00:00:00 5-20:42:30 COMPLETED 43.87
9579068 slurm_gdb9_005992_5_casscf_s 10.0G 2.34G 1 06:24:38 2-00:00:00 1-23:58:01 COMPLETED 61.66
9582327 slurm_gdb9_006412_3_casscf_s 10.0G 2.36G 1 19:36:00 2-00:00:00 1-23:58:02 COMPLETED 61.74
9584625 slurm_gdb9_006542_1_casscf_s 10.0G 2.44G 1 21:53:33 7-00:00:00 1-21:07:16 COMPLETED 25.63
9585121 slurm_gdb9_006555_3_casscf_s 10.0G 2.43G 1 1-01:30:27 7-00:00:00 1-16:45:45 COMPLETED 24.26
9594820 slurm_gdb9_007824_2_casscf_s 10.0G 2.47G 1 05:55:58 7-00:00:00 09:46:05 COMPLETED 15.28
9715952 slurm_gdb9_008671_0_casscf_s 10.0G 666M 1 00:44.771 7-00:00:00 00:01:32 COMPLETED 3.26
9737217 slurm_gdb9_052778_1_casscf_pm6 10.0G 1.85G 1 00:45.062 06:00:00 00:01:27 COMPLETED 9.45
The jobstats
output shows the most important parameters involved with the efficiency of a job, these are the job time, the CPU usage, and memory usage.
In this output, we have a few inefficient jobs:
- Job 9351465 requested 36 GB (ReqMem) but used only 1.44 G (MaxRSS) while blocking 34 GB of unused memory for over 5 days.
- Job 9715952 requested 10 GB but needed only 666 MB.
- Job 9594820 requested 7-00:00:00 (Timelimit) but ran only 09:46:05 (Elapsed). This job could run in the
1day
QoS and start earlier. - Job 9737217 requested 6 hours and run 00:01:32. This job should be submitted to the
30min
QoS.
While the job is still running
The information obtained while the job is running is not always accurate, as it is a snapshot of the status information that SLURM gathers with a certain frequency. Therefore, if your job has sporadic big changes in its resources usage, these commands might not catch them. Nevertheless, they are useful to have an impression of the overall evolution of your jobs.
Info
A user is allowed to ssh
to a node while they have a running job. This is a way to do some real-time checking. Of course, the available resources are still limited to those requested by the running job.
squeue
This is the first, naive but very convenient and most-used tool to check the status of your jobs.
squeue
provides the full list of jobs (PENDING
and RUNNING
) that are present in the queue at a certain moment. It provides the full list by default, but this can be changed using the -u
option:
$ squeue -u cabezon
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT QOS NODELIST(REASON)
2025256 scicore test cabezon PENDING 0:00 1:00 30min (None)
$ squeue -u cabezon
JOBID PARTITION NAME USER STATE TIME TIME_LIMIT QOS NODELIST(REASON)
2025257 scicore test cabezon RUNNING 0:05 1:00 30min shi28
2025258 scicore test cabezon RUNNING 0:05 1:00 30min shi36
2025259 scicore test cabezon RUNNING 0:05 1:00 30min shi27
2025256 scicore test cabezon RUNNING 0:05 1:00 30min shi27
The first example shows the outcome when the job is still waiting in the queue (PENDING
), while the latter shows a running job (which in fact is an array job with 4 tasks).
There are many options to modify the format of the output, that are accessible via man squeue
.
Tip
Most of the time you want squeue
to show only your jobs. To do so as default, simply add an alias to your .bashrc
file (which is in your home directory):
<username>
is your username. Then logout and login again or source the .bashrc
file with: source ~/.bashrc
You can always retrieve the full list with squeue -a
scontrol
This command can do many things to control your currently pending/running jobs, but we will focus here on its informative options.
$ scontrol show jobid -dd 2033257
JobId=2033257 JobName=test
UserId=cabezon(38334) GroupId=scicore(3731) MCS_label=N/A
Priority=1772 Nice=0 Account=scicore QOS=30min
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:06 TimeLimit=00:01:00 TimeMin=N/A
SubmitTime=2017-07-12T11:05:10 EligibleTime=2017-07-12T11:05:10
StartTime=2017-07-12T11:05:20 EndTime=2017-07-12T11:06:20 Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
Partition=scicore AllocNode:Sid=login01:19949
ReqNodeList=(null) ExcNodeList=(null)
NodeList=shi38
BatchHost=shi38
NumNodes=1 NumCPUs=2 NumTasks=1 CPUs/Task=2 ReqB:S:C:T=0:0:*:*
TRES=cpu=2,mem=2000M,node=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
Nodes=shi38 CPU_IDs=8-9 Mem=2000 GRES_IDX=
MinCPUsNode=2 MinMemoryCPU=1000M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
Gres=(null) Reservation=(null)
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/scicore/home/scicore/cabezon/testing/launch.sh
WorkDir=/scicore/home/scicore/cabezon/testing
StdErr=/scicore/home/scicore/cabezon/testing/../out2033257
StdIn=/dev/null
StdOut=/scicore/home/scicore/cabezon/testing/../out2033257
Power=
BatchScript=
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --cpus-per-task=2
#SBATCH --time=00:00:10
#SBATCH --output=../out%j
#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=ruben.cabezon@unibas.ch
##SBATCH --array=1-4
#SBATCH --qos=30min
time sleep 5
scontrol
can be slow and other commands are recommended, like sstat
or squeue
, but sometimes is useful to have access to this information. Note that the -dd
option provides detailed information, including the actual script launched.
Info
The information that scontrol shows will be deleted some minutes after the job has finished. Therefore it is meant to be used while the job is still in the queue or running.
sstat
sstat
provides a piece of information similar to sacct -l
but only while the job is still running. It can be applied to the batch step:
$sstat 2025177.batch
JobID MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPag
es MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AveCPUFreq ReqCPUFre
qMin ReqCPUFreqMax ReqCPUFreqGov ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWr
ite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite
------------ ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- ------
-- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------
---- ------------- ------------- -------------- ------------ --------------- --------------- ------------ ---------
--- ---------------- ---------------- ------------
2025177.bat+ 221048K shi26 0 221048K 2064K shi26 0 2064K
0 shi26 0 0 01:00.000 shi26 0 01:00.000 1 2.24G Unk
nown Unknown Unknown 0 0.08M shi26 0 0.08M 6719.
22M shi26 0 6719.22M
As can be seen, it also provides a lot of information that is not easy to read. Fortunately, sstat
counts with similar options to those of sacct
. Hence, the -o
option can be used to select specific fields.