Software Management
In general, users are encouraged to install and manage their own software and environments on their user space in the sciCORE cluster. For some use cases, however, pre-installed software is available through a module system.
Module System
Pre-installed software is made available through a framework called environment modules. By default, no module is loaded when you log in, but the module commands can easily be added to your .bashrc
or .bash_profile
to automatically load frequently used programs after login.
We have 3 major software stacks available in the sciCORE HPC Cluster:
-
sciCORE EasyBuild
- This software stack is loaded by default.
- Activation Command:
enable-software-stack-scicore
-
Compute Canada 2023
- Compute Canada provides a software stack tailored for high-performance computing (HPC) applications. Here is the link to its detailed documentation.
- Activation Command:
enable-software-stack-compute-canada
-
EESSI 2023.06
- EESSI stands for European Environment for Scientific Software Installations. The EESSI software stack is designed to facilitate collaboration and software deployment across European research institutions. Here is the official link to the documentation.
- Activation Command:
enable-software-stack-eessi
Using the Module System
To view the list of available software modules, use:
Note
All modules are named in the format <softwareName>/<version>-<toolchain>
, where <toolchain>
optionally refers to the toolchain used for compiling the software. See “Compiling Software” for more information on toolchains.
Tip
You can replace module
by its alias ml
for quicker typing
To see available versions of a specific software package, use
Finally, to load a specific software module, run:
For example, to load the R
module version 4.4.2
built with the foss-2024a
toolchain, you would run:
Strictly speaking, running ml R
will also load some version of the R
software. However, we recommend to always include the version and toolchain (when applicable) in the module load command for clarity and reproducibility.
Note
To activate a specific software stack, use its respective activation command. Once activated, the modules from the selected software stack will be available for use within the environment. Users can switch between software stacks based on their requirements using the provided activation commands.
Example:
enable-software-stack-compute-canada # Activate Compute Canada stack
ml av # List available modules in the Compute Canada stack
enable-software-stack-scicore # Switch back to sciCORE EasyBuild stack
ml av # List available modules in the sciCORE EasyBuild stack
Warning
If you load modules automatically via .bashrc
, be aware that those same
modules will be loaded when launching jobs into the compute nodes.
This can lead to conflicts with other modules needed to run specific jobs.
To avoid this, you can write ml purge
in your SLURM scripts and then
load all needed modules explicitly.
Module command reference
The key module commands are:
-
module avail
(orml av
): list available modules -
module load <softwareName>
(orml load <softwareName>
): load the module<softwareName>
-
module list
(orml list
): list currently loaded modules -
module spider <keyword>
(orml spider <keyword>
): search for modules with keyword in their name or description -
module help
(orml help
): list other module commands -
module unload <softwareName>
(orml -<softwareName>
): unload the module<softwareName>
-
module purge
(orml purge
): unload all loaded modules
Python Environment
When working with Python software, we recommend creating virtual environments either on a per project basis or for specific package collections that get reused in different contexts. This allows you to avoid conflicts between different Python packages and versions and helps the reproducibility of your work.
In the sciCORE cluster you can manage your virtual environments much like you would on your local machine. You can choose your favorite package manager, be it conda
, mamba
, uv
, pixi
, etc.
An environment for a specific research project
Let us say you are working on a research project called my_project
and you want to write some Python code for it. Let us create a new directory for it and include a single Python script called my_script.py
:
If you have conda
or mamba
installed, you can proceed as follows:
conda create -n my_project python=3.10 # or some other version
conda activate my_project
pip install <package_name> # install packages you need
python my_script.py # run your scripts within the environment
Note
If you are using mamba
, replace conda
with mamba
in the commands above.
If you have uv
installed, you can create a virtual environment like this:
uv init
uv add <package_name> # install packages you need
uv run python my_script.py # run your scripts within the environment
Info
Different package managers will manage environments in different ways. For
example, conda
will collect all your environments in a single location,
typically ~/miniconda3/envs/
, while uv
will create a .venv
directory
for the environment in the current working directory. When in doubt, check
the documentation of your package manager for details.
Environments for specific package collections
Some environment managers, such as conda
or pixi
allow you to create environments that are “globally visible”. These are intended to contain tool sets that you could use for general purposes, such as quick data exploration or analysis, or for specific tasks that you perform frequently across different projects.
Any conda
environment you create behaves like this by default, because you can activate it from anywhere in the machine. For example, you can create an environment called data_analysis
like this:
conda create -n data_analysis python=3.10 # or some other version
conda activate data_analysis
pip install pandas numpy matplotlib ipython # install packages you need
Then whenever you activate the data_analysis
environment, you have access to pandas
, numpy
, and matplotlib
in your Python scripts, no matter where you are in the file system.
For pixi
, the syntax to reach a similar result is:
pixi global install --environment data_analysis --expose jupyter --expose ipython jupyter numpy pandas matplotlib ipython
Make sure to check if your package manager supports a feature like this.
Particulars of Jupyter notebooks on Open OnDemand
Because Jupyter on Open OnDemand (OOD) is launched from a central process, it does not automatically see the kernels from the Python environments you create. So you need to make sure to follow these steps:
1. Install the ipykernel
package in your Python environment
With conda
or mamba
, you can do this by running:
conda activate my_project # or the name of your environment
conda install ipykernel # or pip install ipykernel
With uv
or pixi
:
2. (Not necessary for conda
or mamba
environments) Register the kernel with Jupyter
If your environment manager installs the ipykernel
package in a .venv
folder within the project’s directory, you need to manually register this kernel with Jupyter.
For example, if you are using uv
, navigate to your project directory and run:
uv run python -m ipykernel install --user --name=<project_name> --display-name "Python [uv env: <project_name>]"
where <project_name>
is the name of your project.
Info
If you use some other package manager adapt the command accordingly. The important part is the one beginning with python -m ipykernel install
.
If you open a new Jupyter notebook on OOD you should now be able to see
"Python [uv env: <project_name>]"
as an option in the kernel selection menu in the top right corner of the notebook interface.
Note
You can technically use any --name
and --display-name
you want, but it is recommended to include the name of your project for consistency and clarity. The –name is used internally by Jupyter to identify the kernel, while the –display-name is what you will see in the kernel selection menu.
Info
This step is not needed for conda
or mamba
environments because there is a plugin on the OOD Jupyter server that automatically detects and registers all conda
environments with the ipykernel
package installed. So you can skip this step if you are using conda
or mamba
.
R Environment
R is available on the cluster and can be loaded via the module system. You can explore all available R versions with
ml spider R
To load a specific version of R, use the following pattern:
ml R/<version>
For example, to load R/4.3.2-foss-2023a
, you would use:
ml R/4.3.2-foss-2023a
After loading the module, you can check that the R executable is available by running:
which R
R scripts in SLURM jobs
To run R scripts from inside a SLURM job, the best method is to use the
Rscript
binary. For example:
You should specify within your R script whether you want to save any files or the workspace. Print statements are directed to STDOUT.
Parallelism in R
Some R libraries implement functions that can make use of parallelism to accelerate calculations. The default behavior of these functions is library-specific, with some libraries assuming by default that all the resources on the machine (i.e. all CPUs) can be used by the library. This is a poor design choice which is unfortunately fairly common in the R ecosystem.
When an R script makes use of parallelism in functions, it is the responsibility of the user to verify that the number of cores used by R corresponds to the number of cores reserved with the SLURM script submission. Some users have crashed compute nodes on the cluster because they didn’t understand the behavior of the program they were using. R functions will often have an option that allows specifying the number of cores to use. This can be matched with the variable $SLURM_CPUS_PER_TASK
There are several approaches to parallelism in R. We recommend the use of the packages parallel
and foreach
. One can also submit R jobs to the cluster using rslurm
.
RStudio Server
Users can run R code interactively on RStudio Server, which is available as an app on Open OnDemand. See “Interactive Computing” for more information on interactive computing on sciCORE.
Installing R Packages
All sciCORE users have the ability to download and install their own R packages. This works in the same way whether using R from the RStudio Server or from a shell session.
You can determine the path for installed packages using the .libPaths()
function in R. Common packages maintained by the system administrators are normally installed along with the software build (e.g. /scicore/soft/apps/R/3.6.0-foss-2018b/lib64/R/library
) whereas user-installed packages end up in the home folder (e.g. /scicore/home/<groupid>/<userid>/R/x86_64-pc-linux-gnu-library/3.6
).
There are various methods for installing R packages, which depend on the code itself and the repository where it lives. Normally, CRAN is the main source to install packages, using the
install.packages()
function. Bioconductor packages are installed using the
BiocManager::install()
function. If you have any questions about installing R packages or run into problems during compilation, please contact us via our Help Center.
Shiny Apps on R-Studio
Shiny is a web application framework for R that enables you to turn your analyses into interactive web applications without requiring HTML, CSS, or JavaScript knowledge. Shiny apps can be built using R Studio, an integrated development environment (IDE) for R. Within sciCORE you need to use Open OnDemand (OOD) to connect to the R studio server and run Shiny Apps. The Shiny Apps are by default placed in the ShinyApps/
folder by the user.
These are the steps to use Shiny Apps:
1. Connect to Open On Demand (OOD)
2. Start the R-Studio Server
3. Load the Shiny library
4. Run the available Shiny Apps from their folders
Info
The default directory for Shiny Apps is $HOME/ShinyApps/
, so you can use
to run the app 04_mpg
. You can also use the runApp()
function and specify the path to the app folder. For example:
MATLAB
Interactively
To run MATLAB interactively, we recommend opening a sciCORE Desktop session on Open OnDemand.
Tip
When you sciCORE Desktop session is ready, you can adjust the “Compression” and “Image quality” sliders to get a better visual experience. This tip is valid whenever you are using sciCORE Desktop for visuals-heavy workflows
From within the sciCORE Desktop session, MATLAB can be loaded via the module system. Open a terminal, then type the following to explore all available MATLAB versions with
ml spider MATLAB
To load a specific version of MATLAB, use the following pattern:
ml MATLAB/<version>
For example, to load MATLAB 2023b, you would use:
ml MATLAB/2023b
To run MATLAB with a graphical user interface (GUI) run the following from the command line:
matlab
To have a command-line interface (CLI) with Java Virtual Machine (JVM):
matlab -nodesktop
To run MATLAB without a without JVM, you can use:
matlab -nojvm
Info
The options -nodesktop
and -nojvm
differ in that the first one still starts JVM,
hence graphics functionalities will still work despite not initializing the MATLAB
desktop. The second won’t work with graphics functions as it cannot access the Java API.
In SLURM batch jobs
You can submit MATLAB scripts to the computing nodes of the cluster as a regular SLURM job. You only need to take care of loading the corresponding MATLAB module and that your script doesn’t need a GUI (i.e. can be run from the command line directly).
To run a MATLAB script from the command line you can use:
matlab -nosplash -nodesktop -r "myscript"
Note
The extension .m
(typical for MATLAB scripts) is ignored and that
the name of the script is between quotation marks.
This implementation will still open MATLAB’s command window and output everything to the STDOUT until the script finishes.
An alternative is to use the -batch
flag, which will start MATLAB without splash screens, without a desktop, and won’t open MATLAB’s command window. It will also exit MATLAB automatically after completion. Namely, it will run the script non-interactively:
matlab -batch "myscript"
Warning
Be aware, though, that the -batch
flag disables certain functionalities of
MATLAB.
A final option is to compile your script using the MATLAB compiler and runtime libraries. This is the preferred option for compute-intensive jobs:
mcc -m -R -nodisplay myscript.m
This will produce a myscript
binary executable, as well as a wrapper script run_myscript.sh
. To run the executable you just need to run the wrapper script, providing the path to the root MATLAB installation.
Example:
./run_myscript.sh <path_to_matlab_root>
Tip
The path to the root MATLAB installation is different for each MATLAB version. To find this path you can run
which matlab
from the command line and extract the part before the last /bin/
in the output.
For example, if the output is:
/scicore/soft/easybuild/apps/MATLAB/2022b/bin/matlab
then <path_to_matlab_root>
would be:
/scicore/soft/easybuild/apps/MATLAB/2022b
Available licenses
The university maintains a pool of MATLAB licenses. When a MATLAB instance is launched, it connects to the license server (FlexLM) which reserves a license (or several if you are using toolboxes) from the pool. The licenses are released (made available for other users) once Matlab terminates.
License pool exhaustion
When a MATLAB script is run as a cluster job, there must be free licenses available to be executed. On some occasions, it may happen that a cluster job is killed because the license pool at the university is temporarily exhausted.
Conda issues
In SLURM scripts
For a lot of environment managers, you can call your scripts from within a SLURM script as you would normally do from the command line. However, slurm jobs begin as a sub-shell on a compute node which does not initialize using the full set of config files in your home folder as per a login shell. Thus there is a small caveat for conda
/mamba
environments: you need to add the snippet eval "$(conda shell.bash hook)"
to the beginning of your SLURM script, so that the conda
command become available in the compute node. For example:
#!/bin/bash
#SBATCH --job-name=my_job
# ... other SLURM options ...
eval "$(conda shell.bash hook)" # make conda command available
conda activate my_project # activate the environment
python my_script.py # run your script
Licensing issues
TL;DR
The “defaults” and “main” conda channels are covered by a license agreement which charges user fees depending on the use case and size of your organization. We recommend removing the default channel and using conda-forge instead.
Conda is an open-source organization which maintains the conda package manager, which facilitates management of python environments and the installation of many software packages into these environments. The software packages are made available through several different “channels” - essentially code repositories. There are many ways to install the conda package manager, each of which having somewhat different configurations including which channels are allowed and their relative priority. For example, if you install using miniconda or the entire “anaconda” package you might end up with the “defaults” channel with the highest priority. This particular channel/repository is maintained by a company called Anaconda Inc (formerly Continuum Analytics), which charges fees for the use of this channel/repository. The terminology and relationships are indeed complicated. It’s addressed to some degree in this blog post from Anaconda Inc.
You can check which channels are configured in your conda setup in the following file:
cat $HOME/.condarc
Alternatively, you can see your channels with the following commands:
conda config --show-sources # shows configuration source, normally $HOME/.condarc
conda config --show channels
We recommend removing the defaults channel, and we may need to block access to this channel from sciCORE systems. Please be aware, however, that this may have an impact on your existing conda environments. Specifically, update or (re)installation processes may need to be performed and in some edge cases may break your environments. Please don’t hesitate to get in touch with the sciCORE team should you need assistance.
You can set conda-forge as the top priority channel with the following command:
conda config --add channels conda-forge
You can remove the defaults channel with the following command:
conda config --remove channels defaults
Tip
An alternative way to get around this issue is to install conda with the
miniforge installer, or use the
pixi
environment manager. Both install packages
from the conda-forge repository by default.
Compiling Software
If you need to compile your own programs, we recommend using toolchains, for instance:
Info
The term foss
stands for Free and Open-Source Software. Run ml spider foss
to explore available versions.
A toolchain is a collection of tools used for building software. They typically include:
- a compiler collection providing basic language support (C/Fortran/C++)
- an MPI implementation for multi-node communication
- a set of linear algebra libraries (FFTW/BLAS/LAPACK) for accelerated math
Info
Many modules on our cluster include the toolchain that was used to build them in
their version tag (e.g. Biopython/1.84-foss-2023b
was built with the
foss/2023b
toolchain).
Mixing components from different toolchains almost always leads to problems. For example, if you mix Intel MPI modules with OpenMPI modules you can guarantee your program will not run (even if you manage to get it to compile). We recommend you always use modules from the same (sub)toolchains.
To see all available toolchains, run:
ml av toolchain
Containers
You can use Apptainer (formerly Singularity) to download and run docker images from public docker registries.
For example, to download the docker/whalesay
image from the Docker Hub, you can run:
This will create a file called whalesay_latest.sif
in the current directory.
To run the image, you can use:
This will output:
<Some warnings>
_________________
< Hi from sciCORE >
-----------------
\
\
\
## .
## ## ## ==
## ## ## ## ===
/""""""""""""""""___/ ===
~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ / ===- ~~~
\______ o __/
\ \ __/
\____\______/
You can also pull images from other container registries that abide by the Open Container Initiative (OCI) standards. See the apptainer docs on this for more information. For example, to pull uv
from the ghcr.io
registry, you can run:
Use the command
to get help about other ways of interacting with apptainer
.
Workflow management
Job dependencies
Establishing dependencies between different SLURM jobs is a good way to split up parts of a pipeline that have very different resources requirements. In this way, the resources are used more efficiently and the scheduler can allocate the jobs more easily, which in the end can shorten the amount of time that you have to wait in the queue.
The way to work with dependent SLURM jobs is to launch them with the --dependency
directive and specify the condition that has to be met so that the depending job can start.
Dependent jobs will be allocated in the queue but they will not start until the specified condition is met and the resources are available.
Condition | Explanation |
---|---|
after:jobid[:jobid] | job begins after specified jobs have started |
afterany:jobid[:jobid] | job begins after specified jobs have terminated |
afternotok:jobid[:jobid] | job begins after specified jobs have failed |
afterok:jobid[:jobid] | job begins after specified jobs have finished successfully |
singleton | job begins after all specified jobs with the same name and user have ended |
Note
Job arrays can also be submitted with dependencies, and a job can depend on an array job. In the latter case, the job will start executing when all tasks in the job array met the dependency criterion (eg, finishing for afterany
).
Practical examples
Assume you have a series of job scripts, job1.sh
, job2.sh
, …, job9.sh
with that depend on each other in some way.
The first job to be launched has no dependencies. It needs a standard sbatch
command and we store the job ID in a variable that will be used for the jobs that depend on job1
:
Info
To make it easier to grab the job id from a completed job, we add the --parsable
flag to SLURM jobs that other jobs depend on. The --parsable
option makes sbatch
return the job ID only.
Multiple jobs can depend on a single job. If job2
and job3
depend on job1
to finish, no matter the status, we can launch them with the following commands:
jid2=$(sbatch --parsable --dependency=afterany:$jid1 job2.sh)
jid3=$(sbatch --parsable --dependency=afterany:$jid1 job3.sh)
Similarly, a single job can depend on multiple jobs. If job4
depends directly on job2
and job3
(thus indirectly on job1
) to finish, we can launch it with:
Job arrays can also be submitted with dependencies. If job5
is a job array that depends on job4
, we can launch it like this:
A single job can depend on an array job. Here, job6
will start when all array jobs from job5
have finished successfully:
A single job can depend on all jobs by the same user with the same name. Here, job7
and job8
depend on job6
to finish successfully, and both are launched with the same name (“dtest”). We make job9
depend on job7
and job8
by making it depend on any job with the name “dtest”.
jid7=$(sbatch --parsable --dependency=afterok:$jid6 --job-name=dtest job7.sh)
jid8=$(sbatch --parsable --dependency=afterok:$jid6 --job-name=dtest job8.sh)
sbatch --dependency=singleton --job-name=dtest job9.sh
Finally, you can show the dependencies of your queued jobs like so:
Tip
It is possible to make a job depend on more than one dependency type.
For example, in the following job4
launches when job2
finished successfully and job3
failed:
jid4=$(sbatch --parsable --dependency=afterok:$jid2,afternotok:$jid3 job4.sh)
Separating the dependency types by ‘,’ means that all dependencies must be met. Separating them by ‘?’ means that either one suffices.
Do it yourself
The following proposes a simple set of scripts to test the concepts showcased above. Each script is set-up to run for 15 seconds
job1.sh:
#! /bin/sh
sleep 15
ls . > job1_output.txt
job2.sh:
#! /bin/sh
## takes input from job1
sleep 15
wc -l job1_output.txt > job2_output.txt
job3.sh:
#! /bin/sh
## takes input from job1
sleep 15
wc -c job1_output.txt > job3_output.txt
job4.sh:
#! /bin/sh
## takes input from job2 and job3
sleep 15
cat job2_output.txt job3_output.txt > job4_output.txt
sbatch job1.sh
outputs Submitted batch job 53481198
sbatch --dependency=afterok:53481198 job2.sh
outputs Submitted batch job 53481217
squeue -u $USER
reveals:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
53481239 scicore job2.sh duchem00 PD 0:00 1 (Dependency)
53481237 scicore job1.sh duchem00 R 0:13 1 sca29
In practice it is often more practical to output just the job ID using the –parsable flag:
jid1=$(sbatch --parsable job1.sh)
jid2=$(sbatch --parsable --dependency=afterok:$jid1 job2.sh)
You can have more that one job dependent on a single job:
jid2=$(sbatch --parsable --dependency=afterok:$jid1 job2.sh)
jid3=$(sbatch --parsable --dependency=afterok:$jid1 job3.sh)
And you can have a job depend on more than 1 job:
jid4=$(sbatch --parsable --dependency=afterok:$jid2:$jid3 job4.sh)
Snakemake
Important note
Running a pipeline on the cluster often involves a master process on the node where you logged in for the duration of the pipeline.
In order to reduce the load on the login node, and to prevent interruption of your pipeline should your session be interrupted, we recommend running the pipelines on the vscode node (vscode.scicore.unibas.ch
) with a tmux persistent session.
Alternatively, you can put your snakemake master process in a SLURM script.
Snakemake is a workflow management system tool to create reproducible and scalable data analyses.
In order to run snakemake workflows on SciCORE, you need to setup a config.yaml
profile where you specify:
executor: slurm
: sets up rules execution in SLURMuse-envmodules: true
: sets up module system- resources for each rules: e.g.
set-resources: myrule:mem=500MB
You also need to specify the modules to load for each rules in the Snakefile
using the envmodules
option.
Follow the relevant documentations from slurm-executor-plugin and snakemake to see exactly which options to set for each use case.
Note
We recommend running snakemake -n
to create a dry-run of the workflow and identify which rules are going to be invoked.
Here is an example config.yaml
file for a workflow with 2 rules (fastqc
and multiqc
), each needin gtheir own module:
executor: slurm # sets up rules execution in SLURM
jobs: 10 # maximum number of concurrent jobs
use-envmodules: true # sets up module system
latency-wait: 30 # time, in second, before checking for result
# files after a job finishes successfully.
# This is useful when there is a bit of
# filesystem latency.
set-resources:
multiqc:
mem: 500MB # reserved memory
runtime: 10 # reserved runtime in minutes
threads: 1 # reserved cpus
slurm_extra: "--qos=30min" # any extra slurm options,
# here used to set the queue of service
fastqc:
mem: 1GB
runtime: 10
threads: 1
slurm_extra: "--qos=30min"
And the Snakefile
specifies the modules to use for each rule:
...
rule fastqc:
...
envmodules:
"FastQC/0.12.1-Java-21" # module to load at rule start-up
...
rule multiqc:
...
envmodules:
"MultiQC/1.22.3-foss-2024a" # module to load at rule start-up
...
...
Presuming the Snakefile
, input data and config.yaml
are in the same folder you launch the workflow with: snakemake --workflow-profile .
Warning
It may happen that the pipeline fails because of filesystem latency issues,
in which case you should typically see some line like this in your error message:
Job 5 completed successfully, but some output files are missing.
In that case consider increasing the latency-wait
option to a higher number.
Warning
Beware that snakemake exports your environment when submitting jobs. This means that any environment variable you have defined in your session will get passed down to the various steps of the workflow.
rules running locally
Some rules are sometimes unsuited to become a job on the cluster, for instance if some task requires internet access.
For this snakemake uses the localrule
keyword in the Snakefile
.
It can either be defined at the rule level :
rule foo:
...
localrule: True # rule foo will be executed locally rather than in a SLURM job
...
Or several rules can be specified at once :
localrules: foo, bar # rules foo and bar will be executed locally rather than in a SLURM job
rule foo:
...
rule bar:
...
You can read more on the subject in the snakemake documentation
Do it yourself:
The configuration we show above corresponds to a simple bioinformatics pipeline where we generate some html Quality Control reports from a set of sequencing result files.
Here are the steps to execute if you want to test it for yourself:
First, create a new folder on scicore, move there
mkdir snakemake_test
cd snakemake_test
Then download some data (NB: these are small files):
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R2.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R2.fastq.gz
Next, create a file called Snakefile
with the following content:
SAMPLES = ["sample1_R1",
"sample1_R2",
"sample2_R1",
"sample2_R2"]
rule all:
input:
"results/multiqc/multiqc_report.html"
rule fastqc:
input:
"{sample}.fastq.gz"
output:
"results/fastqc/{sample}_fastqc.html"
envmodules:
"FastQC/0.12.1-Java-21" # module to load at rule start-up
shell:
"fastqc {input} -o results/fastqc/"
rule multiqc:
input:
fqc=expand("results/fastqc/{sample}_fastqc.html", sample=SAMPLES)
output:
"results/multiqc/multiqc_report.html"
envmodules:
"MultiQC/1.22.3-foss-2024a" # module to load at rule start-up
shell:
"multiqc results/fastqc/ -o results/multiqc/"
Also create a file named config.yaml
, with the content:
executor: slurm # sets up rules execution in SLURM
jobs: 10 # maximum number of concurrent jobs
use-envmodules: true # sets up module system
latency-wait: 30 # time, in second, before checking for result
# files after a job finishes successfully.
# This is useful when there is a bit of
# filesystem latency.
set-resources:
multiqc:
mem: 500MB # reserved memory
runtime: 10 # reserved runtime in minutes
threads: 1 # reserved cpus
slurm_extra: "--qos=30min" # any extra slurm options,
# here used to set the queue of service
fastqc:
mem: 1GB
runtime: 10
threads: 1
slurm_extra: "--qos=30min"
Load the snakemake module:
ml snakemake/9.3.5-foss-2025a
Finally, run the pipeline with:
snakemake --workflow-profile .
Where the option --workflow-profile .
specifies the folder where you have the config.yaml
file (here, the working directory .
).
Nextflow
Important note
Running a pipeline on the cluster often involves a master process on the node where you logged in for the duration of the pipeline.
In order to reduce the load on the login node, and to prevent interruption of your pipeline should your session be interrupted, we recommend running the pipelines on the vscode node (vscode.scicore.unibas.ch
) with a tmux persistent session.
Alternatively, you can put your nextflow master process in a SLURM script.
Nextflow is a workflow system for creating scalable, portable, and reproducible workflows.
In order to run nextflow workflows on SciCORE, you need a nextflow.config
file to specify the SLURM configuration for each processes, such as which resources to reserve, or which modules to load.
Here is an example nextflow.config
file for a workflow with 2 processes (FASTQC
and MULTIQC
), each needing to load their own module:
process.executor = 'slurm'
process {
withName: FASTQC {
module = 'FastQC/0.11.8-Java-1.8'
cpus = 1
memory = 1.GB
clusterOptions = '--qos 30min'
}
}
process {
withName: MULTIQC {
module = 'MultiQC/1.14-foss-2022a'
cpus = 1
memory = 1.GB
clusterOptions = '--qos 30min'
}
}
Where:
cpus
corresponds to SLURM’scpus-per-task
queue
corresponds to SLURM’spartition
clusterOptions
: generic way of adding options to the SLURM submission, here we use it to specify theqos
.
Tip
You can specify multiple options by separating them with spaces. e.g.: clusterOptions = '--qos 30min --mail-type=END,FAIL --mail-user=<my.name>@unibas.ch'
.
Nextflow also let’s you set a queue
option, which corresponds to the SLURM partition
.
More info:
- SLURM executor
- specify process specific config (eg, resources)
- loading modules
- nextflow configuration
Tip
To attach multiple modules to the same process, you can use something like: module = ['FastQC','MultiQC']
Do it yourself:
Get the data
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R2.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R2.fastq.gz
simpleQC.nf
params.report_id = "multiqc_report"
process FASTQC {
publishDir "results/fastqc"
input:
path reads
output:
path "${reads.simpleName}_fastqc.zip", emit: zip
path "${reads.simpleName}_fastqc.html", emit: html
script:
"""
fastqc $reads
"""
}
process MULTIQC {
publishDir "results/multiqc"
input:
path '*'
val output_name
output:
path "${output_name}.html", emit: report
path "${output_name}_data", emit: data
script:
"""
multiqc . -n ${output_name}.html
"""
}
// Workflow block
workflow {
ch_fastq = channel.fromPath(params.fq) // Create a channel using parameter input
FASTQC(ch_fastq) // fastqc
MULTIQC(
FASTQC.out.zip.mix(FASTQC.out.html).collect(),
params.report_id
)
}
Workflow preview and configuration
ml Nextflow
nextflow run simpleQC.nf --fq "sample*.fastq.gz" -preview
output:
Nextflow 25.04.6 is available - Please consider updating your version to it
N E X T F L O W ~ version 24.10.4
Launching `simpleQC.nf` [big_mercator] DSL2 - revision: 0c7da237e5
[- ] FASTQC -
[- ] MULTIQC -
We are going to create a config file to specify all cluster relevant information to nextflow.
If that file is named nextflow.config
and is in your current directory when you launch the workflow, then it will automatically be applied to the run.
Otherwise you can specify it to nextflow run
with option -c
.
nextflow.config
process.executor = 'slurm'
process {
withName: FASTQC {
module = 'FastQC/0.11.8-Java-1.8'
cpus = 1
memory = 1.GB
clusterOptions = '--qos 30min'
}
}
process {
withName: MULTIQC {
module = 'MultiQC/1.14-foss-2022a'
cpus = 1
memory = 1.GB
clusterOptions = '--qos 30min'
}
}
where:
cpus
: SLURMcpus-per-task
queue
: SLURMpartition
clusterOptions
: generic way of adding options to the SLURM submission, here we use it to specify theqos
.- you can specify multiple options by separating them with spaces. e.g.:
clusterOptions = '--qos 30min --mail-type=END,FAIL --mail-user=<my.name>@unibas.ch'
- you can specify multiple options by separating them with spaces. e.g.:
More info
- SLURM executor
- specify process specific config (eg, resources)
- loading modules
- to attach multiple modules, you can use something like:
module = ['FastQC','MultiQC']
- to attach multiple modules, you can use something like:
- nextflow configuration
Actually running the pipeline
nextflow run simpleQC.nf --fq "sample*.fastq.gz" -with-timeline -with-trace -with-report -with-dag
The options -with-timeline -with-trace -with-report -with-dag
produce files text and html reports which are all useful, in particular:
- report for usage details
- trace gives you the job ids (column
native_id
, useful for debug) and usage resource
Alternatively, the pipeline could be run from a SLURM script which you would submit with sbatch
:
#!/bin/bash
#SBATCH --job-name=nextflow-test
#SBATCH --cpus-per-task=1 #Number of cores to reserve
#SBATCH --mem-per-cpu=1G #Amount of RAM/core to reserve
#SBATCH --time=06:00:00 #Maximum allocated time
#SBATCH --qos=6hours #Selected queue to allocate your job
#SBATCH --output=nextflow_test.o
ml Nextflow
nextflow run simpleQC.nf --fq "sample*.fastq.gz" -with-timeline -with-trace -with-report -with-dag
nf-core
this subsection draws inspiration from the genotoul cluster nextflow-course
Important note
This method requires to run a master process on the node where you logged in for the duration of the pipeline.
In order to reduce the load on the login node, and to prevent interruption of your pipeline should your session be interrupted, we recommend running the pipelines on the vscode node (vscode.scicore.unibas.ch
) with a tmux persistent session.
TL;DR
- specify
process.executor = 'slurm'
in your nextflow.config - use
-profile=apptainer
to handle the containers - read about managing workflow resources; also use the recommendations listed above
nf-core is a global community effort to collect a curated set of open‑source analysis pipelines built using Nextflow.
We will demonstrate how to use nf-core on sciCORE with the nf-core/demo pipeline, which
On the login node, we can start by inspecting the pipeline:
This will download the workflow files and display usage options.
nf-core pipeline generally have a test
profile which specifies some simple input data.
This is useful for workflow configuration:
output:
{
"processes": [
{
"name": "NFCORE_DEMO:DEMO:MULTIQC",
"container": "quay.io/biocontainers/multiqc:1.29--pyhdfd78af_0"
},
{
"name": "NFCORE_DEMO:DEMO:SEQTK_TRIM",
"container": "quay.io/biocontainers/seqtk:1.4--he4a0461_1"
},
{
"name": "NFCORE_DEMO:DEMO:FASTQC",
"container": "quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0"
}
]
}
So there are 3 processes, and each are linked to a container in quay.io
This means that we will not need to load the specific module for fastqc, multiqc and seqtk, but rather just make sure that a compatible container software is available.
In the case of SciCORE, apptainer
is directly available without having to load additional modules. We will just need to specify this to nextflow as another profile
option.
But first, you have to setup a file named nextflow.config
containing:
And then you can run:
You can then look up the execution trace, dag, report, and timeline in the folder pipeline_info/
.
In particular, you can inspect the workflow usage resource and extrapolate how much you may need for your data.
The workflows are configured to require relatively sensible resources (generally in their conf/base.config file), but we highly recommend you have a look at their configuration.
See more on that subject in the nf-core documentation and in the section above.