Skip to content

Software Management

In general, users are encouraged to install and manage their own software and environments on their user space in the sciCORE cluster. For some use cases, however, pre-installed software is available through a module system.

Module System

Pre-installed software is made available through a framework called environment modules. By default, no module is loaded when you log in, but the module commands can easily be added to your .bashrc or .bash_profile to automatically load frequently used programs after login.

We have 3 major software stacks available in the sciCORE HPC Cluster:

  1. sciCORE EasyBuild

    • This software stack is loaded by default.
    • Activation Command: enable-software-stack-scicore
  2. Compute Canada 2023

    • Compute Canada provides a software stack tailored for high-performance computing (HPC) applications. Here is the link to its detailed documentation.
    • Activation Command: enable-software-stack-compute-canada
  3. EESSI 2023.06

    • EESSI stands for European Environment for Scientific Software Installations. The EESSI software stack is designed to facilitate collaboration and software deployment across European research institutions. Here is the official link to the documentation.
    • Activation Command: enable-software-stack-eessi

Using the Module System

To view the list of available software modules, use:

module av

Note

All modules are named in the format <softwareName>/<version>-<toolchain>, where <toolchain> optionally refers to the toolchain used for compiling the software. See “Compiling Software” for more information on toolchains.

Tip

You can replace module by its alias ml for quicker typing

To see available versions of a specific software package, use

ml spider <softwareName>

Finally, to load a specific software module, run:

ml <softwareName/version-toolchain>

For example, to load the R module version 4.4.2 built with the foss-2024a toolchain, you would run:

ml R/4.4.2-foss-2024a

Strictly speaking, running ml R will also load some version of the R software. However, we recommend to always include the version and toolchain (when applicable) in the module load command for clarity and reproducibility.

Note

To activate a specific software stack, use its respective activation command. Once activated, the modules from the selected software stack will be available for use within the environment. Users can switch between software stacks based on their requirements using the provided activation commands.

Example:

enable-software-stack-compute-canada  # Activate Compute Canada stack
ml av  # List available modules in the Compute Canada stack
enable-software-stack-scicore  # Switch back to sciCORE EasyBuild stack
ml av  # List available modules in the sciCORE EasyBuild stack

Warning

If you load modules automatically via .bashrc, be aware that those same modules will be loaded when launching jobs into the compute nodes. This can lead to conflicts with other modules needed to run specific jobs. To avoid this, you can write ml purge in your SLURM scripts and then load all needed modules explicitly.

Module command reference

The key module commands are:

  • module avail (or ml av): list available modules

  • module load <softwareName> (or ml load <softwareName>): load the module <softwareName>

  • module list (or ml list): list currently loaded modules

  • module spider <keyword> (or ml spider <keyword>): search for modules with keyword in their name or description

  • module help (or ml help): list other module commands

  • module unload <softwareName> (or ml -<softwareName>): unload the module <softwareName>

  • module purge (or ml purge): unload all loaded modules

Python Environment

When working with Python software, we recommend creating virtual environments either on a per project basis or for specific package collections that get reused in different contexts. This allows you to avoid conflicts between different Python packages and versions and helps the reproducibility of your work.

In the sciCORE cluster you can manage your virtual environments much like you would on your local machine. You can choose your favorite package manager, be it conda, mamba, uv, pixi, etc.

An environment for a specific research project

Let us say you are working on a research project called my_project and you want to write some Python code for it. Let us create a new directory for it and include a single Python script called my_script.py:

mkdir my_project
cd my_project

If you have conda or mamba installed, you can proceed as follows:

conda create -n my_project python=3.10  # or some other version
conda activate my_project
pip install <package_name>  # install packages you need
python my_script.py  # run your scripts within the environment

Note

If you are using mamba, replace conda with mamba in the commands above.

If you have uv installed, you can create a virtual environment like this:

uv init
uv add <package_name>  # install packages you need
uv run python my_script.py  # run your scripts within the environment

Info

Different package managers will manage environments in different ways. For example, conda will collect all your environments in a single location, typically ~/miniconda3/envs/, while uv will create a .venv directory for the environment in the current working directory. When in doubt, check the documentation of your package manager for details.

Environments for specific package collections

Some environment managers, such as conda or pixi allow you to create environments that are “globally visible”. These are intended to contain tool sets that you could use for general purposes, such as quick data exploration or analysis, or for specific tasks that you perform frequently across different projects.

Any conda environment you create behaves like this by default, because you can activate it from anywhere in the machine. For example, you can create an environment called data_analysis like this:

conda create -n data_analysis python=3.10  # or some other version
conda activate data_analysis
pip install pandas numpy matplotlib ipython  # install packages you need

Then whenever you activate the data_analysis environment, you have access to pandas, numpy, and matplotlib in your Python scripts, no matter where you are in the file system.

For pixi, the syntax to reach a similar result is:

pixi global install --environment data_analysis --expose jupyter --expose ipython jupyter numpy pandas matplotlib ipython

Make sure to check if your package manager supports a feature like this.

Particulars of Jupyter notebooks on Open OnDemand

Because Jupyter on Open OnDemand (OOD) is launched from a central process, it does not automatically see the kernels from the Python environments you create. So you need to make sure to follow these steps:

1. Install the ipykernel package in your Python environment

With conda or mamba, you can do this by running:

conda activate my_project  # or the name of your environment
conda install ipykernel  # or pip install ipykernel

With uv or pixi:

uv add ipykernel  # or pixi add ipykernel

2. (Not necessary for conda or mamba environments) Register the kernel with Jupyter

If your environment manager installs the ipykernel package in a .venv folder within the project’s directory, you need to manually register this kernel with Jupyter.

For example, if you are using uv, navigate to your project directory and run:

uv run python -m ipykernel install --user --name=<project_name> --display-name "Python [uv env: <project_name>]"

where <project_name> is the name of your project.

Info

If you use some other package manager adapt the command accordingly. The important part is the one beginning with python -m ipykernel install.

If you open a new Jupyter notebook on OOD you should now be able to see

"Python [uv env: <project_name>]"

as an option in the kernel selection menu in the top right corner of the notebook interface.

Note

You can technically use any --name and --display-name you want, but it is recommended to include the name of your project for consistency and clarity. The –name is used internally by Jupyter to identify the kernel, while the –display-name is what you will see in the kernel selection menu.

Info

This step is not needed for conda or mamba environments because there is a plugin on the OOD Jupyter server that automatically detects and registers all conda environments with the ipykernel package installed. So you can skip this step if you are using conda or mamba.

R Environment

R is available on the cluster and can be loaded via the module system. You can explore all available R versions with

ml spider R

To load a specific version of R, use the following pattern:

ml R/<version>

For example, to load R/4.3.2-foss-2023a, you would use:

ml R/4.3.2-foss-2023a

After loading the module, you can check that the R executable is available by running:

which R

R scripts in SLURM jobs

To run R scripts from inside a SLURM job, the best method is to use the Rscript binary. For example:

#!/bin/bash
#SBATCH ... slurm parameters

module load R/4.3.2-foss-2023a

Rscript myscript.R

You should specify within your R script whether you want to save any files or the workspace. Print statements are directed to STDOUT.

Parallelism in R

Some R libraries implement functions that can make use of parallelism to accelerate calculations. The default behavior of these functions is library-specific, with some libraries assuming by default that all the resources on the machine (i.e. all CPUs) can be used by the library. This is a poor design choice which is unfortunately fairly common in the R ecosystem.

When an R script makes use of parallelism in functions, it is the responsibility of the user to verify that the number of cores used by R corresponds to the number of cores reserved with the SLURM script submission. Some users have crashed compute nodes on the cluster because they didn’t understand the behavior of the program they were using. R functions will often have an option that allows specifying the number of cores to use. This can be matched with the variable $SLURM_CPUS_PER_TASK

There are several approaches to parallelism in R. We recommend the use of the packages parallel and foreach. One can also submit R jobs to the cluster using rslurm.

RStudio Server

Users can run R code interactively on RStudio Server, which is available as an app on Open OnDemand. See “Interactive Computing” for more information on interactive computing on sciCORE.

Installing R Packages

All sciCORE users have the ability to download and install their own R packages. This works in the same way whether using R from the RStudio Server or from a shell session.

You can determine the path for installed packages using the .libPaths() function in R. Common packages maintained by the system administrators are normally installed along with the software build (e.g. /scicore/soft/apps/R/3.6.0-foss-2018b/lib64/R/library) whereas user-installed packages end up in the home folder (e.g. /scicore/home/<groupid>/<userid>/R/x86_64-pc-linux-gnu-library/3.6).

There are various methods for installing R packages, which depend on the code itself and the repository where it lives. Normally, CRAN is the main source to install packages, using the

install.packages()

function. Bioconductor packages are installed using the

BiocManager::install()

function. If you have any questions about installing R packages or run into problems during compilation, please contact us via our Help Center.

Shiny Apps on R-Studio

Shiny is a web application framework for R that enables you to turn your analyses into interactive web applications without requiring HTML, CSS, or JavaScript knowledge. Shiny apps can be built using R Studio, an integrated development environment (IDE) for R. Within sciCORE you need to use Open OnDemand (OOD) to connect to the R studio server and run Shiny Apps. The Shiny Apps are by default placed in the ShinyApps/ folder by the user.

These are the steps to use Shiny Apps:

1. Connect to Open On Demand (OOD)

r_shiny_apps_1

2. Start the R-Studio Server

r_shiny_apps_2

3. Load the Shiny library

r_shiny_apps_3

4. Run the available Shiny Apps from their folders

r_shiny_apps_4

Info

The default directory for Shiny Apps is $HOME/ShinyApps/, so you can use

runExample("04_mpg/")

to run the app 04_mpg. You can also use the runApp() function and specify the path to the app folder. For example:

runApp("~/ShinyApps/04_mpg/")

r_shiny_apps_5

MATLAB

Interactively

To run MATLAB interactively, we recommend opening a sciCORE Desktop session on Open OnDemand.

Tip

When you sciCORE Desktop session is ready, you can adjust the “Compression” and “Image quality” sliders to get a better visual experience. This tip is valid whenever you are using sciCORE Desktop for visuals-heavy workflows

From within the sciCORE Desktop session, MATLAB can be loaded via the module system. Open a terminal, then type the following to explore all available MATLAB versions with

ml spider MATLAB

To load a specific version of MATLAB, use the following pattern:

ml MATLAB/<version>

For example, to load MATLAB 2023b, you would use:

ml MATLAB/2023b

To run MATLAB with a graphical user interface (GUI) run the following from the command line:

matlab

To have a command-line interface (CLI) with Java Virtual Machine (JVM):

matlab -nodesktop

To run MATLAB without a without JVM, you can use:

matlab -nojvm

Info

The options -nodesktop and -nojvm differ in that the first one still starts JVM, hence graphics functionalities will still work despite not initializing the MATLAB desktop. The second won’t work with graphics functions as it cannot access the Java API.

In SLURM batch jobs

You can submit MATLAB scripts to the computing nodes of the cluster as a regular SLURM job. You only need to take care of loading the corresponding MATLAB module and that your script doesn’t need a GUI (i.e. can be run from the command line directly).

To run a MATLAB script from the command line you can use:

matlab -nosplash -nodesktop -r "myscript"

Note

The extension .m (typical for MATLAB scripts) is ignored and that the name of the script is between quotation marks.

This implementation will still open MATLAB’s command window and output everything to the STDOUT until the script finishes.

An alternative is to use the -batch flag, which will start MATLAB without splash screens, without a desktop, and won’t open MATLAB’s command window. It will also exit MATLAB automatically after completion. Namely, it will run the script non-interactively:

matlab -batch "myscript"

Warning

Be aware, though, that the -batch flag disables certain functionalities of MATLAB.

A final option is to compile your script using the MATLAB compiler and runtime libraries. This is the preferred option for compute-intensive jobs:

mcc -m -R -nodisplay myscript.m

This will produce a myscript binary executable, as well as a wrapper script run_myscript.sh. To run the executable you just need to run the wrapper script, providing the path to the root MATLAB installation.

Example:

./run_myscript.sh <path_to_matlab_root>

Tip

The path to the root MATLAB installation is different for each MATLAB version. To find this path you can run

which matlab

from the command line and extract the part before the last /bin/ in the output.

For example, if the output is:

/scicore/soft/easybuild/apps/MATLAB/2022b/bin/matlab

then <path_to_matlab_root> would be:

/scicore/soft/easybuild/apps/MATLAB/2022b

Available licenses

The university maintains a pool of MATLAB licenses. When a MATLAB instance is launched, it connects to the license server (FlexLM) which reserves a license (or several if you are using toolboxes) from the pool. The licenses are released (made available for other users) once Matlab terminates.

License pool exhaustion

When a MATLAB script is run as a cluster job, there must be free licenses available to be executed. On some occasions, it may happen that a cluster job is killed because the license pool at the university is temporarily exhausted.

Conda issues

In SLURM scripts

For a lot of environment managers, you can call your scripts from within a SLURM script as you would normally do from the command line. However, slurm jobs begin as a sub-shell on a compute node which does not initialize using the full set of config files in your home folder as per a login shell. Thus there is a small caveat for conda/mamba environments: you need to add the snippet eval "$(conda shell.bash hook)" to the beginning of your SLURM script, so that the conda command become available in the compute node. For example:

#!/bin/bash
#SBATCH --job-name=my_job
# ... other SLURM options ...

eval "$(conda shell.bash hook)"  # make conda command available

conda activate my_project  # activate the environment

python my_script.py  # run your script

Licensing issues

TL;DR

The “defaults” and “main” conda channels are covered by a license agreement which charges user fees depending on the use case and size of your organization. We recommend removing the default channel and using conda-forge instead.

Conda is an open-source organization which maintains the conda package manager, which facilitates management of python environments and the installation of many software packages into these environments. The software packages are made available through several different “channels” - essentially code repositories. There are many ways to install the conda package manager, each of which having somewhat different configurations including which channels are allowed and their relative priority. For example, if you install using miniconda or the entire “anaconda” package you might end up with the “defaults” channel with the highest priority. This particular channel/repository is maintained by a company called Anaconda Inc (formerly Continuum Analytics), which charges fees for the use of this channel/repository. The terminology and relationships are indeed complicated. It’s addressed to some degree in this blog post from Anaconda Inc.

You can check which channels are configured in your conda setup in the following file:

cat $HOME/.condarc

Alternatively, you can see your channels with the following commands:

conda config --show-sources # shows configuration source, normally $HOME/.condarc
conda config --show channels

We recommend removing the defaults channel, and we may need to block access to this channel from sciCORE systems. Please be aware, however, that this may have an impact on your existing conda environments. Specifically, update or (re)installation processes may need to be performed and in some edge cases may break your environments. Please don’t hesitate to get in touch with the sciCORE team should you need assistance.

You can set conda-forge as the top priority channel with the following command:

conda config --add channels conda-forge

You can remove the defaults channel with the following command:

conda config --remove channels defaults

Tip

An alternative way to get around this issue is to install conda with the miniforge installer, or use the pixi environment manager. Both install packages from the conda-forge repository by default.

Compiling Software

If you need to compile your own programs, we recommend using toolchains, for instance:

module load foss

Info

The term foss stands for Free and Open-Source Software. Run ml spider foss to explore available versions.

A toolchain is a collection of tools used for building software. They typically include:

  • a compiler collection providing basic language support (C/Fortran/C++)
  • an MPI implementation for multi-node communication
  • a set of linear algebra libraries (FFTW/BLAS/LAPACK) for accelerated math

Info

Many modules on our cluster include the toolchain that was used to build them in their version tag (e.g. Biopython/1.84-foss-2023b was built with the foss/2023b toolchain).

Mixing components from different toolchains almost always leads to problems. For example, if you mix Intel MPI modules with OpenMPI modules you can guarantee your program will not run (even if you manage to get it to compile). We recommend you always use modules from the same (sub)toolchains.

To see all available toolchains, run:

ml av toolchain

Containers

You can use Apptainer (formerly Singularity) to download and run docker images from public docker registries.

For example, to download the docker/whalesay image from the Docker Hub, you can run:

apptainer pull docker://docker/whalesay

This will create a file called whalesay_latest.sif in the current directory.

To run the image, you can use:

apptainer run whalesay_latest.sif cowsay "Hi from sciCORE!"

This will output:

<Some warnings>
 _________________
< Hi from sciCORE >
 -----------------
    \
     \
      \
                    ##        .
              ## ## ##       ==
           ## ## ## ##      ===
       /""""""""""""""""___/ ===
  ~~~ {~~ ~~~~ ~~~ ~~~~ ~~ ~ /  ===- ~~~
       \______ o          __/
        \    \        __/
          \____\______/

You can also pull images from other container registries that abide by the Open Container Initiative (OCI) standards. See the apptainer docs on this for more information. For example, to pull uv from the ghcr.io registry, you can run:

apptainer pull docker://ghcr.io/astral-sh/uv:latest

Use the command

apptainer help

to get help about other ways of interacting with apptainer.

Workflow management

Job dependencies

Establishing dependencies between different SLURM jobs is a good way to split up parts of a pipeline that have very different resources requirements. In this way, the resources are used more efficiently and the scheduler can allocate the jobs more easily, which in the end can shorten the amount of time that you have to wait in the queue.

The way to work with dependent SLURM jobs is to launch them with the --dependency directive and specify the condition that has to be met so that the depending job can start.

Dependent jobs will be allocated in the queue but they will not start until the specified condition is met and the resources are available.

Condition Explanation
after:jobid[:jobid] job begins after specified jobs have started
afterany:jobid[:jobid] job begins after specified jobs have terminated
afternotok:jobid[:jobid] job begins after specified jobs have failed
afterok:jobid[:jobid] job begins after specified jobs have finished successfully
singleton job begins after all specified jobs with the same name and user have ended

Note

Job arrays can also be submitted with dependencies, and a job can depend on an array job. In the latter case, the job will start executing when all tasks in the job array met the dependency criterion (eg, finishing for afterany).

Practical examples

Assume you have a series of job scripts, job1.sh, job2.sh, …, job9.sh with that depend on each other in some way.

The first job to be launched has no dependencies. It needs a standard sbatch command and we store the job ID in a variable that will be used for the jobs that depend on job1:

jid1=$(sbatch --parsable job1.sh)

Info

To make it easier to grab the job id from a completed job, we add the --parsable flag to SLURM jobs that other jobs depend on. The --parsable option makes sbatch return the job ID only.

Multiple jobs can depend on a single job. If job2 and job3 depend on job1 to finish, no matter the status, we can launch them with the following commands:

jid2=$(sbatch --parsable --dependency=afterany:$jid1 job2.sh)
jid3=$(sbatch --parsable --dependency=afterany:$jid1 job3.sh)

Similarly, a single job can depend on multiple jobs. If job4 depends directly on job2 and job3 (thus indirectly on job1) to finish, we can launch it with:

jid4=$(sbatch --parsable --dependency=afterany:$jid2:$jid3 job4.sh)

Job arrays can also be submitted with dependencies. If job5 is a job array that depends on job4, we can launch it like this:

jid5=$(sbatch --parsable --dependency=afterany:$jid4 job5.sh)

A single job can depend on an array job. Here, job6 will start when all array jobs from job5 have finished successfully:

jid6=$(sbatch --parsable --dependency=afterok:$jid5 job6.sh)

A single job can depend on all jobs by the same user with the same name. Here, job7 and job8 depend on job6 to finish successfully, and both are launched with the same name (“dtest”). We make job9 depend on job7 and job8 by making it depend on any job with the name “dtest”.

jid7=$(sbatch --parsable --dependency=afterok:$jid6 --job-name=dtest job7.sh)
jid8=$(sbatch --parsable --dependency=afterok:$jid6 --job-name=dtest job8.sh)
sbatch --dependency=singleton --job-name=dtest job9.sh

Finally, you can show the dependencies of your queued jobs like so:

squeue -u $USER -o "%.8A %.4C %.10m %.20E"

Tip

It is possible to make a job depend on more than one dependency type. For example, in the following job4 launches when job2 finished successfully and job3 failed:

jid4=$(sbatch --parsable --dependency=afterok:$jid2,afternotok:$jid3 job4.sh)

Separating the dependency types by ‘,’ means that all dependencies must be met. Separating them by ‘?’ means that either one suffices.

Do it yourself

The following proposes a simple set of scripts to test the concepts showcased above. Each script is set-up to run for 15 seconds

job1.sh:

#! /bin/sh
sleep 15
ls . > job1_output.txt

job2.sh:

#! /bin/sh
## takes input from job1
sleep 15
wc -l job1_output.txt > job2_output.txt

job3.sh:

#! /bin/sh
## takes input from job1
sleep 15
wc -c job1_output.txt > job3_output.txt

job4.sh:

#! /bin/sh
## takes input from job2 and job3
sleep 15
cat job2_output.txt job3_output.txt > job4_output.txt

sbatch job1.sh outputs Submitted batch job 53481198

sbatch --dependency=afterok:53481198 job2.sh outputs Submitted batch job 53481217

squeue -u $USER reveals:

   JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
53481239   scicore  job2.sh duchem00 PD       0:00      1 (Dependency)
53481237   scicore  job1.sh duchem00  R       0:13      1 sca29

In practice it is often more practical to output just the job ID using the –parsable flag:

jid1=$(sbatch --parsable job1.sh)
jid2=$(sbatch --parsable --dependency=afterok:$jid1 job2.sh)

You can have more that one job dependent on a single job:

jid2=$(sbatch --parsable --dependency=afterok:$jid1 job2.sh)
jid3=$(sbatch --parsable --dependency=afterok:$jid1 job3.sh)

And you can have a job depend on more than 1 job:

jid4=$(sbatch --parsable --dependency=afterok:$jid2:$jid3 job4.sh)

Snakemake

Important note

Running a pipeline on the cluster often involves a master process on the node where you logged in for the duration of the pipeline.

In order to reduce the load on the login node, and to prevent interruption of your pipeline should your session be interrupted, we recommend running the pipelines on the vscode node (vscode.scicore.unibas.ch) with a tmux persistent session.

Alternatively, you can put your snakemake master process in a SLURM script.

Snakemake is a workflow management system tool to create reproducible and scalable data analyses.

In order to run snakemake workflows on SciCORE, you need to setup a config.yaml profile where you specify:

  • executor: slurm : sets up rules execution in SLURM
  • use-envmodules: true : sets up module system
  • resources for each rules: e.g. set-resources: myrule:mem=500MB

You also need to specify the modules to load for each rules in the Snakefile using the envmodules option.

Follow the relevant documentations from slurm-executor-plugin and snakemake to see exactly which options to set for each use case.

Note

We recommend running snakemake -n to create a dry-run of the workflow and identify which rules are going to be invoked.

Here is an example config.yaml file for a workflow with 2 rules (fastqc and multiqc), each needin gtheir own module:

executor: slurm        # sets up rules execution in SLURM
jobs: 10               # maximum number of concurrent jobs
use-envmodules: true   # sets up module system

latency-wait: 30       # time, in second, before checking for result
                       # files after a job finishes successfully.
                       # This is useful when there is a bit of
                       # filesystem latency.


set-resources:
    multiqc:
        mem: 500MB                 # reserved memory
        runtime: 10                # reserved runtime in minutes
        threads: 1                 # reserved cpus
        slurm_extra: "--qos=30min" # any extra slurm options,
                                   # here used to set the queue of service
    fastqc:
        mem: 1GB
        runtime: 10
        threads: 1
        slurm_extra: "--qos=30min"

And the Snakefile specifies the modules to use for each rule:

...

rule fastqc:
    ...
    envmodules: 
        "FastQC/0.12.1-Java-21"      # module to load at rule start-up
    ...

rule multiqc:
    ...
    envmodules:
        "MultiQC/1.22.3-foss-2024a"  # module to load at rule start-up
    ...
...

Presuming the Snakefile, input data and config.yaml are in the same folder you launch the workflow with: snakemake --workflow-profile .

Warning

It may happen that the pipeline fails because of filesystem latency issues, in which case you should typically see some line like this in your error message: Job 5 completed successfully, but some output files are missing.

In that case consider increasing the latency-wait option to a higher number.

Warning

Beware that snakemake exports your environment when submitting jobs. This means that any environment variable you have defined in your session will get passed down to the various steps of the workflow.

rules running locally

Some rules are sometimes unsuited to become a job on the cluster, for instance if some task requires internet access. For this snakemake uses the localrule keyword in the Snakefile. It can either be defined at the rule level :

rule foo:
    ...
    localrule: True  # rule foo will be executed locally rather than in a SLURM job
    ...

Or several rules can be specified at once :

localrules: foo, bar # rules foo and bar will be executed locally rather than in a SLURM job

rule foo:
    ...

rule bar:
    ...

You can read more on the subject in the snakemake documentation

Do it yourself:

The configuration we show above corresponds to a simple bioinformatics pipeline where we generate some html Quality Control reports from a set of sequencing result files.

Here are the steps to execute if you want to test it for yourself:

First, create a new folder on scicore, move there

mkdir snakemake_test
cd snakemake_test

Then download some data (NB: these are small files):

wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R2.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R2.fastq.gz

Next, create a file called Snakefile with the following content:

SAMPLES = ["sample1_R1",
           "sample1_R2",
           "sample2_R1",
           "sample2_R2"]

rule all:
    input:
        "results/multiqc/multiqc_report.html"

rule fastqc:
    input:
        "{sample}.fastq.gz"
    output:
        "results/fastqc/{sample}_fastqc.html"
    envmodules: 
        "FastQC/0.12.1-Java-21"      # module to load at rule start-up
    shell:
        "fastqc {input} -o results/fastqc/"

rule multiqc:
    input:
        fqc=expand("results/fastqc/{sample}_fastqc.html", sample=SAMPLES)
    output:
        "results/multiqc/multiqc_report.html"
    envmodules:
        "MultiQC/1.22.3-foss-2024a"  # module to load at rule start-up
    shell:
        "multiqc results/fastqc/ -o results/multiqc/"

Also create a file named config.yaml, with the content:

executor: slurm        # sets up rules execution in SLURM
jobs: 10               # maximum number of concurrent jobs
use-envmodules: true   # sets up module system

latency-wait: 30       # time, in second, before checking for result
                       # files after a job finishes successfully.
                       # This is useful when there is a bit of
                       # filesystem latency.

set-resources:
    multiqc:
        mem: 500MB                 # reserved memory
        runtime: 10                # reserved runtime in minutes
        threads: 1                 # reserved cpus
        slurm_extra: "--qos=30min" # any extra slurm options,
                                   # here used to set the queue of service
    fastqc:
        mem: 1GB
        runtime: 10
        threads: 1
        slurm_extra: "--qos=30min"

Load the snakemake module:

ml snakemake/9.3.5-foss-2025a

Finally, run the pipeline with:

snakemake --workflow-profile .

Where the option --workflow-profile . specifies the folder where you have the config.yaml file (here, the working directory .).

Nextflow

Important note

Running a pipeline on the cluster often involves a master process on the node where you logged in for the duration of the pipeline.

In order to reduce the load on the login node, and to prevent interruption of your pipeline should your session be interrupted, we recommend running the pipelines on the vscode node (vscode.scicore.unibas.ch) with a tmux persistent session.

Alternatively, you can put your nextflow master process in a SLURM script.

Nextflow is a workflow system for creating scalable, portable, and reproducible workflows.

In order to run nextflow workflows on SciCORE, you need a nextflow.config file to specify the SLURM configuration for each processes, such as which resources to reserve, or which modules to load.

Here is an example nextflow.config file for a workflow with 2 processes (FASTQC and MULTIQC), each needing to load their own module:

process.executor = 'slurm'


process {
    withName: FASTQC {
        module = 'FastQC/0.11.8-Java-1.8'
        cpus = 1
        memory = 1.GB
        clusterOptions = '--qos 30min'
    }
}

process {
    withName: MULTIQC {
        module = 'MultiQC/1.14-foss-2022a'
        cpus = 1
        memory = 1.GB
        clusterOptions = '--qos 30min'
    }
}

Where:

  • cpus corresponds to SLURM’s cpus-per-task
  • queue corresponds to SLURM’s partition
  • clusterOptions : generic way of adding options to the SLURM submission, here we use it to specify the qos.

Tip

You can specify multiple options by separating them with spaces. e.g.: clusterOptions = '--qos 30min --mail-type=END,FAIL --mail-user=<my.name>@unibas.ch'.

Nextflow also let’s you set a queue option, which corresponds to the SLURM partition.

More info:

Tip

To attach multiple modules to the same process, you can use something like: module = ['FastQC','MultiQC']

Do it yourself:

Get the data

wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R2.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz
wget https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R2.fastq.gz

simpleQC.nf

params.report_id = "multiqc_report"

process FASTQC {

    publishDir "results/fastqc"

    input:
    path reads

    output:
    path "${reads.simpleName}_fastqc.zip", emit: zip
    path "${reads.simpleName}_fastqc.html", emit: html

    script:
    """
    fastqc $reads
    """
}

process MULTIQC {

    publishDir "results/multiqc"

    input:
    path '*'
    val output_name

    output:
    path "${output_name}.html", emit: report
    path "${output_name}_data", emit: data

    script:
    """
    multiqc . -n ${output_name}.html
    """
}

// Workflow block
workflow {
    ch_fastq = channel.fromPath(params.fq)   // Create a channel using parameter input
    FASTQC(ch_fastq)       // fastqc
    MULTIQC(
        FASTQC.out.zip.mix(FASTQC.out.html).collect(),
        params.report_id
        )

}

Workflow preview and configuration

ml Nextflow
nextflow run simpleQC.nf --fq "sample*.fastq.gz" -preview

output:

Nextflow 25.04.6 is available - Please consider updating your version to it

 N E X T F L O W   ~  version 24.10.4

Launching `simpleQC.nf` [big_mercator] DSL2 - revision: 0c7da237e5

[-        ] FASTQC  -
[-        ] MULTIQC -

We are going to create a config file to specify all cluster relevant information to nextflow.

If that file is named nextflow.config and is in your current directory when you launch the workflow, then it will automatically be applied to the run. Otherwise you can specify it to nextflow run with option -c.

nextflow.config

process.executor = 'slurm'

process {
    withName: FASTQC {
        module = 'FastQC/0.11.8-Java-1.8'
        cpus = 1
        memory = 1.GB
        clusterOptions = '--qos 30min'
    }
}

process {
    withName: MULTIQC {
        module = 'MultiQC/1.14-foss-2022a'
        cpus = 1
        memory = 1.GB
        clusterOptions = '--qos 30min'
    }
}

where:

  • cpus : SLURM cpus-per-task
  • queue : SLURM partition
  • clusterOptions : generic way of adding options to the SLURM submission, here we use it to specify the qos.
    • you can specify multiple options by separating them with spaces. e.g.: clusterOptions = '--qos 30min --mail-type=END,FAIL --mail-user=<my.name>@unibas.ch'

More info

Actually running the pipeline

nextflow run simpleQC.nf --fq "sample*.fastq.gz" -with-timeline -with-trace -with-report -with-dag

The options -with-timeline -with-trace -with-report -with-dag produce files text and html reports which are all useful, in particular:

  • report for usage details
  • trace gives you the job ids (column native_id, useful for debug) and usage resource

Alternatively, the pipeline could be run from a SLURM script which you would submit with sbatch:

#!/bin/bash
#SBATCH --job-name=nextflow-test
#SBATCH --cpus-per-task=1    #Number of cores to reserve
#SBATCH --mem-per-cpu=1G     #Amount of RAM/core to reserve
#SBATCH --time=06:00:00      #Maximum allocated time
#SBATCH --qos=6hours         #Selected queue to allocate your job
#SBATCH --output=nextflow_test.o

ml Nextflow
nextflow run simpleQC.nf --fq "sample*.fastq.gz" -with-timeline -with-trace -with-report -with-dag

nf-core

this subsection draws inspiration from the genotoul cluster nextflow-course

Important note

This method requires to run a master process on the node where you logged in for the duration of the pipeline.

In order to reduce the load on the login node, and to prevent interruption of your pipeline should your session be interrupted, we recommend running the pipelines on the vscode node (vscode.scicore.unibas.ch) with a tmux persistent session.

TL;DR

  • specify process.executor = 'slurm' in your nextflow.config
  • use -profile=apptainer to handle the containers
  • read about managing workflow resources; also use the recommendations listed above

nf-core is a global community effort to collect a curated set of open‑source analysis pipelines built using Nextflow.

We will demonstrate how to use nf-core on sciCORE with the nf-core/demo pipeline, which

On the login node, we can start by inspecting the pipeline:

ml Nextflow

nextflow run nf-core/demo --help

This will download the workflow files and display usage options.

nf-core pipeline generally have a test profile which specifies some simple input data.

This is useful for workflow configuration:

nextflow inspect nf-core/demo -profile test  --outdir results

output:

{
    "processes": [
        {
            "name": "NFCORE_DEMO:DEMO:MULTIQC",
            "container": "quay.io/biocontainers/multiqc:1.29--pyhdfd78af_0"
        },
        {
            "name": "NFCORE_DEMO:DEMO:SEQTK_TRIM",
            "container": "quay.io/biocontainers/seqtk:1.4--he4a0461_1"
        },
        {
            "name": "NFCORE_DEMO:DEMO:FASTQC",
            "container": "quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0"
        }
    ]
}

So there are 3 processes, and each are linked to a container in quay.io

This means that we will not need to load the specific module for fastqc, multiqc and seqtk, but rather just make sure that a compatible container software is available.

In the case of SciCORE, apptainer is directly available without having to load additional modules. We will just need to specify this to nextflow as another profile option.

But first, you have to setup a file named nextflow.config containing:

process.executor = 'slurm'

And then you can run:

nextflow run nf-core/demo --outdir . -profile test,apptainer

You can then look up the execution trace, dag, report, and timeline in the folder pipeline_info/.

In particular, you can inspect the workflow usage resource and extrapolate how much you may need for your data.

The workflows are configured to require relatively sensible resources (generally in their conf/base.config file), but we highly recommend you have a look at their configuration.

See more on that subject in the nf-core documentation and in the section above.