Database for computation

sciCORE manages a versioned collection of datasets, which is accessible at the following path:

/scicore/data/managed/

The directory contains sub-directories, one for each dataset. E.g.:

/scicore/data/managed/1000genomes/
/scicore/data/managed/BLAST_FASTA/
/scicore/data/managed/igenomes/
/scicore/data/managed/PDB/
/scicore/data/managed/PDB_EBI/
/scicore/data/managed/UniProt/

Each dataset directory contains symlinks to the different versions of the dataset: - latest: most current version of the dataset - latest-prev: previous version to the current (latest) one - monthly: version which is updated once per month to point to the latest version at the time of update - frozen_YYMMDDTHHMMSS: version which will not be updated. The name of this version contains the date when this snapshot was created (YYMMDDTHHMMSS)

Many of these datasets are periodically updated. Therefore the current version of the dataset (latest) will be substituted by a newer version. The previous version (latest-prev) will be kept in the repository, until a new version will substitute the latest one and the latest will substitute the latest-prev. The monthly datasets will be updated once per month to point to the most current version. The frozen_ datasets will not be updated. They are snapshots which will be preserved in the repository.

Example: BLAST_FASTA dataset:

/scicore/data/managed/BLAST_FASTA/frozen_200126T191902 -> ../.store/BLAST_FASTA_200126T191902/
/scicore/data/managed/BLAST_FASTA/frozen_200202T191934 -> ../.store/BLAST_FASTA_200202T191934/
/scicore/data/managed/BLAST_FASTA/frozen_220926T035926 -> ../.store/BLAST_FASTA_220926T035926/
/scicore/data/managed/BLAST_FASTA/latest -> ../.store/BLAST_FASTA_260216T114155/
/scicore/data/managed/BLAST_FASTA/latest-prev -> ../.store/BLAST_FASTA_260209T054234/
/scicore/data/managed/BLAST_FASTA/monthly -> ../.store/BLAST_FASTA_260126T143109/

The symlinked paths are named by the date when those datasets were updated. For example: BLAST_FASTA_260216T114155 The paths to these symlinks (e.g. /scicore/data/managed/BLAST_FASTA/latest) are to be used when accessing the datasets.

Usage

If you need to ensure reproducibility or if you have long running analyses, we recommend you to:

  1. Create a snapshot of the dataset data you need, in your group directory, with a specific date attached, following the example:
    $ mkdir UniProt_taxonomic_divisions_YYYYMMDD
    $ cp -r  /scicore/data/managed/UniProt/latest/knowledgebase/taxonomic_divisions   UniProt_taxonomic_divisions_YYYYMMDD
    
  2. Delete your snapshot once it is no longer needed

Reproducibility

There are two positive side effects of copying the data you need:

  • facilitates reproducibility
  • explicit versions and snapshot dates can be referenced in your publications

Long and consistent analysis

It is the possible that dataset updates will occur during long running analyses. Although the dataset path (e.g. /scicore/data/managed/BLAST_FASTA/latest) will always exist, after update it will point to another dataset version directory containing other file versions. Also, the older dataset might be deleted from the system.

Tip

By copying the necessary data to your group directory, one can make sure that the analysis will use files from consistent versions and that the previously resolved symlink paths will continue to exist. Once the analysis is completed, just delete the copied version or substitute it by a copy of the current dataset version.

Tip

Remember to delete snapshots which are no longer needed from your group directory.