Database for computation
sciCORE manages a versioned collection of datasets, which is accessible at the following path:
The directory contains sub-directories, one for each dataset. E.g.:
/scicore/data/managed/1000genomes/
/scicore/data/managed/BLAST_FASTA/
/scicore/data/managed/igenomes/
/scicore/data/managed/PDB/
/scicore/data/managed/PDB_EBI/
/scicore/data/managed/UniProt/
Each dataset directory contains symlinks to the different versions of the dataset:
- latest: most current version of the dataset
- latest-prev: previous version to the current (latest) one
- monthly: version which is updated once per month to point to the latest version at the time of update
- frozen_YYMMDDTHHMMSS: version which will not be updated. The name of this version contains the date when this snapshot was created (YYMMDDTHHMMSS)
Many of these datasets are periodically updated. Therefore the current version of the dataset (latest) will be substituted by a newer version. The previous version (latest-prev) will be kept in the repository, until a new version will substitute the latest one and the latest will substitute the latest-prev.
The monthly datasets will be updated once per month to point to the most current version.
The frozen_ datasets will not be updated. They are snapshots which will be preserved in the repository.
Example: BLAST_FASTA dataset:
/scicore/data/managed/BLAST_FASTA/frozen_200126T191902 -> ../.store/BLAST_FASTA_200126T191902/
/scicore/data/managed/BLAST_FASTA/frozen_200202T191934 -> ../.store/BLAST_FASTA_200202T191934/
/scicore/data/managed/BLAST_FASTA/frozen_220926T035926 -> ../.store/BLAST_FASTA_220926T035926/
/scicore/data/managed/BLAST_FASTA/latest -> ../.store/BLAST_FASTA_260216T114155/
/scicore/data/managed/BLAST_FASTA/latest-prev -> ../.store/BLAST_FASTA_260209T054234/
/scicore/data/managed/BLAST_FASTA/monthly -> ../.store/BLAST_FASTA_260126T143109/
The symlinked paths are named by the date when those datasets were updated. For example: BLAST_FASTA_260216T114155
The paths to these symlinks (e.g. /scicore/data/managed/BLAST_FASTA/latest) are to be used when accessing the datasets.
Usage¶
If you need to ensure reproducibility or if you have long running analyses, we recommend you to:
- Create a snapshot of the dataset data you need, in your group directory, with a specific date attached, following the example:
- Delete your snapshot once it is no longer needed
Reproducibility
There are two positive side effects of copying the data you need:
- facilitates reproducibility
- explicit versions and snapshot dates can be referenced in your publications
Long and consistent analysis
It is the possible that dataset updates will occur during long running analyses. Although the dataset path (e.g. /scicore/data/managed/BLAST_FASTA/latest) will always exist, after update it will point to another dataset version directory containing other file versions. Also, the older dataset might be deleted from the system.
Tip
By copying the necessary data to your group directory, one can make sure that the analysis will use files from consistent versions and that the previously resolved symlink paths will continue to exist. Once the analysis is completed, just delete the copied version or substitute it by a copy of the current dataset version.
Tip
Remember to delete snapshots which are no longer needed from your group directory.