Data¶

The data for this project is managed by the DVC tool and all related files are located in the data directory. The DVC tool has already been installed together with the “Atlas Interpolation” package. Every time you need to run a DVC command (dvc ...) make sure to change to the data directory first (cd data).

Remote Storage Access¶

We have already prepared all the data, but it is located on a remote storage that is only accessible to people within the Blue Brain Project who have access permissions to project proj101. If you’re unsure you can test your permissions with the following command:

ssh bbpv1.bbp.epfl.ch \
"ls /gpfs/bbp.cscs.ch/data/project/proj101/dvc_remotes"

Possible outcomes:

# Access OK
atlas_annotation
atlas_interpolation

# Access denied
ls: cannot open directory [...]: Permission denied

Depending on whether you have access to the remote storage in the following sections you will either pull the data from the remote (dvc pull) or download the input data manually and re-run the data processing pipelines to reproduce the output data (dvc repro).

If you work on the BB5 and have access to the remote storage then run the following command to short-circuit the remote access (because the remote is located on the BB5 itself):

cd data
dvc remote add --local gpfs_proj101 \
  /gpfs/bbp.cscs.ch/data/project/proj101/dvc_remotes/atlas_interpolation
cd ..

Model Checkpoints¶

Much of the functionality of “Atlas Interpolation” relies on pre-trained deep learning models. The model checkpoints that need to be loaded are part of the data.

If you have access to the remote storage (see above) you can pull all model checkpoints from the remote:

cd data
dvc pull checkpoints/rife.dvc
dvc pull checkpoints/cain.dvc
dvc pull checkpoints/maskflownet.params.dvc
dvc pull checkpoints/RAFT.dvc
cd ..

If you don’t have access to the remote you need to download the checkpoint files by hand and put the downloaded data into the data/checkpoints folder. You may not need all the checkpoints depending on the examples you want to run. Here are the instructions for the four models we use: RIFE, CAIN, MaskFlowNet, and RAFT:

RIFE: download the checkpoint from a shared Google Drive folder by following this link. Unzip the contents of the downloaded file into data/checkpoints/rife. [ref]
CAIN: download the checkpoint from a shared Dropbox folder by following this link. Move the downloaded file to `data/checkpoints/cain. [ref]
MaskFlowNet: download the checkpoint directly from GitHub by following this link. Rename the file to maskflownet.params and move it to data/checkpoints. [ref]
RAFT: download the checkpoint files from a shared Dropbox folder by following this link. Move all downloaded .pth files to the data/checkpoints/RAFT/models folder. [ref]

If you downloaded all checkpoints or pulled them from the remote you should have the following files:

data
└── checkpoints
    ├── RAFT
    │   ├── models
    │   │   ├── raft-chairs.pth
    │   │   ├── raft-kitti.pth
    │   │   ├── raft-sintel.pth
    │   │   ├── raft-small.pth
    │   │   └── raft-things.pth
    ├── cain
    │   └── pretrained_cain.pth
    ├── maskflownet.params
    └── rife
        ├── contextnet.pkl
        ├── flownet.pkl
        └── unet.pkl

Section Images and Datasets¶

The purpose of the “Atlas Interpolation” package is to interpolate missing section images within section image datasets. This section explains how to obtain these data.

Remember that if you don’t have access to the remote storage (see above) you’ll need to use the dvc repro commands that download/process the data live. If you do have access, you’ll use dvc pull instead, which is faster.

Normally it’s not necessary to get all data. Due to its size it may take a lot of disk space as well as time to download and pre-process. If you still decide to do so you can by running dvc repro or dvc pull without any parameters.

Specific examples only require specific data. You can use DVC to list all data pipeline stages to find out which stage produces the data you’re interested in. To list all data pipeline stages run:

cd data
dvc stage list

If, for example, you need data located in data/aligned/coronal/Gad1, then according to the output of command above the relevant stage is named align@Gad1. Therefore, you only need to run this stage to get the necessary data (replace repro by pull if you can access the remote storage):

dvc repro align@Gad1

New ISH datasets (advanced, optional)¶

If you’re familiar with the AIBS data that we’re using and would like to add new ISH gene expressions that are not yet available as one of our pipeline stages (check the output of dvc stage list) then follow the following instructions.

Edit the file data/dvc.yaml and add the new gene name to the lists in the stages:download_dataset:foreach and stages:align:foreach sections.
Run the data downloading and processing pipelines (replace NEW_GENE by the real gene name that you used in data/dvc.yaml):
```
dvc repro download_dataset@NEW_GENE
dvc repro align@NEW_GENE
```