Datasets, Software, and Resources

Dataset: EMPIAR-11830

For this tutorial, we compiled a subset of 33 tomograms from the Chlamy dataset (EMPIAR-11830), originally published as R. Kelley, et al, Towards community-driven visual proteomics with large-scale cryo-electron tomography of Chlamydomonas reinhardtii bioRxiv Preprint (2024).

General info about the dataset:

Detector: Falcon4i with SelectrisX energy filter, using Tomo5 on a generation 4 Titan Krios(es)
Pixel size: 1.91 Å (microscope defined 1.96, calibrated by STA to be 1.91)
Voltage: 300
Spherical aberration: 2.7
Tilt axis: -95, be aware that the tilt axis angles indicated in the mdocs are usually wrong
Defocus handedness: -1 in RELION if starting from scratch (+1 if using the TOMOMAN preprocessed project)
Dose per tilt: 3.5 e-/Å²

Although reconstructed tomograms from this dataset are already available (processed using TOMOMAN), this tutorial is designed to guide you through the full tomogram reconstruction workflow from scratch, leading into subtomogram averaging (STA).

This subset may also serve as a useful benchmark for testing and comparing different software tools.

The 33 tilt series are divided into two groups:

6 tilt series using GainRef1
27 tilt series using GainRef2

If you’re just getting started and want to learn the fundamentals of tomogram reconstruction and STA, we recommend beginning with the 6 GainRef1 tilt series for faster processing.

If you’re interested in pushing resolution, trying classification, or running advanced workflows, you can process the full set. Dataset GainRef1 alone is about 33 Gb while GainRef1 and 2 is about 170Gb.

You can download the datasets directly from EMPIAR. We compiled some scripts to help you in that process.

From a Linux terminal, in your desired directory, run the following command to download the 6 tilt series associated with GainRef1:

bash_download_gain1.sh


#!/bin/bash

# Base URL
BASE_URL="ftp://ftp.ebi.ac.uk/empiar/world_availability/11830/data/chlamy_visual_proteomics"

# List of gain1 target entries
ENTRIES=(
  "06042022_BrnoKrios_Arctis_grid5_Position_12"
  "02122021_BrnoKrios_Arctis_lam2_pos2"
  "02122021_BrnoKrios_Arctis_lam1_pos8"
  "02122021_BrnoKrios_Arctis_lam1_pos6"
  "01122021_BrnoKrios_arctis_lam3_pos31"
  "01122021_BrnoKrios_arctis_lam3_pos29"
  "01122021/gainref"
)

for ENTRY in "${ENTRIES[@]}"; do
  wget -r -N -np -nH --cut-dirs=4 \
       --accept "*.eer,*.mdoc,*.gain" \
       "$BASE_URL/$ENTRY/"
done

Make the script executable:

chmod +x bash_download_gain1.sh

and run it like :

./bash_download_gain1.sh

To download the 27 tilt series associated with GainRef2:

bash_download_gain2.sh


#!/bin/bash

# Base URL
BASE_URL="ftp://ftp.ebi.ac.uk/empiar/world_availability/11830/data/chlamy_visual_proteomics"

# List of gain2 target entries
ENTRIES=(
    "06042022_BrnoKrios_Arctis_grid7_Position_19"
    "06042022_BrnoKrios_Arctis_grid7_Position_21"
    "06042022_BrnoKrios_Arctis_grid7_Position_24"
    "15042022_BrnoKrios_Arctis_grid9_Position_25"
    "15042022_BrnoKrios_Arctis_grid9_Position_32"
    "15042022_BrnoKrios_Arctis_grid9_Position_35"
    "15042022_BrnoKrios_Arctis_grid9_Position_65"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_12"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_15"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_25"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_29"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_42"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_44"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_51"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_15"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_19"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_47"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_49"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_59"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_63"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_64"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_77"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_78"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_80"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_83"
    "12052022_BrnoKrios_Arctis_grid_newGISc_Position_13"
    "12052022_BrnoKrios_Arctis_grid_newGISc_Position_29"
    "06042022/gainref"
)

# Loop over each entry and download only .eer and .mdoc files
for ENTRY in "${ENTRIES[@]}"; do
    wget -r -N -np -nH --cut-dirs=4 \
         --accept "*.eer,*.mdoc,*.gain" \
         "$BASE_URL/$ENTRY/"
done

From there, because the original files have annoying names, run this prepare_clean_rename.sh script. It will clean, rename and organize the files. (Run the script the same way you did for the download part):

prepare_clean_rename.sh


#!/bin/bash

BASE_DIR="chlamy_visual_proteomics"

GAIN1=(
    "06042022_BrnoKrios_Arctis_grid5_Position_12"
    "02122021_BrnoKrios_Arctis_lam2_pos2"
    "02122021_BrnoKrios_Arctis_lam1_pos8"
    "02122021_BrnoKrios_Arctis_lam1_pos6"
    "01122021_BrnoKrios_arctis_lam3_pos31"
    "01122021_BrnoKrios_arctis_lam3_pos29"
)

GAIN2=(
    "06042022_BrnoKrios_Arctis_grid7_Position_19"
    "06042022_BrnoKrios_Arctis_grid7_Position_21"
    "06042022_BrnoKrios_Arctis_grid7_Position_24"
    "15042022_BrnoKrios_Arctis_grid9_Position_25"
    "15042022_BrnoKrios_Arctis_grid9_Position_32"
    "15042022_BrnoKrios_Arctis_grid9_Position_35"
    "15042022_BrnoKrios_Arctis_grid9_Position_65"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_12"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_15"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_25"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_29"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_42"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_44"
    "27042022_BrnoKrios_Arctis_grid9_hGIS_Position_51"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_15"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_19"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_47"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_49"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_59"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_63"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_64"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_77"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_78"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_80"
    "02052022_BrnoKrios_Arctis_grid_hGIS_Position_83"
    "12052022_BrnoKrios_Arctis_grid_newGISc_Position_13"
    "12052022_BrnoKrios_Arctis_grid_newGISc_Position_29"
)

declare -A RENAME_MAP=(
    ["01122021_BrnoKrios_arctis_lam3_pos29"]="tomo24"
    ["01122021_BrnoKrios_arctis_lam3_pos31"]="tomo25"
    ["02122021_BrnoKrios_Arctis_lam1_pos6"]="tomo34"
    ["02122021_BrnoKrios_Arctis_lam1_pos8"]="tomo35"
    ["02122021_BrnoKrios_Arctis_lam2_pos2"]="tomo37"
    ["06042022_BrnoKrios_Arctis_grid5_Position_12"]="tomo50"
    ["06042022_BrnoKrios_Arctis_grid7_Position_19"]="tomo69"
    ["06042022_BrnoKrios_Arctis_grid7_Position_21"]="tomo71"
    ["06042022_BrnoKrios_Arctis_grid7_Position_24"]="tomo74"
    ["15042022_BrnoKrios_Arctis_grid9_Position_25"]="tomo216"
    ["15042022_BrnoKrios_Arctis_grid9_Position_32"]="tomo224"
    ["15042022_BrnoKrios_Arctis_grid9_Position_35"]="tomo227"
    ["15042022_BrnoKrios_Arctis_grid9_Position_65"]="tomo260"
    ["27042022_BrnoKrios_Arctis_grid9_hGIS_Position_12"]="tomo297"
    ["27042022_BrnoKrios_Arctis_grid9_hGIS_Position_15"]="tomo300"
    ["27042022_BrnoKrios_Arctis_grid9_hGIS_Position_25"]="tomo310"
    ["27042022_BrnoKrios_Arctis_grid9_hGIS_Position_29"]="tomo314"
    ["27042022_BrnoKrios_Arctis_grid9_hGIS_Position_42"]="tomo329"
    ["27042022_BrnoKrios_Arctis_grid9_hGIS_Position_44"]="tomo331"
    ["27042022_BrnoKrios_Arctis_grid9_hGIS_Position_51"]="tomo339"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_15"]="tomo355"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_19"]="tomo359"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_47"]="tomo378"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_49"]="tomo380"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_59"]="tomo391"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_63"]="tomo396"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_64"]="tomo397"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_77"]="tomo411"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_78"]="tomo412"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_80"]="tomo415"
    ["02052022_BrnoKrios_Arctis_grid_hGIS_Position_83"]="tomo418"
    ["12052022_BrnoKrios_Arctis_grid_newGISc_Position_13"]="tomo423"
    ["12052022_BrnoKrios_Arctis_grid_newGISc_Position_29"]="tomo440"
)

clean_directory() {
    local entry_path="$1"
    echo "Cleaning $entry_path"
    rm -rf "$entry_path"/{AreTomo,ctffind4,metadata,tiltctf}
    if [ -d "$entry_path/frames" ]; then
        mv "$entry_path"/frames/*.eer "$entry_path/" 2>/dev/null
        rmdir "$entry_path/frames" 2>/dev/null
    fi
}

# Step 1: Clean all folders
for entry_path in "$BASE_DIR"/*/; do
    clean_directory "$entry_path"
done

# Step 2: Make gain directories
mkdir -p "$BASE_DIR/gain1" "$BASE_DIR/gain2"

# Step 3: Move gain1 folders
for folder in "${GAIN1[@]}"; do
    if [ -d "$BASE_DIR/$folder" ]; then
        mv "$BASE_DIR/$folder" "$BASE_DIR/gain1/"
    fi
done

# Step 4: Copy overlapping gain1 folders into gain2, move the rest
for folder in "${GAIN2[@]}"; do
    if [ -d "$BASE_DIR/gain1/$folder" ]; then
        cp -r "$BASE_DIR/gain1/$folder" "$BASE_DIR/gain2/"
    elif [ -d "$BASE_DIR/$folder" ]; then
        mv "$BASE_DIR/$folder" "$BASE_DIR/gain2/"
    fi
done

# Step 5: Copy and rename gain references
GAINREF1_SRC="$BASE_DIR/01122021/gainref"
GAINREF2_SRC="$BASE_DIR/06042022/gainref"

GAINREF1_FILE=$(find "$GAINREF1_SRC" -type f -name "*.gain" | head -n 1)
GAINREF2_FILE=$(find "$GAINREF2_SRC" -type f -name "*.gain" | head -n 1)

if [ -f "$GAINREF1_FILE" ]; then
    cp "$GAINREF1_FILE" "$BASE_DIR/gain1/gainref1.gain"
    echo "→ gainref1.gain copied to gain1/"
    rm -rf "$BASE_DIR/01122021"
else
    echo "No .gain file found in $GAINREF1_SRC"
fi

if [ -f "$GAINREF2_FILE" ]; then
    cp "$GAINREF2_FILE" "$BASE_DIR/gain2/gainref2.gain"
    echo "→ gainref2.gain copied to gain2/"
    rm -rf "$BASE_DIR/06042022"
else
    echo "No .gain file found in $GAINREF2_SRC"
fi

# Step 6: Rename folders and files
for GAIN_DIR in "$BASE_DIR/gain1" "$BASE_DIR/gain2"; do
    for original_name in "${!RENAME_MAP[@]}"; do
        new_name="${RENAME_MAP[$original_name]}"
        src_folder="$GAIN_DIR/$original_name"
        dest_folder="$GAIN_DIR/$new_name"

        if [ -d "$src_folder" ]; then
            echo "Renaming folder: $original_name → $new_name"
            mv "$src_folder" "$dest_folder"

            # Rename .mdoc
            old_mdoc=$(find "$dest_folder" -maxdepth 1 -name "*.mdoc" | head -n 1)
            if [ -f "$old_mdoc" ]; then
                mv "$old_mdoc" "$dest_folder/${new_name}.mdoc"
                echo "Renamed .mdoc to ${new_name}.mdoc"
            fi

            # Rename all .eer files
            for eer_file in "$dest_folder"/*.eer; do
                if [[ -f "$eer_file" ]]; then
                    angle=$(echo "$eer_file" | sed -E 's/.*\[(.*)\]_EER\.eer/\1/')
                    mv "$eer_file" "$dest_folder/${new_name}[${angle}]_EER.eer"
                    echo "Renamed .eer to ${new_name}[${angle}]_EER.eer"
                fi
            done
        fi
    done
done

echo "Everything is cleaned, organized, and renamed."

Finally run the python_organise.py script below. It will modify the .mdoc and create gain1_links and gain2_links folders that contains soft links to .eer and .mdoc.

python_organise.py


import os
import re
import sys

def parse_mdoc(mdoc_path):
    subframe_pattern = re.compile(r"SubFramePath\s*=\s*(.+)")
    bracket_pattern = re.compile(r"\[(.+?)\]")

    mdoc_data = []
    subframe_paths = []

    with open(mdoc_path, 'r') as mdoc_file:
        for line in mdoc_file:
            mdoc_data.append(line)
            match = subframe_pattern.match(line.strip())
            if match:
                old_path = match.group(1).strip()
                bracket_value = bracket_pattern.search(old_path)
                if bracket_value:
                    subframe_paths.append((old_path, bracket_value.group(1)))
    return mdoc_data, subframe_paths

def find_matching_files(folder_path):
    bracket_pattern = re.compile(r"\[(.+?)\]")
    eer_files = []

    for filename in os.listdir(folder_path):
        if filename.endswith(".eer"):
            match = bracket_pattern.search(filename)
            if match:
                bracket_value = match.group(1)
                full_path = os.path.join(folder_path, filename)
                eer_files.append((filename, bracket_value, full_path))
    return eer_files

def update_mdoc_in_place(mdoc_path, mdoc_data, subframe_paths, eer_files):
    updated_data = []
    for line in mdoc_data:
        updated_line = line
        for old_path, bracket_value in subframe_paths:
            if old_path in line:
                matching_files = [file for file in eer_files if file[1] == bracket_value]
                if matching_files:
                    new_filename = matching_files[0][2]
                    updated_line = line.replace(old_path, new_filename)
                break
        updated_data.append(updated_line)

    with open(mdoc_path, 'w') as mdoc_file:
        mdoc_file.writelines(updated_data)

def process_folder(folder_path):
    eer_files = find_matching_files(folder_path)

    for filename in os.listdir(folder_path):
        if filename.endswith(".mdoc"):
            mdoc_path = os.path.join(folder_path, filename)
            mdoc_data, subframe_paths = parse_mdoc(mdoc_path)
            update_mdoc_in_place(mdoc_path, mdoc_data, subframe_paths, eer_files)
            print(f"Updated: {mdoc_path}")

def create_symlinks(base_dir, gain_name):
    gain_path = os.path.join(base_dir, gain_name)
    links_dir = os.path.join(os.path.dirname(base_dir), f"{gain_name}_links")

    os.makedirs(links_dir, exist_ok=True)

    for folder in os.listdir(gain_path):
        folder_path = os.path.join(gain_path, folder)
        if not os.path.isdir(folder_path):
            continue
        for file in os.listdir(folder_path):
            if file.endswith(".eer") or file.endswith(".mdoc"):
                target_file = os.path.join(folder_path, file)
                link_path = os.path.join(links_dir, file)

                # Create relative path for symlink
                rel_target = os.path.relpath(target_file, links_dir)

                try:
                    if os.path.exists(link_path) or os.path.islink(link_path):
                        os.remove(link_path)
                    os.symlink(rel_target, link_path)
                    print(f"Linked: {file} → {rel_target}")
                except Exception as e:
                    print(f"Failed to link {file}: {e}")

def main(base_dir):
    for gain_dir in ['gain1', 'gain2']:
        full_gain_path = os.path.join(base_dir, gain_dir)
        if not os.path.isdir(full_gain_path):
            continue
        for entry in os.listdir(full_gain_path):
            entry_path = os.path.join(full_gain_path, entry)
            if os.path.isdir(entry_path):
                process_folder(entry_path)
        create_symlinks(base_dir, gain_dir)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python update_subframe_paths.py chlamy_visual_proteomics")
    else:
        main(sys.argv[1])

Make sure you have python loaded and run:

python python_organise.py chlamy_visual_proteomics

At the end you should have a folder named chlamy_visual_proteomics containing four folders. gain1 and gain2 folders contains the raw data and the gain references, gain1_links and gain2_links contain links to the .eer and the .mdoc in a single folder.

Additionally you can directly download these files here:

Two text files with thickness measurements for automated AreTomo TS alignment: https://github.com/TomoGuide/TomoGuide.github.io/tree/main/docs/data/Z_height/
Templates and masks for Template Matching: https://github.com/TomoGuide/TomoGuide.github.io/tree/main/docs/data/TM/

Software

You need to have access to a GPU-powered machine running on Linux. It can be a local machine or a computing cluster. In our case, we work on a computing cluster with a SLURM system. You will also need to have appropriate CUDA drivers (this means you need to have NVIDIA GPUs) and a Python installation.

Click on the buttons below to get more information about the main software used in that tutorial and download it:

Scipion
IMOD
RELION 5
AreTomo3
ChimeraX
ArtiaX
pytom-match-pick