Skip to content

stothard-group/ASSETS_2

Repository files navigation

ASSETS Workflow

This workflow processes basecalled Nanopore data and produces internal and external reports. For the current program versions and parameters used in production contact Luke McCarthy.

This repository contains custom scripts and code used in the analyses presented in 10.3390/antibiotics14111098. The code is made publicly available to ensure reproducibility but is not designed or maintained for general use as a cohesive workflow.

Workflow diagram

workflow diagram

Instructions for DRAC (Cedar)

Note: You may need to un-comment python environment activation commands in shell rules (e.g., source path/to/some/virtual/env/bin/activate). Also, you will need to run Snakemake with the --use-envmodules option.

Cloning the repository and installing Snakemake

  1. Login and clone the repository to your scratch directory on DRAC.

    git clone https://github.com/stothard-group/ASSETS_2.git
    cd ASSETS_2
  2. Load the appropriate modules (update versions as needed) (DRAC docs):

    module load StdEnv/2020 python/3.7
  3. Create a python virtual environment in the workflow/envs subdirectory using virtualenv (DRAC docs):

    cd resources/software
    virtualenv assets-snakemake-env
    cd ../..
  4. Activate the virtual environment:

    source resources/software/assets-snakemake-env/bin/activate
  5. Upgrade pip:

    pip install --upgrade pip
  6. Install Snakemake and other dependencies which are defined in the workflow/envs/assets-snakemake_requirements.txt file (this may take several minutes):

    pip install -r workflow/envs/assets-snakemake_requirements.txt
  7. When you are done working on the project, deactivate the virtual environment:

    deactivate
  8. When you are ready to work on the project again, activate the virtual environment:

    cd path/to/ASSETS_2
    module load StdEnv/2020 python/3.7
    source resources/software/assets-snakemake-env/bin/activate

Preparing input data and configuring the workflow

  1. Define the location of the input data and the output directories in the config/config.yaml file.

    • First, if the config/config.yaml file does not exist, copy the example file:

      cp config/example_config.yaml config/config.yaml
    • Customize the value of the INPUT_DIR variable in the config/config.yaml file so that the workflow can find the input files. By default, this is a subdirectory in the resources directory of the project directory. The input directory should contain subdirectories named according to sample barcodes (e.g., barcode01, barcode02, etc.). Each barcode subdirectory should contain the basecalled fastq files for that sample. The workflow will rename the barcode subdirectories to include the sample ID, which is defined in the metadata file (see below).

    • Customize the value of the OUTPUT_DIR variable in the config/config.yaml file so that the workflow will write output files to the correct location. By default, this is the results subdirectory of the project directory.

  2. Customize the value of the SAMPLE_METADATA variable in the config/config.yaml file so that the workflow can find the JSON metadata file. By default, this is a file in the resources directory of the project directory. The metadata file should contain a JSON object with keys that are the sample IDs and values that are the ASSETS IDs. For example:

    {
        "sample_barcode01_assets_id": "2045Bi2-067h0",
        "sample_barcode02_assets_id": "2045Bi2-070h0",
        "sample_barcode03_assets_id": "2045Ai2-012h0",
        "sample_barcode04_assets_id": "2046bi2-013h0",
        "sample_barcode05_assets_id": "2045Bix2-015h0",
        "sample_barcode06_assets_id": "2048Ai2-036h0",
        "sample_barcode07_assets_id": "2048Ai2-083h0",
        "sample_barcode08_assets_id": "2046Ai2-095h0"
    }

    This ASSETS ID and barcode information is used to organize the results accordingly.

  3. Check the organisms listed in the config/pathogen_list.txt file.

  4. The config/cluster_config.yaml file defines resource requirements for each step (rule) in the workflow, and may need to be customized depending on the cluster you are using. The default configuration works on the cedar.computecanada.ca cluster with the Stothard group's allocation.

Running the workflow

  1. Activate the virtual environment:

    cd path/to/ASSETS_2
    module load StdEnv/2020 python/3.7
    source workflow/envs/assets-snakemake-env/bin/activate
  2. Copy and customize the example_run_workflow.sh script. This contains the snakemake command which will need to be adapted based on your system and whether you are using conda environments, environment modules, etc.

    cp example_run_workflow.sh run_workflow.sh
  3. Activate a screen session so that the workflow will continue to run when you log out of the server or are disconnected. Note: Tmux is a popular alternative to screen which may work as well. Also, activate the appropriate virtual environment in which Snakemake is installed.

    screen
    module load StdEnv/2020 python/3.7
    source workflow/envs/assets-snakemake-env/bin/activate
  4. Run the run_workflow.sh script (in the screen session) which contains the snakemake command.

    sh run_workflow.sh

Instructions for running on a local server (Helix)

Similar to above, but use mamba to create an environment for running Snakemake instead of the Pip environment.

mamba env create --file workflow/envs/assets-snakemake.yaml

Then, when you want to run the workflow, first activate the enviroment with this command:

conda activate assets-snakemake

To set up pre-commit hooks for checking code formatting, run:

pre-commit install
pre-commit autoupdate

Also, use the --use-conda option when running Snakemake.

Input

Metadata file

A JSON file containing metadata is required for all the samples in the sequencing run. The structure of the file is as follows:

{
    "sample_{barcode}_assets_id": "{ASSETS ID}"
}

For example:

{
"sample_barcode09_assets_id": "2045Bi2-067h0",
"sample_barcode10_assets_id": "2045Bi2-070h0",
"sample_barcode11_assets_id": "2045Ai2-012h0",
"sample_barcode12_assets_id": "2046bi2-013h0",
"sample_barcode13_assets_id": "2045Bix2-015h0",
"sample_barcode14_assets_id": "2048Ai2-036h0",
"sample_barcode15_assets_id": "2048Ai2-083h0",
"sample_barcode16_assets_id": "2046Ai2-095h0"
}

Fastq files

Fastq files are in directories labelled by barcode. The ASSETS ID and barcode information from the metadata file are used to rename these directories to include the ASSETS ID. This is done by the initial rule of the workflow, rename. The new directory names are of the format {ASSETS ID}_{barcode}, and this is termed the {sample} in both the workflow and this README.

Nonstandard requirements

The following are required to run one or more of the custom python3 scripts:

Author contributions

Emily K. Herman (Stothard Group): Initial development of the workflow, including design of the analyses, selection of key software dependencies, and writing the initial versions of most Snakemake rules and Python scripts.

Lael D. Barlow (Stothard Group): Continued development and maintenance of the workflow (July 2023 to present) including addition of rules to automate construction of custom databases and addition of new requested features.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •