This repository serves to document and make available to the community the code of the publication 'To join or not to join: handling biological replicates in long-read RNA sequencing data'.
The pre-print is available on bioRxiv at: https://doi.org/10.64898/2025.12.09.693269
The paper focusses on investigating strategies on combining long-read RNAseq data from multiple biological replicates for transcriptome reconstruction. We investigate 2 strategies: "Join & Call" (J&C), where reads from all replicates are combined before performing transcriptome reconstruction, and "Call & Join" (C&J), where transcriptome reconstruction is performed on each replicate individually before combining the resulting annotations. We compare IsoQuant, FLAIR, Bambu, and TALON on both PacBio and ONT data, as well as Mandalorion and IsoSeq + SQANTI3 Filter on PacBio data only, using a data set of mouse brain and kidney tissue with 5 biological replicates per tissue.
All data generated in this study have been submitted to the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena/browser/home). Mice brain and kidney data generated using PacBio, ONT and Illumina sequencing are accessible under accession numbers PRJEB85167 and PRJEB94912.
For convenience, a small test set as well as reference data are provided at https://drive.google.com/drive/folders/1c3ZDXIxwh_Icx5-KVcjch-QnGkfk_j_B?usp=sharing
The code is organized in a nextflow pipeline which runs a specified transcriptome reconstruction tool (out of the above mentioned) with both strategies and on both brain and kidney tissues on a specified data type (ONT or PacBio, where compatible).
The scripts used by the pipeline are specifically designed to be run on a SLURM cluster and will not be compatible with other environments out-of-the-box.
There are further options (e.g. not using supporting short-read data for FLAIR, or running partial joins with 2,3,4 samples, etc.) which are not used for the analyses in the paper.
Under /src/util/conda_envs, .yaml files to configure the needed conda environments can be found.
Under /src/util/tool_setup, instructions for cloning the repositories of tools which need to have a local copy can be found.
- Clone the repository
- Run the script
src/util/tool_setup/clone_tool_git_repos.shin order to clone the repositories of SQANTI3 and TAMA at specific commits. If intending to use TALON (TranscriptClean) and/or Mandalorion, uncomment the corresponding sections of the script. - Run the script
/mnt/c/Users/jetzi/other_repos/join_and_call_paper/src/util/conda_envs/install_conda_envs.shin order to install the conda environments. Uncomment any of the isoform identification tools depending on what you intend to run (by default only the IsoQuant environment is installed.)
Typical install time depends on the whether all or only some environments are installed as well as the speed of the conda solver etc., but should generally expected to be completed within 30 minutes.
Examples of how to use the SLURM-wrapper nextflow_wrapper.sbatch script to launch the main_workflow.nf (once pre-processing has been conducted and metadata files have been created):
- Run FLAIR with supporting short reads on ONT data:
sbatch nextflow_wrapper.sbatch --data ont --algorithm flair --stringent true --use_sr true --sr_config star --result_name ont/flair_ar_sr/run1
- Run IsoQuant on PacBio data:
sbatch nextflow_wrapper.sbatch --data isoseq --algorithm isoquant --result_name isoseq/isoquant/run1
As mentioned above, a test subset as well as the full reference data are provided at https://drive.google.com/drive/folders/1c3ZDXIxwh_Icx5-KVcjch-QnGkfk_j_B?usp=sharing
Follow the instructions in the README.txt provided with the subset to set up the required paths for the test run.
Then run with sbatch nextflow_wrapper.sbatch --data ont_subset --algorithm isoquant --result_name isoseq/isoquant/testrun1. Execution time depends on cluster availability but is expected to be around 20 minutes.
Expected output:
- Symlinks to all created intermediate results (raw results of IsoQuant, TAMA Merge, SQANTI3, etc.) in
data - Main result files needed for plotting in
reports
From there, plots can be generated with the scripts in src/plotting/, primarily all_plots.Rmd. The script's chunk "init-dirs" needs to be updated with absolute paths to the created "report" directories (Other plotting scripts will need similar adjustments). Note that in order to run the full scripts, generating the assembled figures for multiple isoform identification tools and ONT as well as PacBio data, ALL 10 reports need to be present. Individual plots, however, should be able to be generated by only providing 1 report directory.
/src/util contains utility scripts, including the aforementioned environment and tool setup.
/src/data_preparation_scripts contains scripts to set up the data, including creating the concatenated fastq files needed for the J&C strategy.
/src/nextflow contains the nextflow pipeline, separated into the following subdirectories:
/src/nextflow/modulescontains the .nf files defining the modules of different steps./src/nextflow/scriptscontains the actual .sh .sbatch scripts used by the modules for execution on the SLURM cluster./src/nextflow/workflowscontains a variety of workflows, the primary one ismain_workflow.nf. On a SLURM cluster, this workflow is run through thenextflow_wrapper.sbatchscript.
/src/plotting contains scripts to create the plots used in the paper from the results generated by the workflows.
/src/resouce_inspection contains the scripts used to obtain runtime and memory usage information from the SLURM jobs.
/reports/empty_report contains the basic directory structure for the reports created by the nextflow workflow along with some helper scripts.
For questions about this code and its reuse or adaptation, please use the GitHub issues or contact me through [email protected].
SLURM 23.11.10 conda 23.5.2 Python 3.11.3 (may vary by environment, check corresponding .yaml files) R 4.5.1 (may vary by environment, check corresponding .yaml files) Lima v2.9.0 minimap2 v2.17 Dorado v7.2.13 fastqc v0.12.0 Pychopper v2.5.0 bcl2fastq v2.20 STAR 2.7.10b SQANTI3 v5.3.6 TAMA Merge commit 2fa3c30 IsoQuant 3.6.3 FLAIR v2.1.2 Bambu 3.8.3 TALON 6.0.1 transcriptclean 2.1 Mandalorion v4.5.0 IsoSeq v4.0.0 TUSCO 0.99