Analysis

This repository serves to document and make available to the community the code of the publication 'To join or not to join: handling biological replicates in long-read RNA sequencing data'.

The pre-print is available on bioRxiv at: https://doi.org/10.64898/2025.12.09.693269

Analysis

The paper focusses on investigating strategies on combining long-read RNAseq data from multiple biological replicates for transcriptome reconstruction. We investigate 2 strategies: "Join & Call" (J&C), where reads from all replicates are combined before performing transcriptome reconstruction, and "Call & Join" (C&J), where transcriptome reconstruction is performed on each replicate individually before combining the resulting annotations. We compare IsoQuant, FLAIR, Bambu, and TALON on both PacBio and ONT data, as well as Mandalorion and IsoSeq + SQANTI3 Filter on PacBio data only, using a data set of mouse brain and kidney tissue with 5 biological replicates per tissue.

Data availability

All data generated in this study have been submitted to the European Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena/browser/home). Mice brain and kidney data generated using PacBio, ONT and Illumina sequencing are accessible under accession numbers PRJEB85167 and PRJEB94912.

For convenience, a small test set as well as reference data are provided at https://drive.google.com/drive/folders/1c3ZDXIxwh_Icx5-KVcjch-QnGkfk_j_B?usp=sharing

Code & Reusability

The code is organized in a nextflow pipeline which runs a specified transcriptome reconstruction tool (out of the above mentioned) with both strategies and on both brain and kidney tissues on a specified data type (ONT or PacBio, where compatible).

The scripts used by the pipeline are specifically designed to be run on a SLURM cluster and will not be compatible with other environments out-of-the-box.

There are further options (e.g. not using supporting short-read data for FLAIR, or running partial joins with 2,3,4 samples, etc.) which are not used for the analyses in the paper.

Under /src/util/conda_envs, .yaml files to configure the needed conda environments can be found.

Under /src/util/tool_setup, instructions for cloning the repositories of tools which need to have a local copy can be found.

Installation instructions

Clone the repository
Run the script src/util/tool_setup/clone_tool_git_repos.sh in order to clone the repositories of SQANTI3 and TAMA at specific commits. If intending to use TALON (TranscriptClean) and/or Mandalorion, uncomment the corresponding sections of the script.
Run the script /mnt/c/Users/jetzi/other_repos/join_and_call_paper/src/util/conda_envs/install_conda_envs.sh in order to install the conda environments. Uncomment any of the isoform identification tools depending on what you intend to run (by default only the IsoQuant environment is installed.)

Typical install time depends on the whether all or only some environments are installed as well as the speed of the conda solver etc., but should generally expected to be completed within 30 minutes.

Example usage

Examples of how to use the SLURM-wrapper nextflow_wrapper.sbatch script to launch the main_workflow.nf (once pre-processing has been conducted and metadata files have been created):

Run FLAIR with supporting short reads on ONT data:

sbatch nextflow_wrapper.sbatch --data ont --algorithm flair --stringent true --use_sr true --sr_config star --result_name ont/flair_ar_sr/run1

Run IsoQuant on PacBio data:

sbatch nextflow_wrapper.sbatch --data isoseq --algorithm isoquant --result_name isoseq/isoquant/run1

Test run

As mentioned above, a test subset as well as the full reference data are provided at https://drive.google.com/drive/folders/1c3ZDXIxwh_Icx5-KVcjch-QnGkfk_j_B?usp=sharing

Follow the instructions in the README.txt provided with the subset to set up the required paths for the test run.

Then run with sbatch nextflow_wrapper.sbatch --data ont_subset --algorithm isoquant --result_name isoseq/isoquant/testrun1. Execution time depends on cluster availability but is expected to be around 20 minutes.

Expected output:

Symlinks to all created intermediate results (raw results of IsoQuant, TAMA Merge, SQANTI3, etc.) in data
Main result files needed for plotting in reports

From there, plots can be generated with the scripts in src/plotting/, primarily all_plots.Rmd. The script's chunk "init-dirs" needs to be updated with absolute paths to the created "report" directories (Other plotting scripts will need similar adjustments). Note that in order to run the full scripts, generating the assembled figures for multiple isoform identification tools and ONT as well as PacBio data, ALL 10 reports need to be present. Individual plots, however, should be able to be generated by only providing 1 report directory.

Repository file tree

/src/util contains utility scripts, including the aforementioned environment and tool setup.

/src/data_preparation_scripts contains scripts to set up the data, including creating the concatenated fastq files needed for the J&C strategy.

/src/nextflow contains the nextflow pipeline, separated into the following subdirectories:

/src/nextflow/modules contains the .nf files defining the modules of different steps.
/src/nextflow/scripts contains the actual .sh .sbatch scripts used by the modules for execution on the SLURM cluster.
/src/nextflow/workflows contains a variety of workflows, the primary one is main_workflow.nf. On a SLURM cluster, this workflow is run through the nextflow_wrapper.sbatch script.

/src/plotting contains scripts to create the plots used in the paper from the results generated by the workflows.

/src/resouce_inspection contains the scripts used to obtain runtime and memory usage information from the SLURM jobs.

/reports/empty_report contains the basic directory structure for the reports created by the nextflow workflow along with some helper scripts.

Questions

For questions about this code and its reuse or adaptation, please use the GitHub issues or contact me through [email protected].

Software versions tested on

SLURM 23.11.10 conda 23.5.2 Python 3.11.3 (may vary by environment, check corresponding .yaml files) R 4.5.1 (may vary by environment, check corresponding .yaml files) Lima v2.9.0 minimap2 v2.17 Dorado v7.2.13 fastqc v0.12.0 Pychopper v2.5.0 bcl2fastq v2.20 STAR 2.7.10b SQANTI3 v5.3.6 TAMA Merge commit 2fa3c30 IsoQuant 3.6.3 FLAIR v2.1.2 Bambu 3.8.3 TALON 6.0.1 transcriptclean 2.1 Mandalorion v4.5.0 IsoSeq v4.0.0 TUSCO 0.99

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
reports/empty_report		reports/empty_report
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Analysis

Data availability

Code & Reusability

Installation instructions

Example usage

Test run

Repository file tree

Questions

Software versions tested on

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ConesaLab/join_and_call_paper

Folders and files

Latest commit

History

Repository files navigation

Analysis

Data availability

Code & Reusability

Installation instructions

Example usage

Test run

Repository file tree

Questions

Software versions tested on

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages