User Guide

Installation

Installation of TranD can be done in 2 steps.

Step 1:

Create a new conda environment with python (>=3.7) and activate the environment. Note there are two ways to create the environment depending on the location you would like to save the environment.

## To place environment in your default conda environment location
conda create -n TranD_env python=3.8
conda activate TranD_env

## To place environment in a specific location
conda create -p /path/to/conda_envs/TranD_env python=3.8
conda activate /path/to/conda_envs/TranD_env

Step 2:

Install TranD and dependencies. This can be done 3 different ways: A) with pip for most recent PyPI release, B) with conda for most recent conda release (should be same release as in PyPI), or C) clone this github repository to get most recent TranD code (may include updates that have not been released yet).

A) For the most recent PyPI release, activate the new environment and install with pip:

pip install trand

OR

B) For the most recent conda release, activate the new environment and install with conda:

conda install -c bioconda trand

OR

C) For the most recent code in this repository, set up a development environment by cloning the repository:

cd /path/to/github_clones
git clone https://github.com/McIntyre-Lab/TranD.git

Then, change directories (cd) into the source directory of the TranD clone and install TranD using pip:

conda activate TranD
cd /path/to/github_clones/TranD/source
pip install -e .

TranD Overview

TranD has 2 overarching modes: 1 GTF or 2 GTF inputs.

In 1-GTF mode, structural differences within a single transcriptome are described and quantified via various output files and plots. In 2-GTF mode, two transcriptomes are compared via various output files and plots.

Below is the TranD help message, which provides an overview of general usage:

usage: trand [-h] [-o OUTDIR] [-x PREFIX] [-l LOG_FILE] [-c]
             [-e {pairwise,gene}] [-k] [-p {all,both,first,second}] [-1 NAME1]
             [-2 NAME2] [-n CPU_CORES] [-f] [-s] [-i] [-v] [--version]
             input_file [input_file ...]

Perform transcript distance, complexity and transcriptome comparison analyses.

positional arguments:
  input_file            One or two input GTF file(s).

optional arguments:
  -h, --help            show this help message and exit
  -o OUTDIR, --outdir OUTDIR
                        Output directory, created if missing. Default: current
                        directory.
  -x PREFIX, --prefix PREFIX
                        Output prefix of various output files. " Default: no
                        prefix for 1GTF, 'name1_vs_name2' for 2GTF.
  -l LOG_FILE, --logfile LOG_FILE
                        Log file path and name for logging processing events to file.
  -c, --complexityOnly  Used with 1 or 2 GTF input file(s). Output only
                        transcriptome complexity measures. Default: Perform
                        all analyses and comparisons including complexity
                        calculations
  -e {pairwise,gene}, --ea {pairwise,gene}
                        Specify type of within gene transcript comparison (event analysis):
                        pairwise - Used with 1 or 2 GTF input files. Compare
                        pairs of transcripts within a gene. gene - Used with 1
                        GTF input file. Compare all transcripts within a gene
                        Default: pairwise
  -k, --keepir          Keep transcripts with Intron Retention(s) when
                        generating transcript events. Only used with 1 GTF
                        input file. Default: remove
  -p {all,both,first,second}, --pairs {all,both,first,second}
                        Used with 2 GTF input files. The TranD metrics can be
                        for all transcript pairs in both GTF files or for a
                        subset of transcript pairs using the following
                        options: both - Trand metrics for the minimum pairs in
                        both GTF files, first - TranD metrics for the minimum
                        pairs in the first GTF file, second - TranD metrics
                        for the minimum pairs in the second GTF file all -
                        TranD metrics for all transcript pairs in both GTF
                        files Default: both
  -1 NAME1, --name1 NAME1
                        Used with 2 GTF input files. User-specified name to be
                        used for labeling output files related to the first
                        GTF file. Name must be alphanumeric, can only include
                        "_" special character and not contain any spaces.
                        Default: d1
  -2 NAME2, --name2 NAME2
                        Used with 2 GTF input files. User-specified name to be
                        used for labeling output files related to the second
                        GTF file. Name must be alphanumeric, can only include
                        "_" special character and not contain any spaces.
                        Default: d2
  -n CPU_CORES, --cpus CPU_CORES
                        Number of CPU cores to use for parallelization.
                        Default: 1
  --subset_file         
                        Enter a CSV file of transcript pairs to exclusively process in pairwise mode.
                        Format: no header, two columns (transcript_1 and transcript_2)
  -f, --force           Force overwrite existing output directory and files
                        within.
  -s, --skip-plots      Skip generation of all plots.
  -
  -i, --skip-intermediate
                        Skip intermediate file output (junction and exon
                        region/fragment files).
  -v, --verbose         Verbose output
  --version             Display version

Important Benchmarking Information:

It is very important to note that TranD requires significant time and/or memory to run, especially when comparing two transcriptomes (2GTF mode, and even more so when using the -p both parameter, more information on that below). Below is a list of annotations and some benchmarking information to use as an estimate when running TranD on other PCs.

2 GTF Minimum indicates the parameter -p both and 2 GTF All indicates -p all

Annotation(s)	Num Transcripts	TranD Mode	Num CPUs	Max Memory Usage (mb)	Time (hh:mm:ss)
Refseq Human Annotation Mapped to Ensembl Coordinates	165898	1 GTF Gene	64	4800.968	20:59:38
FlyBase D. mel. Reference Annotation (r6.17)	35255	1 GTF Gene	64	821.56	00:35:29
FlyBase D. mel. Reference Annotation (r6.17)	35255	1 GTF Pairwise	64	1836.524	01:46:42
Ensembl vs. Refseq, Chromosome 1	20492 vs 16075	2 GTF All	64	5123.724	14:50:06
Ensembl vs. Refseq, Chromosome 2	17239 vs 12345	2 GTF All	64	4674.92	10:05:06
Dros. mel. vs sim. (mapped to mel coordinates)	33794 vs 41352	2 GTF Minimum	64	6037.52	23:55:51

Note about human data: Human annotations tend to be very large, so when running with large annotations it is highly recommended to run TranD by chromosome on separate machines (or on a computer cluster via array job). No time is saved through running by chromosome on a single machine. More information on how to do this here.
- This why the above benchmarking information provides parameters on only the 2 longest chromosomes.
Be sure to allocate more time and memory when running TranD 2 GTF Minimum and 1 GTF Pairwise (for larger annotations).
For a verbose list of many more runtimes (including all chromosomes for the Human Annotation Comparison), please see our Precomputed Files.

Single Transcriptome Evaluation

With the input of a single transcriptome (GTF file) TranD can generate 1-GTF analysis in the form of gene-level complexity measures. Gene-level complexity measures are calculated by comparing pairs of transcripts within a gene (‘pairwise’ mode) or by comparing all transcripts within a gene (‘gene’ mode). If ‘gene’ mode is selected then CSV files containing all possible exon regions and exon fragments is created along with a summary plot describing the transcriptome.

Pairwise Mode

Pairwise distances can be calculated for all pairs within a gene and the patterns of splicing (5’/3’ variation, alternate donor/acceptor, alternate exons, intron retentions) as well as the number of nucleotide positions that differ for each metric can summarized.

Important Disclaimers about 1GTF Pairwise Mode:

If a gene only has one transcript, it will not be included in the analysis files output by TranD (that gene is completely removed during analysis).
If two transcripts on opposite strands have the same gene identifier, TranD will output a warning and skip comparison of those two transcripts.

Typical command line usage:

trand \
    /path/to/transcriptome.gtf \ # transcriptome GTF
    -o /path/to/outdir \	 # TranD output directory
    -f \			 # Force output to overwrite existing output directory if running again
    -n 8			 # Number of CPU (default: 1)

Input:

One GTF file corresponding to the transcriptome of interest. File must contain “exon” features with transcript_id and gene_id values indicated in attributes.
Full path to output directory, if it does not exist, it will automatically be created. The directory can already exist, and the output files will be replaced if present in directory and -f (--force) option is set.
Number of CPU to use for parallelization (default: 1, no parallelization). TranD splits by gene to each CPU so if you have many genes with lower numbers of transcripts then increasing this value will help with speed (genes with higher numbers of transcripts may not benefit as well).
Optional: Indicator to skip output of intermediate files -i (--skip-intermediate) and only output the minimum distance output file and all plots.
Optional: Using either 1 GTF or 2 GTF output, -c (--complexityOnly) can be used to generate only the complexity measures.
Optional: Using either 1 GTF or 2 GTF output, -s (--skip-plots) can be used to skip the generation of any plots/legends.
Optional: In only pairwise mode, you may select a specific list of transcript pairs to have processed, while the rest are skipped. To do this, enter a CSV file with no header and a list of the pairs in a two column format (transcript_1,transcript_2) with the argument --subset_file.

Output:

Output CSV file (called pairwise_transcript_distance.csv). The distance columns are described here.
Box plots of complexity metrics (complexity_plots.png) and automatically generated figure legend (complexity_plots.rtf) for each transcriptome.
Output CSV file that displays a list of complexity measures (transcriptome_complexity_counts.csv).
Various plots described in TranD manuscript and automatically generated figure legends for each.
If intermediate files are output (no -i): a CSV file of pairwise events generated for each pair is output (event_analysis.csv) and a CSV file of all junctions in the transcriptome with columns for gene_id, transcript_id, and coords (chrom_:_start-10_:_end+9_:_strand, where start coordinate is 0-based) (junction_catalog.csv) is output.

Sample Usage:

For the rest of the user guide, sample code will be available for clarification on the best usage of TranD. Additionally, sample output will be available for clarification on what each output file truly entails. The sample input for 1 GTF analysis is the full transcriptome (GTF) for Drosophila melanogaster (r6.17) which can be found in Precomputed Files.

1-GTF pairwise sample code:

trand \
    ./input/dmel-all-r6.17.gtf \
    -o ./output/TranD_1GTF_Pairwise \
    -f \
    -n 8

The sample output can be found here.

Gene Mode

Gene mode compares all transcripts within each gene and outputs a transcriptome summary plot like the example below (D. melanogaster r6.17) and generates exon region and exon fragment event files representing the exonic space of each gene.

Typical command line usage:

trand \
    /path/to/transcriptome.gtf \ # transcriptome GTF
    -o /path/to/outdir \	 # TranD output directory
    --ea gene \                  # Switch TranD to gene mode
    --keepir \                   # Keep transcripts with IR events
    -f \			 # Force output to overwrite existing output directory if running again
    -n 8			 # Number of CPU (default: 1)

Generates event file by comparing all transcripts for a gene and outputs a transcriptome summary plot.

Input:

Inputs are the same as above
NOTE: If you want to keep intron retention events, add the argument -k (--keepir) (If not added, transcripts with intron retention will be removed.)

Output:

Complexity output files described above
A CSV file of exon regions (event_analysis_er.csv)
A CSV file of exon fragments (event_analysis_ef.csv)
A CSV file of transcript_id values (ir_transcripts.csv) for transcripts with intron retention events when compared to all other transcripts of the gene (NOTE: this file is empty if --keepir is not used)
A CSV file of all junctions in the transcriptome with columns for gene_id, transcript_id, and coords (chrom_:_start-10_:_end+9_:_strand, where start coordinate is 0-based) (junction_catalog.csv)
A CSV file listing all genes and number of unique exons per gene (uniq_exons_per_gene.csv).
Plot (transcriptome_summary_plot.png) and auto-generated figure legend (transcriptome_summary_plot.rtf) for a summary of the transcriptome structure
Plot (all_gene_prop_nt_variability.png) and auto-generated figure legend (all_gene_prop_nt_variability.rtf) for the distribution of the proportion of nucleotide variability (calculated as the number of nucleotides that are contained in at least one but not all exons of the gene over the total number of nucleotides included in at least one exon of the gene).
Plot (multi_xcrpt_gene_prop_nt_variability.png) and auto-generated figure legend (multi_xcrpt_gene_prop_nt_variability.rtf) for the distribution of variable nucleotides across multi-transcript genes.

Sample Usage:

1-GTF gene sample code:

trand \
    ./input/dmel-all-r6.17.gtf \
    -o ./output/TranD_1GTF_Gene \
    -e gene \
    -k \
    -f \
    -n 8

The sample output can be found here.

Comparison of Two Transcriptomes

Compare two transcriptomes that are from the same genomic coordinates (both sets of transcript coordinates map to the same genome). This mode of TranD is useful for comparing read methods or comparing species (if the species' annotations have been properly mapped to the same genome).

Important Disclaimers/Common Errors:

Transcriptomes must have shared gene_id values for transcripts to be compared within each shared gene. This is especially important when comparing two species or different references. This can be done by mapping the two annotations onto the same genome and using GFFCompare. This process is well documented on this wiki for Drosophila and human annotations. Possible errors if gene_ids are not the same:
- This can cause a "No objects to concatenate" error.
- This can also be checked by looking at the (gtf1)vs(gtf2)_gtf1_only.gtf and (gtf1)vs(gtf2)_gtf2_only.gtf files in the output. If these files perfectly match gtf1 and gtf2, it indicates that no transcript pairing was done.
Transcriptomes must have the exact same seqids (chromosome numbers) within the annotation. This is also important when comparing two references that use different chromosome number formats (ex: Refseq vs Ensembl). The process for using cthreepo to convert chromosome IDs between human formats is well documented in the human example here on the wiki. Possible errors if seqids are different in each annotation:
- TranD will still run, but there will be several lines (as in, one per transcript pair) that repeat "Multiple contig/chromn locations found for {transcript names}, skipping." either the log file or terminal output.

Typical command line usage:

trand \
    /path/to/reads.gtf \	# GTF of first transcriptome
    /path/to/ref.gtf \		# GTF of second transcriptome
    -o /path/to/outdir \	# TranD output directory
    -1 name1 \			# Name for first transcriptome (default: “d1”)
    -2 name2 \			# Name for second transcriptome (default: “d2”)
    -p both \			# Output option for table of transcript pairs
    -i \			# Skip intermediate output files
    -f \			# Force output to overwrite existing output directory if running again
    -n 8			# Number of CPU (default: 1)

Input:

Two GTF files corresponding to the transcriptomes of interest. Files must contain “exon” features with transcript_id and gene_id values indicated in attributes.
Names associated with each transcriptome to be used in output tables and plots (default: “d1” and “d2”). Names must be alphanumeric with no spaces or special characters other than “_”.
Full path to output directory, if it does not exist, it will automatically be created. The directory can already exist, and the output files will be replaced if present in directory and -f (--force) option is set.
-p: Output category option allows user to select the output table of transcript pairs to only be minimum pairs from the “first” transcriptome, only minimum pairs from the “second” transcriptome, minimum pairs for “both” transcriptomes (minimum pair in either), or output “all” pairwise transcript pairs regardless of minimum pair classification. (Default: “both”)
Number of CPU to use for parallelization (default: 1, no parallelization). TranD splits by gene to each CPU so if you have many genes with lower numbers of transcripts then increasing this value will help with speed (genes with higher numbers of transcripts may not benefit as well).
Optional: Indicator to skip output of intermediate files -i (--skip-intermediate) and only output the minimum distance output file and all plots.
Optional: Using either 1 GTF or 2 GTF output, -c (--complexityOnly) can be used to generate only the complexity measures.
Optional: Using either 1 GTF or 2 GTF output, -s (--skip-plots) can be used to skip the generation of any plots/legends.

Output:

Output CSV file
- pairwise_transcript_distance.csv if -p is “all” OR
- minimum_pairwise_transcript_distance.csv if -p is "both", "first", or "second"

If -p all: The output CSV file includes the transcript distance output for all pairwise transcripts pairs (“all”),
If p (both/first/second): The output CSV file includes the minimum distance output for all minimum pairs of the first transcriptome to the second transcriptome (“first”), all minimum pairs of the second transcriptome to the first transcriptome (“second”), or the minimum pairs for either the first transcriptome and/or the second transcriptome (“both”).

The pairwise distance columns are described in here. The additional columns added to the pairwise distance file when using minimum mode are described here. 2. Box plots of complexity metrics ([name]_complexity_plots.png) and automatically generated figure legend ([name]_complexity_plots.rtf) for each transcriptome, where [name] represents the names given to each transcriptome (or “d1” and “d2” as default). 3. Various plots described in TranD manuscript and automatically generated figure legends for each. 4. Output CSV file that displays a list of complexity measures for each transcriptome ([name]_complexity_counts.csv). 4. If intermediate files are output (no -i), CSV file of pairwise events generated for each pair is output (event_analysis.csv).

Sample Usage:

The sample input for 2 GTF analysis are the transcriptomes of D. melanogaster (GTF) and D. simulans (GTF), mapped onto the genome of D. melanogaster, and subset to only the coding genes. The GTFs are referred to as mel2mel and sim2mel, respectively (based on the "mel" and "sim" species mapped to the "mel" genome). The process of creating these mapped GTFs is similar to the process found in the Drosophila Species Comparison.

2-GTF sample code:

trand \
    ./input/mel2mel_coding_associated_gene.gtf \
    ./input/sim2mel_coding_associated_gene.gtf \
    -o ./output/TranD_2GTF_Pairwise \
    -1 mel2mel \
    -2 sim2mel \
    -p both \
    -i \
    -f \
    -n 8

The sample output can be found here.

Documentation and Script

Documentation and Script used for generating all of the sample output are in the links provided.

User Guide

Table of Contents

Installation

Step 1:

Step 2:

TranD Overview

Important Benchmarking Information:

Single Transcriptome Evaluation

Pairwise Mode

Important Disclaimers about 1GTF Pairwise Mode:

Typical command line usage:

Input:

Output:

Sample Usage:

Gene Mode

Typical command line usage:

Input:

Output:

Sample Usage:

Comparison of Two Transcriptomes

Important Disclaimers/Common Errors:

Typical command line usage:

Input:

Output:

Sample Usage:

Documentation and Script

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally