Skip to content

Commit df86d60

Browse files
authored
Merge branch 'dev' into fix/formatting
2 parents 3ac6da4 + 91b0f8c commit df86d60

File tree

111 files changed

+5777
-431
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

111 files changed

+5777
-431
lines changed

README.md

Lines changed: 39 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -20,52 +20,53 @@
2020

2121
## Introduction
2222

23-
**nf-core/drop** is a bioinformatics pipeline that ...
23+
**nf-core/drop**(Detection of RNA Outliers Pipeline) is a bioinformatics pipeline that detects aberrant expression, aberrant splicing, and mono-allelic expression from RNA sequencing data.
2424

25-
<!-- TODO nf-core:
26-
Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
27-
major pipeline sections and the types of output it produces. You're giving an overview to someone new
28-
to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
29-
-->
25+
![A high-level diagram of the DROP workflow in a metro map style](docs/images/drop_metromap.png)
3026

31-
<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
32-
workflows use the "tube map" design for that. See https://nf-co.re/docs/guidelines/graphic_design/workflow_diagrams#examples for examples. -->
33-
<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
27+
- aberrant expression
28+
1. Compute read count matrix ([`GenomicAlignments`](https://github.com/Bioconductor/GenomicAlignments))
29+
2. Detect expression outliers ([`OUTRIDER`](https://github.com/gagneurlab/OUTRIDER/))
30+
- aberrant splicing
31+
1. Count split reads and non-split reads ([`GenomicAlignments`](https://github.com/Bioconductor/GenomicAlignments)) and ([`Subread`](https://bioconductor.org/packages/devel/bioc/html/Rsubread.html))
32+
2. Detect aberrant splicing events ([`FRASER`](https://github.com/gagneurlab/FRASER/))
33+
- mono-allelic expression
34+
1. Compute allelic counts (GATK ASEReadCounter)
35+
2. Detect aberrant mono-allelically expressed genes ([`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html))
36+
- Present QC Reports ([`MultiQC`](http://multiqc.info/))
3437

3538
## Usage
3639

3740
> [!NOTE]
3841
> If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
3942
40-
<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
41-
Explain what rows and columns represent. For instance (please edit as appropriate):
42-
4343
First, prepare a samplesheet with your input data that looks as follows:
4444

45-
`samplesheet.csv`:
45+
`samplesheet.tsv`:
4646

47-
```csv
48-
sample,fastq_1,fastq_2
49-
CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
50-
```
47+
| RNA_ID | RNA_BAM_FILE | RNA_BAI_FILE | DROP_GROUP | STRAND | DNA_ID | DNA_VCF_FILE | DNA_TBI_FILE | GENOME |
48+
| ------- | ------------------- | ----------------------- | ------------- | ------ | ------- | ------------------------- | ----------------------------- | ------ |
49+
| HG00103 | path/to/HG00103.bam | path/to/HG00103.bam.bai | group1,group2 | no | HG00103 | path/to/demo_chr21.vcf.gz | path/to/demo_chr21.vcf.gz.tbi | ucsc |
50+
| HG00106 | path/to/HG00106.bam | path/to/HG00106.bam.bai | group1,group2 | no | HG00106 | path/to/demo_chr21.vcf.gz | path/to/demo_chr21.vcf.gz.tbi | ucsc |
5151

52-
Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
52+
Each row requires a unique RNA_ID, a BAM file, DROP_GROUP and STRAND. For MAE additional DNA_ID, DNA_VCF_FILE and GENOME.
5353

54-
-->
54+
Here is an example of a [samplesheet](assets/samplesheet.tsv). Of note, to detect outliers confidently, a sufficiently large sample size is needed (>30 samples).
5555

5656
Now, you can run the pipeline using:
5757

58-
<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
59-
6058
```bash
6159
nextflow run nf-core/drop \
62-
-profile <docker/singularity/.../institute> \
63-
--input samplesheet.csv \
64-
--outdir <OUTDIR>
60+
-profile <docker/singularity/conda/...> \
61+
--input samplesheet.tsv \
62+
--outdir <OUTDIR> \
63+
--genome hg19 \
64+
--gene_annotation <path/to/gene/annotation/yaml> \
65+
--ucsc_fasta <path/to/fasta>
6566
```
6667

6768
> [!WARNING]
68-
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).
69+
> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files). Here is an example of a [custom config](conf/test.config).
6970
7071
For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/drop/usage) and the [parameter documentation](https://nf-co.re/drop/parameters).
7172

@@ -77,11 +78,22 @@ For more details about the output files and reports, please refer to the
7778

7879
## Credits
7980

80-
nf-core/drop was originally written by Michaela Mueller, Vicente Yepez, Christian Mertes, Daniela Andrade, Cristian Sandu, Andrew Behrens, Julien Gagneur.
81+
nf-core/drop was originally written by Vicente Yepez, Christian Mertes, Michaela Mueller, Daniela Andrade, Leonhard Wachutka from the Gagneur lab at the Department of Informatics and School of Medicine of the Technical University of Munich (TUM) and The German Human Genome-Phenome Archive (GHGA).
82+
83+
The Nextflow DSL2 conversion of the pipeline was lead by Nicolas Vannieuwkerke and Yun Wang.
84+
85+
Main developers:
86+
87+
- [Nicolas Vannieuwkerke](https://github.com/nvnieuwk)
88+
- [Yun Wang](https://github.com/fulaibaowang)
8189

8290
We thank the following people for their extensive assistance in the development of this pipeline:
8391

84-
<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
92+
- [Ata Jadid Ahari](https://github.com/AtaJadidAhari)
93+
- [Drew Behrens](https://github.com/drewjbeh)
94+
95+
<!-- TODO Acknowledgements -->
96+
<!-- GHGA -->
8597

8698
## Contributions and Support
8799

assets/multiqc_config.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ report_comment: >
22
This report has been generated by the <a href="https://github.com/nf-core/drop/tree/dev" target="_blank">nf-core/drop</a>
33
analysis pipeline. For information about how to interpret these results, please see the
44
<a href="https://nf-co.re/drop/dev/docs/output" target="_blank">documentation</a>.
5+
56
report_section_order:
67
"nf-core-drop-methods-description":
78
order: -1000
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
custom_content:
2+
order:
3+
- fraser_overview
4+
- q_estimation_psi5
5+
- q_estimation_psi3
6+
- q_estimation_theta
7+
- q_estimation_jaccard
8+
- aberrantly_spliced_genes
9+
- batch_correlation_psi5_FALSE
10+
- batch_correlation_psi5_TRUE
11+
- batch_correlation_psi3_FALSE
12+
- batch_correlation_psi3_TRUE
13+
- batch_correlation_theta_FALSE
14+
- batch_correlation_theta_TRUE
15+
- batch_correlation_jaccard_FALSE
16+
- batch_correlation_jaccard_TRUE
17+
- total_outliers
18+
- results
19+
20+
custom_data:
21+
fraser_overview:
22+
section_name: "Fraser overview"
23+
format: "tsv"
24+
plot_type: "table"
25+
26+
q_estimation_psi5:
27+
section_name: "Hyperparameter optimization - Q_estimation_psi5"
28+
29+
q_estimation_psi3:
30+
section_name: "Hyperparameter optimization - Q_estimation_psi3"
31+
32+
q_estimation_theta:
33+
section_name: "Hyperparameter optimization - Q_estimation_theta"
34+
35+
q_estimation_jaccard:
36+
section_name: "Hyperparameter optimization"
37+
description: "Q_estimation_jaccard"
38+
39+
aberrantly_spliced_genes:
40+
section_name: "Aberrantly spliced genes"
41+
42+
batch_correlation_psi5_FALSE:
43+
section_name: "Batch Correlation psi5 raw"
44+
batch_correlation_psi5_TRUE:
45+
section_name: "Batch correlation psi5 normalized"
46+
47+
batch_correlation_psi3_FALSE:
48+
section_name: "Batch Correlation psi3 raw"
49+
batch_correlation_psi3_TRUE:
50+
section_name: "Batch correlation psi3 normalized"
51+
52+
batch_correlation_theta_FALSE:
53+
section_name: "Batch Correlation theta raw"
54+
batch_correlation_theta_TRUE:
55+
section_name: "Batch correlation theta normalized"
56+
57+
batch_correlation_jaccard_FALSE:
58+
section_name: "Batch Correlation jaccard raw"
59+
batch_correlation_jaccard_TRUE:
60+
section_name: "Batch correlation jaccard normalized"
61+
62+
total_outliers:
63+
section_name: "Results"
64+
format: "tsv"
65+
plot_type: "table"
66+
description: "Total splicing outliers"
67+
68+
results:
69+
section_name: "Results table"
70+
description: "FRASER results (up to 500 rows shown)"
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
custom_content:
2+
order:
3+
- mae_overview
4+
- cascade_plot
5+
- variant_frequency
6+
- median_of_each_category
7+
- results
8+
9+
custom_data:
10+
mae_overview:
11+
section_name: "MAE overview"
12+
format: "tsv"
13+
plot_type: "table"
14+
15+
cascade_plot:
16+
section_name: "Cascade plot"
17+
description: |
18+
a cascade plot that shows a progression of added filters
19+
20+
- `>10 counts`: only variants supported by more than 10 counts
21+
- `+MAE`: and shows mono allelic expression
22+
- `+MAE for REF`: the monoallelic expression favors the reference allele
23+
- `+MAE for ALT`: the monoallelic expression favors the alternative allele
24+
- `rare`:
25+
- if add_AF is set to true in config file must meet minimum AF set by the config value max_AF
26+
- must meet the inner-cohort frequency maxVarFreqCohort cutoff
27+
28+
variant_frequency:
29+
section_name: "Variant Frequency within Cohort"
30+
31+
median_of_each_category:
32+
section_name: "Median of each category"
33+
34+
results:
35+
section_name: "MAE Results table"
36+
description: "MAE results (up to 500 rows shown)"
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
custom_content:
2+
order:
3+
- matching_values_distribution
4+
- heatmap_matching_variants
5+
- identify_matching_samples
6+
- false_matches
7+
- false_mismatches
8+
9+
custom_data:
10+
matching_values_distribution:
11+
section_name: "DNA - RNA matching values distribution"
12+
13+
heatmap_matching_variants:
14+
section_name: "Heatmap of matching variants"
15+
description: |
16+
Shows the proportion of matching DNA (rows) - RNA (cols) variants. Possible values are:
17+
18+
- match: the DNA sample matches the annotated RNA sample
19+
- no match: the DNA sample does not match the annotated RNA and no match was found
20+
- matches other: the DNA sample does not match the annotated RNA, but another match was found
21+
- matches more: the DNA sample matches the annotated RNA, but also other RNAs not annotated to match
22+
- matches less: the DNA sample is annotated with more than 1 RNA. Not all annotated RNAs are correct.
23+
24+
Similar for the RNAs.
25+
26+
identify_matching_samples:
27+
section_name: "Identify matching samples"
28+
format: "tsv"
29+
plot_type: "table"
30+
description: |
31+
Considerations: On our experience, the median of the proportion of matching variants in matching samples is around 0.95, and the median of the proportion of matching variants in not matching samples is around 0.58. Sometimes we do see some values between 0.7 - 0.85. That could mean that the DNA-RNA combination is not from the same person, but from a relative. It could also be due to a technical error. For those cases, check the following:
32+
33+
- RNA sequencing depth (low seq depth that can lead to variants not to be found in the RNA)
34+
- Number of variants (too many variants called due to sequencing errors)
35+
- Ratio of heterozygous/homozygous variants (usually too many called variants means too many heterozygous ones)
36+
- Is the sample a relative of the other?
37+
38+
false_matches:
39+
section_name: "Samples that were annotated to match but do not"
40+
41+
false_mismatches:
42+
section_name: "Samples that were not annotated to match but actually do"
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
custom_content:
2+
order:
3+
- number_of_samples
4+
- number_of_introns
5+
- number_of_splice_sites
6+
- comparison_local_and_external
7+
- expression_filtering
8+
9+
custom_data:
10+
number_of_samples:
11+
section_name: "Number of samples"
12+
format: "tsv"
13+
plot_type: "table"
14+
15+
number_of_introns:
16+
section_name: "Number of introns"
17+
format: "tsv"
18+
plot_type: "table"
19+
20+
number_of_splice_sites:
21+
section_name: "Number of splice sites"
22+
format: "tsv"
23+
plot_type: "table"
24+
25+
comparison_local_and_external:
26+
section_name: "Comparison of local and external counts"
27+
description: |
28+
Using external counts
29+
30+
External counts introduce some complexity into the problem of counting junctions because it is unknown whether or not a junction is not counted (because there are no reads) compared to filtered and not present due to legal/personal sharing reasons. As a result, after merging the local (counted from BAM files) counts and the external counts, only the junctions that are present in both remain. As a result it is likely that the number of junctions will decrease after merging.
31+
32+
expression_filtering:
33+
section_name: "Expression filtering"
34+
description: |
35+
The expression filtering step removes introns that are lowly expressed. The requirements for an intron to pass this filter are:
36+
37+
- at least 1 sample has 20 counts (K) for the intron
38+
- at least 5% of the samples need to have a total of at least 10 reads for the splice metric denominator (N) of the intron
39+
40+
variability_filtering:
41+
section_name: "Variability filtering"
42+
description: |
43+
The variability filtering step removes introns that have no or little variability in the splice metric values across samples. The requirement for an intron to pass this filter is:
44+
45+
- at least 1 sample has a difference of at least 0.05 in the splice metric compared to the mean splice metric of the intron

assets/schema_input.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@
3535
"pattern": "^\\S+$",
3636
"errorMessage": "DNA ID must be provided and cannot contain spaces",
3737
"meta": ["dna_id"],
38-
"description": "Unique identifier for the DNA sample. Must not contain spaces."
38+
"description": "Identifier for the DNA sample. Must not contain spaces."
3939
},
4040
"DNA_VCF_FILE": {
4141
"type": "string",
@@ -96,8 +96,8 @@
9696
"type": "string",
9797
"format": "file-path",
9898
"exists": true,
99-
"pattern": "^\\S+\\.tsv\\.gz$",
100-
"errorMessage": "The gene counts file has to exist, cannot contain spaces and must have extension '.tsv.gz'",
99+
"pattern": "^\\S+\\.tsv(\\.gz)?$",
100+
"errorMessage": "The gene counts file has to exist, cannot contain spaces and must have extension '.tsv' or '.tsv.gz'",
101101
"description": "Path to the gene counts file. Must exist and cannot contain spaces."
102102
},
103103
"GENE_ANNOTATION": {

0 commit comments

Comments
 (0)