nf-core
diff --git a/‎README.md‎
Lines changed: 39 additions & 27 deletions b/‎README.md‎
Lines changed: 39 additions & 27 deletions
diff --git a/‎assets/multiqc_config.yml‎
Lines changed: 1 addition & 0 deletions b/‎assets/multiqc_config.yml‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎assets/multiqc_configs/multiqc_fraser_config.yml‎
Lines changed: 70 additions & 0 deletions b/‎assets/multiqc_configs/multiqc_fraser_config.yml‎
Lines changed: 70 additions & 0 deletions
diff --git a/‎assets/multiqc_countexpression_config.yml‎ renamed to ‎assets/multiqc_configs/multiqc_genecounts_config.yml‎ b/‎assets/multiqc_countexpression_config.yml‎ renamed to ‎assets/multiqc_configs/multiqc_genecounts_config.yml‎
diff --git a/‎assets/multiqc_configs/multiqc_mae_config.yml‎
Lines changed: 36 additions & 0 deletions b/‎assets/multiqc_configs/multiqc_mae_config.yml‎
Lines changed: 36 additions & 0 deletions
diff --git a/‎assets/multiqc_configs/multiqc_maeqc_config.yml‎
Lines changed: 42 additions & 0 deletions b/‎assets/multiqc_configs/multiqc_maeqc_config.yml‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎assets/multiqc_outrider_config.yml‎ renamed to ‎assets/multiqc_configs/multiqc_outrider_config.yml‎ b/‎assets/multiqc_outrider_config.yml‎ renamed to ‎assets/multiqc_configs/multiqc_outrider_config.yml‎
diff --git a/‎assets/multiqc_configs/multiqc_splicecounts_config.yml‎
Lines changed: 45 additions & 0 deletions b/‎assets/multiqc_configs/multiqc_splicecounts_config.yml‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎assets/schema_input.json‎
Lines changed: 3 additions & 3 deletions b/‎assets/schema_input.json‎
Lines changed: 3 additions & 3 deletions
@@ -20,52 +20,53 @@
 
 ## Introduction
 
-**nf-core/drop** is a bioinformatics pipeline that ...
+**nf-core/drop**(Detection of RNA Outliers Pipeline) is a bioinformatics pipeline that detects aberrant expression, aberrant splicing, and mono-allelic expression from RNA sequencing data.
 
-<!-- TODO nf-core:
-   Complete this sentence with a 2-3 sentence summary of what types of data the pipeline ingests, a brief overview of the
-   major pipeline sections and the types of output it produces. You're giving an overview to someone new
-   to nf-core here, in 15-20 seconds. For an example, see https://github.com/nf-core/rnaseq/blob/master/README.md#introduction
--->
+![A high-level diagram of the DROP workflow in a metro map style](docs/images/drop_metromap.png)
 
-<!-- TODO nf-core: Include a figure that guides the user through the major workflow steps. Many nf-core
-     workflows use the "tube map" design for that. See https://nf-co.re/docs/guidelines/graphic_design/workflow_diagrams#examples for examples.   -->
-<!-- TODO nf-core: Fill in short bullet-pointed list of the default steps in the pipeline -->1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/))2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/))
+- aberrant expression
+  1. Compute read count matrix ([`GenomicAlignments`](https://github.com/Bioconductor/GenomicAlignments))
+  2. Detect expression outliers ([`OUTRIDER`](https://github.com/gagneurlab/OUTRIDER/))
+- aberrant splicing
+  1. Count split reads and non-split reads ([`GenomicAlignments`](https://github.com/Bioconductor/GenomicAlignments)) and ([`Subread`](https://bioconductor.org/packages/devel/bioc/html/Rsubread.html))
+  2. Detect aberrant splicing events ([`FRASER`](https://github.com/gagneurlab/FRASER/))
+- mono-allelic expression
+  1. Compute allelic counts (GATK ASEReadCounter)
+  2. Detect aberrant mono-allelically expressed genes ([`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html))
+- Present QC Reports ([`MultiQC`](http://multiqc.info/))
 
 ## Usage
 
 > [!NOTE]
 > If you are new to Nextflow and nf-core, please refer to [this page](https://nf-co.re/docs/usage/installation) on how to set-up Nextflow. Make sure to [test your setup](https://nf-co.re/docs/usage/introduction#how-to-run-a-pipeline) with `-profile test` before running the workflow on actual data.
 
-<!-- TODO nf-core: Describe the minimum required steps to execute the pipeline, e.g. how to prepare samplesheets.
-     Explain what rows and columns represent. For instance (please edit as appropriate):
-
 First, prepare a samplesheet with your input data that looks as follows:
 
-`samplesheet.csv`:
+`samplesheet.tsv`:
 
-```csv
-sample,fastq_1,fastq_2
-CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
-```
+| RNA_ID  | RNA_BAM_FILE        | RNA_BAI_FILE            | DROP_GROUP    | STRAND | DNA_ID  | DNA_VCF_FILE              | DNA_TBI_FILE                  | GENOME |
+| ------- | ------------------- | ----------------------- | ------------- | ------ | ------- | ------------------------- | ----------------------------- | ------ |
+| HG00103 | path/to/HG00103.bam | path/to/HG00103.bam.bai | group1,group2 | no     | HG00103 | path/to/demo_chr21.vcf.gz | path/to/demo_chr21.vcf.gz.tbi | ucsc   |
+| HG00106 | path/to/HG00106.bam | path/to/HG00106.bam.bai | group1,group2 | no     | HG00106 | path/to/demo_chr21.vcf.gz | path/to/demo_chr21.vcf.gz.tbi | ucsc   |
 
-Each row represents a fastq file (single-end) or a pair of fastq files (paired end).
+Each row requires a unique RNA_ID, a BAM file, DROP_GROUP and STRAND. For MAE additional DNA_ID, DNA_VCF_FILE and GENOME.
 
--->
+Here is an example of a [samplesheet](assets/samplesheet.tsv). Of note, to detect outliers confidently, a sufficiently large sample size is needed (>30 samples).
 
 Now, you can run the pipeline using:
 
-<!-- TODO nf-core: update the following command to include all required parameters for a minimal example -->
-
 ```bash
 nextflow run nf-core/drop \
-   -profile <docker/singularity/.../institute> \
-   --input samplesheet.csv \
-   --outdir <OUTDIR>
+   -profile <docker/singularity/conda/...> \
+   --input samplesheet.tsv \
+   --outdir <OUTDIR> \
+   --genome hg19 \
+   --gene_annotation <path/to/gene/annotation/yaml> \
+   --ucsc_fasta <path/to/fasta>
 ```
 
 > [!WARNING]
-> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files).
+> Please provide pipeline parameters via the CLI or Nextflow `-params-file` option. Custom config files including those provided by the `-c` Nextflow option can be used to provide any configuration _**except for parameters**_; see [docs](https://nf-co.re/docs/usage/getting_started/configuration#custom-configuration-files). Here is an example of a [custom config](conf/test.config).
 
 For more details and further functionality, please refer to the [usage documentation](https://nf-co.re/drop/usage) and the [parameter documentation](https://nf-co.re/drop/parameters).
 
@@ -77,11 +78,22 @@ For more details about the output files and reports, please refer to the
 
 ## Credits
 
-nf-core/drop was originally written by Michaela Mueller, Vicente Yepez, Christian Mertes, Daniela Andrade, Cristian Sandu, Andrew Behrens, Julien Gagneur.
+nf-core/drop was originally written by Vicente Yepez, Christian Mertes, Michaela Mueller, Daniela Andrade, Leonhard Wachutka from the Gagneur lab at the Department of Informatics and School of Medicine of the Technical University of Munich (TUM) and The German Human Genome-Phenome Archive (GHGA).
+
+The Nextflow DSL2 conversion of the pipeline was lead by Nicolas Vannieuwkerke and Yun Wang.
+
+Main developers:
+
+- [Nicolas Vannieuwkerke](https://github.com/nvnieuwk)
+- [Yun Wang](https://github.com/fulaibaowang)
 
 We thank the following people for their extensive assistance in the development of this pipeline:
 
-<!-- TODO nf-core: If applicable, make list of people who have also contributed -->
+- [Ata Jadid Ahari](https://github.com/AtaJadidAhari)
+- [Drew Behrens](https://github.com/drewjbeh)
+
+<!-- TODO Acknowledgements -->
+<!-- GHGA -->
 
 ## Contributions and Support
 
 
@@ -2,6 +2,7 @@ report_comment: >
   This report has been generated by the <a href="https://github.com/nf-core/drop/tree/dev" target="_blank">nf-core/drop</a>
   analysis pipeline. For information about how to interpret these results, please see the
   <a href="https://nf-co.re/drop/dev/docs/output" target="_blank">documentation</a>.
+
 report_section_order:
   "nf-core-drop-methods-description":
     order: -1000
 
@@ -0,0 +1,70 @@
+custom_content:
+  order:
+    - fraser_overview
+    - q_estimation_psi5
+    - q_estimation_psi3
+    - q_estimation_theta
+    - q_estimation_jaccard
+    - aberrantly_spliced_genes
+    - batch_correlation_psi5_FALSE
+    - batch_correlation_psi5_TRUE
+    - batch_correlation_psi3_FALSE
+    - batch_correlation_psi3_TRUE
+    - batch_correlation_theta_FALSE
+    - batch_correlation_theta_TRUE
+    - batch_correlation_jaccard_FALSE
+    - batch_correlation_jaccard_TRUE
+    - total_outliers
+    - results
+
+custom_data:
+  fraser_overview:
+    section_name: "Fraser overview"
+    format: "tsv"
+    plot_type: "table"
+
+  q_estimation_psi5:
+    section_name: "Hyperparameter optimization - Q_estimation_psi5"
+
+  q_estimation_psi3:
+    section_name: "Hyperparameter optimization - Q_estimation_psi3"
+
+  q_estimation_theta:
+    section_name: "Hyperparameter optimization - Q_estimation_theta"
+
+  q_estimation_jaccard:
+    section_name: "Hyperparameter optimization"
+    description: "Q_estimation_jaccard"
+
+  aberrantly_spliced_genes:
+    section_name: "Aberrantly spliced genes"
+
+  batch_correlation_psi5_FALSE:
+    section_name: "Batch Correlation psi5 raw"
+  batch_correlation_psi5_TRUE:
+    section_name: "Batch correlation psi5 normalized"
+
+  batch_correlation_psi3_FALSE:
+    section_name: "Batch Correlation psi3 raw"
+  batch_correlation_psi3_TRUE:
+    section_name: "Batch correlation psi3 normalized"
+
+  batch_correlation_theta_FALSE:
+    section_name: "Batch Correlation theta raw"
+  batch_correlation_theta_TRUE:
+    section_name: "Batch correlation theta normalized"
+
+  batch_correlation_jaccard_FALSE:
+    section_name: "Batch Correlation jaccard raw"
+  batch_correlation_jaccard_TRUE:
+    section_name: "Batch correlation jaccard normalized"
+
+  total_outliers:
+    section_name: "Results"
+    format: "tsv"
+    plot_type: "table"
+    description: "Total splicing outliers"
+
+  results:
+    section_name: "Results table"
+    description: "FRASER results (up to 500 rows shown)"
@@ -0,0 +1,36 @@
+custom_content:
+  order:
+    - mae_overview
+    - cascade_plot
+    - variant_frequency
+    - median_of_each_category
+    - results
+
+custom_data:
+  mae_overview:
+    section_name: "MAE overview"
+    format: "tsv"
+    plot_type: "table"
+
+  cascade_plot:
+    section_name: "Cascade plot"
+    description: |
+      a cascade plot that shows a progression of added filters
+
+      - `>10 counts`: only variants supported by more than 10 counts
+      - `+MAE`: and shows mono allelic expression
+      - `+MAE for REF`: the monoallelic expression favors the reference allele
+      - `+MAE for ALT`: the monoallelic expression favors the alternative allele
+      - `rare`:
+        - if add_AF is set to true in config file must meet minimum AF set by the config value max_AF
+        - must meet the inner-cohort frequency maxVarFreqCohort cutoff
+
+  variant_frequency:
+    section_name: "Variant Frequency within Cohort"
+
+  median_of_each_category:
+    section_name: "Median of each category"
+
+  results:
+    section_name: "MAE Results table"
+    description: "MAE results (up to 500 rows shown)"
@@ -0,0 +1,42 @@
+custom_content:
+  order:
+    - matching_values_distribution
+    - heatmap_matching_variants
+    - identify_matching_samples
+    - false_matches
+    - false_mismatches
+
+custom_data:
+  matching_values_distribution:
+    section_name: "DNA - RNA matching values distribution"
+
+  heatmap_matching_variants:
+    section_name: "Heatmap of matching variants"
+    description: |
+      Shows the proportion of matching DNA (rows) - RNA (cols) variants. Possible values are:
+
+      - match: the DNA sample matches the annotated RNA sample
+      - no match: the DNA sample does not match the annotated RNA and no match was found
+      - matches other: the DNA sample does not match the annotated RNA, but another match was found
+      - matches more: the DNA sample matches the annotated RNA, but also other RNAs not annotated to match
+      - matches less: the DNA sample is annotated with more than 1 RNA. Not all annotated RNAs are correct.
+
+      Similar for the RNAs.
+
+  identify_matching_samples:
+    section_name: "Identify matching samples"
+    format: "tsv"
+    plot_type: "table"
+    description: |
+      Considerations: On our experience, the median of the proportion of matching variants in matching samples is around 0.95, and the median of the proportion of matching variants in not matching samples is around 0.58. Sometimes we do see some values between 0.7 - 0.85. That could mean that the DNA-RNA combination is not from the same person, but from a relative. It could also be due to a technical error. For those cases, check the following:
+
+      - RNA sequencing depth (low seq depth that can lead to variants not to be found in the RNA)
+      - Number of variants (too many variants called due to sequencing errors)
+      - Ratio of heterozygous/homozygous variants (usually too many called variants means too many heterozygous ones)
+      - Is the sample a relative of the other?
+
+  false_matches:
+    section_name: "Samples that were annotated to match but do not"
+
+  false_mismatches:
+    section_name: "Samples that were not annotated to match but actually do"
@@ -0,0 +1,45 @@
+custom_content:
+  order:
+    - number_of_samples
+    - number_of_introns
+    - number_of_splice_sites
+    - comparison_local_and_external
+    - expression_filtering
+
+custom_data:
+  number_of_samples:
+    section_name: "Number of samples"
+    format: "tsv"
+    plot_type: "table"
+
+  number_of_introns:
+    section_name: "Number of introns"
+    format: "tsv"
+    plot_type: "table"
+
+  number_of_splice_sites:
+    section_name: "Number of splice sites"
+    format: "tsv"
+    plot_type: "table"
+
+  comparison_local_and_external:
+    section_name: "Comparison of local and external counts"
+    description: |
+      Using external counts
+
+      External counts introduce some complexity into the problem of counting junctions because it is unknown whether or not a junction is not counted (because there are no reads) compared to filtered and not present due to legal/personal sharing reasons. As a result, after merging the local (counted from BAM files) counts and the external counts, only the junctions that are present in both remain. As a result it is likely that the number of junctions will decrease after merging.
+
+  expression_filtering:
+    section_name: "Expression filtering"
+    description: |
+      The expression filtering step removes introns that are lowly expressed. The requirements for an intron to pass this filter are:
+
+      - at least 1 sample has 20 counts (K) for the intron
+      - at least 5% of the samples need to have a total of at least 10 reads for the splice metric denominator (N) of the intron
+
+  variability_filtering:
+    section_name: "Variability filtering"
+    description: |
+      The variability filtering step removes introns that have no or little variability in the splice metric values across samples. The requirement for an intron to pass this filter is:
+
+      - at least 1 sample has a difference of at least 0.05 in the splice metric compared to the mean splice metric of the intron
@@ -35,7 +35,7 @@
                 "pattern": "^\\S+$",
                 "errorMessage": "DNA ID must be provided and cannot contain spaces",
                 "meta": ["dna_id"],
-                "description": "Unique identifier for the DNA sample. Must not contain spaces."
+                "description": "Identifier for the DNA sample. Must not contain spaces."
             },
             "DNA_VCF_FILE": {
                 "type": "string",
@@ -96,8 +96,8 @@
                 "type": "string",
                 "format": "file-path",
                 "exists": true,
-                "pattern": "^\\S+\\.tsv\\.gz$",
-                "errorMessage": "The gene counts file has to exist, cannot contain spaces and must have extension '.tsv.gz'",
+                "pattern": "^\\S+\\.tsv(\\.gz)?$",
+                "errorMessage": "The gene counts file has to exist, cannot contain spaces and must have extension '.tsv' or '.tsv.gz'",
                 "description": "Path to the gene counts file. Must exist and cannot contain spaces."
             },
             "GENE_ANNOTATION": {