You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-9Lines changed: 12 additions & 9 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,7 +62,7 @@ The outlined analyses were performed using the programming languages R (ver) [re
62
62
**Dimensionality Reduction**
63
63
64
64
**Principal Component Analysis (PCA)**
65
-
We used Principal Component Analysis (PCA) [ref] from scikit-learn (ver) [ref] as the linear approach. We visualized [n_components] principal components and kept [X/all] components for downstream analyses. For diagnostic purposes we visualized the variance explained of all and the top 10% of principal components (PCs) using elbow- and cumulative-variance-plots, sequential pair-wise PCs for up to 10 PCs using scatter-, and density-plots (colored by [metadata_of_interest]), and finally loadings plots showing the magnitude and direction of the 10 most influential features for each PC combination. The R packages ggally (ver) [ref] and ggrepel (ver) [ref] were used to improve the diagnostic visualizations.
65
+
We used Principal Component Analysis (PCA) [ref] from scikit-learn (ver) [ref] as the linear approach. We visualized [n_components] principal components and kept [X/all] components for downstream analyses. For diagnostic purposes we visualized the variance explained of all and the top 10% of principal components (PCs) using elbow- and cumulative-variance-plots, sequential pair-wise PCs for up to 10 PCs using scatter-, and density-plots (colored by [metadata_of_interest]), and finally loadings plots showing the magnitude and direction of the 10 most influential features for each PC as lollipop plot and biplot for sequential combinations of PCs. The R packages ggally (ver) [ref] and ggrepel (ver) [ref] were used to improve the diagnostic visualizations.
66
66
67
67
**Uniform Manifold Approximation and Projection (UMAP)**
68
68
Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] was used as the non-linear approach. The metric [metric] and number of neighbors [n_neighbors] were used for the generation of a shared k-nearest-neighbor graph. The graph was embedded in [n_components] dimensions with minimum distance parameter [min_dist].
@@ -99,11 +99,13 @@ We performed internal cluster validation using six complementary indices: Silhou
99
99
The workflow perfroms the following analyses on each dataset provided in the annotation file. A result folder "unsupervised_analysis" is generated containing a folder for each dataset.
100
100
101
101
## Dimensionality Reduction
102
+
> _"High-dimensional spaces are where intuition goes to die and dimensionality reduction becomes the antidote to the curse of dimensionality."_ from Anonymous
102
103
- Principal Component Anlaysis (PCA) keeping all components (.pickle and .CSV)
103
104
- diagnostics (.PNG):
104
105
- variance: scree-plot and cumulative explained variance-plot of all and top 10% principal components
105
106
- pairs: sequential pair-wise PCs for up to 10 PCs using scatter- and density-plots colored by [metadata_of_interest]
106
-
- loadings: showing the magnitude and direction of the 10 most influential features for each PC combination
107
+
- loadings: showing the magnitude and direction of the 10 most influential features for each Principal Component combination (Biplot, but without the data)
108
+
- loadings lolliplot: showing the magnitude of the 10 most influential features for each Principal Component
- k-nearest-neighbor graph (.pickle): generated using the [n_neighbors] parameter together with the provided [metrics].
109
111
- fix any pickle load issue by specifying Python version to 3.9 (in case you want to use the graph downstream)
@@ -172,27 +174,26 @@ The workflow perfroms the following analyses on each dataset provided in the ann
172
174
Here are some tips for the usage of this workflow:
173
175
- Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (they are computational expensive and slow).
174
176
- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically based on retries. If a out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X time upon failure with X times the memory.
175
-
- Clustification performance scales with available cores, i.e., more cores faster internal parallelization of RF training & testing.
176
-
- Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata.
177
-
177
+
- Clustification performance scales with available cores, i.e., more cores faster internal parallelization of Random Forest training & testing.
178
+
- Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata (can be skipped).
178
179
179
180
# Configuration
180
181
Detailed specifications can be found here [./config/README.md](./config/README.md)
181
182
182
183
# Examples
183
-
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](.test/):
184
+
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](./test/):
- metadata (consisting of the ground truth label "target"): digits_labels.csv
190
-
- results will be generated in a subfolder .test/results/
191
+
- results will be generated in the configured subfolder `./test/results/`
191
192
- performance: on an HPC it took less than 5 minutes to complete a full run (with up to 32GB of memory per task)
192
193
193
194
# single-cell RNA sequencing (scRNA-seq) data analysis
194
195
Unsupervised analyses, dimensionality reduction and cluster analysis, are corner stones of scRNA-seq data analyses.
195
-
A full run on a [published](https://www.nature.com/articles/s41588-020-0636-z) scRNA-seq [cancer dataset](https://www.weizmann.ac.il/sites/3CA/colorectal) with 21,657 cells and 18,245 genes took 2.5 hours to complete (without heatmaps, with 32GB memory, with 8 cores for clustification, ).
196
+
A full run on a [published](https://www.nature.com/articles/s41588-020-0636-z) scRNA-seq [cancer dataset](https://www.weizmann.ac.il/sites/3CA/colorectal) with 21,657 cells and 18,245 genes took 2.5 hours to complete (without heatmaps, with 32GB memory and 8 cores for clustification).
196
197
Below are configurations of the two most commonly used frameworks, [scanpy](https://scanpy.readthedocs.io/en/stable/index.html) (Python) and [Seurat](https://satijalab.org/seurat/) (R), and the original package's defaults as comparison and to facilitate reproducibility:
197
198
198
199
UMAP for dimensionality reduction
@@ -236,7 +237,9 @@ Leiden algorithm for clustering
-[ATAC-seq Processing](https://github.com/epigen/atacseq_pipeline) to quantify chromatin accessibility.
239
-
-[Split, Filter, Normalize and Integrate Sequencing Data](https://github.com/epigen/spilterlize_integrate) process sequencing data.
240
+
-[scRNA-seq Data Processing & Visualization](https://github.com/epigen/scrnaseq_processing_seurat) for processing and preparing single cell data as input.
241
+
-[Split, Filter, Normalize and Integrate Sequencing Data](https://github.com/epigen/spilterlize_integrate) process and preapre sequencing data as input.
242
+
-[Perturbation Analysis using Mixscape from Seurat](https://github.com/epigen/mixscape_seurat) to identify perturbed cells from pooled (multimodal) CRISPR screens with sc/snRNA-seq read-out (scCRISPR-seq) as input.
240
243
-[Reichl, S. (2018). Mathematical methods in single cell RNA sequencing analysis with an emphasis on the validation of clustering results [Diploma Thesis, Technische Universität Wien]](https://doi.org/10.34726/hss.2018.49662)
You need one configuration file to configure the analyses and one annotation file for the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data, configuration, results an report in test/ as a starting point.
3
+
You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data, configuration, results an report in `test/` as a starting point.
4
4
5
-
- project configuration (config/config.yaml): different for every project and configures the analyses to be performed.
6
-
- sample annotation (sample_annotation): CSV file consisting of three mandatory columns.
7
-
- name: a unique name of the dataset (tip: keep it short but descriptive).
8
-
- data: path to the tabular data as comma separated table (CSV).
9
-
- metadata: path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each sample/observation and every other column metadata for the respective sample (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
10
-
- samples_by_features: 0 or 1 as boolean indicator if data matrix is samples (rows) x features (columns) -> (0==no, 1==yes).
5
+
- project configuration (`config/config.yaml`): Different for every project and configures the analyses to be performed.
6
+
- sample annotation (annotation): CSV file consisting of four mandatory columns.
7
+
- name: A unique name of the dataset (tip: keep it short but descriptive).
8
+
- data: Path to the tabular data as comma separated table (CSV).
9
+
- metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each sample/observation and every other column metadata for the respective sample (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
10
+
- samples_by_features: Boolean indicator if the data matrix is samples (rows) x features (columns) -> (0==no, 1==yes).
0 commit comments