Skip to content

Commit 483f23f

Browse files
committed
update documentation
1 parent 8105bf1 commit 483f23f

File tree

4 files changed

+21
-21
lines changed

4 files changed

+21
-21
lines changed

README.md

Lines changed: 12 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ The outlined analyses were performed using the programming languages R (ver) [re
6262
**Dimensionality Reduction**
6363

6464
**Principal Component Analysis (PCA)**
65-
We used Principal Component Analysis (PCA) [ref] from scikit-learn (ver) [ref] as the linear approach. We visualized [n_components] principal components and kept [X/all] components for downstream analyses. For diagnostic purposes we visualized the variance explained of all and the top 10% of principal components (PCs) using elbow- and cumulative-variance-plots, sequential pair-wise PCs for up to 10 PCs using scatter-, and density-plots (colored by [metadata_of_interest]), and finally loadings plots showing the magnitude and direction of the 10 most influential features for each PC combination. The R packages ggally (ver) [ref] and ggrepel (ver) [ref] were used to improve the diagnostic visualizations.
65+
We used Principal Component Analysis (PCA) [ref] from scikit-learn (ver) [ref] as the linear approach. We visualized [n_components] principal components and kept [X/all] components for downstream analyses. For diagnostic purposes we visualized the variance explained of all and the top 10% of principal components (PCs) using elbow- and cumulative-variance-plots, sequential pair-wise PCs for up to 10 PCs using scatter-, and density-plots (colored by [metadata_of_interest]), and finally loadings plots showing the magnitude and direction of the 10 most influential features for each PC as lollipop plot and biplot for sequential combinations of PCs. The R packages ggally (ver) [ref] and ggrepel (ver) [ref] were used to improve the diagnostic visualizations.
6666

6767
**Uniform Manifold Approximation and Projection (UMAP)**
6868
Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] was used as the non-linear approach. The metric [metric] and number of neighbors [n_neighbors] were used for the generation of a shared k-nearest-neighbor graph. The graph was embedded in [n_components] dimensions with minimum distance parameter [min_dist].
@@ -99,11 +99,13 @@ We performed internal cluster validation using six complementary indices: Silhou
9999
The workflow perfroms the following analyses on each dataset provided in the annotation file. A result folder "unsupervised_analysis" is generated containing a folder for each dataset.
100100

101101
## Dimensionality Reduction
102+
> _"High-dimensional spaces are where intuition goes to die and dimensionality reduction becomes the antidote to the curse of dimensionality."_ from Anonymous
102103
- Principal Component Anlaysis (PCA) keeping all components (.pickle and .CSV)
103104
- diagnostics (.PNG):
104105
- variance: scree-plot and cumulative explained variance-plot of all and top 10% principal components
105106
- pairs: sequential pair-wise PCs for up to 10 PCs using scatter- and density-plots colored by [metadata_of_interest]
106-
- loadings: showing the magnitude and direction of the 10 most influential features for each PC combination
107+
- loadings: showing the magnitude and direction of the 10 most influential features for each Principal Component combination (Biplot, but without the data)
108+
- loadings lolliplot: showing the magnitude of the 10 most influential features for each Principal Component
107109
- Uniform Manifold Approximation & Projection (UMAP)
108110
- k-nearest-neighbor graph (.pickle): generated using the [n_neighbors] parameter together with the provided [metrics].
109111
- fix any pickle load issue by specifying Python version to 3.9 (in case you want to use the graph downstream)
@@ -172,27 +174,26 @@ The workflow perfroms the following analyses on each dataset provided in the ann
172174
Here are some tips for the usage of this workflow:
173175
- Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (they are computational expensive and slow).
174176
- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically based on retries. If a out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X time upon failure with X times the memory.
175-
- Clustification performance scales with available cores, i.e., more cores faster internal parallelization of RF training & testing.
176-
- Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata.
177-
177+
- Clustification performance scales with available cores, i.e., more cores faster internal parallelization of Random Forest training & testing.
178+
- Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata (can be skipped).
178179

179180
# Configuration
180181
Detailed specifications can be found here [./config/README.md](./config/README.md)
181182

182183
# Examples
183-
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](.test/):
184+
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](./test/):
184185
- config
185186
- configuration: config/config.yaml
186187
- sample annotation: digits_unsupervised_analysis_annotation.csv
187188
- data
188189
- dataset (1797 observations, 64 features): digits_data.csv
189190
- metadata (consisting of the ground truth label "target"): digits_labels.csv
190-
- results will be generated in a subfolder .test/results/
191+
- results will be generated in the configured subfolder `./test/results/`
191192
- performance: on an HPC it took less than 5 minutes to complete a full run (with up to 32GB of memory per task)
192193

193194
# single-cell RNA sequencing (scRNA-seq) data analysis
194195
Unsupervised analyses, dimensionality reduction and cluster analysis, are corner stones of scRNA-seq data analyses.
195-
A full run on a [published](https://www.nature.com/articles/s41588-020-0636-z) scRNA-seq [cancer dataset](https://www.weizmann.ac.il/sites/3CA/colorectal) with 21,657 cells and 18,245 genes took 2.5 hours to complete (without heatmaps, with 32GB memory, with 8 cores for clustification, ).
196+
A full run on a [published](https://www.nature.com/articles/s41588-020-0636-z) scRNA-seq [cancer dataset](https://www.weizmann.ac.il/sites/3CA/colorectal) with 21,657 cells and 18,245 genes took 2.5 hours to complete (without heatmaps, with 32GB memory and 8 cores for clustification).
196197
Below are configurations of the two most commonly used frameworks, [scanpy](https://scanpy.readthedocs.io/en/stable/index.html) (Python) and [Seurat](https://satijalab.org/seurat/) (R), and the original package's defaults as comparison and to facilitate reproducibility:
197198

198199
UMAP for dimensionality reduction
@@ -236,7 +237,9 @@ Leiden algorithm for clustering
236237
- Recommended compatible [MR.PARETO](https://github.com/epigen/mr.pareto) modules
237238
- for upstream processing:
238239
- [ATAC-seq Processing](https://github.com/epigen/atacseq_pipeline) to quantify chromatin accessibility.
239-
- [Split, Filter, Normalize and Integrate Sequencing Data](https://github.com/epigen/spilterlize_integrate) process sequencing data.
240+
- [scRNA-seq Data Processing & Visualization](https://github.com/epigen/scrnaseq_processing_seurat) for processing and preparing single cell data as input.
241+
- [Split, Filter, Normalize and Integrate Sequencing Data](https://github.com/epigen/spilterlize_integrate) process and preapre sequencing data as input.
242+
- [Perturbation Analysis using Mixscape from Seurat](https://github.com/epigen/mixscape_seurat) to identify perturbed cells from pooled (multimodal) CRISPR screens with sc/snRNA-seq read-out (scCRISPR-seq) as input.
240243
- [Reichl, S. (2018). Mathematical methods in single cell RNA sequencing analysis with an emphasis on the validation of clustering results [Diploma Thesis, Technische Universität Wien]](https://doi.org/10.34726/hss.2018.49662)
241244

242245
# Publications

config/README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# Configuration
22

3-
You need one configuration file to configure the analyses and one annotation file for the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data, configuration, results an report in test/ as a starting point.
3+
You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data, configuration, results an report in `test/` as a starting point.
44

5-
- project configuration (config/config.yaml): different for every project and configures the analyses to be performed.
6-
- sample annotation (sample_annotation): CSV file consisting of three mandatory columns.
7-
- name: a unique name of the dataset (tip: keep it short but descriptive).
8-
- data: path to the tabular data as comma separated table (CSV).
9-
- metadata: path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each sample/observation and every other column metadata for the respective sample (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
10-
- samples_by_features: 0 or 1 as boolean indicator if data matrix is samples (rows) x features (columns) -> (0==no, 1==yes).
5+
- project configuration (`config/config.yaml`): Different for every project and configures the analyses to be performed.
6+
- sample annotation (annotation): CSV file consisting of four mandatory columns.
7+
- name: A unique name of the dataset (tip: keep it short but descriptive).
8+
- data: Path to the tabular data as comma separated table (CSV).
9+
- metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each sample/observation and every other column metadata for the respective sample (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
10+
- samples_by_features: Boolean indicator if the data matrix is samples (rows) x features (columns) -> (0==no, 1==yes).

config/config.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
# always use absolute paths
21
# provide at least one parameter per option (no empty fields allowed)
32

43
##### RESOURCES #####
@@ -72,7 +71,7 @@ clustree:
7271
# Cluster validation using internal cluster indices is computationally very expensive.
7372
# To reduce complexity and increase performance a proportion of samples can be used for the internal cluster evaluation.
7473
# Internal cluster validation can be skipped with 0.
75-
sample_proportion: 1 # float (0-1], >500 samples should be included.
74+
sample_proportion: 1 # float [0-1], >500 samples should be included.
7675

7776
##### categorical metadata column used in the following analyses:
7877
# - PCA pairs plot (first entry only)
@@ -93,7 +92,7 @@ scatterplot2d:
9392
alpha: 1
9493

9594
# specify features of interest. these features from the data, will be highlighted in the 2D/3D plots
96-
# motivated by bioinformatics highlighting expression levels of marker genes (eg: ['PTPRC'])
95+
# motivated by bioinformatics highlighting expression levels of marker genes (eg: ['PTPRC','STAT1','IRF8'])
9796
# use keyword ['ALL'] to plot all features. WARNING: Only useful for relatively low dimensional data, a plot is generated for each feature and method.
9897
# if not used leave empty []
9998
features_to_plot: ['ALL'] #['pixel_0_0','pixel_0_1','pixel_0_2','pixel_0_3']

workflow/scripts/plot_heatmap.R

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,8 +30,6 @@ if (!dir.exists(result_dir)){
3030
}
3131

3232
### load data
33-
# data <- read.csv(file=file.path(data_path), row.names=1, header=TRUE)
34-
# metadata <- read.csv(file=file.path(metadata_path), row.names=1, header=TRUE)
3533
data <- data.frame(fread(file.path(data_path), header=TRUE), row.names=1)
3634
metadata <- data.frame(fread(file.path(metadata_path), header=TRUE), row.names=1)
3735

0 commit comments

Comments
 (0)