You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+12-6Lines changed: 12 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,7 +3,7 @@
3
3
# Unsupervised Analysis Workflow
4
4
A general purpose [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to perform unsupervised analyses (dimensionality reduction and cluster analysis) and visualizations of high-dimensional data.
5
5
6
-
This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others to find and benefit from the effort and me to prioritize my efforts!
6
+
This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others find and benefit from the effort and me to prioritize my efforts!
7
7
8
8
**If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI [10.5281/zenodo.8405360](https://doi.org/10.5281/zenodo.8405360).**
9
9
@@ -38,6 +38,7 @@ This project wouldn't be possible without the following software and their depen
@@ -69,7 +70,7 @@ Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] was
69
70
(Optional) We used the density preserving regularization option, densMAP [ref], during the embedding step, with default parameters to account for varying local density of the data within its original high dimensional space.
70
71
71
72
**Hierarchically Clustered Heatmap**
72
-
Hierarchically clustered heatmaps of scaled data (z-score) were generated using the R package ComplexHeatmap (ver) [ref]. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.
73
+
Hierarchically clustered heatmaps of scaled data (z-score) were generated using the Python package scipy's (ver) [ref] function pdist for distance matrix calculation (for observation and features), fastcluster's R implementation (ver) [ref] for hierarchical clustering and the R package ComplexHeatmap (ver) [ref] for visualization. (optional) To reduce computational cost the observations were downsampled to [heatmap:n_observations] and top [n_features] features selected by high variance. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.
73
74
74
75
**Visualization**
75
76
The R-packages ggplot2 (ver) [ref] and patchwork (ver) [ref] were used to generate all 2D visualizations colored by metadata [metadata], feature(s) [features_to_plot], and/or clustering results.
@@ -115,7 +116,12 @@ The workflow perfroms the following analyses on each dataset provided in the ann
115
116
- diagnostics (.PNG): 2D embedding colored by PCA coordinates, vector quantization coordinates, approximated local dimension, neighborhood Jaccard index
116
117
- connectivity (.PNG): graph/network-connectivity plot with edge-bundling (hammer algorithm variant)
117
118
- Hierarchically Clustered Heatmap (.PNG)
118
-
- hierarchically clustered heatmaps of scaled data (z-score) with configured distance [metrics] and clustering methods ([hclust_methods]). All combinations are computed, and annotated with [metadata_of_interest].
119
+
- Hierarchically clustered heatmaps of scaled data (z-score) with configured distances ([metrics]) and clustering methods ([hclust_methods]).
120
+
- Distance matrices of observations and features are precomputed using scipy's dist function.
121
+
- Hierarchical clustering is performed by the R implementation of fastcluster.
122
+
- The observations can be randomly downsampled by proportion or absolute number ([n_observations]) to reduce computational cost.
123
+
- The number of features can be reduced to a proportion or an absolute number of the top variable features ([n_features]) to reduce computational cost.
124
+
- All combinations are computed, and annotated with [metadata_of_interest].
119
125
- Visualization
120
126
- 2D metadata and feature plots (.PNG) of the first 2 principal components and all 2D embeddings, respectively.
121
127
- interactive 2D and 3D visualizations as self contained HTML files of all projections/embeddings.
@@ -174,7 +180,7 @@ The workflow perfroms the following analyses on each dataset provided in the ann
174
180
# Usage
175
181
Here are some tips for the usage of this workflow:
176
182
- Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (they are computational expensive and slow).
177
-
- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically based on retries. If a out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X time upon failure with X times the memory.
183
+
- Heatmaps require **a lot** of memory, hence options to reduce computational cost are provided and the memory allocation is solved dynamically based on retries. If an out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X times upon failure with X times the memory.
178
184
- Clustification performance scales with available cores, i.e., more cores faster internal parallelization of Random Forest training & testing.
179
185
- Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata (can be skipped).
180
186
@@ -185,12 +191,12 @@ Detailed specifications can be found here [./config/README.md](./config/README.m
185
191
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](./test/):
You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data, configuration, results an report in `test/` as a starting point.
3
+
You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data and configuration in `test/` as a starting point.
4
4
5
5
- project configuration (`config/config.yaml`): Different for every project and configures the analyses to be performed.
6
6
- sample annotation (annotation): CSV file consisting of four mandatory columns.
7
7
- name: A unique name of the dataset (tip: keep it short but descriptive).
8
8
- data: Path to the tabular data as comma separated table (CSV).
9
-
- metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each sample/observation and every other column metadata for the respective sample (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
10
-
- samples_by_features: Boolean indicator if the data matrix is samples (rows) x features (columns) -> (0==no, 1==yes).
9
+
- metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each observation/sample and every other column metadata for the respective observation (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
10
+
- samples_by_features: Boolean indicator if the data matrix is observations/samples (rows) x features (columns): 0==no, 1==yes.
# information on the ComplexHeatmap parameters: https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html
38
39
# distance metrics: for rows and columns. all metrics that are supported by scipy.spatial.distance.pdist (https://docs.scipy.org/doc/scipy-1.14.0/reference/generated/scipy.spatial.distance.pdist.html)
39
-
# clustering methods: methods for hierarchical clustering that are supported by stats::hclust() (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust)
40
+
# clustering methods: methods for hierarchical clustering that are supported by fastcluster's R implementation (https://danifold.net/fastcluster.html)
40
41
# it is the most resource (memory) intensive method, leave empty [] if not required
41
42
heatmap:
42
43
metrics: ['correlation','cosine']
43
44
hclust_methods: ['complete']
44
-
n_observations: 1000# random sampled proportion float [0-1] or absolute number as integer
45
-
n_features: 0.5# highly variable features percentate float [0-1] or absolute number as integer
45
+
n_observations: 1000# random sampled proportion float (0-1] or absolute number as integer
46
+
n_features: 0.5# highly variable features proportion float (0-1] or absolute number as integer
46
47
47
48
##### LEIDEN #####
48
49
# Leiden clustering applied on UMAP KNN graphs specified by the respective parameters (metric, n_neighbors).
0 commit comments