Skip to content

Commit 1f3993a

Browse files
committed
update docs with Heatmap performance improvements #4
1 parent 12f669f commit 1f3993a

File tree

9 files changed

+20
-52
lines changed

9 files changed

+20
-52
lines changed

README.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# Unsupervised Analysis Workflow
44
A general purpose [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow to perform unsupervised analyses (dimensionality reduction and cluster analysis) and visualizations of high-dimensional data.
55

6-
This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others to find and benefit from the effort and me to prioritize my efforts!
6+
This workflow adheres to the module specifications of [MR.PARETO](https://github.com/epigen/mr.pareto), an effort to augment research by modularizing (biomedical) data science. For more details and modules check out the project's repository. Please consider **starring** and sharing modules that are interesting or useful to you, this helps others find and benefit from the effort and me to prioritize my efforts!
77

88
**If you use this workflow in a publication, please don't forget to give credit to the authors by citing it using this DOI [10.5281/zenodo.8405360](https://doi.org/10.5281/zenodo.8405360).**
99

@@ -38,6 +38,7 @@ This project wouldn't be possible without the following software and their depen
3838
| clustree | https://doi.org/10.1093/gigascience/giy083 |
3939
| ComplexHeatmap | https://doi.org/10.1093/bioinformatics/btw313 |
4040
| densMAP | https://doi.org/10.1038/s41587-020-00801-7 |
41+
| fastcluster | https://doi.org/10.18637/jss.v053.i09 |
4142
| ggally | https://CRAN.R-project.org/package=GGally |
4243
| ggplot2 | https://ggplot2.tidyverse.org/ |
4344
| ggrepel | https://CRAN.R-project.org/package=ggrepel |
@@ -69,7 +70,7 @@ Uniform Manifold Approximation projection (UMAP) from umap-learn (ver) [ref] was
6970
(Optional) We used the density preserving regularization option, densMAP [ref], during the embedding step, with default parameters to account for varying local density of the data within its original high dimensional space.
7071

7172
**Hierarchically Clustered Heatmap**
72-
Hierarchically clustered heatmaps of scaled data (z-score) were generated using the R package ComplexHeatmap (ver) [ref]. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.
73+
Hierarchically clustered heatmaps of scaled data (z-score) were generated using the Python package scipy's (ver) [ref] function pdist for distance matrix calculation (for observation and features), fastcluster's R implementation (ver) [ref] for hierarchical clustering and the R package ComplexHeatmap (ver) [ref] for visualization. (optional) To reduce computational cost the observations were downsampled to [heatmap:n_observations] and top [n_features] features selected by high variance. The distance metric [metric] and clustering method [clustering_method] were used to determine the hierarchical clustering of observations (rows) and features (columns), respectively. The heatmap was annotated with metadata [metadata_of_interest]. The values were colored by the top percentiles (0.01/0.99) of the data to avoid shifts in the coloring scheme caused by outliers.
7374

7475
**Visualization**
7576
The R-packages ggplot2 (ver) [ref] and patchwork (ver) [ref] were used to generate all 2D visualizations colored by metadata [metadata], feature(s) [features_to_plot], and/or clustering results.
@@ -115,7 +116,12 @@ The workflow perfroms the following analyses on each dataset provided in the ann
115116
- diagnostics (.PNG): 2D embedding colored by PCA coordinates, vector quantization coordinates, approximated local dimension, neighborhood Jaccard index
116117
- connectivity (.PNG): graph/network-connectivity plot with edge-bundling (hammer algorithm variant)
117118
- Hierarchically Clustered Heatmap (.PNG)
118-
- hierarchically clustered heatmaps of scaled data (z-score) with configured distance [metrics] and clustering methods ([hclust_methods]). All combinations are computed, and annotated with [metadata_of_interest].
119+
- Hierarchically clustered heatmaps of scaled data (z-score) with configured distances ([metrics]) and clustering methods ([hclust_methods]).
120+
- Distance matrices of observations and features are precomputed using scipy's dist function.
121+
- Hierarchical clustering is performed by the R implementation of fastcluster.
122+
- The observations can be randomly downsampled by proportion or absolute number ([n_observations]) to reduce computational cost.
123+
- The number of features can be reduced to a proportion or an absolute number of the top variable features ([n_features]) to reduce computational cost.
124+
- All combinations are computed, and annotated with [metadata_of_interest].
119125
- Visualization
120126
- 2D metadata and feature plots (.PNG) of the first 2 principal components and all 2D embeddings, respectively.
121127
- interactive 2D and 3D visualizations as self contained HTML files of all projections/embeddings.
@@ -174,7 +180,7 @@ The workflow perfroms the following analyses on each dataset provided in the ann
174180
# Usage
175181
Here are some tips for the usage of this workflow:
176182
- Start with minimal parameter combinations and without UMAP diagnostics and connectivity plots (they are computational expensive and slow).
177-
- Heatmaps require **a lot** of memory, hence the memory allocation is solved dynamically based on retries. If a out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X time upon failure with X times the memory.
183+
- Heatmaps require **a lot** of memory, hence options to reduce computational cost are provided and the memory allocation is solved dynamically based on retries. If an out-of-memory exception occurs the flag `--retries X` can be used to trigger automatic resubmission X times upon failure with X times the memory.
178184
- Clustification performance scales with available cores, i.e., more cores faster internal parallelization of Random Forest training & testing.
179185
- Cluster indices are extremely compute intense and scale linearly with every additional clustering result and specified metadata (can be skipped).
180186

@@ -185,12 +191,12 @@ Detailed specifications can be found here [./config/README.md](./config/README.m
185191
We provide a minimal example of the analysis of the [UCI ML hand-written digits datasets](https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits) imported from [sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) in the [test folder](./test/):
186192
- config
187193
- configuration: config/config.yaml
188-
- sample annotation: digits_unsupervised_analysis_annotation.csv
194+
- sample annotation: test/config/digits_unsupervised_analysis_annotation.csv
189195
- data
190196
- dataset (1797 observations, 64 features): digits_data.csv
191197
- metadata (consisting of the ground truth label "target"): digits_labels.csv
192198
- results will be generated in the configured subfolder `./test/results/`
193-
- performance: on an HPC it took less than 5 minutes to complete a full run (with up to 32GB of memory per task)
199+
- performance: on an HPC it took less than 7 minutes to complete a full run (with up to 32GB of memory per task)
194200

195201
# single-cell RNA sequencing (scRNA-seq) data analysis
196202
Unsupervised analyses, dimensionality reduction and cluster analysis, are corner stones of scRNA-seq data analyses.

config/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,10 @@
11
# Configuration
22

3-
You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data, configuration, results an report in `test/` as a starting point.
3+
You need one configuration file to configure the analyses and one annotation file describing the data to run the complete workflow. If in doubt read the comments in the config and/or try the default values. We provide a full example including data and configuration in `test/` as a starting point.
44

55
- project configuration (`config/config.yaml`): Different for every project and configures the analyses to be performed.
66
- sample annotation (annotation): CSV file consisting of four mandatory columns.
77
- name: A unique name of the dataset (tip: keep it short but descriptive).
88
- data: Path to the tabular data as comma separated table (CSV).
9-
- metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each sample/observation and every other column metadata for the respective sample (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
10-
- samples_by_features: Boolean indicator if the data matrix is samples (rows) x features (columns) -> (0==no, 1==yes).
9+
- metadata: Path to the metadata as comma separated table (CSV) with the first column being the index/identifier of each observation/sample and every other column metadata for the respective observation (either numeric or categorical, not mixed). **No NaN or empty values allowed.**
10+
- samples_by_features: Boolean indicator if the data matrix is observations/samples (rows) x features (columns): 0==no, 1==yes.

config/config.yaml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ project_name: digits
1313

1414
##### PCA #####
1515
# https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
16+
# especially relevant for large data
1617
pca:
1718
n_components: 0.9 # variance as float (0-1], number of components as int e.g., 50, or 'mle'
1819
svd_solver: 'auto' # options: ‘auto’, ‘full’, ‘covariance_eigh’, ‘arpack’, ‘randomized’
@@ -36,13 +37,13 @@ umap:
3637
##### HEATMAP #####
3738
# information on the ComplexHeatmap parameters: https://jokergoo.github.io/ComplexHeatmap-reference/book/index.html
3839
# distance metrics: for rows and columns. all metrics that are supported by scipy.spatial.distance.pdist (https://docs.scipy.org/doc/scipy-1.14.0/reference/generated/scipy.spatial.distance.pdist.html)
39-
# clustering methods: methods for hierarchical clustering that are supported by stats::hclust() (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/hclust)
40+
# clustering methods: methods for hierarchical clustering that are supported by fastcluster's R implementation (https://danifold.net/fastcluster.html)
4041
# it is the most resource (memory) intensive method, leave empty [] if not required
4142
heatmap:
4243
metrics: ['correlation','cosine']
4344
hclust_methods: ['complete']
44-
n_observations: 1000 # random sampled proportion float [0-1] or absolute number as integer
45-
n_features: 0.5 # highly variable features percentate float [0-1] or absolute number as integer
45+
n_observations: 1000 # random sampled proportion float (0-1] or absolute number as integer
46+
n_features: 0.5 # highly variable features proportion float (0-1] or absolute number as integer
4647

4748
##### LEIDEN #####
4849
# Leiden clustering applied on UMAP KNN graphs specified by the respective parameters (metric, n_neighbors).

workflow/Snakefile

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -142,11 +142,6 @@ rule all:
142142
n_components=[dims for dims in config["umap"]["n_components"] if dims in [2,3]]
143143
) if 2 in config["umap"]["n_components"] or 3 in config["umap"]["n_components"] else [],
144144
# Heatmap
145-
# distance_matrices = expand(os.path.join(result_path,'{sample}','Heatmap','DistanceMatrix_{metric}_{type}.csv'),
146-
# sample=list(annot.index),
147-
# metric=config["heatmap"]["metrics"],
148-
# type=["observations","features"],
149-
# ),
150145
heatmap_plots = expand(os.path.join(result_path,'{sample}','Heatmap','plots','Heatmap_{metric}_{method}.png'),
151146
sample=list(annot.index),
152147
method=config["heatmap"]["hclust_methods"],

workflow/envs/fastdist_UNUSED.yaml

Lines changed: 0 additions & 12 deletions
This file was deleted.

workflow/envs/sklearn_UNUSED.yaml

Lines changed: 0 additions & 7 deletions
This file was deleted.

workflow/envs/umap_UNUSED.yaml

Lines changed: 0 additions & 13 deletions
This file was deleted.

workflow/rules/clustering.smk

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,6 @@ rule aggregate_all_clustering_results:
8383
# read each clustering result and add to data dict
8484
for filename in input:
8585
agg_clust.append(pd.read_csv(filename, header=0, index_col=0))
86-
8786

8887
# convert the dictionary to a DataFrame
8988
agg_clust_df = pd.concat(agg_clust, axis=1)

workflow/rules/common.smk

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,8 +134,7 @@ def get_external_validation_paths(wildcards):
134134

135135
# get paths to determine internal cluster indices
136136
def get_internal_validation_paths(wildcards):
137-
return {#'data': annot.loc[wildcards.sample,'data'],
138-
'metadata': annot.loc[wildcards.sample,"metadata"],
137+
return {'metadata': annot.loc[wildcards.sample,"metadata"],
139138
'clusterings': os.path.join(result_path,wildcards.sample, "metadata_clusterings.csv"),
140139
'pca': os.path.join(result_path,wildcards.sample,'PCA','PCA_{}_{}_data.csv'.format(config["pca"]["svd_solver"],config["pca"]["n_components"])),
141140
'pca_var': os.path.join(result_path,wildcards.sample,'PCA','PCA_{}_{}_var.csv'.format(config["pca"]["svd_solver"],config["pca"]["n_components"]))

0 commit comments

Comments
 (0)