Add diann subworkflow and module #9173

pinin4fjords · 2025-10-06T10:15:41Z

Overview

This PR ports the DIA-NN workflow from bigbio/quantms as an nf-core subworkflow (dia_proteomics_analysis), refactored to align with nf-core conventions and best practices.

Motivation

The original quantms implementation used multiple separate DIA-NN modules with different hardcoded configurations and extensive conditional logic. This PR consolidates and modernizes the workflow architecture to be more maintainable, flexible, and idiomatic to nf-core standards.

Key Changes

Architecture improvements

Single unified DIA-NN module: Replaced multiple configuration-specific modules with one flexible module. The differences between quantms modules were purely configuration-based, not functional.
Separation of concerns: Moved custom pipeline logic from module-level scripts into subworkflow Groovy, keeping modules focused on tool execution.
Parameter handling: The subworkflow no longer directly references params. Parameters required for workflow logic are explicitly passed as subworkflow inputs, while tool-specific parameters can be supplied via standard Nextflow config files.

Data flow improvements

Enhanced metadata propagation: Comprehensive use of meta maps throughout the workflow enables better provenance tracking and allows more complex experimental designs.
Channel-driven conditional execution: Replaced explicit conditionals with Nextflow channel operators (.ifEmpty(null), .filter(), .combine()), making the data flow more declarative and the workflow logic more transparent.
Support for iteration: The workflow is designed to handle multiple inputs (e.g., multiple FASTA databases, experimental designs) rather than assuming single global inputs, enabling more flexible experimental setups.

Specific technical improvements

Pre-generated inputs support: Users can optionally supply pre-generated spectral libraries or empirical libraries to skip computationally expensive steps.
Mass accuracy extraction: Automatic extraction of empirical mass accuracy from DIA-NN preliminary logs to inform final quantification parameters.
Comprehensive outputs: The workflow emits multiple output formats (TSV, Parquet, mzTab, MSstats-compatible, Triqler-compatible) to support diverse downstream analysis needs.

Pipeline stages

The subworkflow implements the complete DIA-NN analysis pipeline:

Configuration generation (quantmsutils/dianncfg): Generate DIA-NN config files based on experimental design
In-silico library generation (diann): Predict spectral library from FASTA (skippable if pre-generated library provided)
Preliminary analysis (diann): Initial pass to generate empirical library (skippable if pre-generated empirical library provided)
Mass accuracy extraction: Parse DIA-NN logs to extract optimal mass accuracy settings
Individual sample analysis (diann): Per-sample quantification using empirical library
Final quantification (diann): Unified analysis across all samples
Format conversion and statistics (quantmsutils/diann2mztab, quantmsutils/mzmlstatistics): Generate standard output formats and QC metrics

Modules included

modules/nf-core/diann/: Universal DIA-NN wrapper
modules/nf-core/quantmsutils/dianncfg/: DIA-NN configuration file generation
modules/nf-core/quantmsutils/diann2mztab/: Convert DIA-NN output to mzTab format
modules/nf-core/quantmsutils/mzmlstatistics/: Extract mzML file statistics
subworkflows/nf-core/dia_proteomics_analysis/: Main DIA analysis orchestration

Testing

All modules and the subworkflow include nf-test suites with appropriate test data
Tests use snapshot comparisons for reproducible outputs (versions, CSV files) and existence checks for non-reproducible outputs (TSV, mzTab, Parquet files)
Test data PR: Add more diann test data test-datasets#1756 (now merged)

Known limitations

The MSstats LFQ analysis step from quantms dia.nf is not yet ported, as it requires custom R scripts that need refactoring into a proper nf-core module. This will be addressed in a follow-up PR.

Reviewer notes

The conditional execution pattern using .ifEmpty(null).filter() is intentional and documented inline - this allows downstream processes to execute only when pre-generated inputs are not provided
The large multiMap block (lines ~155-161 in main.nf) generates all required entity combinations upfront and creates multiple channel views at different metadata granularities - this is documented with detailed comments
Module meta.yml files follow nf-core schema for both modules and subworkflows, with proper tuple structures and output definitions
Tests skip Conda for quantmsutils modules (added to .github/skip_nf_test.json) as these tools are not available via Conda

PR checklist

Closes #XXX

tests/config/nextflow.config

…dules into add_diann_subworkflow_modules

jonasscheid · 2025-10-14T15:44:20Z

FYI I'm going to review. Hope to be finished tomorrow

jonasscheid

Only small comments. Overall very nicely done and commented!

modules/nf-core/diann/main.nf

modules/nf-core/diann/meta.yml

modules/nf-core/diann/tests/main.nf.test

modules/nf-core/quantmsutils/mzmlstatistics/main.nf

subworkflows/nf-core/dia_proteomics_analysis/main.nf

jonasscheid · 2025-10-16T08:42:54Z

@pinin4fjords are you aware that DIA-NN moved from open-source to license-restricted access beyond version 1.8.2? It's still free for academic use, but container distribution is not allowed in newer versions (latest version is 2.2.0 with quite some improvements..). So my general question would be what the version update strategy of this nf-core subworkflow is? Don't get me wrong, it's nice that this subworkflow is part of nf-core, but we need to think about updating to future versions eventually.

pinin4fjords · 2025-10-16T08:56:06Z

@pinin4fjords are you aware that DIA-NN moved from open-source to license-restricted access beyond version 1.8.2? It's still free for academic use, but container distribution is not allowed in newer versions (latest version is 2.2.0 with quite some improvements..). So my general question would be what the version update strategy of this nf-core subworkflow is? Don't get me wrong, it's nice that this subworkflow is part of nf-core, but we need to think about updating to future versions eventually.

Hi @jonasscheid , thanks for the comments, will address them today. For complete clarity this in contract work done at the request of @FelixKrueger .

In terms of updates, yes, we're aware of the licensing issues. It will be relatively trivial for folks to use configuration to override container directives and ext.args to suit newer versions, which covers some of the answer. But where the newer version would require larger-scale changes to the module/ subworkflow structure I don't have a good answer.

Per DIA-NN author recommendation (bigbio/quantms#481), --rt-profiling should be used during in-silico library generation to enable retention time profiling, improving downstream empirical library quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Improved documentation to explain the two input modes: - ms_files: Actual file paths when DIA-NN needs raw MS data - ms_file_names: Just basenames used with --use-quant to match against preprocessed .quant files, avoiding unnecessary file staging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Updated meta.yml to note that Thermo RAW files are only supported on Linux with DIA-NN 2.0+. Older versions require conversion to mzML or .d format first. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Updated all prefix-based output file keys to use ${prefix} notation instead of wildcards to match the module's actual output patterns. Pattern fields still use wildcards for flexibility. Fixes linter error: correct_meta_outputs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…dules into add_diann_subworkflow_modules

Added standard when condition that was removed during conda support changes. Fixes linter error: when_exist 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Adds automatic conversion of Thermo RAW files to mzML format before analysis: - Branch input files to detect .raw extensions - Convert RAW files using ThermoRawFileParser module - Reconstruct input channel with converted mzML files - Configure THERMORAWFILEPARSER to output mzML format (--format=1) - Update meta.yml to document RAW file support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

subworkflows/nf-core/dia_proteomics_analysis/main.nf

This reverts commit 9492801.

DongzeHE

LGTM

jonasscheid · 2025-10-21T04:40:45Z

LGTM

I hope DIA-NN implements some sort of an authentication mechanism (tokens) in the future that could be passend by env variable. Then we could use the most recent versions 🙏🏼

@vdemichev are there any plans towards that?

vdemichev · 2025-10-21T06:18:47Z

Hi All,

A general comment: since 2.0 we keep all inputs/outputs stable & backward compatible, i.e. even .quant files are backward compatible. So in the future it's very likely that newer versions will work without changing any code in the pipeline. That said, if one wishes to implement InfinDIA, this will probably result in very minor changes (it creates '.pre.quant' files instead of '.quant' files; also does not require spectral library as input), but this is like sth separate that should exist in addition to the main workflow - just for super large search spaces.

@jonasscheid, you mean like AI models? :)

Best,
Vadim

jonasscheid · 2025-10-21T06:50:59Z

Thanks for the quick response @vdemichev and good to know that input output structures will remain quite stable.

@jonasscheid, you mean like AI models? :)

No by token I meant some sort of a license key that can be purchased for DIA-NN (or free license key for academy members). In the grand scheme of things that would enable the distribution of DIA-NN containers (and evenutally conda) that could be used in workflows like these here.

vdemichev · 2025-10-21T07:13:34Z

Well, this is exactly how the Enterprise version works :) But redistribtion is an issue regardless, so we don't enable it. I would think it's OK to just always have the pipeline download DIA-NN & create container on the fly? This apparently is not possible in Galaxy, but here should work just fine?

pinin4fjords · 2025-10-21T08:41:11Z

Well, this is exactly how the Enterprise version works :) But redistribtion is an issue regardless, so we don't enable it. I would think it's OK to just always have the pipeline download DIA-NN & create container on the fly? This apparently is not possible in Galaxy, but here should work just fine?

Not how things work right now, unfortunately. Upgrades right now will have to work via folks building their containers externally, and then they'd override the container directives.

We have other commercial software here, which work via license keys. The software won't work without them, but you can still download the containers etc. Sentieon is an example: https://github.com/nf-core/modules/blob/master/modules/nf-core/sentieon/bwamem/main.nf. Maybe a model worth considering. I think Sentieon provide a license key to allow the CI testing.

vdemichev · 2025-10-21T08:45:41Z

Not how things work right now, unfortunately.

But that's how it works for quantms, seems all is OK with it?

Maybe a model worth considering

Will not work in our case unfortunately.

pinin4fjords · 2025-10-21T08:48:10Z

Not how things work right now, unfortunately.

But that's how it works for quantms, seems all is OK with it?

This nf-core repo has more structured requirements, we have a standardised approach to container provisioning etc and I don't think this will fit, though I'll ask around.

@jonasscheid

* Add initial subworkflow working copy * Actually, we only need a single diann module * msstats module will be a bit more work- defer until later * Fix mzmlstatistics test * Fix dianncfg test * Fix diann2mztab test * Revert testing change * Tidy up * Try to address linting issues * more linting fixes * Hopefully last subworkflow meta fix * None of that Conda thanks * Misc fixes * Remove another cpus directive * Comment clarification * Remove unsnapshottable things * restore test data path * Replace placeholder string checks with empty list checks in diann module Changed checks from `.name != 'NO_...'` pattern to `!= []` for: - fasta input (NO_FASTA_FILE) - library input (NO_LIB_FILE) - quant directory (NO_QUANT_DIR) This makes the code cleaner and more idiomatic for Nextflow. Addresses PR review comment from @jonasscheid 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Correct outputs, prefixing * Subworkflow-level fixes * Fix meta ymls * clarify in meta.yml * misc module fixes * Fix DIANN module to automatically add --use-quant with --temp - Module now automatically adds --use-quant when quant files are provided - --temp and --use-quant must be used together per DIA-NN requirements - Removed --use-quant from subworkflow configs (handled by module) - Combined logic into single quant_args variable for clarity 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Update DIANN meta.yml: clarify quant parameter is optional The quant parameter enables --use-quant mode for performance optimization but is not strictly required. DIA-NN can still process files via --f without quant files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add conda support to quantmsutils modules - Add environment.yml files for all three quantmsutils modules - Add conda directive to module main.nf files - Remove conda error checks from script and stub sections - Remove quantmsutils modules from conda skip list in .github/skip_nf_test.json - Modules now support conda/mamba profiles 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Clarify dianncfg only supports Unimod modifications Updated meta.yml to specify that the module only handles Unimod modifications. Custom modifications must be passed directly to DIANN. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add --rt-profiling to in-silico library generation Per DIA-NN author recommendation (bigbio/quantms#481), --rt-profiling should be used during in-silico library generation to enable retention time profiling, improving downstream empirical library quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Clarify ms_files vs ms_file_names usage in DIANN module Improved documentation to explain the two input modes: - ms_files: Actual file paths when DIA-NN needs raw MS data - ms_file_names: Just basenames used with --use-quant to match against preprocessed .quant files, avoiding unnecessary file staging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Clarify RAW file support is version-specific in DIANN Updated meta.yml to note that Thermo RAW files are only supported on Linux with DIA-NN 2.0+. Older versions require conversion to mzML or .d format first. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix DIANN meta.yml output patterns to use ${prefix} Updated all prefix-based output file keys to use ${prefix} notation instead of wildcards to match the module's actual output patterns. Pattern fields still use wildcards for flexibility. Fixes linter error: correct_meta_outputs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add missing when condition to quantmsutils/mzmlstatistics Added standard when condition that was removed during conda support changes. Fixes linter error: when_exist 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add ThermoRawFileParser support to DIA proteomics subworkflow Adds automatic conversion of Thermo RAW files to mzML format before analysis: - Branch input files to detect .raw extensions - Convert RAW files using ThermoRawFileParser module - Reconstruct input channel with converted mzML files - Configure THERMORAWFILEPARSER to output mzML format (--format=1) - Update meta.yml to document RAW file support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Revert "Add ThermoRawFileParser support to DIA proteomics subworkflow" This reverts commit 9492801. * Revert parquet in test --------- Co-authored-by: Claude <[email protected]>

pinin4fjords added 2 commits October 5, 2025 22:31

Add initial subworkflow working copy

63650e6

Actually, we only need a single diann module

02305e0

pinin4fjords requested review from a team as code owners October 6, 2025 10:15

pinin4fjords mentioned this pull request Oct 6, 2025

Add more diann test data nf-core/test-datasets#1756

Merged

pinin4fjords added 2 commits October 6, 2025 11:22

Merge branch 'master' into add_diann_subworkflow_modules

8dcb8d8

msstats module will be a bit more work- defer until later

36c0c39

maxulysse reviewed Oct 6, 2025

View reviewed changes

tests/config/nextflow.config Outdated Show resolved Hide resolved

pinin4fjords added 15 commits October 6, 2025 12:19

Fix mzmlstatistics test

bd18493

Fix dianncfg test

99a2466

Fix diann2mztab test

cd0ec87

Revert testing change

c04fdd6

Merge branch 'add_diann_subworkflow_modules' of github.com:nf-core/mo…

3f0dd59

…dules into add_diann_subworkflow_modules

Tidy up

7315f7a

Try to address linting issues

660cbba

more linting fixes

cdf1a7b

Hopefully last subworkflow meta fix

ce5f1d9

None of that Conda thanks

8910034

Misc fixes

c022958

Remove another cpus directive

d3146ed

Comment clarification

53dbf85

Remove unsnapshottable things

ded6d1c

restore test data path

983ecb1

pinin4fjords requested review from FelixKrueger and jonasscheid October 6, 2025 15:03

FelixKrueger requested a review from DongzeHE October 13, 2025 13:38

jonasscheid requested changes Oct 16, 2025

View reviewed changes

pinin4fjords and others added 8 commits October 17, 2025 13:05

Merge branch 'master' into add_diann_subworkflow_modules

5f9f551

Merge branch 'add_diann_subworkflow_modules' of github.com:nf-core/mo…

cf00e54

…dules into add_diann_subworkflow_modules

Add missing when condition to quantmsutils/mzmlstatistics

a8ddb83

Added standard when condition that was removed during conda support changes. Fixes linter error: when_exist 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

pinin4fjords force-pushed the add_diann_subworkflow_modules branch from 3033adc to 9492801 Compare October 17, 2025 12:50

pinin4fjords requested review from DongzeHE and jonasscheid October 17, 2025 12:56

DongzeHE reviewed Oct 18, 2025

View reviewed changes

subworkflows/nf-core/dia_proteomics_analysis/main.nf Show resolved Hide resolved

pinin4fjords added 3 commits October 20, 2025 14:03

Revert "Add ThermoRawFileParser support to DIA proteomics subworkflow"

06af1f4

This reverts commit 9492801.

Merge branch 'master' into add_diann_subworkflow_modules

5d36e59

Revert parquet in test

95f24b9

pinin4fjords requested a review from DongzeHE October 20, 2025 16:18

DongzeHE approved these changes Oct 20, 2025

View reviewed changes

jonasscheid approved these changes Oct 21, 2025

View reviewed changes

pinin4fjords added this pull request to the merge queue Oct 21, 2025

Merged via the queue into master with commit 310549b Oct 21, 2025
70 checks passed

pinin4fjords deleted the add_diann_subworkflow_modules branch October 21, 2025 08:45

Add diann subworkflow and module #9173

Add diann subworkflow and module #9173

Uh oh!

Conversation

pinin4fjords commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Motivation

Key Changes

Architecture improvements

Data flow improvements

Specific technical improvements

Pipeline stages

Modules included

Testing

Known limitations

Reviewer notes

PR checklist

Uh oh!

Uh oh!

jonasscheid commented Oct 14, 2025

Uh oh!

jonasscheid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonasscheid commented Oct 16, 2025

Uh oh!

pinin4fjords commented Oct 16, 2025

Uh oh!

Uh oh!

DongzeHE left a comment

Choose a reason for hiding this comment

Uh oh!

jonasscheid commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vdemichev commented Oct 21, 2025

Uh oh!

jonasscheid commented Oct 21, 2025

Uh oh!

vdemichev commented Oct 21, 2025

Uh oh!

pinin4fjords commented Oct 21, 2025

Uh oh!

Uh oh!

vdemichev commented Oct 21, 2025

Uh oh!

pinin4fjords commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pinin4fjords commented Oct 6, 2025 •

edited

Loading

jonasscheid commented Oct 21, 2025 •

edited

Loading