Skip to content

Conversation

@pinin4fjords
Copy link
Member

@pinin4fjords pinin4fjords commented Oct 6, 2025

Overview

This PR ports the DIA-NN workflow from bigbio/quantms as an nf-core subworkflow (dia_proteomics_analysis), refactored to align with nf-core conventions and best practices.

Motivation

The original quantms implementation used multiple separate DIA-NN modules with different hardcoded configurations and extensive conditional logic. This PR consolidates and modernizes the workflow architecture to be more maintainable, flexible, and idiomatic to nf-core standards.

Key Changes

Architecture improvements

  • Single unified DIA-NN module: Replaced multiple configuration-specific modules with one flexible module. The differences between quantms modules were purely configuration-based, not functional.

  • Separation of concerns: Moved custom pipeline logic from module-level scripts into subworkflow Groovy, keeping modules focused on tool execution.

  • Parameter handling: The subworkflow no longer directly references params. Parameters required for workflow logic are explicitly passed as subworkflow inputs, while tool-specific parameters can be supplied via standard Nextflow config files.

Data flow improvements

  • Enhanced metadata propagation: Comprehensive use of meta maps throughout the workflow enables better provenance tracking and allows more complex experimental designs.

  • Channel-driven conditional execution: Replaced explicit conditionals with Nextflow channel operators (.ifEmpty(null), .filter(), .combine()), making the data flow more declarative and the workflow logic more transparent.

  • Support for iteration: The workflow is designed to handle multiple inputs (e.g., multiple FASTA databases, experimental designs) rather than assuming single global inputs, enabling more flexible experimental setups.

Specific technical improvements

  • Pre-generated inputs support: Users can optionally supply pre-generated spectral libraries or empirical libraries to skip computationally expensive steps.

  • Mass accuracy extraction: Automatic extraction of empirical mass accuracy from DIA-NN preliminary logs to inform final quantification parameters.

  • Comprehensive outputs: The workflow emits multiple output formats (TSV, Parquet, mzTab, MSstats-compatible, Triqler-compatible) to support diverse downstream analysis needs.

Pipeline stages

The subworkflow implements the complete DIA-NN analysis pipeline:

  1. Configuration generation (quantmsutils/dianncfg): Generate DIA-NN config files based on experimental design
  2. In-silico library generation (diann): Predict spectral library from FASTA (skippable if pre-generated library provided)
  3. Preliminary analysis (diann): Initial pass to generate empirical library (skippable if pre-generated empirical library provided)
  4. Mass accuracy extraction: Parse DIA-NN logs to extract optimal mass accuracy settings
  5. Individual sample analysis (diann): Per-sample quantification using empirical library
  6. Final quantification (diann): Unified analysis across all samples
  7. Format conversion and statistics (quantmsutils/diann2mztab, quantmsutils/mzmlstatistics): Generate standard output formats and QC metrics

Modules included

  • modules/nf-core/diann/: Universal DIA-NN wrapper
  • modules/nf-core/quantmsutils/dianncfg/: DIA-NN configuration file generation
  • modules/nf-core/quantmsutils/diann2mztab/: Convert DIA-NN output to mzTab format
  • modules/nf-core/quantmsutils/mzmlstatistics/: Extract mzML file statistics
  • subworkflows/nf-core/dia_proteomics_analysis/: Main DIA analysis orchestration

Testing

  • All modules and the subworkflow include nf-test suites with appropriate test data
  • Tests use snapshot comparisons for reproducible outputs (versions, CSV files) and existence checks for non-reproducible outputs (TSV, mzTab, Parquet files)
  • Test data PR: Add more diann test data test-datasets#1756 (now merged)

Known limitations

The MSstats LFQ analysis step from quantms dia.nf is not yet ported, as it requires custom R scripts that need refactoring into a proper nf-core module. This will be addressed in a follow-up PR.

Reviewer notes

  • The conditional execution pattern using .ifEmpty(null).filter() is intentional and documented inline - this allows downstream processes to execute only when pre-generated inputs are not provided
  • The large multiMap block (lines ~155-161 in main.nf) generates all required entity combinations upfront and creates multiple channel views at different metadata granularities - this is documented with detailed comments
  • Module meta.yml files follow nf-core schema for both modules and subworkflows, with proper tuple structures and output definitions
  • Tests skip Conda for quantmsutils modules (added to .github/skip_nf_test.json) as these tools are not available via Conda

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@jonasscheid
Copy link
Contributor

FYI I'm going to review. Hope to be finished tomorrow

Copy link
Contributor

@jonasscheid jonasscheid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only small comments. Overall very nicely done and commented!

@jonasscheid
Copy link
Contributor

@pinin4fjords are you aware that DIA-NN moved from open-source to license-restricted access beyond version 1.8.2? It's still free for academic use, but container distribution is not allowed in newer versions (latest version is 2.2.0 with quite some improvements..). So my general question would be what the version update strategy of this nf-core subworkflow is? Don't get me wrong, it's nice that this subworkflow is part of nf-core, but we need to think about updating to future versions eventually.

@pinin4fjords
Copy link
Member Author

@pinin4fjords are you aware that DIA-NN moved from open-source to license-restricted access beyond version 1.8.2? It's still free for academic use, but container distribution is not allowed in newer versions (latest version is 2.2.0 with quite some improvements..). So my general question would be what the version update strategy of this nf-core subworkflow is? Don't get me wrong, it's nice that this subworkflow is part of nf-core, but we need to think about updating to future versions eventually.

Hi @jonasscheid , thanks for the comments, will address them today. For complete clarity this in contract work done at the request of @FelixKrueger .

In terms of updates, yes, we're aware of the licensing issues. It will be relatively trivial for folks to use configuration to override container directives and ext.args to suit newer versions, which covers some of the answer. But where the newer version would require larger-scale changes to the module/ subworkflow structure I don't have a good answer.

pinin4fjords and others added 8 commits October 17, 2025 13:05
Per DIA-NN author recommendation (bigbio/quantms#481), --rt-profiling
should be used during in-silico library generation to enable retention
time profiling, improving downstream empirical library quality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Improved documentation to explain the two input modes:
- ms_files: Actual file paths when DIA-NN needs raw MS data
- ms_file_names: Just basenames used with --use-quant to match against
  preprocessed .quant files, avoiding unnecessary file staging

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Updated meta.yml to note that Thermo RAW files are only supported
on Linux with DIA-NN 2.0+. Older versions require conversion to
mzML or .d format first.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Updated all prefix-based output file keys to use ${prefix} notation
instead of wildcards to match the module's actual output patterns.
Pattern fields still use wildcards for flexibility.

Fixes linter error: correct_meta_outputs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Added standard when condition that was removed during conda support
changes. Fixes linter error: when_exist

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Adds automatic conversion of Thermo RAW files to mzML format before analysis:
- Branch input files to detect .raw extensions
- Convert RAW files using ThermoRawFileParser module
- Reconstruct input channel with converted mzML files
- Configure THERMORAWFILEPARSER to output mzML format (--format=1)
- Update meta.yml to document RAW file support

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@pinin4fjords pinin4fjords force-pushed the add_diann_subworkflow_modules branch from 3033adc to 9492801 Compare October 17, 2025 12:50
@pinin4fjords pinin4fjords requested a review from DongzeHE October 20, 2025 16:18
Copy link
Member

@DongzeHE DongzeHE left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jonasscheid
Copy link
Contributor

jonasscheid commented Oct 21, 2025

LGTM

I hope DIA-NN implements some sort of an authentication mechanism (tokens) in the future that could be passend by env variable. Then we could use the most recent versions 🙏🏼

@vdemichev are there any plans towards that?

@vdemichev
Copy link

Hi All,

A general comment: since 2.0 we keep all inputs/outputs stable & backward compatible, i.e. even .quant files are backward compatible. So in the future it's very likely that newer versions will work without changing any code in the pipeline. That said, if one wishes to implement InfinDIA, this will probably result in very minor changes (it creates '.pre.quant' files instead of '.quant' files; also does not require spectral library as input), but this is like sth separate that should exist in addition to the main workflow - just for super large search spaces.

@jonasscheid, you mean like AI models? :)

Best,
Vadim

@jonasscheid
Copy link
Contributor

Thanks for the quick response @vdemichev and good to know that input output structures will remain quite stable.

@jonasscheid, you mean like AI models? :)

No by token I meant some sort of a license key that can be purchased for DIA-NN (or free license key for academy members). In the grand scheme of things that would enable the distribution of DIA-NN containers (and evenutally conda) that could be used in workflows like these here.

@vdemichev
Copy link

Well, this is exactly how the Enterprise version works :) But redistribtion is an issue regardless, so we don't enable it. I would think it's OK to just always have the pipeline download DIA-NN & create container on the fly? This apparently is not possible in Galaxy, but here should work just fine?

@pinin4fjords
Copy link
Member Author

Well, this is exactly how the Enterprise version works :) But redistribtion is an issue regardless, so we don't enable it. I would think it's OK to just always have the pipeline download DIA-NN & create container on the fly? This apparently is not possible in Galaxy, but here should work just fine?

Not how things work right now, unfortunately. Upgrades right now will have to work via folks building their containers externally, and then they'd override the container directives.

We have other commercial software here, which work via license keys. The software won't work without them, but you can still download the containers etc. Sentieon is an example: https://github.com/nf-core/modules/blob/master/modules/nf-core/sentieon/bwamem/main.nf. Maybe a model worth considering. I think Sentieon provide a license key to allow the CI testing.

@pinin4fjords pinin4fjords added this pull request to the merge queue Oct 21, 2025
Merged via the queue into master with commit 310549b Oct 21, 2025
70 checks passed
@pinin4fjords pinin4fjords deleted the add_diann_subworkflow_modules branch October 21, 2025 08:45
@vdemichev
Copy link

Not how things work right now, unfortunately.

But that's how it works for quantms, seems all is OK with it?

Maybe a model worth considering

Will not work in our case unfortunately.

@pinin4fjords
Copy link
Member Author

Not how things work right now, unfortunately.

But that's how it works for quantms, seems all is OK with it?

This nf-core repo has more structured requirements, we have a standardised approach to container provisioning etc and I don't think this will fit, though I'll ask around.

vagkaratzas pushed a commit that referenced this pull request Oct 24, 2025
* Add initial subworkflow working copy

* Actually, we only need a single diann module

* msstats module will be a bit more work- defer until later

* Fix mzmlstatistics test

* Fix dianncfg test

* Fix diann2mztab test

* Revert testing change

* Tidy up

* Try to address linting issues

* more linting fixes

* Hopefully last subworkflow meta fix

* None of that Conda thanks

* Misc fixes

* Remove another cpus directive

* Comment clarification

* Remove unsnapshottable things

* restore test data path

* Replace placeholder string checks with empty list checks in diann module

Changed checks from `.name != 'NO_...'` pattern to `!= []` for:
- fasta input (NO_FASTA_FILE)
- library input (NO_LIB_FILE)
- quant directory (NO_QUANT_DIR)

This makes the code cleaner and more idiomatic for Nextflow.

Addresses PR review comment from @jonasscheid

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Correct outputs, prefixing

* Subworkflow-level fixes

* Fix meta ymls

* clarify in meta.yml

* misc module fixes

* Fix DIANN module to automatically add --use-quant with --temp

- Module now automatically adds --use-quant when quant files are provided
- --temp and --use-quant must be used together per DIA-NN requirements
- Removed --use-quant from subworkflow configs (handled by module)
- Combined logic into single quant_args variable for clarity

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Update DIANN meta.yml: clarify quant parameter is optional

The quant parameter enables --use-quant mode for performance optimization
but is not strictly required. DIA-NN can still process files via --f without
quant files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add conda support to quantmsutils modules

- Add environment.yml files for all three quantmsutils modules
- Add conda directive to module main.nf files
- Remove conda error checks from script and stub sections
- Remove quantmsutils modules from conda skip list in .github/skip_nf_test.json
- Modules now support conda/mamba profiles

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Clarify dianncfg only supports Unimod modifications

Updated meta.yml to specify that the module only handles Unimod
modifications. Custom modifications must be passed directly to DIANN.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add --rt-profiling to in-silico library generation

Per DIA-NN author recommendation (bigbio/quantms#481), --rt-profiling
should be used during in-silico library generation to enable retention
time profiling, improving downstream empirical library quality.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Clarify ms_files vs ms_file_names usage in DIANN module

Improved documentation to explain the two input modes:
- ms_files: Actual file paths when DIA-NN needs raw MS data
- ms_file_names: Just basenames used with --use-quant to match against
  preprocessed .quant files, avoiding unnecessary file staging

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Clarify RAW file support is version-specific in DIANN

Updated meta.yml to note that Thermo RAW files are only supported
on Linux with DIA-NN 2.0+. Older versions require conversion to
mzML or .d format first.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Fix DIANN meta.yml output patterns to use ${prefix}

Updated all prefix-based output file keys to use ${prefix} notation
instead of wildcards to match the module's actual output patterns.
Pattern fields still use wildcards for flexibility.

Fixes linter error: correct_meta_outputs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add missing when condition to quantmsutils/mzmlstatistics

Added standard when condition that was removed during conda support
changes. Fixes linter error: when_exist

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Add ThermoRawFileParser support to DIA proteomics subworkflow

Adds automatic conversion of Thermo RAW files to mzML format before analysis:
- Branch input files to detect .raw extensions
- Convert RAW files using ThermoRawFileParser module
- Reconstruct input channel with converted mzML files
- Configure THERMORAWFILEPARSER to output mzML format (--format=1)
- Update meta.yml to document RAW file support

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

* Revert "Add ThermoRawFileParser support to DIA proteomics subworkflow"

This reverts commit 9492801.

* Revert parquet in test

---------

Co-authored-by: Claude <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants