-
Notifications
You must be signed in to change notification settings - Fork 912
Add diann subworkflow and module #9173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…dules into add_diann_subworkflow_modules
FYI I'm going to review. Hope to be finished tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only small comments. Overall very nicely done and commented!
@pinin4fjords are you aware that DIA-NN moved from open-source to license-restricted access beyond version 1.8.2? It's still free for academic use, but container distribution is not allowed in newer versions (latest version is 2.2.0 with quite some improvements..). So my general question would be what the version update strategy of this nf-core subworkflow is? Don't get me wrong, it's nice that this subworkflow is part of nf-core, but we need to think about updating to future versions eventually. |
Hi @jonasscheid , thanks for the comments, will address them today. For complete clarity this in contract work done at the request of @FelixKrueger . In terms of updates, yes, we're aware of the licensing issues. It will be relatively trivial for folks to use configuration to override |
Per DIA-NN author recommendation (bigbio/quantms#481), --rt-profiling should be used during in-silico library generation to enable retention time profiling, improving downstream empirical library quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Improved documentation to explain the two input modes: - ms_files: Actual file paths when DIA-NN needs raw MS data - ms_file_names: Just basenames used with --use-quant to match against preprocessed .quant files, avoiding unnecessary file staging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Updated meta.yml to note that Thermo RAW files are only supported on Linux with DIA-NN 2.0+. Older versions require conversion to mzML or .d format first. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Updated all prefix-based output file keys to use ${prefix} notation instead of wildcards to match the module's actual output patterns. Pattern fields still use wildcards for flexibility. Fixes linter error: correct_meta_outputs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…dules into add_diann_subworkflow_modules
Added standard when condition that was removed during conda support changes. Fixes linter error: when_exist 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
Adds automatic conversion of Thermo RAW files to mzML format before analysis: - Branch input files to detect .raw extensions - Convert RAW files using ThermoRawFileParser module - Reconstruct input channel with converted mzML files - Configure THERMORAWFILEPARSER to output mzML format (--format=1) - Update meta.yml to document RAW file support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
3033adc
to
9492801
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM I hope DIA-NN implements some sort of an authentication mechanism (tokens) in the future that could be passend by env variable. Then we could use the most recent versions 🙏🏼 @vdemichev are there any plans towards that? |
Hi All, A general comment: since 2.0 we keep all inputs/outputs stable & backward compatible, i.e. even .quant files are backward compatible. So in the future it's very likely that newer versions will work without changing any code in the pipeline. That said, if one wishes to implement InfinDIA, this will probably result in very minor changes (it creates '.pre.quant' files instead of '.quant' files; also does not require spectral library as input), but this is like sth separate that should exist in addition to the main workflow - just for super large search spaces. @jonasscheid, you mean like AI models? :) Best, |
Thanks for the quick response @vdemichev and good to know that input output structures will remain quite stable.
No by token I meant some sort of a license key that can be purchased for DIA-NN (or free license key for academy members). In the grand scheme of things that would enable the distribution of DIA-NN containers (and evenutally conda) that could be used in workflows like these here. |
Well, this is exactly how the Enterprise version works :) But redistribtion is an issue regardless, so we don't enable it. I would think it's OK to just always have the pipeline download DIA-NN & create container on the fly? This apparently is not possible in Galaxy, but here should work just fine? |
Not how things work right now, unfortunately. Upgrades right now will have to work via folks building their containers externally, and then they'd override the We have other commercial software here, which work via license keys. The software won't work without them, but you can still download the containers etc. Sentieon is an example: https://github.com/nf-core/modules/blob/master/modules/nf-core/sentieon/bwamem/main.nf. Maybe a model worth considering. I think Sentieon provide a license key to allow the CI testing. |
But that's how it works for quantms, seems all is OK with it?
Will not work in our case unfortunately. |
This nf-core repo has more structured requirements, we have a standardised approach to container provisioning etc and I don't think this will fit, though I'll ask around. |
* Add initial subworkflow working copy * Actually, we only need a single diann module * msstats module will be a bit more work- defer until later * Fix mzmlstatistics test * Fix dianncfg test * Fix diann2mztab test * Revert testing change * Tidy up * Try to address linting issues * more linting fixes * Hopefully last subworkflow meta fix * None of that Conda thanks * Misc fixes * Remove another cpus directive * Comment clarification * Remove unsnapshottable things * restore test data path * Replace placeholder string checks with empty list checks in diann module Changed checks from `.name != 'NO_...'` pattern to `!= []` for: - fasta input (NO_FASTA_FILE) - library input (NO_LIB_FILE) - quant directory (NO_QUANT_DIR) This makes the code cleaner and more idiomatic for Nextflow. Addresses PR review comment from @jonasscheid 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Correct outputs, prefixing * Subworkflow-level fixes * Fix meta ymls * clarify in meta.yml * misc module fixes * Fix DIANN module to automatically add --use-quant with --temp - Module now automatically adds --use-quant when quant files are provided - --temp and --use-quant must be used together per DIA-NN requirements - Removed --use-quant from subworkflow configs (handled by module) - Combined logic into single quant_args variable for clarity 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Update DIANN meta.yml: clarify quant parameter is optional The quant parameter enables --use-quant mode for performance optimization but is not strictly required. DIA-NN can still process files via --f without quant files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add conda support to quantmsutils modules - Add environment.yml files for all three quantmsutils modules - Add conda directive to module main.nf files - Remove conda error checks from script and stub sections - Remove quantmsutils modules from conda skip list in .github/skip_nf_test.json - Modules now support conda/mamba profiles 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Clarify dianncfg only supports Unimod modifications Updated meta.yml to specify that the module only handles Unimod modifications. Custom modifications must be passed directly to DIANN. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add --rt-profiling to in-silico library generation Per DIA-NN author recommendation (bigbio/quantms#481), --rt-profiling should be used during in-silico library generation to enable retention time profiling, improving downstream empirical library quality. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Clarify ms_files vs ms_file_names usage in DIANN module Improved documentation to explain the two input modes: - ms_files: Actual file paths when DIA-NN needs raw MS data - ms_file_names: Just basenames used with --use-quant to match against preprocessed .quant files, avoiding unnecessary file staging 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Clarify RAW file support is version-specific in DIANN Updated meta.yml to note that Thermo RAW files are only supported on Linux with DIA-NN 2.0+. Older versions require conversion to mzML or .d format first. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Fix DIANN meta.yml output patterns to use ${prefix} Updated all prefix-based output file keys to use ${prefix} notation instead of wildcards to match the module's actual output patterns. Pattern fields still use wildcards for flexibility. Fixes linter error: correct_meta_outputs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add missing when condition to quantmsutils/mzmlstatistics Added standard when condition that was removed during conda support changes. Fixes linter error: when_exist 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Add ThermoRawFileParser support to DIA proteomics subworkflow Adds automatic conversion of Thermo RAW files to mzML format before analysis: - Branch input files to detect .raw extensions - Convert RAW files using ThermoRawFileParser module - Reconstruct input channel with converted mzML files - Configure THERMORAWFILEPARSER to output mzML format (--format=1) - Update meta.yml to document RAW file support 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]> * Revert "Add ThermoRawFileParser support to DIA proteomics subworkflow" This reverts commit 9492801. * Revert parquet in test --------- Co-authored-by: Claude <[email protected]>
Overview
This PR ports the DIA-NN workflow from bigbio/quantms as an nf-core subworkflow (
dia_proteomics_analysis
), refactored to align with nf-core conventions and best practices.Motivation
The original quantms implementation used multiple separate DIA-NN modules with different hardcoded configurations and extensive conditional logic. This PR consolidates and modernizes the workflow architecture to be more maintainable, flexible, and idiomatic to nf-core standards.
Key Changes
Architecture improvements
Single unified DIA-NN module: Replaced multiple configuration-specific modules with one flexible module. The differences between quantms modules were purely configuration-based, not functional.
Separation of concerns: Moved custom pipeline logic from module-level scripts into subworkflow Groovy, keeping modules focused on tool execution.
Parameter handling: The subworkflow no longer directly references
params
. Parameters required for workflow logic are explicitly passed as subworkflow inputs, while tool-specific parameters can be supplied via standard Nextflow config files.Data flow improvements
Enhanced metadata propagation: Comprehensive use of meta maps throughout the workflow enables better provenance tracking and allows more complex experimental designs.
Channel-driven conditional execution: Replaced explicit conditionals with Nextflow channel operators (
.ifEmpty(null)
,.filter()
,.combine()
), making the data flow more declarative and the workflow logic more transparent.Support for iteration: The workflow is designed to handle multiple inputs (e.g., multiple FASTA databases, experimental designs) rather than assuming single global inputs, enabling more flexible experimental setups.
Specific technical improvements
Pre-generated inputs support: Users can optionally supply pre-generated spectral libraries or empirical libraries to skip computationally expensive steps.
Mass accuracy extraction: Automatic extraction of empirical mass accuracy from DIA-NN preliminary logs to inform final quantification parameters.
Comprehensive outputs: The workflow emits multiple output formats (TSV, Parquet, mzTab, MSstats-compatible, Triqler-compatible) to support diverse downstream analysis needs.
Pipeline stages
The subworkflow implements the complete DIA-NN analysis pipeline:
quantmsutils/dianncfg
): Generate DIA-NN config files based on experimental designdiann
): Predict spectral library from FASTA (skippable if pre-generated library provided)diann
): Initial pass to generate empirical library (skippable if pre-generated empirical library provided)diann
): Per-sample quantification using empirical librarydiann
): Unified analysis across all samplesquantmsutils/diann2mztab
,quantmsutils/mzmlstatistics
): Generate standard output formats and QC metricsModules included
modules/nf-core/diann/
: Universal DIA-NN wrappermodules/nf-core/quantmsutils/dianncfg/
: DIA-NN configuration file generationmodules/nf-core/quantmsutils/diann2mztab/
: Convert DIA-NN output to mzTab formatmodules/nf-core/quantmsutils/mzmlstatistics/
: Extract mzML file statisticssubworkflows/nf-core/dia_proteomics_analysis/
: Main DIA analysis orchestrationTesting
Known limitations
The MSstats LFQ analysis step from quantms dia.nf is not yet ported, as it requires custom R scripts that need refactoring into a proper nf-core module. This will be addressed in a follow-up PR.
Reviewer notes
.ifEmpty(null).filter()
is intentional and documented inline - this allows downstream processes to execute only when pre-generated inputs are not providedmultiMap
block (lines ~155-161 inmain.nf
) generates all required entity combinations upfront and creates multiple channel views at different metadata granularities - this is documented with detailed comments.github/skip_nf_test.json
) as these tools are not available via CondaPR checklist
Closes #XXX
versions.yml
file.label
nf-core modules test <MODULE> --profile docker
nf-core modules test <MODULE> --profile singularity
nf-core modules test <MODULE> --profile conda
nf-core subworkflows test <SUBWORKFLOW> --profile docker
nf-core subworkflows test <SUBWORKFLOW> --profile singularity
nf-core subworkflows test <SUBWORKFLOW> --profile conda