Skip to content

Conversation

marrip
Copy link
Contributor

@marrip marrip commented Jul 29, 2025

Hey guys!

After almost one year I finally could fix the issues we had with Dragen and try out the module. I reduced it a lot to only feature inputs and outputs relevant for germline analysis. I haven't added tests yet but wanted to discuss the current state of the module first. Let me know what you think.

Old PR: #6383

PR checklist

Closes #XXX

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the module conventions in the contribution docs
  • If necessary, include test data in your PR.
  • Remove all TODO statements.
  • Emit the versions.yml file.
  • Follow the naming conventions.
  • Follow the parameters requirements.
  • Follow the input/output options guidelines.
  • Add a resource label
  • Use BioConda and BioContainers if possible to fulfil software requirements.
  • Ensure that the test works with either Docker / Singularity. Conda CI tests can be quite flaky:
    • For modules:
      • nf-core modules test <MODULE> --profile docker
      • nf-core modules test <MODULE> --profile singularity
      • nf-core modules test <MODULE> --profile conda
    • For subworkflows:
      • nf-core subworkflows test <SUBWORKFLOW> --profile docker
      • nf-core subworkflows test <SUBWORKFLOW> --profile singularity
      • nf-core subworkflows test <SUBWORKFLOW> --profile conda

@marrip marrip requested review from SPPearce and luisas July 29, 2025 06:47
@marrip marrip self-assigned this Jul 29, 2025
@marrip marrip marked this pull request as draft July 29, 2025 06:48
@marrip marrip mentioned this pull request Jul 29, 2025
17 tasks
@famosab
Copy link
Contributor

famosab commented Jul 31, 2025

Thank you for that huge contribution 👀 I was thinking whether there would be a way to simplify this a little so that the module does not have as many input and outputs. But at the same time I think DRAGEN is aimed to be a variant-calling-pipeline itself so that might be hard.

I thought we could maybe split this into more submodules and then collect those into a DRAGEN subworkflow. But I am unsure whether that is something worth the work. We also have to check if all inputs are required by DRAGEN or if they could also be supplied with ext.args

tag "$meta.id"
label 'process_long'

// ATTENTION: No conda env or container image as Dragen requires specialized hardware to run
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would we make this compatible with nf-core pipelines then? 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this: https://github.com/seqeralabs/nf-dragen?tab=readme-ov-file#pipeline-implementation
But not every module in nf-core/modules is necessarily used in nf-core pipelines.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we tag it with the label dragen instead of process_long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed it to process_dragen now

Copy link
Contributor

@SPPearce SPPearce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you not putting the meta.id on the later half of the outputs?

Comment on lines +29 to +36
- bam:
type: file
description: Input BAM file to send to card
pattern: "*.bam"
- cram:
type: file
description: Input CRAM file to send to card
pattern: "*.cram"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this one input and figure out based on extension which?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want all sequence data containing inputs combined?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, what else comes under that (fastq, bam, cram, anything else?)
Bam and cram are treated pretty much the same in many tools (subject to compliance).
Basically I don't really like modules with lots of input channels as they are quite hard to use as we don't have named inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I attempted to handle all in one in this commit: 1a26f3c

@SPPearce
Copy link
Contributor

SPPearce commented Aug 1, 2025

Thank you for that huge contribution 👀 I was thinking whether there would be a way to simplify this a little so that the module does not have as many input and outputs. But at the same time I think DRAGEN is aimed to be a variant-calling-pipeline itself so that might be hard.

I thought we could maybe split this into more submodules and then collect those into a DRAGEN subworkflow. But I am unsure whether that is something worth the work. We also have to check if all inputs are required by DRAGEN or if they could also be supplied with ext.args

I think this should just be one tool, as you say it is a variant calling pipeline in itself.

@marrip
Copy link
Contributor Author

marrip commented Aug 11, 2025

Thank you for that huge contribution 👀 I was thinking whether there would be a way to simplify this a little so that the module does not have as many input and outputs. But at the same time I think DRAGEN is aimed to be a variant-calling-pipeline itself so that might be hard.
I thought we could maybe split this into more submodules and then collect those into a DRAGEN subworkflow. But I am unsure whether that is something worth the work. We also have to check if all inputs are required by DRAGEN or if they could also be supplied with ext.args

I think this should just be one tool, as you say it is a variant calling pipeline in itself.

I agree, we already had a bigger version of this and agreed on reducing it to one use case. ☺️

Sorry for the late reply, I was on holiday.

Comment on lines +213 to +215
if (input.size() != 2) {
error "Error: a maximum of 2 input files is supported."
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats on a huge job implementing this!

Just fyi, Dragen will accept multiple FASTQ files if you supply the first one in the matching list, e.g. these FASTQs:

input_S1_L001_R1_001.fastq.gz
input_S1_L001_R1_002.fastq.gz
input_S1_L001_R1_003.fastq.gz

Can be specified with the input arg -1 input_S1_L001_R1_001.fastq.gz

You can see it in my implementation here (which only supports FASTQ files): https://github.com/seqeralabs/nf-dragen/blob/master/modules/local/dragen.nf

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you get input_S1_L001_R1_002.fastq.gz from, I have never seen anything that isn't 001 at the end? Manual renaming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @adamrtalbot 🙏

About the fastq files. Just to clarify, the fastq files basically belong to the same sample and are from different lanes, for example? I also saw that dragen can take a list of fastq files as input but I am not sure if we want to open up for all these different options. That makes things even more complex. I am not against it, I just think the concatenation can be handled outside of the module as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you get input_S1_L001_R1_002.fastq.gz from, I have never seen anything that isn't 001 at the end? Manual renaming?

I don't know, it's in the Dragen docs: https://support-docs.illumina.com/SW/DRAGEN_v39/Content/SW/DRAGEN/Inputfiles_fDG.htm

Maybe there's a demultiplex setting for breaking up FASTQ files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eurgh, link doesn't work. Pasted here for reference:

If using, bcl2fastq or the DRAGEN BCL command use the following common file naming convention:

S<#><segment#>.fastq.gz

Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.

For Example:

RDRS182520_S1_L001_R1_001.fastq.gz
RDRS182520_S1_L001_R1_002.fastq.gz
...
RDRS182520_S1_L001_R1_008.fastq.gz

These files do not need to be concatenated to be processed together by DRAGEN. To map/align any sample, provide the first file in the series (-1 _001.fastq). DRAGEN reads all segment files in the sample consecutively for both of the FASTQ file sequences specified using the -1 and -2 options for paired-end input and for compressed fastq.gz files. To turn the behavior off, set ‑‑enable-auto-multifile to false on the command line.

DRAGEN can also optionally read multiple files by the sample name given in the file name, which can be used to combine samples that have been distributed across multiple BCL lanes or flow cells. To enable this feature, set the --combine-samples-by-name option to true.

If the FASTQ files specified on the command-line use the Casava 1.8 file naming convention shown above and additional files in the same directory share that sample name, those files and all their segments are processed automatically. Note that sample name, read number, and file extension must match. Index barcode and lane number can differ.

To avoid impacting system performance, input files must be located on a fast file system.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can try to implement it if you want just want to be sure we are in agreement that it is something people will use.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, that would be premature optimization. When someone asks for it make a note of this for future reference!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the hard work!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you 🙂

@marrip
Copy link
Contributor Author

marrip commented Sep 11, 2025

I am a bit uncertain what the tests should look like, should it be just stubbing tests? Which use-cases should be covered? All sequencing input files? For most other input, there not much changing if it is supplied or not. Let me know what your thoughts are.

@adamrtalbot
Copy link
Contributor

I am a bit uncertain what the tests should look like, should it be just stubbing tests? Which use-cases should be covered? All sequencing input files? For most other input, there not much changing if it is supplied or not. Let me know what your thoughts are.

@mashehu can you help us here? What should a Dragen test look like since we don't have access to the hardware

@mashehu
Copy link
Contributor

mashehu commented Oct 17, 2025

yes, just stub tests then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants