-
Notifications
You must be signed in to change notification settings - Fork 911
Add dragen germline module #8823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thank you for that huge contribution 👀 I was thinking whether there would be a way to simplify this a little so that the module does not have as many input and outputs. But at the same time I think DRAGEN is aimed to be a variant-calling-pipeline itself so that might be hard. I thought we could maybe split this into more submodules and then collect those into a DRAGEN subworkflow. But I am unsure whether that is something worth the work. We also have to check if all inputs are required by DRAGEN or if they could also be supplied with ext.args |
tag "$meta.id" | ||
label 'process_long' | ||
|
||
// ATTENTION: No conda env or container image as Dragen requires specialized hardware to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would we make this compatible with nf-core pipelines then? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this: https://github.com/seqeralabs/nf-dragen?tab=readme-ov-file#pipeline-implementation
But not every module in nf-core/modules is necessarily used in nf-core pipelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we tag it with the label dragen
instead of process_long
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to process_dragen
now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you not putting the meta.id on the later half of the outputs?
- bam: | ||
type: file | ||
description: Input BAM file to send to card | ||
pattern: "*.bam" | ||
- cram: | ||
type: file | ||
description: Input CRAM file to send to card | ||
pattern: "*.cram" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this one input and figure out based on extension which?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want all sequence data containing inputs combined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, what else comes under that (fastq, bam, cram, anything else?)
Bam and cram are treated pretty much the same in many tools (subject to compliance).
Basically I don't really like modules with lots of input channels as they are quite hard to use as we don't have named inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to handle all in one in this commit: 1a26f3c
I think this should just be one tool, as you say it is a variant calling pipeline in itself. |
I agree, we already had a bigger version of this and agreed on reducing it to one use case. Sorry for the late reply, I was on holiday. |
if (input.size() != 2) { | ||
error "Error: a maximum of 2 input files is supported." | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Congrats on a huge job implementing this!
Just fyi, Dragen will accept multiple FASTQ files if you supply the first one in the matching list, e.g. these FASTQs:
input_S1_L001_R1_001.fastq.gz
input_S1_L001_R1_002.fastq.gz
input_S1_L001_R1_003.fastq.gz
Can be specified with the input arg -1 input_S1_L001_R1_001.fastq.gz
You can see it in my implementation here (which only supports FASTQ files): https://github.com/seqeralabs/nf-dragen/blob/master/modules/local/dragen.nf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you get input_S1_L001_R1_002.fastq.gz
from, I have never seen anything that isn't 001
at the end? Manual renaming?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, @adamrtalbot 🙏
About the fastq files. Just to clarify, the fastq files basically belong to the same sample and are from different lanes, for example? I also saw that dragen can take a list of fastq files as input but I am not sure if we want to open up for all these different options. That makes things even more complex. I am not against it, I just think the concatenation can be handled outside of the module as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you get
input_S1_L001_R1_002.fastq.gz
from, I have never seen anything that isn't001
at the end? Manual renaming?
I don't know, it's in the Dragen docs: https://support-docs.illumina.com/SW/DRAGEN_v39/Content/SW/DRAGEN/Inputfiles_fDG.htm
Maybe there's a demultiplex setting for breaking up FASTQ files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eurgh, link doesn't work. Pasted here for reference:
If using, bcl2fastq or the DRAGEN BCL command use the following common file naming convention:
S<#><segment#>.fastq.gz
Older versions of bcl2fastq and DRAGEN could segment FASTQ samples into multiple files to limit file size or to decrease the time to generate them.
For Example:
RDRS182520_S1_L001_R1_001.fastq.gz RDRS182520_S1_L001_R1_002.fastq.gz ... RDRS182520_S1_L001_R1_008.fastq.gz
These files do not need to be concatenated to be processed together by DRAGEN. To map/align any sample, provide the first file in the series (-1 _001.fastq). DRAGEN reads all segment files in the sample consecutively for both of the FASTQ file sequences specified using the -1 and -2 options for paired-end input and for compressed fastq.gz files. To turn the behavior off, set ‑‑enable-auto-multifile to false on the command line.
DRAGEN can also optionally read multiple files by the sample name given in the file name, which can be used to combine samples that have been distributed across multiple BCL lanes or flow cells. To enable this feature, set the --combine-samples-by-name option to true.
If the FASTQ files specified on the command-line use the Casava 1.8 file naming convention shown above and additional files in the same directory share that sample name, those files and all their segments are processed automatically. Note that sample name, read number, and file extension must match. Index barcode and lane number can differ.
To avoid impacting system performance, input files must be located on a fast file system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can try to implement it if you want just want to be sure we are in agreement that it is something people will use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, that would be premature optimization. When someone asks for it make a note of this for future reference!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the hard work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you 🙂
I am a bit uncertain what the tests should look like, should it be just stubbing tests? Which use-cases should be covered? All sequencing input files? For most other input, there not much changing if it is supplied or not. Let me know what your thoughts are. |
@mashehu can you help us here? What should a Dragen test look like since we don't have access to the hardware |
yes, just stub tests then. |
Hey guys!
After almost one year I finally could fix the issues we had with Dragen and try out the module. I reduced it a lot to only feature inputs and outputs relevant for germline analysis. I haven't added tests yet but wanted to discuss the current state of the module first. Let me know what you think.
Old PR: #6383
PR checklist
Closes #XXX
versions.yml
file.label
nf-core modules test <MODULE> --profile docker
nf-core modules test <MODULE> --profile singularity
nf-core modules test <MODULE> --profile conda
nf-core subworkflows test <SUBWORKFLOW> --profile docker
nf-core subworkflows test <SUBWORKFLOW> --profile singularity
nf-core subworkflows test <SUBWORKFLOW> --profile conda