Pipeline template: New pre-commit hooks (large files, merge conflicts) #3935

MatthiasZepper · 2025-11-26T15:53:35Z

I would like to suggest adding two new pre-commit hooks to the pipeline template:

Check for large files (>5MB)
Prevent adding files with unresolved merge conflict markers.

If that is a change that requires an RFC in nf-core/proposals, I am fine with adding an issue there first.

The rationale is, that people have inadvertently committed large files to pipeline repositories. Because those committed files have been deleted in subsequent commits before the pull request was reviewed, it escaped the reviewers that the commit history contained these large unnecessary files. Obviously, the associated blobs are still there and will be downloaded by everyone forking and cloning the repository in the future.

The limit of 5MB is arbitrary, but seemed a reasonable cut-off based on a few pipeline repositories that I have tested. Before merging this PR, we can ask in #pipeline-maintainers that everybody determines the file size of their largest files outside of docs to see if we need additional exceptions or a generally higher limit.

Running the following command on any Git repository will show the largest files by blob size:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' \
| grep '^blob' \
| sort -k2 -n \
| tail -25 \
| numfmt --field=2 --to=iec

When run on Sarek, for example, the command above shows that the largest files are simply docs, which is why I have already included this exception into the hook configuration:

blob    1.5M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.9M docs/posters/NextflowSummit_2022_FHanssen.pdf
blob    2.1M docs/posters/EMBO_2022_FHanssen.pdf
blob    2.2M lib/nfcore_external_java_deps.jar
blob    2.3M docs/posters/poster_tubit_2021_FHanssen.svg
blob    2.3M docs/images/sarek_subway.svg

But when run on another nf-core pipeline, it shows a history of inadvertently committed large files that seemingly originated from testing the modified pipeline locally inside the development directory:

blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_15-10-59.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-33-41.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_13-01-32.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-00-51.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-17-52.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_11-41-22.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_13-29-34.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_13-13-04.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_13-10-51.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_13-02-11.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-49-34.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-37-33.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_15-08-55.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_15-11-55.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_13-35-34.html
blob    6.5M result_test/pipeline_info/execution_report_2025-10-14_15-22-16.html
blob    6.5M result_test/pipeline_info/execution_report_2025-10-14_15-30-15.html
blob    6.5M result/pipeline_info/execution_report_2025-10-14_13-15-29.html
blob      24M result/bwamem2_index/bwamem2/genome.fa.0123
blob      38M result/bwamem2_index/bwamem2/genome.fa.bwt.2bit.64

In the future, the execution reports would for example be caught by the new pre-commit hook.

PR checklist

This comment contains a description of changes (with reason)
CHANGELOG.md is updated
If you've fixed a bug or added code that should be tested, add tests!
Documentation in docs is updated

mashehu · 2025-11-26T16:01:25Z

nf_core/pipeline-template/.pre-commit-config.yaml

+              lib/nfcore_external_java_deps.jar$|
+              docs/.*\.(svg|pdf)$
+          )$
+      - id: check-merge-conflict


we have already https://nf-co.re/docs/nf-core-tools/api_reference/dev/pipeline_lint_tests/merge_markers, but this is a nicer approach

The main motivation for this PR was evidently the check-added-large-files. Phil just recommended to use the other one as well, because he has it on the MultiQC repo.

If you think that I shouldn't bundle them in one PR to allow separate decisions, I am fine with removing it as well.

ewels · 2025-11-26T16:13:30Z

I think it's worth pushing up the minimum file size, I suspect that there will be quite a lot of valid reasons to have files that are a few MB? It's more the hundreds of MB / GB files that I think are more what we're after. Then we can remove the hardcoded exceptions and hopefully it'll get in the way less.

The result_test thing is a bit annoying - we already have testing/ and testing* in .gitignore. I guess we could make that even more liberal. Or add result* to .gitignore as that's quite common maybe.

MatthiasZepper · 2025-11-26T16:39:17Z

I think it's worth pushing up the minimum file size, I suspect that there will be quite a lot of valid reasons to have files that are a few MB?

This is why I suggested to ask in #pipeline-maintainers to run the command above on their repos to get an idea what the largest legitimate files outside of docs are.

It's more the hundreds of MB / GB files that I think are more what we're after. Then we can remove the hardcoded exceptions and hopefully it'll get in the way less.

But I guess truly large files are rare, yet many files with a couple of MB will add up quickly. In the SciLifeLab/umi-transfer repo, we have lots of (binary) files between 5-20MB each, that collectively add up to 6GB.

The result_test thing is a bit annoying - we already have testing/ and testing* in .gitignore. I guess we could make that even more liberal. Or add result* to .gitignore as that's quite common maybe.

In that case, instead of doing guesswork how somebody may call their output directory: Nextflow will always add a pipeline_info subfolder in it, right? I guess we could come up with a custom pre-commit hook that would check for staged files inside a directory that also contains a pipeline_info subfolder?

…d check for unresolved merge conflict strings.

MatthiasZepper · 2025-11-27T15:14:50Z

I have now added a custom hook to test for pipeline output folders based on the presence of a 'pipeline_info' folder inside the directory paths that are added to the staging area.

nf_core/pipeline-template/.hooks/block_pipeline_outdir.sh

…stomisation in 'template_features.yml'.

MatthiasZepper · 2025-11-27T17:23:02Z

FYI: I have just used git-filter-repo on SciLifeLab/umi-transfer at it nicely cleaned up the old commits while retaining the authors, but the SHA sums of all subsequent commits changed as well (which makes sense on a second thought).

Branches based on any of those commits would therefore likely be left dangling. It seems reasonable to assume that a rewrite on a pipeline repo is only possible, if all open PRs are merged in and no collaborator is still working on a feature.

MatthiasZepper force-pushed the precommit_largefiles branch from 783b984 to 71ef635 Compare November 26, 2025 15:56

mashehu reviewed Nov 26, 2025

View reviewed changes

MatthiasZepper and others added 3 commits November 27, 2025 16:13

Add new pre-commit hooks to pipeline template: Exclude large files an…

404aab1

…d check for unresolved merge conflict strings.

[automated] Update CHANGELOG.md

1189649

Add custom pre-commit hook to check for a pipeline output directory.

77540e2

MatthiasZepper force-pushed the precommit_largefiles branch from 74c11b1 to 77540e2 Compare November 27, 2025 15:13

mashehu reviewed Nov 27, 2025

View reviewed changes

nf_core/pipeline-template/.hooks/block_pipeline_outdir.sh Show resolved Hide resolved

Add '.hooks/block_pipeline_outdir.sh' to linter group for pipeline cu…

1b149f0

…stomisation in 'template_features.yml'.

MatthiasZepper marked this pull request as ready for review November 28, 2025 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pipeline template: New pre-commit hooks (large files, merge conflicts) #3935

Pipeline template: New pre-commit hooks (large files, merge conflicts) #3935

Uh oh!

MatthiasZepper commented Nov 26, 2025

Uh oh!

mashehu Nov 26, 2025

Uh oh!

MatthiasZepper Nov 26, 2025

Uh oh!

ewels commented Nov 26, 2025

Uh oh!

MatthiasZepper commented Nov 26, 2025

Uh oh!

MatthiasZepper commented Nov 27, 2025

Uh oh!

Uh oh!

MatthiasZepper commented Nov 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Pipeline template: New pre-commit hooks (large files, merge conflicts) #3935

Are you sure you want to change the base?

Pipeline template: New pre-commit hooks (large files, merge conflicts) #3935

Uh oh!

Conversation

MatthiasZepper commented Nov 26, 2025

PR checklist

Uh oh!

mashehu Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

MatthiasZepper Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

ewels commented Nov 26, 2025

Uh oh!

MatthiasZepper commented Nov 26, 2025

Uh oh!

MatthiasZepper commented Nov 27, 2025

Uh oh!

Uh oh!

MatthiasZepper commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MatthiasZepper commented Nov 27, 2025 •

edited

Loading