Skip to content

Conversation

@MatthiasZepper
Copy link
Member

I would like to suggest adding two new pre-commit hooks to the pipeline template:

  1. Check for large files (>5MB)
  2. Prevent adding files with unresolved merge conflict markers.

If that is a change that requires an RFC in nf-core/proposals, I am fine with adding an issue there first.

The rationale is, that people have inadvertently committed large files to pipeline repositories. Because those committed files have been deleted in subsequent commits before the pull request was reviewed, it escaped the reviewers that the commit history contained these large unnecessary files. Obviously, the associated blobs are still there and will be downloaded by everyone forking and cloning the repository in the future.

The limit of 5MB is arbitrary, but seemed a reasonable cut-off based on a few pipeline repositories that I have tested. Before merging this PR, we can ask in #pipeline-maintainers that everybody determines the file size of their largest files outside of docs to see if we need additional exceptions or a generally higher limit.

Running the following command on any Git repository will show the largest files by blob size:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectsize) %(rest)' \
| grep '^blob' \
| sort -k2 -n \
| tail -25 \
| numfmt --field=2 --to=iec

When run on Sarek, for example, the command above shows that the largest files are simply docs, which is why I have already included this exception into the hook configuration:

blob    1.5M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.6M docs/sarek_subway.png
blob    1.6M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.7M docs/images/sarek_subway.png
blob    1.9M docs/posters/NextflowSummit_2022_FHanssen.pdf
blob    2.1M docs/posters/EMBO_2022_FHanssen.pdf
blob    2.2M lib/nfcore_external_java_deps.jar
blob    2.3M docs/posters/poster_tubit_2021_FHanssen.svg
blob    2.3M docs/images/sarek_subway.svg

But when run on another nf-core pipeline, it shows a history of inadvertently committed large files that seemingly originated from testing the modified pipeline locally inside the development directory:

blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_15-10-59.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-33-41.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_13-01-32.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-00-51.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-17-52.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_11-41-22.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_13-29-34.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_13-13-04.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_13-10-51.html
blob    6.4M result/pipeline_info/execution_report_2025-10-14_13-02-11.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-49-34.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_14-37-33.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_15-08-55.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_15-11-55.html
blob    6.4M result_test/pipeline_info/execution_report_2025-10-14_13-35-34.html
blob    6.5M result_test/pipeline_info/execution_report_2025-10-14_15-22-16.html
blob    6.5M result_test/pipeline_info/execution_report_2025-10-14_15-30-15.html
blob    6.5M result/pipeline_info/execution_report_2025-10-14_13-15-29.html
blob      24M result/bwamem2_index/bwamem2/genome.fa.0123
blob      38M result/bwamem2_index/bwamem2/genome.fa.bwt.2bit.64

In the future, the execution reports would for example be caught by the new pre-commit hook.

PR checklist

  • This comment contains a description of changes (with reason)
  • CHANGELOG.md is updated
  • If you've fixed a bug or added code that should be tested, add tests!
  • Documentation in docs is updated

lib/nfcore_external_java_deps.jar$|
docs/.*\.(svg|pdf)$
)$
- id: check-merge-conflict
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main motivation for this PR was evidently the check-added-large-files. Phil just recommended to use the other one as well, because he has it on the MultiQC repo.

If you think that I shouldn't bundle them in one PR to allow separate decisions, I am fine with removing it as well.

@ewels
Copy link
Member

ewels commented Nov 26, 2025

I think it's worth pushing up the minimum file size, I suspect that there will be quite a lot of valid reasons to have files that are a few MB? It's more the hundreds of MB / GB files that I think are more what we're after. Then we can remove the hardcoded exceptions and hopefully it'll get in the way less.

The result_test thing is a bit annoying - we already have testing/ and testing* in .gitignore. I guess we could make that even more liberal. Or add result* to .gitignore as that's quite common maybe.

@MatthiasZepper
Copy link
Member Author

I think it's worth pushing up the minimum file size, I suspect that there will be quite a lot of valid reasons to have files that are a few MB?

This is why I suggested to ask in #pipeline-maintainers to run the command above on their repos to get an idea what the largest legitimate files outside of docs are.

It's more the hundreds of MB / GB files that I think are more what we're after. Then we can remove the hardcoded exceptions and hopefully it'll get in the way less.

But I guess truly large files are rare, yet many files with a couple of MB will add up quickly. In the SciLifeLab/umi-transfer repo, we have lots of (binary) files between 5-20MB each, that collectively add up to 6GB.

The result_test thing is a bit annoying - we already have testing/ and testing* in .gitignore. I guess we could make that even more liberal. Or add result* to .gitignore as that's quite common maybe.

In that case, instead of doing guesswork how somebody may call their output directory: Nextflow will always add a pipeline_info subfolder in it, right? I guess we could come up with a custom pre-commit hook that would check for staged files inside a directory that also contains a pipeline_info subfolder?

@MatthiasZepper
Copy link
Member Author

I have now added a custom hook to test for pipeline output folders based on the presence of a 'pipeline_info' folder inside the directory paths that are added to the staging area.

@MatthiasZepper
Copy link
Member Author

MatthiasZepper commented Nov 27, 2025

FYI: I have just used git-filter-repo on SciLifeLab/umi-transfer at it nicely cleaned up the old commits while retaining the authors, but the SHA sums of all subsequent commits changed as well (which makes sense on a second thought).

Branches based on any of those commits would therefore likely be left dangling. It seems reasonable to assume that a rewrite on a pipeline repo is only possible, if all open PRs are merged in and no collaborator is still working on a feature.

@MatthiasZepper MatthiasZepper marked this pull request as ready for review November 28, 2025 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants