-
Notifications
You must be signed in to change notification settings - Fork 220
Pipeline template: New pre-commit hooks (large files, merge conflicts) #3935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
783b984 to
71ef635
Compare
| lib/nfcore_external_java_deps.jar$| | ||
| docs/.*\.(svg|pdf)$ | ||
| )$ | ||
| - id: check-merge-conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have already https://nf-co.re/docs/nf-core-tools/api_reference/dev/pipeline_lint_tests/merge_markers, but this is a nicer approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main motivation for this PR was evidently the check-added-large-files. Phil just recommended to use the other one as well, because he has it on the MultiQC repo.
If you think that I shouldn't bundle them in one PR to allow separate decisions, I am fine with removing it as well.
|
I think it's worth pushing up the minimum file size, I suspect that there will be quite a lot of valid reasons to have files that are a few MB? It's more the hundreds of MB / GB files that I think are more what we're after. Then we can remove the hardcoded exceptions and hopefully it'll get in the way less. The |
This is why I suggested to ask in #pipeline-maintainers to run the command above on their repos to get an idea what the largest legitimate files outside of docs are.
But I guess truly large files are rare, yet many files with a couple of MB will add up quickly. In the SciLifeLab/umi-transfer repo, we have lots of (binary) files between 5-20MB each, that collectively add up to 6GB.
In that case, instead of doing guesswork how somebody may call their output directory: Nextflow will always add a |
…d check for unresolved merge conflict strings.
74c11b1 to
77540e2
Compare
|
I have now added a custom hook to test for pipeline output folders based on the presence of a 'pipeline_info' folder inside the directory paths that are added to the staging area. |
…stomisation in 'template_features.yml'.
|
FYI: I have just used git-filter-repo on SciLifeLab/umi-transfer at it nicely cleaned up the old commits while retaining the authors, but the SHA sums of all subsequent commits changed as well (which makes sense on a second thought). Branches based on any of those commits would therefore likely be left dangling. It seems reasonable to assume that a rewrite on a pipeline repo is only possible, if all open PRs are merged in and no collaborator is still working on a feature. |
I would like to suggest adding two new pre-commit hooks to the pipeline template:
If that is a change that requires an RFC in nf-core/proposals, I am fine with adding an issue there first.
The rationale is, that people have inadvertently committed large files to pipeline repositories. Because those committed files have been deleted in subsequent commits before the pull request was reviewed, it escaped the reviewers that the commit history contained these large unnecessary files. Obviously, the associated blobs are still there and will be downloaded by everyone forking and cloning the repository in the future.
The limit of 5MB is arbitrary, but seemed a reasonable cut-off based on a few pipeline repositories that I have tested. Before merging this PR, we can ask in
#pipeline-maintainersthat everybody determines the file size of their largest files outside ofdocsto see if we need additional exceptions or a generally higher limit.Running the following command on any Git repository will show the largest files by blob size:
When run on Sarek, for example, the command above shows that the largest files are simply docs, which is why I have already included this exception into the hook configuration:
But when run on another nf-core pipeline, it shows a history of inadvertently committed large files that seemingly originated from testing the modified pipeline locally inside the development directory:
In the future, the execution reports would for example be caught by the new pre-commit hook.
PR checklist
CHANGELOG.mdis updateddocsis updated