-
Notifications
You must be signed in to change notification settings - Fork 152
[IMP] runbot_merge: recombine remaining batches after identifying a culprit PR #1223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: 18.0
Are you sure you want to change the base?
[IMP] runbot_merge: recombine remaining batches after identifying a culprit PR #1223
Conversation
|
@Xavier-Do @d-fence Hope you like it. |
|
Hello @MiquelRForgeFlow Thanks for contributing This is more a topic for @xmo-odoo but it looks like it wouldn't work well with random errors. |
3a0b8ec to
885c8d2
Compare
|
@Xavier-Do to be more safe, I removed the cancelling of any pending staging from the same split source and sibling splits for that source. The new combined split should be executed regardless because the BTW, I asked ChatGPT how to handle better the random errors (flakes) and it proposes two ways:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added both my understanding of the original version (a depth-first search of a culprit PR) and this version (keep the breadth-first search but recombine as soon as a "culprit" is found) to a small simulator I worked on during the odoo days. I found the original to be way better than I was assuming, while the simple recombination is not really an improvemend over the baseline version:
| title | total | merged | failed | nondeterministic |
|---|---|---|---|---|
| batched | 236000 | 233380 | 2222 | 398 |
| batched_rejoin | 238608 | 236005 | 2210 | 393 |
| batched_depthfirst | 254902 | 252245 | 2377 | 280 |
(this is with a bunch of assumptions as the simulator is pretty simplistic, and over 5 years of simulation-time).
However,
- the branch for the mergebot is 17.0
- the mergebot has a fair number of tests, and as both this and the previous significantly impact the scheduling of splits I doubt the various tests which involve such splits still pass with this
When a multi-batch staging fails and a culprit PR is later identified in a single-batch staging, cancel pending sibling stagings/splits from the same split root and create a new split with all remaining active batches. This lets the scheduler stage them together again, reducing CI cycles and accelerating merges while preserving safety (bisect still applies if more culprits/interactions remain).
885c8d2 to
6d3e9e7
Compare
|
@xmo-odoo comments attended. |
Background (current behavior):
When a staging with multiple batches fails, runbot_merge splits the set in two halves and stages them in sequence. Only the failing half is split again (bisected); a passing half is staged and merged as a unit. Eventually, if bisecting reaches a single-batch staging that fails, the culprit PR is identified and marked as failed. After that point, the remaining PRs from the original split group continue through the existing split queue (halves and further splits only if those fail). There is no step to recombine all unaffected PRs into a single staging once the culprit is known.
Problem:
Even when there is only one bad PR, the remaining good PRs may still be processed across several stagings (depending on how the queued splits unfold), leading to extra CI cycles and slower throughput.
Proposed change:
As soon as a culprit PR is identified in a single-batch staging:
2. Cancel any pending stagings from the same split source and remove sibling splits for that source.Why this helps:
Implementation notes:
recombine_remaining_after_culprit(pr)onrunbot_merge.stagings. Finds sibling splits bysource_id, filters remaining active batches, cancels pending sibling stagings, unlinks sibling splits for the same source, then creates a single recombined split with the remaining batches._order = 'id desc'onrunbot_merge.splitso the newly created recombined split is picked first bytry_staging().How to test (manual):
B1..B4on the same target. MakeB1fail in isolation; others pass.[B1,B2]and[B3,B4].B1as culprit (single-batch failure).B2,B3,B4.B2,B3,B4together; CI passes → they merge together.Backward compatibility & failure modes:
If you’d like, I can add a short “Changelog entry”.