[IMP] runbot_merge: recombine remaining batches after identifying a culprit PR #1223

MiquelRForgeFlow · 2025-09-19T12:42:45Z

Background (current behavior):
When a staging with multiple batches fails, runbot_merge splits the set in two halves and stages them in sequence. Only the failing half is split again (bisected); a passing half is staged and merged as a unit. Eventually, if bisecting reaches a single-batch staging that fails, the culprit PR is identified and marked as failed. After that point, the remaining PRs from the original split group continue through the existing split queue (halves and further splits only if those fail). There is no step to recombine all unaffected PRs into a single staging once the culprit is known.

Problem:
Even when there is only one bad PR, the remaining good PRs may still be processed across several stagings (depending on how the queued splits unfold), leading to extra CI cycles and slower throughput.

Proposed change:
As soon as a culprit PR is identified in a single-batch staging:

Mark that PR as failed (existing behavior).
~~2. Cancel any pending stagings from the same split source and remove sibling splits for that source.~~
Create a new split containing all remaining active batches from the original group (excluding the culprit and closed/merged ones).
Let the scheduler stage this recombined batch together.

Why this helps:

Fewer CI runs: In the common “one culprit” scenario, we avoid cascading stagings of separate halves.
Faster merges: Good PRs get merged together promptly.
Safe: If another culprit or a problematic interaction exists, the recombined staging will fail and the standard bisect logic will isolate it.

Implementation notes:

New helper recombine_remaining_after_culprit(pr) on runbot_merge.stagings. Finds sibling splits by source_id, filters remaining active batches, cancels pending sibling stagings, unlinks sibling splits for the same source, then creates a single recombined split with the remaining batches.
Set _order = 'id desc' on runbot_merge.split so the newly created recombined split is picked first by try_staging().

How to test (manual):

Prepare 4 batches B1..B4 on the same target. Make B1 fail in isolation; others pass.
Trigger a staging with all 4 → it fails → it is split into [B1,B2] and [B3,B4].
Process the left split; bisect isolates B1 as culprit (single-batch failure).
Verify logs show: “recombine remaining N batches …” and that a new split is created with B2,B3,B4.
Next scheduler run should stage B2,B3,B4 together; CI passes → they merge together.
Repeat with a “two culprits” scenario to confirm the process repeats until all bad PRs are isolated.

Backward compatibility & failure modes:

Worst case (multiple culprits/interactions): behavior naturally falls back to standard bisect.
No change to approvals/labels; only scheduling is optimized.
Recombined staging creation is idempotent per split source cycle.

If you’d like, I can add a short “Changelog entry”.

MiquelRForgeFlow · 2025-09-19T12:46:40Z

@Xavier-Do @d-fence Hope you like it.

Xavier-Do · 2025-09-22T11:53:43Z

Hello @MiquelRForgeFlow

Thanks for contributing

This is more a topic for @xmo-odoo but it looks like it wouldn't work well with random errors.

MiquelRForgeFlow · 2025-10-01T08:22:54Z

@Xavier-Do to be more safe, I removed the cancelling of any pending staging from the same split source and sibling splits for that source. The new combined split should be executed regardless because the _order set on runbot_merge.split.

BTW, I asked ChatGPT how to handle better the random errors (flakes) and it proposes two ways:

Culprit confirmation with a short retry (“canary retry”): Before marking a PR as failed (when you detect it as single-batch), make a unique retry of that single-batch.
Unique retry of the recombined / single-batched when there is no clear culprit.

xmo-odoo

I added both my understanding of the original version (a depth-first search of a culprit PR) and this version (keep the breadth-first search but recombine as soon as a "culprit" is found) to a small simulator I worked on during the odoo days. I found the original to be way better than I was assuming, while the simple recombination is not really an improvemend over the baseline version:

title	total	merged	failed	nondeterministic
batched	236000	233380	2222	398
batched_rejoin	238608	236005	2210	393
batched_depthfirst	254902	252245	2377	280

(this is with a bunch of assumptions as the simulator is pretty simplistic, and over 5 years of simulation-time).

However,

the branch for the mergebot is 17.0
the mergebot has a fair number of tests, and as both this and the previous significantly impact the scheduling of splits I doubt the various tests which involve such splits still pass with this

runbot_merge/models/pull_requests.py

When a multi-batch staging fails and a culprit PR is later identified in a single-batch staging, cancel pending sibling stagings/splits from the same split root and create a new split with all remaining active batches. This lets the scheduler stage them together again, reducing CI cycles and accelerating merges while preserving safety (bisect still applies if more culprits/interactions remain).

MiquelRForgeFlow · 2025-10-15T13:58:02Z

@xmo-odoo comments attended.

MiquelRForgeFlow force-pushed the 18.0-imp-runbot_merge-recombine-remaining branch 2 times, most recently from 3a0b8ec to 885c8d2 Compare October 1, 2025 08:02

xmo-odoo reviewed Oct 15, 2025

View reviewed changes

MiquelRForgeFlow force-pushed the 18.0-imp-runbot_merge-recombine-remaining branch from 885c8d2 to 6d3e9e7 Compare October 15, 2025 13:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[IMP] runbot_merge: recombine remaining batches after identifying a culprit PR #1223

[IMP] runbot_merge: recombine remaining batches after identifying a culprit PR #1223

Uh oh!

MiquelRForgeFlow commented Sep 19, 2025 •

edited

Loading

Uh oh!

MiquelRForgeFlow commented Sep 19, 2025

Uh oh!

Xavier-Do commented Sep 22, 2025

Uh oh!

MiquelRForgeFlow commented Oct 1, 2025

Uh oh!

xmo-odoo left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MiquelRForgeFlow commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[IMP] runbot_merge: recombine remaining batches after identifying a culprit PR #1223

Are you sure you want to change the base?

[IMP] runbot_merge: recombine remaining batches after identifying a culprit PR #1223

Uh oh!

Conversation

MiquelRForgeFlow commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MiquelRForgeFlow commented Sep 19, 2025

Uh oh!

Xavier-Do commented Sep 22, 2025

Uh oh!

MiquelRForgeFlow commented Oct 1, 2025

Uh oh!

xmo-odoo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MiquelRForgeFlow commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MiquelRForgeFlow commented Sep 19, 2025 •

edited

Loading