Skip to content

Conversation

@MiquelRForgeFlow
Copy link
Contributor

@MiquelRForgeFlow MiquelRForgeFlow commented Sep 19, 2025

Background (current behavior):
When a staging with multiple batches fails, runbot_merge splits the set in two halves and stages them in sequence. Only the failing half is split again (bisected); a passing half is staged and merged as a unit. Eventually, if bisecting reaches a single-batch staging that fails, the culprit PR is identified and marked as failed. After that point, the remaining PRs from the original split group continue through the existing split queue (halves and further splits only if those fail). There is no step to recombine all unaffected PRs into a single staging once the culprit is known.

Problem:
Even when there is only one bad PR, the remaining good PRs may still be processed across several stagings (depending on how the queued splits unfold), leading to extra CI cycles and slower throughput.

Proposed change:
As soon as a culprit PR is identified in a single-batch staging:

  1. Mark that PR as failed (existing behavior).
    2. Cancel any pending stagings from the same split source and remove sibling splits for that source.
  2. Create a new split containing all remaining active batches from the original group (excluding the culprit and closed/merged ones).
  3. Let the scheduler stage this recombined batch together.

Why this helps:

  • Fewer CI runs: In the common “one culprit” scenario, we avoid cascading stagings of separate halves.
  • Faster merges: Good PRs get merged together promptly.
  • Safe: If another culprit or a problematic interaction exists, the recombined staging will fail and the standard bisect logic will isolate it.

Implementation notes:

  • New helper recombine_remaining_after_culprit(pr) on runbot_merge.stagings. Finds sibling splits by source_id, filters remaining active batches, cancels pending sibling stagings, unlinks sibling splits for the same source, then creates a single recombined split with the remaining batches.
  • Set _order = 'id desc' on runbot_merge.split so the newly created recombined split is picked first by try_staging().

How to test (manual):

  1. Prepare 4 batches B1..B4 on the same target. Make B1 fail in isolation; others pass.
  2. Trigger a staging with all 4 → it fails → it is split into [B1,B2] and [B3,B4].
  3. Process the left split; bisect isolates B1 as culprit (single-batch failure).
  4. Verify logs show: “recombine remaining N batches …” and that a new split is created with B2,B3,B4.
  5. Next scheduler run should stage B2,B3,B4 together; CI passes → they merge together.
  6. Repeat with a “two culprits” scenario to confirm the process repeats until all bad PRs are isolated.

Backward compatibility & failure modes:

  • Worst case (multiple culprits/interactions): behavior naturally falls back to standard bisect.
  • No change to approvals/labels; only scheduling is optimized.
  • Recombined staging creation is idempotent per split source cycle.

If you’d like, I can add a short “Changelog entry”.

@MiquelRForgeFlow
Copy link
Contributor Author

@Xavier-Do @d-fence Hope you like it.

@Xavier-Do
Copy link
Contributor

Hello @MiquelRForgeFlow

Thanks for contributing

This is more a topic for @xmo-odoo but it looks like it wouldn't work well with random errors.

@MiquelRForgeFlow MiquelRForgeFlow force-pushed the 18.0-imp-runbot_merge-recombine-remaining branch 2 times, most recently from 3a0b8ec to 885c8d2 Compare October 1, 2025 08:02
@MiquelRForgeFlow
Copy link
Contributor Author

@Xavier-Do to be more safe, I removed the cancelling of any pending staging from the same split source and sibling splits for that source. The new combined split should be executed regardless because the _order set on runbot_merge.split.

BTW, I asked ChatGPT how to handle better the random errors (flakes) and it proposes two ways:

  1. Culprit confirmation with a short retry (“canary retry”): Before marking a PR as failed (when you detect it as single-batch), make a unique retry of that single-batch.
  2. Unique retry of the recombined / single-batched when there is no clear culprit.

Copy link
Collaborator

@xmo-odoo xmo-odoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added both my understanding of the original version (a depth-first search of a culprit PR) and this version (keep the breadth-first search but recombine as soon as a "culprit" is found) to a small simulator I worked on during the odoo days. I found the original to be way better than I was assuming, while the simple recombination is not really an improvemend over the baseline version:

title total merged failed nondeterministic
batched 236000 233380 2222 398
batched_rejoin 238608 236005 2210 393
batched_depthfirst 254902 252245 2377 280

(this is with a bunch of assumptions as the simulator is pretty simplistic, and over 5 years of simulation-time).

However,

  • the branch for the mergebot is 17.0
  • the mergebot has a fair number of tests, and as both this and the previous significantly impact the scheduling of splits I doubt the various tests which involve such splits still pass with this

When a multi-batch staging fails and a culprit PR is later identified
in a single-batch staging, cancel pending sibling stagings/splits from
the same split root and create a new split with all remaining active
batches. This lets the scheduler stage them together again, reducing CI
cycles and accelerating merges while preserving safety (bisect still
applies if more culprits/interactions remain).
@MiquelRForgeFlow MiquelRForgeFlow force-pushed the 18.0-imp-runbot_merge-recombine-remaining branch from 885c8d2 to 6d3e9e7 Compare October 15, 2025 13:34
@MiquelRForgeFlow
Copy link
Contributor Author

@xmo-odoo comments attended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants