Add update throttling to prevent rapid deployment churn #187

toelke · 2025-12-04T12:12:05Z

Implements a minimum interval between updates (default: 10s, configurable) to prevent Wave from updating deployments too frequently when secrets or configmaps change rapidly.

This prevents scenarios where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments, which can overwhelm the Kubernetes API server.

Key features:

Fixed minimum interval between updates (default: 10s)
Configurable via --min-update-interval flag
Configurable via Helm chart (minUpdateInterval value)
State tracked in-memory within the operator
Thread-safe implementation with mutex protection
Applies to ALL updates, even when config hashes change

Fixes #182

🤖 Generated with Claude Code

toelke · 2025-12-04T12:12:22Z

I did not want to close #183, I just wanted to rename the branch :-(

toelke · 2025-12-04T12:13:41Z

I decided that the semantics of an increasing backoff period are not easy: When to decrease the backoff-period again?

So this is now with a static backoff-period. I think this is enough for all workloads.

jabdoa2 · 2025-12-04T12:22:33Z

So this is now with a static backoff-period. I think this is enough for all workloads.

I guess that should be good enough for most cases. It would have delayed the issue in our case. Not sure if it would have prevented the issue completely though (given enough time passed). In case it happens again I would try to add more backoff.

I decided that the semantics of an increasing backoff period are not easy: When to decrease the backoff-period again?

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

toelke · 2025-12-04T12:31:36Z

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

Yes, I was thinking along those lines: if the last update was not stopped and is currentBackoff in the past, then reduce the limit for n steps. But ultimately I felt that 1 update per 10 seconds is not going to hurt in any case…

jabdoa2 · 2025-12-04T12:37:51Z

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

Yes, I was thinking along those lines: if the last update was not stopped and is currentBackoff in the past, then reduce the limit for n steps. But ultimately I felt that 1 update per 10 seconds is not going to hurt in any case…

For Deployments that depends on how fast your pods get ready. Kubernetes will only then remove old replicasets. If your pods are slow to start (i.e. Prometheus) then this might still create a lot of replica sets.

It should not explode too fast though as it would be limited to 360 RS per hour per deployment. However, it this happens to a lot of different Deployments it might still hit. Maybe we need a global max updates per minute?

toelke · 2025-12-04T13:11:38Z

A global cool-down could break the base expectation that there is a restart for the "last" change of a configmap.

jabdoa2 · 2025-12-04T13:50:09Z

A global cool-down could break the base expectation that there is a restart for the "last" change of a configmap.

In general, yes. However, even the Kubernetes api has a per client qps limit (with burst) to prevent ddos. I imagine something like that.

In our larger clusters when we update the trust chain for all namespaces (using trust manager) this currently causes wave to update a few thousand pods at once (as almost all of them mount ca-certs from a configmap). Initially trust manager had bugs (which changed the configmap a few times for each cert) and that actually caused the our api server to crash as well. We attributes that to trust manager but now that I think of it wave might have amplified that as well.

toelke · 2025-12-04T13:55:01Z

In that case you require all Deployments to be restarted, you just want to do it slowly, right?

jabdoa2 · 2025-12-04T14:00:57Z

In that case you require all Deployments to be restarted, you just want to do it slowly, right?

Exactly, kubernetes api is fragile as soon as clusters grow. Guess this would be another issue. I can write something up later.

toelke · 2025-12-04T14:16:22Z

That sounds like it wants to be solved by wave putting the updates into a queue that is slowly drained (leaky bucket for rate limit?)

jabdoa2 · 2025-12-04T15:07:01Z

That sounds like it wants to be solved by wave putting the updates into a queue that is slowly drained (leaky bucket for rate limit?)

Basically, we already got a queue for the reconciler. As soon as we block the reconciler we got that behavior. Naiively we could just add a custom rate limiter during controller setup. However, that would not be 100% accurate. It would probably better to manually call When on a rate.Limiter before an update.

Implements global rate limiting using golang.org/x/time/rate.Limiter to prevent Wave from updating deployments too frequently when secrets or configmaps change rapidly. This prevents scenarios where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments, which can overwhelm the Kubernetes API server. Key features: - Token bucket rate limiting (default: 1 update/sec globally, burst of 10) - Global rate limiting shared across all deployments/statefulsets/daemonsets - Configurable via --update-rate and --update-burst flags - Configurable via Helm chart (updateRate and updateBurst values) - Stalls operator pipeline via Wait() instead of requeueing - Thread-safe implementation using rate.Limiter Technical details: - Uses standard library golang.org/x/time/rate for token bucket algorithm - Burst allows initial rapid updates, then enforces steady-state rate - Infinite rate (math.Inf) disables rate limiting for testing - All resources share single global rate limiter Fixes #182 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

toelke · 2025-12-05T14:37:53Z

Like this?

jabdoa2 · 2025-12-06T07:42:50Z

Like this?

Yeah that is great. Simple code and should be very effective. I guess I would set burst to 100 by default. That way wave will act instantly until you do a lot of updates at once.

toelke marked this pull request as ready for review December 4, 2025 12:13

toelke requested a review from a team as a code owner December 4, 2025 12:13

toelke force-pushed the feature/backoff-issue-182 branch from 3dc7cc5 to a648036 Compare December 4, 2025 12:26

toelke force-pushed the feature/backoff-issue-182 branch from a648036 to fc2375d Compare December 5, 2025 14:37

Add update throttling to prevent rapid deployment churn #187

Are you sure you want to change the base?

Add update throttling to prevent rapid deployment churn #187

Uh oh!

Conversation

toelke commented Dec 4, 2025

Uh oh!

toelke commented Dec 4, 2025

Uh oh!

toelke commented Dec 4, 2025

Uh oh!

jabdoa2 commented Dec 4, 2025

Uh oh!

toelke commented Dec 4, 2025

Uh oh!

jabdoa2 commented Dec 4, 2025

Uh oh!

toelke commented Dec 4, 2025

Uh oh!

jabdoa2 commented Dec 4, 2025

Uh oh!

toelke commented Dec 4, 2025

Uh oh!

jabdoa2 commented Dec 4, 2025

Uh oh!

toelke commented Dec 4, 2025

Uh oh!

jabdoa2 commented Dec 4, 2025

Uh oh!

toelke commented Dec 5, 2025

Uh oh!

jabdoa2 commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants