Skip to content

Conversation

@toelke
Copy link
Collaborator

@toelke toelke commented Dec 4, 2025

Implements a minimum interval between updates (default: 10s, configurable) to prevent Wave from updating deployments too frequently when secrets or configmaps change rapidly.

This prevents scenarios where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments, which can overwhelm the Kubernetes API server.

Key features:

  • Fixed minimum interval between updates (default: 10s)
  • Configurable via --min-update-interval flag
  • Configurable via Helm chart (minUpdateInterval value)
  • State tracked in-memory within the operator
  • Thread-safe implementation with mutex protection
  • Applies to ALL updates, even when config hashes change

Fixes #182

🤖 Generated with Claude Code

@toelke
Copy link
Collaborator Author

toelke commented Dec 4, 2025

I did not want to close #183, I just wanted to rename the branch :-(

@toelke
Copy link
Collaborator Author

toelke commented Dec 4, 2025

I decided that the semantics of an increasing backoff period are not easy: When to decrease the backoff-period again?

So this is now with a static backoff-period. I think this is enough for all workloads.

@toelke toelke marked this pull request as ready for review December 4, 2025 12:13
@toelke toelke requested a review from a team as a code owner December 4, 2025 12:13
@jabdoa2
Copy link
Contributor

jabdoa2 commented Dec 4, 2025

So this is now with a static backoff-period. I think this is enough for all workloads.

I guess that should be good enough for most cases. It would have delayed the issue in our case. Not sure if it would have prevented the issue completely though (given enough time passed). In case it happens again I would try to add more backoff.

I decided that the semantics of an increasing backoff period are not easy: When to decrease the backoff-period again?

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

@toelke toelke force-pushed the feature/backoff-issue-182 branch from 3dc7cc5 to a648036 Compare December 4, 2025 12:26
@toelke
Copy link
Collaborator Author

toelke commented Dec 4, 2025

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

Yes, I was thinking along those lines: if the last update was not stopped and is currentBackoff in the past, then reduce the limit for n steps. But ultimately I felt that 1 update per 10 seconds is not going to hurt in any case…

@jabdoa2
Copy link
Contributor

jabdoa2 commented Dec 4, 2025

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

Yes, I was thinking along those lines: if the last update was not stopped and is currentBackoff in the past, then reduce the limit for n steps. But ultimately I felt that 1 update per 10 seconds is not going to hurt in any case…

For Deployments that depends on how fast your pods get ready. Kubernetes will only then remove old replicasets. If your pods are slow to start (i.e. Prometheus) then this might still create a lot of replica sets.

It should not explode too fast though as it would be limited to 360 RS per hour per deployment. However, it this happens to a lot of different Deployments it might still hit. Maybe we need a global max updates per minute?

@toelke
Copy link
Collaborator Author

toelke commented Dec 4, 2025

A global cool-down could break the base expectation that there is a restart for the "last" change of a configmap.

@jabdoa2
Copy link
Contributor

jabdoa2 commented Dec 4, 2025

A global cool-down could break the base expectation that there is a restart for the "last" change of a configmap.

In general, yes. However, even the Kubernetes api has a per client qps limit (with burst) to prevent ddos. I imagine something like that.

In our larger clusters when we update the trust chain for all namespaces (using trust manager) this currently causes wave to update a few thousand pods at once (as almost all of them mount ca-certs from a configmap). Initially trust manager had bugs (which changed the configmap a few times for each cert) and that actually caused the our api server to crash as well. We attributes that to trust manager but now that I think of it wave might have amplified that as well.

@toelke
Copy link
Collaborator Author

toelke commented Dec 4, 2025

In that case you require all Deployments to be restarted, you just want to do it slowly, right?

@jabdoa2
Copy link
Contributor

jabdoa2 commented Dec 4, 2025

In that case you require all Deployments to be restarted, you just want to do it slowly, right?

Exactly, kubernetes api is fragile as soon as clusters grow. Guess this would be another issue. I can write something up later.

@toelke
Copy link
Collaborator Author

toelke commented Dec 4, 2025

That sounds like it wants to be solved by wave putting the updates into a queue that is slowly drained (leaky bucket for rate limit?)

@jabdoa2
Copy link
Contributor

jabdoa2 commented Dec 4, 2025

That sounds like it wants to be solved by wave putting the updates into a queue that is slowly drained (leaky bucket for rate limit?)

Basically, we already got a queue for the reconciler. As soon as we block the reconciler we got that behavior. Naiively we could just add a custom rate limiter during controller setup. However, that would not be 100% accurate. It would probably better to manually call When on a rate.Limiter before an update.

Implements global rate limiting using golang.org/x/time/rate.Limiter to
prevent Wave from updating deployments too frequently when secrets or
configmaps change rapidly.

This prevents scenarios where a buggy controller rapidly updating secrets
causes Wave to rapidly update deployments, which can overwhelm the
Kubernetes API server.

Key features:
- Token bucket rate limiting (default: 1 update/sec globally, burst of 10)
- Global rate limiting shared across all deployments/statefulsets/daemonsets
- Configurable via --update-rate and --update-burst flags
- Configurable via Helm chart (updateRate and updateBurst values)
- Stalls operator pipeline via Wait() instead of requeueing
- Thread-safe implementation using rate.Limiter

Technical details:
- Uses standard library golang.org/x/time/rate for token bucket algorithm
- Burst allows initial rapid updates, then enforces steady-state rate
- Infinite rate (math.Inf) disables rate limiting for testing
- All resources share single global rate limiter

Fixes #182

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@toelke toelke force-pushed the feature/backoff-issue-182 branch from a648036 to fc2375d Compare December 5, 2025 14:37
@toelke
Copy link
Collaborator Author

toelke commented Dec 5, 2025

Like this?

@jabdoa2
Copy link
Contributor

jabdoa2 commented Dec 6, 2025

Like this?

Yeah that is great. Simple code and should be very effective. I guess I would set burst to 100 by default. That way wave will act instantly until you do a lot of updates at once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wave is very fast to update Deployments and can DDoS the kubernetes API with a lot of ReplicaSets

3 participants