-
Notifications
You must be signed in to change notification settings - Fork 82
Add update throttling to prevent rapid deployment churn #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
I did not want to close #183, I just wanted to rename the branch :-( |
|
I decided that the semantics of an increasing backoff period are not easy: When to decrease the backoff-period again? So this is now with a static backoff-period. I think this is enough for all workloads. |
I guess that should be good enough for most cases. It would have delayed the issue in our case. Not sure if it would have prevented the issue completely though (given enough time passed). In case it happens again I would try to add more backoff.
I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast. |
3dc7cc5 to
a648036
Compare
Yes, I was thinking along those lines: if the last update was not stopped and is |
For Deployments that depends on how fast your pods get ready. Kubernetes will only then remove old replicasets. If your pods are slow to start (i.e. Prometheus) then this might still create a lot of replica sets. It should not explode too fast though as it would be limited to 360 RS per hour per deployment. However, it this happens to a lot of different Deployments it might still hit. Maybe we need a global max updates per minute? |
|
A global cool-down could break the base expectation that there is a restart for the "last" change of a configmap. |
In general, yes. However, even the Kubernetes api has a per client qps limit (with burst) to prevent ddos. I imagine something like that. In our larger clusters when we update the trust chain for all namespaces (using trust manager) this currently causes wave to update a few thousand pods at once (as almost all of them mount ca-certs from a configmap). Initially trust manager had bugs (which changed the configmap a few times for each cert) and that actually caused the our api server to crash as well. We attributes that to trust manager but now that I think of it wave might have amplified that as well. |
|
In that case you require all Deployments to be restarted, you just want to do it slowly, right? |
Exactly, kubernetes api is fragile as soon as clusters grow. Guess this would be another issue. I can write something up later. |
|
That sounds like it wants to be solved by wave putting the updates into a queue that is slowly drained (leaky bucket for rate limit?) |
Basically, we already got a queue for the reconciler. As soon as we block the reconciler we got that behavior. Naiively we could just add a custom rate limiter during controller setup. However, that would not be 100% accurate. It would probably better to manually call |
Implements global rate limiting using golang.org/x/time/rate.Limiter to prevent Wave from updating deployments too frequently when secrets or configmaps change rapidly. This prevents scenarios where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments, which can overwhelm the Kubernetes API server. Key features: - Token bucket rate limiting (default: 1 update/sec globally, burst of 10) - Global rate limiting shared across all deployments/statefulsets/daemonsets - Configurable via --update-rate and --update-burst flags - Configurable via Helm chart (updateRate and updateBurst values) - Stalls operator pipeline via Wait() instead of requeueing - Thread-safe implementation using rate.Limiter Technical details: - Uses standard library golang.org/x/time/rate for token bucket algorithm - Burst allows initial rapid updates, then enforces steady-state rate - Infinite rate (math.Inf) disables rate limiting for testing - All resources share single global rate limiter Fixes #182 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
a648036 to
fc2375d
Compare
|
Like this? |
Yeah that is great. Simple code and should be very effective. I guess I would set burst to 100 by default. That way wave will act instantly until you do a lot of updates at once. |
Implements a minimum interval between updates (default: 10s, configurable) to prevent Wave from updating deployments too frequently when secrets or configmaps change rapidly.
This prevents scenarios where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments, which can overwhelm the Kubernetes API server.
Key features:
Fixes #182
🤖 Generated with Claude Code