-
Notifications
You must be signed in to change notification settings - Fork 108
runtime/events: Fix rate limits error handling in recorder #986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/events: Fix rate limits error handling in recorder #986
Conversation
|
@jvcdk could you please amend your commit and rename it to |
1c98b85 to
3aa313d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! This fix LGTM, just one nit
Done. |
|
@jvcdk you need to pull latest main into your fork then rebase your branch and force push |
3aa313d to
72afdbe
Compare
Sorry. Hadn't seen main had moved. Done :) |
72afdbe to
6c7ed16
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 🚀
6c7ed16 to
587d822
Compare
|
Amended commit with fix to tests. |
The notification controller sends a 429 Too Many Requests message when a message is duplicated. Thus if we receive a 429, we should discard the message. Signed-off-by: Jørn Villesen Christensen <[email protected]>
Signed-off-by: Matheus Pimenta <[email protected]>
587d822 to
8804d2f
Compare
History
This is patch stems from investigating the issue described in PR Bugfix: Perform http post in goroutine to prevent locking up the reconciler loop. See this for background info on the issue.
Brief bug explanation
Code points of interest:
ErrorPropagatedRetryPolicyuses baseRetryPolicy which in turn handles http code 429 as please-retry-again.Brief bugfix explanation
checkRetrycallback function (of the go-retryablehttp client) fromErrorPropagatedRetryPolicyto a custom function:ErrorPropagatedRetryPolicy.Side contemplations
Number of retrys
The default retry count is 4, resulting in 5 post attempts from go-retryablehttp client, and 6 total attempts (including the manual 429 check) for the original code.
With this PR, the manual attempt is removed, and thus the total attempts is 5. I assume this is ok, but if not, it is easy to increment the retry count.
Response code from Notification Controller
I would suggest that the Notification Controller would return StatusAlreadyReported (instead of StatusTooManyRequests) as a result of duplicated messages. This would remove the need for the custom http code handling in the Kustomize Controller.
(However, this stems from the go-limiter package that Notification Controller uses. See middleware.go, L119.)
Detailed failure scenario description
Here I will lay out the failure scenario that led to this PR. It is meant as auxiliary reading if you wish to understand more in depth how this bug plays out.
Details
Here is a screenshot of the event recorder code and go-retryablehttp client. There is an inlay with two log statements from my debugging session.
Here are some notes to explain the steps further:
res == nil), the code does not return.baseRetryPolicyhandles this with a generic try-again policy.Retry-Afterheader of 5 minutes. This stems from the command line option--rate-limit-intervalwhich defaults to 5 minutes.Unreachable Notification Controller
This happens in our cluster during start-up and is due to a reconfiguration of Cilium. This is a separate issue that I am investigating as well.