Skip to content

Conversation

@jvcdk
Copy link
Contributor

@jvcdk jvcdk commented Jul 16, 2025

History

This is patch stems from investigating the issue described in PR Bugfix: Perform http post in goroutine to prevent locking up the reconciler loop. See this for background info on the issue.

Brief bug explanation

Brief bugfix explanation

  • Replaced checkRetry callback function (of the go-retryablehttp client) from ErrorPropagatedRetryPolicy to a custom function:
    • Do not perform retry on http code 429.
    • Otherwise, fallback to ErrorPropagatedRetryPolicy.
  • Removed the manual check for http code 429 (as this is now handled as a more general case).

Side contemplations

Number of retrys

The default retry count is 4, resulting in 5 post attempts from go-retryablehttp client, and 6 total attempts (including the manual 429 check) for the original code.

With this PR, the manual attempt is removed, and thus the total attempts is 5. I assume this is ok, but if not, it is easy to increment the retry count.

Response code from Notification Controller

I would suggest that the Notification Controller would return StatusAlreadyReported (instead of StatusTooManyRequests) as a result of duplicated messages. This would remove the need for the custom http code handling in the Kustomize Controller.

(However, this stems from the go-limiter package that Notification Controller uses. See middleware.go, L119.)

Detailed failure scenario description

Here I will lay out the failure scenario that led to this PR. It is meant as auxiliary reading if you wish to understand more in depth how this bug plays out.

Details

Here is a screenshot of the event recorder code and go-retryablehttp client. There is an inlay with two log statements from my debugging session.

flux-retry-policy

Here are some notes to explain the steps further:

  1. The code starts with a manual message post to the Notification Controller.
  2. Unfortunately, the Notification Controller is not reachable and the http call times out. (See note below)
  3. Since the result is not StatusTooManyRequests (in fact, res == nil), the code does not return.
  4. The message is attempted to be posted again, this time with the retryable http client.
  5. The Notification Controller is still unreachable, yielding a timeout error.
  6. The baseRetryPolicy handles this with a generic try-again policy.
  7. The back-off calculation uses a default policy, resulting in a back-off delay of 2 seconds.
  8. 2nd time it tries, the Notification Controller is reachable. The event is a duplicate and thus the Notification Controller returns 429 Too Many Requests.
    • Note: The log file is from edited code (extra log data + PoC bugfix) and thus the "giving up after 2 attempts" is not the behavior of the original code. The original code would succeed in the 3rd attempt, see next points.
  9. The 429 Too Many Requests results in a try-again policy. (This is the behavior that is changing with this PR.)
  10. The Notification Controller supplies a Retry-After header of 5 minutes. This stems from the command line option --rate-limit-interval which defaults to 5 minutes.
  11. [Not shown on screenshot]: After 5 minutes of wait time, the retryable http client posts the message to the Notification Controller again, and this times succeeds (because the back-off time was respected).

Unreachable Notification Controller

This happens in our cluster during start-up and is due to a reconfiguration of Cilium. This is a separate issue that I am investigating as well.

@stefanprodan stefanprodan changed the title Runtime, Recorder: Use httpClient.CheckRetry to handle http code 429 Too Many Requests. runtime/events/recorder: Fix rate limits error handling Jul 16, 2025
@stefanprodan
Copy link
Member

@jvcdk could you please amend your commit and rename it to runtime/events/recorder: Fix rate limits error handling, also sign-off on it with git commit -s and rebase / force push.

@jvcdk jvcdk force-pushed the bugfix/retryable-http-should-honor-http-code-429 branch from 1c98b85 to 3aa313d Compare July 16, 2025 11:14
Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This fix LGTM, just one nit

@jvcdk
Copy link
Contributor Author

jvcdk commented Jul 16, 2025

@jvcdk could you please amend your commit and rename it to runtime/events/recorder: Fix rate limits error handling, also sign-off on it with git commit -s and rebase / force push.

Done.

@stefanprodan
Copy link
Member

@jvcdk you need to pull latest main into your fork then rebase your branch and force push

@stefanprodan stefanprodan added the area/runtime Controller runtime related issues and pull requests label Jul 16, 2025
@jvcdk jvcdk force-pushed the bugfix/retryable-http-should-honor-http-code-429 branch from 3aa313d to 72afdbe Compare July 16, 2025 11:55
@jvcdk
Copy link
Contributor Author

jvcdk commented Jul 16, 2025

@jvcdk you need to pull latest main into your fork then rebase your branch and force push

Sorry. Hadn't seen main had moved. Done :)

@jvcdk jvcdk force-pushed the bugfix/retryable-http-should-honor-http-code-429 branch from 72afdbe to 6c7ed16 Compare July 16, 2025 12:03
Copy link
Member

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

@jvcdk jvcdk force-pushed the bugfix/retryable-http-should-honor-http-code-429 branch from 6c7ed16 to 587d822 Compare July 16, 2025 12:43
@jvcdk
Copy link
Contributor Author

jvcdk commented Jul 16, 2025

Amended commit with fix to tests.

jvcdk and others added 2 commits July 16, 2025 14:14
The notification controller sends a 429 Too Many Requests message when a message is duplicated. Thus if we
receive a 429, we should discard the message.

Signed-off-by: Jørn Villesen Christensen <[email protected]>
Signed-off-by: Matheus Pimenta <[email protected]>
@matheuscscp matheuscscp force-pushed the bugfix/retryable-http-should-honor-http-code-429 branch from 587d822 to 8804d2f Compare July 16, 2025 13:14
@matheuscscp matheuscscp changed the title runtime/events/recorder: Fix rate limits error handling runtime/events: Fix rate limits error handling Jul 16, 2025
@matheuscscp matheuscscp changed the title runtime/events: Fix rate limits error handling runtime/events: Fix rate limits error handling in recorder Jul 16, 2025
@matheuscscp matheuscscp merged commit dc9bf74 into fluxcd:main Jul 16, 2025
11 checks passed
@jvcdk jvcdk deleted the bugfix/retryable-http-should-honor-http-code-429 branch July 16, 2025 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/runtime Controller runtime related issues and pull requests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants