Controller, bugfix: Call EventRecorder.AnnotatedEventf in a go-routine to make it non-blocking. #1479
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Quick notes:
Failure Scenario summary
Following my issue described in fluxcd/flux2#5403, I did a bit of digging. I enabled the debug log of the Kustomize controller, and found the following scenario happening:
Ka
andKb
.Kb
depends onKa
.Ka
andKb
tries to reconcile every 2 seconds; they fail withDependencies do not meet ready condition, retrying in 2s
. This is expected.Ka
comes into place, andKa
starts to reconcile.Kb
keeps trying to reconcile every 2 seconds for some time! Then it stops / dies.Ka
is done reconciling, andKb
should be ready to go, but nothing happens!Kb
starts to reconcile.See details below.
Apparent cause – and fix
It seems that this scenario is the combination of two bugs:
Bug 1: Communication to Notification-controller
For some reason; Kustomize-controller cannot communicate with Notification-controller in our cluster. It fails to
[...] to record event
and it gives up after ~5 minutes after start of Notification-controller.This may very well be a problem in our setup that I will have to investigate.
Bug 2: Blocking of other communication
It seems that the communication loop used to update the perceived reconciliation status of
Ka
andKb
is the same loop used for the Notification controller. Hence: While the thread is blocked trying (and failing) to communicate with Notification Controller, Kustomization controller cannot update its perceived status of Kustomizations.=> This is why
Kb
fails to start reconciling whenKa
is done! Kustomization controller simply does not get the info thatKa
is done.My quick fix (read: This may very well be solved in a better way) is to push the "send the message Notification Controller" into its own co-routing, effectively unblocking other communication.
The effect? Now, when Ka is done reconciling, Kb starts briefly there-after. 🥳
Would you have a look and see if this fix is good, or if something else can be done?
Thank you 🙏
Details
Sequence diagram
Here are 2 sequence diagrams of a) what I expect to happen, and b) what I see is happening.
Note: I am a bit uncertain about the actors... That is why I wrote question marks next to "API-server".
Expected behavior
Observed behavior
Log output
Notes:
t_seconds
is seconds since start first log statement from Kustomize Controller.Ka
in the example above iskyverno-system-controllers
in the log files.Kb
in the example above iskyverno-system
in the log files.kyverno-system
depends onkyverno-system-controllers
ReconciliationSucceeded
messages each time... but there is...Log output, running Kustomize Controller from
main
branchNotable timestamps:
Log output, running Kustomize Controller from this branch
Notable timestamps: