Skip to content

Conversation

jvcdk
Copy link

@jvcdk jvcdk commented Jul 3, 2025

Quick notes:

  • This is a follow-up of Flux reconciliation dependencies flux2#5403. Please read this for details on the failure scenario.
  • The intention is to put spotlight on a failure scenario; whether this fix is the right or not, I do not know.

Failure Scenario summary

Following my issue described in fluxcd/flux2#5403, I did a bit of digging. I enabled the debug log of the Kustomize controller, and found the following scenario happening:

  • Say we have kustomizations Ka and Kb. Kb depends on Ka.
  • In the beginning: Both Ka and Kb tries to reconcile every 2 seconds; they fail with Dependencies do not meet ready condition, retrying in 2s. This is expected.
  • Then dependencies for Ka comes into place, and Ka starts to reconcile.
  • Kb keeps trying to reconcile every 2 seconds for some time! Then it stops / dies.
  • Ka is done reconciling, and Kb should be ready to go, but nothing happens!
  • After a while, a handful of errors are logged (similar to this):
    message: unable to record event
    error: POST http://notification-controller.flux-system.svc.cluster.local./ giving up after 5 attempt(s)
    
  • Then Kb starts to reconcile.

See details below.

Apparent cause – and fix

It seems that this scenario is the combination of two bugs:

Bug 1: Communication to Notification-controller

For some reason; Kustomize-controller cannot communicate with Notification-controller in our cluster. It fails to [...] to record event and it gives up after ~5 minutes after start of Notification-controller.

This may very well be a problem in our setup that I will have to investigate.

Bug 2: Blocking of other communication

It seems that the communication loop used to update the perceived reconciliation status of Ka and Kb is the same loop used for the Notification controller. Hence: While the thread is blocked trying (and failing) to communicate with Notification Controller, Kustomization controller cannot update its perceived status of Kustomizations.

=> This is why Kb fails to start reconciling when Ka is done! Kustomization controller simply does not get the info that Ka is done.

My quick fix (read: This may very well be solved in a better way) is to push the "send the message Notification Controller" into its own co-routing, effectively unblocking other communication.

The effect? Now, when Ka is done reconciling, Kb starts briefly there-after. 🥳

Would you have a look and see if this fix is good, or if something else can be done?

Thank you 🙏

Details

Sequence diagram

Here are 2 sequence diagrams of a) what I expect to happen, and b) what I see is happening.

Note: I am a bit uncertain about the actors... That is why I wrote question marks next to "API-server".

Expected behavior

image

Observed behavior

image

Log output

Notes:

  • The log output is filtered / modified to ease the reading. However; they do originate from Kustomize Controller.
  • The field t_seconds is seconds since start first log statement from Kustomize Controller.
  • Ka in the example above is kyverno-system-controllers in the log files.
  • Kb in the example above is kyverno-system in the log files.
  • kyverno-system depends on kyverno-system-controllers
  • I am unsure why there are two ReconciliationSucceeded messages each time... but there is...

Log output, running Kustomize Controller from main branch

Notable timestamps:

  • T=89s: kyverno-system stops trying to reconcile.
  • T=101s: kyverno-system-controllers finishes reconciling.
  • T=316s: Log messages about "unable to record event" for multiple systems (only a few shown for brevity).
  • T=319s: kyverno-system starts reconciling again (and is done immediately).
{"t_seconds":66,"name":"kyverno-system-controllers","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":68,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":68,"name":"kyverno-system-controllers","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":70,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":72,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":74,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":76,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":78,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":81,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":83,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":85,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":87,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":89,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":101,"name":"kyverno-system-controllers","reason":"Progressing","message":"Health check passed in 30.033458445s","error":null}
{"t_seconds":101,"name":"kyverno-system-controllers","reason":"ReconciliationSucceeded","message":"Reconciliation finished in 30.319573291s, next run in 15m0s","error":null}
{"t_seconds":106,"name":"kyverno-system-controllers","reason":"ReconciliationSucceeded","message":"Reconciliation finished in 5.13217514s, next run in 15m0s","error":null}
{"t_seconds":316,"name":"tenants","reason":null,"message":"unable to record event","error":"POST http://notification-controller.flux-system.svc.cluster.local./ giving up after 5 attempt(s)"}
{"t_seconds":316,"name":"observability-system","reason":null,"message":"unable to record event","error":"POST http://notification-controller.flux-system.svc.cluster.local./ giving up after 5 attempt(s)"}
{"t_seconds":316,"name":"ingress-system","reason":null,"message":"unable to record event","error":"POST http://notification-controller.flux-system.svc.cluster.local./ giving up after 5 attempt(s)"}
{"t_seconds":316,"name":"security-system-controllers","reason":null,"message":"unable to record event","error":"POST http://notification-controller.flux-system.svc.cluster.local./ giving up after 5 attempt(s)"}
{"t_seconds":316,"name":"kyverno-system","reason":null,"message":"unable to record event","error":"POST http://notification-controller.flux-system.svc.cluster.local./ giving up after 5 attempt(s)"}
{"t_seconds":316,"name":"auth-system","reason":null,"message":"unable to record event","error":"POST http://notification-controller.flux-system.svc.cluster.local./ giving up after 5 attempt(s)"}
{"t_seconds":319,"name":"kyverno-system","reason":"Progressing","message":"ConfigMap/kyverno-system/harbor-proxy-configmap created\nClusterPolicy/add-critical-label-to-specific-resources created\nClusterPolicy/add-emptydir-sizelimit created\nClusterPolicy/add-pod-proxies created\nClusterPolicy/add-terminationmessagepolicy-fallbacktologsonerror created\nClusterPolicy/apply-container-security-context created\nClusterPolicy/block-updates-deletes created\nClusterPolicy/deny-oci-helmrepository created\nClusterPolicy/disallow-default-namespace created\nClusterPolicy/disallow-world-or-all created\nClusterPolicy/flux-multi-tenancy created\nClusterPolicy/generate-prometheus-scrape-network-policies created\nClusterPolicy/modify-o11y-collector-temp created\nClusterPolicy/node-add-label-kube-kraken-status created\nClusterPolicy/podsecurity-subrule-restricted created\nClusterPolicy/refresh-env-var-in-pods created\nClusterPolicy/restrict-pods-to-not-run-on-egress-gateways created\nClusterPolicy/restrict-use-of-tolerations created\nClusterPolicy/restrict-wildcard-resources created\nClusterPolicy/restrict-wildcard-verbs created\nClusterPolicy/spread-pods created\nPolicyException/kyverno-system/alloy-host-namespaces-podsec-exception created\nPolicyException/kyverno-system/alloy-host-path-volumes-podsec-exception created\nPolicyException/kyverno-system/alloy-runasnonroot-podsec-exception created\nPolicyException/kyverno-system/alloy-validator-podsec-exception created\nPolicyException/kyverno-system/configmap-reloader-podsec-exception created\nPolicyException/kyverno-system/kube-cert-agent-podsec-exception created\nPolicyException/kyverno-system/prometheus-node-exporter-hostport-exception created\nPolicyException/kyverno-system/traefik-hostport-exception created\nPolicyException/kyverno-system/trivy-validator-podsec-exception created","error":null}
{"t_seconds":319,"name":"kyverno-system","reason":"Progressing","message":"Health check passed in 133.106974ms","error":null}
{"t_seconds":319,"name":"kyverno-system","reason":"ReconciliationSucceeded","message":"Reconciliation finished in 3.95742419s, next run in 15m0s","error":null}
{"t_seconds":322,"name":"kyverno-system","reason":"ReconciliationSucceeded","message":"Reconciliation finished in 2.230914375s, next run in 15m0s","error":null}

Log output, running Kustomize Controller from this branch

Notable timestamps:

  • T=*/2: kyverno-system continues to retry reconciliation every 2 seconds. Never stops!
  • T=93s: kyverno-system-controllers finishes reconciling.
  • T=98s: kyverno-system starts reconciling (and is done immediately).
{"t_seconds":66,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":68,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":70,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":72,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":74,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":76,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":78,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":80,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":82,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":84,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":86,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":88,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":88,"name":"kyverno-system-controllers","reason":"Progressing","message":"Health check passed in 30.02618904s","error":null}
{"t_seconds":88,"name":"kyverno-system-controllers","reason":"ReconciliationSucceeded","message":"Reconciliation finished in 30.368048274s, next run in 15m0s","error":null}
{"t_seconds":90,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":92,"name":"kyverno-system","reason":"DependencyNotReady","message":"Dependencies do not meet ready condition, retrying in 2s","error":null}
{"t_seconds":93,"name":"kyverno-system-controllers","reason":"ReconciliationSucceeded","message":"Reconciliation finished in 5.215704635s, next run in 15m0s","error":null}
{"t_seconds":98,"name":"kyverno-system","reason":"Progressing","message":"ConfigMap/kyverno-system/harbor-proxy-configmap created\nClusterPolicy/add-critical-label-to-specific-resources created\nClusterPolicy/add-emptydir-sizelimit created\nClusterPolicy/add-pod-proxies created\nClusterPolicy/add-terminationmessagepolicy-fallbacktologsonerror created\nClusterPolicy/apply-container-security-context created\nClusterPolicy/block-updates-deletes created\nClusterPolicy/deny-oci-helmrepository created\nClusterPolicy/disallow-default-namespace created\nClusterPolicy/disallow-world-or-all created\nClusterPolicy/flux-multi-tenancy created\nClusterPolicy/generate-prometheus-scrape-network-policies created\nClusterPolicy/modify-o11y-collector-temp created\nClusterPolicy/node-add-label-kube-kraken-status created\nClusterPolicy/podsecurity-subrule-restricted created\nClusterPolicy/refresh-env-var-in-pods created\nClusterPolicy/restrict-pods-to-not-run-on-egress-gateways created\nClusterPolicy/restrict-use-of-tolerations created\nClusterPolicy/restrict-wildcard-resources created\nClusterPolicy/restrict-wildcard-verbs created\nClusterPolicy/spread-pods created\nPolicyException/kyverno-system/alloy-host-namespaces-podsec-exception created\nPolicyException/kyverno-system/alloy-host-path-volumes-podsec-exception created\nPolicyException/kyverno-system/alloy-runasnonroot-podsec-exception created\nPolicyException/kyverno-system/alloy-validator-podsec-exception created\nPolicyException/kyverno-system/configmap-reloader-podsec-exception created\nPolicyException/kyverno-system/kube-cert-agent-podsec-exception created\nPolicyException/kyverno-system/prometheus-node-exporter-hostport-exception created\nPolicyException/kyverno-system/traefik-hostport-exception created\nPolicyException/kyverno-system/trivy-validator-podsec-exception created","error":null}
{"t_seconds":98,"name":"kyverno-system","reason":"Progressing","message":"Health check passed in 69.108735ms","error":null}
{"t_seconds":98,"name":"kyverno-system","reason":"ReconciliationSucceeded","message":"Reconciliation finished in 3.896051137s, next run in 15m0s","error":null}
{"t_seconds":100,"name":"kyverno-system","reason":"ReconciliationSucceeded","message":"Reconciliation finished in 2.281435033s, next run in 15m0s","error":null}

…e it non-blocking.

Signed-off-by: Jørn Villesen Christensen <[email protected]>
@jvcdk jvcdk changed the title Controller: Call EventRecorder.AnnotatedEventf in a go-routine to make it non-blocking. Controller, bugfix: Call EventRecorder.AnnotatedEventf in a go-routine to make it non-blocking. Jul 3, 2025
@matheuscscp
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants