Skip to content

MachineHealthcheck controller fails to get cluster connection from cache #12363

@darkweaver87

Description

@darkweaver87

What steps did you take and what happened?

I spawned an AWS workload cluster using CAPA v2.8.3 and the following flags set:

--feature-gates=EKS=true,EKSEnableIAM=true,EKSAllowAddRoles=false,EKSFargate=false,MachinePool=false,EventBridgeInstanceState=false,AutoControllerIdentityCreator=true,BootstrapFormatIgnition=false,ExternalResourceGC=false

Every 15m or so (it depends, sometime it can go up to 20m), I get the following message form the MachineHealthcheck controller:

E0616 07:31:16.533327       1 machinehealthcheck_controller.go:221] "Error creating remote cluster cache" err="error getting client: connection to the workload cluster is down" controller="machinehealthcheck" controllerGroup="cluster.x-k8s.io" controllerKind="MachineHealthCheck" MachineHealthCheck="flux-system/t00-use1-eks-test" namespace="flux-system" name="t00-use1-eks-test" reconcileID="420722f1-5b72-4829-9340-8a9c27537fd4" Cluster="flux-system/t00-use1-eks-test"
E0616 07:31:16.535055       1 machineset_controller.go:1218] "Unable to retrieve Node status" err="error getting client: connection to the workload cluster is down" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="flux-system/t00-use1-eks-test-us-east-1a-md0-f6f27" namespace="flux-system" name="t00-use1-eks-test-us-east-1a-md0-f6f27" reconcileID="4c4135df-6d55-4590-8751-b4af5d8b8982" Cluster="flux-system/t00-use1-eks-test" MachineDeployment="flux-system/t00-use1-eks-test-us-east-1a-md0" Machine="flux-system/t00-use1-eks-test-us-east-1a-md0-f6f27-gpm5b" Node=""
E0616 07:31:16.564003       1 machineset_controller.go:1218] "Unable to retrieve Node status" err="error getting client: connection to the workload cluster is down" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="flux-system/t00-use1-eks-test-us-east-1a-md0-f6f27" namespace="flux-system" name="t00-use1-eks-test-us-east-1a-md0-f6f27" reconcileID="e5a1e000-b16f-4d8a-9c75-6af4c7be8b9f" Cluster="flux-system/t00-use1-eks-test" MachineDeployment="flux-system/t00-use1-eks-test-us-east-1a-md0" Machine="flux-system/t00-use1-eks-test-us-east-1a-md0-f6f27-gpm5b" Node=""
E0616 07:31:16.614978       1 machineset_controller.go:1218] "Unable to retrieve Node status" err="error getting client: connection to the workload cluster is down" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="flux-system/t00-use1-eks-test-us-east-1a-md0-f6f27" namespace="flux-system" name="t00-use1-eks-test-us-east-1a-md0-f6f27" reconcileID="f0952489-d550-40bc-bbe5-fb1bfb72bbd1" Cluster="flux-system/t00-use1-eks-test" MachineDeployment="flux-system/t00-use1-eks-test-us-east-1a-md0" Machine="flux-system/t00-use1-eks-test-us-east-1a-md0-f6f27-gpm5b" Node=""
E0616 07:31:16.628004       1 machineset_controller.go:1218] "Unable to retrieve Node status" err="error getting client: connection to the workload cluster is down" controller="machineset" controllerGroup="cluster.x-k8s.io" controllerKind="MachineSet" MachineSet="flux-system/t00-use1-eks-test-us-east-1a-md0-f6f27" namespace="flux-system" name="t00-use1-eks-test-us-east-1a-md0-f6f27" reconcileID="8c2fd112-b159-438e-9c7e-e2b9ab38191a" Cluster="flux-system/t00-use1-eks-test" MachineDeployment="flux-system/t00-use1-eks-test-us-east-1a-md0" Machine="flux-system/t00-use1-eks-test-us-east-1a-md0-f6f27-gpm5b" Node=""

As far as I checked, I don't have any operational impact.
I didn't have this error using v1.7.9, it seems to appear starting v1.8.0.

What did you expect to happen?

If this is a real error with token renewal logic or the updated cache component, then try to fix it.
If it's not, suppress the message.

Cluster API version

v1.9.8

Kubernetes version

No response

Anything else you would like to add?

I tried to play with sync-period on both capa-controller-manager and capi-controller-manager without any luck.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.priority/backlogHigher priority than priority/awaiting-more-evidence.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions