-
Notifications
You must be signed in to change notification settings - Fork 349
Open
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
Description
Hello,
I'm seeing issue in aws cluster (not eks) with karpenter deployed where nodes created by Karpenter do not get their providerID set ever by the AWS CCM. The cluster is being created with ASG and when trying similar action on ASG created nodes, the nodes join the cluster without issue and get the providerID set by the CCM. Here's the error from the CCM:
error syncing 'ip-<node-ip>.<region>.compute.internal': failed to get instance metadata for node ip-<node-ip>.<region>.compute.internal: instance not found, requeuing
E0602 19:56:35.311131 1 node_controller.go:244] "Unhandled Error" err="error syncing 'ip-<node-ip>.<region>.compute.internal': failed to get instance metadata for node ip-<node-ip>.<region>.compute.internal: instance not found, requeuing" logger="UnhandledError"
E0602 19:56:36.976054 1 node_lifecycle_controller.go:156] error checking if node ip-<node-ip>.<region>.compute.internal exists: instance not found
I0602 19:56:39.224230 1 node_controller.go:271] Update 7 nodes status took 1.627947232s.
E0602 19:56:42.128159 1 node_lifecycle_controller.go:156] error checking if node ip-<node-ip>.<region>.compute.internal exists: instance not found
E0602 19:56:47.302222 1 node_lifecycle_controller.go:156] error checking if node ip-<node-ip>.<region>.compute.internal exists: instance not found
When setting the providerID manually or through another application, The node gets picked up by the CCM and everything's fine.
NodeClass config:
spec:
amiFamily: Custom
amiSelectorTerms:
- tags:
<custom-name>-karpenter-version: current
associatePublicIPAddress: false
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
deleteOnTermination: true
encrypted: true
kmsKeyID: <kms-key-id>
volumeSize: 100Gi
volumeType: gp3
detailedMonitoring: true
metadataOptions:
httpEndpoint: enabled
httpProtocolIPv6: disabled
httpPutResponseHopLimit: 3
httpTokens: optional
role: <iam-role>
securityGroupSelectorTerms:
- id: <sg>
subnetSelectorTerms:
- id: <subnet1>
- id: <subnet2>
- id: <subnet3>
tags:
<some-tag>
<name>:rancher:managementApi: <api-url>
userData: |
#cloud-config
yum_repos:
artifactory:
baseurl: <url>
enabled: true
gpgcheck: false
name: "<name>"
package_update: true
package_upgrade: true
write_files:
- <file_config>
runcmd:
- |
<Some script and initialization steps with CA and so on>
nodepool:
spec:
disruption:
budgets:
- nodes: "1"
consolidateAfter: 5m
consolidationPolicy: WhenEmpty
limits:
cpu: 512
template:
metadata:
labels:
<some-tag>
spec:
expireAfter: Never
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: prometheus
requirements:
- key: kubernetes.io/arch
operator: In
values:
- amd64
- key: kubernetes.io/os
operator: In
values:
- linux
- key: karpenter.sh/capacity-type
operator: In
values:
- on-demand
startupTaints:
- effect: NoExecute
key: node.cilium.io/agent-not-ready
value: "true"
taints:
- effect: NoExecute
key: <prefix>/dedicated
value: prometheus
The cluster is managed with rke2 and has cloud-provider=external set.
Is there a particular setting that needs to be done on either side that I'm missing?
(cross-posting from kubernetes-sigs/karpenter#2281)
Thanks!
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.