-
Notifications
You must be signed in to change notification settings - Fork 150
Open
projectcalico/calico
#11595Description
Description
The tigera-operator fails to resolve the EKS API server hostname during initial cluster setup due to a DNS circular dependency. The dnsPolicy is hardcoded in the chart template with no option to override it, preventing successful Calico deployment on fresh EKS clusters.
Environment
- Platform: AWS EKS
- Operator Version: v1.38.7
- Calico Version: v3.30.4
- Installation Method: Helm chart
Current Chart Template (Problematic)
# templates/tigera-operator/02-tigera-operator.yaml
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet # HARDCODED - no override option
{{- if .Values.dnsConfig }}
dnsConfig:
{{- toYaml .Values.dnsConfig | nindent 8 }}
{{- end }}Problem: DNS Circular Dependency
The Deadlock Chain
1. tigera-operator starts with ClusterFirstWithHostNet
↓
2. Attempts to query kube-dns (ClusterIP: 10.100.0.10)
↓
3. ClusterIP requires kube-proxy iptables rules
↓
4. iptables rules require CNI network
↓
5. CNI network requires Calico
↓
6. Calico requires Operator to deploy it
↓
DEADLOCK 🔒Why This Happens on EKS
On EKS:
- CoreDNS is deployed as a regular Deployment (not hostNetwork)
- CoreDNS requires CNI network to function
- During initial cluster setup, no CNI is available yet
- Node DNS (VPC DNS) is available but not used due to ClusterFirstWithHostNet
Error Symptoms
# Operator logs
E1227 03:32:05 reflector.go:166] "Unhandled Error"
err="dial tcp: lookup <EKS-API-ENDPOINT>: i/o timeout"
# Debug container
$ nslookup <EKS-API-ENDPOINT>
;; connection timed out; no servers could be reachedCurrent Workaround (Not Ideal)
Since dnsPolicy cannot be overridden, users must use dnsConfig to prepend VPC DNS:
# values.yaml
dnsConfig:
nameservers:
- 172.22.0.2 # VPC DNS - environment-specific!
searches:
- ap-northeast-1.compute.internal # region-specific!
- default.svc.cluster.local
- svc.cluster.local
- cluster.localIssues with this approach:
- ❌ Requires hardcoding VPC DNS IP (varies per VPC:
<VPC_CIDR>.2) - ❌ Requires hardcoding AWS region in search domains
- ❌ Not portable across different VPCs/regions/accounts
- ❌ Unnecessarily complex configuration
Proposed Solution
Make dnsPolicy Configurable
Chart template change:
# templates/tigera-operator/02-tigera-operator.yaml
hostNetwork: true
{{- if .Values.dnsPolicy }}
dnsPolicy: {{ .Values.dnsPolicy }}
{{- else }}
dnsPolicy: ClusterFirstWithHostNet # default for backward compatibility
{{- end }}
{{- if .Values.dnsConfig }}
dnsConfig:
{{- toYaml .Values.dnsConfig | nindent 8 }}
{{- end }}Usage for EKS (simple and portable):
# values.yaml
dnsPolicy: Default # Use node's DNSWhy Default Works for EKS
dnsPolicy: DefaultBehavior:
- Operator uses node's
/etc/resolv.conf - EKS automatically configures node DNS to VPC DNS (
<VPC_CIDR>.2) - Operator can resolve EKS API server
- Calico deploys successfully
- After Calico is running, kube-dns becomes available
- Cluster services work normally
Benefits:
- ✅ Simple: One-line configuration
- ✅ Portable: Works across all VPCs, regions, and accounts automatically
- ✅ No hardcoding: Node DNS is auto-configured by EKS
- ✅ Backward compatible: Default remains ClusterFirstWithHostNet for other environments
- ✅ Minimal change: Only affects operator pod DNS behavior
Why Not Modify CoreDNS Instead?
While theoretically possible to make CoreDNS use hostNetwork, this approach:
- ❌ Requires modifying EKS-managed addon (loses AWS support)
- ❌ Causes port conflicts with systemd-resolved
- ❌ Limits scalability (one CoreDNS per node max)
- ❌ Breaks load balancing (no ClusterIP)
- ❌ Affects entire cluster DNS (high risk)
- ❌ Introduces unnecessary complexity
The issue is with tigera-operator, not CoreDNS. The operator should be fixed, not the entire cluster DNS architecture.
Reproduction Steps
- Create a fresh EKS cluster
- Deploy tigera-operator with default Helm chart
- Apply Installation CR
- Observe operator cannot resolve EKS API server
- Calico never deploys successfully
Impact
This issue:
- ❌ Blocks Calico deployment on fresh EKS clusters
- ❌ Prevents cluster scaling (new nodes cannot get network)
- ❌ Requires complex, environment-specific workarounds
- ❌ Reduces operator portability
Request
Please make dnsPolicy configurable via values.yaml:
# values.yaml - proposed new parameter
dnsPolicy: Default # or ClusterFirstWithHostNet, or any valid Kubernetes dnsPolicyBenefits:
- ✅ Fixes EKS deployments with simple configuration
- ✅ Maintains backward compatibility (default unchanged)
- ✅ No breaking changes
- ✅ Minimal code change (add one conditional)
- ✅ Solves the root cause properly
Related
- Issue: calico-node fails DNS lookup on startup when KUBERNETES_SERVICE_HOST is a domain and dataplane is BPF projectcalico/calico#10683
- PR: Add config options for calico/node DNS policy and config #4098 (added hostNetwork, but didn't address DNS)
Metadata
Metadata
Assignees
Labels
No labels