Skip to content

EKS: Hardcoded dnsPolicy: ClusterFirstWithHostNet causes DNS resolution deadlock during initial deployment #4325

@kalavt

Description

@kalavt

Description

The tigera-operator fails to resolve the EKS API server hostname during initial cluster setup due to a DNS circular dependency. The dnsPolicy is hardcoded in the chart template with no option to override it, preventing successful Calico deployment on fresh EKS clusters.

Environment

  • Platform: AWS EKS
  • Operator Version: v1.38.7
  • Calico Version: v3.30.4
  • Installation Method: Helm chart

Current Chart Template (Problematic)

# templates/tigera-operator/02-tigera-operator.yaml
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet  # HARDCODED - no override option
{{- if .Values.dnsConfig }}
dnsConfig:
  {{- toYaml .Values.dnsConfig | nindent 8 }}
{{- end }}

Problem: DNS Circular Dependency

The Deadlock Chain

1. tigera-operator starts with ClusterFirstWithHostNet
   
2. Attempts to query kube-dns (ClusterIP: 10.100.0.10)
   
3. ClusterIP requires kube-proxy iptables rules
   
4. iptables rules require CNI network
   
5. CNI network requires Calico
   
6. Calico requires Operator to deploy it
   
   DEADLOCK 🔒

Why This Happens on EKS

On EKS:

  • CoreDNS is deployed as a regular Deployment (not hostNetwork)
  • CoreDNS requires CNI network to function
  • During initial cluster setup, no CNI is available yet
  • Node DNS (VPC DNS) is available but not used due to ClusterFirstWithHostNet

Error Symptoms

# Operator logs
E1227 03:32:05 reflector.go:166] "Unhandled Error" 
err="dial tcp: lookup <EKS-API-ENDPOINT>: i/o timeout"

# Debug container
$ nslookup <EKS-API-ENDPOINT>
;; connection timed out; no servers could be reached

Current Workaround (Not Ideal)

Since dnsPolicy cannot be overridden, users must use dnsConfig to prepend VPC DNS:

# values.yaml
dnsConfig:
  nameservers:
    - 172.22.0.2  # VPC DNS - environment-specific!
  searches:
    - ap-northeast-1.compute.internal  # region-specific!
    - default.svc.cluster.local
    - svc.cluster.local
    - cluster.local

Issues with this approach:

  • ❌ Requires hardcoding VPC DNS IP (varies per VPC: <VPC_CIDR>.2)
  • ❌ Requires hardcoding AWS region in search domains
  • ❌ Not portable across different VPCs/regions/accounts
  • ❌ Unnecessarily complex configuration

Proposed Solution

Make dnsPolicy Configurable

Chart template change:

# templates/tigera-operator/02-tigera-operator.yaml
hostNetwork: true
{{- if .Values.dnsPolicy }}
dnsPolicy: {{ .Values.dnsPolicy }}
{{- else }}
dnsPolicy: ClusterFirstWithHostNet  # default for backward compatibility
{{- end }}
{{- if .Values.dnsConfig }}
dnsConfig:
  {{- toYaml .Values.dnsConfig | nindent 8 }}
{{- end }}

Usage for EKS (simple and portable):

# values.yaml
dnsPolicy: Default  # Use node's DNS

Why Default Works for EKS

dnsPolicy: Default

Behavior:

  1. Operator uses node's /etc/resolv.conf
  2. EKS automatically configures node DNS to VPC DNS (<VPC_CIDR>.2)
  3. Operator can resolve EKS API server
  4. Calico deploys successfully
  5. After Calico is running, kube-dns becomes available
  6. Cluster services work normally

Benefits:

  • Simple: One-line configuration
  • Portable: Works across all VPCs, regions, and accounts automatically
  • No hardcoding: Node DNS is auto-configured by EKS
  • Backward compatible: Default remains ClusterFirstWithHostNet for other environments
  • Minimal change: Only affects operator pod DNS behavior

Why Not Modify CoreDNS Instead?

While theoretically possible to make CoreDNS use hostNetwork, this approach:

  • ❌ Requires modifying EKS-managed addon (loses AWS support)
  • ❌ Causes port conflicts with systemd-resolved
  • ❌ Limits scalability (one CoreDNS per node max)
  • ❌ Breaks load balancing (no ClusterIP)
  • ❌ Affects entire cluster DNS (high risk)
  • ❌ Introduces unnecessary complexity

The issue is with tigera-operator, not CoreDNS. The operator should be fixed, not the entire cluster DNS architecture.


Reproduction Steps

  1. Create a fresh EKS cluster
  2. Deploy tigera-operator with default Helm chart
  3. Apply Installation CR
  4. Observe operator cannot resolve EKS API server
  5. Calico never deploys successfully

Impact

This issue:

  • ❌ Blocks Calico deployment on fresh EKS clusters
  • ❌ Prevents cluster scaling (new nodes cannot get network)
  • ❌ Requires complex, environment-specific workarounds
  • ❌ Reduces operator portability

Request

Please make dnsPolicy configurable via values.yaml:

# values.yaml - proposed new parameter
dnsPolicy: Default  # or ClusterFirstWithHostNet, or any valid Kubernetes dnsPolicy

Benefits:

  • ✅ Fixes EKS deployments with simple configuration
  • ✅ Maintains backward compatibility (default unchanged)
  • ✅ No breaking changes
  • ✅ Minimal code change (add one conditional)
  • ✅ Solves the root cause properly

Related


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions