Upgrade calico to 3.29 #5844

marosset · 2025-08-27T18:36:33Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5498

This PR replaces #5688

Special notes for your reviewer:

TODOs:

squashed commits
includes documentation
adds unit tests
cherry-pick candidate

Release note:

Update Calico to 3.29

codecov · 2025-08-27T19:23:29Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.94%. Comparing base (ea0c2f4) to head (7e5d556).
⚠️ Report is 25 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #5844   +/-   ##
=======================================
  Coverage   46.94%   46.94%           
=======================================
  Files         279      279           
  Lines       29687    29687           
=======================================
  Hits        13936    13936           
  Misses      14938    14938           
  Partials      813      813

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

marosset · 2025-08-27T20:16:58Z

I found another image that we'll need to add to the acr cache

Aug 27 19:44:32.767285 capz-conf-wsj56g-md-0-mldvh-nnnss kubelet[1614]: E0827 19:44:32.767196 1614 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to "StartContainer" for "flexvol-driver" with ImagePullBackOff: "Back-off pulling image \"capzcicommunity.azurecr.io/calico/pod2daemon-flexvol:v3.29.4\": ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image \"capzcicommunity.azurecr.io/calico/pod2daemon-flexvol:v3.29.4\": failed to resolve reference \"capzcicommunity.azurecr.io/calico/pod2daemon-flexvol:v3.29.4\": capzcicommunity.azurecr.io/calico/pod2daemon-flexvol:v3.29.4: not found"" pod="calico-system/calico-node-bssgs" podUID="53cae7ce-964c-4de3-9047-994cb2b71cef"

marosset · 2025-08-27T21:03:25Z

/test pull-cluster-api-provider-azure-conformance

marosset · 2025-08-27T21:57:51Z

/test pull-cluster-api-provider-azure-conformance

marosset · 2025-08-27T23:03:34Z

/test pull-cluster-api-provider-azure-conformance

marosset · 2025-08-28T17:25:25Z

/test pull-cluster-api-provider-azure-conformance

marosset · 2025-08-28T22:57:50Z

/retest

marosset · 2025-09-03T22:09:34Z

/retest

* Upgrade to calico 3.29 and use windows support Signed-off-by: James Sturtevant <[email protected]> * Don't use images from MCR Signed-off-by: James Sturtevant <[email protected]> --------- Signed-off-by: James Sturtevant <[email protected]>

marosset · 2025-09-04T18:31:06Z

/retest

marosset · 2025-09-04T23:35:06Z

/test pull-cluster-api-provider-azure-e2e

marosset · 2025-09-04T23:35:49Z

/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra

marosset · 2025-09-04T23:36:38Z

/test pull-cluster-api-provider-azure-conformance

marosset · 2025-09-05T15:55:20Z

/test pull-cluster-api-provider-azure-conformance-azl3-with-ci-artifacts

marosset · 2025-09-10T19:11:38Z

/skip pull-cluster-api-provider-azure-conformance-azl3-with-ci-artifacts

marosset · 2025-09-10T19:11:52Z

/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra

k8s-ci-robot · 2025-09-11T19:07:22Z

LGTM label has been added.

Git tree hash: c0fc47f9df3510b51aac80483a943d0187696280

k8s-ci-robot · 2025-09-11T19:07:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nojnhuh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [nojnhuh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

marosset · 2025-09-11T20:15:13Z

/retest

nojnhuh · 2025-09-11T22:12:17Z

@marosset I'm seeing logs like this for most nodes across all the e2e jobs:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1

In particular, that seems to be bottlenecking the scale jobs where collecting logs from each node is taking a whole minute:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 21:59:39.640: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-vxb9f in cluster capz-06jj60 in namespace default
  Sep 11 22:00:40.543: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-vxb9f
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vxb9f, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:00:41.167: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-w4pgh in cluster capz-06jj60 in namespace default
  Sep 11 22:01:41.169: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-w4pgh
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-w4pgh, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:01:41.797: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-wcs98 in cluster capz-06jj60 in namespace default
  Sep 11 22:02:42.731: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-wcs98
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-wcs98, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:02:43.430: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-x6cm2 in cluster capz-06jj60 in namespace default
  Sep 11 22:03:43.431: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-x6cm2
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-x6cm2, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:03:44.088: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-xfp8s in cluster capz-06jj60 in namespace default

Could that be related to this change?

marosset · 2025-09-11T22:32:41Z

@marosset I'm seeing logs like this for most nodes across all the e2e jobs:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1

In particular, that seems to be bottlenecking the scale jobs where collecting logs from each node is taking a whole minute:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 21:59:39.640: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-vxb9f in cluster capz-06jj60 in namespace default
  Sep 11 22:00:40.543: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-vxb9f
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vxb9f, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:00:41.167: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-w4pgh in cluster capz-06jj60 in namespace default
  Sep 11 22:01:41.169: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-w4pgh
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-w4pgh, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:01:41.797: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-wcs98 in cluster capz-06jj60 in namespace default
  Sep 11 22:02:42.731: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-wcs98
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-wcs98, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:02:43.430: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-x6cm2 in cluster capz-06jj60 in namespace default
  Sep 11 22:03:43.431: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-x6cm2
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-x6cm2, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:03:44.088: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-xfp8s in cluster capz-06jj60 in namespace default

Could that be related to this change?

No idea

@marosset I'm seeing logs like this for most nodes across all the e2e jobs:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1

In particular, that seems to be bottlenecking the scale jobs where collecting logs from each node is taking a whole minute:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 21:59:39.640: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-vxb9f in cluster capz-06jj60 in namespace default
  Sep 11 22:00:40.543: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-vxb9f
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vxb9f, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:00:41.167: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-w4pgh in cluster capz-06jj60 in namespace default
  Sep 11 22:01:41.169: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-w4pgh
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-w4pgh, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:01:41.797: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-wcs98 in cluster capz-06jj60 in namespace default
  Sep 11 22:02:42.731: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-wcs98
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-wcs98, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:02:43.430: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-x6cm2 in cluster capz-06jj60 in namespace default
  Sep 11 22:03:43.431: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-x6cm2
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-x6cm2, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:03:44.088: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-xfp8s in cluster capz-06jj60 in namespace default

Could that be related to this change?

No idea but I'll try and look into it

nojnhuh · 2025-09-12T06:54:11Z

Hm, I'm still seeing that same error even with the volume change. Firing up a cluster I see this on one of the nodes:

capi@aso-20320-controlplane-hj2gb:~$ cat /var/log/calico/cni/cni.log
cat: /var/log/calico/cni/cni.log: Permission denied
capi@aso-20320-controlplane-hj2gb:~$ ll /var/log/calico/cni/cni.log 
-rw------- 1 root root 108794 Sep 12 03:58 /var/log/calico/cni/cni.log

Building the exact same cluster with the older Calico version I see this:

capi@aso-23097-controlplane-fmsbp:~$ ll /var/log/calico/cni/cni.log 
-rw-r--r-- 1 root root 85870 Sep 12 04:07 /var/log/calico/cni/cni.log

A sudo on the cat seems to do the trick with the newer Calico.

I opened #5868 to make these things easier to spot next time.

nojnhuh · 2025-09-12T07:10:46Z

A sudo on the cat seems to do the trick with the newer Calico.

This change alone seems to fix things even without the volume changes. Happy to keep those anyway if there's some other benefit, otherwise I think we can revert them.

marosset · 2025-09-12T15:57:24Z

A sudo on the cat seems to do the trick with the newer Calico.

This change alone seems to fix things even without the volume changes. Happy to keep those anyway if there's some other benefit, otherwise I think we can revert them.

Ah - I'll drop the volume changes and add a sudo to our logging script

marosset · 2025-09-12T16:55:25Z

@nojnhuh - I see cni.log files for the machines now.
Thanks again!

nojnhuh

/lgtm

I'll leave the hold in case you're planning to squash this into one or a few commits. Feel free to drop that when you're done. As long as you don't rebase while you're cleaning things up the LGTM should stick.

Thanks!

k8s-ci-robot · 2025-09-12T17:24:55Z

LGTM label has been added.

Git tree hash: 317d128ac674e8db02b7f456983d710692318edb

nojnhuh · 2025-09-12T17:43:42Z

#5866 should fix the AKS job

/test pull-cluster-api-provider-azure-e2e-aks

marosset · 2025-09-12T18:00:35Z

/test pull-cluster-api-provider-azure-e2e

Signed-off-by: Mark Rossett <[email protected]>

marosset · 2025-09-12T20:01:09Z

/hold cancel
This should be fine to merge once CI passes!

k8s-ci-robot · 2025-09-12T20:19:09Z

@marosset: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-azure-conformance-azl3-with-ci-artifacts	`cef4469`	link	false	`/test pull-cluster-api-provider-azure-conformance-azl3-with-ci-artifacts`
pull-cluster-api-provider-azure-load-test-custom-builds	`21d54b5`	link	false	`/test pull-cluster-api-provider-azure-load-test-custom-builds`
pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra	`7e5d556`	link	false	`/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

nojnhuh · 2025-09-12T20:29:51Z

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/5844/pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra/1966592810257747968

#5690

Not a required job and the last run passed without any changes so I'll let it go.

github-project-automation bot added this to CAPZ Planning Aug 27, 2025

github-project-automation bot moved this to Todo in CAPZ Planning Aug 27, 2025

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 27, 2025

k8s-ci-robot requested a review from jackfrancis August 27, 2025 18:36

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2025

k8s-ci-robot requested a review from Jont828 August 27, 2025 18:36

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 27, 2025

marosset force-pushed the upgrade-calico branch from e9f89fe to d57262b Compare August 27, 2025 19:19

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2025

jsturtevant and others added 2 commits September 4, 2025 09:16

Use quay instead of docker for calico images

cb226b1

marosset force-pushed the upgrade-calico branch from d57262b to 8458277 Compare September 4, 2025 16:18

nojnhuh mentioned this pull request Sep 10, 2025

Update the Calico version for testing #5498

Closed

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2025

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2025

k8s-ci-robot requested a review from nojnhuh September 11, 2025 23:51

nojnhuh mentioned this pull request Sep 12, 2025

Log stderr when e2e ssh commands fail #5868

Merged

4 tasks

marosset force-pushed the upgrade-calico branch from 44419d6 to 9a22400 Compare September 12, 2025 16:14

nojnhuh approved these changes Sep 12, 2025

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2025

Use capzicommunity ACR cache references for calico/tigera images

7e5d556

Signed-off-by: Mark Rossett <[email protected]>

marosset force-pushed the upgrade-calico branch from 9a22400 to 7e5d556 Compare September 12, 2025 20:00

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 12, 2025

k8s-ci-robot merged commit 8f913a9 into kubernetes-sigs:main Sep 12, 2025
29 of 30 checks passed

github-project-automation bot moved this from Todo to Done in CAPZ Planning Sep 12, 2025

marosset deleted the upgrade-calico branch September 15, 2025 16:13

willie-yao mentioned this pull request Sep 22, 2025

WIP: Upgrade calico to 3.29 #5688

Closed

4 tasks

nojnhuh added this to the v1.22 milestone Sep 26, 2025

Upgrade calico to 3.29 #5844

Upgrade calico to 3.29 #5844

Uh oh!

Conversation

marosset commented Aug 27, 2025 • edited by nojnhuh Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

marosset commented Aug 27, 2025

Uh oh!

marosset commented Aug 27, 2025

Uh oh!

marosset commented Aug 27, 2025

Uh oh!

marosset commented Aug 27, 2025

Uh oh!

marosset commented Aug 28, 2025

Uh oh!

marosset commented Aug 28, 2025

Uh oh!

marosset commented Sep 3, 2025

Uh oh!

marosset commented Sep 4, 2025

Uh oh!

marosset commented Sep 4, 2025

Uh oh!

marosset commented Sep 4, 2025

Uh oh!

marosset commented Sep 4, 2025

Uh oh!

marosset commented Sep 5, 2025

Uh oh!

marosset commented Sep 10, 2025

Uh oh!

marosset commented Sep 10, 2025

Uh oh!

k8s-ci-robot commented Sep 11, 2025

Uh oh!

k8s-ci-robot commented Sep 11, 2025

Uh oh!

marosset commented Sep 11, 2025

Uh oh!

nojnhuh commented Sep 11, 2025

Uh oh!

marosset commented Sep 11, 2025

Uh oh!

nojnhuh commented Sep 12, 2025

Uh oh!

nojnhuh commented Sep 12, 2025

Uh oh!

marosset commented Sep 12, 2025

Uh oh!

marosset commented Sep 12, 2025

Uh oh!

nojnhuh left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Sep 12, 2025

Uh oh!

nojnhuh commented Sep 12, 2025

Uh oh!

marosset commented Sep 12, 2025

Uh oh!

marosset commented Sep 12, 2025

Uh oh!

k8s-ci-robot commented Sep 12, 2025

Uh oh!

nojnhuh commented Sep 12, 2025

Uh oh!

Uh oh!

Uh oh!

marosset commented Aug 27, 2025 •

edited by nojnhuh

Loading

codecov bot commented Aug 27, 2025 •

edited

Loading