Skip to content

Conversation

marosset
Copy link
Contributor

@marosset marosset commented Aug 27, 2025

What type of PR is this?
/kind cleanup

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5498

This PR replaces #5688

Special notes for your reviewer:

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests
  • cherry-pick candidate

Release note:

Update Calico to 3.29

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 27, 2025
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2025
@k8s-ci-robot k8s-ci-robot requested a review from Jont828 August 27, 2025 18:36
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 27, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 27, 2025
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 46.94%. Comparing base (ea0c2f4) to head (7e5d556).
⚠️ Report is 25 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5844   +/-   ##
=======================================
  Coverage   46.94%   46.94%           
=======================================
  Files         279      279           
  Lines       29687    29687           
=======================================
  Hits        13936    13936           
  Misses      14938    14938           
  Partials      813      813           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@marosset
Copy link
Contributor Author

I found another image that we'll need to add to the acr cache

Aug 27 19:44:32.767285 capz-conf-wsj56g-md-0-mldvh-nnnss kubelet[1614]: E0827 19:44:32.767196 1614 pod_workers.go:1301] "Error syncing pod, skipping" err="failed to "StartContainer" for "flexvol-driver" with ImagePullBackOff: "Back-off pulling image \"capzcicommunity.azurecr.io/calico/pod2daemon-flexvol:v3.29.4\": ErrImagePull: rpc error: code = NotFound desc = failed to pull and unpack image \"capzcicommunity.azurecr.io/calico/pod2daemon-flexvol:v3.29.4\": failed to resolve reference \"capzcicommunity.azurecr.io/calico/pod2daemon-flexvol:v3.29.4\": capzcicommunity.azurecr.io/calico/pod2daemon-flexvol:v3.29.4: not found"" pod="calico-system/calico-node-bssgs" podUID="53cae7ce-964c-4de3-9047-994cb2b71cef"

@marosset
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-conformance

3 similar comments
@marosset
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-conformance

@marosset
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-conformance

@marosset
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-conformance

@marosset
Copy link
Contributor Author

/retest

1 similar comment
@marosset
Copy link
Contributor Author

marosset commented Sep 3, 2025

/retest

jsturtevant and others added 2 commits September 4, 2025 09:16
* Upgrade to calico 3.29 and use windows support

Signed-off-by: James Sturtevant <[email protected]>

* Don't use images from MCR

Signed-off-by: James Sturtevant <[email protected]>

---------

Signed-off-by: James Sturtevant <[email protected]>
@marosset
Copy link
Contributor Author

marosset commented Sep 4, 2025

/retest

@marosset
Copy link
Contributor Author

marosset commented Sep 4, 2025

/test pull-cluster-api-provider-azure-e2e

@marosset
Copy link
Contributor Author

marosset commented Sep 4, 2025

/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra

@marosset
Copy link
Contributor Author

marosset commented Sep 4, 2025

/test pull-cluster-api-provider-azure-conformance

@marosset
Copy link
Contributor Author

marosset commented Sep 5, 2025

/test pull-cluster-api-provider-azure-conformance-azl3-with-ci-artifacts

@marosset
Copy link
Contributor Author

/skip pull-cluster-api-provider-azure-conformance-azl3-with-ci-artifacts

@marosset
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c0fc47f9df3510b51aac80483a943d0187696280

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nojnhuh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 11, 2025
@marosset
Copy link
Contributor Author

/retest

@nojnhuh
Copy link
Contributor

nojnhuh commented Sep 11, 2025

@marosset I'm seeing logs like this for most nodes across all the e2e jobs:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1

In particular, that seems to be bottlenecking the scale jobs where collecting logs from each node is taking a whole minute:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 21:59:39.640: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-vxb9f in cluster capz-06jj60 in namespace default
  Sep 11 22:00:40.543: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-vxb9f
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vxb9f, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:00:41.167: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-w4pgh in cluster capz-06jj60 in namespace default
  Sep 11 22:01:41.169: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-w4pgh
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-w4pgh, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:01:41.797: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-wcs98 in cluster capz-06jj60 in namespace default
  Sep 11 22:02:42.731: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-wcs98
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-wcs98, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:02:43.430: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-x6cm2 in cluster capz-06jj60 in namespace default
  Sep 11 22:03:43.431: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-x6cm2
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-x6cm2, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:03:44.088: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-xfp8s in cluster capz-06jj60 in namespace default

Could that be related to this change?

@marosset
Copy link
Contributor Author

@marosset I'm seeing logs like this for most nodes across all the e2e jobs:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1

In particular, that seems to be bottlenecking the scale jobs where collecting logs from each node is taking a whole minute:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 21:59:39.640: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-vxb9f in cluster capz-06jj60 in namespace default
  Sep 11 22:00:40.543: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-vxb9f
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vxb9f, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:00:41.167: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-w4pgh in cluster capz-06jj60 in namespace default
  Sep 11 22:01:41.169: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-w4pgh
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-w4pgh, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:01:41.797: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-wcs98 in cluster capz-06jj60 in namespace default
  Sep 11 22:02:42.731: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-wcs98
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-wcs98, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:02:43.430: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-x6cm2 in cluster capz-06jj60 in namespace default
  Sep 11 22:03:43.431: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-x6cm2
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-x6cm2, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:03:44.088: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-xfp8s in cluster capz-06jj60 in namespace default

Could that be related to this change?

No idea

@marosset I'm seeing logs like this for most nodes across all the e2e jobs:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1

In particular, that seems to be bottlenecking the scale jobs where collecting logs from each node is taking a whole minute:

Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vv66g, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 21:59:39.640: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-vxb9f in cluster capz-06jj60 in namespace default
  Sep 11 22:00:40.543: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-vxb9f
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-vxb9f, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:00:41.167: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-w4pgh in cluster capz-06jj60 in namespace default
  Sep 11 22:01:41.169: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-w4pgh
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-w4pgh, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:01:41.797: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-wcs98 in cluster capz-06jj60 in namespace default
  Sep 11 22:02:42.731: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-wcs98
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-wcs98, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:02:43.430: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-x6cm2 in cluster capz-06jj60 in namespace default
  Sep 11 22:03:43.431: INFO: Collecting boot logs for resource group capz-06jj60, VM capz-06jj60-md-0-xdh7f-x6cm2
Failed to get logs for Machine capz-06jj60-md-0-xdh7f-x6cm2, Cluster default/capz-06jj60: running command "cat /var/log/calico/cni/cni.log": Process exited with status 1
  Sep 11 22:03:44.088: INFO: Collecting logs for Linux node capz-06jj60-md-0-xdh7f-xfp8s in cluster capz-06jj60 in namespace default

Could that be related to this change?

No idea but I'll try and look into it

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 11, 2025
@nojnhuh
Copy link
Contributor

nojnhuh commented Sep 12, 2025

Hm, I'm still seeing that same error even with the volume change. Firing up a cluster I see this on one of the nodes:

capi@aso-20320-controlplane-hj2gb:~$ cat /var/log/calico/cni/cni.log
cat: /var/log/calico/cni/cni.log: Permission denied
capi@aso-20320-controlplane-hj2gb:~$ ll /var/log/calico/cni/cni.log 
-rw------- 1 root root 108794 Sep 12 03:58 /var/log/calico/cni/cni.log

Building the exact same cluster with the older Calico version I see this:

capi@aso-23097-controlplane-fmsbp:~$ ll /var/log/calico/cni/cni.log 
-rw-r--r-- 1 root root 85870 Sep 12 04:07 /var/log/calico/cni/cni.log

A sudo on the cat seems to do the trick with the newer Calico.

I opened #5868 to make these things easier to spot next time.

@nojnhuh
Copy link
Contributor

nojnhuh commented Sep 12, 2025

A sudo on the cat seems to do the trick with the newer Calico.

This change alone seems to fix things even without the volume changes. Happy to keep those anyway if there's some other benefit, otherwise I think we can revert them.

@marosset
Copy link
Contributor Author

A sudo on the cat seems to do the trick with the newer Calico.

This change alone seems to fix things even without the volume changes. Happy to keep those anyway if there's some other benefit, otherwise I think we can revert them.

Ah - I'll drop the volume changes and add a sudo to our logging script

@marosset
Copy link
Contributor Author

@nojnhuh - I see cni.log files for the machines now.
Thanks again!

Copy link
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I'll leave the hold in case you're planning to squash this into one or a few commits. Feel free to drop that when you're done. As long as you don't rebase while you're cleaning things up the LGTM should stick.

Thanks!

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2025
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 317d128ac674e8db02b7f456983d710692318edb

@nojnhuh
Copy link
Contributor

nojnhuh commented Sep 12, 2025

#5866 should fix the AKS job

/test pull-cluster-api-provider-azure-e2e-aks

@marosset
Copy link
Contributor Author

/test pull-cluster-api-provider-azure-e2e

@marosset
Copy link
Contributor Author

/hold cancel
This should be fine to merge once CI passes!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 12, 2025
@k8s-ci-robot
Copy link
Contributor

@marosset: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-conformance-azl3-with-ci-artifacts cef4469 link false /test pull-cluster-api-provider-azure-conformance-azl3-with-ci-artifacts
pull-cluster-api-provider-azure-load-test-custom-builds 21d54b5 link false /test pull-cluster-api-provider-azure-load-test-custom-builds
pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra 7e5d556 link false /test pull-cluster-api-provider-azure-conformance-with-ci-artifacts-dra

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@nojnhuh
Copy link
Contributor

nojnhuh commented Sep 12, 2025

@k8s-ci-robot k8s-ci-robot merged commit 8f913a9 into kubernetes-sigs:main Sep 12, 2025
29 of 30 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in CAPZ Planning Sep 12, 2025
@marosset marosset deleted the upgrade-calico branch September 15, 2025 16:13
@willie-yao willie-yao mentioned this pull request Sep 22, 2025
4 tasks
@nojnhuh nojnhuh added this to the v1.22 milestone Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Update the Calico version for testing
5 participants