Process apiGroup in capi provider #8410

wjunott · 2025-08-06T13:21:45Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

With capi v1beta2, a MachineDeployment or MachineSet's infrastructureRef field has changed from ObjectReference to ContractVersionedObjectReference. We need to process the difference between apiVersion and apiGroup.

Which issue(s) this PR fixes:

Fixes #8330

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla · 2025-08-06T13:21:50Z

The committers listed above are authorized under a signed CLA.

✅ login: wjunott / name: Jun Wang (1ca5f44, f4c2fde, 21ca04a)

k8s-ci-robot · 2025-08-06T13:21:54Z

Welcome @wjunott!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-08-06T13:21:55Z

Hi @wjunott. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

sbueringer · 2025-08-06T15:41:41Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

+
+	apiGroup, ok := infraref["apiGroup"]
+	if ok {
+		if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil {


If I see correctly this is doing a live call against the apiserver. I'm wondering if 1 live call for every call of readInfrastructureReferenceResource is too much

Should we use a cache with a TTL to cache the apiGroup => version mapping? (ttl: 1m or 10m?)
(we can use client-go/tools/cache.NewTTLStore for that)

Good point. Initially, I think this api is invoked only during scale up/down. @elmiko any advice where to put the cache?

If it's okay to always do a live call here because this isn't called too often, absolutely fine for me of course (I just don't know :))

these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).

in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.

i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.

Sounds like we don't necessarily need caching. If I see correctly the getInfrastructureResource below is also not cached? So this won't add much on top

getInfrastructureResource enables informer's cache.

sbueringer · 2025-08-06T15:43:05Z

@wjunott Thx!

/assign @elmiko

Once we settled on the implementation I can do a test with Cluster API if it works as expected

elmiko

this making sense to me, i have some suggestions about the error messages and i tend to agree with @sbueringer about caching.

although, if we feel adding caching to this PR will make it too complex, i'm fine to review it in a followup.

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go

elmiko · 2025-08-07T16:37:25Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

+
+	apiGroup, ok := infraref["apiGroup"]
+	if ok {
+		if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil {


these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).

in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.

i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.

elmiko · 2025-08-07T16:46:24Z

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

+			klog.V(4).Info("Missing apiVersion")
+			return nil, errors.New("Missing apiVersion")


i'd like to add a little more information here to help with triage

Suggested change

klog.V(4).Info("Missing apiVersion")

return nil, errors.New("Missing apiVersion")

errorMsg := fmt.Sprintf("missing apiVersion for infrastructureRef of scalable resource %q", r.unstructured.GetName())

klog.V(4).Info(errorMsg)

return nil, errors.New(errorMsg)

Added more detailed information.

@elmiko @sbueringer I created a commit to support cached preferred version of an apiGroup with about 24 lines' change. How about we discuss more if we still need cached version and if so I will create a new PR after this one is merged? given only scale from zero will access apiserver to get preferred version of an apiGroup.

i think a cache would be helpful to reduce the number of api calls that the cluster-api provider makes. i'm not sure that it is absolutely required, but it would be interesting to test it out.

under normal operation, the cluster-api provider can generate many log lines reporting client-side throttling. i would think that having a cache would help us to reduce the frequency of calls.

Waited for 174.987663ms due to client-side throttling, not priority and fairness, request: <details of HTTP request>

OK, I will create a new PR with cache enabled after this PR is tested and merged.

cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go

sbueringer · 2025-08-08T05:58:04Z

@wjunott see #8410 (comment)

jackfrancis · 2025-08-11T16:44:52Z

/ok-to-test

jackfrancis · 2025-08-13T16:02:35Z

/lgtm
/approve

/hold for @elmiko to sign off

k8s-ci-robot · 2025-08-13T16:02:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, wjunott

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/clusterapi/OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elmiko · 2025-08-13T19:40:46Z

sorry, i didn't get a chance to review today. i'm adding to my queue for tomorrow.

elmiko · 2025-08-14T18:46:46Z

i think this is looking good. i'd like to improve the tests, but that can come next.

/unhold

sbueringer · 2025-08-15T06:28:17Z

@elmiko let's backport?

(xref: testing this PR in CAPI here: kubernetes-sigs/cluster-api#12643)

sbueringer · 2025-08-18T09:39:07Z

Also verified this fix and the previous one for scale to 0 locally with the debugger. Looks perfect.

@elmiko When is the next patch release planned?

jackfrancis · 2025-08-18T16:12:13Z

/cherry-pick cluster-autoscaler-release-1.30

jackfrancis · 2025-08-18T16:12:19Z

/cherry-pick cluster-autoscaler-release-1.31

jackfrancis · 2025-08-18T16:12:23Z

/cherry-pick cluster-autoscaler-release-1.32

jackfrancis · 2025-08-18T16:12:29Z

/cherry-pick cluster-autoscaler-release-1.33

k8s-infra-cherrypick-robot · 2025-08-18T16:12:57Z

@jackfrancis: new pull request created: #8452

In response to this:

/cherry-pick cluster-autoscaler-release-1.30

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2025-08-18T16:12:59Z

@jackfrancis: new pull request created: #8453

In response to this:

/cherry-pick cluster-autoscaler-release-1.31

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2025-08-18T16:13:03Z

@jackfrancis: new pull request created: #8454

In response to this:

/cherry-pick cluster-autoscaler-release-1.32

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-infra-cherrypick-robot · 2025-08-18T16:13:08Z

@jackfrancis: new pull request created: #8455

In response to this:

/cherry-pick cluster-autoscaler-release-1.33

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

elmiko · 2025-08-20T13:25:09Z

thanks @jackfrancis !

@elmiko When is the next patch release planned?

good question, i can bring this up at the next sig meeting.

jackfrancis · 2025-08-20T16:50:39Z

looks like something changed in the UT foundations maybe prior to this PR change, which is missing in release branches prior to 1.33, I haven't had time to look into it yet

tl;dr we probably need to cherry-pick some other stuff into < 1.32 before this one will slot in cleanly

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/needs-area labels Aug 6, 2025

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 6, 2025

k8s-ci-robot added area/cluster-autoscaler area/provider/cluster-api Issues or PRs related to Cluster API provider labels Aug 6, 2025

k8s-ci-robot requested review from arunmk and hardikdr August 6, 2025 13:21

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed do-not-merge/needs-area labels Aug 6, 2025

sbueringer reviewed Aug 6, 2025

View reviewed changes

k8s-ci-robot assigned elmiko Aug 6, 2025

elmiko reviewed Aug 7, 2025

View reviewed changes

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 8, 2025

wjunott changed the title ~~Add process with apiGroup in cluster api provider when scaling from …~~ Add process with apiGroup in cluster api provider Aug 8, 2025

wjunott force-pushed the consume-capi-v1beta2-from-zero branch from 619b909 to 53e8f19 Compare August 8, 2025 03:49

wjunott changed the title ~~Add process with apiGroup in cluster api provider~~ Add process with apiGroup in capi provider Aug 8, 2025

wjunott added 3 commits August 8, 2025 15:16

Add process with apiGroup in capi provider

f4c2fde

Replace capi v1alpha3 with v1beta2 in test cases

21ca04a

Add detailed error messages

1ca5f44

wjunott force-pushed the consume-capi-v1beta2-from-zero branch from d0de2e3 to 1ca5f44 Compare August 8, 2025 07:18

k8s-ci-robot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Aug 8, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 11, 2025

wjunott changed the title ~~Add process with apiGroup in capi provider~~ Process apiGroup in capi provider Aug 12, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 13, 2025

k8s-ci-robot assigned jackfrancis Aug 13, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 13, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 13, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 14, 2025

k8s-ci-robot merged commit a9cb59f into kubernetes:master Aug 14, 2025
7 checks passed

k8s-infra-cherrypick-robot mentioned this pull request Aug 18, 2025

[cluster-autoscaler-release-1.30] Process apiGroup in capi provider #8452

Open

k8s-infra-cherrypick-robot mentioned this pull request Aug 18, 2025

[cluster-autoscaler-release-1.31] Process apiGroup in capi provider #8453

Open

k8s-infra-cherrypick-robot mentioned this pull request Aug 18, 2025

[cluster-autoscaler-release-1.32] Process apiGroup in capi provider #8454

Open

k8s-infra-cherrypick-robot mentioned this pull request Aug 18, 2025

[cluster-autoscaler-release-1.33] Process apiGroup in capi provider #8455

Merged

		klog.V(4).Info("Missing apiVersion")
		return nil, errors.New("Missing apiVersion")

-			klog.V(4).Info("Missing apiVersion")
-			return nil, errors.New("Missing apiVersion")
+            errorMsg := fmt.Sprintf("missing apiVersion for infrastructureRef of scalable resource %q", r.unstructured.GetName())
+			klog.V(4).Info(errorMsg)
+			return nil, errors.New(errorMsg)

Process apiGroup in capi provider #8410

Process apiGroup in capi provider #8410

Conversation

wjunott commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

linux-foundation-easycla bot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

k8s-ci-robot commented Aug 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sbueringer commented Aug 6, 2025

Uh oh!

elmiko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wjunott Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sbueringer commented Aug 8, 2025

Uh oh!

jackfrancis commented Aug 11, 2025

Uh oh!

jackfrancis commented Aug 13, 2025

Uh oh!

k8s-ci-robot commented Aug 13, 2025

Uh oh!

elmiko commented Aug 13, 2025

Uh oh!

elmiko commented Aug 14, 2025

Uh oh!

Uh oh!

sbueringer commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbueringer commented Aug 18, 2025

Uh oh!

jackfrancis commented Aug 18, 2025

Uh oh!

jackfrancis commented Aug 18, 2025

wjunott commented Aug 6, 2025 •

edited

Loading

linux-foundation-easycla bot commented Aug 6, 2025 •

edited

Loading

sbueringer Aug 7, 2025 •

edited

Loading

wjunott Aug 8, 2025 •

edited

Loading

sbueringer commented Aug 15, 2025 •

edited

Loading