-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Process apiGroup in capi provider #8410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process apiGroup in capi provider #8410
Conversation
Welcome @wjunott! |
Hi @wjunott. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
|
||
apiGroup, ok := infraref["apiGroup"] | ||
if ok { | ||
if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I see correctly this is doing a live call against the apiserver. I'm wondering if 1 live call for every call of readInfrastructureReferenceResource is too much
Should we use a cache with a TTL to cache the apiGroup => version mapping? (ttl: 1m or 10m?)
(we can use client-go/tools/cache.NewTTLStore for that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Initially, I think this api is invoked only during scale up/down. @elmiko any advice where to put the cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's okay to always do a live call here because this isn't called too often, absolutely fine for me of course (I just don't know :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).
in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.
i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like we don't necessarily need caching. If I see correctly the getInfrastructureResource below is also not cached? So this won't add much on top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getInfrastructureResource enables informer's cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this making sense to me, i have some suggestions about the error messages and i tend to agree with @sbueringer about caching.
although, if we feel adding caching to this PR will make it too complex, i'm fine to review it in a followup.
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_controller_test.go
Show resolved
Hide resolved
|
||
apiGroup, ok := infraref["apiGroup"] | ||
if ok { | ||
if apiversion, err = getAPIGroupPreferredVersion(r.controller.managementDiscoveryClient, apiGroup); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these calls will only happen when the core autoscaler wants to construct a node template. if the autoscaler has a ready node from the node group, then it will use a node as a template instead of asking the provider to generate a new template (where this function is called).
in the worst case scenario, this function will get called once per node group per scan interval from the autoscaler, which defaults to 10 seconds. in a large cluster this could be called several time for the same template depending on how the cluster-api resources are organized.
i think it's worth investigating putting a cache in for the infrastructure templates as they probably won't change that frequently and it could save us some api calls.
klog.V(4).Info("Missing apiVersion") | ||
return nil, errors.New("Missing apiVersion") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd like to add a little more information here to help with triage
klog.V(4).Info("Missing apiVersion") | |
return nil, errors.New("Missing apiVersion") | |
errorMsg := fmt.Sprintf("missing apiVersion for infrastructureRef of scalable resource %q", r.unstructured.GetName()) | |
klog.V(4).Info(errorMsg) | |
return nil, errors.New(errorMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more detailed information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@elmiko @sbueringer I created a commit to support cached preferred version of an apiGroup with about 24 lines' change. How about we discuss more if we still need cached version and if so I will create a new PR after this one is merged? given only scale from zero will access apiserver to get preferred version of an apiGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think a cache would be helpful to reduce the number of api calls that the cluster-api provider makes. i'm not sure that it is absolutely required, but it would be interesting to test it out.
under normal operation, the cluster-api provider can generate many log lines reporting client-side throttling. i would think that having a cache would help us to reduce the frequency of calls.
Waited for 174.987663ms due to client-side throttling, not priority and fairness, request: <details of HTTP request>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will create a new PR with cache enabled after this PR is tested and merged.
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
cluster-autoscaler/cloudprovider/clusterapi/clusterapi_unstructured.go
Outdated
Show resolved
Hide resolved
619b909
to
53e8f19
Compare
d0de2e3
to
1ca5f44
Compare
/ok-to-test |
/lgtm /hold for @elmiko to sign off |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jackfrancis, wjunott The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
sorry, i didn't get a chance to review today. i'm adding to my queue for tomorrow. |
i think this is looking good. i'd like to improve the tests, but that can come next. /unhold |
@elmiko let's backport? (xref: testing this PR in CAPI here: kubernetes-sigs/cluster-api#12643) |
Also verified this fix and the previous one for scale to 0 locally with the debugger. Looks perfect. @elmiko When is the next patch release planned? |
/cherry-pick cluster-autoscaler-release-1.30 |
/cherry-pick cluster-autoscaler-release-1.31 |
/cherry-pick cluster-autoscaler-release-1.32 |
/cherry-pick cluster-autoscaler-release-1.33 |
@jackfrancis: new pull request created: #8452 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@jackfrancis: new pull request created: #8453 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@jackfrancis: new pull request created: #8454 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
@jackfrancis: new pull request created: #8455 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
thanks @jackfrancis !
good question, i can bring this up at the next sig meeting. |
looks like something changed in the UT foundations maybe prior to this PR change, which is missing in release branches prior to 1.33, I haven't had time to look into it yet tl;dr we probably need to cherry-pick some other stuff into < 1.32 before this one will slot in cleanly |
What type of PR is this?
/kind bug
What this PR does / why we need it:
With capi v1beta2, a MachineDeployment or MachineSet's infrastructureRef field has changed from ObjectReference to ContractVersionedObjectReference. We need to process the difference between apiVersion and apiGroup.
Which issue(s) this PR fixes:
Fixes #8330
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: