-
Notifications
You must be signed in to change notification settings - Fork 1.4k
🐛fix ClusterCache doesn't pick latest kubeconfig secret proactively #12400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @mogliang! |
Hi @mogliang. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
4142631
to
49bbe5d
Compare
added unit test |
/ok-to-test |
@@ -511,6 +512,16 @@ func (cc *clusterCache) Reconcile(ctx context.Context, req reconcile.Request) (r | |||
requeueAfterDurations = append(requeueAfterDurations, accessor.config.HealthProbe.Interval) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are hitting tooManyConsecutiveFailures in your scenario, right?
Would it also be enough for your use case to make HealthProbe.Timeout/Interval/FailureThreshold configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really.
We have proxy between mgmt cluster and target clusters.
we see after kubeconfig updated (proxy address changed), the existing connection (clustercache probe) still works, so it doesn't refetch kubeconfig and still cache the old one, but new connections fails (etcd client)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if the health check would open a new connection it would detect it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's correct~ another idea is to disconnect for a given time (e.g. 5m) period to force refresh connection. will this be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just try to extend the health check that it does two requests. One over the existing connection one with a new connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let it check it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sbueringer
https://github.com/kubernetes-sigs/cluster-api/compare/release-1.9...mogliang:cluster-api:dev/qliang/v1.9.5-fixetcd2?expand=1
copied the implemenation from tlscache.get.
need mention that, establish tcp connection takes ~1sec in our case, and normal ping may take ~100ms. so, this do adds some reconcile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow. 1s is really not good :)
And yeah it's a lot of code to copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know what the etcd client is doing in the case that is failing for you? I would have hoped that that doesn't take 1s (at least in our scale tests the entire KCP reconcile was much faster than that)
Maybe we can do something more lightweight
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our env is a bit complex, the mgmt cluster is on cloud side, the target cluster are on users' local office. an relay connection is needed between them.
When mgmt trying to estalbish new tcp connection to target cluster, there are a few work need be done on relay proxy side, so the time taken is a bit long.
From what i've seen, normally kubeclient will reuse the underlying tcp connection. However, when kcp use etcd client, it tries to establish the new underlying tcp connections. that may take a few more time.
but what blocks us is that: the kubeconfig in clustercache may already outdated (don't have latest relay info), and etcd client cannot use it to create connection.
2a67216
to
36e1ea5
Compare
730e1a5
to
754d34a
Compare
/hold |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #12399
/area clustercache