-
Notifications
You must be signed in to change notification settings - Fork 832
fix AllUserStats during rolling updates on ingester #7026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
fix AllUserStats during rolling updates on ingester #7026
Conversation
// in stopping or starting state. Therefore, returning an error would | ||
// cause the API to fail during the update. This is an expected error in | ||
// that scenario, we continue the loop to work API. | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like that we are returning incomplete stats after this change.
I think I prefer the current behavior.
can you clarify why incomplete stats are ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, an internal batch job in my system calls the /distributor/all_user_stats
API. In my workloads, since ingester deployments take over 6 hours, all jobs are failing during the update.
What do you think about adding a dedicated column for the reflected replication factor?
In deployment, some tenants may have a 2 replication factor, but it would not be incomplete stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining the context. 🙏
mmm. I went through the code again and the stats are coming from healthy ingesters. So my understanding is probably that check might be outdated due to the way the ring in memberlist works.
Also my understanding is this API is very flaky in large clusters. There is always an ingester stopping somewhere. We should definitely fix that.
I would be fine with this change if we also add another field to the stats that expresses how many ingesters have been queried, perfect if that number shows up also in
cortex/pkg/ingester/http_admin.go
Line 15 in 6436235
const tpl = ` |
@SungJin1212 wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, it would be good if we add a line explaining how many ingesters have been queried.
If not all ingesters are reflected, the stats results could be unstable. Would that be acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: SungJin1212 <[email protected]>
Signed-off-by: SungJin1212 <[email protected]>
5bcf694
to
2f2443b
Compare
Signed-off-by: SungJin1212 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
During an ingester rolling update, an updating ingester might be temporarily in
stopping
orstarting
state, and it emits an error inAllUserStats
.Returning an error causes the
/distributor/all_user_stats
API to fail since the API returns an error in the loop.This PR allows the for loop to continue, which keeps the
/distributor/all_user_stats
API working during rolling updates.The e2e test shows the API works with the rolling update scenario.
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]