fix AllUserStats during rolling updates on ingester #7026

SungJin1212 · 2025-09-18T11:38:25Z

During an ingester rolling update, an updating ingester might be temporarily in stopping or starting state, and it emits an error in AllUserStats.

func (i *Ingester) AllUserStats(_ context.Context, _ *client.UserStatsRequest) (*client.UsersStatsResponse, error) {
	if err := i.checkRunning(); err != nil {
		return nil, err
	}
	...

Returning an error causes the /distributor/all_user_stats API to fail since the API returns an error in the loop.

for _, ingester := range replicationSet.Instances {
	client, err := d.ingesterPool.GetClientFor(ingester.Addr)
	if err != nil {
		return nil, err
	}
	resp, err := client.(ingester_client.IngesterClient).AllUserStats(ctx, req)
	if err != nil {
		return nil, err // cause 500 during ingester rolling update
	}
	...

This PR allows the for loop to continue, which keeps the /distributor/all_user_stats API working during rolling updates.
The e2e test shows the API works with the rolling update scenario.

Which issue(s) this PR fixes:
Fixes #

Checklist

Tests updated
Documentation added
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

friedrichg · 2025-09-19T15:04:09Z

pkg/distributor/distributor.go

+			// in stopping or starting state. Therefore, returning an error would
+			// cause the API to fail during the update. This is an expected error in
+			// that scenario, we continue the loop to work API.
+			continue


I don't like that we are returning incomplete stats after this change.
I think I prefer the current behavior.

can you clarify why incomplete stats are ok?

Personally, an internal batch job in my system calls the /distributor/all_user_stats API. In my workloads, since ingester deployments take over 6 hours, all jobs are failing during the update.

What do you think about adding a dedicated column for the reflected replication factor?
In deployment, some tenants may have a 2 replication factor, but it would not be incomplete stats.

Thanks for explaining the context. 🙏

mmm. I went through the code again and the stats are coming from healthy ingesters. So my understanding is probably that check might be outdated due to the way the ring in memberlist works.

Also my understanding is this API is very flaky in large clusters. There is always an ingester stopping somewhere. We should definitely fix that.

I would be fine with this change if we also add another field to the stats that expresses how many ingesters have been queried, perfect if that number shows up also in

cortex/pkg/ingester/http_admin.go

Line 15 in 6436235

const tpl = `

@SungJin1212 wdyt?

Yup, it would be good if we add a line explaining how many ingesters have been queried.
If not all ingesters are reflected, the stats results could be unstable. Would that be acceptable?

I added a dedicated column # Queried Ingesters and added a line expressing how many ingesters are reflected in total.

Signed-off-by: SungJin1212 <[email protected]>

friedrichg

LGTM. Thanks!

pull-request-size bot added the size/M label Sep 18, 2025

dosubot bot added the component/distributor label Sep 18, 2025

friedrichg reviewed Sep 19, 2025

View reviewed changes

SungJin1212 added 2 commits September 22, 2025 10:56

fix AllUserStats during rolling updates on ingester

d231aca

Signed-off-by: SungJin1212 <[email protected]>

Add dedicated column to expose # Queried Ingesters

2f2443b

Signed-off-by: SungJin1212 <[email protected]>

SungJin1212 force-pushed the Fix-AllUserStats-when-rolling-update branch from 5bcf694 to 2f2443b Compare September 22, 2025 06:42

pull-request-size bot added size/L and removed size/M labels Sep 22, 2025

fix test

9265cd3

Signed-off-by: SungJin1212 <[email protected]>

friedrichg approved these changes Sep 22, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix AllUserStats during rolling updates on ingester #7026

fix AllUserStats during rolling updates on ingester #7026

Uh oh!

SungJin1212 commented Sep 18, 2025 •

edited

Loading

Uh oh!

friedrichg Sep 19, 2025

Uh oh!

SungJin1212 Sep 19, 2025 •

edited

Loading

Uh oh!

friedrichg Sep 21, 2025

Uh oh!

SungJin1212 Sep 22, 2025

Uh oh!

SungJin1212 Sep 22, 2025

Uh oh!

friedrichg left a comment

Uh oh!

Uh oh!

fix AllUserStats during rolling updates on ingester #7026

Are you sure you want to change the base?

fix AllUserStats during rolling updates on ingester #7026

Uh oh!

Conversation

SungJin1212 commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

friedrichg Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

SungJin1212 Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friedrichg Sep 21, 2025

Choose a reason for hiding this comment

Uh oh!

SungJin1212 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

SungJin1212 Sep 22, 2025

Choose a reason for hiding this comment

Uh oh!

friedrichg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SungJin1212 commented Sep 18, 2025 •

edited

Loading

SungJin1212 Sep 19, 2025 •

edited

Loading