Skip to content

Conversation

SungJin1212
Copy link
Member

@SungJin1212 SungJin1212 commented Sep 18, 2025

During an ingester rolling update, an updating ingester might be temporarily in stopping or starting state, and it emits an error in AllUserStats.

func (i *Ingester) AllUserStats(_ context.Context, _ *client.UserStatsRequest) (*client.UsersStatsResponse, error) {
	if err := i.checkRunning(); err != nil {
		return nil, err
	}
	...

Returning an error causes the /distributor/all_user_stats API to fail since the API returns an error in the loop.

for _, ingester := range replicationSet.Instances {
	client, err := d.ingesterPool.GetClientFor(ingester.Addr)
	if err != nil {
		return nil, err
	}
	resp, err := client.(ingester_client.IngesterClient).AllUserStats(ctx, req)
	if err != nil {
		return nil, err // cause 500 during ingester rolling update
	}
	...

This PR allows the for loop to continue, which keeps the /distributor/all_user_stats API working during rolling updates.
The e2e test shows the API works with the rolling update scenario.

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

// in stopping or starting state. Therefore, returning an error would
// cause the API to fail during the update. This is an expected error in
// that scenario, we continue the loop to work API.
continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we are returning incomplete stats after this change.
I think I prefer the current behavior.

can you clarify why incomplete stats are ok?

Copy link
Member Author

@SungJin1212 SungJin1212 Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, an internal batch job in my system calls the /distributor/all_user_stats API. In my workloads, since ingester deployments take over 6 hours, all jobs are failing during the update.

What do you think about adding a dedicated column for the reflected replication factor?
In deployment, some tenants may have a 2 replication factor, but it would not be incomplete stats.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining the context. 🙏

mmm. I went through the code again and the stats are coming from healthy ingesters. So my understanding is probably that check might be outdated due to the way the ring in memberlist works.

Also my understanding is this API is very flaky in large clusters. There is always an ingester stopping somewhere. We should definitely fix that.

I would be fine with this change if we also add another field to the stats that expresses how many ingesters have been queried, perfect if that number shows up also in

const tpl = `

@SungJin1212 wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it would be good if we add a line explaining how many ingesters have been queried.
If not all ingesters are reflected, the stats results could be unstable. Would that be acceptable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a dedicated column # Queried Ingesters and added a line expressing how many ingesters are reflected in total.
스크린샷 2025-09-22 오후 3 43 00

@SungJin1212 SungJin1212 force-pushed the Fix-AllUserStats-when-rolling-update branch from 5bcf694 to 2f2443b Compare September 22, 2025 06:42
@pull-request-size pull-request-size bot added size/L and removed size/M labels Sep 22, 2025
Signed-off-by: SungJin1212 <[email protected]>
Copy link
Member

@friedrichg friedrichg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Sep 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/distributor lgtm This PR has been approved by a maintainer size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants