Skip to content

Track more snapshot-releated node-level stats #130301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

nicktindall
Copy link
Contributor

@nicktindall nicktindall commented Jun 30, 2025

Adds additional snapshot metrics and publishes them via APM

Apologies for the size of this change, but most of it is plumbing. The change itself is quite small.

Relates: ES-12055, ES-11927

@elasticsearchmachine elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jun 30, 2025
@nicktindall nicktindall added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jun 30, 2025
@nicktindall nicktindall requested a review from ywangd July 10, 2025 03:57
@nicktindall nicktindall requested a review from DaveCTurner July 10, 2025 06:33
Comment on lines 473 to 476
+ ", startTime="
+ startTime
+ startTimeMillis
+ ", totalTime="
+ totalTime
+ totalTimeMillis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could rename the labels here too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in 9bea854 and also added units to the immutable copy object

this(entries, stateSummaries.v1(), stateSummaries.v2());
}

private static Tuple<Map<State, Integer>, Map<ShardState, Integer>> calculateStateSummaries(List<Entry> entries) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think this means we do this computation on every node now which seems wasteful. Could we do it in SnapshotsService still, just on the master?

When I suggested doing this in applyClusterState I meant just updating the existing stats according to the new cluster state, not computing everything from scratch. If we have to do it from scratch every time then I guess it'd be better to happen on the stats-collection thread rather than the cluster applier. At least we could cache the results assuming they won't change before the next stats collection?

Copy link
Contributor Author

@nicktindall nicktindall Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh right yes, that is wasteful. My thinking was to avoid changing the serialization (hence deriving the stats in the ByRepo constructor), and I thought that for many cluster state updates the bulk of the structure doesn't change, so we wouldn't create ByRepo instances unless there was a change, but even despite that we still do quite a bit of unnecessary work.

I changed this in 87864ec to calculate the stats on the metric thread, and only re-calculate them when the cluster state changed. I also added some more logic in 74c3427 to only recalculate the metrics when the SnapshotsInProgress instance changes, because I think this will avoid re-calculating the metrics every time something unrelated in the cluster state changes.

I'm assuming the object-equality-on-no-change thing holds for the customs as well. The check against cluster state version may now be redundant, but I guess it might save fetching the SnapshotsInProgress sometimes.

I don't think we can do this more cleverly in the applier because SnapshotsInProgress seems to lack the ability to query what changed like some of the other cluster state implements, but I may be missing something. I think by doing all the caching and recalculation on the metrics thread we can also avoid making the SnapshotStats field volatile, but if we invalidate it from the applier thread it would have to be.

Edit: some related tidying in 346c15f

@nicktindall nicktindall requested a review from DaveCTurner July 16, 2025 06:17
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (one comment nit, one other question, but nothing blocking)

@@ -178,6 +169,16 @@ public void awaitIdle() {

}

@Override
public LongWithAttributes getShardSnapshotsInProgress() {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to throw here too?

Suggested change
return null;
throw createUnknownTypeException();

If not, could we have a comment saying why we return null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was we may end up with an unknown type repository alongside other valid repositories, and throwing an exception here might interfere with metric collection for the valid repos (I'm not sure how long that state could persist). Returning null signifies that the repository doesn't track shard snapshots in progress so it's effectively a no-op.

If that makes sense I'll add a comment to that effect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added d3b583a

// Can't get shards for clone entry
continue;
}
for (ShardSnapshotStatus shardSnapshotStatus : entry.shards().values()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok AIUI we now only do this big nested loop when updating the stats cache, which happens when collecting stats from APM and therefore can be disabled by turning off APM right?

Copy link
Contributor Author

@nicktindall nicktindall Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it'll only happen

  • when either "shards by state" or "snapshots by state" metrics is requested
  • AND the cluster state UUID is not the same as the one used to last calculate these numbers
  • AND the snapshots in progress custom is not instance equal to the one used to last calculate these numbers

It should be calculated on the metrics thread and disabling APM will stop it occurring. I could add a dynamic setting to disable these specific metrics if we want to be risk averse.

# Conflicts:
#	server/src/main/java/org/elasticsearch/TransportVersions.java
#	server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
@nicktindall nicktindall merged commit 9890f98 into elastic:main Aug 11, 2025
33 checks passed
@nicktindall nicktindall deleted the ES-12055_track_snapshot_stats_as_metrics branch August 11, 2025 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue serverless-linked Added by automation, don't add manually Team:Distributed Coordination Meta label for Distributed Coordination team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants