Track more snapshot-releated node-level stats #130301

nicktindall · 2025-06-30T01:05:05Z

Adds additional snapshot metrics and publishes them via APM

Apologies for the size of this change, but most of it is plumbing. ~~The change itself is quite small.~~

Relates: ES-12055, ES-11927

…ot_stats_as_metrics

…ot_stats_as_metrics # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

DaveCTurner · 2025-07-14T07:32:57Z

server/src/main/java/org/elasticsearch/index/snapshots/IndexShardSnapshotStatus.java

            + ", startTime="
-            + startTime
+            + startTimeMillis
            + ", totalTime="
-            + totalTime
+            + totalTimeMillis


nit: could rename the labels here too

Resolved in 9bea854 and also added units to the immutable copy object

DaveCTurner · 2025-07-14T07:47:27Z

server/src/main/java/org/elasticsearch/cluster/SnapshotsInProgress.java

+            this(entries, stateSummaries.v1(), stateSummaries.v2());
+        }
+
+        private static Tuple<Map<State, Integer>, Map<ShardState, Integer>> calculateStateSummaries(List<Entry> entries) {


Hmm I think this means we do this computation on every node now which seems wasteful. Could we do it in SnapshotsService still, just on the master?

When I suggested doing this in applyClusterState I meant just updating the existing stats according to the new cluster state, not computing everything from scratch. If we have to do it from scratch every time then I guess it'd be better to happen on the stats-collection thread rather than the cluster applier. At least we could cache the results assuming they won't change before the next stats collection?

Ahh right yes, that is wasteful. My thinking was to avoid changing the serialization (hence deriving the stats in the ByRepo constructor), and I thought that for many cluster state updates the bulk of the structure doesn't change, so we wouldn't create ByRepo instances unless there was a change, but even despite that we still do quite a bit of unnecessary work.

I changed this in 87864ec to calculate the stats on the metric thread, and only re-calculate them when the cluster state changed. I also added some more logic in 74c3427 to only recalculate the metrics when the SnapshotsInProgress instance changes, because I think this will avoid re-calculating the metrics every time something unrelated in the cluster state changes.

I'm assuming the object-equality-on-no-change thing holds for the customs as well. The check against cluster state version may now be redundant, but I guess it might save fetching the SnapshotsInProgress sometimes.

I don't think we can do this more cleverly in the applier because SnapshotsInProgress seems to lack the ability to query what changed like some of the other cluster state implements, but I may be missing something. I think by doing all the caching and recalculation on the metrics thread we can also avoid making the SnapshotStats field volatile, but if we invalidate it from the applier thread it would have to be.

Edit: some related tidying in 346c15f

…ot_stats_as_metrics

…s.Copy fields/toString

DaveCTurner

LGTM (one comment nit, one other question, but nothing blocking)

DaveCTurner · 2025-08-08T08:15:42Z

server/src/main/java/org/elasticsearch/repositories/UnknownTypeRepository.java

@@ -178,6 +169,16 @@ public void awaitIdle() {

    }

+    @Override
+    public LongWithAttributes getShardSnapshotsInProgress() {
+        return null;


Do we want to throw here too?

Suggested change

return null;

throw createUnknownTypeException();

If not, could we have a comment saying why we return null?

My thinking was we may end up with an unknown type repository alongside other valid repositories, and throwing an exception here might interfere with metric collection for the valid repos (I'm not sure how long that state could persist). Returning null signifies that the repository doesn't track shard snapshots in progress so it's effectively a no-op.

If that makes sense I'll add a comment to that effect?

Added d3b583a

DaveCTurner · 2025-08-08T08:23:26Z

server/src/main/java/org/elasticsearch/cluster/SnapshotsInProgress.java

+                    // Can't get shards for clone entry
+                    continue;
+                }
+                for (ShardSnapshotStatus shardSnapshotStatus : entry.shards().values()) {


Ok AIUI we now only do this big nested loop when updating the stats cache, which happens when collecting stats from APM and therefore can be disabled by turning off APM right?

Yes it'll only happen

when either "shards by state" or "snapshots by state" metrics is requested

AND the cluster state UUID is not the same as the one used to last calculate these numbers

AND the snapshots in progress custom is not instance equal to the one used to last calculate these numbers

It should be calculated on the metrics thread and disabling APM will stop it occurring. I could add a dynamic setting to disable these specific metrics if we want to be risk averse.

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

…ot_stats_as_metrics

Track snapshot stats as metrics

e7794c1

elasticsearchmachine added the v9.2.0 label Jun 30, 2025

nicktindall added 3 commits June 30, 2025 11:49

Fix double counted snapshot completion

b8be99e

Reduce size of change

9eeee30

Add MeterRegistry param in callers

67eb753

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Jun 30, 2025

nicktindall added >non-issue :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Jun 30, 2025

nicktindall and others added 22 commits June 30, 2025 14:29

Make banned implementation final

faf4e7a

Improve javadoc

5a33bb6

Fix naming

b4c926f

Fix naming, record shard duration as histogram

d808b85

Millis -> nanos

fd55b35

Reuse totalTime

ffdb941

Don't use cached time

6e22dc6

Fist pass on tests

818c259

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

b929cd1

…ot_stats_as_metrics

Fix SnapshotMetricsIT

9ba0d4a

Naming

45909db

Assert on throttling metrics

f42f9bd

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

ef19cd5

…ot_stats_as_metrics

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

adc149a

…ot_stats_as_metrics

Add snapshot APM metrics

345cc59

Add snapshot metrics

b341645

Tidy

3894be6

Tidy

d164c8c

Tidy

b8cc9f9

Reduce surface area of change (?)

7f7427c

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

3aa7ca6

…ot_stats_as_metrics

URLRepository

e2665d1

nicktindall added 10 commits July 8, 2025 20:06

Use com.carrotsearch.hppc.ObjectIntMap.addTo

ba62ff6

Pre-calculate shard & snapshot state summaries in cluster state

e90ff65

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

2ef402e

…ot_stats_as_metrics # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

6c5d200

…ot_stats_as_metrics # Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java

Don't try and get shards for clone entry

a43ac7e

Add snapshots by state metric

093ae22

Merge branch 'main' into ES-12055_track_snapshot_stats_as_metrics

81ce452

Remove remnants of limited state tracking

f509643

Remove redundant snapshots in progress metric

3ac413c

Populate and assert on all snapshotStats fields

1dec741

nicktindall requested a review from ywangd July 10, 2025 03:57

nicktindall added 3 commits July 10, 2025 14:09

Fix flakiness in RepositorySnapshotStatsIT, remove dead code

850c116

Merge branch 'main' into ES-12055_track_snapshot_stats_as_metrics

6dd44f1

Fix assertion

ca9d1ab

nicktindall requested a review from DaveCTurner July 10, 2025 06:33

DaveCTurner reviewed Jul 14, 2025

View reviewed changes

nicktindall added 6 commits July 15, 2025 22:02

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

4f76e4a

…ot_stats_as_metrics

Calculate snapshot stats on metrics thread and only when stale

87864ec

Merge branch 'main' into ES-12055_track_snapshot_stats_as_metrics

7feb852

Align toString with field names, add units to IndexShardSnapshotStatu…

9bea854

…s.Copy fields/toString

Only recalculate stats if SnapshotsInProgress changed

74c3427

Tidy up, naming

346c15f

nicktindall requested a review from DaveCTurner July 16, 2025 06:17

DaveCTurner approved these changes Aug 8, 2025

View reviewed changes

nicktindall added 3 commits August 9, 2025 11:50

Merge branch 'main' into ES-12055_track_snapshot_stats_as_metrics

9c1b1dd

# Conflicts: # server/src/main/java/org/elasticsearch/TransportVersions.java # server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java

Merge remote-tracking branch 'origin/main' into ES-12055_track_snapsh…

48466fd

…ot_stats_as_metrics

Clarify why we return null

d3b583a

nicktindall merged commit 9890f98 into elastic:main Aug 11, 2025
33 checks passed

nicktindall deleted the ES-12055_track_snapshot_stats_as_metrics branch August 11, 2025 07:56

JeremyDahlgren mentioned this pull request Aug 11, 2025

Fix race condition in SnapshotMetricsIT.testSnapshotAPMMetrics #132686

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Track more snapshot-releated node-level stats #130301

Track more snapshot-releated node-level stats #130301

Uh oh!

nicktindall commented Jun 30, 2025 •

edited

Loading

Uh oh!

DaveCTurner Jul 14, 2025

Uh oh!

nicktindall Jul 16, 2025

Uh oh!

DaveCTurner Jul 14, 2025

Uh oh!

nicktindall Jul 16, 2025 •

edited

Loading

Uh oh!

DaveCTurner left a comment

Uh oh!

DaveCTurner Aug 8, 2025

Uh oh!

nicktindall Aug 9, 2025

Uh oh!

nicktindall Aug 11, 2025

Uh oh!

DaveCTurner Aug 8, 2025

Uh oh!

nicktindall Aug 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Track more snapshot-releated node-level stats #130301

Track more snapshot-releated node-level stats #130301

Uh oh!

Conversation

nicktindall commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DaveCTurner Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Jul 14, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Aug 8, 2025

Choose a reason for hiding this comment

Uh oh!

nicktindall Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nicktindall commented Jun 30, 2025 •

edited

Loading

nicktindall Jul 16, 2025 •

edited

Loading

nicktindall Aug 9, 2025 •

edited

Loading