[FLINK-38291][REST] Reduce thread lock overhead for REST handlers #26951

yunfengzhou-hub · 2025-08-28T07:52:30Z

What is the purpose of the change

This PR optimizes the latency of Flink REST handlers used to generate the DAG in Flink UI.

In the current implementation, REST handlers like JobDetailsHandler would iterate through all vertexes of a job, and invoke MetricStore#getSubtaskAttemptMetricStore during each iteration. Given that this is a synchronized method, invocations to this method could possibly be blocked until other threads finished invoking other synchronized methods. This blocking overhead is accumulated with the for loop, resulting in high latency when Flink UI tries to render the status of a Flink job through JobDetailsHandler.

In order to solve this problem, this PR proposes to reduce the number of synchronized invocations in REST handlers. A snapshot of the MetricStore jobs is acquired for each handler (and the synchronization overhead is accumulated only once here), and the snapshot is then reused in the for loops. The snapshot is read only so it needs not be synchronized.

As for benchmark results, we manually measured the latency for the Flink UI to display the DAG of a sophisticated Flink job in our company. Before optimization, the Flink UI needs more than 1 minute to finish the display. After the optimization, the latency decreased to less than 10 seconds.

Brief change log

Introduce MetricStore.MetricStoreJobs to manage a snapshot of all jobs in the MetricStore. Compared with original implementation to operate on MetricStore jobs, the new implementation does not need synchronized keywords on the methods.

Verifying this change

The correctness of this PR is covered by existing tests, such as JobDetailsHandlerTest and MetricStoreTest.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive): no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable

flinkbot · 2025-08-28T07:56:04Z

CI report:

e1b5790 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

yunfengzhou-hub · 2025-08-29T00:58:55Z

Hi @Sxnan could you please help review this PR?

Sxnan

Thanks for the PR! The change looks good overall. I just left comments about the naming of the data structure that captures the snapshot of the JobMetricStores.

flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobDetailsHandler.java

...-runtime/src/main/java/org/apache/flink/runtime/rest/handler/legacy/metrics/MetricStore.java

yunfengzhou-hub · 2025-09-03T10:45:46Z

Thanks for the comments @Sxnan . I have updated the PR according to the comments.

Sxnan

Thanks for the update, LGTM!

Sxnan · 2025-09-04T01:31:15Z

@flinkbot run azure

yunfengzhou-hub · 2025-09-04T03:50:26Z

@flinkbot run azure

yunfengzhou-hub · 2025-09-04T08:36:38Z

@flinkbot run azure

yunfengzhou-hub force-pushed the accelerate-job-details-handler branch from 6288645 to 3d005cf Compare August 28, 2025 08:04

yunfengzhou-hub marked this pull request as ready for review August 28, 2025 08:05

Sxnan reviewed Sep 3, 2025

View reviewed changes

flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobDetailsHandler.java Outdated Show resolved Hide resolved

...-runtime/src/main/java/org/apache/flink/runtime/rest/handler/legacy/metrics/MetricStore.java Outdated Show resolved Hide resolved

Sxnan approved these changes Sep 4, 2025

View reviewed changes

yunfengzhou-hub force-pushed the accelerate-job-details-handler branch from 5f93c07 to 6a7f307 Compare September 4, 2025 05:53

[FLINK-38291][REST] Reduce thread lock overhead for REST handlers

e1b5790

yunfengzhou-hub force-pushed the accelerate-job-details-handler branch from 6a7f307 to e1b5790 Compare September 4, 2025 11:01

Sxnan merged commit 7a00b0e into apache:master Sep 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FLINK-38291][REST] Reduce thread lock overhead for REST handlers #26951

[FLINK-38291][REST] Reduce thread lock overhead for REST handlers #26951

Uh oh!

yunfengzhou-hub commented Aug 28, 2025 •

edited

Loading

Uh oh!

flinkbot commented Aug 28, 2025 •

edited

Loading

Uh oh!

yunfengzhou-hub commented Aug 29, 2025

Uh oh!

Sxnan left a comment

Uh oh!

Uh oh!

Uh oh!

yunfengzhou-hub commented Sep 3, 2025

Uh oh!

Sxnan left a comment

Uh oh!

Sxnan commented Sep 4, 2025

Uh oh!

yunfengzhou-hub commented Sep 4, 2025

Uh oh!

yunfengzhou-hub commented Sep 4, 2025

Uh oh!

Uh oh!

[FLINK-38291][REST] Reduce thread lock overhead for REST handlers #26951

[FLINK-38291][REST] Reduce thread lock overhead for REST handlers #26951

Uh oh!

Conversation

yunfengzhou-hub commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Uh oh!

flinkbot commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI report:

Uh oh!

yunfengzhou-hub commented Aug 29, 2025

Uh oh!

Sxnan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yunfengzhou-hub commented Sep 3, 2025

Uh oh!

Sxnan left a comment

Choose a reason for hiding this comment

Uh oh!

Sxnan commented Sep 4, 2025

Uh oh!

yunfengzhou-hub commented Sep 4, 2025

Uh oh!

yunfengzhou-hub commented Sep 4, 2025

Uh oh!

Uh oh!

yunfengzhou-hub commented Aug 28, 2025 •

edited

Loading

flinkbot commented Aug 28, 2025 •

edited

Loading