-
Notifications
You must be signed in to change notification settings - Fork 13.8k
[FLINK-38291][REST] Reduce thread lock overhead for REST handlers #26951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-38291][REST] Reduce thread lock overhead for REST handlers #26951
Conversation
6288645
to
3d005cf
Compare
Hi @Sxnan could you please help review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! The change looks good overall. I just left comments about the naming of the data structure that captures the snapshot of the JobMetricStores.
flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobDetailsHandler.java
Outdated
Show resolved
Hide resolved
...-runtime/src/main/java/org/apache/flink/runtime/rest/handler/legacy/metrics/MetricStore.java
Outdated
Show resolved
Hide resolved
Thanks for the comments @Sxnan . I have updated the PR according to the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update, LGTM!
@flinkbot run azure |
1 similar comment
@flinkbot run azure |
5f93c07
to
6a7f307
Compare
@flinkbot run azure |
6a7f307
to
e1b5790
Compare
What is the purpose of the change
This PR optimizes the latency of Flink REST handlers used to generate the DAG in Flink UI.
In the current implementation, REST handlers like
JobDetailsHandler
would iterate through all vertexes of a job, and invokeMetricStore#getSubtaskAttemptMetricStore
during each iteration. Given that this is a synchronized method, invocations to this method could possibly be blocked until other threads finished invoking other synchronized methods. This blocking overhead is accumulated with the for loop, resulting in high latency when Flink UI tries to render the status of a Flink job throughJobDetailsHandler
.In order to solve this problem, this PR proposes to reduce the number of synchronized invocations in REST handlers. A snapshot of the MetricStore jobs is acquired for each handler (and the synchronization overhead is accumulated only once here), and the snapshot is then reused in the for loops. The snapshot is read only so it needs not be synchronized.
As for benchmark results, we manually measured the latency for the Flink UI to display the DAG of a sophisticated Flink job in our company. Before optimization, the Flink UI needs more than 1 minute to finish the display. After the optimization, the latency decreased to less than 10 seconds.
Brief change log
Verifying this change
The correctness of this PR is covered by existing tests, such as JobDetailsHandlerTest and MetricStoreTest.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation