Skip to content

Conversation

yunfengzhou-hub
Copy link
Contributor

@yunfengzhou-hub yunfengzhou-hub commented Aug 28, 2025

What is the purpose of the change

This PR optimizes the latency of Flink REST handlers used to generate the DAG in Flink UI.

In the current implementation, REST handlers like JobDetailsHandler would iterate through all vertexes of a job, and invoke MetricStore#getSubtaskAttemptMetricStore during each iteration. Given that this is a synchronized method, invocations to this method could possibly be blocked until other threads finished invoking other synchronized methods. This blocking overhead is accumulated with the for loop, resulting in high latency when Flink UI tries to render the status of a Flink job through JobDetailsHandler.

In order to solve this problem, this PR proposes to reduce the number of synchronized invocations in REST handlers. A snapshot of the MetricStore jobs is acquired for each handler (and the synchronization overhead is accumulated only once here), and the snapshot is then reused in the for loops. The snapshot is read only so it needs not be synchronized.

As for benchmark results, we manually measured the latency for the Flink UI to display the DAG of a sophisticated Flink job in our company. Before optimization, the Flink UI needs more than 1 minute to finish the display. After the optimization, the latency decreased to less than 10 seconds.

Brief change log

  • Introduce MetricStore.MetricStoreJobs to manage a snapshot of all jobs in the MetricStore. Compared with original implementation to operate on MetricStore jobs, the new implementation does not need synchronized keywords on the methods.

Verifying this change

The correctness of this PR is covered by existing tests, such as JobDetailsHandlerTest and MetricStoreTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 28, 2025

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@yunfengzhou-hub yunfengzhou-hub force-pushed the accelerate-job-details-handler branch from 6288645 to 3d005cf Compare August 28, 2025 08:04
@yunfengzhou-hub yunfengzhou-hub marked this pull request as ready for review August 28, 2025 08:05
@yunfengzhou-hub
Copy link
Contributor Author

Hi @Sxnan could you please help review this PR?

Copy link
Contributor

@Sxnan Sxnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The change looks good overall. I just left comments about the naming of the data structure that captures the snapshot of the JobMetricStores.

@yunfengzhou-hub
Copy link
Contributor Author

Thanks for the comments @Sxnan . I have updated the PR according to the comments.

Copy link
Contributor

@Sxnan Sxnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, LGTM!

@Sxnan
Copy link
Contributor

Sxnan commented Sep 4, 2025

@flinkbot run azure

1 similar comment
@yunfengzhou-hub
Copy link
Contributor Author

@flinkbot run azure

@yunfengzhou-hub yunfengzhou-hub force-pushed the accelerate-job-details-handler branch from 5f93c07 to 6a7f307 Compare September 4, 2025 05:53
@yunfengzhou-hub
Copy link
Contributor Author

@flinkbot run azure

@yunfengzhou-hub yunfengzhou-hub force-pushed the accelerate-job-details-handler branch from 6a7f307 to e1b5790 Compare September 4, 2025 11:01
@Sxnan Sxnan merged commit 7a00b0e into apache:master Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants