Skip to content

Conversation

@guojialiang92
Copy link
Contributor

Description

Add an dedicated REPLICATION thread pool.

Related Issues

Resolves #[18755]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

❌ Gradle check result for 934f8ba: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/add-dedicated-replication-threadpool branch from 934f8ba to 70888c2 Compare July 16, 2025 15:16
@github-actions
Copy link
Contributor

✅ Gradle check result for 70888c2: SUCCESS

@codecov
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

❌ Patch coverage is 83.33333% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 72.74%. Comparing base (fc6b08e) to head (70888c2).
⚠️ Report is 224 commits behind head on main.

Files with missing lines Patch % Lines
...ices/replication/common/ReplicationCollection.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18769      +/-   ##
============================================
- Coverage     72.86%   72.74%   -0.12%     
+ Complexity    68571    68482      -89     
============================================
  Files          5566     5566              
  Lines        314513   314673     +160     
  Branches      45636    45652      +16     
============================================
- Hits         229167   228912     -255     
- Misses        66789    67183     +394     
- Partials      18557    18578      +21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@guojialiang92
Copy link
Contributor Author

Hi, @Bukhtawar @ashking94
I'd like to invite you to help review this PR, thanks.

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing on this. I wanted to understand on what is the motivation for this change? Did you see any issue which led to you to do this?

final int halfProcMaxAt5 = halfAllocatedProcessorsMaxFive(allocatedProcessors);
final int halfProcMaxAt10 = halfAllocatedProcessorsMaxTen(allocatedProcessors);
final int genericThreadPoolMax = boundedBy(4 * allocatedProcessors, 128, 512);
final int replicationThreadPoolMax = boundedBy(4 * allocatedProcessors, 128, 512);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How have we arrived at this number?

Copy link
Contributor Author

@guojialiang92 guojialiang92 Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashking94

How have we arrived at this number?

To better manage PR, no other changes were introduced except for a dedicated replication thread pool.
The thread pool size remains consistent with the generic thread pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am assuming recovery will still continue to use the generic threadpool?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bukhtawar

I am assuming recovery will still continue to use the generic threadpool?

Yeah.

@guojialiang92
Copy link
Contributor Author

guojialiang92 commented Jul 18, 2025

Hi, @ashking94, thanks for review the PR.

Thanks for contributing on this. I wanted to understand on what is the motivation for this change? Did you see any issue which led to you to do this?

When the number of shards on a node is large (say 3000~5000), the generic thread pool will have a relatively large number of active threads executing data replication tasks. Higher-priority tasks such as node discovery and peer recovery may be affected. At the same time, mixing these tasks together is also not conducive to our monitoring of the thread usage of data replication tasks.

As segment replication becomes increasingly important, many new ideas are based on segment replication, such as: merged segment warmer, Hybrid block level fetch and Adaptive Refresh. By introducing a dedicated thread pool, we can more precisely control tasks of the data replication, and there are no obvious drawbacks.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Aug 19, 2025
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Sep 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants