Skip to content

[merged segment warmer] support cleanup redundant pending merged segments #18720

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

guojialiang92
Copy link
Contributor

@guojialiang92 guojialiang92 commented Jul 10, 2025

Description

This PR is based on [18255]'s follow-up work. It implements the core process of cleaning up redundant pending merged segments. Meanwhile, this cleanup process can be applied to both local and remote storage scenarios.

Example
Here, an example of generating redundant pending merged segments will be provided.

  1. At time1, the primary shard merges _1.si and _2.si into segment _3.si. _3.si is pre-copied to the replica shard, which completed at time4.
  2. At time2, the primary shard generated _4.si, and _1.si, _2.si, _4.si were copied to the replica shard via segment replication, which completed at time5.
  3. At time3, the primary shard merges _3.si and _4.si into segment _5.si. _5.si is pre-copied to the replica shard, which completed at time6.
  4. At time4, the replica completed the pre-copy of _3.si, but _3.si is not searchable.
  5. At time5, the reference counts of _1.si and _2.si in the primary shard decreased to 0 and were deleted.
  6. At time6, _5.si on the primary shard becomes searchable. The reference counts of _3.si and _4.si in the primary shard decreased to 0 and were deleted. _5.si were copied to the replica shard via segment replication, which completed at time7 (This process is fast because it has already performed pre-copy).
  7. At time7, _5.si on the replica shard becomes searchable. The reference counts of _1.si, _2.si and _4.si in the replica shard decreased to 0 and were deleted.
  8. After time7, _3.si on the replica is the redundant pending merged segment.
primary replica
time1 _1.si, _2.si, _3.si
time2 _1.si, _2.si, _3.si, _4.si
time3 _1.si, _2.si, _3.si, _4.si, _5.si
time4 _1.si, _2.si, _3.si, _4.si, _5.si _3.si
time5 _3.si, _4.si, _5.si _1.si, _2.si, _3.si, _4.si
time6 _5.si _1.si, _2.si, _3.si, _4.si, _5.si
time7 _5.si _3.si, _5.si

Cleanup strategy
When the following conditions are met, we will consider the pending merge segment to be redundant and can be deleted.

  • The current primary term of the primary shard is greater than the primary term when the pending merge segment was generated, or the primary terms are identical but the segmentInfosVersion of the primary shard is greater than the segmentInfosVersion when the pending merge segment was generated.
  • The pending merge segment no longer exists in the primary shard.

Implement

  • After the merged segment pre-copy is completed, the replica shard needs to record MergeSegmentCheckpoint in IndexShard#pendingMergeSegmentCheckpoints.
  • After the segment replication is completed, the replica shard needs to remove the relevant MergeSegmentCheckpoint from IndexShard#pendingMergeSegmentCheckpoints.
  • We introduced IndexService.AsyncPublishReferencedSegmentsTask to periodically publish the segments referenced by the primary shard to replicas.
  • The replica shard cleans up the redundant pending merge segments according to the above cleanup strategy.

Related Issues

Resolves #[17528], #[18625]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for bc68b76: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for ef17803: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
@guojialiang92 guojialiang92 force-pushed the dev/support-cleanup-redundant-pending-merge-segment branch from ef17803 to 136f448 Compare July 10, 2025 13:27
Copy link
Contributor

✅ Gradle check result for 136f448: SUCCESS

Copy link

codecov bot commented Jul 10, 2025

Codecov Report

❌ Patch coverage is 81.29032% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.78%. Comparing base (42a1dd2) to head (718fa30).
⚠️ Report is 23 commits behind head on main.

Files with missing lines Patch % Lines
...c/main/java/org/opensearch/index/IndexService.java 75.00% 6 Missing and 1 partial ⚠️
...ation/checkpoint/ReferencedSegmentsCheckpoint.java 74.07% 1 Missing and 6 partials ⚠️
...in/java/org/opensearch/index/shard/IndexShard.java 87.50% 4 Missing and 2 partials ⚠️
...on/checkpoint/PublishReferencedSegmentsAction.java 76.92% 2 Missing and 1 partial ⚠️
...n/checkpoint/PublishReferencedSegmentsRequest.java 80.00% 1 Missing and 2 partials ⚠️
...cation/checkpoint/ReferencedSegmentsPublisher.java 75.00% 2 Missing ⚠️
...es/replication/MergedSegmentReplicationTarget.java 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18720      +/-   ##
============================================
- Coverage     72.80%   72.78%   -0.03%     
- Complexity    68564    68592      +28     
============================================
  Files          5567     5571       +4     
  Lines        314844   314987     +143     
  Branches      45675    45694      +19     
============================================
+ Hits         229227   229261      +34     
- Misses        67028    67185     +157     
+ Partials      18589    18541      -48     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: guojialiang <[email protected]>
Signed-off-by: guojialiang <[email protected]>
Copy link
Contributor

❌ Gradle check result for 806fff2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
Signed-off-by: guojialiang <[email protected]>
Signed-off-by: guojialiang <[email protected]>
Signed-off-by: guojialiang <[email protected]>
Signed-off-by: guojialiang <[email protected]>
Copy link
Contributor

❌ Gradle check result for 7ef8968: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: guojialiang <[email protected]>
Copy link
Contributor

✅ Gradle check result for 718fa30: SUCCESS

@guojialiang92
Copy link
Contributor Author

guojialiang92 commented Jul 23, 2025

Hi, @ashking94

@guojialiang92 Let me know once you have addressed all the comments. Will do the final review then.

If you have time, please help review the code again. Thank you. :)

Copy link
Member

@ashking94 ashking94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision. I have some minor comments only now, let me know once you have addressed it.

@ashking94
Copy link
Member

Do look at this - #18720 (comment) as well.

@guojialiang92
Copy link
Contributor Author

Hi, @ashking94

Thanks for the revision. I have some minor comments only now, let me know once you have addressed it.

Could you please help review it again? Including the discussion in #18720 (comment).
Thank you.

@ashking94
Copy link
Member

Hi, @ashking94

Thanks for the revision. I have some minor comments only now, let me know once you have addressed it.

Could you please help review it again? Including the discussion in #18720 (comment). Thank you.

Hi @guojialiang92, except the only unresolved comment, I am good with everything.

@guojialiang92
Copy link
Contributor Author

Hi, @ashking94

Thanks for the revision. I have some minor comments only now, let me know once you have addressed it.

Could you please help review it again? Including the discussion in #18720 (comment). Thank you.

Hi @guojialiang92, except the only unresolved comment, I am good with everything.

Thanks, @ashking94, I have already replied to comment.
If you think the content involved in this discussion is meaningful, I will submit a separate issue to record it later.

@ashking94
Copy link
Member

@guojialiang92 Sure, lets create an issue and tackle this in a follow up. Overall, I am good.

@ashking94 ashking94 merged commit c985636 into opensearch-project:main Jul 28, 2025
34 of 35 checks passed
@ashking94
Copy link
Member

@guojialiang92 I have merged this and i am hoping that none of the added tests are flaky on longer iterations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants