fix(sync-server): thread-safe shutdown and error reporting in SyncMapServicer #236

sapkota-aayush · 2025-07-18T02:25:37Z

This PR fixes a race condition in the sync server where only the first error from multiple crashing threads was reported, and others were lost.
Now, only the first error triggers shutdown and error reporting; others are ignored after shutdown starts.
Removed temporary test files used for debugging.
Closes #198 (Graceful shutdown of gRPC servers when there are exceptions in the User Code).
I’m still exploring gRPC, so I may be wrong—open to any feedback or suggestions!

codecov · 2025-07-18T02:26:30Z

Codecov Report

Attention: Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.

Project coverage is 94.09%. Comparing base (42f9fbd) to head (4ed0cf5).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pynumaflow/mapper/_servicer/_sync_servicer.py	73.68%	3 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #236      +/-   ##
==========================================
- Coverage   94.26%   94.09%   -0.17%     
==========================================
  Files          60       60              
  Lines        2441     2457      +16     
  Branches      124      128       +4     
==========================================
+ Hits         2301     2312      +11     
- Misses        101      104       +3     
- Partials       39       41       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…Servicer; remove temporary race condition test files Signed-off-by: sapkota-aayush <[email protected]>

sapkota-aayush · 2025-07-18T02:29:59Z

@kohlisid @vigith
After applying the changes, I ran the race condition tests. The results show that only the first error triggers shutdown and error reporting, and all other errors are ignored after shutdown starts—just as intended. I may be mistaken, though.
I’m waiting for your feedback!

vigith

LGTM, i will let @kohlisid who is more closer to the code do a thorough review.

kohlisid · 2025-07-18T20:27:54Z

@sapkota-aayush Would you want to test few scenarios with fmea.

Scaledown events where pods are killed
Panic is the user code (random)
Panic in the user code (consistent)

Also want to note down the behaviour of the events post the shutdown/restart

Are the pods coming back up seamless, or there are issues in server startup
Are the events that were left mid way in processing getting reprocessed

In the ideal endgoal for clean shutdown, when we get a shutdown signal we would like to close the server for any new incoming events, let the current events process/drain out, and then shut down the orchestrator and server

vigith · 2025-07-18T21:15:41Z

@kohlisid does python gRPC support drain/shutdown mode?

sapkota-aayush · 2025-08-06T18:17:38Z

@sapkota-aayush Would you want to test few scenarios with fmea.

Scaledown events where pods are killed

Panic is the user code (random)

Panic in the user code (consistent)

Also want to note down the behaviour of the events post the shutdown/restart

Are the pods coming back up seamless, or there are issues in server startup

Are the events that were left mid way in processing getting reprocessed

In the ideal endgoal for clean shutdown, when we get a shutdown signal we would like to close the server for any new incoming events, let the current events process/drain out, and then shut down the orchestrator and server

Hi @kohlisid,

Sorry for getting back to this late!

Thanks for the detailed testing scenarios.
I haven’t written tests for scaledown/pod-kill scenarios before.

Do you want me to:

Perform these tests manually and share the results, or
Write automated test cases for them as part of this PR?

This will help me approach it the right way.

kohlisid · 2025-09-04T23:56:18Z

Perform these tests manually and share the results, or
@sapkota-aayush Let's do few of these first as fmea

vigith · 2025-09-29T04:00:44Z

@sapkota-aayush any update on this?

sapkota-aayush · 2025-09-29T19:15:31Z

@sapkota-aayush any update on this?

Looking at it.

sapkota-aayush requested review from ab93, vigith and whynowy as code owners July 18, 2025 02:25

vigith requested a review from kohlisid July 18, 2025 02:27

fix(sync-server): thread-safe shutdown and error reporting in SyncMap…

4ed0cf5

…Servicer; remove temporary race condition test files Signed-off-by: sapkota-aayush <[email protected]>

sapkota-aayush force-pushed the sync-server-graceful-shutdown-fix branch from 131b637 to 4ed0cf5 Compare July 18, 2025 02:28

vigith reviewed Jul 18, 2025

View reviewed changes

vigith marked this pull request as draft September 29, 2025 04:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(sync-server): thread-safe shutdown and error reporting in SyncMapServicer #236

fix(sync-server): thread-safe shutdown and error reporting in SyncMapServicer #236

Uh oh!

sapkota-aayush commented Jul 18, 2025

Uh oh!

codecov bot commented Jul 18, 2025 •

edited

Loading

Uh oh!

sapkota-aayush commented Jul 18, 2025

Uh oh!

vigith left a comment

Uh oh!

kohlisid commented Jul 18, 2025

Uh oh!

vigith commented Jul 18, 2025

Uh oh!

sapkota-aayush commented Aug 6, 2025

Uh oh!

kohlisid commented Sep 4, 2025

Uh oh!

vigith commented Sep 29, 2025

Uh oh!

sapkota-aayush commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix(sync-server): thread-safe shutdown and error reporting in SyncMapServicer #236

Are you sure you want to change the base?

fix(sync-server): thread-safe shutdown and error reporting in SyncMapServicer #236

Uh oh!

Conversation

sapkota-aayush commented Jul 18, 2025

Uh oh!

codecov bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sapkota-aayush commented Jul 18, 2025

Uh oh!

vigith left a comment

Choose a reason for hiding this comment

Uh oh!

kohlisid commented Jul 18, 2025

Uh oh!

vigith commented Jul 18, 2025

Uh oh!

sapkota-aayush commented Aug 6, 2025

Uh oh!

kohlisid commented Sep 4, 2025

Uh oh!

vigith commented Sep 29, 2025

Uh oh!

sapkota-aayush commented Sep 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Jul 18, 2025 •

edited

Loading