Skip to content

Conversation

sapkota-aayush
Copy link
Contributor

This PR fixes a race condition in the sync server where only the first error from multiple crashing threads was reported, and others were lost.
Now, only the first error triggers shutdown and error reporting; others are ignored after shutdown starts.
Removed temporary test files used for debugging.
Closes #198 (Graceful shutdown of gRPC servers when there are exceptions in the User Code).
I’m still exploring gRPC, so I may be wrong—open to any feedback or suggestions!

Copy link

codecov bot commented Jul 18, 2025

Codecov Report

Attention: Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.

Project coverage is 94.09%. Comparing base (42f9fbd) to head (4ed0cf5).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pynumaflow/mapper/_servicer/_sync_servicer.py 73.68% 3 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #236      +/-   ##
==========================================
- Coverage   94.26%   94.09%   -0.17%     
==========================================
  Files          60       60              
  Lines        2441     2457      +16     
  Branches      124      128       +4     
==========================================
+ Hits         2301     2312      +11     
- Misses        101      104       +3     
- Partials       39       41       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vigith vigith requested a review from kohlisid July 18, 2025 02:27
…Servicer; remove temporary race condition test files

Signed-off-by: sapkota-aayush <[email protected]>
@sapkota-aayush sapkota-aayush force-pushed the sync-server-graceful-shutdown-fix branch from 131b637 to 4ed0cf5 Compare July 18, 2025 02:28
@sapkota-aayush
Copy link
Contributor Author

@kohlisid @vigith
After applying the changes, I ran the race condition tests. The results show that only the first error triggers shutdown and error reporting, and all other errors are ignored after shutdown starts—just as intended. I may be mistaken, though.
I’m waiting for your feedback!

Copy link
Member

@vigith vigith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, i will let @kohlisid who is more closer to the code do a thorough review.

@kohlisid
Copy link
Contributor

@sapkota-aayush Would you want to test few scenarios with fmea.

  1. Scaledown events where pods are killed
  2. Panic is the user code (random)
  3. Panic in the user code (consistent)

Also want to note down the behaviour of the events post the shutdown/restart

  1. Are the pods coming back up seamless, or there are issues in server startup
  2. Are the events that were left mid way in processing getting reprocessed

In the ideal endgoal for clean shutdown, when we get a shutdown signal we would like to close the server for any new incoming events, let the current events process/drain out, and then shut down the orchestrator and server

@vigith
Copy link
Member

vigith commented Jul 18, 2025

@kohlisid does python gRPC support drain/shutdown mode?

@sapkota-aayush
Copy link
Contributor Author

@sapkota-aayush Would you want to test few scenarios with fmea.

  1. Scaledown events where pods are killed
  2. Panic is the user code (random)
  3. Panic in the user code (consistent)

Also want to note down the behaviour of the events post the shutdown/restart

  1. Are the pods coming back up seamless, or there are issues in server startup
  2. Are the events that were left mid way in processing getting reprocessed

In the ideal endgoal for clean shutdown, when we get a shutdown signal we would like to close the server for any new incoming events, let the current events process/drain out, and then shut down the orchestrator and server

Hi @kohlisid,

Sorry for getting back to this late!

Thanks for the detailed testing scenarios.
I haven’t written tests for scaledown/pod-kill scenarios before.

Do you want me to:

  1. Perform these tests manually and share the results, or
  2. Write automated test cases for them as part of this PR?

This will help me approach it the right way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Graceful shutdown of gRPC servers when there are exceptions in the User Code
3 participants