-
Notifications
You must be signed in to change notification settings - Fork 124
Factory: Improved Shutdown of Local Workers #4196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Factory: Improved Shutdown of Local Workers #4196
Conversation
…riendly kill modes. Conflicts: batch_job/src/batch_queue_internal.h
And, for local workers, send SIGKILL after 30s until gone.
8733862
into
cooperative-computing-lab:master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the shutdown process for local workers by implementing a more graceful termination approach. Instead of immediately killing workers, it now sends SIGQUIT first to allow cleanup, waits for completion, and then escalates to SIGKILL if needed.
Key changes include:
- Added a new
batch_queue_remove_mode_t
enum to distinguish between friendly (SIGQUIT) and forceful (SIGKILL) removal modes - Updated all batch queue remove functions to accept the new mode parameter
- Implemented enhanced worker shutdown logic that waits for graceful termination before escalating
Reviewed Changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
makeflow/src/makeflow.c | Updated batch_queue_remove calls to use FRIENDLY mode |
batch_job/src/work_queue_factory.c | Added new worker shutdown functions with timeout-based escalation |
batch_job/src/vine_factory.c | Added similar shutdown logic with different timeout behavior |
batch_job/src/batch_queue_local.c | Implemented friendly vs kill signal handling (SIGQUIT vs SIGKILL) |
batch_job/src/batch_queue.h | Added BATCH_QUEUE_REMOVE_MODE enum definition |
batch_job/src/batch_queue.c | Updated remove function signature to accept mode parameter |
Various batch queue modules | Updated remove function signatures to match new interface |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
while(itable_size(job_table)) { | ||
wait_all_workers(queue,job_table,time(0)+worker_kill_timeout); | ||
remove_all_workers(queue,job_table,BATCH_QUEUE_REMOVE_MODE_KILL); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic appears to have an infinite loop potential. If remove_all_workers
with KILL mode fails to remove workers and itable_size(job_table)
remains non-zero, this could loop indefinitely. Consider adding a maximum retry count or break condition.
while(itable_size(job_table)) { | |
wait_all_workers(queue,job_table,time(0)+worker_kill_timeout); | |
remove_all_workers(queue,job_table,BATCH_QUEUE_REMOVE_MODE_KILL); | |
} | |
int max_retries = 10; | |
int retry_count = 0; | |
while(itable_size(job_table) && retry_count < max_retries) { | |
wait_all_workers(queue,job_table,time(0)+worker_kill_timeout); | |
remove_all_workers(queue,job_table,BATCH_QUEUE_REMOVE_MODE_KILL); | |
retry_count++; | |
} | |
if(itable_size(job_table)) { | |
debug(D_WQ, "Failed to remove all workers after %d retries.", max_retries); | |
} |
Copilot uses AI. Check for mistakes.
For non-local batch queues, we skip this be because our ability to | ||
wait on removed jobs is unreliable. For local jobs, we follow up | ||
the friendly remove after 30s with an unfriendly kill every 5s, | ||
and we wait for children to exit to ensure cleanup is complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment mentions killing 'every 5s' but the implementation shows wait_all_workers(queue,job_table,time(0)+5)
which waits for 5 seconds, then removes all workers, then waits again. This could be clearer about the actual behavior.
For non-local batch queues, we skip this be because our ability to | |
wait on removed jobs is unreliable. For local jobs, we follow up | |
the friendly remove after 30s with an unfriendly kill every 5s, | |
and we wait for children to exit to ensure cleanup is complete. | |
For non-local batch queues, we skip this because our ability to | |
wait on removed jobs is unreliable. For local jobs, after the friendly remove and 30s wait, | |
we immediately attempt an unfriendly kill, then wait 5s, and repeat this process until all workers exit, | |
ensuring cleanup is complete. |
Copilot uses AI. Check for mistakes.
* Add batch_queue_remove_mode_t parameter to allow for friendly and unfriendly kill modes. Conflicts: batch_job/src/batch_queue_internal.h * format * Modify factory to shutdown workers in parallel. And, for local workers, send SIGKILL after 30s until gone. * Adjust makeflow to use batch_queue_remove_mode_t. * Extend worker kill from vine_factory to work_queue_factory. * Remove mistaken prune * Remove unused int count * Remove unused int count * After waiting 30s, send the kill signal every 5s until worker exited.
Proposed Changes
Local workers are now shutdown by sending concurrent SIGQUIT,
then waiting for completion, then sending SIGKILL every 30s.
Fixes #3311
Merge Checklist
The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.
make test
Run local tests prior to pushing.make format
Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)make lint
Run lint on source code prior to pushing.