Skip to content

Conversation

dthain
Copy link
Member

@dthain dthain commented Jun 27, 2025

Proposed Changes

Local workers are now shutdown by sending concurrent SIGQUIT,
then waiting for completion, then sending SIGKILL every 30s.
Fixes #3311

Merge Checklist

The following items must be completed before PRs can be merged.
Check these off to verify you have completed all steps.

  • make test Run local tests prior to pushing.
  • make format Format source code to comply with lint policies. Note that some lint errors can only be resolved manually (e.g., Python)
  • make lint Run lint on source code prior to pushing.
  • Manual Update: Update the manual to reflect user-visible changes.
  • Type Labels: Select a github label for the type: bugfix, enhancement, etc.
  • Product Labels: Select a github label for the product: TaskVine, Makeflow, etc.
  • PR RTM: Mark your PR as ready to merge.

@btovar btovar modified the milestones: 7.15.6, 7.15.7 Jul 3, 2025
@btovar btovar merged commit 8733862 into cooperative-computing-lab:master Sep 10, 2025
21 of 22 checks passed
@tphung3 tphung3 requested a review from Copilot September 22, 2025 18:27
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the shutdown process for local workers by implementing a more graceful termination approach. Instead of immediately killing workers, it now sends SIGQUIT first to allow cleanup, waits for completion, and then escalates to SIGKILL if needed.

Key changes include:

  • Added a new batch_queue_remove_mode_t enum to distinguish between friendly (SIGQUIT) and forceful (SIGKILL) removal modes
  • Updated all batch queue remove functions to accept the new mode parameter
  • Implemented enhanced worker shutdown logic that waits for graceful termination before escalating

Reviewed Changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
makeflow/src/makeflow.c Updated batch_queue_remove calls to use FRIENDLY mode
batch_job/src/work_queue_factory.c Added new worker shutdown functions with timeout-based escalation
batch_job/src/vine_factory.c Added similar shutdown logic with different timeout behavior
batch_job/src/batch_queue_local.c Implemented friendly vs kill signal handling (SIGQUIT vs SIGKILL)
batch_job/src/batch_queue.h Added BATCH_QUEUE_REMOVE_MODE enum definition
batch_job/src/batch_queue.c Updated remove function signature to accept mode parameter
Various batch queue modules Updated remove function signatures to match new interface

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +636 to +639
while(itable_size(job_table)) {
wait_all_workers(queue,job_table,time(0)+worker_kill_timeout);
remove_all_workers(queue,job_table,BATCH_QUEUE_REMOVE_MODE_KILL);
}
Copy link
Preview

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic appears to have an infinite loop potential. If remove_all_workers with KILL mode fails to remove workers and itable_size(job_table) remains non-zero, this could loop indefinitely. Consider adding a maximum retry count or break condition.

Suggested change
while(itable_size(job_table)) {
wait_all_workers(queue,job_table,time(0)+worker_kill_timeout);
remove_all_workers(queue,job_table,BATCH_QUEUE_REMOVE_MODE_KILL);
}
int max_retries = 10;
int retry_count = 0;
while(itable_size(job_table) && retry_count < max_retries) {
wait_all_workers(queue,job_table,time(0)+worker_kill_timeout);
remove_all_workers(queue,job_table,BATCH_QUEUE_REMOVE_MODE_KILL);
retry_count++;
}
if(itable_size(job_table)) {
debug(D_WQ, "Failed to remove all workers after %d retries.", max_retries);
}

Copilot uses AI. Check for mistakes.

Comment on lines +641 to +644
For non-local batch queues, we skip this be because our ability to
wait on removed jobs is unreliable. For local jobs, we follow up
the friendly remove after 30s with an unfriendly kill every 5s,
and we wait for children to exit to ensure cleanup is complete.
Copy link
Preview

Copilot AI Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment mentions killing 'every 5s' but the implementation shows wait_all_workers(queue,job_table,time(0)+5) which waits for 5 seconds, then removes all workers, then waits again. This could be clearer about the actual behavior.

Suggested change
For non-local batch queues, we skip this be because our ability to
wait on removed jobs is unreliable. For local jobs, we follow up
the friendly remove after 30s with an unfriendly kill every 5s,
and we wait for children to exit to ensure cleanup is complete.
For non-local batch queues, we skip this because our ability to
wait on removed jobs is unreliable. For local jobs, after the friendly remove and 30s wait,
we immediately attempt an unfriendly kill, then wait 5s, and repeat this process until all workers exit,
ensuring cleanup is complete.

Copilot uses AI. Check for mistakes.

btovar pushed a commit that referenced this pull request Sep 25, 2025
* Add batch_queue_remove_mode_t parameter to allow for friendly and unfriendly kill modes.

Conflicts:
	batch_job/src/batch_queue_internal.h

* format

* Modify factory to shutdown workers in parallel.
And, for local workers, send SIGKILL after 30s until gone.

* Adjust makeflow to use batch_queue_remove_mode_t.

* Extend worker kill from vine_factory to work_queue_factory.

* Remove mistaken prune

* Remove unused int count

* Remove unused int count

* After waiting 30s, send the kill signal every 5s until worker exited.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vine: factory is slow to shutdown workers
2 participants