Skip to content

Fix that all slurm jobs were canceled if a single job failed #1347

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

daniel-wer
Copy link
Member

@daniel-wer daniel-wer commented Aug 2, 2025

Description:

  • It's a bit unexpected why this was an issue, but here's the gist:
    • In a PR a couple of weeks ago I fixed that if an exception occurs inside a with executor block, shutdown doesn't wait forever for all jobs to finish which caused deadlocks (because the exception might occur before the jobs were scheduled and they won't ever finish).
    • What I didn't realize is that voxelytics relies on the fact that if a job is killed, due to a time or memory limit exception or simply fails due to another error, the cluster tools actually wait for all other jobs to finish before leaving the with block (shutdown with wait=True). This is somewhat obscure and we should think about making that more obvious and deliberate, but I restored that behavior for now.
  • I added a regression test and made sure that it failed before and succeeds with this PR

Issues:

  • fixes that all slurm jobs were canceled if a single job failed

Todos:

Make sure to delete unnecessary points or to check all before merging:

  • Updated Changelog
  • Added / Updated Tests

@daniel-wer daniel-wer self-assigned this Aug 2, 2025
@daniel-wer daniel-wer added the bug label Aug 2, 2025
Copy link

github-actions bot commented Aug 2, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
9291 7841 84% 80% 🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: d1732fa by action🐍

Copy link
Member

@valentin-pinkau valentin-pinkau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants