Skip to content

fix(worker): retry heartbeat if JobDocument is not yet available #3205

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ArjunJagdale
Copy link

Fixes #2532

Adds a retry loop in WorkerExecutor.heartbeat() to handle the case where the heartbeat is sent before the JobDocument has been persisted.

  • Retries up to 6 times
  • 0.5s delay between attempts
  • Only retries if the known error string "JobDocument matching query does not exist" is detected
  • Other exceptions still raise immediately

Fixes huggingface#2532

When a worker sends a heartbeat immediately after job assignment, it may race with job persistence, resulting in:
`JobDocument matching query does not exist`.

This patch adds retry logic to the `WorkerExecutor.heartbeat()` method in `executor.py`, retrying the heartbeat a few times (with delay) if this specific error occurs. This avoids false negatives and unnecessary worker shutdowns on transient states.

Retries: 6 attempts  
Delay: 0.5s between retries

Only the specific "JobDocument not found" error is retried — other exceptions still raise immediately.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix worker crash when first heartbeat conflicts with job start
1 participant