[Dashboard] Fix potential job actor leak by ensuring retrieval of actor instance before termination. #55860

daiping8 · 2025-08-23T05:34:58Z

Why are these changes needed?

To close issue #55858

Problem Analysis

Let's analyze this from the code details. Let's look at the code in job_manager.py that handles job pending timeouts by the loop.

Normal process

When a job times out, JobStatus will be FAILED and the entire loop will be break. At the end of the function, ray.kill(job_supervisor, no_restart=True) will be executed.

Faulty process

When a job is in the pending state, for some reason the old node that submitted the job dies, and a new node is created . For the new job manager created after the raylet on new node (new raylet) starts, it continuously monitors the task status through _monitor_job_internal.

Just like the scenario I mentioned in #55858 (a new raylet is in the process of starting up, and the job happens to reach its timeout during this period), the loop that checks for job pending timeouts terminates during its first iteration, so the code responsible for obtaining the handle is never executed. It fails to obtain the job_supervisor handle.

So ray.kill(job_supervisor, no_restart=True) will not be executed.

Solution

Fix potential job actor leak by ensuring retrieval of actor instance before termination.
Add a test to validate job timeout behavior during new raylet creation.

Related issue number

Closes #55858

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Ensures _monitor_job_internal fetches the job supervisor actor before killing it to avoid leaks, and adds a test covering pending-job timeout during new raylet creation.

Jobs Manager (python/ray/dashboard/modules/job/job_manager.py):
- Ensure cleanup: before killing, fetch job_supervisor via _get_actor_for_job(job_id) if missing to avoid actor leaks on early loop exit.
Tests (python/ray/dashboard/modules/job/tests/test_job_manager.py):
- Add test_pending_job_timeout_during_raylet_creation to simulate new raylet creation, verify pending job times out, and confirm supervisor actor is removed.

^{Written by Cursor Bugbot for commit 5274288. This will update automatically on new commits. Configure here.}

gemini-code-assist

Code Review

This pull request addresses a potential job actor leak by ensuring the actor handle is retrieved before termination. The fix in job_manager.py is correct and directly solves the described issue. The accompanying test in test_job_manager.py effectively validates the fix by simulating a raylet restart and checking for proper job timeout and actor cleanup. I've provided a few suggestions to improve code conciseness in the fix and to enhance clarity in the new test.

python/ray/dashboard/modules/job/job_manager.py

python/ray/dashboard/modules/job/tests/test_job_manager.py

edoakes · 2025-08-25T21:02:44Z

@israbbani PTAL

jjyao · 2025-09-08T20:56:16Z

@daiping8 Raylet will not restart. If raylet fails, the entire node is gone and a new Node will be created. Could you update the PR description to be more accurate. I think you mean GCS FT?

daiping8 · 2025-09-09T02:52:49Z

@jjyao Thank you for your attention. I have clarified the relevant content in the descriptions of the issue and PR. The core problem is that the job_supervisor handle can be None in specific scenarios, thereby preventing resource release.

This scenarios is an extreme edge case, which makes it difficult to describe and understand. Thank you again for your patience.

jjyao · 2025-09-09T21:32:22Z

@daiping8 could you rebase, I just merged my fix.

…ive kill logic. - Add test to verify timeout handling for pending jobs during new raylet creation. Signed-off-by: daiping8 <[email protected]>

Signed-off-by: daiping8 <[email protected]>

github-actions · 2025-09-24T12:25:24Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

python/ray/dashboard/modules/job/tests/test_job_manager.py

edoakes · 2025-10-07T16:05:29Z

@daiping8 @jjyao there is really no reason for us to be running the JobManager on every node. This is an artifact of a half-finished feature to load balance job submission across nodes, which is not really necessary because we can still schedule drivers across the cluster while only having the job manager on the head node.

So my preferred fix for this would actually be to simplify things by running the JobManager only on the head node and removing the _recover_running_jobs logic entirely. What do you think?

Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Ping Dai <[email protected]>

jjyao · 2025-10-08T17:09:30Z

So my preferred fix for this would actually be to simplify things by running the JobManager only on the head node and removing the _recover_running_jobs logic entirely. What do you think?

@edoakes yea, strictly speaking we don't officially support GCS FT for jobs, so we don't need _recover_running_jobs but in reality, many OSS users still run GCS FT for ray jobs and hence _recover_running_jobs was introduced in the first place.

… cleanup Change-Id: I2786781cc39bc6af76e94318ce47160341f8b656 Signed-off-by: daiping8 <[email protected]>

…ead creation - Introduced `MockTimer` class for simulating time manipulations in tests. - Added test to validate timeout behavior for pending jobs during new head node creation. Change-Id: Id13ea5bb41139bf0764ac3643b2e19a403a03a0b Signed-off-by: daiping8 <[email protected]>

daiping8 · 2025-10-10T10:15:56Z

@jjyao Sorry for my late reply.

Following your suggestions, I've rewritten the test cases. Please review.
Should we uniformly change new raylet in the PR description to new head node for better clarity? As mentioned above, the JobManager might only run on the head node in the future. Let me know if this change is needed.

Change-Id: Ie3962b7f8644367e7f9bf940baef828dbb02fb56 Signed-off-by: daiping8 <[email protected]>

python/ray/dashboard/modules/job/tests/test_job_manager.py

python/ray/_common/test_utils.py

- Introduced `Timer` class to replace direct `time.time()` calls, enhancing test maintainability. - Refactored `__init__` to include a `timer` parameter for dependency injection. - Updated `test_job_manager.py` tests to use `FakeTimer` instead of `MockTimer`. - Renamed `MockTimer` to `FakeTimer` for better clarity of purpose. Change-Id: Ia5351a1745cda42b3057e0ff70ccb02bd50cf538 Signed-off-by: daiping8 <[email protected]>

python/ray/_common/test_utils.py

python/ray/dashboard/modules/job/job_manager.py

python/ray/dashboard/modules/job/tests/test_job_manager.py

- Moved `Timer` and `TimerBase` from `serve._private.utils` to `_common.utils` for better reusability. - Updated relevant imports across modules. - Refactored `JobManager` to use `timeout_check_timer` for enhanced clarity. - Adjusted related tests to align with new `Timer` implementation. Change-Id: I28717c4ed0fe9322c29d03f799a7a200c8ee0aa3 Signed-off-by: daiping8 <[email protected]>

- Replaced `start_timer.advance` with `FakeTimer` instance in the test. - Aligned job manager initialization to use the updated timer instance for consistency. Change-Id: I4a7adac95e9e8575d51b171946fec8bf5cb8b466 Signed-off-by: daiping8 <[email protected]>

Change-Id: I7f304b4d1ed2e8f1bffd83be38a9a457569fbad6

daiping8 · 2025-10-17T02:16:48Z

@jjyao As per your guidance, all revisions have been completed. Please review again.

jjyao

Very sorry for the late review

python/ray/dashboard/modules/job/job_manager.py

Signed-off-by: Jiajun Yao <[email protected]>

python/ray/dashboard/modules/job/job_manager.py

python/ray/dashboard/modules/job/tests/test_job_manager.py

Signed-off-by: Jiajun Yao <[email protected]>

Change-Id: Ice13a8aaf1db5a07b05c8f54a2eccbb55b91b3a2

Change-Id: I1cdaae76d9db40441a7b7816f74a8764059dc54a Signed-off-by: daiping8 <[email protected]>

daiping8 · 2025-12-04T12:35:31Z

@jjyao A doc test unrelated to this PR has failed. I'm attempting to fix this issue in PR #59169, so please merge PR #59169 first.

…extra spaces Change-Id: Ie1dddf8a96fd6b7cd8e4059f1597fd0f35765cdb Signed-off-by: daiping8 <[email protected]>

daiping8 · 2025-12-05T02:32:18Z

@jjyao All CI test have passed! Thanks.

daiping8 requested a review from a team as a code owner August 23, 2025 05:34

gemini-code-assist bot reviewed Aug 23, 2025

View reviewed changes

python/ray/dashboard/modules/job/job_manager.py Show resolved Hide resolved

python/ray/dashboard/modules/job/tests/test_job_manager.py Outdated Show resolved Hide resolved

python/ray/dashboard/modules/job/tests/test_job_manager.py Outdated Show resolved Hide resolved

daiping8 mentioned this pull request Aug 23, 2025

[Dashborad] Fix the bugs in the loop in job_manager that checks for Pending status. #55135

Closed

7 tasks

ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Aug 23, 2025

edoakes assigned israbbani Aug 25, 2025

daiping8 force-pushed the job_sub branch from 9d179c0 to 44c361e Compare September 4, 2025 06:30

daiping8 changed the title ~~[Dashboard][Fix] Fix potential job actor leak by ensuring retrieval of actor instance before termination.~~ [Dashboard] Fix potential job actor leak by ensuring retrieval of actor instance before termination. Sep 4, 2025

daiping8 added 2 commits September 10, 2025 09:55

- Fix potential actor leak by initializing job_supervisor in defens…

c09e9ac

…ive kill logic. - Add test to verify timeout handling for pending jobs during new raylet creation. Signed-off-by: daiping8 <[email protected]>

Add test for pending job timeout during raylet creation

e44fe03

Signed-off-by: daiping8 <[email protected]>

daiping8 force-pushed the job_sub branch from 71da739 to e44fe03 Compare September 10, 2025 02:03

daiping8 added 3 commits September 10, 2025 10:04

Merge branch 'master' into job_sub

3ae4afb

Refactor: Adjust environment variable setting for job timeout test

0407fd5

Signed-off-by: daiping8 <[email protected]>

Merge branch 'job_sub' of https://github.com/daiping8/ray into job_sub

6fcc861

github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 24, 2025

edoakes added the go add ONLY when ready to merge, run all tests label Sep 25, 2025

github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Sep 26, 2025

Merge branch 'master' into job_sub

5274288

jjyao reviewed Sep 30, 2025

View reviewed changes

python/ray/dashboard/modules/job/tests/test_job_manager.py Outdated Show resolved Hide resolved

jjyao reviewed Oct 6, 2025

View reviewed changes

python/ray/dashboard/modules/job/tests/test_job_manager.py Outdated Show resolved Hide resolved

python/ray/dashboard/modules/job/tests/test_job_manager.py Outdated Show resolved Hide resolved

Update python/ray/dashboard/modules/job/tests/test_job_manager.py

6e7cbad

Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Ping Dai <[email protected]>

This comment was marked as outdated.

Sign in to view

daiping8 added 2 commits October 10, 2025 18:01

Remove test for pending job timeout during raylet creation as part of…

4a51541

… cleanup Change-Id: I2786781cc39bc6af76e94318ce47160341f8b656 Signed-off-by: daiping8 <[email protected]>

This comment was marked as outdated.

Sign in to view

Refactor import order and add noqa directive for test function

e31f3bd

Change-Id: Ie3962b7f8644367e7f9bf940baef828dbb02fb56 Signed-off-by: daiping8 <[email protected]>

jjyao reviewed Oct 10, 2025

View reviewed changes

python/ray/dashboard/modules/job/tests/test_job_manager.py Outdated Show resolved Hide resolved

python/ray/_common/test_utils.py Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

jjyao reviewed Oct 15, 2025

View reviewed changes

daiping8 requested a review from a team as a code owner October 16, 2025 02:57

daiping8 added 3 commits October 16, 2025 15:50

Merge branch 'ray-project:master' into job_sub

2c8ab8b

Merge branch 'job_sub' of https://github.com/daiping8/ray into job_sub

e8d97c8

Change-Id: I7f304b4d1ed2e8f1bffd83be38a9a457569fbad6

jjyao approved these changes Dec 4, 2025

View reviewed changes

python/ray/dashboard/modules/job/job_manager.py Outdated Show resolved Hide resolved

python/ray/dashboard/modules/job/job_manager.py Outdated Show resolved Hide resolved

jjyao added 2 commits December 3, 2025 18:19

Apply suggestions from code review

9c8a4a6

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' into job_sub

5401257

Signed-off-by: Jiajun Yao <[email protected]>

abrarsheikh approved these changes Dec 4, 2025

View reviewed changes

jjyao reviewed Dec 4, 2025

View reviewed changes

python/ray/dashboard/modules/job/job_manager.py Outdated Show resolved Hide resolved

python/ray/dashboard/modules/job/tests/test_job_manager.py Outdated Show resolved Hide resolved

jjyao and others added 3 commits December 3, 2025 20:45

Apply suggestions from code review

5d557c0

Signed-off-by: Jiajun Yao <[email protected]>

Merge branch 'master' of https://github.com/daiping8/ray into job_sub

2f9e23a

Change-Id: Ice13a8aaf1db5a07b05c8f54a2eccbb55b91b3a2

format

4aa7db6

Change-Id: I1cdaae76d9db40441a7b7816f74a8764059dc54a Signed-off-by: daiping8 <[email protected]>

jjyao and others added 2 commits December 4, 2025 11:42

Merge branch 'master' into job_sub

133c22b

docs(ray-references): Adjusted the format of the id column to remove …

87f091e

…extra spaces Change-Id: Ie1dddf8a96fd6b7cd8e4059f1597fd0f35765cdb Signed-off-by: daiping8 <[email protected]>

daiping8 requested a review from a team as a code owner December 5, 2025 01:51

jjyao merged commit 8b488eb into ray-project:master Dec 5, 2025
6 checks passed

[Dashboard] Fix potential job actor leak by ensuring retrieval of actor instance before termination. #55860

[Dashboard] Fix potential job actor leak by ensuring retrieval of actor instance before termination. #55860

Uh oh!

Conversation

daiping8 commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Problem Analysis

Normal process

Faulty process

Solution

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes commented Aug 25, 2025

Uh oh!

jjyao commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daiping8 commented Sep 9, 2025

Uh oh!

jjyao commented Sep 9, 2025

Uh oh!

github-actions bot commented Sep 24, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edoakes commented Oct 7, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

jjyao commented Oct 8, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

daiping8 commented Oct 10, 2025

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daiping8 commented Oct 17, 2025

Uh oh!

jjyao left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

daiping8 commented Dec 4, 2025

Uh oh!

daiping8 commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

daiping8 commented Aug 23, 2025 •

edited

Loading

jjyao commented Sep 8, 2025 •

edited

Loading