Skip to content

Conversation

@daiping8
Copy link
Contributor

@daiping8 daiping8 commented Aug 23, 2025

Why are these changes needed?

To close issue #55858

Problem Analysis

Let's analyze this from the code details. Let's look at the code in job_manager.py that handles job pending timeouts by the loop.

Normal process

When a job times out, JobStatus will be FAILED and the entire loop will be break. At the end of the function, ray.kill(job_supervisor, no_restart=True) will be executed.

Faulty process

When a job is in the pending state, for some reason the old node that submitted the job dies, and a new node is created . For the new job manager created after the raylet on new node (new raylet) starts, it continuously monitors the task status through _monitor_job_internal.

Just like the scenario I mentioned in #55858 (a new raylet is in the process of starting up, and the job happens to reach its timeout during this period), the loop that checks for job pending timeouts terminates during its first iteration, so the code responsible for obtaining the handle is never executed. It fails to obtain the job_supervisor handle.

So ray.kill(job_supervisor, no_restart=True) will not be executed.

Solution

  • Fix potential job actor leak by ensuring retrieval of actor instance before termination.
  • Add a test to validate job timeout behavior during new raylet creation.

Related issue number

Closes #55858

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Note

Ensures _monitor_job_internal fetches the job supervisor actor before killing it to avoid leaks, and adds a test covering pending-job timeout during new raylet creation.

  • Jobs Manager (python/ray/dashboard/modules/job/job_manager.py):
    • Ensure cleanup: before killing, fetch job_supervisor via _get_actor_for_job(job_id) if missing to avoid actor leaks on early loop exit.
  • Tests (python/ray/dashboard/modules/job/tests/test_job_manager.py):
    • Add test_pending_job_timeout_during_raylet_creation to simulate new raylet creation, verify pending job times out, and confirm supervisor actor is removed.

Written by Cursor Bugbot for commit 5274288. This will update automatically on new commits. Configure here.

@daiping8 daiping8 requested a review from a team as a code owner August 23, 2025 05:34
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a potential job actor leak by ensuring the actor handle is retrieved before termination. The fix in job_manager.py is correct and directly solves the described issue. The accompanying test in test_job_manager.py effectively validates the fix by simulating a raylet restart and checking for proper job timeout and actor cleanup. I've provided a few suggestions to improve code conciseness in the fix and to enhance clarity in the new test.

@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Aug 23, 2025
@edoakes
Copy link
Collaborator

edoakes commented Aug 25, 2025

@israbbani PTAL

@daiping8 daiping8 changed the title [Dashboard][Fix] Fix potential job actor leak by ensuring retrieval of actor instance before termination. [Dashboard] Fix potential job actor leak by ensuring retrieval of actor instance before termination. Sep 4, 2025
@jjyao
Copy link
Collaborator

jjyao commented Sep 8, 2025

@daiping8 Raylet will not restart. If raylet fails, the entire node is gone and a new Node will be created. Could you update the PR description to be more accurate. I think you mean GCS FT?

@daiping8
Copy link
Contributor Author

daiping8 commented Sep 9, 2025

@jjyao Thank you for your attention. I have clarified the relevant content in the descriptions of the issue and PR. The core problem is that the job_supervisor handle can be None in specific scenarios, thereby preventing resource release.

This scenarios is an extreme edge case, which makes it difficult to describe and understand. Thank you again for your patience.

@jjyao
Copy link
Collaborator

jjyao commented Sep 9, 2025

@daiping8 could you rebase, I just merged my fix.

…ive kill logic.

- Add test to verify timeout handling for pending jobs during new raylet creation.

Signed-off-by: daiping8 <[email protected]>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 24, 2025
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label Sep 25, 2025
@github-actions github-actions bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Sep 26, 2025
@edoakes
Copy link
Collaborator

edoakes commented Oct 7, 2025

@daiping8 @jjyao there is really no reason for us to be running the JobManager on every node. This is an artifact of a half-finished feature to load balance job submission across nodes, which is not really necessary because we can still schedule drivers across the cluster while only having the job manager on the head node.

So my preferred fix for this would actually be to simplify things by running the JobManager only on the head node and removing the _recover_running_jobs logic entirely. What do you think?

cursor[bot]

This comment was marked as outdated.

@jjyao
Copy link
Collaborator

jjyao commented Oct 8, 2025

So my preferred fix for this would actually be to simplify things by running the JobManager only on the head node and removing the _recover_running_jobs logic entirely. What do you think?

@edoakes yea, strictly speaking we don't officially support GCS FT for jobs, so we don't need _recover_running_jobs but in reality, many OSS users still run GCS FT for ray jobs and hence _recover_running_jobs was introduced in the first place.

… cleanup

Change-Id: I2786781cc39bc6af76e94318ce47160341f8b656
Signed-off-by: daiping8 <[email protected]>
…ead creation

- Introduced `MockTimer` class for simulating time manipulations in tests.
- Added test to validate timeout behavior for pending jobs during new head node creation.

Change-Id: Id13ea5bb41139bf0764ac3643b2e19a403a03a0b
Signed-off-by: daiping8 <[email protected]>
cursor[bot]

This comment was marked as outdated.

@daiping8
Copy link
Contributor Author

@jjyao Sorry for my late reply.

  1. Following your suggestions, I've rewritten the test cases. Please review.

  2. Should we uniformly change new raylet in the PR description to new head node for better clarity? As mentioned above, the JobManager might only run on the head node in the future. Let me know if this change is needed.

Change-Id: Ie3962b7f8644367e7f9bf940baef828dbb02fb56
Signed-off-by: daiping8 <[email protected]>
- Introduced `Timer` class to replace direct `time.time()` calls, enhancing test maintainability.
- Refactored `__init__` to include a `timer` parameter for dependency injection.
- Updated `test_job_manager.py` tests to use `FakeTimer` instead of `MockTimer`.
- Renamed `MockTimer` to `FakeTimer` for better clarity of purpose.

Change-Id: Ia5351a1745cda42b3057e0ff70ccb02bd50cf538
Signed-off-by: daiping8 <[email protected]>
cursor[bot]

This comment was marked as outdated.

- Moved `Timer` and `TimerBase` from `serve._private.utils` to `_common.utils` for better reusability.
- Updated relevant imports across modules.
- Refactored `JobManager` to use `timeout_check_timer` for enhanced clarity.
- Adjusted related tests to align with new `Timer` implementation.

Change-Id: I28717c4ed0fe9322c29d03f799a7a200c8ee0aa3
Signed-off-by: daiping8 <[email protected]>
@daiping8 daiping8 requested a review from a team as a code owner October 16, 2025 02:57
- Replaced `start_timer.advance` with `FakeTimer` instance in the test.
- Aligned job manager initialization to use the updated timer instance for consistency.

Change-Id: I4a7adac95e9e8575d51b171946fec8bf5cb8b466
Signed-off-by: daiping8 <[email protected]>
Change-Id: I7f304b4d1ed2e8f1bffd83be38a9a457569fbad6
@daiping8
Copy link
Contributor Author

@jjyao As per your guidance, all revisions have been completed. Please review again.

Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very sorry for the late review

jjyao and others added 3 commits December 3, 2025 20:45
Change-Id: Ice13a8aaf1db5a07b05c8f54a2eccbb55b91b3a2
Change-Id: I1cdaae76d9db40441a7b7816f74a8764059dc54a
Signed-off-by: daiping8 <[email protected]>
@daiping8
Copy link
Contributor Author

daiping8 commented Dec 4, 2025

@jjyao A doc test unrelated to this PR has failed. I'm attempting to fix this issue in PR #59169, so please merge PR #59169 first.

jjyao and others added 2 commits December 4, 2025 11:42
…extra spaces

Change-Id: Ie1dddf8a96fd6b7cd8e4059f1597fd0f35765cdb
Signed-off-by: daiping8 <[email protected]>
@daiping8 daiping8 requested a review from a team as a code owner December 5, 2025 01:51
@daiping8
Copy link
Contributor Author

daiping8 commented Dec 5, 2025

@jjyao All CI test have passed! Thanks.

@jjyao jjyao merged commit 8b488eb into ray-project:master Dec 5, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Dashboard] Job supervisor actor leak when job times out during new raylet startup

5 participants