-
Notifications
You must be signed in to change notification settings - Fork 7k
[Dashboard] Fix potential job actor leak by ensuring retrieval of actor instance before termination. #55860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a potential job actor leak by ensuring the actor handle is retrieved before termination. The fix in job_manager.py is correct and directly solves the described issue. The accompanying test in test_job_manager.py effectively validates the fix by simulating a raylet restart and checking for proper job timeout and actor cleanup. I've provided a few suggestions to improve code conciseness in the fix and to enhance clarity in the new test.
|
@israbbani PTAL |
|
@daiping8 Raylet will not restart. If raylet fails, the entire node is gone and a new Node will be created. Could you update the PR description to be more accurate. I think you mean GCS FT? |
|
@jjyao Thank you for your attention. I have clarified the relevant content in the descriptions of the issue and PR. The core problem is that the This scenarios is an extreme edge case, which makes it difficult to describe and understand. Thank you again for your patience. |
|
@daiping8 could you rebase, I just merged my fix. |
…ive kill logic. - Add test to verify timeout handling for pending jobs during new raylet creation. Signed-off-by: daiping8 <[email protected]>
Signed-off-by: daiping8 <[email protected]>
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
|
@daiping8 @jjyao there is really no reason for us to be running the So my preferred fix for this would actually be to simplify things by running the |
Co-authored-by: Jiajun Yao <[email protected]> Signed-off-by: Ping Dai <[email protected]>
@edoakes yea, strictly speaking we don't officially support GCS FT for jobs, so we don't need |
… cleanup Change-Id: I2786781cc39bc6af76e94318ce47160341f8b656 Signed-off-by: daiping8 <[email protected]>
…ead creation - Introduced `MockTimer` class for simulating time manipulations in tests. - Added test to validate timeout behavior for pending jobs during new head node creation. Change-Id: Id13ea5bb41139bf0764ac3643b2e19a403a03a0b Signed-off-by: daiping8 <[email protected]>
|
@jjyao Sorry for my late reply.
|
Change-Id: Ie3962b7f8644367e7f9bf940baef828dbb02fb56 Signed-off-by: daiping8 <[email protected]>
- Introduced `Timer` class to replace direct `time.time()` calls, enhancing test maintainability. - Refactored `__init__` to include a `timer` parameter for dependency injection. - Updated `test_job_manager.py` tests to use `FakeTimer` instead of `MockTimer`. - Renamed `MockTimer` to `FakeTimer` for better clarity of purpose. Change-Id: Ia5351a1745cda42b3057e0ff70ccb02bd50cf538 Signed-off-by: daiping8 <[email protected]>
- Moved `Timer` and `TimerBase` from `serve._private.utils` to `_common.utils` for better reusability. - Updated relevant imports across modules. - Refactored `JobManager` to use `timeout_check_timer` for enhanced clarity. - Adjusted related tests to align with new `Timer` implementation. Change-Id: I28717c4ed0fe9322c29d03f799a7a200c8ee0aa3 Signed-off-by: daiping8 <[email protected]>
- Replaced `start_timer.advance` with `FakeTimer` instance in the test. - Aligned job manager initialization to use the updated timer instance for consistency. Change-Id: I4a7adac95e9e8575d51b171946fec8bf5cb8b466 Signed-off-by: daiping8 <[email protected]>
Change-Id: I7f304b4d1ed2e8f1bffd83be38a9a457569fbad6
|
@jjyao As per your guidance, all revisions have been completed. Please review again. |
jjyao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very sorry for the late review
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Jiajun Yao <[email protected]>
Change-Id: Ice13a8aaf1db5a07b05c8f54a2eccbb55b91b3a2
Change-Id: I1cdaae76d9db40441a7b7816f74a8764059dc54a Signed-off-by: daiping8 <[email protected]>
…extra spaces Change-Id: Ie1dddf8a96fd6b7cd8e4059f1597fd0f35765cdb Signed-off-by: daiping8 <[email protected]>
|
@jjyao All CI test have passed! Thanks. |
Why are these changes needed?
To close issue #55858
Problem Analysis
Let's analyze this from the code details. Let's look at the code in
job_manager.pythat handles job pending timeouts by the loop.Normal process
When a job times out, JobStatus will be FAILED and the entire loop will be break. At the end of the function,
ray.kill(job_supervisor, no_restart=True)will be executed.Faulty process
When a job is in the pending state, for some reason the old node that submitted the job dies, and a new node is created . For the new job manager created after the raylet on new node (new raylet) starts, it continuously monitors the task status through
_monitor_job_internal.Just like the scenario I mentioned in #55858 (a new raylet is in the process of starting up, and the job happens to reach its timeout during this period), the loop that checks for job pending timeouts terminates during its first iteration, so the code responsible for obtaining the handle is never executed. It fails to obtain the
job_supervisorhandle.So
ray.kill(job_supervisor, no_restart=True)will not be executed.Solution
Related issue number
Closes #55858
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.Note
Ensures
_monitor_job_internalfetches the job supervisor actor before killing it to avoid leaks, and adds a test covering pending-job timeout during new raylet creation.python/ray/dashboard/modules/job/job_manager.py):job_supervisorvia_get_actor_for_job(job_id)if missing to avoid actor leaks on early loop exit.python/ray/dashboard/modules/job/tests/test_job_manager.py):test_pending_job_timeout_during_raylet_creationto simulate new raylet creation, verify pending job times out, and confirm supervisor actor is removed.Written by Cursor Bugbot for commit 5274288. This will update automatically on new commits. Configure here.