Ray Actor Timeout breaks cluster in that workers can no longer be ssh'd

### What happened + What you expected to happen

When requesting resources for a ray cluster, when the actor times out (could be error in the minimal code example) it seems to leave the ray workers in a state where they don't respond to ssh.  



1) ray up ray.yml
2) ray dashboard ray.yml  
3) seq 6 | parallel -n0 ray job submit --entrypoint-num-gpus 1 --entrypoint-num-cpus 24 --working-dir . -- nvcc --version // this successfully loads 6 gpu workers
4) time ray job submit --runtime-env ./ray_runtime_env.yml --address http://localhost:8265 -- python test.py // this is the problematic command where the worker will timeout and return command line
5) seq 6 | parallel -n0 ray job submit --entrypoint-num-gpus 1 --entrypoint-num-cpus 24 --working-dir . -- nvcc --version  // when running the following command that ran fine at step 3), instead this instance the ray head node will be stuck not able to communicate with the workers.  


when looking at the monitor.out file I see that the ssh to the workers is not working.  however when you attach to the head and ssh from there I can ssh into the workers. (also workers are running fine in azure console



### Versions / Dependencies

ray 3.0.0.dev


### Reproduction script


[ray.yml.txt](https://github.com/user-attachments/files/17300561/ray.yml.txt)
[ray-env2.yml.txt](https://github.com/user-attachments/files/17300551/ray-env2.yml.txt)
[test2.py.txt](https://github.com/user-attachments/files/17300552/test2.py.txt)
[ray_runtime_env2.yml.txt](https://github.com/user-attachments/files/17300553/ray_runtime_env2.yml.txt)


### Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ray Actor Timeout breaks cluster in that workers can no longer be ssh'd #47953

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ray Actor Timeout breaks cluster in that workers can no longer be ssh'd #47953

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions