-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Open
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalazurebugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-clustersFor launching and managing Ray clusters/jobs/kubernetesFor launching and managing Ray clusters/jobs/kubernetes
Description
What happened + What you expected to happen
When requesting resources for a ray cluster, when the actor times out (could be error in the minimal code example) it seems to leave the ray workers in a state where they don't respond to ssh.
- ray up ray.yml
- ray dashboard ray.yml
- seq 6 | parallel -n0 ray job submit --entrypoint-num-gpus 1 --entrypoint-num-cpus 24 --working-dir . -- nvcc --version // this successfully loads 6 gpu workers
- time ray job submit --runtime-env ./ray_runtime_env.yml --address http://localhost:8265 -- python test.py // this is the problematic command where the worker will timeout and return command line
- seq 6 | parallel -n0 ray job submit --entrypoint-num-gpus 1 --entrypoint-num-cpus 24 --working-dir . -- nvcc --version // when running the following command that ran fine at step 3), instead this instance the ray head node will be stuck not able to communicate with the workers.
when looking at the monitor.out file I see that the ssh to the workers is not working. however when you attach to the head and ssh from there I can ssh into the workers. (also workers are running fine in azure console
Versions / Dependencies
ray 3.0.0.dev
Reproduction script
ray.yml.txt
ray-env2.yml.txt
test2.py.txt
ray_runtime_env2.yml.txt
Issue Severity
High: It blocks me from completing my task.
Metadata
Metadata
Assignees
Labels
P2Important issue, but not time-criticalImportant issue, but not time-criticalazurebugSomething that is supposed to be working; but isn'tSomething that is supposed to be working; but isn'tcoreIssues that should be addressed in Ray CoreIssues that should be addressed in Ray Corecore-clustersFor launching and managing Ray clusters/jobs/kubernetesFor launching and managing Ray clusters/jobs/kubernetes
Type
Projects
Status
No status