-
Notifications
You must be signed in to change notification settings - Fork 647
Description
Search before asking
- I searched the issues and found no similar issues.
KubeRay Component
ray-operator
What happened + What you expected to happen
When label RayJob with volcano scheduler and queue,ray operator crashes. Raycluster object would be created with no status. None of relevant cluster pod would be created. Rayjob is the sample job provided in examples. This happens in both v.0.6.0 and v.1.0.0
Tried same setup for RayCluster, everything works as expected.
The log from ray operator is the following:
INFO controllers.RayCluster reconcileHeadService {"1 head service found": "rayjob-sample-raycluster-klhw9-head-svc"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13ad868]
goroutine 786 [running]:
github.com/ray-project/kuberay/ray-operator/controllers/ray/batchscheduler/volcano.(*VolcanoBatchScheduler).DoBatchSchedulingOnSubmission(0xc001758368?, 0xc008176500)
/workspace/controllers/ray/batchscheduler/volcano/volcano_scheduler.go:55 +0xe8
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).reconcilePods(0xc0003054f0, {0x194d558, 0xc008536ed0}, 0xc008176500)
/workspace/controllers/ray/raycluster_controller.go:550 +0x23f
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).rayClusterReconcile(0xc0003054f0, {0x194d558, 0xc008536ed0}, {{{0xc00812ee68, 0x8}, {0xc00814ab00, 0x1e}}}, 0xc008176500)
/workspace/controllers/ray/raycluster_controller.go:340 +0xea8
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).Reconcile(0xc0003054f0, {0x194d558, 0xc008536ed0}, {{{0xc00812ee68, 0x8}, {0xc00814ab00, 0x1e}}})
/workspace/controllers/ray/raycluster_controller.go:158 +0x21e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00025edc0, {0x194d558, 0xc008536e10}, {{{0xc00812ee68?, 0x163d900?}, {0xc00814ab00?, 0x4045d4?}}})
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114 +0x28b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00025edc0, {0x194d4b0, 0xc000d2f040}, {0x158c6a0?, 0xc006dccf00?})
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311 +0x352
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00025edc0, {0x194d4b0, 0xc000d2f040})
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
/opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x31c
Reproduced steadily. Volcano works fine with RayCluster, it’s only problematic with RayJob. (both v.6.0.0 and V.1.0.0)
Reproduction script
apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
name: rayjob-sample
labels:
ray.io/scheduler-name: volcano
volcano.sh/queue-name: kuberay-test-queue
spec:
entrypoint: python /home/ray/samples/sample_code.py
rayClusterSpec:
rayVersion: '2.7.0' # should match the Ray version in the image of the containers
# Ray head pod template
headGroupSpec:
...
Anything else
The immediate fix is pretty trivial. Not sure whether further refactor is desired.
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!