Skip to content

[Bug] RayJob Volcano integration #1580

@Pikabooboo

Description

@Pikabooboo

Search before asking

  • I searched the issues and found no similar issues.

KubeRay Component

ray-operator

What happened + What you expected to happen

When label RayJob with volcano scheduler and queue,ray operator crashes. Raycluster object would be created with no status. None of relevant cluster pod would be created. Rayjob is the sample job provided in examples. This happens in both v.0.6.0 and v.1.0.0
Tried same setup for RayCluster, everything works as expected.

The log from ray operator is the following:

INFO    controllers.RayCluster  reconcileHeadService    {"1 head service found": "rayjob-sample-raycluster-klhw9-head-svc"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x13ad868]

goroutine 786 [running]:
github.com/ray-project/kuberay/ray-operator/controllers/ray/batchscheduler/volcano.(*VolcanoBatchScheduler).DoBatchSchedulingOnSubmission(0xc001758368?, 0xc008176500)
        /workspace/controllers/ray/batchscheduler/volcano/volcano_scheduler.go:55 +0xe8
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).reconcilePods(0xc0003054f0, {0x194d558, 0xc008536ed0}, 0xc008176500)
        /workspace/controllers/ray/raycluster_controller.go:550 +0x23f
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).rayClusterReconcile(0xc0003054f0, {0x194d558, 0xc008536ed0}, {{{0xc00812ee68, 0x8}, {0xc00814ab00, 0x1e}}}, 0xc008176500)
        /workspace/controllers/ray/raycluster_controller.go:340 +0xea8
github.com/ray-project/kuberay/ray-operator/controllers/ray.(*RayClusterReconciler).Reconcile(0xc0003054f0, {0x194d558, 0xc008536ed0}, {{{0xc00812ee68, 0x8}, {0xc00814ab00, 0x1e}}})
        /workspace/controllers/ray/raycluster_controller.go:158 +0x21e
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0xc00025edc0, {0x194d558, 0xc008536e10}, {{{0xc00812ee68?, 0x163d900?}, {0xc00814ab00?, 0x4045d4?}}})
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114 +0x28b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00025edc0, {0x194d4b0, 0xc000d2f040}, {0x158c6a0?, 0xc006dccf00?})
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311 +0x352
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00025edc0, {0x194d4b0, 0xc000d2f040})
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266 +0x1d9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227 +0x85
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2
        /opt/app-root/src/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:223 +0x31c
Reproduced steadily. Volcano works fine with RayCluster, it’s only problematic with RayJob.  (both v.6.0.0 and V.1.0.0) 

Reproduction script

apiVersion: ray.io/v1alpha1
kind: RayJob
metadata:
name: rayjob-sample
labels:
ray.io/scheduler-name: volcano
volcano.sh/queue-name: kuberay-test-queue
spec:
entrypoint: python /home/ray/samples/sample_code.py
rayClusterSpec:
rayVersion: '2.7.0' # should match the Ray version in the image of the containers
# Ray head pod template
headGroupSpec:
...

Anything else

The immediate fix is pretty trivial. Not sure whether further refactor is desired.
No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Metadata

Metadata

Assignees

Labels

1.5.0P1Issue that should be fixed within a few weeksbugSomething isn't workingrayjob

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions