Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -96,11 +96,11 @@ require (
go.uber.org/multierr v1.11.0 // indirect
go.uber.org/zap v1.27.0 // indirect
golang.org/x/net v0.38.0 // indirect
golang.org/x/oauth2 v0.27.0 // indirect
golang.org/x/sync v0.12.0 // indirect
golang.org/x/sys v0.32.0 // indirect
golang.org/x/term v0.30.0 // indirect
golang.org/x/text v0.23.0 // indirect
golang.org/x/oauth2 v0.28.0 // indirect
golang.org/x/sync v0.14.0 // indirect
golang.org/x/sys v0.33.0 // indirect
golang.org/x/term v0.32.0 // indirect
golang.org/x/text v0.25.0 // indirect
golang.org/x/time v0.10.0 // indirect
golang.org/x/tools v0.31.0 // indirect
gomodules.xyz/jsonpatch/v2 v2.4.0 // indirect
Expand Down
20 changes: 10 additions & 10 deletions go.sum

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 21 additions & 0 deletions ray-operator/config/rbac/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -174,3 +174,24 @@ rules:
- patch
- update
- watch
- apiGroups:
- cert-manager.io
resources:
- issuers
- certificates
verbs:
- create
- delete
- get
- list
- patch
- update
- watch
- apiGroups:
- cert-manager.io
resources:
- certificates/status
verbs:
- get
- patch
- update
110 changes: 110 additions & 0 deletions ray-operator/controllers/ray/raycluster_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -978,6 +978,13 @@ func (r *RayClusterReconciler) buildHeadPod(ctx context.Context, instance rayv1.
if len(r.options.HeadSidecarContainers) > 0 {
podConf.Spec.Containers = append(podConf.Spec.Containers, r.options.HeadSidecarContainers...)
}

// Configure mTLS if enabled
if features.Enabled(features.MTLS) {
logger.Info("mTLS is enabled, configuring mTLS for head pod")
r.configureMTLSForPod(&podConf, instance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature gate shouldn't toggle whether the RayCluster should enable mTLS. The feature gate is for allowing / disallowing use of the feature. There should probably be a separate API (field or annotation) to enable mTLS if the feature gate is enabled. Default behavior is still to disable mTLS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so Kuberay feature gates aren't for actually enabling a specific feature? Just enabling the use of a feature? From chatting to others, we want to avoid API changes to any of the CRDs so that won't be an option so maybe an annotation is the right way to go

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim would it make more sense to include it here? https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/config/v1alpha1/configuration_types.go as an optional mechanism for all rayclusters under the reconciliation of a kuberay installation given the similar suggestion in #4098

Copy link
Collaborator

@rueian rueian Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Andrew is suggesting that there should be a choice for a RayCluster to opt into mTLS or not. Is that okay for you, @laurafitzgerald? Or do you want to enforce mTLS on all RayClusters when the feature is enabled?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We see this an environment specific configuration. aka an admin wants to enforce this for all RayClusters in the environment they are administrating. Is there a recommended way or existing architecture where this pattern is used inside kuberay? the configuration types for example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a field in RayCluster to enable mTLS. If a platform admin wants to force mTLS on all RayClusters, they can use a mutating admission policy or webhook. The feature gate also needs to exist, we use the feature gate to introduce new features as "alpha" status.

}

logger.Info("head pod labels", "labels", podConf.Labels)
creatorCRDType := getCreatorCRDType(instance)
pod := common.BuildPod(ctx, podConf, rayv1.HeadNode, instance.Spec.HeadGroupSpec.RayStartParams, headPort, autoscalingEnabled, creatorCRDType, fqdnRayIP)
Expand Down Expand Up @@ -1006,6 +1013,13 @@ func (r *RayClusterReconciler) buildWorkerPod(ctx context.Context, instance rayv
if len(r.options.WorkerSidecarContainers) > 0 {
podTemplateSpec.Spec.Containers = append(podTemplateSpec.Spec.Containers, r.options.WorkerSidecarContainers...)
}

// Configure mTLS if enabled
if features.Enabled(features.MTLS) {
logger.Info("mTLS is enabled, configuring mTLS for head pod")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Info("mTLS is enabled, configuring mTLS for head pod")
logger.Info(fmt.Sprintf("mTLS is enabled, configuring mTLS for worker pod %s", podName))

r.configureMTLSForPod(&podTemplateSpec, instance)
}

creatorCRDType := getCreatorCRDType(instance)
pod := common.BuildPod(ctx, podTemplateSpec, rayv1.WorkerNode, worker.RayStartParams, headPort, autoscalingEnabled, creatorCRDType, fqdnRayIP)
// Set raycluster instance as the owner and controller
Expand All @@ -1016,6 +1030,102 @@ func (r *RayClusterReconciler) buildWorkerPod(ctx context.Context, instance rayv
return pod
}

// configureMTLSForPod configures mTLS settings for a pod template if mTLS is enabled
func (r *RayClusterReconciler) configureMTLSForPod(podTemplate *corev1.PodTemplateSpec, instance rayv1.RayCluster) {
// Determine the appropriate secret name based on node type
// We can determine this from the pod labels or container name
var secretName string
var isWorker bool
if podTemplate.Labels != nil && podTemplate.Labels[utils.RayNodeTypeLabelKey] == string(rayv1.HeadNode) {
isWorker = false
secretName = fmt.Sprintf("ray-head-secret-%s", instance.Name)
} else {
isWorker = true
secretName = fmt.Sprintf("ray-worker-secret-%s", instance.Name)
}

for i := range podTemplate.Spec.Containers {
// Add TLS environment variables
r.addTLSEnvironmentVariables(&podTemplate.Spec.Containers[i], isWorker)
// Add certificate volume mounts
r.addCertVolumeMounts(&podTemplate.Spec.Containers[i])
}

// Add mTLS configuration to init containers as well
for i := range podTemplate.Spec.InitContainers {
// Add TLS environment variables
r.addTLSEnvironmentVariables(&podTemplate.Spec.InitContainers[i], isWorker)
// Add certificate volume mounts
r.addCertVolumeMounts(&podTemplate.Spec.InitContainers[i])
}

// Add CA volumes with proper secret references
r.addCAVolumes(&podTemplate.Spec, secretName)
}

// addTLSEnvironmentVariables adds Ray TLS environment variables to a container
func (r *RayClusterReconciler) addTLSEnvironmentVariables(container *corev1.Container, isWorker bool) {
// Check if this is a worker container by looking at the container name

if isWorker {
// Worker pods only need basic TLS environment variables
tlsEnvVars := []corev1.EnvVar{
{Name: "RAY_USE_TLS", Value: "1"},
{Name: "RAY_TLS_SERVER_CERT", Value: "/home/ray/workspace/tls/tls.crt"},
{Name: "RAY_TLS_SERVER_KEY", Value: "/home/ray/workspace/tls/tls.key"},
{Name: "RAY_TLS_CA_CERT", Value: "/home/ray/workspace/tls/ca.crt"},
}
container.Env = append(container.Env, tlsEnvVars...)
return
}

// Head pods need all TLS environment variables
tlsEnvVars := []corev1.EnvVar{
{
Name: "MY_POD_IP",
ValueFrom: &corev1.EnvVarSource{
FieldRef: &corev1.ObjectFieldSelector{
FieldPath: "status.podIP",
},
},
},
{Name: "RAY_USE_TLS", Value: "1"},
{Name: "RAY_TLS_SERVER_CERT", Value: "/home/ray/workspace/tls/tls.crt"},
{Name: "RAY_TLS_SERVER_KEY", Value: "/home/ray/workspace/tls/tls.key"},
{Name: "RAY_TLS_CA_CERT", Value: "/home/ray/workspace/tls/ca.crt"},
}
container.Env = append(container.Env, tlsEnvVars...)
}

// addCAVolumes adds CA and certificate volumes to a pod spec
func (r *RayClusterReconciler) addCAVolumes(podSpec *corev1.PodSpec, secretName string) {
caVolumes := []corev1.Volume{
{
Name: "ca-vol",
VolumeSource: corev1.VolumeSource{
Secret: &corev1.SecretVolumeSource{
SecretName: secretName,
},
},
},
}

podSpec.Volumes = append(podSpec.Volumes, caVolumes...)
}

// addCertVolumeMounts adds certificate volume mounts to a container
func (r *RayClusterReconciler) addCertVolumeMounts(container *corev1.Container) {
volumeMounts := []corev1.VolumeMount{
{
Name: "ca-vol",
MountPath: "/home/ray/workspace/tls",
ReadOnly: true,
},
}

container.VolumeMounts = append(container.VolumeMounts, volumeMounts...)
}

func (r *RayClusterReconciler) buildRedisCleanupJob(ctx context.Context, instance rayv1.RayCluster) batchv1.Job {
logger := ctrl.LoggerFrom(ctx)

Expand Down
Loading