Skip to content

Conversation

kryanbeane
Copy link
Contributor

Why are these changes needed?

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

@laurafitzgerald laurafitzgerald force-pushed the mtls-poc branch 2 times, most recently from e1c074e to 9f73348 Compare September 2, 2025 14:43
@laurafitzgerald
Copy link

Verification Steps

Setup

Run make cert-manager

To ray-operator/config/manager/manager.yaml

  • add ,MTLS=true to --feature-gates flag value
  • image: quay.io/laurafitzgerald/kuberay:mtls

To ray-operator/config/default/kustomization.yaml

name: kuberay/operator
   newName: quay.io/laurafitzgerald/kuberay
   newTag: mtls

Run make deploy
Run oc apply -f config/samples/ray-cluster.sample.yaml

Verify Configuration

Run oc get certificate

NAME                                 READY   SECRET                                 AGE
ray-head-cert-raycluster-kuberay     True    ray-head-secret-raycluster-kuberay     34m
ray-worker-cert-raycluster-kuberay   True    ray-worker-secret-raycluster-kuberay   34m

Run oc get secret

NAME                                   TYPE                DATA   AGE
ray-head-secret-raycluster-kuberay     kubernetes.io/tls   3      100m
ray-worker-secret-raycluster-kuberay   kubernetes.io/tls   3      100m

Check the ENVS and volumemounts and volumes for the head and worker node
Expected
ENVs

- name: MY_POD_IP
        valueFrom:
          fieldRef:
            apiVersion: v1
            fieldPath: status.podIP
      - name: RAY_USE_TLS
        value: "1"
      - name: RAY_TLS_SERVER_CERT
        value: /home/ray/workspace/tls/server.crt
      - name: RAY_TLS_SERVER_KEY
        value: /home/ray/workspace/tls/server.key
      - name: RAY_TLS_CA_CERT
        value: /home/ray/workspace/tls/ca.crt

Volumes

volumes:
    - name: ca-vol
      secret:
        defaultMode: 420
        secretName: ray-head-secret-raycluster-kuberay

Volume Mounts

volumeMounts:
      - mountPath: /home/ray/workspace/tls
        name: ca-vol

Changes include
- feature flag to switch mTLS on, by default it's off
- new mtls reconciler which reconciles the cert manager resources when mtls is on
- required ENV VARs, volumens and volumen mounts to each pod in the cluster behind the feature flag
- Additional RBACs required

Co-authored-by: laurafitzgerald <[email protected]>
Copy link
Contributor Author

@kryanbeane kryanbeane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some nits and questions other than that looks great

} else if errors.IsNotFound(err) {
// CA certificate does not exist, create it
logger.Info("Creating CA certificate for RayCluster", "rayCluster", instance.Name)
return r.createCACertificate(ctx, instance)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we requeue here too? If we can't find the Root CA, the creation fails, requeue?

utilruntime.Must(routev1.Install(scheme))
utilruntime.Must(batchv1.AddToScheme(scheme))
utilruntime.Must(configapi.AddToScheme(scheme))
utilruntime.Must(certmanagerv1.AddToScheme(scheme))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should cert manager always be added to the scheme or should this be feature gated too?

)

const (
caSecretName = "ray-ca-secret"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one secret per raycluster right? I will test this but if 2 rayclusters are created in the same namespace, I think we'll see a conflict as the secret with this name will already exist. We should just append the raycluster name to the secret name so it's unique


// Configure mTLS if enabled
if features.Enabled(features.MTLS) {
logger.Info("mTLS is enabled, configuring mTLS for head pod")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Info("mTLS is enabled, configuring mTLS for head pod")
logger.Info(fmt.Sprintf("mTLS is enabled, configuring mTLS for worker pod %s", podName))

Comment on lines +563 to +588
dnsNames := []string{
workerSvcName,
"localhost",
fmt.Sprintf("%s.%s.svc", workerSvcName, instance.Namespace),
fmt.Sprintf("%s.%s.svc.cluster.local", workerSvcName, instance.Namespace),
}

// Add DNS names for each worker group
for _, workerGroup := range instance.Spec.WorkerGroupSpecs {
groupDNSNames := []string{
"localhost",
fmt.Sprintf("%s-%s", instance.Name, workerGroup.GroupName),
fmt.Sprintf("%s-%s.%s.svc", instance.Name, workerGroup.GroupName, instance.Namespace),
fmt.Sprintf("%s-%s.%s.svc.cluster.local", instance.Name, workerGroup.GroupName, instance.Namespace),
}
dnsNames = append(dnsNames, groupDNSNames...)
}

// Add wildcard patterns for dynamic worker services
dnsNames = append(dnsNames,
"localhost",
fmt.Sprintf("*.%s.%s.svc", workerSvcName, instance.Namespace),
fmt.Sprintf("*.%s.%s.svc.cluster.local", workerSvcName, instance.Namespace),
fmt.Sprintf("*-worker-*.%s.svc", instance.Namespace),
fmt.Sprintf("*-worker-*.%s.svc.cluster.local", instance.Namespace),
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "localhost" is being added to dnsNames 3 times here I believe. It may not create duplicate entries but yano, codesmell

Comment on lines +760 to +765
IssuerRef: cmmeta.ObjectReference{
// Bootstrap with the SelfSigned issuer
Name: fmt.Sprintf("%s-%s", raySelfSignedIssuerName, instance.Name),
Kind: "Issuer",
Group: "cert-manager.io",
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this self signed issuer created anywhere? I can just see the cleanup of it. I would think if it isn't created manually, the certifications won't be able to be signed

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this will need an API review cc @rueian @Future-Outlier

// Configure mTLS if enabled
if features.Enabled(features.MTLS) {
logger.Info("mTLS is enabled, configuring mTLS for head pod")
r.configureMTLSForPod(&podConf, instance)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature gate shouldn't toggle whether the RayCluster should enable mTLS. The feature gate is for allowing / disallowing use of the feature. There should probably be a separate API (field or annotation) to enable mTLS if the feature gate is enabled. Default behavior is still to disable mTLS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so Kuberay feature gates aren't for actually enabling a specific feature? Just enabling the use of a feature? From chatting to others, we want to avoid API changes to any of the CRDs so that won't be an option so maybe an annotation is the right way to go

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andrewsykim would it make more sense to include it here? https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/config/v1alpha1/configuration_types.go as an optional mechanism for all rayclusters under the reconciliation of a kuberay installation given the similar suggestion in #4098

Copy link
Collaborator

@rueian rueian Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Andrew is suggesting that there should be a choice for a RayCluster to opt into mTLS or not. Is that okay for you, @laurafitzgerald? Or do you want to enforce mTLS on all RayClusters when the feature is enabled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants