-
Notifications
You must be signed in to change notification settings - Fork 620
Non-Istio mTLS POC #3952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Non-Istio mTLS POC #3952
Conversation
e1c074e
to
9f73348
Compare
Verification StepsSetupRun To ray-operator/config/manager/manager.yaml
To ray-operator/config/default/kustomization.yaml
Run Verify ConfigurationRun
Run
Check the ENVS and volumemounts and volumes for the head and worker node
Volumes
Volume Mounts
|
9f73348
to
fc3e1f2
Compare
Changes include - feature flag to switch mTLS on, by default it's off - new mtls reconciler which reconciles the cert manager resources when mtls is on - required ENV VARs, volumens and volumen mounts to each pod in the cluster behind the feature flag - Additional RBACs required Co-authored-by: laurafitzgerald <[email protected]>
fc3e1f2
to
a298d6a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some nits and questions other than that looks great
} else if errors.IsNotFound(err) { | ||
// CA certificate does not exist, create it | ||
logger.Info("Creating CA certificate for RayCluster", "rayCluster", instance.Name) | ||
return r.createCACertificate(ctx, instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we requeue here too? If we can't find the Root CA, the creation fails, requeue?
utilruntime.Must(routev1.Install(scheme)) | ||
utilruntime.Must(batchv1.AddToScheme(scheme)) | ||
utilruntime.Must(configapi.AddToScheme(scheme)) | ||
utilruntime.Must(certmanagerv1.AddToScheme(scheme)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should cert manager always be added to the scheme or should this be feature gated too?
) | ||
|
||
const ( | ||
caSecretName = "ray-ca-secret" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is one secret per raycluster right? I will test this but if 2 rayclusters are created in the same namespace, I think we'll see a conflict as the secret with this name will already exist. We should just append the raycluster name to the secret name so it's unique
|
||
// Configure mTLS if enabled | ||
if features.Enabled(features.MTLS) { | ||
logger.Info("mTLS is enabled, configuring mTLS for head pod") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.Info("mTLS is enabled, configuring mTLS for head pod") | |
logger.Info(fmt.Sprintf("mTLS is enabled, configuring mTLS for worker pod %s", podName)) |
dnsNames := []string{ | ||
workerSvcName, | ||
"localhost", | ||
fmt.Sprintf("%s.%s.svc", workerSvcName, instance.Namespace), | ||
fmt.Sprintf("%s.%s.svc.cluster.local", workerSvcName, instance.Namespace), | ||
} | ||
|
||
// Add DNS names for each worker group | ||
for _, workerGroup := range instance.Spec.WorkerGroupSpecs { | ||
groupDNSNames := []string{ | ||
"localhost", | ||
fmt.Sprintf("%s-%s", instance.Name, workerGroup.GroupName), | ||
fmt.Sprintf("%s-%s.%s.svc", instance.Name, workerGroup.GroupName, instance.Namespace), | ||
fmt.Sprintf("%s-%s.%s.svc.cluster.local", instance.Name, workerGroup.GroupName, instance.Namespace), | ||
} | ||
dnsNames = append(dnsNames, groupDNSNames...) | ||
} | ||
|
||
// Add wildcard patterns for dynamic worker services | ||
dnsNames = append(dnsNames, | ||
"localhost", | ||
fmt.Sprintf("*.%s.%s.svc", workerSvcName, instance.Namespace), | ||
fmt.Sprintf("*.%s.%s.svc.cluster.local", workerSvcName, instance.Namespace), | ||
fmt.Sprintf("*-worker-*.%s.svc", instance.Namespace), | ||
fmt.Sprintf("*-worker-*.%s.svc.cluster.local", instance.Namespace), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "localhost" is being added to dnsNames
3 times here I believe. It may not create duplicate entries but yano, codesmell
IssuerRef: cmmeta.ObjectReference{ | ||
// Bootstrap with the SelfSigned issuer | ||
Name: fmt.Sprintf("%s-%s", raySelfSignedIssuerName, instance.Name), | ||
Kind: "Issuer", | ||
Group: "cert-manager.io", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this self signed issuer created anywhere? I can just see the cleanup of it. I would think if it isn't created manually, the certifications won't be able to be signed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this will need an API review cc @rueian @Future-Outlier
// Configure mTLS if enabled | ||
if features.Enabled(features.MTLS) { | ||
logger.Info("mTLS is enabled, configuring mTLS for head pod") | ||
r.configureMTLSForPod(&podConf, instance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature gate shouldn't toggle whether the RayCluster should enable mTLS. The feature gate is for allowing / disallowing use of the feature. There should probably be a separate API (field or annotation) to enable mTLS if the feature gate is enabled. Default behavior is still to disable mTLS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so Kuberay feature gates aren't for actually enabling a specific feature? Just enabling the use of a feature? From chatting to others, we want to avoid API changes to any of the CRDs so that won't be an option so maybe an annotation is the right way to go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrewsykim would it make more sense to include it here? https://github.com/ray-project/kuberay/blob/master/ray-operator/apis/config/v1alpha1/configuration_types.go as an optional mechanism for all rayclusters under the reconciliation of a kuberay installation given the similar suggestion in #4098
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Andrew is suggesting that there should be a choice for a RayCluster to opt into mTLS or not. Is that okay for you, @laurafitzgerald? Or do you want to enforce mTLS on all RayClusters when the feature is enabled?
Why are these changes needed?
Related issue number
Checks