Skip to content

gpu instances: make /var/run/cdi world-writable #7603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mikebonnet
Copy link
Contributor

Both /etc/cdi and /var/run/cdi are valid locations for CDI config, with the latter overriding the former. Make both world-writable so config can be modified by users as necessary.

No need to su - ec2-user, Konflux workloads run as a newly-created random user. Making the directory world-writable is sufficient.

Both /etc/cdi and /var/run/cdi are valid locations for CDI config, with
the latter overriding the former. Make both world-writable so config can
be modified by users as necessary.

No need to "su - ec2-user", Konflux workloads run as a newly-created
random user. Making the directory world-writable is sufficient.

Signed-off-by: Mike Bonnet <[email protected]>
Copy link

openshift-ci bot commented Aug 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mikebonnet
Once this PR has been reviewed and has the lgtm label, please assign ifireball for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

Code Review by Gemini

Here's a review of the provided code changes:

components/kubearchive/staging/base/external-secret.yaml

  • Issue: The removal of the argocd.argoproj.io/sync-options: Prune=false annotation changes the secret's lifecycle management by ArgoCD. This means ArgoCD will now prune (delete) the kubearchive-logging secret if it's no longer defined in Git. This change appears unrelated to the commit message about GPU instances and CDI.
  • Suggestion:
    --- a/components/kubearchive/staging/base/external-secret.yaml
    +++ b/components/kubearchive/staging/base/external-secret.yaml
    @@ -12,14 +12,10 @@ spec:
         - extract:
             key: staging/kubearchive/logging
       refreshInterval: 1h
       secretStoreRef:
         kind: ClusterSecretStore
         name: appsre-stonesoup-vault
       target:
         creationPolicy: Owner
         deletionPolicy: Delete
         name: kubearchive-logging
    -    template:
    -      metadata:
    -        annotations:
    -          argocd.argoproj.io/sync-options: Prune=false
    Confirm if this change in pruning behavior is intentional for the kubearchive-logging secret. If it is, consider adding a brief explanation to the commit message. If it's unrelated to the primary purpose of this commit, it should be reverted or moved to a separate commit.

components/multi-platform-controller/production/kflux-prd-rh03/host-config.yaml

  • Issue: Making /etc/cdi world-writable (chmod a+rwx, equivalent to 777) is a security concern. /etc is typically reserved for system-wide configuration files with more restrictive permissions. While /var/run/cdi might require world-writable permissions for runtime data as stated in the commit message, /etc/cdi should ideally have more restricted permissions (e.g., 755 or 775 if a specific group needs write access). The nvidia-ctk cdi generate command typically runs as root and does not require world-writable permissions for its output.
  • Suggestion:
    --- a/components/multi-platform-controller/production/kflux-prd-rh03/host-config.yaml
    +++ b/components/multi-platform-controller/production/kflux-prd-rh03/host-config.yaml
    @@ -789,8 +789,8 @@ data:
     chmod 700 /home/ec2-user/.ssh
     restorecon -r /home/ec2-user
     
    -    mkdir -p /etc/cdi
    -    chmod a+rwx /etc/cdi
    +    mkdir -p /etc/cdi /var/run/cdi
    +    chmod 755 /etc/cdi
    +    chmod a+rwx /var/run/cdi
     
     setsebool container_use_devices 1
    -    
    -    su - ec2-user
    +
     nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
     --//--
    Apply chmod a+rwx only to /var/run/cdi. For /etc/cdi, consider chmod 755 or 775 to maintain better security practices, unless there's a specific, unavoidable requirement for non-root users to write to /etc/cdi.

components/sandbox/toolchain-member-operator/staging/base/kustomization.yaml

  • Issue: The resources section now includes ../../base/ and new patches to delete specific ClusterRoles and RoleBindings. This is a significant refactoring of the kustomization structure and RBAC configuration.
  • Suggestion: No specific line changes are needed here, but ensure that all environments that previously consumed this base are updated correctly to reflect these structural and RBAC changes.

components/sandbox/toolchain-member-operator/staging/kflux-stg-es01/kustomization.yaml

  • Issue: The base kustomization now includes patch_resources.yaml which modifies the dev-sandbox-member Subscription. This overlay then applies a patch to $patch: delete the same dev-sandbox-member Subscription. This means the subscription will not be deployed in the kflux-stg-es01 environment, effectively overriding the modification from the base.
  • Suggestion:
    --- a/components/sandbox/toolchain-member-operator/staging/kflux-stg-es01/kustomization.yaml
    +++ b/components/sandbox/toolchain-member-operator/staging/kflux-stg-es01/kustomization.yaml
    @@ -1,4 +1,12 @@
     apiVersion: kustomize.config.k8s.io/v1beta1
     kind: Kustomization
     resources:
    -  - ../base
    +  - ../base
    +patches:
    +  - patch: |
    +      $patch: delete
    +      apiVersion: operators.coreos.com/v1alpha1
    +      kind: Subscription
    +      metadata:
    +        name: dev-sandbox-member
    +        namespace: toolchain-member-operator
    \ No newline at end of file
    Confirm if the intention is to not deploy the dev-sandbox-member Subscription in kflux-stg-es01. If so, this change is correct but represents a significant divergence from the base configuration. If the intention was to apply the resource requests from base/patch_resources.yaml and then further modify it, this patch is incorrect.

components/sandbox/toolchain-member-operator/staging/stone-stage-p01/kustomization.yaml

  • Issue: This kustomization file has been changed to remove ../base from its resources and now only includes ns.yaml. This means that the toolchain-member-operator itself, along with any other resources or patches defined in the base kustomization, will not be deployed in this environment. Only the toolchain-member-operator namespace will be created. This is a major change in deployment strategy and likely a bug if the operator is expected to run here.
  • Suggestion:
    --- a/components/sandbox/toolchain-member-operator/staging/stone-stage-p01/kustomization.yaml
    +++ b/components/sandbox/toolchain-member-operator/staging/stone-stage-p01/kustomization.yaml
    @@ -1,5 +1,4 @@
     apiVersion: kustomize.config.k8s.io/v1beta1
     kind: Kustomization
     resources:
    -  - ../base
    -
    +- ns.yaml
    If the toolchain-member-operator is intended to be deployed in stone-stage-p01, then resources: - ../base should be retained. If the operator is not intended to be deployed here, then this change is correct, but it's a significant functional change that should be explicitly documented.

components/sandbox/toolchain-member-operator/staging/stone-stg-rh01/kustomization.yaml

  • Issue: Same as stone-stage-p01/kustomization.yaml. This kustomization file has been changed to remove ../base from its resources and now only includes ns.yaml. This means that the toolchain-member-operator itself, along with any other resources or patches defined in the base kustomization, will not be deployed in this environment. Only the toolchain-member-operator namespace will be created. This is a major change in deployment strategy and likely a bug if the operator is expected to run here.
  • Suggestion:
    --- a/components/sandbox/toolchain-member-operator/staging/stone-stg-rh01/kustomization.yaml
    +++ b/components/sandbox/toolchain-member-operator/staging/stone-stg-rh01/kustomization.yaml
    @@ -1,5 +1,4 @@
     apiVersion: kustomize.config.k8s.io/v1beta1
     kind: Kustomization
     resources:
    -  - ../base
    -
    +- ns.yaml
    If the toolchain-member-operator is intended to be deployed in stone-stg-rh01, then resources: - ../base should be retained. If the operator is not intended to be deployed here, then this change is correct, but it's a significant functional change that should be explicitly documented.

@manish-jangra
Copy link
Contributor

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants