Skip to content

postStart hook commands timeout #1440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

akurinnoy
Copy link
Collaborator

What does this PR do?

This PR addresses the issue of postStart hook failures in DevWorkspaces when hook commands not exiting within the timeout period, so that the workspace pod gets stuck in Terminating state and never gets deleted.

This PR resolves the issue by:

  • Introducing timeout for postStart hook. User-provided commands are now wrapped with the timeout utility. This ensures that postStart hook commands are terminated if they exceed a configurable duration. The timeout duration can be set in the DevWorkspaceOperatorConfig (a value of 0 means no timeout):
    # DevWorkspaceOperatorConfig
    # ...
    config:
      workspace:
        postStartTimeout: 30 # Timeout in seconds
  • Adding the parsing logic for interpreting various Kubelet messages to extract an exact reason or exit code for lifecycle hook failures.

What issues does this PR fix or reference?

https://issues.redhat.com/browse/CRW-8329

Is it tested? How?

  1. Install DWO from this PR:
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: devworkspace-operator-catalog
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/okurinny/devworkspace-operator-index:postStartHookTimeout
  publisher: Red Hat
  displayName: DevWorkspace Operator Catalog
  updateStrategy:
    registryPoll:
      interval: 5m
EOF
oc apply -f - <<EOF
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
  name: devworkspace-operator
  namespace: openshift-operators
spec:
  channel: next
  installPlanApproval: Automatic
  name: devworkspace-operator
  source: devworkspace-operator-catalog
  sourceNamespace: openshift-marketplace
EOF
  1. Create DevWorkspaceOperatorConfig with the postStart hook timeout duration (in seconds):
oc apply -f - <<EOF
apiVersion: controller.devfile.io/v1alpha1
kind: DevWorkspaceOperatorConfig
metadata:
  name: devworkspace-operator-config
  namespace: openshift-operators
config:
  workspace:
    postStartTimeout: 30
EOF
  1. Create a problematic DevWorkspace designed to have its postStart hook time out:
oc apply -f - <<EOF
apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: problematic-workspace
spec:
  started: true
  template:
    components:
      - name: tools
        container:
          image: quay.io/devfile/universal-developer-image:ubi9-latest
          memoryLimit: "1Gi"
          memoryRequest: "512Mi"
          cpuRequest: "250m"
          cpuLimit: "1000m"
    commands:
      - id: sleep-infinity-cmd
        exec:
          component: tools
          commandLine: "echo 'PostStart: Starting infinite sleep...'; sleep infinity; echo 'PostStart: Sleep finished (should not be reached)'"
    events:
      postStart:
        - sleep-infinity-cmd
EOF
  1. Watch the DevWorkspace:
oc get dw problematic-workspace -w
  1. The DevWorkspace should eventually enter a Failed phase.
  2. The status.message of the DevWorkspace should provide a reason for the failure, indicating a timeout. For example: Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands terminated by SIGTERM (likely timed out after 30s). Exit code 143.

PR Checklist

  • E2E tests pass (when PR is ready, comment /test v8-devworkspace-operator-e2e, v8-che-happy-path to trigger)
    • v8-devworkspace-operator-e2e: DevWorkspace e2e test
    • v8-che-happy-path: Happy path for verification integration with Che

@akurinnoy akurinnoy self-assigned this May 29, 2025
@akurinnoy akurinnoy requested review from dkwon17 and ibuziuk as code owners May 29, 2025 13:06
@akurinnoy akurinnoy requested a review from rohanKanojia May 29, 2025 13:07
@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from e90b773 to c342798 Compare May 29, 2025 13:56
@rohanKanojia
Copy link
Member

I tried the abovementioned steps and I was able to see probelematic workspace failing with [postStart hook] message:

oc get pods -w
NAME                                               READY   STATUS              RESTARTS   AGE
devworkspace-controller-manager-6c948bbf56-k6262   2/2     Running             0          32m
devworkspace-webhook-server-8597b84fc4-kglmf       2/2     Running             0          32m
devworkspace-webhook-server-8597b84fc4-m9rc5       2/2     Running             0          32m
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     ContainerCreating   0          6s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     ContainerCreating   0          9s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     PostStartHookError   0          15s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Terminating          0          15s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Terminating          1 (14s ago)   28s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Error                1             28s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Error                1             29s
workspace35712747d3d64d73-5c6dcd54dc-gvrd5         0/1     Error                1             29s

oc get dw
NAME                    DEVWORKSPACE ID             PHASE    INFO
problematic-workspace   workspace35712747d3d64d73   Failed   Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands failed (Kubelet reported exit code 1)

Copy link

openshift-ci bot commented Jun 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: akurinnoy, rohanKanojia
Once this PR has been reviewed and has the lgtm label, please assign dkwon17 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

akurinnoy added 13 commits July 10, 2025 15:05
Signed-off-by: Oleksii Kurinnyi <[email protected]>
@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from 61d8918 to 85046e5 Compare July 14, 2025 12:54
@openshift-ci openshift-ci bot removed the lgtm label Jul 14, 2025
Copy link

openshift-ci bot commented Jul 14, 2025

New changes are detected. LGTM label has been removed.

@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from 85046e5 to 3b0e379 Compare July 14, 2025 13:13
@akurinnoy akurinnoy marked this pull request as draft July 15, 2025 11:25
@akurinnoy akurinnoy marked this pull request as ready for review July 16, 2025 12:26
@openshift-ci openshift-ci bot requested a review from dkwon17 July 16, 2025 12:26
@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from cf174a0 to 2813408 Compare July 16, 2025 12:26
Copy link

codecov bot commented Jul 16, 2025

Codecov Report

Attention: Patch coverage is 74.85380% with 43 lines in your changes missing coverage. Please review.

Project coverage is 40.00%. Comparing base (6e8009c) to head (443872c).
Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
pkg/library/lifecycle/poststart.go 77.89% 21 Missing ⚠️
pkg/library/status/check.go 65.00% 21 Missing ⚠️
pkg/library/container/container.go 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1440      +/-   ##
==========================================
+ Coverage   39.57%   40.00%   +0.42%     
==========================================
  Files         160      160              
  Lines       13186    13333     +147     
==========================================
+ Hits         5219     5334     +115     
- Misses       7590     7622      +32     
  Partials      377      377              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dkwon17
Copy link
Collaborator

dkwon17 commented Jul 21, 2025

@akurinnoy thank you for the update, but I get the fallback status message when starting problematic-workspace:

'Error creating DevWorkspace deployment: Detected unrecoverable event FailedPostStartHook: [postStart hook] failed with an unknown error (see pod events or container logs for more details)'

instead of the expected message:

Error creating DevWorkspace deployment: Container tools has state [postStart hook] Commands terminated by SIGTERM (likely timed out after 30s). Exit code 143.

Events:
image

Is it expected?

}
return handler, nil
}

// processCommandsForPostStart builds a lifecycle handler that runs the provided command(s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// processCommandsForPostStart builds a lifecycle handler that runs the provided command(s)
// processCommandsWithoutTimeoutFallback builds a lifecycle handler that runs the provided command(s)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


// processCommandsForPostStart processes a list of DevWorkspace commands
// and generates a corev1.LifecycleHandler for the PostStart lifecycle hook.
func processCommandsForPostStart(commands []dw.Command, postStartTimeout *int32) (*corev1.LifecycleHandler, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, could we define the caller functions above the helper functions? ie,

processCommandsForPostStart(...)
processCommandsWithoutTimeoutFallback(...)
buildUserScripot(...)
generateScriptWithTimeout(...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@akurinnoy
Copy link
Collaborator Author

@dkwon17 Hi,

Is it expected?

I couldn't find any proof that this is expected behavior in the Kubernetes docs, but it seems to be the case. I also encountered this behavior while I was testing this PR. I ran the problematic-workspace, and for the first few runs, I got the message with the exact exit code, but for subsequent runs, it was "failed with an unknown error."

@akurinnoy
Copy link
Collaborator Author

/retest

1 similar comment
@dkwon17
Copy link
Collaborator

dkwon17 commented Jul 28, 2025

/retest

return "", fmt.Errorf("exec command is nil for command ID %s", command.Id)
}
if len(execCmd.Env) > 0 {
return "", fmt.Errorf("env vars in postStart command %s are unsupported", command.Id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something obvious, but why are env vars in the postStart command not supported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there's nothing here that prevents us from passing env variables to the postStart command. I fixed this issue.

// +kubebuilder:validation:Optional
// +kubebuilder:validation:Type=integer
// +kubebuilder:validation:Format=int32
PostStartTimeout *int32 `json:"postStartTimeout,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, could we change the type from *int32 to string so that it is consistent with the other timeouts?

// IdleTimeout determines how long a workspace should sit idle before being
// automatically scaled down. Proper functionality of this configuration property
// requires support in the workspace being started. If not specified, the default
// value of "15m" is used.
IdleTimeout string `json:"idleTimeout,omitempty"`
// ProgressTimeout determines the maximum duration a DevWorkspace can be in
// a "Starting" or "Failing" phase without progressing before it is automatically failed.
// Duration should be specified in a format parseable by Go's time package, e.g.
// "15m", "20s", "1h30m", etc. If not specified, the default value of "5m" is used.
ProgressTimeout string `json:"progressTimeout,omitempty"`

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense; I have changed the timeout type.

@akurinnoy akurinnoy force-pushed the postStartHookTimeout branch from 434a115 to 2d2ad37 Compare August 4, 2025 09:14
@akurinnoy
Copy link
Collaborator Author

/retest

Copy link

openshift-ci bot commented Aug 4, 2025

@akurinnoy: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/v14-che-happy-path 7272e1d link true /test v14-che-happy-path

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dkwon17
Copy link
Collaborator

dkwon17 commented Aug 6, 2025

After more testing, I noticed that this DW fails when the postStartTimeout is set, but succeeds when it is not set:

apiVersion: workspace.devfile.io/v1alpha2
kind: DevWorkspace
metadata:
  name: problematic-workspace
spec:
  started: false
  template:
    components:
      - name: tools
        container:
          image: quay.io/dkwon17/test:test-dir
          memoryLimit: "1Gi"
          memoryRequest: "512Mi"
          cpuRequest: "250m"
          cpuLimit: "1000m"
    commands:
      - id: test
        exec:
          workingDir: '/projects/test dir'
          component: tools
          commandLine: "mkdir mydir"
    events:
      postStart:
        - test

@akurinnoy are you able to reproduce the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants