Skip to content

Conversation

@moko-poi
Copy link
Contributor

@moko-poi moko-poi commented Sep 29, 2025

Summary

Fixes #658

Sets BackoffLimit to 0 for both starter and stopper jobs to prevent excessive pod creation on failures, aligning their behavior with the initializer and runner jobs.

Problem

The starter and stopper jobs were missing a BackoffLimit configuration, causing them to use Kubernetes' default value of 6. This led to:

  • Excessive pod creation: Up to 6 pods created for the same failure
  • Resource waste: Unnecessary compute resources consumed
  • Inconsistent behavior: Initializer and runner jobs fail immediately with BackoffLimit: 0, but starter and stopper retry 6 times
  • Delayed error detection: Takes longer to identify actual issues
  • Redundant logging: Same failure logged 6 times

Since the starter and stopper jobs' curl commands already have --retry 3 built-in, the job-level retries were redundant and resulted in up to 18 total attempts (6 pods × 3 curl retries) for the same failure.

Changes

  • Added BackoffLimit: &zero32 to the starter job specification in pkg/resources/jobs/starter.go
  • Added BackoffLimit: &zero32 to the stopper job specification in pkg/resources/jobs/stopper.go
  • Updated corresponding test files to match the new expected behavior
  • This makes the starter and stopper job behavior consistent with initializer and runner jobs, both of which already use BackoffLimit: 0

Testing

  • Verified that starter and stopper jobs now fail immediately after curl's internal retries complete
  • Confirmed only one starter/stopper pod is created on failure, matching initializer/runner behavior
  • Updated unit tests in starter_test.go and stopper_test.go to include BackoffLimit expectations
  • No behavioral changes for successful test runs

Impact

  • Breaking Change: No
  • Performance: Improved - reduces unnecessary pod creation and faster failure detection
  • Consistency: All job types (starter, stopper, initializer, runner) now have the same BackoffLimit policy

@moko-poi moko-poi requested a review from yorugac as a code owner September 29, 2025 23:40
@moko-poi moko-poi changed the title fix: set BackoffLimit to zero for starter job to prevent excessive pod creation fix: set BackoffLimit to zero for starter and stopper jobs to prevent excessive pod creation Sep 30, 2025
@moko-poi
Copy link
Contributor Author

Test Environment

  • Kind cluster
  • k6-operator image: k6operator:starter-job-backoff
  • Kubernetes version: 1.31.0

Test Scenario

  1. Created a TestRun with a simple k6 script
  2. Deleted the runner pod immediately after creation to simulate starter job failure
  3. Observed starter job behavior

Results

BackoffLimit Successfully Set to 0

$ kubectl describe job k6-test-starter
Name:             k6-test-starter
Namespace:        default
Backoff Limit:    0
Start Time:       Tue, 30 Sep 2025 09:21:22 +0900
Pods Statuses:    0 Active (0 Ready) / 0 Succeeded / 1 Failed

Single Pod Creation (No Excessive Retries)

Before fix: Up to 6 pods could be created due to default BackoffLimit
After fix: Only 1 pod created

$ kubectl get pods -l k6_cr=k6-test
NAME                        READY   STATUS      RESTARTS   AGE
k6-test-initializer-vptgh   0/1     Completed   0          33s
k6-test-starter-4kc59       0/1     Error       0          22s

Immediate Failure Detection

$ kubectl get jobs -l k6_cr=k6-test
NAME                  STATUS     COMPLETIONS   DURATION   AGE
k6-test-starter       Failed     0/1           26s        26s

Job Events:

Events:
  Type     Reason                Age   From            Message
  ----     ------                ----  ----            -------
  Normal   SuccessfulCreate      29s   job-controller  Created pod: k6-test-starter-4kc59
  Warning  BackoffLimitExceeded  22s   job-controller  Job has reached the specified backoff limit

Curl Internal Retries Still Working

Starter pod logs showing curl's --retry 3 in action

$ kubectl logs k6-test-starter-4kc59
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (7) Failed to connect to 10.96.146.19 port 6565 after 0 ms: Could not connect to server

The curl command attempted connection with its internal retry mechanism, but the job-level retry was eliminated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Starter and stopper jobs have inconsistent BackoffLimit setting causing excessive pod creation on failures

1 participant