fix: set BackoffLimit to zero for starter and stopper jobs to prevent excessive pod creation #659

moko-poi · 2025-09-29T23:40:53Z

Summary

Fixes #658

Sets BackoffLimit to 0 for both starter and stopper jobs to prevent excessive pod creation on failures, aligning their behavior with the initializer and runner jobs.

Problem

The starter and stopper jobs were missing a BackoffLimit configuration, causing them to use Kubernetes' default value of 6. This led to:

Excessive pod creation: Up to 6 pods created for the same failure
Resource waste: Unnecessary compute resources consumed
Inconsistent behavior: Initializer and runner jobs fail immediately with BackoffLimit: 0, but starter and stopper retry 6 times
Delayed error detection: Takes longer to identify actual issues
Redundant logging: Same failure logged 6 times

Since the starter and stopper jobs' curl commands already have --retry 3 built-in, the job-level retries were redundant and resulted in up to 18 total attempts (6 pods × 3 curl retries) for the same failure.

Changes

Added BackoffLimit: &zero32 to the starter job specification in pkg/resources/jobs/starter.go
Added BackoffLimit: &zero32 to the stopper job specification in pkg/resources/jobs/stopper.go
Updated corresponding test files to match the new expected behavior
This makes the starter and stopper job behavior consistent with initializer and runner jobs, both of which already use BackoffLimit: 0

Testing

Verified that starter and stopper jobs now fail immediately after curl's internal retries complete
Confirmed only one starter/stopper pod is created on failure, matching initializer/runner behavior
Updated unit tests in starter_test.go and stopper_test.go to include BackoffLimit expectations
No behavioral changes for successful test runs

Impact

Breaking Change: No
Performance: Improved - reduces unnecessary pod creation and faster failure detection
Consistency: All job types (starter, stopper, initializer, runner) now have the same BackoffLimit policy

…d creation

moko-poi · 2025-09-30T00:25:52Z

Test Environment

Kind cluster
k6-operator image: k6operator:starter-job-backoff
Kubernetes version: 1.31.0

Test Scenario

Created a TestRun with a simple k6 script
Deleted the runner pod immediately after creation to simulate starter job failure
Observed starter job behavior

Results

BackoffLimit Successfully Set to 0

$ kubectl describe job k6-test-starter
Name:             k6-test-starter
Namespace:        default
Backoff Limit:    0
Start Time:       Tue, 30 Sep 2025 09:21:22 +0900
Pods Statuses:    0 Active (0 Ready) / 0 Succeeded / 1 Failed

Single Pod Creation (No Excessive Retries)

Before fix: Up to 6 pods could be created due to default BackoffLimit
After fix: Only 1 pod created

$ kubectl get pods -l k6_cr=k6-test
NAME                        READY   STATUS      RESTARTS   AGE
k6-test-initializer-vptgh   0/1     Completed   0          33s
k6-test-starter-4kc59       0/1     Error       0          22s

Immediate Failure Detection

$ kubectl get jobs -l k6_cr=k6-test
NAME                  STATUS     COMPLETIONS   DURATION   AGE
k6-test-starter       Failed     0/1           26s        26s

Job Events:

Events:
  Type     Reason                Age   From            Message
  ----     ------                ----  ----            -------
  Normal   SuccessfulCreate      29s   job-controller  Created pod: k6-test-starter-4kc59
  Warning  BackoffLimitExceeded  22s   job-controller  Job has reached the specified backoff limit

Curl Internal Retries Still Working

Starter pod logs showing curl's --retry 3 in action

$ kubectl logs k6-test-starter-4kc59
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0  0     0    0     0    0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (7) Failed to connect to 10.96.146.19 port 6565 after 0 ms: Could not connect to server

The curl command attempted connection with its internal retry mechanism, but the job-level retry was eliminated.

fix: set BackoffLimit to zero for starter job to prevent excessive po…

2fcc2d3

…d creation

moko-poi requested a review from yorugac as a code owner September 29, 2025 23:40

test: set BackoffLimit to 0 for starter and stopper job tests

ef6a9bb

moko-poi changed the title ~~fix: set BackoffLimit to zero for starter job to prevent excessive pod creation~~ fix: set BackoffLimit to zero for starter and stopper jobs to prevent excessive pod creation Sep 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: set BackoffLimit to zero for starter and stopper jobs to prevent excessive pod creation #659

fix: set BackoffLimit to zero for starter and stopper jobs to prevent excessive pod creation #659

Uh oh!

moko-poi commented Sep 29, 2025 •

edited

Loading

Uh oh!

moko-poi commented Sep 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

fix: set BackoffLimit to zero for starter and stopper jobs to prevent excessive pod creation #659

Are you sure you want to change the base?

fix: set BackoffLimit to zero for starter and stopper jobs to prevent excessive pod creation #659

Uh oh!

Conversation

moko-poi commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Changes

Testing

Impact

Uh oh!

moko-poi commented Sep 30, 2025

Test Environment

Test Scenario

Results

BackoffLimit Successfully Set to 0

Single Pod Creation (No Excessive Retries)

Immediate Failure Detection

Curl Internal Retries Still Working

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

moko-poi commented Sep 29, 2025 •

edited

Loading