fix(scheduler/model-gw): failed pipelines never retried #6917

domsolutions · 2025-11-12T19:05:31Z

Motivation

There is an issue where if the scheduler issues create pipeline cmds to dataflow-engine, model-gw, pipeline-gw and they fail to create (i.e. due to Kafka connectivity issues), the pipeline will remain in a not Ready state and the scheduler will not try to rectify this. This was also the case for terminating pipelines.

Summary of changes

changed PipelineFailed state to only represent pipelines which failed to create. This does mean if any pipelines were in this state previusly due to failing to terminate, the scheduler will try to create them
new state PipelineFailedTerminating for pipelines which failed to terminate
new env var on scheduler RETRY_CREATING_FAILED_PIPELINES_TICK defaulting to 1 minute used by 3 goroutines which will poll to check for any pipelines which failed to create and will re-issues cmds to required services. There's a goroutine for each service: dataflow-engine, model-gw, pipeline-gw
new env var on scheduler RETRY_DELETING_FAILED_PIPELINES_TICK defaulting to 1 minute used by 3 goroutines which will poll to check for any pipelines which failed to terminate and will re-issues cmds to required services. There's a goroutine for each service: dataflow-engine, model-gw, pipeline-gw
fixed bug in model-gw where if loading a pipelines fails due to not being able to create topics with Kafka, on the second attempt it would be successful, even though it still couldn't connect to Kafka. This was due to the model not being removed from the loaded models map

The issues were actually only noticed with model-gw and dataflow-engine. pipeline-gw was added for completeness. pipeline-gw responds with success even if it does not have connectivity to Kafka. This should potentially be changed in the future as it reports ready when it's not able to send requests to kafka.

It was also notcied, once pipelines were successfully created, if Kafka brokers were all brought down, pipeline would still report as ready. In future we could look at adding additional pipeline state of PipelineRuntimeError with each service reporting their health.

How to test

In Kind, changed replicas: 0 on kafkanodepool and then delete StrimziPodSet. This will cause the brokers to permanently terminate. Then restart scheduler dataflow-engine model-gw pipeline-gw and wait for the pipeline to become not ready. Set replicas: 1 on kafkanodepool and brokers will come back up, pipeline should eventually become ready.

Checklist

Added/updated unit tests
Added/updated documentation
Checked for typos in variable names, comments, etc.
Added licences for new files

Testing

…ty issues

lc525

Overall looks good, I've made some initial comments on the PR -- let's discuss and clarify.

scheduler/pkg/kafka/dataflow/server.go

scheduler/pkg/kafka/dataflow/server_test.go

scheduler/pkg/kafka/gateway/infer.go

scheduler/pkg/server/pipeline_status.go

MiguelAAe · 2025-11-25T16:21:12Z

scheduler/pkg/store/pipeline/pipelinestatus_string.go

+	_ = x[PipelineFailedTerminating-9]
+}
+
+const _PipelineStatus_name = "PipelineStatusUnknownPipelineCreatePipelineCreatingPipelineReadyPipelineFailedPipelineTerminatePipelineTerminatingPipelineTerminatedPipelineRebalancingPipelineFailedTerminating"


very cool the way stringer designs the implementation, I didn't know there was a tool to automate this

lc525

lgtm; one suggestion regarding the state keeping for "currentRetries" per model, and a minor nit
regarding naming. Thank you for implementing this, it closes a loophole in Core's handling of
failures.

scheduler/pkg/kafka/conflict-resolution/conflict_resolution.go

scheduler/pkg/server/server_status.go

domsolutions added 4 commits November 10, 2025 17:07

retry failed pipeline on dataflow-engine

681b548

fix: failed pipelines never creating/deleting due to kafka connectivi…

f81a586

…ty issues

copyright

4eefc98

tests

9fb8543

domsolutions requested a review from lc525 as a code owner November 12, 2025 19:05

domsolutions changed the title ~~Infra 1652/forever failed pipelines~~ fix(scheduler/model-gw): failed pipelines never retried Nov 12, 2025

domsolutions added 2 commits November 12, 2025 19:16

copyright

086b9d4

do not try to create pipeline if not latest version

c25a064

lc525 reviewed Nov 20, 2025

View reviewed changes

PR comments and max retry feature

f90b817

MiguelAAe reviewed Nov 25, 2025

View reviewed changes

lc525 approved these changes Nov 26, 2025

View reviewed changes

scheduler/pkg/kafka/conflict-resolution/conflict_resolution.go Outdated Show resolved Hide resolved

scheduler/pkg/server/server_status.go Outdated Show resolved Hide resolved

scheduler/pkg/server/server_status.go Outdated Show resolved Hide resolved

domsolutions added 2 commits November 27, 2025 10:41

PR comments

1a03472

remove retry count when pipeline/model deleted

94c7422

domsolutions merged commit 21cbc5a into v2 Nov 27, 2025
7 of 8 checks passed

domsolutions deleted the INFRA-1652/forever-failed-pipelines branch November 27, 2025 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(scheduler/model-gw): failed pipelines never retried #6917

fix(scheduler/model-gw): failed pipelines never retried #6917

Uh oh!

domsolutions commented Nov 12, 2025 •

edited

Loading

Uh oh!

lc525 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MiguelAAe Nov 25, 2025

Uh oh!

lc525 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(scheduler/model-gw): failed pipelines never retried #6917

fix(scheduler/model-gw): failed pipelines never retried #6917

Uh oh!

Conversation

domsolutions commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Summary of changes

How to test

Checklist

Testing

Uh oh!

lc525 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MiguelAAe Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

lc525 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

domsolutions commented Nov 12, 2025 •

edited

Loading