Skip to content

Conversation

@domsolutions
Copy link
Contributor

@domsolutions domsolutions commented Nov 12, 2025

Motivation

There is an issue where if the scheduler issues create pipeline cmds to dataflow-engine, model-gw, pipeline-gw and they fail to create (i.e. due to Kafka connectivity issues), the pipeline will remain in a not Ready state and the scheduler will not try to rectify this. This was also the case for terminating pipelines.

Summary of changes

  • changed PipelineFailed state to only represent pipelines which failed to create. This does mean if any pipelines were in this state previusly due to failing to terminate, the scheduler will try to create them
  • new state PipelineFailedTerminating for pipelines which failed to terminate
  • new env var on scheduler RETRY_CREATING_FAILED_PIPELINES_TICK defaulting to 1 minute used by 3 goroutines which will poll to check for any pipelines which failed to create and will re-issues cmds to required services. There's a goroutine for each service: dataflow-engine, model-gw, pipeline-gw
  • new env var on scheduler RETRY_DELETING_FAILED_PIPELINES_TICK defaulting to 1 minute used by 3 goroutines which will poll to check for any pipelines which failed to terminate and will re-issues cmds to required services. There's a goroutine for each service: dataflow-engine, model-gw, pipeline-gw
  • fixed bug in model-gw where if loading a pipelines fails due to not being able to create topics with Kafka, on the second attempt it would be successful, even though it still couldn't connect to Kafka. This was due to the model not being removed from the loaded models map

The issues were actually only noticed with model-gw and dataflow-engine. pipeline-gw was added for completeness. pipeline-gw responds with success even if it does not have connectivity to Kafka. This should potentially be changed in the future as it reports ready when it's not able to send requests to kafka.

It was also notcied, once pipelines were successfully created, if Kafka brokers were all brought down, pipeline would still report as ready. In future we could look at adding additional pipeline state of PipelineRuntimeError with each service reporting their health.

How to test

In Kind, changed replicas: 0 on kafkanodepool and then delete StrimziPodSet. This will cause the brokers to permanently terminate. Then restart scheduler dataflow-engine model-gw pipeline-gw and wait for the pipeline to become not ready. Set replicas: 1 on kafkanodepool and brokers will come back up, pipeline should eventually become ready.

Checklist

  • Added/updated unit tests
  • Added/updated documentation
  • Checked for typos in variable names, comments, etc.
  • Added licences for new files

Testing

@domsolutions domsolutions requested a review from lc525 as a code owner November 12, 2025 19:05
@domsolutions domsolutions changed the title Infra 1652/forever failed pipelines fix(scheduler/model-gw): failed pipelines never retried Nov 12, 2025
Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, I've made some initial comments on the PR -- let's discuss and clarify.

_ = x[PipelineFailedTerminating-9]
}

const _PipelineStatus_name = "PipelineStatusUnknownPipelineCreatePipelineCreatingPipelineReadyPipelineFailedPipelineTerminatePipelineTerminatingPipelineTerminatedPipelineRebalancingPipelineFailedTerminating"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool the way stringer designs the implementation, I didn't know there was a tool to automate this

Copy link
Member

@lc525 lc525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; one suggestion regarding the state keeping for "currentRetries" per model, and a minor nit
regarding naming. Thank you for implementing this, it closes a loophole in Core's handling of
failures.

@domsolutions domsolutions merged commit 21cbc5a into v2 Nov 27, 2025
7 of 8 checks passed
@domsolutions domsolutions deleted the INFRA-1652/forever-failed-pipelines branch November 27, 2025 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants