-
Notifications
You must be signed in to change notification settings - Fork 855
fix(scheduler/model-gw): failed pipelines never retried #6917
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lc525
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, I've made some initial comments on the PR -- let's discuss and clarify.
| _ = x[PipelineFailedTerminating-9] | ||
| } | ||
|
|
||
| const _PipelineStatus_name = "PipelineStatusUnknownPipelineCreatePipelineCreatingPipelineReadyPipelineFailedPipelineTerminatePipelineTerminatingPipelineTerminatedPipelineRebalancingPipelineFailedTerminating" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very cool the way stringer designs the implementation, I didn't know there was a tool to automate this
lc525
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm; one suggestion regarding the state keeping for "currentRetries" per model, and a minor nit
regarding naming. Thank you for implementing this, it closes a loophole in Core's handling of
failures.
Motivation
There is an issue where if the scheduler issues create pipeline cmds to
dataflow-engine,model-gw,pipeline-gwand they fail to create (i.e. due to Kafka connectivity issues), the pipeline will remain in a not Ready state and the scheduler will not try to rectify this. This was also the case for terminating pipelines.Summary of changes
PipelineFailedstate to only represent pipelines which failed to create. This does mean if any pipelines were in this state previusly due to failing to terminate, the scheduler will try to create themPipelineFailedTerminatingfor pipelines which failed to terminateRETRY_CREATING_FAILED_PIPELINES_TICKdefaulting to1 minuteused by 3 goroutines which will poll to check for any pipelines which failed to create and will re-issues cmds to required services. There's a goroutine for each service:dataflow-engine,model-gw,pipeline-gwRETRY_DELETING_FAILED_PIPELINES_TICKdefaulting to1 minuteused by 3 goroutines which will poll to check for any pipelines which failed to terminate and will re-issues cmds to required services. There's a goroutine for each service:dataflow-engine,model-gw,pipeline-gwmodel-gwwhere if loading a pipelines fails due to not being able to create topics with Kafka, on the second attempt it would be successful, even though it still couldn't connect to Kafka. This was due to the model not being removed from the loaded models mapThe issues were actually only noticed with
model-gwanddataflow-engine.pipeline-gwwas added for completeness.pipeline-gwresponds with success even if it does not have connectivity to Kafka. This should potentially be changed in the future as it reports ready when it's not able to send requests to kafka.It was also notcied, once pipelines were successfully created, if Kafka brokers were all brought down, pipeline would still report as ready. In future we could look at adding additional pipeline state of
PipelineRuntimeErrorwith each service reporting their health.How to test
In Kind, changed
replicas: 0onkafkanodepooland then deleteStrimziPodSet. This will cause the brokers to permanently terminate. Then restartschedulerdataflow-enginemodel-gwpipeline-gwand wait for the pipeline to become not ready. Setreplicas: 1onkafkanodepooland brokers will come back up, pipeline should eventually become ready.Checklist
Testing