-
Notifications
You must be signed in to change notification settings - Fork 1
Add checkpointer to cog-runtime #229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
internal/runner/manager.go
Outdated
// Derive the runtime context from the manager's context | ||
runtimeContext, runtimeCancel := context.WithCancel(ctx) | ||
|
||
cmd, callback, err := cp.Restore(runtimeContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this postRunnerStart or something
internal/runner/manager.go
Outdated
|
||
m.runners[0] = runner | ||
m.monitoringWG.Go(func() { | ||
m.monitorRunnerSubprocess(m.ctx, DefaultRunnerName, runner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this after the goto
internal/service/service.go
Outdated
if s.cfg.SignalMode { | ||
// This runs an infinite loop for handling signals, so we explicitly | ||
// do not want to put it in a wait group of any kind | ||
go s.handler.HandleSignals() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a context for cancelling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We went through this synchronously 🎉🌮🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM, OOC have you been testing with any specific cogs?
Summary
We want to add checkpointing to cog-runtime, so we can checkpoint and restore models after they completed setup. As such, this PR introduces the
Checkpointer
object, that exposes the ability to checkpoint and restore the model.To enable this, we also want to restore some of the ability for
coglet
to use signals to communicate with the parent process over signals rather than over webhooks, as switched to in this PR.