-
Notifications
You must be signed in to change notification settings - Fork 620
Feature/cron scheduling rayjob 2426 #3836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/cron scheduling rayjob 2426 #3836
Conversation
Signed-off-by: Kenny Han <[email protected]>
Signed-off-by: Kenny Han <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have some initial comments and questions. pls add unit tests as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few more comments, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cleaner logic for creating cluster when scheduled and cleaning tests
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
ray-operator/controllers/ray/rayjob_controller_scheduled_test.go
Outdated
Show resolved
Hide resolved
How does it work with features like Kueue? |
This reverts commit f6b4f17.
I plan to revert this PR. You can refer to #3908 (comment) for the reasons. |
This reverts commit f6b4f17.
This reverts commit f6b4f17.
This reverts commit f6b4f17.
Why are these changes needed?
These changes add cron-based scheduling for RayJob resources. Currently, RayJobs run immediately or on demand. Implementing cron scheduling enables users to define recurring RayJobs, which is crucial for automated tasks, periodic data processing, and other time-sensitive workflows. This feature significantly enhances RayJob's utility by providing a built-in scheduling capability.
To use
In your RayJob Manifest add the
schedule
field under the spec with a cron like string for exampleschedule: 30 6-16/4 * * 1-5
Related issue number
Resolves #2426 [Feature] Support cron scheduling for RayJob
Checks
Explanation
This solution integrates cron scheduling directly into the RayJob controller, avoiding the complexity of a new CronRayJob CRD or major changes to the existing controller for Kubernetes' native CronJob resource.
Key aspects of this approach include:
New Job Deployment Statuses:
JobDeploymentStatusScheduling
: When a RayJob finishes and has aSpec.Schedule
defined, it transitions to this state. This triggers cleanup of associated resources (like clusters and jobs) in preparation for the next scheduled run ifshutdownAfterJobFinishes
is set totrue
if not it should use the same cluster for the next scheduled run.JobDeploymentStatusScheduled
: After resource cleanup, the job moves to this state. The controller then waits for the appropriate time based on the cron schedule.JobStatusScheduled
: (Used for clarity) Indicates the job is in a scheduled waiting period.Scheduling Logic:
Upon job completion, if a
Spec.Schedule
exists, the job's deployment status becomes JobDeploymentStatusScheduling.This state initiates the same resource teardown logic used for suspending or retrying jobs, ensuring resources are freed.
The job then transitions to
JobDeploymentStatusScheduled
.In
JobDeploymentStatusScheduled
, the controller checks if the current time is within aScheduleBuffer
(a small buffer) of a cron tick.If it's time, the job's status reverts to
JobStatusNew
andJobDeploymentStatusNew
, triggering a new execution.If not, reconciliation is re-queued for
NextScheduleTimeDuration
, effectively waiting until the next scheduled run.ScheduleBuffer
: This buffer accounts for potential reconciliation delays or drifts, ensuring robust scheduling.Initial Job State: RayJobs with a defined schedule now go directly to
JobDeploymentStatusScheduled
upon creation, rather than running immediately. This aligns with the expectation of a scheduled job.Resource Management: The scheduling mechanism correctly interacts with the
shutdownAfterJobFinishes
spec, ensuring proper cluster deletion or reuse after each scheduled job run.The diagram below illustrates the proposed state transitions for a scheduled RayJob:
