Skip to content

Conversation

DW-Han
Copy link
Contributor

@DW-Han DW-Han commented Jun 26, 2025

Why are these changes needed?

These changes add cron-based scheduling for RayJob resources. Currently, RayJobs run immediately or on demand. Implementing cron scheduling enables users to define recurring RayJobs, which is crucial for automated tasks, periodic data processing, and other time-sensitive workflows. This feature significantly enhances RayJob's utility by providing a built-in scheduling capability.

To use

In your RayJob Manifest add the schedule field under the spec with a cron like string for example schedule: 30 6-16/4 * * 1-5

Related issue number

Resolves #2426 [Feature] Support cron scheduling for RayJob

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Explanation

This solution integrates cron scheduling directly into the RayJob controller, avoiding the complexity of a new CronRayJob CRD or major changes to the existing controller for Kubernetes' native CronJob resource.

Key aspects of this approach include:

New Job Deployment Statuses:

  • JobDeploymentStatusScheduling: When a RayJob finishes and has a Spec.Schedule defined, it transitions to this state. This triggers cleanup of associated resources (like clusters and jobs) in preparation for the next scheduled run if shutdownAfterJobFinishes is set to true if not it should use the same cluster for the next scheduled run.
  • JobDeploymentStatusScheduled: After resource cleanup, the job moves to this state. The controller then waits for the appropriate time based on the cron schedule.
  • JobStatusScheduled: (Used for clarity) Indicates the job is in a scheduled waiting period.

Scheduling Logic:

  • Upon job completion, if a Spec.Schedule exists, the job's deployment status becomes JobDeploymentStatusScheduling.

  • This state initiates the same resource teardown logic used for suspending or retrying jobs, ensuring resources are freed.

  • The job then transitions to JobDeploymentStatusScheduled.

  • In JobDeploymentStatusScheduled, the controller checks if the current time is within a ScheduleBuffer (a small buffer) of a cron tick.

  • If it's time, the job's status reverts to JobStatusNew and JobDeploymentStatusNew, triggering a new execution.

  • If not, reconciliation is re-queued for NextScheduleTimeDuration, effectively waiting until the next scheduled run.

  • ScheduleBuffer: This buffer accounts for potential reconciliation delays or drifts, ensuring robust scheduling.

  • Initial Job State: RayJobs with a defined schedule now go directly to JobDeploymentStatusScheduled upon creation, rather than running immediately. This aligns with the expectation of a scheduled job.

Resource Management: The scheduling mechanism correctly interacts with the shutdownAfterJobFinishes spec, ensuring proper cluster deletion or reuse after each scheduled job run.

The diagram below illustrates the proposed state transitions for a scheduled RayJob:
Untitled Diagram drawio (1)

@DW-Han DW-Han marked this pull request as ready for review June 26, 2025 09:08
@DW-Han DW-Han marked this pull request as draft June 26, 2025 09:10
@DW-Han DW-Han marked this pull request as ready for review June 26, 2025 20:15
@DW-Han DW-Han marked this pull request as draft June 26, 2025 20:19
@andrewsykim andrewsykim self-requested a review June 26, 2025 20:42
@andrewsykim andrewsykim self-assigned this Jun 26, 2025
@ryanaoleary ryanaoleary self-requested a review June 26, 2025 22:23
Copy link
Contributor

@chiayi chiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have some initial comments and questions. pls add unit tests as well.

Copy link
Contributor

@chiayi chiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, PTAL

Copy link
Contributor

@chiayi chiayi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DW-Han DW-Han marked this pull request as ready for review July 14, 2025 20:55
@DW-Han
Copy link
Contributor Author

DW-Han commented Jul 24, 2025

From manual testing I saw that the jobs do run at exactly the expected time, every minute (it doesn't schedule a run if the job isnt finished running is its not every minute) I tested with every 5 minutes and 10 as well. I also noticed the correct behavior for deleting or keeping the ray cluster with ShutdownAfterJobFinishes.

I used this sample config

kubectl apply -f config/samples/ray-job.schedule.yaml

I observed the running clusters and rayjobs with

watch kubectl get pods

And I tracked the transition between states with

 kubectl get rayjobs rayjob-schedule -n default -w

if you look at the start time column, you see that is always on the minute mark as expected, in the third column from the right

image

Copy link
Contributor Author

@DW-Han DW-Han left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleaner logic for creating cluster when scheduled and cleaning tests

@andrewsykim andrewsykim merged commit f6b4f17 into ray-project:master Jul 30, 2025
25 checks passed
@kevin85421
Copy link
Member

How does it work with features like Kueue?

kevin85421 added a commit to kevin85421/kuberay that referenced this pull request Jul 30, 2025
@kevin85421
Copy link
Member

I plan to revert this PR. You can refer to #3908 (comment) for the reasons.

DW-Han added a commit to DW-Han/kuberay that referenced this pull request Jul 30, 2025
DW-Han added a commit to DW-Han/kuberay that referenced this pull request Jul 30, 2025
DW-Han added a commit to DW-Han/kuberay that referenced this pull request Jul 30, 2025
kevin85421 pushed a commit that referenced this pull request Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Support cron scheduling for RayJob
4 participants