This repository contains everything needed to launch a multi-node Ray cluster on an HPC system using SLURM and Apptainer (formerly Singularity).
It runs:
- A Ray head node (CPU-only)
- One or more Ray worker nodes (with GPU support)
- A Python job script using Ray for distributed processing
- All via a single SLURM batch script
| File | Description |
|---|---|
ray_cluster_all_in_one.sh |
SLURM batch script to launch head, workers, and job |
my_ray_script.py |
Example Ray script with remote GPU tasks |
ray_container.def |
Apptainer definition file to build container |
- Ray head is started on the first SLURM node (no GPU requested).
- Remaining nodes launch Ray workers with GPU support.
- A Python script runs on the head node using Ray to distribute work.
- All processes are launched using
srunand managed by SLURM.
- A SLURM-managed HPC environment
- Apptainer installed
- Apptainer container image built from the provided definition file
apptainer build ray_container.sif ray_container.defUpdate the path in ray_cluster_all_in_one.sh:
CONTAINER_PATH=/full/path/to/ray_container.sifSubmit the full cluster workload with:
sbatch ray_cluster_all_in_one.shIt will:
- Start Ray head on node 1
- Start workers on remaining nodes
- Run
my_ray_script.pyusing Ray - Shut down all Ray processes
To access the Ray dashboard:
ssh -L 8265:localhost:8265 user@clusterThen open: http://localhost:8265
This script connects to the running Ray cluster and distributes 20 square operations across nodes:
@ray.remote(num_gpus=0.25)
def square(x): return x * xYou can replace this with any workload using Ray's APIs.
| Want to change... | Do this |
|---|---|
| Number of nodes | Edit #SBATCH --nodes=X |
| GPUs per worker | Edit --gres=gpu:X and --num-gpus=X |
| CPUs per task | Edit --cpus-per-task=X and --num-cpus=X |
| Job duration | Adjust #SBATCH --time=... |
| Container content | Edit ray_container.def |
- Shared
/homeor/scratchacross nodes - Passwordless SSH not needed (SLURM handles all
srun) - GPU-enabled worker nodes (e.g., V100, A100)
MIT or your institution's default open-source license.
File an issue or contact your HPC support team if you need help with SLURM or Apptainer permissions.