GPU quota management for Slurm clusters.
$ slurmq check
╭──────────────────── GPU Quota Report ────────────────────╮
│ │
│ User: dedalus │
│ QoS: medium │
│ Cluster: Stella HPC │
│ │
│ ████████████████████░░░░░░░░░░ 68.5% │
│ │
│ Used: 342.5 GPU-hours │
│ Remaining: 157.5 GPU-hours │
│ Quota: 500 GPU-hours (rolling 30 days) │
│ │
╰──────────────────────────────────────────────────────────╯uv tool install slurmqslurmq config init # interactive wizard
slurmq config show # verify settings
slurmq config validate # check syntax before deployConfig resolution order:
SLURMQ_CONFIGenv var~/.config/slurmq/config.toml(user)/etc/slurmq/config.toml(system-wide)
default_cluster = "stella"
[clusters.stella]
name = "Stella HPC"
account = "research"
qos = ["low", "medium"]
quota_limit = 500 # GPU-hours
rolling_window_days = 30slurmq check # current user
slurmq check --user alice # specific user
slurmq check --cluster other # different cluster
slurmq check --forecast # usage projection
slurmq --json check # machine-readable
slurmq --quiet check # silent on success (for scripts)Analyze job resource efficiency (like seff).
slurmq efficiency 12345Flags low efficiency: CPU < 30%, Memory < 20%.
Generate usage reports (admin).
slurmq report # table view
slurmq report --format csv -o out.csvReal-time monitoring with optional enforcement (admin).
slurmq monitor # live dashboard, 30s refresh
slurmq monitor --interval 10
slurmq monitor --once # single check, for cron
slurmq monitor --enforce # cancel jobs over quotaCluster-wide analytics with month-over-month comparison.
slurmq stats # GPU utilization + wait times
slurmq stats --days 14 # custom period
slurmq stats --no-compare # skip MoM comparison
slurmq stats -p gpu -p gpu-large # specific partitions
slurmq stats --small-threshold 25 # custom job size threshold
slurmq --json stats # machine-readableShows:
- GPU utilization by partition/QoS
- Wait time analysis (median, % jobs waiting > 6h)
- Small vs large job breakdown
- Month-over-month trends
Cancel jobs automatically when users exceed quota.
[enforcement]
enabled = true
dry_run = true # preview mode
grace_period_hours = 24 # warn before cancel
exempt_users = ["admin"]
exempt_job_prefixes = ["checkpoint_"]Run with slurmq monitor --enforce. Disable dry_run when ready.
Grace period: users exceeding quota get a warning window before jobs are cancelled.
Problematic states are highlighted:
| State | Meaning |
|---|---|
OOM |
Out of Memory |
TO |
Timeout |
NF |
Node Failure |
F |
Failed |
PR |
Preempted |
# check quota status
if slurmq --json check | jq -e '.status == "exceeded"' > /dev/null; then
echo "Quota exceeded"
fi
# cron: enforce every 5 minutes (quiet mode)
*/5 * * * * slurmq --quiet monitor --once --enforce >> /var/log/slurmq.log 2>&1Online: dedalus-labs.github.io/slurmq
For LLMs: llms.txt | llms-full.txt
Locally:
uv sync --extra docs
uv run mkdocs servegit clone https://github.com/dedalus-labs/slurmq.git && cd slurmq
uv sync --all-extras
uv run pytest
uv run ruff check
uv run ty checkMIT