-
Notifications
You must be signed in to change notification settings - Fork 622
feat: add support for vllm cache initialization in Dynamo Planner #3115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for vllm cache initialization in Dynamo Planner #3115
Conversation
Signed-off-by: Michael Shin <[email protected]>
Signed-off-by: Michael Shin <[email protected]>
3dfd2eb
to
0d0b895
Compare
WalkthroughAdds a vLLM cache-initialization mode to the planner with new CLI flags and control-flow in the planner runtime. Introduces a deployment manifest for cache-init operation and a PVC manifest for shared vLLM cache storage. The planner orchestrates initial minimal replicas, waits for readiness, then scales to target replicas and updates metrics. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User as Operator
participant Planner as Planner
participant Kube as Kubernetes API
participant Decode as VllmDecodeWorker
participant Prefill as VllmPrefillWorker
participant Prom as Prometheus
Note over Planner: Start with --vllm-cache-initialization-mode
Planner->>Kube: Scale Prefill=0, Decode=1 (initial)
Kube-->>Decode: Ensure 1 replica running
loop Poll readiness
Planner->>Kube: is_deployment_ready(decode)
Kube-->>Planner: ready? (true/false)
end
alt Ready
Planner->>Kube: Scale to post-init replicas<br/>(Prefill=Np, Decode=Nd)
Kube-->>Prefill: Reconcile to Np
Kube-->>Decode: Reconcile to Nd
Planner->>Prom: Update metrics (cache_initialized=true)
Note over Planner: vllm_cache_initialized = True
else Not ready
Note over Planner: Skip normal adjustments until ready
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: Michael Shin <[email protected]>
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
Status, support, documentation and community
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (6)
deploy/utils/manifests/vllm-cache-pvc.yaml (1)
9-14
: Optional: add volumeMode and label for clarity/ops hygiene.Not required, but adding volumeMode and labels eases debugging and storage class behavior introspection.
spec: accessModes: - ReadWriteMany storageClassName: ${STORAGE_CLASS_NAME} + volumeMode: Filesystem + selector: {} +--- +metadata: + labels: + app: vllm-cachecomponents/planner/src/dynamo/planner/utils/planner_argparse.py (1)
127-144
: Clarify help text and scope flags to vLLM/Kubernetes.
- Help says “start with 1 replica” but the logic uses 0 prefill, 1 decode.
- These flags only make sense for backend=vllm on Kubernetes; surface that in help.
- help="Enable vLLM cache initialization mode - start with 1 replica to initialize vLLM cache, then scale up", + help="Enable vLLM cache initialization: start with 0 prefill and 1 decode to warm vLLM cache, then scale up (backend=vllm on Kubernetes)",Optionally validate at parse time (soft warning) so users don’t pass these in other modes:
def create_sla_planner_parser() -> argparse.ArgumentParser: @@ parser.add_argument( "--post-vllm-cache-decode-replicas", type=int, default=1, help="Target number of decode worker replicas after vLLM cache initialization", ) + # Post-parse warning hook (caller can invoke) + parser.set_defaults(_validate_vllm_cache_flags=_validate_vllm_cache_flags) return parser + +def _validate_vllm_cache_flags(args: argparse.Namespace) -> None: + if args.vllm_cache_initialization_mode and (args.backend != "vllm"): + logging.warning("--vllm-cache-initialization-mode is set but backend is not vllm; flag will be ignored.") + if args.vllm_cache_initialization_mode and getattr(args, "environment", None) != "kubernetes": + logging.warning("--vllm-cache-initialization-mode is intended for Kubernetes; behavior may be undefined elsewhere.")components/planner/src/dynamo/planner/utils/planner_core.py (3)
133-149
: Initialization flags wiring looks good; gate by env to avoid surprises.Consider gating the mode to Kubernetes explicitly to prevent Virtual/no-op runs from invoking K8s-only paths.
self.vllm_cache_initialization_mode = getattr( args, "vllm_cache_initialization_mode", False ) +if self.vllm_cache_initialization_mode and args.environment != "kubernetes": + logger.warning("vLLM cache init mode is intended for Kubernetes; disabling for environment=%s", args.environment) + self.vllm_cache_initialization_mode = False
395-420
: Initial scale set is fine; avoid redundant writes when already at desired replicas.Not blocking, but you could skip the write if current replicas already match 0/1 to reduce noisy updates.
428-436
: Enforce min_endpoint on post-init targets to stay consistent with global constraints.If post-init targets are below min_endpoint, later adjustments will bump them anyway; enforce here for coherence.
- target_replicas = { + target_replicas = { WORKER_COMPONENT_NAMES[ self.args.backend - ].prefill_worker_k8s_name: self.post_vllm_cache_prefill_replicas, + ].prefill_worker_k8s_name: max(self.args.min_endpoint, self.post_vllm_cache_prefill_replicas), WORKER_COMPONENT_NAMES[ self.args.backend - ].decode_worker_k8s_name: self.post_vllm_cache_decode_replicas, + ].decode_worker_k8s_name: max(self.args.min_endpoint, self.post_vllm_cache_decode_replicas), }components/backends/vllm/deploy/disagg_planner_cache_init.yaml (1)
35-53
: Health probes always succeed; consider real checks.Using
exec: exit 0
makes readiness/liveness meaningless and can mask failures, especially during cache warm-up.
- Replace with HTTP GET to a lightweight health endpoint or a file-gate written post-initialization.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/backends/vllm/deploy/disagg_planner_cache_init.yaml
(1 hunks)components/planner/src/dynamo/planner/utils/planner_argparse.py
(1 hunks)components/planner/src/dynamo/planner/utils/planner_core.py
(3 hunks)deploy/utils/manifests/vllm-cache-pvc.yaml
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
components/planner/src/dynamo/planner/utils/planner_core.py (3)
components/planner/src/dynamo/planner/kube.py (1)
is_deployment_ready
(102-115)components/planner/src/dynamo/planner/virtual_connector.py (1)
set_component_replicas
(289-316)components/planner/src/dynamo/planner/kubernetes_connector.py (1)
set_component_replicas
(71-100)
🪛 Ruff (0.12.2)
components/planner/src/dynamo/planner/utils/planner_core.py
387-387: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
components/backends/vllm/deploy/disagg_planner_cache_init.yaml (2)
60-79
: Expose Prometheus metrics on port 8000 and add a ServicecontainerPort is 9085 while the planner is started with --prometheus-port=8000 — set containerPort to 8000 (or change the arg to 9085) and create a Service selecting the planner pods that exposes port 8000 (port: 8000, targetPort: 8000) so Prometheus can scrape it.
File: components/backends/vllm/deploy/disagg_planner_cache_init.yaml (lines 60–79)
Verification couldn't be completed here (kubectl not available). Run locally to confirm: kubectl -n vllm-disagg-planner-cache-init get pods -l component=planner -o jsonpath='{.items[].spec.containers[].ports}' && kubectl -n vllm-disagg-planner-cache-init get svc
10-21
: Verify Prometheus target service name
Unable to confirm the DNS name for the frontend service in thevllm-disagg-planner-cache-init
namespace (cluster tools aren’t available here). Please run locally:kubectl -n vllm-disagg-planner-cache-init get svcand ensure the static target
"vllm-disagg-planner-frontend-cache-init:8000"
inDYNAMO_SERVICE_CONFIG
matches the actual<service-name>.<namespace>.svc.cluster.local:8000
. Update the value if it differs.
replicas: 1 | ||
resources: | ||
limits: | ||
gpu: "1" | ||
pvc: | ||
create: false | ||
name: vllm-cache-pvc # Must be created before deployment | ||
mountPoint: /root/.cache/vllm | ||
extraPodSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
PVC name must match the PVC manifest.
Here you reference "vllm-cache-pvc". Make sure the PVC manifest creates the same name (see comment on vllm-cache-pvc.yaml). If you prefer the other name, change it here instead.
Also applies to: 153-157
🏁 Script executed:
#!/bin/bash
# Confirm the PVC exists in the target namespace after applying manifests
# (Replace $NAMESPACE appropriately)
kubectl -n vllm-disagg-planner-cache-init get pvc vllm-cache-pvc -o wide || true
Length of output: 45
🏁 Script executed:
#!/bin/bash
# Locate PVC manifest files named vllm-cache-pvc.yaml
if files=$(fd vllm-cache-pvc.yaml); then
:
else
files=$(find . -type f -name 'vllm-cache-pvc.yaml')
fi
echo "Found PVC manifest files:"
echo "$files"
echo
for f in $files; do
echo "----- $f -----"
sed -n '1,50p' "$f"
echo
done
Length of output: 498
Fix PVC name mismatch
Deployment references PVC "vllm-cache-pvc" but the PVC manifest deploy/utils/manifests/vllm-cache-pvc.yaml sets metadata.name: "dynamo-vllm-cache". Make the names identical — either rename the manifest to metadata.name: vllm-cache-pvc or update the deployment to use dynamo-vllm-cache (components/backends/vllm/deploy/disagg_planner_cache_init.yaml — refs at ~lines 120–128 and 153–157). Re-apply manifests and verify with: kubectl -n get pvc .
🤖 Prompt for AI Agents
In components/backends/vllm/deploy/disagg_planner_cache_init.yaml around lines
120–128 (and similarly refs near 153–157), the PVC name "vllm-cache-pvc" does
not match the PVC manifest which is named "dynamo-vllm-cache"; make the names
identical by either renaming the PVC manifest metadata.name to "vllm-cache-pvc"
or updating the deployment's pvc.name to "dynamo-vllm-cache", then re-apply the
corrected manifests and verify the PVC exists with kubectl -n <namespace> get
pvc <name>.
async def check_vllm_cache_initialization_complete(self) -> bool: | ||
"""Check if vLLM cache has been initialized""" | ||
if not self.vllm_cache_initialization_mode: | ||
return True | ||
try: | ||
# Assume if a decode worker is ready, the cache has been initialized | ||
if self.connector: | ||
is_initial_deployment_ready = ( | ||
await self.connector.kube_api.is_deployment_ready(self.namespace) | ||
) | ||
if is_initial_deployment_ready: | ||
logger.info( | ||
"Initial deployment is ready, vLLM cache initialization complete" | ||
) | ||
return True | ||
except Exception as e: | ||
logger.warning(f"Failed to check vLLM cache initialization status: {e}") | ||
return False | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Fix readiness check: using namespace instead of graph name; breaks on non-matching names and Virtual/no-op modes.
- kube_api.is_deployment_ready expects a GraphDeployment name, not a namespace.
- Will AttributeError on VirtualConnector or when no_operation=true (no connector).
- Blind exception catch violates BLE001 and can hide real failures.
Apply this fix to correctly resolve the graph name, handle envs, and narrow exceptions:
async def check_vllm_cache_initialization_complete(self) -> bool:
"""Check if vLLM cache has been initialized"""
if not self.vllm_cache_initialization_mode:
return True
- try:
- # Assume if a decode worker is ready, the cache has been initialized
- if self.connector:
- is_initial_deployment_ready = (
- await self.connector.kube_api.is_deployment_ready(self.namespace)
- )
- if is_initial_deployment_ready:
- logger.info(
- "Initial deployment is ready, vLLM cache initialization complete"
- )
- return True
- except Exception as e:
- logger.warning(f"Failed to check vLLM cache initialization status: {e}")
+ # Only applicable on Kubernetes with a KubernetesConnector
+ if not hasattr(self, "connector") or not hasattr(self.connector, "kube_api"):
+ return True # treat as complete in virtual/no-op modes
+ try:
+ deployment = await self.connector.kube_api.get_graph_deployment()
+ if deployment is None:
+ logger.info("GraphDeployment not found yet; cache init not complete")
+ return False
+ graph_name = (deployment.get("metadata") or {}).get("name")
+ if not graph_name:
+ logger.warning("GraphDeployment metadata.name missing; cache init not complete")
+ return False
+ is_ready = await self.connector.kube_api.is_deployment_ready(graph_name)
+ if is_ready:
+ logger.info("GraphDeployment %s is Ready; treating vLLM cache init as complete", graph_name)
+ return True
+ except (AttributeError, KeyError, TypeError, ValueError) as e:
+ logger.warning(f"Failed to check vLLM cache initialization status: {e}")
return False
🏁 Script executed:
#!/bin/bash
# Verify expected signature and typical usage of is_deployment_ready
rg -n -C3 -P 'def\s+is_deployment_ready\(' components/planner/src
rg -n -C3 -P 'is_deployment_ready\(' components/planner/src
Length of output: 2707
Fix readiness check: pass GraphDeployment name (not namespace) and guard virtual/no-op connectors
- kube_api.is_deployment_ready expects a GraphDeployment name (components/planner/src/dynamo/planner/kube.py:102); planner_core currently passes self.namespace and can produce wrong results.
- If connector exists but lacks kube_api (VirtualConnector) or connector is None (no-op), current code will raise AttributeError; broad except Exception hides real failures (BLE001).
- Change: resolve the graph deployment name (reuse the helper pattern used in kubernetes_connector.py before calling is_deployment_ready), guard for missing/virtual connectors (skip or treat as complete), and replace the blanket except with narrowly scoped exceptions around the kube calls.
- Locations: components/planner/src/dynamo/planner/utils/planner_core.py (~372–390); see kube.py:102 and kubernetes_connector.py:82–84 for reference usage.
🧰 Tools
🪛 Ruff (0.12.2)
387-387: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In components/planner/src/dynamo/planner/utils/planner_core.py around lines 372
to 390, the readiness check currently passes self.namespace to
kube_api.is_deployment_ready and assumes connector.kube_api always exists, which
can produce wrong results or AttributeError; change it to resolve and pass the
GraphDeployment name (reuse the same helper/pattern used in
kubernetes_connector.py to compute the deployment name), add guards so if
connector is None or connector has no kube_api (virtual/no-op connector) the
function treats cache as initialized (return True or skip readiness), and narrow
the try/except to only wrap the actual kube_api call (catch specific exceptions
from the kube client) instead of a blanket except so real errors are not
swallowed.
# Handle vLLM cache initialization completely separately from scaling adjustments | ||
if self.args.backend == "vllm": | ||
await self.handle_vllm_cache_initialization() | ||
|
||
# If cache initialization is in progress, skip all other operations | ||
if ( | ||
self.vllm_cache_initialization_mode | ||
and not self.vllm_cache_initialized | ||
): | ||
continue | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Busy-wait during cache init; add a small sleep to avoid pegging the event loop.
The early continue skips the general sleep below; this will spin at 100% CPU until ready.
- if self.args.backend == "vllm":
- await self.handle_vllm_cache_initialization()
-
- # If cache initialization is in progress, skip all other operations
- if (
- self.vllm_cache_initialization_mode
- and not self.vllm_cache_initialized
- ):
- continue
+ if self.args.backend == "vllm":
+ await self.handle_vllm_cache_initialization()
+ # If cache initialization is in progress, skip all other operations but do not busy-wait
+ if self.vllm_cache_initialization_mode and not self.vllm_cache_initialized:
+ await asyncio.sleep(max(2, self.args.adjustment_interval / 10))
+ continue
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
# Handle vLLM cache initialization completely separately from scaling adjustments | |
if self.args.backend == "vllm": | |
await self.handle_vllm_cache_initialization() | |
# If cache initialization is in progress, skip all other operations | |
if ( | |
self.vllm_cache_initialization_mode | |
and not self.vllm_cache_initialized | |
): | |
continue | |
# Handle vLLM cache initialization completely separately from scaling adjustments | |
if self.args.backend == "vllm": | |
await self.handle_vllm_cache_initialization() | |
# If cache initialization is in progress, skip all other operations but do not busy-wait | |
if self.vllm_cache_initialization_mode and not self.vllm_cache_initialized: | |
await asyncio.sleep(max(2, self.args.adjustment_interval / 10)) | |
continue |
🤖 Prompt for AI Agents
In components/planner/src/dynamo/planner/utils/planner_core.py around lines 519
to 529, the loop currently busy-waits when vLLM cache initialization is in
progress because the early continue skips the general sleep; modify the branch
that detects "self.vllm_cache_initialization_mode and not
self.vllm_cache_initialized" to await a short non-blocking sleep (for example
await asyncio.sleep(0.1)) before continuing so the event loop is yielded and CPU
isn't pegged while waiting for initialization to complete.
apiVersion: v1 | ||
kind: PersistentVolumeClaim | ||
metadata: | ||
name: dynamo-vllm-cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
PVC name mismatch with deployment manifest; pods will fail to mount.
Your worker manifests reference a PVC named "vllm-cache-pvc" while this file creates "dynamo-vllm-cache". Align the names to avoid Pending pods due to missing PVC.
Apply either fix; I recommend renaming the PVC here to match the deployment:
metadata:
- name: dynamo-vllm-cache
+ name: vllm-cache-pvc
Run to confirm all references are consistent:
🏁 Script executed:
#!/bin/bash
# Check PVC name references
rg -nP 'vllm-cache-pvc|dynamo-vllm-cache' -g '!**/node_modules/**' -S
Length of output: 148
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "PWD: $(pwd)"
echo "--- Top-level files ---"
ls -la | sed -n '1,200p' || true
echo
echo "--- rg search (no ignore, include hidden) for PVC names ---"
if command -v rg >/dev/null 2>&1; then
rg -n --hidden --no-ignore -S 'vllm-cache-pvc|dynamo-vllm-cache' || true
else
echo "rg not found; using grep fallback"
grep -RIn --exclude-dir=node_modules -e 'vllm-cache-pvc' -e 'dynamo-vllm-cache' . || true
fi
echo
echo "--- rg search for claimName lines ---"
if command -v rg >/dev/null 2>&1; then
rg -n --hidden --no-ignore -S 'claimName:\s*(vllm-cache-pvc|dynamo-vllm-cache)' || true
else
grep -RIn --exclude-dir=node_modules -E 'claimName:\s*(vllm-cache-pvc|dynamo-vllm-cache)' . || true
fi
echo
echo "--- show file if it exists: deploy/utils/manifests/vllm-cache-pvc.yaml ---"
if [ -f deploy/utils/manifests/vllm-cache-pvc.yaml ]; then
sed -n '1,240p' deploy/utils/manifests/vllm-cache-pvc.yaml || true
else
echo "deploy/utils/manifests/vllm-cache-pvc.yaml not found"
fi
echo
echo "--- list YAML files under deploy/utils/manifests (if dir exists) ---"
if [ -d deploy/utils/manifests ]; then
find deploy/utils/manifests -type f -name '*.y*ml' -print -exec sed -n '1,200p' {} \; || true
else
echo "deploy/utils/manifests directory not found"
fi
echo
echo "Done."
Length of output: 7607
PVC name mismatch — pods will fail to mount; make PVC name match deployment references
The PVC manifest creates metadata.name: dynamo-vllm-cache while init/deployment manifests reference vllm-cache-pvc. Change the name in deploy/utils/manifests/vllm-cache-pvc.yaml to match.
metadata:
- name: dynamo-vllm-cache
+ name: vllm-cache-pvc
References observed: components/backends/vllm/deploy/disagg_planner_cache_init.yaml (lines ~126, ~155) reference vllm-cache-pvc.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
name: dynamo-vllm-cache | |
name: vllm-cache-pvc |
🤖 Prompt for AI Agents
In deploy/utils/manifests/vllm-cache-pvc.yaml around line 6, the PVC is named
"dynamo-vllm-cache" but init/deployment manifests (e.g.,
components/backends/vllm/deploy/disagg_planner_cache_init.yaml lines ~126 and
~155) reference "vllm-cache-pvc"; update metadata.name in this manifest to
"vllm-cache-pvc" so the PVC name matches the deployment references and pods can
mount it.
stupid question, if the user start planner with more than 1 replica for prefill/decode, will they first be scheduled, then killed, then scheduled again? |
The number of replicas will be increased so the initial worker won't be removed, but we'll scale up to the desired number |
Overview:
This PR implements vLLM cache initialization mode for the SLA Planner, addressing performance bottlenecks during initial deployment startup. The feature enables controlled cache warming to reduce cold start latencies and improve initial request handling in disaggregated vLLM deployments.
I used
nvidia/Llama-3.1-8B-Instruct-FP8
to test the performance difference. Without caching, torch.compile takes ~45 seconds. With caching, loading the cached results takes ~8 seconds.Note that we can extend this functionality to Dynamo in general and not just Planner, but that can be part of a future PR
Details:
Overall sequence diagram:

vLLM Cache Initialization Mode: Added a new planner mode that orchestrates cache warming by:
Additional Planner CLI Arguments (
planner_argparse.py
):--vllm-cache-initialization-mode
: Enable the cache initialization strategy--post-vllm-cache-prefill-replicas
: Target prefill worker count after initialization--post-vllm-cache-decode-replicas
: Target decode worker count after initializationPlanner Core Logic (
planner_core.py
):handle_vllm_cache_initialization()
: Manages the cache initialization workflowcheck_vllm_cache_initialization_complete()
: Monitors deployment readinessDeployment Configurations:
vllm-cache-pvc.yaml
- Dedicated PVC for vLLM cache storage (400Gi, ReadWriteMany). This is required to share the cache between different workersdisagg_planner_cache_init.yaml
- Example deployment with cache initialization enabledWhere should the reviewer start?
components/planner/src/dynamo/planner/utils/planner_core.py
- Core cache initialization logic (lines 133-425)components/planner/src/dynamo/planner/utils/planner_argparse.py
- New CLI arguments (lines 127-144)deploy/utils/manifests/vllm-cache-pvc.yaml
- PVC configuration for cache storagecomponents/backends/vllm/deploy/disagg_planner_cache_init.yaml
- Example deployment configuration. I can move this to a different location since this is an exampleSummary by CodeRabbit
New Features
Chores