feat: add support for vllm cache initialization in Dynamo Planner #3115

michaelshin · 2025-09-18T15:06:22Z

Overview:

This PR implements vLLM cache initialization mode for the SLA Planner, addressing performance bottlenecks during initial deployment startup. The feature enables controlled cache warming to reduce cold start latencies and improve initial request handling in disaggregated vLLM deployments.

I used nvidia/Llama-3.1-8B-Instruct-FP8 to test the performance difference. Without caching, torch.compile takes ~45 seconds. With caching, loading the cached results takes ~8 seconds.

Note that we can extend this functionality to Dynamo in general and not just Planner, but that can be part of a future PR

Details:

Overall sequence diagram:

vLLM Cache Initialization Mode: Added a new planner mode that orchestrates cache warming by:
- Starting with minimal replicas (0 prefill, 1 decode worker) to initialize the vLLM cache
- Monitoring deployment readiness to detect cache initialization completion
- Automatically scaling to target replica counts once cache is ready
Additional Planner CLI Arguments (planner_argparse.py):
- --vllm-cache-initialization-mode: Enable the cache initialization strategy
- --post-vllm-cache-prefill-replicas: Target prefill worker count after initialization
- --post-vllm-cache-decode-replicas: Target decode worker count after initialization
Planner Core Logic (planner_core.py):
- handle_vllm_cache_initialization(): Manages the cache initialization workflow
- check_vllm_cache_initialization_complete(): Monitors deployment readiness
- Integration with existing scaling logic to prevent conflicts during initialization
Deployment Configurations:
- vllm-cache-pvc.yaml - Dedicated PVC for vLLM cache storage (400Gi, ReadWriteMany). This is required to share the cache between different workers
- disagg_planner_cache_init.yaml - Example deployment with cache initialization enabled

Where should the reviewer start?

components/planner/src/dynamo/planner/utils/planner_core.py - Core cache initialization logic (lines 133-425)
components/planner/src/dynamo/planner/utils/planner_argparse.py - New CLI arguments (lines 127-144)
deploy/utils/manifests/vllm-cache-pvc.yaml - PVC configuration for cache storage
components/backends/vllm/deploy/disagg_planner_cache_init.yaml - Example deployment configuration. I can move this to a different location since this is an example

Summary by CodeRabbit

New Features
- Added cache initialization mode for the vLLM-based planner to warm up the model cache before normal operation.
- Planner now auto-scales prefill and decode workers after cache warm-up completes.
- New CLI options to enable cache init and configure post-init prefill/decode replica targets.
Chores
- Introduced a deployment manifest to run the cache initialization workflow with frontend, planner, workers, and monitoring.
- Added a PVC manifest for shared vLLM cache storage to support initialization and subsequent runs.

copy-pr-bot · 2025-09-18T15:06:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Michael Shin <[email protected]>

coderabbitai · 2025-09-18T15:09:11Z

Walkthrough

Adds a vLLM cache-initialization mode to the planner with new CLI flags and control-flow in the planner runtime. Introduces a deployment manifest for cache-init operation and a PVC manifest for shared vLLM cache storage. The planner orchestrates initial minimal replicas, waits for readiness, then scales to target replicas and updates metrics.

Changes

Cohort / File(s)	Summary of Changes
Planner core: vLLM cache init flow `components/planner/src/dynamo/planner/utils/planner_core.py`	Adds cache-init mode state and logic. Introduces `check_vllm_cache_initialization_complete()` and `handle_vllm_cache_initialization()` async methods. Integrates init flow into main loop to set initial replicas (prefill=0, decode=1), poll readiness, then scale to configured post-init replicas and update metrics.
Planner CLI options `components/planner/src/dynamo/planner/utils/planner_argparse.py`	Adds three args: `--vllm-cache-initialization-mode` (bool), `--post-vllm-cache-prefill-replicas` (int), `--post-vllm-cache-decode-replicas` (int). No signature changes.
Kubernetes manifests: deployment & PVC `components/backends/vllm/deploy/disagg_planner_cache_init.yaml`, `deploy/utils/manifests/vllm-cache-pvc.yaml`	New CRD-based deployment for disaggregated planner in cache-init mode with components (Frontend, Planner, Prometheus, VllmDecodeWorker, VllmPrefillWorker). Adds PVC manifest `dynamo-vllm-cache` (RWX, 400Gi, templated namespace/storage class).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User as Operator
    participant Planner as Planner
    participant Kube as Kubernetes API
    participant Decode as VllmDecodeWorker
    participant Prefill as VllmPrefillWorker
    participant Prom as Prometheus

    Note over Planner: Start with --vllm-cache-initialization-mode
    Planner->>Kube: Scale Prefill=0, Decode=1 (initial)
    Kube-->>Decode: Ensure 1 replica running
    loop Poll readiness
        Planner->>Kube: is_deployment_ready(decode)
        Kube-->>Planner: ready? (true/false)
    end
    alt Ready
        Planner->>Kube: Scale to post-init replicas<br/>(Prefill=Np, Decode=Nd)
        Kube-->>Prefill: Reconcile to Np
        Kube-->>Decode: Reconcile to Nd
        Planner->>Prom: Update metrics (cache_initialized=true)
        Note over Planner: vllm_cache_initialized = True
    else Not ready
        Note over Planner: Skip normal adjustments until ready
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: support sglang in sla planner #2421 — Updates WORKER_COMPONENT_NAMES with SGLang entries; directly related to planner logic that references worker component mappings used in the new cache-init scaling.

Poem

In caches warm, my whiskers twitch,
One decode up, prefill on switch.
I nibble logs, await “all green,”
Then scale the fields to steady sheen.
PVC burrow, warm and wide—
Hop, deploy, and metrics glide. 🐇⚙️

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "feat: add support for vllm cache initialization in Dynamo Planner" is concise and accurately summarizes the PR's primary change—adding vLLM cache initialization support to the Planner; it uses conventional commit style and is clear for teammates scanning history.
Description Check	✅ Passed	The PR description is largely complete: it provides a clear Overview, detailed Changes, and a "Where should the reviewer start?" section with file-level pointers and test observations, which give reviewers sufficient context to evaluate the change. It omits the "Related Issues" section from the repository template and does not include explicit step-by-step verification instructions or explicit preconditions (for example, a note to pre-create the required PVC), which would aid reproducibility. Overall the description is sufficient for review but would benefit from those small additions.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: Michael Shin <[email protected]>

michaelshin · 2025-09-18T15:13:18Z

@coderabbitai help

coderabbitai · 2025-09-18T15:13:31Z

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
- @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
- @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve to resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

Visit our status page to check the current availability of CodeRabbit.
Create a ticket on our support page for assistance with any issues or questions.
Visit our documentation site for detailed information on how to use CodeRabbit.
Join our Discord community to connect with other users and get help from the community.
Follow us on X/Twitter for updates and announcements.

michaelshin · 2025-09-18T15:13:48Z

@coderabbitai full review

coderabbitai · 2025-09-18T15:13:56Z

✅ Actions performed

Full review triggered.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (6)

deploy/utils/manifests/vllm-cache-pvc.yaml (1)
9-14: Optional: add volumeMode and label for clarity/ops hygiene.

Not required, but adding volumeMode and labels eases debugging and storage class behavior introspection.
 spec:
   accessModes:
     - ReadWriteMany
   storageClassName: ${STORAGE_CLASS_NAME}
+  volumeMode: Filesystem
+  selector: {}
+---
+metadata:
+  labels:
+    app: vllm-cache
components/planner/src/dynamo/planner/utils/planner_argparse.py (1)
127-144: Clarify help text and scope flags to vLLM/Kubernetes.

Help says “start with 1 replica” but the logic uses 0 prefill, 1 decode.

These flags only make sense for backend=vllm on Kubernetes; surface that in help.
-        help="Enable vLLM cache initialization mode - start with 1 replica to initialize vLLM cache, then scale up",
+        help="Enable vLLM cache initialization: start with 0 prefill and 1 decode to warm vLLM cache, then scale up (backend=vllm on Kubernetes)",
Optionally validate at parse time (soft warning) so users don’t pass these in other modes:
 def create_sla_planner_parser() -> argparse.ArgumentParser:
@@
     parser.add_argument(
         "--post-vllm-cache-decode-replicas",
         type=int,
         default=1,
         help="Target number of decode worker replicas after vLLM cache initialization",
     )
+    # Post-parse warning hook (caller can invoke)
+    parser.set_defaults(_validate_vllm_cache_flags=_validate_vllm_cache_flags)
     return parser
+
+def _validate_vllm_cache_flags(args: argparse.Namespace) -> None:
+    if args.vllm_cache_initialization_mode and (args.backend != "vllm"):
+        logging.warning("--vllm-cache-initialization-mode is set but backend is not vllm; flag will be ignored.")
+    if args.vllm_cache_initialization_mode and getattr(args, "environment", None) != "kubernetes":
+        logging.warning("--vllm-cache-initialization-mode is intended for Kubernetes; behavior may be undefined elsewhere.")
components/planner/src/dynamo/planner/utils/planner_core.py (3)
133-149: Initialization flags wiring looks good; gate by env to avoid surprises.

Consider gating the mode to Kubernetes explicitly to prevent Virtual/no-op runs from invoking K8s-only paths.
 self.vllm_cache_initialization_mode = getattr(
     args, "vllm_cache_initialization_mode", False
 )
+if self.vllm_cache_initialization_mode and args.environment != "kubernetes":
+    logger.warning("vLLM cache init mode is intended for Kubernetes; disabling for environment=%s", args.environment)
+    self.vllm_cache_initialization_mode = False
395-420: Initial scale set is fine; avoid redundant writes when already at desired replicas.

Not blocking, but you could skip the write if current replicas already match 0/1 to reduce noisy updates.

428-436: Enforce min_endpoint on post-init targets to stay consistent with global constraints.

If post-init targets are below min_endpoint, later adjustments will bump them anyway; enforce here for coherence.
-            target_replicas = {
+            target_replicas = {
                 WORKER_COMPONENT_NAMES[
                     self.args.backend
-                ].prefill_worker_k8s_name: self.post_vllm_cache_prefill_replicas,
+                ].prefill_worker_k8s_name: max(self.args.min_endpoint, self.post_vllm_cache_prefill_replicas),
                 WORKER_COMPONENT_NAMES[
                     self.args.backend
-                ].decode_worker_k8s_name: self.post_vllm_cache_decode_replicas,
+                ].decode_worker_k8s_name: max(self.args.min_endpoint, self.post_vllm_cache_decode_replicas),
             }
components/backends/vllm/deploy/disagg_planner_cache_init.yaml (1)

35-53: Health probes always succeed; consider real checks.

Using exec: exit 0 makes readiness/liveness meaningless and can mask failures, especially during cache warm-up.

Replace with HTTP GET to a lightweight health endpoint or a file-gate written post-initialization.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd3326 and e3f9c0e.

📒 Files selected for processing (4)

components/backends/vllm/deploy/disagg_planner_cache_init.yaml (1 hunks)
components/planner/src/dynamo/planner/utils/planner_argparse.py (1 hunks)
components/planner/src/dynamo/planner/utils/planner_core.py (3 hunks)
deploy/utils/manifests/vllm-cache-pvc.yaml (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

components/planner/src/dynamo/planner/utils/planner_core.py (3)

components/planner/src/dynamo/planner/kube.py (1)

is_deployment_ready (102-115)

components/planner/src/dynamo/planner/virtual_connector.py (1)

set_component_replicas (289-316)

components/planner/src/dynamo/planner/kubernetes_connector.py (1)

set_component_replicas (71-100)

🪛 Ruff (0.12.2)

components/planner/src/dynamo/planner/utils/planner_core.py

387-387: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (2)

components/backends/vllm/deploy/disagg_planner_cache_init.yaml (2)
60-79: Expose Prometheus metrics on port 8000 and add a Service

containerPort is 9085 while the planner is started with --prometheus-port=8000 — set containerPort to 8000 (or change the arg to 9085) and create a Service selecting the planner pods that exposes port 8000 (port: 8000, targetPort: 8000) so Prometheus can scrape it.

File: components/backends/vllm/deploy/disagg_planner_cache_init.yaml (lines 60–79)

Verification couldn't be completed here (kubectl not available). Run locally to confirm: kubectl -n vllm-disagg-planner-cache-init get pods -l component=planner -o jsonpath='{.items[].spec.containers[].ports}' && kubectl -n vllm-disagg-planner-cache-init get svc

10-21: Verify Prometheus target service name
Unable to confirm the DNS name for the frontend service in the vllm-disagg-planner-cache-init namespace (cluster tools aren’t available here). Please run locally:
kubectl -n vllm-disagg-planner-cache-init get svc
and ensure the static target "vllm-disagg-planner-frontend-cache-init:8000" in DYNAMO_SERVICE_CONFIG matches the actual <service-name>.<namespace>.svc.cluster.local:8000. Update the value if it differs.

coderabbitai · 2025-09-18T15:26:24Z

components/backends/vllm/deploy/disagg_planner_cache_init.yaml

+      replicas: 1
+      resources:
+        limits:
+          gpu: "1"
+      pvc:
+        create: false
+        name: vllm-cache-pvc # Must be created before deployment
+        mountPoint: /root/.cache/vllm
+      extraPodSpec:


⚠️ Potential issue

🧩 Analysis chain

PVC name must match the PVC manifest.

Here you reference "vllm-cache-pvc". Make sure the PVC manifest creates the same name (see comment on vllm-cache-pvc.yaml). If you prefer the other name, change it here instead.

Also applies to: 153-157

🏁 Script executed:

#!/bin/bash # Confirm the PVC exists in the target namespace after applying manifests # (Replace $NAMESPACE appropriately) kubectl -n vllm-disagg-planner-cache-init get pvc vllm-cache-pvc -o wide || true

Length of output: 45

🏁 Script executed:

#!/bin/bash # Locate PVC manifest files named vllm-cache-pvc.yaml if files=$(fd vllm-cache-pvc.yaml); then : else files=$(find . -type f -name 'vllm-cache-pvc.yaml') fi echo "Found PVC manifest files:" echo "$files" echo for f in $files; do echo "----- $f -----" sed -n '1,50p' "$f" echo done

Length of output: 498

Fix PVC name mismatch

Deployment references PVC "vllm-cache-pvc" but the PVC manifest deploy/utils/manifests/vllm-cache-pvc.yaml sets metadata.name: "dynamo-vllm-cache". Make the names identical — either rename the manifest to metadata.name: vllm-cache-pvc or update the deployment to use dynamo-vllm-cache (components/backends/vllm/deploy/disagg_planner_cache_init.yaml — refs at ~lines 120–128 and 153–157). Re-apply manifests and verify with: kubectl -n get pvc .

🤖 Prompt for AI Agents

In components/backends/vllm/deploy/disagg_planner_cache_init.yaml around lines 120–128 (and similarly refs near 153–157), the PVC name "vllm-cache-pvc" does not match the PVC manifest which is named "dynamo-vllm-cache"; make the names identical by either renaming the PVC manifest metadata.name to "vllm-cache-pvc" or updating the deployment's pvc.name to "dynamo-vllm-cache", then re-apply the corrected manifests and verify the PVC exists with kubectl -n <namespace> get pvc <name>.

coderabbitai · 2025-09-18T15:26:24Z

components/planner/src/dynamo/planner/utils/planner_core.py

+    async def check_vllm_cache_initialization_complete(self) -> bool:
+        """Check if vLLM cache has been initialized"""
+        if not self.vllm_cache_initialization_mode:
+            return True
+        try:
+            # Assume if a decode worker is ready, the cache has been initialized
+            if self.connector:
+                is_initial_deployment_ready = (
+                    await self.connector.kube_api.is_deployment_ready(self.namespace)
+                )
+                if is_initial_deployment_ready:
+                    logger.info(
+                        "Initial deployment is ready, vLLM cache initialization complete"
+                    )
+                    return True
+        except Exception as e:
+            logger.warning(f"Failed to check vLLM cache initialization status: {e}")
+        return False
+


⚠️ Potential issue

🧩 Analysis chain

Fix readiness check: using namespace instead of graph name; breaks on non-matching names and Virtual/no-op modes.

kube_api.is_deployment_ready expects a GraphDeployment name, not a namespace.

Will AttributeError on VirtualConnector or when no_operation=true (no connector).

Blind exception catch violates BLE001 and can hide real failures.

Apply this fix to correctly resolve the graph name, handle envs, and narrow exceptions:

async def check_vllm_cache_initialization_complete(self) -> bool: """Check if vLLM cache has been initialized""" if not self.vllm_cache_initialization_mode: return True - try: - # Assume if a decode worker is ready, the cache has been initialized - if self.connector: - is_initial_deployment_ready = ( - await self.connector.kube_api.is_deployment_ready(self.namespace) - ) - if is_initial_deployment_ready: - logger.info( - "Initial deployment is ready, vLLM cache initialization complete" - ) - return True - except Exception as e: - logger.warning(f"Failed to check vLLM cache initialization status: {e}") + # Only applicable on Kubernetes with a KubernetesConnector + if not hasattr(self, "connector") or not hasattr(self.connector, "kube_api"): + return True # treat as complete in virtual/no-op modes + try: + deployment = await self.connector.kube_api.get_graph_deployment() + if deployment is None: + logger.info("GraphDeployment not found yet; cache init not complete") + return False + graph_name = (deployment.get("metadata") or {}).get("name") + if not graph_name: + logger.warning("GraphDeployment metadata.name missing; cache init not complete") + return False + is_ready = await self.connector.kube_api.is_deployment_ready(graph_name) + if is_ready: + logger.info("GraphDeployment %s is Ready; treating vLLM cache init as complete", graph_name) + return True + except (AttributeError, KeyError, TypeError, ValueError) as e: + logger.warning(f"Failed to check vLLM cache initialization status: {e}") return False

🏁 Script executed:

#!/bin/bash # Verify expected signature and typical usage of is_deployment_ready rg -n -C3 -P 'def\s+is_deployment_ready\(' components/planner/src rg -n -C3 -P 'is_deployment_ready\(' components/planner/src

Length of output: 2707

Fix readiness check: pass GraphDeployment name (not namespace) and guard virtual/no-op connectors

kube_api.is_deployment_ready expects a GraphDeployment name (components/planner/src/dynamo/planner/kube.py:102); planner_core currently passes self.namespace and can produce wrong results.

If connector exists but lacks kube_api (VirtualConnector) or connector is None (no-op), current code will raise AttributeError; broad except Exception hides real failures (BLE001).

Change: resolve the graph deployment name (reuse the helper pattern used in kubernetes_connector.py before calling is_deployment_ready), guard for missing/virtual connectors (skip or treat as complete), and replace the blanket except with narrowly scoped exceptions around the kube calls.

Locations: components/planner/src/dynamo/planner/utils/planner_core.py (~372–390); see kube.py:102 and kubernetes_connector.py:82–84 for reference usage.

🧰 Tools

🪛 Ruff (0.12.2)

387-387: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In components/planner/src/dynamo/planner/utils/planner_core.py around lines 372 to 390, the readiness check currently passes self.namespace to kube_api.is_deployment_ready and assumes connector.kube_api always exists, which can produce wrong results or AttributeError; change it to resolve and pass the GraphDeployment name (reuse the same helper/pattern used in kubernetes_connector.py to compute the deployment name), add guards so if connector is None or connector has no kube_api (virtual/no-op connector) the function treats cache as initialized (return True or skip readiness), and narrow the try/except to only wrap the actual kube_api call (catch specific exceptions from the kube client) instead of a blanket except so real errors are not swallowed.

coderabbitai · 2025-09-18T15:26:24Z

components/planner/src/dynamo/planner/utils/planner_core.py

+            # Handle vLLM cache initialization completely separately from scaling adjustments
+            if self.args.backend == "vllm":
+                await self.handle_vllm_cache_initialization()
+
+                # If cache initialization is in progress, skip all other operations
+                if (
+                    self.vllm_cache_initialization_mode
+                    and not self.vllm_cache_initialized
+                ):
+                    continue
+


⚠️ Potential issue

Busy-wait during cache init; add a small sleep to avoid pegging the event loop.

The early continue skips the general sleep below; this will spin at 100% CPU until ready.

- if self.args.backend == "vllm": - await self.handle_vllm_cache_initialization() - - # If cache initialization is in progress, skip all other operations - if ( - self.vllm_cache_initialization_mode - and not self.vllm_cache_initialized - ): - continue + if self.args.backend == "vllm": + await self.handle_vllm_cache_initialization() + # If cache initialization is in progress, skip all other operations but do not busy-wait + if self.vllm_cache_initialization_mode and not self.vllm_cache_initialized: + await asyncio.sleep(max(2, self.args.adjustment_interval / 10)) + continue

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Handle vLLM cache initialization completely separately from scaling adjustments

if self.args.backend == "vllm":

await self.handle_vllm_cache_initialization()

# If cache initialization is in progress, skip all other operations

if (

self.vllm_cache_initialization_mode

and not self.vllm_cache_initialized

):

continue

# Handle vLLM cache initialization completely separately from scaling adjustments

if self.args.backend == "vllm":

await self.handle_vllm_cache_initialization()

# If cache initialization is in progress, skip all other operations but do not busy-wait

if self.vllm_cache_initialization_mode and not self.vllm_cache_initialized:

await asyncio.sleep(max(2, self.args.adjustment_interval / 10))

continue

🤖 Prompt for AI Agents

In components/planner/src/dynamo/planner/utils/planner_core.py around lines 519 to 529, the loop currently busy-waits when vLLM cache initialization is in progress because the early continue skips the general sleep; modify the branch that detects "self.vllm_cache_initialization_mode and not self.vllm_cache_initialized" to await a short non-blocking sleep (for example await asyncio.sleep(0.1)) before continuing so the event loop is yielded and CPU isn't pegged while waiting for initialization to complete.

coderabbitai · 2025-09-18T15:26:24Z

deploy/utils/manifests/vllm-cache-pvc.yaml

+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: dynamo-vllm-cache


⚠️ Potential issue

🧩 Analysis chain

PVC name mismatch with deployment manifest; pods will fail to mount.

Your worker manifests reference a PVC named "vllm-cache-pvc" while this file creates "dynamo-vllm-cache". Align the names to avoid Pending pods due to missing PVC.

Apply either fix; I recommend renaming the PVC here to match the deployment:

metadata: - name: dynamo-vllm-cache + name: vllm-cache-pvc

Run to confirm all references are consistent:

🏁 Script executed:

#!/bin/bash # Check PVC name references rg -nP 'vllm-cache-pvc|dynamo-vllm-cache' -g '!**/node_modules/**' -S

Length of output: 148

🏁 Script executed:

#!/bin/bash set -euo pipefail echo "PWD: $(pwd)" echo "--- Top-level files ---" ls -la | sed -n '1,200p' || true echo echo "--- rg search (no ignore, include hidden) for PVC names ---" if command -v rg >/dev/null 2>&1; then rg -n --hidden --no-ignore -S 'vllm-cache-pvc|dynamo-vllm-cache' || true else echo "rg not found; using grep fallback" grep -RIn --exclude-dir=node_modules -e 'vllm-cache-pvc' -e 'dynamo-vllm-cache' . || true fi echo echo "--- rg search for claimName lines ---" if command -v rg >/dev/null 2>&1; then rg -n --hidden --no-ignore -S 'claimName:\s*(vllm-cache-pvc|dynamo-vllm-cache)' || true else grep -RIn --exclude-dir=node_modules -E 'claimName:\s*(vllm-cache-pvc|dynamo-vllm-cache)' . || true fi echo echo "--- show file if it exists: deploy/utils/manifests/vllm-cache-pvc.yaml ---" if [ -f deploy/utils/manifests/vllm-cache-pvc.yaml ]; then sed -n '1,240p' deploy/utils/manifests/vllm-cache-pvc.yaml || true else echo "deploy/utils/manifests/vllm-cache-pvc.yaml not found" fi echo echo "--- list YAML files under deploy/utils/manifests (if dir exists) ---" if [ -d deploy/utils/manifests ]; then find deploy/utils/manifests -type f -name '*.y*ml' -print -exec sed -n '1,200p' {} \; || true else echo "deploy/utils/manifests directory not found" fi echo echo "Done."

Length of output: 7607

PVC name mismatch — pods will fail to mount; make PVC name match deployment references

The PVC manifest creates metadata.name: dynamo-vllm-cache while init/deployment manifests reference vllm-cache-pvc. Change the name in deploy/utils/manifests/vllm-cache-pvc.yaml to match.

metadata: - name: dynamo-vllm-cache + name: vllm-cache-pvc

References observed: components/backends/vllm/deploy/disagg_planner_cache_init.yaml (lines ~126, ~155) reference vllm-cache-pvc.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

name: dynamo-vllm-cache

name: vllm-cache-pvc

🤖 Prompt for AI Agents

In deploy/utils/manifests/vllm-cache-pvc.yaml around line 6, the PVC is named "dynamo-vllm-cache" but init/deployment manifests (e.g., components/backends/vllm/deploy/disagg_planner_cache_init.yaml lines ~126 and ~155) reference "vllm-cache-pvc"; update metadata.name in this manifest to "vllm-cache-pvc" so the PVC name matches the deployment references and pods can mount it.

tedzhouhk · 2025-09-18T17:06:49Z

stupid question, if the user start planner with more than 1 replica for prefill/decode, will they first be scheduled, then killed, then scheduled again?

michaelshin · 2025-09-18T17:14:17Z

stupid question, if the user start planner with more than 1 replica for prefill/decode, will they first be scheduled, then killed, then scheduled again?

The number of replicas will be increased so the initial worker won't be removed, but we'll scale up to the desired number

michaelshin requested review from Aphoh, PeaBrane, alec-flowers, atchernych, biswapanda, hhzhang16, hutm, ishandhanani, jasonqinzhou, julienmancuso, mohammedabdulwahhab, nnshah1 and tedzhouhk as code owners September 18, 2025 15:06

pull-request-size bot added the size/L label Sep 18, 2025

github-actions bot added the feat label Sep 18, 2025

michaelshin added 2 commits September 18, 2025 11:06

feat: add support for vllm cache initialization in Dynamo Planner

22923d6

Signed-off-by: Michael Shin <[email protected]>

add example

0d0b895

Signed-off-by: Michael Shin <[email protected]>

michaelshin force-pushed the michaelshin/add-support-for-vllm-cache-in-planner branch from 3dfd2eb to 0d0b895 Compare September 18, 2025 15:07

fix lint

e3f9c0e

Signed-off-by: Michael Shin <[email protected]>

coderabbitai bot reviewed Sep 18, 2025

View reviewed changes

michaelshin closed this Sep 26, 2025

michaelshin mentioned this pull request Sep 26, 2025

feat: Add vLLM compilation cache to Dynamo Operator #3257

Merged

feat: add support for vllm cache initialization in Dynamo Planner #3115

feat: add support for vllm cache initialization in Dynamo Planner #3115

Uh oh!

Conversation

michaelshin commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 18, 2025

Uh oh!

coderabbitai bot commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks

Uh oh!

michaelshin commented Sep 18, 2025

Uh oh!

coderabbitai bot commented Sep 18, 2025

Chat

CodeRabbit commands

Other keywords and placeholders

Status, support, documentation and community

Uh oh!

michaelshin commented Sep 18, 2025

Uh oh!

coderabbitai bot commented Sep 18, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

tedzhouhk commented Sep 18, 2025

Uh oh!

michaelshin commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelshin commented Sep 18, 2025 •

edited

Loading

coderabbitai bot commented Sep 18, 2025 •

edited

Loading