Skip to content

Conversation

michaelshin
Copy link
Contributor

@michaelshin michaelshin commented Sep 18, 2025

Overview:

This PR implements vLLM cache initialization mode for the SLA Planner, addressing performance bottlenecks during initial deployment startup. The feature enables controlled cache warming to reduce cold start latencies and improve initial request handling in disaggregated vLLM deployments.

I used nvidia/Llama-3.1-8B-Instruct-FP8 to test the performance difference. Without caching, torch.compile takes ~45 seconds. With caching, loading the cached results takes ~8 seconds.

Note that we can extend this functionality to Dynamo in general and not just Planner, but that can be part of a future PR

Details:

Overall sequence diagram:
Mermaid Chart - Create complex, visual diagrams with text  A smarter way of creating diagrams -2025-09-18-152924

  1. vLLM Cache Initialization Mode: Added a new planner mode that orchestrates cache warming by:

    • Starting with minimal replicas (0 prefill, 1 decode worker) to initialize the vLLM cache
    • Monitoring deployment readiness to detect cache initialization completion
    • Automatically scaling to target replica counts once cache is ready
  2. Additional Planner CLI Arguments (planner_argparse.py):

    • --vllm-cache-initialization-mode: Enable the cache initialization strategy
    • --post-vllm-cache-prefill-replicas: Target prefill worker count after initialization
    • --post-vllm-cache-decode-replicas: Target decode worker count after initialization
  3. Planner Core Logic (planner_core.py):

    • handle_vllm_cache_initialization(): Manages the cache initialization workflow
    • check_vllm_cache_initialization_complete(): Monitors deployment readiness
    • Integration with existing scaling logic to prevent conflicts during initialization
  4. Deployment Configurations:

    • vllm-cache-pvc.yaml - Dedicated PVC for vLLM cache storage (400Gi, ReadWriteMany). This is required to share the cache between different workers
    • disagg_planner_cache_init.yaml - Example deployment with cache initialization enabled

Where should the reviewer start?

  1. components/planner/src/dynamo/planner/utils/planner_core.py - Core cache initialization logic (lines 133-425)
  2. components/planner/src/dynamo/planner/utils/planner_argparse.py - New CLI arguments (lines 127-144)
  3. deploy/utils/manifests/vllm-cache-pvc.yaml - PVC configuration for cache storage
  4. components/backends/vllm/deploy/disagg_planner_cache_init.yaml - Example deployment configuration. I can move this to a different location since this is an example

Summary by CodeRabbit

  • New Features

    • Added cache initialization mode for the vLLM-based planner to warm up the model cache before normal operation.
    • Planner now auto-scales prefill and decode workers after cache warm-up completes.
    • New CLI options to enable cache init and configure post-init prefill/decode replica targets.
  • Chores

    • Introduced a deployment manifest to run the cache initialization workflow with frontend, planner, workers, and monitoring.
    • Added a PVC manifest for shared vLLM cache storage to support initialization and subsequent runs.

Copy link

copy-pr-bot bot commented Sep 18, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@michaelshin michaelshin force-pushed the michaelshin/add-support-for-vllm-cache-in-planner branch from 3dfd2eb to 0d0b895 Compare September 18, 2025 15:07
Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

Walkthrough

Adds a vLLM cache-initialization mode to the planner with new CLI flags and control-flow in the planner runtime. Introduces a deployment manifest for cache-init operation and a PVC manifest for shared vLLM cache storage. The planner orchestrates initial minimal replicas, waits for readiness, then scales to target replicas and updates metrics.

Changes

Cohort / File(s) Summary of Changes
Planner core: vLLM cache init flow
components/planner/src/dynamo/planner/utils/planner_core.py
Adds cache-init mode state and logic. Introduces check_vllm_cache_initialization_complete() and handle_vllm_cache_initialization() async methods. Integrates init flow into main loop to set initial replicas (prefill=0, decode=1), poll readiness, then scale to configured post-init replicas and update metrics.
Planner CLI options
components/planner/src/dynamo/planner/utils/planner_argparse.py
Adds three args: --vllm-cache-initialization-mode (bool), --post-vllm-cache-prefill-replicas (int), --post-vllm-cache-decode-replicas (int). No signature changes.
Kubernetes manifests: deployment & PVC
components/backends/vllm/deploy/disagg_planner_cache_init.yaml, deploy/utils/manifests/vllm-cache-pvc.yaml
New CRD-based deployment for disaggregated planner in cache-init mode with components (Frontend, Planner, Prometheus, VllmDecodeWorker, VllmPrefillWorker). Adds PVC manifest dynamo-vllm-cache (RWX, 400Gi, templated namespace/storage class).

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant User as Operator
    participant Planner as Planner
    participant Kube as Kubernetes API
    participant Decode as VllmDecodeWorker
    participant Prefill as VllmPrefillWorker
    participant Prom as Prometheus

    Note over Planner: Start with --vllm-cache-initialization-mode
    Planner->>Kube: Scale Prefill=0, Decode=1 (initial)
    Kube-->>Decode: Ensure 1 replica running
    loop Poll readiness
        Planner->>Kube: is_deployment_ready(decode)
        Kube-->>Planner: ready? (true/false)
    end
    alt Ready
        Planner->>Kube: Scale to post-init replicas<br/>(Prefill=Np, Decode=Nd)
        Kube-->>Prefill: Reconcile to Np
        Kube-->>Decode: Reconcile to Nd
        Planner->>Prom: Update metrics (cache_initialized=true)
        Note over Planner: vllm_cache_initialized = True
    else Not ready
        Note over Planner: Skip normal adjustments until ready
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • feat: support sglang in sla planner #2421 — Updates WORKER_COMPONENT_NAMES with SGLang entries; directly related to planner logic that references worker component mappings used in the new cache-init scaling.

Poem

In caches warm, my whiskers twitch,
One decode up, prefill on switch.
I nibble logs, await “all green,”
Then scale the fields to steady sheen.
PVC burrow, warm and wide—
Hop, deploy, and metrics glide. 🐇⚙️

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.50% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "feat: add support for vllm cache initialization in Dynamo Planner" is concise and accurately summarizes the PR's primary change—adding vLLM cache initialization support to the Planner; it uses conventional commit style and is clear for teammates scanning history.
Description Check ✅ Passed The PR description is largely complete: it provides a clear Overview, detailed Changes, and a "Where should the reviewer start?" section with file-level pointers and test observations, which give reviewers sufficient context to evaluate the change. It omits the "Related Issues" section from the repository template and does not include explicit step-by-step verification instructions or explicit preconditions (for example, a note to pre-create the required PVC), which would aid reproducibility. Overall the description is sufficient for review but would benefit from those small additions.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Michael Shin <[email protected]>
@michaelshin
Copy link
Contributor Author

@coderabbitai help

Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@michaelshin
Copy link
Contributor Author

@coderabbitai full review

Copy link
Contributor

coderabbitai bot commented Sep 18, 2025

✅ Actions performed

Full review triggered.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (6)
deploy/utils/manifests/vllm-cache-pvc.yaml (1)

9-14: Optional: add volumeMode and label for clarity/ops hygiene.

Not required, but adding volumeMode and labels eases debugging and storage class behavior introspection.

 spec:
   accessModes:
     - ReadWriteMany
   storageClassName: ${STORAGE_CLASS_NAME}
+  volumeMode: Filesystem
+  selector: {}
+---
+metadata:
+  labels:
+    app: vllm-cache
components/planner/src/dynamo/planner/utils/planner_argparse.py (1)

127-144: Clarify help text and scope flags to vLLM/Kubernetes.

  • Help says “start with 1 replica” but the logic uses 0 prefill, 1 decode.
  • These flags only make sense for backend=vllm on Kubernetes; surface that in help.
-        help="Enable vLLM cache initialization mode - start with 1 replica to initialize vLLM cache, then scale up",
+        help="Enable vLLM cache initialization: start with 0 prefill and 1 decode to warm vLLM cache, then scale up (backend=vllm on Kubernetes)",

Optionally validate at parse time (soft warning) so users don’t pass these in other modes:

 def create_sla_planner_parser() -> argparse.ArgumentParser:
@@
     parser.add_argument(
         "--post-vllm-cache-decode-replicas",
         type=int,
         default=1,
         help="Target number of decode worker replicas after vLLM cache initialization",
     )
+    # Post-parse warning hook (caller can invoke)
+    parser.set_defaults(_validate_vllm_cache_flags=_validate_vllm_cache_flags)
     return parser
+
+def _validate_vllm_cache_flags(args: argparse.Namespace) -> None:
+    if args.vllm_cache_initialization_mode and (args.backend != "vllm"):
+        logging.warning("--vllm-cache-initialization-mode is set but backend is not vllm; flag will be ignored.")
+    if args.vllm_cache_initialization_mode and getattr(args, "environment", None) != "kubernetes":
+        logging.warning("--vllm-cache-initialization-mode is intended for Kubernetes; behavior may be undefined elsewhere.")
components/planner/src/dynamo/planner/utils/planner_core.py (3)

133-149: Initialization flags wiring looks good; gate by env to avoid surprises.

Consider gating the mode to Kubernetes explicitly to prevent Virtual/no-op runs from invoking K8s-only paths.

 self.vllm_cache_initialization_mode = getattr(
     args, "vllm_cache_initialization_mode", False
 )
+if self.vllm_cache_initialization_mode and args.environment != "kubernetes":
+    logger.warning("vLLM cache init mode is intended for Kubernetes; disabling for environment=%s", args.environment)
+    self.vllm_cache_initialization_mode = False

395-420: Initial scale set is fine; avoid redundant writes when already at desired replicas.

Not blocking, but you could skip the write if current replicas already match 0/1 to reduce noisy updates.


428-436: Enforce min_endpoint on post-init targets to stay consistent with global constraints.

If post-init targets are below min_endpoint, later adjustments will bump them anyway; enforce here for coherence.

-            target_replicas = {
+            target_replicas = {
                 WORKER_COMPONENT_NAMES[
                     self.args.backend
-                ].prefill_worker_k8s_name: self.post_vllm_cache_prefill_replicas,
+                ].prefill_worker_k8s_name: max(self.args.min_endpoint, self.post_vllm_cache_prefill_replicas),
                 WORKER_COMPONENT_NAMES[
                     self.args.backend
-                ].decode_worker_k8s_name: self.post_vllm_cache_decode_replicas,
+                ].decode_worker_k8s_name: max(self.args.min_endpoint, self.post_vllm_cache_decode_replicas),
             }
components/backends/vllm/deploy/disagg_planner_cache_init.yaml (1)

35-53: Health probes always succeed; consider real checks.

Using exec: exit 0 makes readiness/liveness meaningless and can mask failures, especially during cache warm-up.

  • Replace with HTTP GET to a lightweight health endpoint or a file-gate written post-initialization.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd3326 and e3f9c0e.

📒 Files selected for processing (4)
  • components/backends/vllm/deploy/disagg_planner_cache_init.yaml (1 hunks)
  • components/planner/src/dynamo/planner/utils/planner_argparse.py (1 hunks)
  • components/planner/src/dynamo/planner/utils/planner_core.py (3 hunks)
  • deploy/utils/manifests/vllm-cache-pvc.yaml (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
components/planner/src/dynamo/planner/utils/planner_core.py (3)
components/planner/src/dynamo/planner/kube.py (1)
  • is_deployment_ready (102-115)
components/planner/src/dynamo/planner/virtual_connector.py (1)
  • set_component_replicas (289-316)
components/planner/src/dynamo/planner/kubernetes_connector.py (1)
  • set_component_replicas (71-100)
🪛 Ruff (0.12.2)
components/planner/src/dynamo/planner/utils/planner_core.py

387-387: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (2)
components/backends/vllm/deploy/disagg_planner_cache_init.yaml (2)

60-79: Expose Prometheus metrics on port 8000 and add a Service

containerPort is 9085 while the planner is started with --prometheus-port=8000 — set containerPort to 8000 (or change the arg to 9085) and create a Service selecting the planner pods that exposes port 8000 (port: 8000, targetPort: 8000) so Prometheus can scrape it.

File: components/backends/vllm/deploy/disagg_planner_cache_init.yaml (lines 60–79)

Verification couldn't be completed here (kubectl not available). Run locally to confirm: kubectl -n vllm-disagg-planner-cache-init get pods -l component=planner -o jsonpath='{.items[].spec.containers[].ports}' && kubectl -n vllm-disagg-planner-cache-init get svc


10-21: Verify Prometheus target service name
Unable to confirm the DNS name for the frontend service in the vllm-disagg-planner-cache-init namespace (cluster tools aren’t available here). Please run locally:

kubectl -n vllm-disagg-planner-cache-init get svc

and ensure the static target "vllm-disagg-planner-frontend-cache-init:8000" in DYNAMO_SERVICE_CONFIG matches the actual <service-name>.<namespace>.svc.cluster.local:8000. Update the value if it differs.

Comment on lines +120 to +128
replicas: 1
resources:
limits:
gpu: "1"
pvc:
create: false
name: vllm-cache-pvc # Must be created before deployment
mountPoint: /root/.cache/vllm
extraPodSpec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

PVC name must match the PVC manifest.

Here you reference "vllm-cache-pvc". Make sure the PVC manifest creates the same name (see comment on vllm-cache-pvc.yaml). If you prefer the other name, change it here instead.

Also applies to: 153-157


🏁 Script executed:

#!/bin/bash
# Confirm the PVC exists in the target namespace after applying manifests
# (Replace $NAMESPACE appropriately)
kubectl -n vllm-disagg-planner-cache-init get pvc vllm-cache-pvc -o wide || true

Length of output: 45


🏁 Script executed:

#!/bin/bash
# Locate PVC manifest files named vllm-cache-pvc.yaml
if files=$(fd vllm-cache-pvc.yaml); then
  :
else
  files=$(find . -type f -name 'vllm-cache-pvc.yaml')
fi

echo "Found PVC manifest files:"
echo "$files"
echo

for f in $files; do
  echo "----- $f -----"
  sed -n '1,50p' "$f"
  echo
done

Length of output: 498


Fix PVC name mismatch

Deployment references PVC "vllm-cache-pvc" but the PVC manifest deploy/utils/manifests/vllm-cache-pvc.yaml sets metadata.name: "dynamo-vllm-cache". Make the names identical — either rename the manifest to metadata.name: vllm-cache-pvc or update the deployment to use dynamo-vllm-cache (components/backends/vllm/deploy/disagg_planner_cache_init.yaml — refs at ~lines 120–128 and 153–157). Re-apply manifests and verify with: kubectl -n get pvc .

🤖 Prompt for AI Agents
In components/backends/vllm/deploy/disagg_planner_cache_init.yaml around lines
120–128 (and similarly refs near 153–157), the PVC name "vllm-cache-pvc" does
not match the PVC manifest which is named "dynamo-vllm-cache"; make the names
identical by either renaming the PVC manifest metadata.name to "vllm-cache-pvc"
or updating the deployment's pvc.name to "dynamo-vllm-cache", then re-apply the
corrected manifests and verify the PVC exists with kubectl -n <namespace> get
pvc <name>.

Comment on lines +372 to +390
async def check_vllm_cache_initialization_complete(self) -> bool:
"""Check if vLLM cache has been initialized"""
if not self.vllm_cache_initialization_mode:
return True
try:
# Assume if a decode worker is ready, the cache has been initialized
if self.connector:
is_initial_deployment_ready = (
await self.connector.kube_api.is_deployment_ready(self.namespace)
)
if is_initial_deployment_ready:
logger.info(
"Initial deployment is ready, vLLM cache initialization complete"
)
return True
except Exception as e:
logger.warning(f"Failed to check vLLM cache initialization status: {e}")
return False

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Fix readiness check: using namespace instead of graph name; breaks on non-matching names and Virtual/no-op modes.

  • kube_api.is_deployment_ready expects a GraphDeployment name, not a namespace.
  • Will AttributeError on VirtualConnector or when no_operation=true (no connector).
  • Blind exception catch violates BLE001 and can hide real failures.

Apply this fix to correctly resolve the graph name, handle envs, and narrow exceptions:

 async def check_vllm_cache_initialization_complete(self) -> bool:
     """Check if vLLM cache has been initialized"""
     if not self.vllm_cache_initialization_mode:
         return True
-    try:
-        # Assume if a decode worker is ready, the cache has been initialized
-        if self.connector:
-            is_initial_deployment_ready = (
-                await self.connector.kube_api.is_deployment_ready(self.namespace)
-            )
-            if is_initial_deployment_ready:
-                logger.info(
-                    "Initial deployment is ready, vLLM cache initialization complete"
-                )
-                return True
-    except Exception as e:
-        logger.warning(f"Failed to check vLLM cache initialization status: {e}")
+    # Only applicable on Kubernetes with a KubernetesConnector
+    if not hasattr(self, "connector") or not hasattr(self.connector, "kube_api"):
+        return True  # treat as complete in virtual/no-op modes
+    try:
+        deployment = await self.connector.kube_api.get_graph_deployment()
+        if deployment is None:
+            logger.info("GraphDeployment not found yet; cache init not complete")
+            return False
+        graph_name = (deployment.get("metadata") or {}).get("name")
+        if not graph_name:
+            logger.warning("GraphDeployment metadata.name missing; cache init not complete")
+            return False
+        is_ready = await self.connector.kube_api.is_deployment_ready(graph_name)
+        if is_ready:
+            logger.info("GraphDeployment %s is Ready; treating vLLM cache init as complete", graph_name)
+            return True
+    except (AttributeError, KeyError, TypeError, ValueError) as e:
+        logger.warning(f"Failed to check vLLM cache initialization status: {e}")
     return False

🏁 Script executed:

#!/bin/bash
# Verify expected signature and typical usage of is_deployment_ready
rg -n -C3 -P 'def\s+is_deployment_ready\(' components/planner/src
rg -n -C3 -P 'is_deployment_ready\(' components/planner/src

Length of output: 2707


Fix readiness check: pass GraphDeployment name (not namespace) and guard virtual/no-op connectors

  • kube_api.is_deployment_ready expects a GraphDeployment name (components/planner/src/dynamo/planner/kube.py:102); planner_core currently passes self.namespace and can produce wrong results.
  • If connector exists but lacks kube_api (VirtualConnector) or connector is None (no-op), current code will raise AttributeError; broad except Exception hides real failures (BLE001).
  • Change: resolve the graph deployment name (reuse the helper pattern used in kubernetes_connector.py before calling is_deployment_ready), guard for missing/virtual connectors (skip or treat as complete), and replace the blanket except with narrowly scoped exceptions around the kube calls.
  • Locations: components/planner/src/dynamo/planner/utils/planner_core.py (~372–390); see kube.py:102 and kubernetes_connector.py:82–84 for reference usage.
🧰 Tools
🪛 Ruff (0.12.2)

387-387: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In components/planner/src/dynamo/planner/utils/planner_core.py around lines 372
to 390, the readiness check currently passes self.namespace to
kube_api.is_deployment_ready and assumes connector.kube_api always exists, which
can produce wrong results or AttributeError; change it to resolve and pass the
GraphDeployment name (reuse the same helper/pattern used in
kubernetes_connector.py to compute the deployment name), add guards so if
connector is None or connector has no kube_api (virtual/no-op connector) the
function treats cache as initialized (return True or skip readiness), and narrow
the try/except to only wrap the actual kube_api call (catch specific exceptions
from the kube client) instead of a blanket except so real errors are not
swallowed.

Comment on lines +519 to +529
# Handle vLLM cache initialization completely separately from scaling adjustments
if self.args.backend == "vllm":
await self.handle_vllm_cache_initialization()

# If cache initialization is in progress, skip all other operations
if (
self.vllm_cache_initialization_mode
and not self.vllm_cache_initialized
):
continue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Busy-wait during cache init; add a small sleep to avoid pegging the event loop.

The early continue skips the general sleep below; this will spin at 100% CPU until ready.

-            if self.args.backend == "vllm":
-                await self.handle_vllm_cache_initialization()
-
-                # If cache initialization is in progress, skip all other operations
-                if (
-                    self.vllm_cache_initialization_mode
-                    and not self.vllm_cache_initialized
-                ):
-                    continue
+            if self.args.backend == "vllm":
+                await self.handle_vllm_cache_initialization()
+                # If cache initialization is in progress, skip all other operations but do not busy-wait
+                if self.vllm_cache_initialization_mode and not self.vllm_cache_initialized:
+                    await asyncio.sleep(max(2, self.args.adjustment_interval / 10))
+                    continue
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Handle vLLM cache initialization completely separately from scaling adjustments
if self.args.backend == "vllm":
await self.handle_vllm_cache_initialization()
# If cache initialization is in progress, skip all other operations
if (
self.vllm_cache_initialization_mode
and not self.vllm_cache_initialized
):
continue
# Handle vLLM cache initialization completely separately from scaling adjustments
if self.args.backend == "vllm":
await self.handle_vllm_cache_initialization()
# If cache initialization is in progress, skip all other operations but do not busy-wait
if self.vllm_cache_initialization_mode and not self.vllm_cache_initialized:
await asyncio.sleep(max(2, self.args.adjustment_interval / 10))
continue
🤖 Prompt for AI Agents
In components/planner/src/dynamo/planner/utils/planner_core.py around lines 519
to 529, the loop currently busy-waits when vLLM cache initialization is in
progress because the early continue skips the general sleep; modify the branch
that detects "self.vllm_cache_initialization_mode and not
self.vllm_cache_initialized" to await a short non-blocking sleep (for example
await asyncio.sleep(0.1)) before continuing so the event loop is yielded and CPU
isn't pegged while waiting for initialization to complete.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: dynamo-vllm-cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

PVC name mismatch with deployment manifest; pods will fail to mount.

Your worker manifests reference a PVC named "vllm-cache-pvc" while this file creates "dynamo-vllm-cache". Align the names to avoid Pending pods due to missing PVC.

Apply either fix; I recommend renaming the PVC here to match the deployment:

 metadata:
-  name: dynamo-vllm-cache
+  name: vllm-cache-pvc

Run to confirm all references are consistent:


🏁 Script executed:

#!/bin/bash
# Check PVC name references
rg -nP 'vllm-cache-pvc|dynamo-vllm-cache' -g '!**/node_modules/**' -S

Length of output: 148


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "PWD: $(pwd)"
echo "--- Top-level files ---"
ls -la | sed -n '1,200p' || true

echo
echo "--- rg search (no ignore, include hidden) for PVC names ---"
if command -v rg >/dev/null 2>&1; then
  rg -n --hidden --no-ignore -S 'vllm-cache-pvc|dynamo-vllm-cache' || true
else
  echo "rg not found; using grep fallback"
  grep -RIn --exclude-dir=node_modules -e 'vllm-cache-pvc' -e 'dynamo-vllm-cache' . || true
fi

echo
echo "--- rg search for claimName lines ---"
if command -v rg >/dev/null 2>&1; then
  rg -n --hidden --no-ignore -S 'claimName:\s*(vllm-cache-pvc|dynamo-vllm-cache)' || true
else
  grep -RIn --exclude-dir=node_modules -E 'claimName:\s*(vllm-cache-pvc|dynamo-vllm-cache)' . || true
fi

echo
echo "--- show file if it exists: deploy/utils/manifests/vllm-cache-pvc.yaml ---"
if [ -f deploy/utils/manifests/vllm-cache-pvc.yaml ]; then
  sed -n '1,240p' deploy/utils/manifests/vllm-cache-pvc.yaml || true
else
  echo "deploy/utils/manifests/vllm-cache-pvc.yaml not found"
fi

echo
echo "--- list YAML files under deploy/utils/manifests (if dir exists) ---"
if [ -d deploy/utils/manifests ]; then
  find deploy/utils/manifests -type f -name '*.y*ml' -print -exec sed -n '1,200p' {} \; || true
else
  echo "deploy/utils/manifests directory not found"
fi

echo
echo "Done."

Length of output: 7607


PVC name mismatch — pods will fail to mount; make PVC name match deployment references

The PVC manifest creates metadata.name: dynamo-vllm-cache while init/deployment manifests reference vllm-cache-pvc. Change the name in deploy/utils/manifests/vllm-cache-pvc.yaml to match.

metadata:
-  name: dynamo-vllm-cache
+  name: vllm-cache-pvc

References observed: components/backends/vllm/deploy/disagg_planner_cache_init.yaml (lines ~126, ~155) reference vllm-cache-pvc.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
name: dynamo-vllm-cache
name: vllm-cache-pvc
🤖 Prompt for AI Agents
In deploy/utils/manifests/vllm-cache-pvc.yaml around line 6, the PVC is named
"dynamo-vllm-cache" but init/deployment manifests (e.g.,
components/backends/vllm/deploy/disagg_planner_cache_init.yaml lines ~126 and
~155) reference "vllm-cache-pvc"; update metadata.name in this manifest to
"vllm-cache-pvc" so the PVC name matches the deployment references and pods can
mount it.

@tedzhouhk
Copy link
Contributor

stupid question, if the user start planner with more than 1 replica for prefill/decode, will they first be scheduled, then killed, then scheduled again?

@michaelshin
Copy link
Contributor Author

stupid question, if the user start planner with more than 1 replica for prefill/decode, will they first be scheduled, then killed, then scheduled again?

The number of replicas will be increased so the initial worker won't be removed, but we'll scale up to the desired number

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants