-
Notifications
You must be signed in to change notification settings - Fork 594
Description
Feature request
Hi Dynamo developers!
We wanted to provide visibility into the near-term roadmap for the Dynamo v0.4.1 and Dynamo v0.5.0 releases.
We are contributing to make progress on the five major focus areas:
- Performance
- Fault tolerance
- K8 deployment
- KV cache management and transfer
- Scheduling with smart router and planner
📅 Timeline
The target dates for the releases are below:
v0.4.1 | v0.5.0 |
---|---|
8/27 | 9/17 |
Dynamo v0.4.1. Features
1. Performance
Develop reproducible benchmark script for DS-R1.
2. Fault Tolerance & Observability
[High Availability] Support multiple KV routers (+frontends)
Request Migration Docs and E2E Tests vLLM
bug: metrics collection timed out
- fix: replace metrics callback with background scraping to prevent tim… #2480
- fix: use tokio spawn / interval.tick(), make nats metric names clearer, fix tests sharing environment variables (temp_env) #2506
vLLM: Add a "model" label to Component metrics
k8s: Add Guide on Deploying Prometheus and Grafana in Dynamo Cloud (K8s) Deployment
fix: Frontend metrics to be renamed from nv_llm_* to dynamo_*
Create a new guide on using the MetricsRegistry APIs
fix: component metric names to be called dynamo_component_*, and labels to not collide with Kubernetes
Base metrics: NATS, drt & component grouped metrics
fix: endpoint.rs does not propagate Runtime errors back on the stack
Parameterize /health and /live endpoints
Refactor: System Server (http_server.rs) is often confused with the Dynamo frontend
- refactor: Rename HTTP server to metrics server in worker process #2318
- refactor: rename to system status server for consistency #2354
3. K8s Deployment
Grove
Grove integration : multi-node support
Implement workaround to scale components when using Grove
Inference Gateway
Dynamo integration with API Gateway - EPP customization
Processor/router unit needs to not queue in nats and instead return to envoy shim/proxy
Update Documentation for Dynamo Inference Gateway
- fix: add instruction to deploy model with inference gateway #2257
- docs: add instruction to deploy model with inference gateway #2257 #2260
Metrics
Create a reference guide to collecting and viewing dynamo metrics in kubernetes
4. KV Cache Management & Transfer
KV Block Manager
Note: G1 = HBM, G2 = Host memory, G3 = Local disk, G4 = Remote storage
vLLM: G2- G3 offloading, onboarding, unit tests - functionality ( <2% of LMCache perf test suites)
- G2 unit tests
- Connector API implementation
- G3 onboarding
- G3 offloading
- G3 unit tests
5. Planning & Routing
Planner
SLA Planner integration support for SGLang (Dense models)
Dynamo v0.5.0. Features
1. Performance
Develop reproducible benchmark script for DS-R1.
Ensure Top 4 popular models are benchmarked for all three backends (SGLang, TRT-LLM, vLLM)
- Target models: Qwen 32B, Llama 70B, GPT-OSS 120B, and DS R1
- Reproducible benchmarking guides using K8
2. Fault Tolerance & Observability
Metrics
- Runtime, Frontend, and Engine Metrics
Logging & tracing
- E2E Request Level Tracing
Request handling
- E2E Request Cancellation
Component availability & recovery
- All components can fail and be recovered individually without restarting others
Worker availability and recovery
- Engine GPU Health Monitoring
3. K8s Deployment
Production release - Grove for Dynamo k8s deployments
- Helm charts with Grove + Dynamo operators (Dynamo Cloud platform)
Multi-tenancy
- Frontend needs to support dynamo-NS scoping for backend models, so that deployments can stay isolated
Detecting failure scenarios with DCGM controller
OME Integration
- Dynamo as a Cluster/ServingRuntime for OME
Support Dynamo namespace isolation
- feat: dynamo namespace isolation #2394
- feat: dynamo namespace isolation for backend component #2475
- feat: inject DGD id in planner env variables #2460
4. KV Cache Management & Transfer
KV Block Manager
Merge LMCache multi-connector path
Modularization: separate repo and artifacts
G4 (remote storage) support
LMCache
- KV events and KV Routing path verification
- e2e NIXL southbound integration verification
- e2e NIXL southbound performance verification
TRT-LLM: G2 - G3 offloading, onboarding, unit tests
Dynamo KVBM integration
5. Planning & Routing
Router
Separate frontend and Router, so frontend and Router can be scaled independently
- Make API server, processor and router composable
- Create Frontend container (e.g. dynamo-frontend)
Allow serving multiple routers at the same time, and also allow for warm restarts for Router if one goes down
Planner
Test scaling Dynamo Planner
SLA Planner integration with TRT-LLM (Dense models)
6. Other
Multi-LoRA enablement
Guided Decoding
If there are any additional features that needs to be considered or prioritized, please let us know in the comment. Thank you so much for your ongoing feedback, and we will do our best to incorporate them to GA Dynamo in December 🙏.
Describe the problem you're encountering
N/A
Describe alternatives you've tried
No response