Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
340 changes: 340 additions & 0 deletions 0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,340 @@
# Dynamo integration with Inference Gateway

**Status**: Draft

**Authors**: [Biswa Panda](https://github.com/biswapanda)

**Category**: Architecture

**Sponsor**: Itay, Maksim, Neelay

**Required Reviewers**:

**Review Date**: [Date for review]

**Pull Request**: [Link to Pull Request of the Proposal itself]

**Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation]

# Summary

This proposal outlines the integration of [Dynamo](https://docs.nvidia.com/dynamo/latest/architecture/architecture.html#high-level-architecture-and-key-benefits) with the [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io) to enable seamless model routing, request scheduling, and centralized control of inference workloads by mapping Inference Gateway concepts to Dynamo components.

## Dynamo Introduction

### Components and process boundary:
Dynamo is a modular inference system with distinct logical components:

**Frontend** runs these 3 components in a single process. It doesn't require accelerator resources.
1. **API Service**: The entry point for OpenAI-compatible requests

2. **Processor**: Handles tokenization and preprocessing

3. **Router**: Makes scheduling decisions based on KV metrics

**Backend** Process requires accelerator resources and hosts `Worker` component.

4. **Workers**: Backend components responsible for managing LLM engines running on accelerators.
Workers execute inference tasks using underlying engines such as vLLM, TRT-LLM, or SGLang.


Each Dynamo graph deployment creates a Kubernetes deployment which manages component's pods.

* Dynamo Graph (logical view)
![Dynamo Graph (logical)](./dynamo_graph_logical_2.png)

* Dynamo Graph (deployment view)
![Dynamo Graph (deployment)](./dynamo_graph_deployment.png)

| Module | Dynamo | IGW |
| :---- | :---- | :---- |
| **Service/Data Plane** | Custom NATS/TCP and json based protocol supports async messages based data flow | gRPC/HTTP-based req/reply protocol |
| **Event Plane** | KV/capacity related metric events are published over NATS | Scrapers populate Datastore with metrics for a pod (pull-based) |
| **Control Plane** | Planner is responsible for scaling decisions, Orchestration happens via operator | todo: need information |


### Disaggregated Serving
In Dynamo's disaggregated serving, the initial request is handled by the decode stage, and if prefill is required, it is performed remotely by the Prefill worker.

Currently, Dynamo does not support co-scheduling Prefill/Decode or Prefill -> decode flow.


### Communication
Dynamo uses a custom protocol for intra-component communication. It is based on NATS and two-part JSON messages.


## IGW (Inference Gateway) Introduction

1. **Model Aware Routing**
IGW enables traffic management across multiple base and LoRA models.

2. **Request scheduling**
IGW schedules requests based on various policies at the `endpoint picker` extension.
Routes requests to the best LLM worker instance based on various data sources (runtime stats data) and input (SLO).

3. **Centralized control**
Enables centralized management of: auth, RBAC, rate limiting, usage tracking etc. in the Gateway layer.

## Goals

* Map Inference Gateway concepts to Dynamo
* Maintain backward compatibility with existing EPP functionality
* Extend IGW to use Dynamo router
* Minimize network hops

### Non Goals

* Replace existing EPP internal scheduling
* Modify core Gateway API specifications
* Change existing Dynamo worker interfaces significantly
* LoRA support in Dynamo

## Guiding Principles

1. **Composability**: EPP should externalize scheduling decisions to Dynamo router
2. **DRY**: Aim to reduce duplications in preprocessing steps (tokenization, prompt template application)
3. **Compatibility**: Maintain full compatibility with Inference Gateway API
4. **Reduce network hops** to minimize tail latency
5. **EPP extends Dynamo's Router** EPP delegates scheduling decisions to dynamo router, preserving Dynamo's scheduling logic

## Constraints
- Dynamo components (Processor, Router) use Dynamo native transport (two-part JSON messages over NATS)
- Dynamo does not support co-scheduling in disaggregated mode. Currently request flow goes from decode to prefill.
- An EPP is associated with a [single InferencePool](https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool)


## Problems
1. Currently EPP scheduling has tight coupling with in-process preprocessing.
It's hard to maintain it across different models.

2. Double tokenization during scheduling and service path

3. Dynamo uses a custom communication protocol for communication between frontend and workers.


## Requirements

### REQ 1 External Processing Integration

Dynamo EPP (Endpoint picker) **MUST** support scheduling request in dynamo while maintaining the existing ext-proc interface.

### REQ 2 Unified Dynamo deployment

Dynamo EPP and components (Frontend, Workers) **MUST** be deployable within Kubernetes through a unified Helm chart to maintain version compatibility.

### REQ 3 Maintain compatibility with Inference Gateway protocols

Dynamo EPP **MUST** be compatible with Inference Gateway API and concepts (InferencePool, InferenceModel).

# Proposal

## Architecture Overview
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alt 2 is more aligned with IGW's design principles and composes well with Kubernetes Gateway API.

Different base models are assumed to be deployed independently as separate pools; EPP is thought of as a request scheduler on a pool, it assumes that all endpoints are capable of serving the request and its role is to pick the best one, and so routing across different base models is assumed to be done at a stage before that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

routing across different base models is done by enabling body based router or we are planning to integrate this as a core feature of IGW?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving some context before answering your question:

Context 1: body-based-router is an extension that all it does is reading the model name from the body and writing it to a known header, it doesn't do actual routing per se. Once the model name is in a header, a user can use HTTPRoute rules to express the intent to route requests based on that header to different pools. This is basic k8s Gateway functionality. The reason to write the model name to a header is because HTTPRoute doesn't support creating rules with matchers against the request body, it can only match against headers.

Context 2: We define IGW as a combination of two things: a proxy + ext-proc extensions; as part of the IGW project, we developed two extensions, the bbr described above and the epp, the endpoint picker extension (we are calling it inference request scheduler as well). bbr is invoked early in the envoy filter chain before selecting the pool to route the request to. EPP is invoked later in the chain, after the pool is selected, but before picking the endpoint. EPP is the component that does the “smart” routing.

So if I were to answer the question: "are we planning to integrate this [routing across different base models] as a core feature of IGW"; my answer is yes if you look at IGW as the composition of the features above.

If the question is "can a single pool host multiple base models, and the EPP route requests across different base models within the pool?", then this is still an open question, we lack the motivating use case to consider that (vs having the layer above, HTTPRoute, doing it). More concretely, is there any smart routing that we may want to apply when selecting the base model to route the request to? Or is selecting the base model predetermined (and hence can be expressed as header matching)? If there is a strong case for the former, then EPP can certainly expand to support base model routing.


## Alt 1: Entire Dynamo Graph Deployment as a blackbox

EPP routes request to Frontend pod/sidecar
- multiple models in same k8s namespace
- shared frontend
![Shared Frontend](./blackbox/alt_1_dyn_a.png)

multiple models in different k8s namespace
uses BBR to inject model name header
- multiple models in same k8s namespace
- Each model has it's dedicated frontend
![Frontend per graph deployment](./blackbox/alt_1_dyn_b.png)

# Alt 2: Dynamo EPP integration with Router

![Architecture Diagram](./dyn_alt_2.png)

### Data flow

1. The client sends an HTTP inference request to the Gateway.
2. Gateway receives the request and extracts the model name and relevant metadata.
Gateway consults the InferenceModel configuration to determine the inference pool (Dynamo graph) to route the request.
3. Gateway calls EPP over gRPC for worker scheduling based on Envoy ext_proc protocol.
4. EPP forwards the request to Frontend sidecar
```yaml
Request:
- req header: set x-routing-request: true
- req body: original request body (For example, Chat completion request)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the expectation that the epp will buffer the request and forward it in one go to the frontend?


Response:
worker_id: this is Dynamo specific worker_id
token_ids: (Optional) tokens generated from processor step
```
5. Dynamo Frontend accepts OAI request and forwards request through Dynamo request plane (NATS)
6. Dynamo Processor performs necessary pre-processing and generates tokens. It calls routers to decide worker_id.
7. Dynamo Router takes tokens as input and decides worker_id based on scheduling policies and KV metrics
8. EPP sets headers (x-gateway-destination-endpoint and x-gateway-worker-id)
Optional optimization: We can inject the tokens in request body to avoid recomputing tokens in service path.
Note: `tokens` key in request body is not OpenAI compatible.
```
Set Req Header:
- `x-gateway-destination-endpoint`: worker address of the Dynamo frontend pod
- `x-gateway-worker-id`: Dynamo worker id of Backend LLM worker instance

Add to Req Body (Optional):
- `tokens`
```
9. IGW forwards the request to appropriate Dynamo frontend based on request header `x-gateway-destination-endpoint`
Note: This could be ideally routed to Frontend service because Frontend/Processor deployment is decoupled from LLM workers.

10. Processor skips pre-processing
- `tokens` in request body and skips pre-processing step
- `x-gateway-worker-id` in the request and skips call to router

11. Request is sent to LLM Backend and response is streamed back through
- Processor: Post-processing steps
- Frontend: Change response shape from Dynamo native to OpenAI compatible response
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be encapsulated into a sidecar deployed with the model server instead of having them deployed as central components in the service path?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that as part of the ext-proc protocol, the epp will get the response back, and so it can also forward it to the frontend instance deployed as a sidecar to the epp.

Copy link
Author

@biswapanda biswapanda Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. My initial thought was this will couple cpu and accelerator resources tightly and we wont be able to scale them independently. Both AIBrix and llm-d are taking similar approach in colocating the http server with model server (Backend).

Can this logic be encapsulated into a sidecar deployed with the model server instead of having them deployed as central components in the service path?

Copy link
Author

@biswapanda biswapanda Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that as part of the ext-proc protocol, the epp will get the response back, and so it can also forward it to the frontend instance deployed as a sidecar to the epp.

@ahg-g, this make sense.
epp can forward request to a frontend [http frontend + processor + router] sidecar which can be in one of these pods:

  1. frontend sidecar in epp
    this option is similar in spirit to alt-1-entire-dynamo-graph-deployment-as-a-blackbox.

  2. frontend sidecar in model service pod
    frontend cpu/memory resource are coupled with model service.
    Each model service's frontend sidecar will have a predictable load. it might be a better option for horizontal scaling.
    (side note: we can further optimize by using local unix sockets/zmq instead of nats between frontend and model service)

also, is epp considered stateful?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the frontend logic doesn't require global state, then moving it to run beside the model service pod is the more scalable option.

also, is epp considered stateful?

Currently it maintains an "approximate" in-memory prefix-cache populated from previous requests to perform prefix-aware routing. So performance would suffer if the state is lost, but requests will continue to be routed and served.


**Notes:**
- All inter-component communication within Dynamo (Processor, Router, Workers) uses NATS with two-part JSON messages.
- Deployment is unified via a single Helm chart for version compatibility.

### Mapping Inference Pool/Model with Dynamo
1. There would be a 1:1 mapping between an inference pool, a Dynamo graph deployment and EPP deployment.
Reasoning: Dynamo Graph represents a cohesive deployment unit with compute resources. Each Dynamo graph deployment should correspond to one Inference Pool.

This is the view from IGW perspective -
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Inference Model │ │ Inference Model │ │ Inference Model │
│ (lora1) │ │ (lora2) │ │ (gemma) │
└─────────┬───────┘ └─────────┬────────┘ └─────────┬───────┘
└──────────────────────┼───────────────────────┘
│ N:1
┌────────────▼────────────┐
│ Inference Pool │
└────────────┬────────────┘
│ 1:1
┌────────────▼────────────┐
│ Dynamo EPP │
└────────────┬────────────┘
│ 1:1
┌────────────▼────────────┐
│ Dynamo Graph Deployment │
└─────────────────────────┘
```

2. EPP has a 1:1 relationship with Inference Pool and it's responsible for scheduling decisions within a Dynamo graph.

3. Inference Model maps user-facing model names to backend implementations. Multiple inference models can refer to the same Inference Pool (Dynamo Graph).


```mermaid
erDiagram
%% IGW Entities
Gateway
HTTPRoute
InferenceModel
InferencePool
EPP

%% Dynamo Entities
DynamoGraph
Frontend
Worker

%% IGW Relationships
Gateway ||--o{ HTTPRoute : "routes traffic via"
HTTPRoute ||--o{ InferenceModel : "matches model name to"
InferenceModel ||--|| InferencePool : "references backend"
InferencePool ||--|| EPP : "delegates scheduling to"
EPP ||--o{ Worker : "selects worker instances"

%% Dynamo Relationships
DynamoGraph ||--|| Frontend : "has single entrypoint"
DynamoGraph ||--o{ Worker : "contains multiple workers"

%% Cross-System Mapping
InferencePool ||--|| DynamoGraph : "maps to"
EPP ||--|| Frontend : "deployed as sidecar"

%% Internal Dynamo Communication
Frontend ||--o{ Worker : "routes to selected worker"
```

### Decision Points

#### 1. EPP integration with Dynamo: plugin vs sidecar vs external callout service

![EPP integration with Dynamo](./sidecar_vs_ext_svc.png)
##### Sidecar container (Preferred)
Needs support in EPP to deploy a sidecar container and specify the port to request at.

**Pros**
- Reduced network hops: Direct communication between EPP and Dynamo components within the same pod
- Lower latency: No network overhead for inter-component communication
- Simpler deployment and management: Deployed as a single unit, easier to manage lifecycle

**Cons**
- Tightly coupled scaling: Scaling decisions for EPP and Frontend are coupled
- Deployment of EPP is coupled with Dynamo sidecar image. Version upgrades should be done in-sync.

##### External callout service
**Pros**
- Completely isolated deployments
- Each component can be deployed and scaled independently

**Cons**
- Additional network hops: More latency due to network communication between services
- Service discovery complexity: Need to manage service endpoints and load balancing
- Additional network failure points in the request path

##### Plugin
**Pros**
- Minimum number of network hops
- Simpler architecture without additional layer
- Lower latency for request processing

**Cons**
- Dynamo runtime/component don't have native integration with Golang
- Hard to scale across models
- Tight coupling with Golang-based implementation


## Current state of IGW and Dynamo

### Inference Gateway Request Flow:
```
HTTP Request
┌─────────────┐ Extract model name ┌──────────────────┐
│ Gateway │ ──────────────────────► │ InferenceModel │
│ (HTTPRoute) │ │ (Model Config) │
└─────────────┘ └──────────────────┘
│ │
│ Route to backend │ References
▼ ▼
┌─────────────┐ Smart routing via ┌──────────────────┐
│InferencePool│ ◄─────────────────────── │ Endpoint Picker │
│ (Compute) │ EPP extension │ Extension (EPP) │
└─────────────┘ └──────────────────┘
┌─────────────┐
│ Model Server│
│ Pods │
└─────────────┘
```

## Deferred to Implementation
- Fallback mechanisms for failures
- Metrics and observability integration


## Follow up questions

1. Is EPP considered stateless? How do we achieve EPP HA/Fault-tolerance?
Dynamo Router is stateful (in-memory prefix tree).
Coupling Dynamo frontend as a sidecar with EPP might bring scaling challenges.

2. Multiple models: InferenceModel and body-based routing
Can we enable body-based routing as a stage in EPP or composable at IGW level?

# Related Proposals
* [Gateway API Inference Extension Documentation](https://gateway-api-inference-extension.sigs.k8s.io/)
* [Envoy External Processing Filter](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)
* [Gateway API Specification](https://gateway-api.sigs.k8s.io/)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added 0001-TBD-inference-gw-integration/dyn_alt_2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.