|
| 1 | +# Dynamo integration with Inference Gateway |
| 2 | + |
| 3 | +**Status**: Draft |
| 4 | + |
| 5 | +**Authors**: [Biswa Panda](https://github.com/biswapanda) |
| 6 | + |
| 7 | +**Category**: Architecture |
| 8 | + |
| 9 | +**Replaces**: [Link of previous proposal if applicable] |
| 10 | + |
| 11 | +**Replaced By**: [Link of previous proposal if applicable] |
| 12 | + |
| 13 | +**Sponsor**: [Name of code owner or maintainer to shepard process] |
| 14 | + |
| 15 | +**Required Reviewers**: [Names of technical leads that are required for acceptance] |
| 16 | + |
| 17 | +**Review Date**: [Date for review] |
| 18 | + |
| 19 | +**Pull Request**: [Link to Pull Request of the Proposal itself] |
| 20 | + |
| 21 | +**Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation] |
| 22 | + |
| 23 | +# Summary |
| 24 | + |
| 25 | +This proposal outlines the integration of Dynamo components with the Gateway API Inference Extension. |
| 26 | + |
| 27 | +The current Inference Gateway is tightly coupled with model's tokenizer. However use cases require: |
| 28 | +1. **External Tokenization**: Preprocessing requests outside the gateway for specialized tokenization logic |
| 29 | +2. **KV-Aware Routing**: Intelligent routing based on prefix cache status and token analysis |
| 30 | +3. **Flexible side channel to offload tokens**: Support for both external cache and direct token passing strategies. This would be helpful for transfering large blob of tokens for VLMs (image/audio/video tokens) |
| 31 | +4. **Unified Dynamo Architecture**: Consolidated deployment model for all processing components |
| 32 | + |
| 33 | +## Terminology & Definitions |
| 34 | + |
| 35 | +| Term | Definition | |
| 36 | +| :---- | :---- | |
| 37 | +| **Dynamo EPP** | Enhanced Endpoint Picker Protocol service with Dynamo integration | |
| 38 | +| **Dynamo Processor** | Dynamo component responsible for request tokenization and preprocessing | |
| 39 | +| **Dynamo Router** | Dynamo component responsible for KV aware Routing strategy | |
| 40 | +| **Token Cache / Side Channel** | External storage system for tokenized request | |
| 41 | + |
| 42 | +## Acronyms & Abbreviations |
| 43 | + |
| 44 | +**EPP:** Endpoint Picker Protocol |
| 45 | +**IGW:** Inference Gateway |
| 46 | + |
| 47 | +## Goals |
| 48 | + |
| 49 | +* Integrate Dynamo Processor for request preprocessing and tokenization |
| 50 | +* Enable KV-aware routing through Dynamo Router Service |
| 51 | +* Support flexible token management (cache keys vs direct values) |
| 52 | +* Provide unified deployment architecture for all Dynamo components |
| 53 | +* Maintain backward compatibility with existing EPP functionality |
| 54 | + |
| 55 | +### Non Goals |
| 56 | + |
| 57 | +* Replace existing EPP internal scheduling completely |
| 58 | +* Modify core Gateway API specifications |
| 59 | +* Change existing worker pod interfaces significantly |
| 60 | + |
| 61 | +## Requirements |
| 62 | + |
| 63 | +### REQ 1 External Processing Integration |
| 64 | + |
| 65 | +Dynamo EPP (Endpoint picker) **MUST** support calling LLM processors for request preprocessing and tokenization while maintaining the existing ext-proc interface. |
| 66 | + |
| 67 | +### REQ 2 Flexible Routing Strategies |
| 68 | + |
| 69 | +The system **SHOULD** support both external routing (via Dynamo Router) and internal EPP scheduling based on request configuration. |
| 70 | + |
| 71 | +### REQ 3 Token offloading capability |
| 72 | + |
| 73 | +The system **SHOULD** support both external cache-based token storage and direct token value passing to worker pods. |
| 74 | + |
| 75 | +### REQ 4 Unified Dynamo Architecture |
| 76 | + |
| 77 | +Dynamo EPP and components (Processor, Router, Workers) **MUST** be deployable as a unified dynamo graph within Kubernetes. |
| 78 | + |
| 79 | +### REQ 5 Maintain compatibility with Inference Gateway protocols |
| 80 | + |
| 81 | +Dynamo EPP **MUST** be compatible with Inference Gateway |
| 82 | + |
| 83 | +# Proposal |
| 84 | + |
| 85 | +## Design Principles |
| 86 | + |
| 87 | +## Architecture Overview |
| 88 | + |
| 89 | +The updated architecture unifies Inference Gateway with Dynamo Graph deployment. See architecture diagram below for detailed component interactions. |
| 90 | + |
| 91 | + |
| 92 | + |
| 93 | +## Sequence Diagram |
| 94 | + |
| 95 | +```mermaid |
| 96 | +sequenceDiagram |
| 97 | + participant Client |
| 98 | + participant IGW as Inference Gateway<br/>(Envoy/kGateway) |
| 99 | + participant EPP as EPP Service<br/>(ext-proc/Endpoint Picker) |
| 100 | + participant ExtProcessor as (Dynamo) External LLM<br/>Processor |
| 101 | + participant Router as (Dynamo) Router |
| 102 | + participant TokenCache as External Token<br/>Cache/Side-channel |
| 103 | + participant Worker as (Dynamo) Worker<br/>Pod |
| 104 | +
|
| 105 | + Note over Client,Worker: Token Handling & Routing Strategies |
| 106 | +
|
| 107 | + %% Client Request |
| 108 | + Client->>IGW: POST /v1/chat/completions<br/>{"model": "llama-instruct",<br/> "messages": [...]<br/> } |
| 109 | +
|
| 110 | + IGW->>EPP: ext-proc: RequestHeaders |
| 111 | + EPP->>EPP: Parse model name from request<br/>Set X-Gateway-Model-Name header |
| 112 | + IGW->>EPP: ext-proc: RequestBody |
| 113 | +
|
| 114 | + %% Scenario 1: route=true (External routing via Router Service) |
| 115 | + alt route=true |
| 116 | + EPP->>ExtProcessor: POST /process<br/>{"request_body": {<br/> "model": "llama-instruct",<br/> "messages": [...],<br/> "route": true<br/> },<br/> "headers": {"x-request-id": "req-123"}} |
| 117 | + |
| 118 | + ExtProcessor->>ExtProcessor: Tokenize prompt (always)<br/>Generate token_ids: [1, 15043, 29892, ...] |
| 119 | + |
| 120 | + ExtProcessor->>Router: POST /route<br/>{"token_ids": [1, 15043, 29892, ...]} |
| 121 | + |
| 122 | + Router->>Router: Apply KV aware routing:<br/>- Check prefix cache<br/>- Apply custom routing strategy<br/>- Select optimal worker |
| 123 | + |
| 124 | + Router-->>ExtProcessor: Worker selection:<br/>{"worker_address": "worker-3:8080"} |
| 125 | + |
| 126 | + %% Token storage decision |
| 127 | + alt Using External Cache |
| 128 | + ExtProcessor->>TokenCache: Store tokens<br/>Key: "cache_key_abc123"<br/>Value: [1, 15043, 29892, ...] |
| 129 | + |
| 130 | + ExtProcessor-->>EPP: Response with token_key:<br/>{"worker_address": "worker-3:8080",<br/> "token_key": "cache_key_abc123"} |
| 131 | + |
| 132 | + EPP->>EPP: Set x-gateway-destination-endpoint: "worker-3:8080"<br/>Set routing metadata |
| 133 | + EPP->>EPP: Prepare headers:<br/>x-req-tokens-key: "cache_key_abc123" |
| 134 | + |
| 135 | + else Direct Token Values |
| 136 | + ExtProcessor-->>EPP: Response with token_value:<br/>{"worker_address": "worker-2:8080",<br/> "token_value": "[1,15043,29892,...]"} |
| 137 | + |
| 138 | + EPP->>EPP: Set x-gateway-destination-endpoint: "worker-2:8080"<br/>Set routing metadata |
| 139 | + EPP->>EPP: Modify request body:<br/>Add "token_ids": [1,15043,29892,...] |
| 140 | + end |
| 141 | + |
| 142 | + %% Scenario 2: route not specified (Internal routing) |
| 143 | + else route not specified |
| 144 | + Note over EPP: EPP schedules worker pods<br/>using internal logic |
| 145 | + |
| 146 | + EPP->>ExtProcessor: POST /process<br/>{"request_body": {<br/> "model": "llama-instruct",<br/> "messages": [...]<br/> },<br/> "headers": {"x-request-id": "req-123"}} |
| 147 | + |
| 148 | + ExtProcessor->>ExtProcessor: Tokenize prompt (always) |
| 149 | + |
| 150 | + %% Allow external cache for internal routing too |
| 151 | + alt Store in External Cache |
| 152 | + ExtProcessor->>TokenCache: Store tokens<br/>Key: "cache_key_xyz789"<br/>Value: [1, 15043, 29892, ...] |
| 153 | + |
| 154 | + ExtProcessor-->>EPP: Response with token_key:<br/>{"token_key": "cache_key_xyz789"} |
| 155 | + |
| 156 | + EPP->>EPP: Schedule worker pods:<br/>- Apply internal scheduling<br/>- Check pod availability<br/>- Select: worker-pool-1:8080 |
| 157 | + |
| 158 | + EPP->>EPP: Set x-gateway-destination-endpoint: "worker-pool-1:8080"<br/>Set routing metadata |
| 159 | + EPP->>EPP: Prepare headers:<br/>x-req-tokens-key: "cache_key_xyz789" |
| 160 | + |
| 161 | + else Direct Token Values |
| 162 | + ExtProcessor-->>EPP: Response with tokens:<br/>{"token_value": "[1,15043,29892,...]"} |
| 163 | + |
| 164 | + EPP->>EPP: Schedule worker pods:<br/>- Apply internal scheduling<br/>- Select: worker-pool-1:8080 |
| 165 | + |
| 166 | + EPP->>EPP: Set x-gateway-destination-endpoint: "worker-pool-1:8080"<br/>Set routing metadata |
| 167 | + EPP->>EPP: Modify request body:<br/>Add "token_ids": [1,15043,29892,...] |
| 168 | + end |
| 169 | + end |
| 170 | +
|
| 171 | + EPP-->>IGW: ext-proc Response<br/>Header: x-gateway-destination-endpoint<br/>Header: x-req-tokens-key (if using cache)<br/>Modified request body (if direct tokens) |
| 172 | +
|
| 173 | + %% Request forwarding |
| 174 | + IGW->>Worker: HTTP Request to selected worker<br/>Header: x-gateway-destination-endpoint<br/>Header: x-req-tokens-key (if applicable)<br/>Body: includes token_ids (if direct) |
| 175 | +
|
| 176 | + alt Worker receives token_key in header |
| 177 | + Worker->>TokenCache: Fetch tokens<br/>Key from header: x-req-tokens-key |
| 178 | + TokenCache-->>Worker: Token array: [1,15043,29892,...] |
| 179 | + else Worker receives token_ids in body |
| 180 | + Worker->>Worker: Use token_ids directly<br/>from request body |
| 181 | + end |
| 182 | +
|
| 183 | + Worker->>Worker: LLM Inference with tokens |
| 184 | + Worker-->>IGW: Response<br/>{"choices": [...], "usage": {...}} |
| 185 | +
|
| 186 | + IGW-->>Client: Final Response |
| 187 | +``` |
| 188 | + |
| 189 | +# Implementation Details |
| 190 | + |
| 191 | +## Key Components |
| 192 | + |
| 193 | +### Dynamo EPP (ext-proc) |
| 194 | +- Integrates with Gateway via ext-proc protocol |
| 195 | +- Parses model names and sets `X-Gateway-Model-Name` header |
| 196 | +- Calls External LLM Processor for tokenization |
| 197 | +- Handles both external and internal routing strategies |
| 198 | +- Manages token key/value header and body modifications |
| 199 | + |
| 200 | +### Dynamo Processor |
| 201 | +- Performs request tokenization |
| 202 | +- Supports both routing modes (external via Router, internal via EPP) |
| 203 | +- Manages token transfer strategies (cache vs direct) |
| 204 | +- Returns worker selection and dynamo backend framework (vLLM/Trtllm/sglang) agnostic request |
| 205 | + |
| 206 | +### Dynamo Router Service |
| 207 | +- Implements KV-aware routing algorithms |
| 208 | +- Analyzes token_ids for optimal worker selection based on prefix cache |
| 209 | +- Called only when `route=true` is specified |
| 210 | + |
| 211 | +### Dynamo Worker Pods |
| 212 | +- Perform LLM inference with preprocessed tokens |
| 213 | +- Support both token retrieval methods (cache keys, direct values) |
| 214 | +- Maintain compatibility with existing worker interfaces |
| 215 | +- exposes HTTP endpoint for direct intgerration with Inference gateway |
| 216 | + |
| 217 | +### Token Cache / Side channel |
| 218 | +- External storage system which provides a Key/Value store interface transfer token_ids from processor to worker |
| 219 | +- Stores tokenized data with generated keys |
| 220 | +- Enables efficient token sharing between components |
| 221 | +- Optional component (direct token passing also supported) |
| 222 | + |
| 223 | +## Configuration |
| 224 | + |
| 225 | +### Environment Variables |
| 226 | +- `EXTERNAL_LLM_PROCESSOR_ENDPOINT`: Dynamo External LLM Processor URL |
| 227 | +- `USE_EXTERNAL_LLM_PROCESSOR`: Enable/disable external pre-processing (apply prompt templates/tokenization) |
| 228 | +- `USE_EXTERNAL_LLM_ROUTER`: Enable/disable external routing (in this case it's Dynamo Router) |
| 229 | + |
| 230 | +### Headers |
| 231 | +- `X-Gateway-Model-Name`: Set by EPP from parsed model name in user request's body |
| 232 | +- `x-req-tokens-key`: Token cache key (when using external cache) |
| 233 | +- `x-req-tokens-value`: Direct token values (alternative to cache) |
| 234 | + |
| 235 | +## Deferred to Implementation |
| 236 | + |
| 237 | +- Specific token cache implementation details (Redis vs alternatives) |
| 238 | +- Fallback mechanisms for external service failures |
| 239 | +- Metrics and observability integration |
| 240 | + |
| 241 | +# Implementation Phases |
| 242 | + |
| 243 | +## Phase 1 Core Integration |
| 244 | +**Supported API / Behavior:** |
| 245 | +- External tokenization via Dynamo Processor |
| 246 | +- External scheduling/routiung using Dynamo Router |
| 247 | +- Direct token value passing to workers |
| 248 | + |
| 249 | +**Not Supported:** |
| 250 | +- External cache-based token passing |
| 251 | + |
| 252 | +## Phase 2 Tokens transfer thrugh side channel/cache |
| 253 | +**Supported API / Behavior:** |
| 254 | +- External cache-based token passing |
| 255 | + |
| 256 | +# Related Proposals |
| 257 | +* Gateway API Inference Extension Architecture |
| 258 | +* EPP Architecture Proposal |
| 259 | +* Model Server Protocol |
| 260 | + |
| 261 | +# Alternate Solutions |
| 262 | + |
| 263 | +## Alt 1 Direct Tokenizer Integration in EPP (current EPP architecture) |
| 264 | + |
| 265 | +**Pros:** |
| 266 | +- Simpler architecture without additional layer |
| 267 | +- Lower latency for request processing |
| 268 | +- Fewer network hops |
| 269 | + |
| 270 | +**Cons:** |
| 271 | +- Less flexible for different models |
| 272 | +- Harder to maintain separation of concerns |
| 273 | + |
| 274 | +**Reason Rejected:** |
| 275 | +- Violates Gateway API integration principles |
| 276 | +- Reduces portability across models |
| 277 | +- Increases complexity/TCO by using golang based tokenizer |
| 278 | + |
| 279 | +## Alt 2 Sidecar Pattern |
| 280 | +- TODO |
| 281 | + |
| 282 | +## References |
| 283 | + |
| 284 | +* [Gateway API Inference Extension Documentation](https://gateway-api-inference-extension.sigs.k8s.io/) |
| 285 | +* [Envoy External Processing Filter](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) |
| 286 | +* [Gateway API Specification](https://gateway-api.sigs.k8s.io/) |
0 commit comments