Skip to content

Commit d384dad

Browse files
committed
draft
1 parent 41d7085 commit d384dad

File tree

2 files changed

+286
-0
lines changed

2 files changed

+286
-0
lines changed
Lines changed: 286 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,286 @@
1+
# Dynamo integration with Inference Gateway
2+
3+
**Status**: Draft
4+
5+
**Authors**: [Biswa Panda](https://github.com/biswapanda)
6+
7+
**Category**: Architecture
8+
9+
**Replaces**: [Link of previous proposal if applicable]
10+
11+
**Replaced By**: [Link of previous proposal if applicable]
12+
13+
**Sponsor**: [Name of code owner or maintainer to shepard process]
14+
15+
**Required Reviewers**: [Names of technical leads that are required for acceptance]
16+
17+
**Review Date**: [Date for review]
18+
19+
**Pull Request**: [Link to Pull Request of the Proposal itself]
20+
21+
**Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation]
22+
23+
# Summary
24+
25+
This proposal outlines the integration of Dynamo components with the Gateway API Inference Extension.
26+
27+
The current Inference Gateway is tightly coupled with model's tokenizer. However use cases require:
28+
1. **External Tokenization**: Preprocessing requests outside the gateway for specialized tokenization logic
29+
2. **KV-Aware Routing**: Intelligent routing based on prefix cache status and token analysis
30+
3. **Flexible side channel to offload tokens**: Support for both external cache and direct token passing strategies. This would be helpful for transfering large blob of tokens for VLMs (image/audio/video tokens)
31+
4. **Unified Dynamo Architecture**: Consolidated deployment model for all processing components
32+
33+
## Terminology & Definitions
34+
35+
| Term | Definition |
36+
| :---- | :---- |
37+
| **Dynamo EPP** | Enhanced Endpoint Picker Protocol service with Dynamo integration |
38+
| **Dynamo Processor** | Dynamo component responsible for request tokenization and preprocessing |
39+
| **Dynamo Router** | Dynamo component responsible for KV aware Routing strategy |
40+
| **Token Cache / Side Channel** | External storage system for tokenized request |
41+
42+
## Acronyms & Abbreviations
43+
44+
**EPP:** Endpoint Picker Protocol
45+
**IGW:** Inference Gateway
46+
47+
## Goals
48+
49+
* Integrate Dynamo Processor for request preprocessing and tokenization
50+
* Enable KV-aware routing through Dynamo Router Service
51+
* Support flexible token management (cache keys vs direct values)
52+
* Provide unified deployment architecture for all Dynamo components
53+
* Maintain backward compatibility with existing EPP functionality
54+
55+
### Non Goals
56+
57+
* Replace existing EPP internal scheduling completely
58+
* Modify core Gateway API specifications
59+
* Change existing worker pod interfaces significantly
60+
61+
## Requirements
62+
63+
### REQ 1 External Processing Integration
64+
65+
Dynamo EPP (Endpoint picker) **MUST** support calling LLM processors for request preprocessing and tokenization while maintaining the existing ext-proc interface.
66+
67+
### REQ 2 Flexible Routing Strategies
68+
69+
The system **SHOULD** support both external routing (via Dynamo Router) and internal EPP scheduling based on request configuration.
70+
71+
### REQ 3 Token offloading capability
72+
73+
The system **SHOULD** support both external cache-based token storage and direct token value passing to worker pods.
74+
75+
### REQ 4 Unified Dynamo Architecture
76+
77+
Dynamo EPP and components (Processor, Router, Workers) **MUST** be deployable as a unified dynamo graph within Kubernetes.
78+
79+
### REQ 5 Maintain compatibility with Inference Gateway protocols
80+
81+
Dynamo EPP **MUST** be compatible with Inference Gateway
82+
83+
# Proposal
84+
85+
## Design Principles
86+
87+
## Architecture Overview
88+
89+
The updated architecture unifies Inference Gateway with Dynamo Graph deployment. See architecture diagram below for detailed component interactions.
90+
91+
![Architecture Diagram](./arch1.png)
92+
93+
## Sequence Diagram
94+
95+
```mermaid
96+
sequenceDiagram
97+
participant Client
98+
participant IGW as Inference Gateway<br/>(Envoy/kGateway)
99+
participant EPP as EPP Service<br/>(ext-proc/Endpoint Picker)
100+
participant ExtProcessor as (Dynamo) External LLM<br/>Processor
101+
participant Router as (Dynamo) Router
102+
participant TokenCache as External Token<br/>Cache/Side-channel
103+
participant Worker as (Dynamo) Worker<br/>Pod
104+
105+
Note over Client,Worker: Token Handling & Routing Strategies
106+
107+
%% Client Request
108+
Client->>IGW: POST /v1/chat/completions<br/>{"model": "llama-instruct",<br/> "messages": [...]<br/> }
109+
110+
IGW->>EPP: ext-proc: RequestHeaders
111+
EPP->>EPP: Parse model name from request<br/>Set X-Gateway-Model-Name header
112+
IGW->>EPP: ext-proc: RequestBody
113+
114+
%% Scenario 1: route=true (External routing via Router Service)
115+
alt route=true
116+
EPP->>ExtProcessor: POST /process<br/>{"request_body": {<br/> "model": "llama-instruct",<br/> "messages": [...],<br/> "route": true<br/> },<br/> "headers": {"x-request-id": "req-123"}}
117+
118+
ExtProcessor->>ExtProcessor: Tokenize prompt (always)<br/>Generate token_ids: [1, 15043, 29892, ...]
119+
120+
ExtProcessor->>Router: POST /route<br/>{"token_ids": [1, 15043, 29892, ...]}
121+
122+
Router->>Router: Apply KV aware routing:<br/>- Check prefix cache<br/>- Apply custom routing strategy<br/>- Select optimal worker
123+
124+
Router-->>ExtProcessor: Worker selection:<br/>{"worker_address": "worker-3:8080"}
125+
126+
%% Token storage decision
127+
alt Using External Cache
128+
ExtProcessor->>TokenCache: Store tokens<br/>Key: "cache_key_abc123"<br/>Value: [1, 15043, 29892, ...]
129+
130+
ExtProcessor-->>EPP: Response with token_key:<br/>{"worker_address": "worker-3:8080",<br/> "token_key": "cache_key_abc123"}
131+
132+
EPP->>EPP: Set x-gateway-destination-endpoint: "worker-3:8080"<br/>Set routing metadata
133+
EPP->>EPP: Prepare headers:<br/>x-req-tokens-key: "cache_key_abc123"
134+
135+
else Direct Token Values
136+
ExtProcessor-->>EPP: Response with token_value:<br/>{"worker_address": "worker-2:8080",<br/> "token_value": "[1,15043,29892,...]"}
137+
138+
EPP->>EPP: Set x-gateway-destination-endpoint: "worker-2:8080"<br/>Set routing metadata
139+
EPP->>EPP: Modify request body:<br/>Add "token_ids": [1,15043,29892,...]
140+
end
141+
142+
%% Scenario 2: route not specified (Internal routing)
143+
else route not specified
144+
Note over EPP: EPP schedules worker pods<br/>using internal logic
145+
146+
EPP->>ExtProcessor: POST /process<br/>{"request_body": {<br/> "model": "llama-instruct",<br/> "messages": [...]<br/> },<br/> "headers": {"x-request-id": "req-123"}}
147+
148+
ExtProcessor->>ExtProcessor: Tokenize prompt (always)
149+
150+
%% Allow external cache for internal routing too
151+
alt Store in External Cache
152+
ExtProcessor->>TokenCache: Store tokens<br/>Key: "cache_key_xyz789"<br/>Value: [1, 15043, 29892, ...]
153+
154+
ExtProcessor-->>EPP: Response with token_key:<br/>{"token_key": "cache_key_xyz789"}
155+
156+
EPP->>EPP: Schedule worker pods:<br/>- Apply internal scheduling<br/>- Check pod availability<br/>- Select: worker-pool-1:8080
157+
158+
EPP->>EPP: Set x-gateway-destination-endpoint: "worker-pool-1:8080"<br/>Set routing metadata
159+
EPP->>EPP: Prepare headers:<br/>x-req-tokens-key: "cache_key_xyz789"
160+
161+
else Direct Token Values
162+
ExtProcessor-->>EPP: Response with tokens:<br/>{"token_value": "[1,15043,29892,...]"}
163+
164+
EPP->>EPP: Schedule worker pods:<br/>- Apply internal scheduling<br/>- Select: worker-pool-1:8080
165+
166+
EPP->>EPP: Set x-gateway-destination-endpoint: "worker-pool-1:8080"<br/>Set routing metadata
167+
EPP->>EPP: Modify request body:<br/>Add "token_ids": [1,15043,29892,...]
168+
end
169+
end
170+
171+
EPP-->>IGW: ext-proc Response<br/>Header: x-gateway-destination-endpoint<br/>Header: x-req-tokens-key (if using cache)<br/>Modified request body (if direct tokens)
172+
173+
%% Request forwarding
174+
IGW->>Worker: HTTP Request to selected worker<br/>Header: x-gateway-destination-endpoint<br/>Header: x-req-tokens-key (if applicable)<br/>Body: includes token_ids (if direct)
175+
176+
alt Worker receives token_key in header
177+
Worker->>TokenCache: Fetch tokens<br/>Key from header: x-req-tokens-key
178+
TokenCache-->>Worker: Token array: [1,15043,29892,...]
179+
else Worker receives token_ids in body
180+
Worker->>Worker: Use token_ids directly<br/>from request body
181+
end
182+
183+
Worker->>Worker: LLM Inference with tokens
184+
Worker-->>IGW: Response<br/>{"choices": [...], "usage": {...}}
185+
186+
IGW-->>Client: Final Response
187+
```
188+
189+
# Implementation Details
190+
191+
## Key Components
192+
193+
### Dynamo EPP (ext-proc)
194+
- Integrates with Gateway via ext-proc protocol
195+
- Parses model names and sets `X-Gateway-Model-Name` header
196+
- Calls External LLM Processor for tokenization
197+
- Handles both external and internal routing strategies
198+
- Manages token key/value header and body modifications
199+
200+
### Dynamo Processor
201+
- Performs request tokenization
202+
- Supports both routing modes (external via Router, internal via EPP)
203+
- Manages token transfer strategies (cache vs direct)
204+
- Returns worker selection and dynamo backend framework (vLLM/Trtllm/sglang) agnostic request
205+
206+
### Dynamo Router Service
207+
- Implements KV-aware routing algorithms
208+
- Analyzes token_ids for optimal worker selection based on prefix cache
209+
- Called only when `route=true` is specified
210+
211+
### Dynamo Worker Pods
212+
- Perform LLM inference with preprocessed tokens
213+
- Support both token retrieval methods (cache keys, direct values)
214+
- Maintain compatibility with existing worker interfaces
215+
- exposes HTTP endpoint for direct intgerration with Inference gateway
216+
217+
### Token Cache / Side channel
218+
- External storage system which provides a Key/Value store interface transfer token_ids from processor to worker
219+
- Stores tokenized data with generated keys
220+
- Enables efficient token sharing between components
221+
- Optional component (direct token passing also supported)
222+
223+
## Configuration
224+
225+
### Environment Variables
226+
- `EXTERNAL_LLM_PROCESSOR_ENDPOINT`: Dynamo External LLM Processor URL
227+
- `USE_EXTERNAL_LLM_PROCESSOR`: Enable/disable external pre-processing (apply prompt templates/tokenization)
228+
- `USE_EXTERNAL_LLM_ROUTER`: Enable/disable external routing (in this case it's Dynamo Router)
229+
230+
### Headers
231+
- `X-Gateway-Model-Name`: Set by EPP from parsed model name in user request's body
232+
- `x-req-tokens-key`: Token cache key (when using external cache)
233+
- `x-req-tokens-value`: Direct token values (alternative to cache)
234+
235+
## Deferred to Implementation
236+
237+
- Specific token cache implementation details (Redis vs alternatives)
238+
- Fallback mechanisms for external service failures
239+
- Metrics and observability integration
240+
241+
# Implementation Phases
242+
243+
## Phase 1 Core Integration
244+
**Supported API / Behavior:**
245+
- External tokenization via Dynamo Processor
246+
- External scheduling/routiung using Dynamo Router
247+
- Direct token value passing to workers
248+
249+
**Not Supported:**
250+
- External cache-based token passing
251+
252+
## Phase 2 Tokens transfer thrugh side channel/cache
253+
**Supported API / Behavior:**
254+
- External cache-based token passing
255+
256+
# Related Proposals
257+
* Gateway API Inference Extension Architecture
258+
* EPP Architecture Proposal
259+
* Model Server Protocol
260+
261+
# Alternate Solutions
262+
263+
## Alt 1 Direct Tokenizer Integration in EPP (current EPP architecture)
264+
265+
**Pros:**
266+
- Simpler architecture without additional layer
267+
- Lower latency for request processing
268+
- Fewer network hops
269+
270+
**Cons:**
271+
- Less flexible for different models
272+
- Harder to maintain separation of concerns
273+
274+
**Reason Rejected:**
275+
- Violates Gateway API integration principles
276+
- Reduces portability across models
277+
- Increases complexity/TCO by using golang based tokenizer
278+
279+
## Alt 2 Sidecar Pattern
280+
- TODO
281+
282+
## References
283+
284+
* [Gateway API Inference Extension Documentation](https://gateway-api-inference-extension.sigs.k8s.io/)
285+
* [Envoy External Processing Filter](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter)
286+
* [Gateway API Specification](https://gateway-api.sigs.k8s.io/)
112 KB
Loading

0 commit comments

Comments
 (0)