-
Notifications
You must be signed in to change notification settings - Fork 5
inference gateway integration #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d384dad
0780556
fa76d09
6d457bb
246a20c
b5dedaf
559b881
a341ab8
48f0a77
35009f9
f4d66e3
5c94f2f
394aed0
e80332d
cc48457
a1d3368
f1a141d
957c846
8fda443
7478219
1f667aa
7ffc7f9
691cfa0
6009993
8d6be89
5636543
80e130e
4f9f45b
d3ec5ac
a5d7fca
05b7437
99ad01f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,340 @@ | ||
# Dynamo integration with Inference Gateway | ||
|
||
**Status**: Draft | ||
|
||
**Authors**: [Biswa Panda](https://github.com/biswapanda) | ||
|
||
**Category**: Architecture | ||
|
||
**Sponsor**: Itay, Maksim, Neelay | ||
|
||
**Required Reviewers**: | ||
|
||
**Review Date**: [Date for review] | ||
|
||
**Pull Request**: [Link to Pull Request of the Proposal itself] | ||
|
||
**Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation] | ||
|
||
# Summary | ||
|
||
This proposal outlines the integration of [Dynamo](https://docs.nvidia.com/dynamo/latest/architecture/architecture.html#high-level-architecture-and-key-benefits) with the [Gateway API Inference Extension](https://gateway-api-inference-extension.sigs.k8s.io) to enable seamless model routing, request scheduling, and centralized control of inference workloads by mapping Inference Gateway concepts to Dynamo components. | ||
|
||
## Dynamo Introduction | ||
|
||
### Components and process boundary: | ||
Dynamo is a modular inference system with distinct logical components: | ||
|
||
**Frontend** runs these 3 components in a single process. It doesn't require accelerator resources. | ||
1. **API Service**: The entry point for OpenAI-compatible requests | ||
|
||
2. **Processor**: Handles tokenization and preprocessing | ||
|
||
3. **Router**: Makes scheduling decisions based on KV metrics | ||
|
||
**Backend** Process requires accelerator resources and hosts `Worker` component. | ||
|
||
4. **Workers**: Backend components responsible for managing LLM engines running on accelerators. | ||
Workers execute inference tasks using underlying engines such as vLLM, TRT-LLM, or SGLang. | ||
|
||
|
||
Each Dynamo graph deployment creates a Kubernetes deployment which manages component's pods. | ||
|
||
* Dynamo Graph (logical view) | ||
 | ||
|
||
* Dynamo Graph (deployment view) | ||
 | ||
|
||
| Module | Dynamo | IGW | | ||
| :---- | :---- | :---- | | ||
| **Service/Data Plane** | Custom NATS/TCP and json based protocol supports async messages based data flow | gRPC/HTTP-based req/reply protocol | | ||
| **Event Plane** | KV/capacity related metric events are published over NATS | Scrapers populate Datastore with metrics for a pod (pull-based) | | ||
| **Control Plane** | Planner is responsible for scaling decisions, Orchestration happens via operator | todo: need information | | ||
|
||
|
||
### Disaggregated Serving | ||
In Dynamo's disaggregated serving, the initial request is handled by the decode stage, and if prefill is required, it is performed remotely by the Prefill worker. | ||
|
||
Currently, Dynamo does not support co-scheduling Prefill/Decode or Prefill -> decode flow. | ||
|
||
|
||
### Communication | ||
Dynamo uses a custom protocol for intra-component communication. It is based on NATS and two-part JSON messages. | ||
|
||
|
||
## IGW (Inference Gateway) Introduction | ||
|
||
1. **Model Aware Routing** | ||
IGW enables traffic management across multiple base and LoRA models. | ||
|
||
2. **Request scheduling** | ||
IGW schedules requests based on various policies at the `endpoint picker` extension. | ||
Routes requests to the best LLM worker instance based on various data sources (runtime stats data) and input (SLO). | ||
|
||
3. **Centralized control** | ||
Enables centralized management of: auth, RBAC, rate limiting, usage tracking etc. in the Gateway layer. | ||
|
||
## Goals | ||
|
||
* Map Inference Gateway concepts to Dynamo | ||
* Maintain backward compatibility with existing EPP functionality | ||
* Extend IGW to use Dynamo router | ||
* Minimize network hops | ||
|
||
### Non Goals | ||
|
||
* Replace existing EPP internal scheduling | ||
* Modify core Gateway API specifications | ||
* Change existing Dynamo worker interfaces significantly | ||
* LoRA support in Dynamo | ||
|
||
## Guiding Principles | ||
|
||
1. **Composability**: EPP should externalize scheduling decisions to Dynamo router | ||
2. **DRY**: Aim to reduce duplications in preprocessing steps (tokenization, prompt template application) | ||
3. **Compatibility**: Maintain full compatibility with Inference Gateway API | ||
4. **Reduce network hops** to minimize tail latency | ||
5. **EPP extends Dynamo's Router** EPP delegates scheduling decisions to dynamo router, preserving Dynamo's scheduling logic | ||
|
||
## Constraints | ||
- Dynamo components (Processor, Router) use Dynamo native transport (two-part JSON messages over NATS) | ||
- Dynamo does not support co-scheduling in disaggregated mode. Currently request flow goes from decode to prefill. | ||
- An EPP is associated with a [single InferencePool](https://gateway-api-inference-extension.sigs.k8s.io/api-types/inferencepool) | ||
|
||
|
||
## Problems | ||
1. Currently EPP scheduling has tight coupling with in-process preprocessing. | ||
It's hard to maintain it across different models. | ||
|
||
2. Double tokenization during scheduling and service path | ||
|
||
3. Dynamo uses a custom communication protocol for communication between frontend and workers. | ||
|
||
|
||
## Requirements | ||
|
||
### REQ 1 External Processing Integration | ||
|
||
Dynamo EPP (Endpoint picker) **MUST** support scheduling request in dynamo while maintaining the existing ext-proc interface. | ||
|
||
### REQ 2 Unified Dynamo deployment | ||
|
||
Dynamo EPP and components (Frontend, Workers) **MUST** be deployable within Kubernetes through a unified Helm chart to maintain version compatibility. | ||
|
||
### REQ 3 Maintain compatibility with Inference Gateway protocols | ||
|
||
Dynamo EPP **MUST** be compatible with Inference Gateway API and concepts (InferencePool, InferenceModel). | ||
|
||
# Proposal | ||
|
||
## Architecture Overview | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alt 2 is more aligned with IGW's design principles and composes well with Kubernetes Gateway API. Different base models are assumed to be deployed independently as separate pools; EPP is thought of as a request scheduler on a pool, it assumes that all endpoints are capable of serving the request and its role is to pick the best one, and so routing across different base models is assumed to be done at a stage before that. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Giving some context before answering your question: Context 1: body-based-router is an extension that all it does is reading the model name from the body and writing it to a known header, it doesn't do actual routing per se. Once the model name is in a header, a user can use HTTPRoute rules to express the intent to route requests based on that header to different pools. This is basic k8s Gateway functionality. The reason to write the model name to a header is because HTTPRoute doesn't support creating rules with matchers against the request body, it can only match against headers. Context 2: We define IGW as a combination of two things: a proxy + ext-proc extensions; as part of the IGW project, we developed two extensions, the bbr described above and the epp, the endpoint picker extension (we are calling it inference request scheduler as well). bbr is invoked early in the envoy filter chain before selecting the pool to route the request to. EPP is invoked later in the chain, after the pool is selected, but before picking the endpoint. EPP is the component that does the “smart” routing. So if I were to answer the question: "are we planning to integrate this [routing across different base models] as a core feature of IGW"; my answer is yes if you look at IGW as the composition of the features above. If the question is "can a single pool host multiple base models, and the EPP route requests across different base models within the pool?", then this is still an open question, we lack the motivating use case to consider that (vs having the layer above, HTTPRoute, doing it). More concretely, is there any smart routing that we may want to apply when selecting the base model to route the request to? Or is selecting the base model predetermined (and hence can be expressed as header matching)? If there is a strong case for the former, then EPP can certainly expand to support base model routing. |
||
|
||
## Alt 1: Entire Dynamo Graph Deployment as a blackbox | ||
|
||
EPP routes request to Frontend pod/sidecar | ||
- multiple models in same k8s namespace | ||
- shared frontend | ||
 | ||
|
||
multiple models in different k8s namespace | ||
uses BBR to inject model name header | ||
- multiple models in same k8s namespace | ||
- Each model has it's dedicated frontend | ||
 | ||
|
||
# Alt 2: Dynamo EPP integration with Router | ||
|
||
 | ||
biswapanda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Data flow | ||
|
||
1. The client sends an HTTP inference request to the Gateway. | ||
2. Gateway receives the request and extracts the model name and relevant metadata. | ||
Gateway consults the InferenceModel configuration to determine the inference pool (Dynamo graph) to route the request. | ||
3. Gateway calls EPP over gRPC for worker scheduling based on Envoy ext_proc protocol. | ||
4. EPP forwards the request to Frontend sidecar | ||
```yaml | ||
Request: | ||
- req header: set x-routing-request: true | ||
- req body: original request body (For example, Chat completion request) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is the expectation that the epp will buffer the request and forward it in one go to the frontend? |
||
|
||
Response: | ||
worker_id: this is Dynamo specific worker_id | ||
token_ids: (Optional) tokens generated from processor step | ||
``` | ||
5. Dynamo Frontend accepts OAI request and forwards request through Dynamo request plane (NATS) | ||
6. Dynamo Processor performs necessary pre-processing and generates tokens. It calls routers to decide worker_id. | ||
7. Dynamo Router takes tokens as input and decides worker_id based on scheduling policies and KV metrics | ||
8. EPP sets headers (x-gateway-destination-endpoint and x-gateway-worker-id) | ||
Optional optimization: We can inject the tokens in request body to avoid recomputing tokens in service path. | ||
Note: `tokens` key in request body is not OpenAI compatible. | ||
``` | ||
Set Req Header: | ||
- `x-gateway-destination-endpoint`: worker address of the Dynamo frontend pod | ||
- `x-gateway-worker-id`: Dynamo worker id of Backend LLM worker instance | ||
|
||
Add to Req Body (Optional): | ||
- `tokens` | ||
``` | ||
9. IGW forwards the request to appropriate Dynamo frontend based on request header `x-gateway-destination-endpoint` | ||
Note: This could be ideally routed to Frontend service because Frontend/Processor deployment is decoupled from LLM workers. | ||
|
||
10. Processor skips pre-processing | ||
- `tokens` in request body and skips pre-processing step | ||
- `x-gateway-worker-id` in the request and skips call to router | ||
|
||
11. Request is sent to LLM Backend and response is streamed back through | ||
- Processor: Post-processing steps | ||
- Frontend: Change response shape from Dynamo native to OpenAI compatible response | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this logic be encapsulated into a sidecar deployed with the model server instead of having them deployed as central components in the service path? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that as part of the ext-proc protocol, the epp will get the response back, and so it can also forward it to the frontend instance deployed as a sidecar to the epp. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good point. My initial thought was this will couple cpu and accelerator resources tightly and we wont be able to scale them independently. Both
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
@ahg-g, this make sense.
also, is epp considered stateful? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if the frontend logic doesn't require global state, then moving it to run beside the model service pod is the more scalable option.
Currently it maintains an "approximate" in-memory prefix-cache populated from previous requests to perform prefix-aware routing. So performance would suffer if the state is lost, but requests will continue to be routed and served. |
||
|
||
**Notes:** | ||
- All inter-component communication within Dynamo (Processor, Router, Workers) uses NATS with two-part JSON messages. | ||
- Deployment is unified via a single Helm chart for version compatibility. | ||
|
||
### Mapping Inference Pool/Model with Dynamo | ||
1. There would be a 1:1 mapping between an inference pool, a Dynamo graph deployment and EPP deployment. | ||
Reasoning: Dynamo Graph represents a cohesive deployment unit with compute resources. Each Dynamo graph deployment should correspond to one Inference Pool. | ||
|
||
This is the view from IGW perspective - | ||
``` | ||
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ | ||
│ Inference Model │ │ Inference Model │ │ Inference Model │ | ||
│ (lora1) │ │ (lora2) │ │ (gemma) │ | ||
└─────────┬───────┘ └─────────┬────────┘ └─────────┬───────┘ | ||
└──────────────────────┼───────────────────────┘ | ||
│ N:1 | ||
┌────────────▼────────────┐ | ||
│ Inference Pool │ | ||
└────────────┬────────────┘ | ||
│ 1:1 | ||
┌────────────▼────────────┐ | ||
│ Dynamo EPP │ | ||
└────────────┬────────────┘ | ||
│ 1:1 | ||
┌────────────▼────────────┐ | ||
│ Dynamo Graph Deployment │ | ||
└─────────────────────────┘ | ||
``` | ||
|
||
2. EPP has a 1:1 relationship with Inference Pool and it's responsible for scheduling decisions within a Dynamo graph. | ||
|
||
3. Inference Model maps user-facing model names to backend implementations. Multiple inference models can refer to the same Inference Pool (Dynamo Graph). | ||
|
||
|
||
```mermaid | ||
erDiagram | ||
%% IGW Entities | ||
Gateway | ||
HTTPRoute | ||
InferenceModel | ||
InferencePool | ||
EPP | ||
|
||
%% Dynamo Entities | ||
DynamoGraph | ||
Frontend | ||
Worker | ||
|
||
%% IGW Relationships | ||
Gateway ||--o{ HTTPRoute : "routes traffic via" | ||
HTTPRoute ||--o{ InferenceModel : "matches model name to" | ||
InferenceModel ||--|| InferencePool : "references backend" | ||
InferencePool ||--|| EPP : "delegates scheduling to" | ||
EPP ||--o{ Worker : "selects worker instances" | ||
|
||
%% Dynamo Relationships | ||
DynamoGraph ||--|| Frontend : "has single entrypoint" | ||
DynamoGraph ||--o{ Worker : "contains multiple workers" | ||
|
||
%% Cross-System Mapping | ||
InferencePool ||--|| DynamoGraph : "maps to" | ||
EPP ||--|| Frontend : "deployed as sidecar" | ||
|
||
%% Internal Dynamo Communication | ||
Frontend ||--o{ Worker : "routes to selected worker" | ||
``` | ||
|
||
### Decision Points | ||
|
||
#### 1. EPP integration with Dynamo: plugin vs sidecar vs external callout service | ||
|
||
 | ||
##### Sidecar container (Preferred) | ||
Needs support in EPP to deploy a sidecar container and specify the port to request at. | ||
|
||
**Pros** | ||
- Reduced network hops: Direct communication between EPP and Dynamo components within the same pod | ||
- Lower latency: No network overhead for inter-component communication | ||
- Simpler deployment and management: Deployed as a single unit, easier to manage lifecycle | ||
|
||
**Cons** | ||
- Tightly coupled scaling: Scaling decisions for EPP and Frontend are coupled | ||
- Deployment of EPP is coupled with Dynamo sidecar image. Version upgrades should be done in-sync. | ||
|
||
##### External callout service | ||
**Pros** | ||
- Completely isolated deployments | ||
- Each component can be deployed and scaled independently | ||
|
||
**Cons** | ||
- Additional network hops: More latency due to network communication between services | ||
- Service discovery complexity: Need to manage service endpoints and load balancing | ||
- Additional network failure points in the request path | ||
|
||
##### Plugin | ||
**Pros** | ||
- Minimum number of network hops | ||
- Simpler architecture without additional layer | ||
- Lower latency for request processing | ||
|
||
**Cons** | ||
- Dynamo runtime/component don't have native integration with Golang | ||
- Hard to scale across models | ||
- Tight coupling with Golang-based implementation | ||
|
||
|
||
## Current state of IGW and Dynamo | ||
|
||
### Inference Gateway Request Flow: | ||
``` | ||
HTTP Request | ||
│ | ||
▼ | ||
┌─────────────┐ Extract model name ┌──────────────────┐ | ||
│ Gateway │ ──────────────────────► │ InferenceModel │ | ||
│ (HTTPRoute) │ │ (Model Config) │ | ||
└─────────────┘ └──────────────────┘ | ||
│ │ | ||
│ Route to backend │ References | ||
▼ ▼ | ||
┌─────────────┐ Smart routing via ┌──────────────────┐ | ||
│InferencePool│ ◄─────────────────────── │ Endpoint Picker │ | ||
│ (Compute) │ EPP extension │ Extension (EPP) │ | ||
└─────────────┘ └──────────────────┘ | ||
│ | ||
▼ | ||
┌─────────────┐ | ||
│ Model Server│ | ||
│ Pods │ | ||
└─────────────┘ | ||
``` | ||
|
||
## Deferred to Implementation | ||
- Fallback mechanisms for failures | ||
- Metrics and observability integration | ||
|
||
|
||
## Follow up questions | ||
|
||
1. Is EPP considered stateless? How do we achieve EPP HA/Fault-tolerance? | ||
Dynamo Router is stateful (in-memory prefix tree). | ||
Coupling Dynamo frontend as a sidecar with EPP might bring scaling challenges. | ||
|
||
2. Multiple models: InferenceModel and body-based routing | ||
Can we enable body-based routing as a stage in EPP or composable at IGW level? | ||
|
||
# Related Proposals | ||
* [Gateway API Inference Extension Documentation](https://gateway-api-inference-extension.sigs.k8s.io/) | ||
* [Envoy External Processing Filter](https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/ext_proc_filter) | ||
* [Gateway API Specification](https://gateway-api.sigs.k8s.io/) |
Uh oh!
There was an error while loading. Please reload this page.