inference gateway integration #13

biswapanda · 2025-06-12T02:37:47Z

No description provided.

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

ahg-g · 2025-07-16T18:39:40Z

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

+**EPP:** Endpoint Picker Protocol
+**IGW:** Inference Gateway
+
+## Why 


As of now, IGW is mostly concerned with the second point, request scheduling. The other two are assumed to be addressed by other components that composes with IGW. But that could change depending on the detailed requirements.

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

ahg-g · 2025-07-16T18:45:03Z

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

+
+### REQ 2 Unified Dynamo deployment
+
+Dynamo EPP and components (Frontend, Processor, Router, Workers) **MUST** be deployable within Kubernetes through a unified helm chart to maintain version compatibility.


Since they are in the request path, can we define the scope of "Frontend", "Processor" and "Router" or link to a document explaining that?

added the defintions in the doc dynamo introduction section

Thanks, one more thing to clarify and relevant to IGW integration is worker discovery, basically how does the smart router know about the workers and their role?

You also touched on metrics, that Dynamo follows a push model, it would be great to clarify the exact metrics currently being pushed and the frequency.

ahg-g · 2025-07-16T20:27:29Z

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

+
+# Proposal
+
+## Architecture Overview


Alt 2 is more aligned with IGW's design principles and composes well with Kubernetes Gateway API.

Different base models are assumed to be deployed independently as separate pools; EPP is thought of as a request scheduler on a pool, it assumes that all endpoints are capable of serving the request and its role is to pick the best one, and so routing across different base models is assumed to be done at a stage before that.

routing across different base models is done by enabling body based router or we are planning to integrate this as a core feature of IGW?

Giving some context before answering your question:

Context 1: body-based-router is an extension that all it does is reading the model name from the body and writing it to a known header, it doesn't do actual routing per se. Once the model name is in a header, a user can use HTTPRoute rules to express the intent to route requests based on that header to different pools. This is basic k8s Gateway functionality. The reason to write the model name to a header is because HTTPRoute doesn't support creating rules with matchers against the request body, it can only match against headers.

Context 2: We define IGW as a combination of two things: a proxy + ext-proc extensions; as part of the IGW project, we developed two extensions, the bbr described above and the epp, the endpoint picker extension (we are calling it inference request scheduler as well). bbr is invoked early in the envoy filter chain before selecting the pool to route the request to. EPP is invoked later in the chain, after the pool is selected, but before picking the endpoint. EPP is the component that does the “smart” routing.

So if I were to answer the question: "are we planning to integrate this [routing across different base models] as a core feature of IGW"; my answer is yes if you look at IGW as the composition of the features above.

If the question is "can a single pool host multiple base models, and the EPP route requests across different base models within the pool?", then this is still an open question, we lack the motivating use case to consider that (vs having the layer above, HTTPRoute, doing it). More concretely, is there any smart routing that we may want to apply when selecting the base model to route the request to? Or is selecting the base model predetermined (and hence can be expressed as header matching)? If there is a strong case for the former, then EPP can certainly expand to support base model routing.

ahg-g · 2025-07-16T20:32:05Z

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

+
+1. The client sends an HTTP inference request to the Gateway.
+2. Gateway receives the request and extracts the model name and relevant metadata.  
+   Gateway consults the InferenceModel configuration to determine the inference pool (dynamo graph) to route the request.


The InferenceModel API will be deprecated: initial thoughts at https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0#heading=h.towq7jyczzgo; a concrete proposal will be sent out to the IGW repo in the next couple of days.

But in all cases, the gateway doesn't actually look at the inferenceModel, the way this works is as follows: the body-based-router extension extracts the model name into a header, and then an httpRoute rule can be used to route the request to different pools based on that header.

Yes, the standalone InferenceModel's purpose was confusing in the absence of the body-based-router extension, and the new API proposal makes sense.

ahg-g · 2025-07-16T22:15:10Z

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

+
+10. Request is sent to LLM Backend and response is streamed back through 
+- processsor: Postprocessing steps
+- Frontend: Change response shape from Dynamo native to OpenAI compatible response


Can this logic be encapsulated into a sidecar deployed with the model server instead of having them deployed as central components in the service path?

Note that as part of the ext-proc protocol, the epp will get the response back, and so it can also forward it to the frontend instance deployed as a sidecar to the epp.

This is a good point. My initial thought was this will couple cpu and accelerator resources tightly and we wont be able to scale them independently. Both AIBrix and llm-d are taking similar approach in colocating the http server with model server (Backend).

Can this logic be encapsulated into a sidecar deployed with the model server instead of having them deployed as central components in the service path?

Note that as part of the ext-proc protocol, the epp will get the response back, and so it can also forward it to the frontend instance deployed as a sidecar to the epp.

@ahg-g, this make sense.
epp can forward request to a frontend [http frontend + processor + router] sidecar which can be in one of these pods:

frontend sidecar in epp
this option is similar in spirit to alt-1-entire-dynamo-graph-deployment-as-a-blackbox.

frontend sidecar in model service pod
frontend cpu/memory resource are coupled with model service.
Each model service's frontend sidecar will have a predictable load. it might be a better option for horizontal scaling.
(side note: we can further optimize by using local unix sockets/zmq instead of nats between frontend and model service)

also, is epp considered stateful?

if the frontend logic doesn't require global state, then moving it to run beside the model service pod is the more scalable option.

also, is epp considered stateful?

Currently it maintains an "approximate" in-memory prefix-cache populated from previous requests to perform prefix-aware routing. So performance would suffer if the state is lost, but requests will continue to be routed and served.

ahg-g · 2025-07-16T22:25:12Z

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

+#### 1. EPP integration with Dynamo: plugin vs sidecar vs external callout service
+
+![EPP integration with Dynamo](./alt-epp-dyn.png)
+##### sidecar container (Preffered)


In this mode, what protocol do you envision to have between the epp and the frontend? what functionality do you expect the epp to continue to offer?

ahg-g · 2025-07-16T22:37:39Z

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

+```yaml
+Request: 
+    - req header: set x-routing-request: true
+    - req body: original request body (For example, Chat completion request)


is the expectation that the epp will buffer the request and forward it in one go to the frontend?

ahg-g · 2025-07-16T22:46:02Z

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

+
+## Problems
+1. Currently EPP schedluling has tightly coupling with in-porcess preprocessing.
+  It's hard to scale/maintain it accross different models.


I would like to understand better the concern here. An EPP service can be scaled horizontally. Things get tricky when there is state (like in-memory prefix-cache index), but if you don't rely on that, then you should be able to scale horizontally based on load.

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md

athreesh · 2025-08-01T21:27:40Z

@biswapanda @atchernych are we going with the approach where EPP is a sidecar?

biswapanda force-pushed the bis/inference-gw branch from 0e56e0b to 05c3e27 Compare June 12, 2025 05:37

draft

d384dad

biswapanda force-pushed the bis/inference-gw branch from 05c3e27 to d384dad Compare June 12, 2025 05:55

biswapanda added 2 commits June 11, 2025 23:17

draft

0780556

draft

fa76d09

biswapanda self-assigned this Jun 18, 2025

biswapanda added 19 commits June 17, 2025 17:34

draft

6d457bb

draft

246a20c

draft

b5dedaf

draft

559b881

draft

a341ab8

draft

48f0a77

draft

35009f9

draft

f4d66e3

draft

5c94f2f

draft

394aed0

draft

e80332d

draft

cc48457

draft

a1d3368

draft

f1a141d

draft

957c846

draft

8fda443

update

7478219

--wip--

1f667aa

--wip--

7ffc7f9

ahg-g reviewed Jul 16, 2025

View reviewed changes

biswapanda added 4 commits July 17, 2025 15:18

fix typos

691cfa0

updates

6009993

updates

8d6be89

updates

5636543

biswapanda added 5 commits July 21, 2025 01:33

udpate

80e130e

udpate

4f9f45b

udpate

d3ec5ac

udpate

a5d7fca

udpate

05b7437

ahg-g reviewed Jul 21, 2025

View reviewed changes

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md Show resolved Hide resolved

ahg-g reviewed Jul 21, 2025

View reviewed changes

0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md Show resolved Hide resolved

wip

99ad01f


		### REQ 2 Unified Dynamo deployment

		Dynamo EPP and components (Frontend, Processor, Router, Workers) MUST be deployable within Kubernetes through a unified helm chart to maintain version compatibility.

inference gateway integration #13

Are you sure you want to change the base?

inference gateway integration #13

Uh oh!

Conversation

biswapanda commented Jun 12, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

biswapanda Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

biswapanda Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

athreesh commented Aug 1, 2025

Uh oh!

Uh oh!

biswapanda Jul 17, 2025 •

edited

Loading

biswapanda Jul 18, 2025 •

edited

Loading