Skip to content

Conversation

biswapanda
Copy link

No description provided.

@biswapanda biswapanda self-assigned this Jun 18, 2025
**EPP:** Endpoint Picker Protocol
**IGW:** Inference Gateway

## Why
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, IGW is mostly concerned with the second point, request scheduling. The other two are assumed to be addressed by other components that composes with IGW. But that could change depending on the detailed requirements.


### REQ 2 Unified Dynamo deployment

Dynamo EPP and components (Frontend, Processor, Router, Workers) **MUST** be deployable within Kubernetes through a unified helm chart to maintain version compatibility.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since they are in the request path, can we define the scope of "Frontend", "Processor" and "Router" or link to a document explaining that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the defintions in the doc dynamo introduction section

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, one more thing to clarify and relevant to IGW integration is worker discovery, basically how does the smart router know about the workers and their role?

You also touched on metrics, that Dynamo follows a push model, it would be great to clarify the exact metrics currently being pushed and the frequency.


# Proposal

## Architecture Overview
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alt 2 is more aligned with IGW's design principles and composes well with Kubernetes Gateway API.

Different base models are assumed to be deployed independently as separate pools; EPP is thought of as a request scheduler on a pool, it assumes that all endpoints are capable of serving the request and its role is to pick the best one, and so routing across different base models is assumed to be done at a stage before that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

routing across different base models is done by enabling body based router or we are planning to integrate this as a core feature of IGW?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Giving some context before answering your question:

Context 1: body-based-router is an extension that all it does is reading the model name from the body and writing it to a known header, it doesn't do actual routing per se. Once the model name is in a header, a user can use HTTPRoute rules to express the intent to route requests based on that header to different pools. This is basic k8s Gateway functionality. The reason to write the model name to a header is because HTTPRoute doesn't support creating rules with matchers against the request body, it can only match against headers.

Context 2: We define IGW as a combination of two things: a proxy + ext-proc extensions; as part of the IGW project, we developed two extensions, the bbr described above and the epp, the endpoint picker extension (we are calling it inference request scheduler as well). bbr is invoked early in the envoy filter chain before selecting the pool to route the request to. EPP is invoked later in the chain, after the pool is selected, but before picking the endpoint. EPP is the component that does the “smart” routing.

So if I were to answer the question: "are we planning to integrate this [routing across different base models] as a core feature of IGW"; my answer is yes if you look at IGW as the composition of the features above.

If the question is "can a single pool host multiple base models, and the EPP route requests across different base models within the pool?", then this is still an open question, we lack the motivating use case to consider that (vs having the layer above, HTTPRoute, doing it). More concretely, is there any smart routing that we may want to apply when selecting the base model to route the request to? Or is selecting the base model predetermined (and hence can be expressed as header matching)? If there is a strong case for the former, then EPP can certainly expand to support base model routing.


1. The client sends an HTTP inference request to the Gateway.
2. Gateway receives the request and extracts the model name and relevant metadata.
Gateway consults the InferenceModel configuration to determine the inference pool (dynamo graph) to route the request.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InferenceModel API will be deprecated: initial thoughts at https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0#heading=h.towq7jyczzgo; a concrete proposal will be sent out to the IGW repo in the next couple of days.

But in all cases, the gateway doesn't actually look at the inferenceModel, the way this works is as follows: the body-based-router extension extracts the model name into a header, and then an httpRoute rule can be used to route the request to different pools based on that header.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the standalone InferenceModel's purpose was confusing in the absence of the body-based-router extension, and the new API proposal makes sense.


10. Request is sent to LLM Backend and response is streamed back through
- processsor: Postprocessing steps
- Frontend: Change response shape from Dynamo native to OpenAI compatible response
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this logic be encapsulated into a sidecar deployed with the model server instead of having them deployed as central components in the service path?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that as part of the ext-proc protocol, the epp will get the response back, and so it can also forward it to the frontend instance deployed as a sidecar to the epp.

Copy link
Author

@biswapanda biswapanda Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. My initial thought was this will couple cpu and accelerator resources tightly and we wont be able to scale them independently. Both AIBrix and llm-d are taking similar approach in colocating the http server with model server (Backend).

Can this logic be encapsulated into a sidecar deployed with the model server instead of having them deployed as central components in the service path?

Copy link
Author

@biswapanda biswapanda Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that as part of the ext-proc protocol, the epp will get the response back, and so it can also forward it to the frontend instance deployed as a sidecar to the epp.

@ahg-g, this make sense.
epp can forward request to a frontend [http frontend + processor + router] sidecar which can be in one of these pods:

  1. frontend sidecar in epp
    this option is similar in spirit to alt-1-entire-dynamo-graph-deployment-as-a-blackbox.

  2. frontend sidecar in model service pod
    frontend cpu/memory resource are coupled with model service.
    Each model service's frontend sidecar will have a predictable load. it might be a better option for horizontal scaling.
    (side note: we can further optimize by using local unix sockets/zmq instead of nats between frontend and model service)

also, is epp considered stateful?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the frontend logic doesn't require global state, then moving it to run beside the model service pod is the more scalable option.

also, is epp considered stateful?

Currently it maintains an "approximate" in-memory prefix-cache populated from previous requests to perform prefix-aware routing. So performance would suffer if the state is lost, but requests will continue to be routed and served.

#### 1. EPP integration with Dynamo: plugin vs sidecar vs external callout service

![EPP integration with Dynamo](./alt-epp-dyn.png)
##### sidecar container (Preffered)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this mode, what protocol do you envision to have between the epp and the frontend? what functionality do you expect the epp to continue to offer?

```yaml
Request:
- req header: set x-routing-request: true
- req body: original request body (For example, Chat completion request)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the expectation that the epp will buffer the request and forward it in one go to the frontend?


## Problems
1. Currently EPP schedluling has tightly coupling with in-porcess preprocessing.
It's hard to scale/maintain it accross different models.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand better the concern here. An EPP service can be scaled horizontally. Things get tricky when there is state (like in-memory prefix-cache index), but if you don't rely on that, then you should be able to scale horizontally based on load.

@athreesh
Copy link

athreesh commented Aug 1, 2025

@biswapanda @atchernych are we going with the approach where EPP is a sidecar?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants