-
Notifications
You must be signed in to change notification settings - Fork 5
inference gateway integration #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0e56e0b
to
05c3e27
Compare
05c3e27
to
d384dad
Compare
0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md
Outdated
Show resolved
Hide resolved
**EPP:** Endpoint Picker Protocol | ||
**IGW:** Inference Gateway | ||
|
||
## Why |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of now, IGW is mostly concerned with the second point, request scheduling. The other two are assumed to be addressed by other components that composes with IGW. But that could change depending on the detailed requirements.
0001-TBD-inference-gw-integration/0001-TBD-inference-gw-integration.md
Outdated
Show resolved
Hide resolved
|
||
### REQ 2 Unified Dynamo deployment | ||
|
||
Dynamo EPP and components (Frontend, Processor, Router, Workers) **MUST** be deployable within Kubernetes through a unified helm chart to maintain version compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since they are in the request path, can we define the scope of "Frontend", "Processor" and "Router" or link to a document explaining that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the defintions in the doc dynamo introduction section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, one more thing to clarify and relevant to IGW integration is worker discovery, basically how does the smart router know about the workers and their role?
You also touched on metrics, that Dynamo follows a push model, it would be great to clarify the exact metrics currently being pushed and the frequency.
|
||
# Proposal | ||
|
||
## Architecture Overview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alt 2 is more aligned with IGW's design principles and composes well with Kubernetes Gateway API.
Different base models are assumed to be deployed independently as separate pools; EPP is thought of as a request scheduler on a pool, it assumes that all endpoints are capable of serving the request and its role is to pick the best one, and so routing across different base models is assumed to be done at a stage before that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
routing across different base models
is done by enabling body based router or we are planning to integrate this as a core feature of IGW?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Giving some context before answering your question:
Context 1: body-based-router is an extension that all it does is reading the model name from the body and writing it to a known header, it doesn't do actual routing per se. Once the model name is in a header, a user can use HTTPRoute rules to express the intent to route requests based on that header to different pools. This is basic k8s Gateway functionality. The reason to write the model name to a header is because HTTPRoute doesn't support creating rules with matchers against the request body, it can only match against headers.
Context 2: We define IGW as a combination of two things: a proxy + ext-proc extensions; as part of the IGW project, we developed two extensions, the bbr described above and the epp, the endpoint picker extension (we are calling it inference request scheduler as well). bbr is invoked early in the envoy filter chain before selecting the pool to route the request to. EPP is invoked later in the chain, after the pool is selected, but before picking the endpoint. EPP is the component that does the “smart” routing.
So if I were to answer the question: "are we planning to integrate this [routing across different base models] as a core feature of IGW"; my answer is yes if you look at IGW as the composition of the features above.
If the question is "can a single pool host multiple base models, and the EPP route requests across different base models within the pool?", then this is still an open question, we lack the motivating use case to consider that (vs having the layer above, HTTPRoute, doing it). More concretely, is there any smart routing that we may want to apply when selecting the base model to route the request to? Or is selecting the base model predetermined (and hence can be expressed as header matching)? If there is a strong case for the former, then EPP can certainly expand to support base model routing.
|
||
1. The client sends an HTTP inference request to the Gateway. | ||
2. Gateway receives the request and extracts the model name and relevant metadata. | ||
Gateway consults the InferenceModel configuration to determine the inference pool (dynamo graph) to route the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InferenceModel API will be deprecated: initial thoughts at https://docs.google.com/document/d/1x6aI9pbTF5oOsaEQYc9n4pBBY3_AuEY2X51VKxmBSnU/edit?tab=t.0#heading=h.towq7jyczzgo; a concrete proposal will be sent out to the IGW repo in the next couple of days.
But in all cases, the gateway doesn't actually look at the inferenceModel, the way this works is as follows: the body-based-router extension extracts the model name into a header, and then an httpRoute rule can be used to route the request to different pools based on that header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the standalone InferenceModel's purpose was confusing in the absence of the body-based-router extension, and the new API proposal makes sense.
|
||
10. Request is sent to LLM Backend and response is streamed back through | ||
- processsor: Postprocessing steps | ||
- Frontend: Change response shape from Dynamo native to OpenAI compatible response |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this logic be encapsulated into a sidecar deployed with the model server instead of having them deployed as central components in the service path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that as part of the ext-proc protocol, the epp will get the response back, and so it can also forward it to the frontend instance deployed as a sidecar to the epp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point. My initial thought was this will couple cpu and accelerator resources tightly and we wont be able to scale them independently. Both AIBrix
and llm-d
are taking similar approach in colocating the http server with model server (Backend).
Can this logic be encapsulated into a sidecar deployed with the model server instead of having them deployed as central components in the service path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that as part of the ext-proc protocol, the epp will get the response back, and so it can also forward it to the frontend instance deployed as a sidecar to the epp.
@ahg-g, this make sense.
epp can forward request to a frontend
[http frontend + processor + router] sidecar which can be in one of these pods:
-
frontend sidecar in epp
this option is similar in spirit to alt-1-entire-dynamo-graph-deployment-as-a-blackbox. -
frontend sidecar in model service pod
frontend cpu/memory resource are coupled with model service.
Each model service's frontend sidecar will have a predictable load. it might be a better option for horizontal scaling.
(side note: we can further optimize by using local unix sockets/zmq instead of nats between frontend and model service)
also, is epp considered stateful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the frontend logic doesn't require global state, then moving it to run beside the model service pod is the more scalable option.
also, is epp considered stateful?
Currently it maintains an "approximate" in-memory prefix-cache populated from previous requests to perform prefix-aware routing. So performance would suffer if the state is lost, but requests will continue to be routed and served.
#### 1. EPP integration with Dynamo: plugin vs sidecar vs external callout service | ||
|
||
 | ||
##### sidecar container (Preffered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this mode, what protocol do you envision to have between the epp and the frontend? what functionality do you expect the epp to continue to offer?
```yaml | ||
Request: | ||
- req header: set x-routing-request: true | ||
- req body: original request body (For example, Chat completion request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the expectation that the epp will buffer the request and forward it in one go to the frontend?
|
||
## Problems | ||
1. Currently EPP schedluling has tightly coupling with in-porcess preprocessing. | ||
It's hard to scale/maintain it accross different models. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to understand better the concern here. An EPP service can be scaled horizontally. Things get tricky when there is state (like in-memory prefix-cache index), but if you don't rely on that, then you should be able to scale horizontally based on load.
@biswapanda @atchernych are we going with the approach where EPP is a sidecar? |
No description provided.