You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The idea addresses a missing feature in the current inference pool: the ability to perform model redirects and split traffic for requests. The essence of this idea is to enable request body mutation, allowing us to change the model name in a request before it's processed by the model server.
Currently, while we can split traffic between multiple inference pools (example), we cannot mutate the request body. This is a limitation when the model servers are configured differently and expect specific model names. This proposal aims to solve this by allowing for in-flight modification of the request.
Use cases
Here are several potential scenarios where this functionality would be beneficial:
Seamless Model Upgrades: A ML engineer/model server owner wants to roll out a new model version (e.g., modelA-v2) to replace an existing one (modelA-v1). With model redirection, they could gradually shift traffic to the new version without any changes to the client application, which would still be requesting modelA. This is particularly crucial for Lora adapter updates that must happen within the same inference pool.
A/B Testing: An ML engineer wants to compare the performance of two different models. They could send a small percentage of user traffic to a new model and compare its performance against the current one before a full rollout.
Large-Scale LoRA Management: For scenarios with hundreds of LoRA adapters, a robust model mapping system is needed to manage the different adapters effectively * Semantic Routing: To optimize for cost and performance, it would be possible to route requests to different models based on the prompt's content within a inferencePool.
High-Level Idea
There is a proposal from @ahg-g leveraging httpHeaderModifier and BBR however, there is some limitation on envoy filter order and HTTPRouteRule max limit to be 16.
The core idea is to introduce a mechanism that allows users to define model redirect and traffic splitting rules. These rules would then be used to mutate the model field within the request body before it reaches the model server.
For example, a rule could specify that for any request asking for "production-chatbot", 10% of the traffic is rewritten to be served by "stable-chatbot-v6" and 90% by "stable-chatbot-v5"
We are considering different approaches to implement this, but we first want to gather feedback on the use cases and the general direction.
What are your thoughts on these use cases? Are there other scenarios we should be considering for in-pool model redirection?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
The idea addresses a missing feature in the current inference pool: the ability to perform model redirects and split traffic for requests. The essence of this idea is to enable request body mutation, allowing us to change the model name in a request before it's processed by the model server.
Currently, while we can split traffic between multiple inference pools (example), we cannot mutate the request body. This is a limitation when the model servers are configured differently and expect specific model names. This proposal aims to solve this by allowing for in-flight modification of the request.
Use cases
Here are several potential scenarios where this functionality would be beneficial:
* Semantic Routing: To optimize for cost and performance, it would be possible to route requests to different models based on the prompt's content within a inferencePool.High-Level Idea
There is a proposal from @ahg-g leveraging httpHeaderModifier and BBR however, there is some limitation on envoy filter order and HTTPRouteRule max limit to be 16.
The core idea is to introduce a mechanism that allows users to define model redirect and traffic splitting rules. These rules would then be used to mutate the model field within the request body before it reaches the model server.
For example, a rule could specify that for any request asking for "production-chatbot", 10% of the traffic is rewritten to be served by "stable-chatbot-v6" and 90% by "stable-chatbot-v5"
We are considering different approaches to implement this, but we first want to gather feedback on the use cases and the general direction.
What are your thoughts on these use cases? Are there other scenarios we should be considering for in-pool model redirection?
Beta Was this translation helpful? Give feedback.
All reactions