You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A company wants to deploy multiple large language models (LLMs) to serve different workloads.
3
-
For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application.
2
+
3
+
A company wants to deploy multiple large language models (LLMs) to a cluster to serve different workloads.
4
+
For example, they might want to deploy a Gemma3 model for a chatbot interface and a DeepSeek model for a recommendation application.
4
5
The company needs to ensure optimal serving performance for these LLMs.
5
-
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
6
-
You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property.
6
+
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
7
+
You can then route requests based on the model name (such as `chatbot` and `recommender`) and the `Criticality` property.
7
8
8
9
## How
10
+
9
11
The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name.
10
-
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md)
12
+
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) (BBR)
11
13
from the request body to the header. The header is then matched to dispatch
12
14
requests to different `InferencePool` (and their EPPs) instances.
13
15

14
16
17
+
### Deploy Body-Based Routing
18
+
19
+
To enable body-based routing, you need to deploy the Body-Based Routing ExtProc server using Helm. Depending on your Gateway provider, you can use one of the following commands:
This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`.
50
+
16
51
```yaml
17
52
apiVersion: gateway.networking.k8s.io/v1
18
53
kind: HTTPRoute
@@ -25,8 +60,7 @@ spec:
25
60
- matches:
26
61
- headers:
27
62
- type: Exact
28
-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
29
-
name: X-Gateway-Model-Name
63
+
name: X-Gateway-Model-Name # (1)!
30
64
value: chatbot
31
65
path:
32
66
type: PathPrefix
@@ -37,38 +71,74 @@ spec:
37
71
- matches:
38
72
- headers:
39
73
- type: Exact
40
-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
41
-
name: X-Gateway-Model-Name
74
+
name: X-Gateway-Model-Name # (2)!
42
75
value: recommender
43
76
path:
44
77
type: PathPrefix
45
78
value: /
46
79
backendRefs:
47
80
- name: deepseek-r1
48
-
kind: InferencePool
81
+
kind: InferencePool
49
82
```
50
83
84
+
1. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
85
+
2. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
86
+
51
87
## Try it out
52
88
53
89
1. Get the gateway IP:
54
90
```bash
55
91
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
56
92
```
57
-
2. Send a few requests to model "chatbot" as follows:
0 commit comments