Skip to content

Commit ac85f25

Browse files
authored
docs: update BBR guide (#1517)
Signed-off-by: Ernest Wong <[email protected]>
1 parent a2feb2e commit ac85f25

File tree

2 files changed

+101
-30
lines changed

2 files changed

+101
-30
lines changed

mkdocs.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ theme:
1212
logo: images/logo/logo-text-large-horizontal-white.png
1313
favicon: images/favicon-64.png
1414
features:
15+
- content.code.annotate
1516
- search.highlight
1617
- navigation.tabs
1718
- navigation.top
@@ -55,7 +56,7 @@ nav:
5556
Design Principles: concepts/design-principles.md
5657
Conformance: concepts/conformance.md
5758
Roles and Personas: concepts/roles-and-personas.md
58-
- Implementations:
59+
- Implementations:
5960
- Gateways: implementations/gateways.md
6061
- Model Servers: implementations/model-servers.md
6162
- FAQ: faq.md
@@ -70,7 +71,7 @@ nav:
7071
- InferencePool Rollout: guides/inferencepool-rollout.md
7172
- Metrics and Observability: guides/metrics-and-observability.md
7273
- Configuration Guide:
73-
- Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md
74+
- Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md
7475
- Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md
7576
- Troubleshooting Guide: guides/troubleshooting.md
7677
- Implementer Guides:
Lines changed: 98 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,53 @@
11
# Serve multiple generative AI models
2-
A company wants to deploy multiple large language models (LLMs) to serve different workloads.
3-
For example, they might want to deploy a Gemma3 model for a chatbot interface and a Deepseek model for a recommendation application.
2+
3+
A company wants to deploy multiple large language models (LLMs) to a cluster to serve different workloads.
4+
For example, they might want to deploy a Gemma3 model for a chatbot interface and a DeepSeek model for a recommendation application.
45
The company needs to ensure optimal serving performance for these LLMs.
5-
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
6-
You can then route requests based on the model name (such as "chatbot" and "recommender") and the `Criticality` property.
6+
By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`.
7+
You can then route requests based on the model name (such as `chatbot` and `recommender`) and the `Criticality` property.
78

89
## How
10+
911
The following diagram illustrates how an Inference Gateway routes requests to different models based on the model name.
10-
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md)
12+
The model name is extracted by [Body-Based routing](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) (BBR)
1113
from the request body to the header. The header is then matched to dispatch
1214
requests to different `InferencePool` (and their EPPs) instances.
1315
![Serving multiple generative AI models](../images/serve-mul-gen-AI-models.png)
1416

17+
### Deploy Body-Based Routing
18+
19+
To enable body-based routing, you need to deploy the Body-Based Routing ExtProc server using Helm. Depending on your Gateway provider, you can use one of the following commands:
20+
21+
=== "GKE"
22+
23+
```bash
24+
helm install body-based-router \
25+
--set provider.name=gke \
26+
--version v0.5.1 \
27+
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
28+
```
29+
30+
=== "Istio"
31+
32+
```bash
33+
helm install body-based-router \
34+
--set provider.name=istio \
35+
--version v0.5.1 \
36+
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
37+
```
38+
39+
=== "Other"
40+
41+
```bash
42+
helm install body-based-router \
43+
--version v0.5.1 \
44+
oci://registry.k8s.io/gateway-api-inference-extension/charts/body-based-routing
45+
```
46+
47+
### Configure HTTPRoute
48+
1549
This example illustrates a conceptual example regarding how to use the `HTTPRoute` object to route based on model name like “chatbot” or “recommender” to `InferencePool`.
50+
1651
```yaml
1752
apiVersion: gateway.networking.k8s.io/v1
1853
kind: HTTPRoute
@@ -25,8 +60,7 @@ spec:
2560
- matches:
2661
- headers:
2762
- type: Exact
28-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
29-
name: X-Gateway-Model-Name
63+
name: X-Gateway-Model-Name # (1)!
3064
value: chatbot
3165
path:
3266
type: PathPrefix
@@ -37,38 +71,74 @@ spec:
3771
- matches:
3872
- headers:
3973
- type: Exact
40-
#Body-Based routing(https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header.
41-
name: X-Gateway-Model-Name
74+
name: X-Gateway-Model-Name # (2)!
4275
value: recommender
4376
path:
4477
type: PathPrefix
4578
value: /
4679
backendRefs:
4780
- name: deepseek-r1
48-
kind: InferencePool
81+
kind: InferencePool
4982
```
5083
84+
1. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
85+
2. [BBR](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/bbr/README.md) is being used to copy the model name from the request body to the header with key `X-Gateway-Model-Name`. The header can then be used in the `HTTPRoute` to route requests to different `InferencePool` instances.
86+
5187
## Try it out
5288

5389
1. Get the gateway IP:
5490
```bash
5591
IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80
5692
```
57-
2. Send a few requests to model "chatbot" as follows:
58-
```bash
59-
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
60-
"model": "chatbot",
61-
"prompt": "What is the color of the sky",
62-
"max_tokens": 100,
63-
"temperature": 0
64-
}'
65-
```
66-
3. Send a few requests to model "recommender" as follows:
67-
```bash
68-
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
69-
"model": "recommender",
70-
"prompt": "Give me restaurant recommendations in Paris",
71-
"max_tokens": 100,
72-
"temperature": 0
73-
}'
74-
```
93+
94+
=== "Chat Completions API"
95+
96+
1. Send a few requests to model `chatbot` as follows:
97+
```bash
98+
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
99+
-H "Content-Type: application/json" \
100+
-d '{
101+
"model": "chatbot",
102+
"messages": [{"role": "user", "content": "What is the color of the sky?"}],
103+
"max_tokens": 100,
104+
"temperature": 0
105+
}'
106+
```
107+
108+
2. Send a few requests to model `recommender` as follows:
109+
```bash
110+
curl -X POST -i ${IP}:${PORT}/v1/chat/completions \
111+
-H "Content-Type: application/json" \
112+
-d '{
113+
"model": "recommender",
114+
"messages": [{"role": "user", "content": "Give me restaurant recommendations in Paris"}],
115+
"max_tokens": 100,
116+
"temperature": 0
117+
}'
118+
```
119+
120+
=== "Completions API"
121+
122+
1. Send a few requests to model `chatbot` as follows:
123+
```bash
124+
curl -X POST -i ${IP}:${PORT}/v1/completions \
125+
-H 'Content-Type: application/json' \
126+
-d '{
127+
"model": "chatbot",
128+
"prompt": "What is the color of the sky",
129+
"max_tokens": 100,
130+
"temperature": 0
131+
}'
132+
```
133+
134+
2. Send a few requests to model `recommender` as follows:
135+
```bash
136+
curl -X POST -i ${IP}:${PORT}/v1/completions \
137+
-H 'Content-Type: application/json' \
138+
-d '{
139+
"model": "recommender",
140+
"prompt": "Give me restaurant recommendations in Paris",
141+
"max_tokens": 100,
142+
"temperature": 0
143+
}'
144+
```

0 commit comments

Comments
 (0)