Skip to content

Commit 9266b5e

Browse files
authored
Updating the the doc site (#1500)
* Updating the guides in the doc site * adding priority and capacity section
1 parent 35b14a1 commit 9266b5e

17 files changed

+63
-486
lines changed

mkdocs.yml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ nav:
5656
Design Principles: concepts/design-principles.md
5757
Conformance: concepts/conformance.md
5858
Roles and Personas: concepts/roles-and-personas.md
59+
Priority and Capacity: concepts/priority-and-capacity.md
5960
- Implementations:
6061
- Gateways: implementations/gateways.md
6162
- Model Servers: implementations/model-servers.md
@@ -65,13 +66,12 @@ nav:
6566
- Getting started: guides/index.md
6667
- Use Cases:
6768
- Serve Multiple GenAI models: guides/serve-multiple-genai-models.md
68-
- Serve Multiple LoRA adapters: guides/serve-multiple-lora-adapters.md
6969
- Rollout:
7070
- Adapter Rollout: guides/adapter-rollout.md
7171
- InferencePool Rollout: guides/inferencepool-rollout.md
7272
- Metrics and Observability: guides/metrics-and-observability.md
7373
- Configuration Guide:
74-
- Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md
74+
- Configuring the plugins via configuration YAML file: guides/epp-configuration/config-text.md
7575
- Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md
7676
- Troubleshooting Guide: guides/troubleshooting.md
7777
- Implementer Guides:
@@ -82,9 +82,10 @@ nav:
8282
- Regression Testing: performance/regression-testing/index.md
8383
- Reference:
8484
- API Reference: reference/spec.md
85+
- Alpha API Reference: reference/x-spec.md
8586
- API Types:
8687
- InferencePool: api-types/inferencepool.md
87-
- InferenceModel: api-types/inferencemodel.md
88+
- InferenceObjective: api-types/inferenceobjective.md
8889
- Enhancements:
8990
- Overview: gieps/overview.md
9091
- Contributing:

site-src/api-types/inferencemodel.md

Lines changed: 0 additions & 19 deletions
This file was deleted.
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# Inference Objective
2+
3+
??? example "Alpha since v1.0.0"
4+
5+
The `InferenceObjective` resource is alpha and may have breaking changes in
6+
future releases of the API.
7+
8+
## Background
9+
10+
The **InferenceObjective** API defines a set of serving objectives of the specific request it is associated with. This CRD currently houses only `Priority` but will be expanded to include fields such as SLO attainment.
11+
12+
## Spec
13+
14+
The full spec of the InferenceModel is defined [here](/reference/x-spec/#inferenceobjective).

site-src/api-types/inferencepool.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,8 @@
11
# Inference Pool
22

3-
??? example "Alpha since v0.1.0"
3+
??? success example "GA since v1.0.0"
44

5-
The `InferencePool` resource is alpha and may have breaking changes in
6-
future releases of the API.
5+
The `InferencePool` resource has been graduated to v1 and is considered stable.
76

87
## Background
98

site-src/concepts/api-overview.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,6 @@ each aligning with a specific user persona in the Generative AI serving workflow
2323

2424
InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but has some unique capabilities. With InferencePool, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our [InferencePool documentation](/api-types/inferencepool) or go directly to the [InferencePool spec](/reference/spec/#inferencepool).
2525

26-
### InferenceModel
26+
### InferenceObjective
2727

28-
An InferenceModel represents a model or adapter, and configuration associated with that model. This resource enables you to configure the relative criticality of a model, and allows you to seamlessly translate the requested model name to one or more backend model names. Multiple InferenceModels can be attached to an InferencePool. For more information on this resource, refer to our [InferenceModel documentation](/api-types/inferencemodel) or go directly to the [InferenceModel spec](/reference/spec/#inferencemodel).
28+
An InferenceObjective represents the objectives of a specific request. A single InferenceObjective is associated with a request, and multiple requests with different InferenceObjectives can be attached to an InferencePool. For more information on this resource, refer to our [InferenceObjective documentation](/api-types/inferenceobjective) or go directly to the [InferenceObjective spec](/reference/spec/#inferenceobjective).
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Priority and Capacity
2+
3+
The InferenceObjective creates the definition of `Priority` which describes how requests interact with each other, this naturally interacts with total pool capacity, and properly understanding and configuring these behaviors is important in allowing a pool to handle requests of different priority.
4+
5+
## Priority (in flow control)
6+
7+
It should be noted that priority is currently only used in [Capacity](#capacity), and that the description below is how Priority will be consumed in the `Flow Control` model.
8+
9+
Priority is a simple stack rank; the higher the number, the higher the priority. Should no priority for a request be specified, the default value is zero. Requests of higher priority are _always_ selected first when requests are queued. Requests of equal priority currently operate on a FCFS basis.
10+
11+
## Capacity
12+
13+
The current capacity model uses configurable [thresholds](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/35b14a10a9830d1a9e3850913539066ebc8fb317/pkg/epp/saturationdetector/saturationdetector.go#L49) to determine if the entire pool is saturated. The calculation is to simply iterate through each endpoint in the pool, and if all are above all thresholds, the pool is considered `saturated`. In the event of saturation, all requests with a negative priority will be rejected, and other requests will be scheduled and queued on the model servers.
14+
15+
## Future work
16+
17+
The Flow Control system is nearing completion and will add more nuance to the Priority and Capacity model: proper priority enforcement, more articulate capacity tracking, queuing at the Inference Gateway level, etc. This documentation will be updated when the Flow Control has finished implementation.

site-src/concepts/roles-and-personas.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@ The Inference Platform Admin creates and manages the infrastructure necessary to
1717

1818
An Inference Workload Owner persona owns and manages one or many Generative AI Workloads (LLM focused *currently*). This includes:
1919

20-
- Defining criticality
20+
- Defining priority
2121
- Managing fine-tunes
2222
- LoRA Adapters
2323
- System Prompts

site-src/guides/adapter-rollout.md

Lines changed: 3 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,6 @@
33
The goal of this guide is to show you how to perform incremental roll out operations,
44
which gradually deploy new versions of your inference infrastructure.
55
You can update LoRA adapters and Inference Pool with minimal service disruption.
6-
This page also provides guidance on traffic splitting and rollbacks to help ensure reliable deployments for LoRA adapters rollout.
76

87
LoRA adapter rollouts let you deploy new versions of LoRA adapters in phases,
98
without altering the underlying base model or infrastructure.
@@ -49,36 +48,7 @@ data:
4948
5049
The new adapter version is applied to the model servers live, without requiring a restart.
5150
52-
53-
### Direct traffic to the new adapter version
54-
55-
Modify the InferenceModel to configure a canary rollout with traffic splitting. In this example, 10% of traffic for food-review model will be sent to the new ***food-review-2*** adapter.
56-
57-
58-
```bash
59-
kubectl edit inferencemodel food-review
60-
```
61-
62-
Change the targetModels list in InferenceModel to match the following:
63-
64-
65-
```yaml
66-
apiVersion: inference.networking.x-k8s.io/v1alpha2
67-
kind: InferenceModel
68-
metadata:
69-
name: food-review
70-
spec:
71-
criticality: 1
72-
poolRef:
73-
name: vllm-llama3-8b-instruct
74-
targetModels:
75-
- name: food-review-1
76-
weight: 90
77-
- name: food-review-2
78-
weight: 10
79-
```
80-
81-
The above configuration means one in every ten requests should be sent to the new version. Try it out:
51+
Try it out:
8252
8353
1. Get the gateway IP:
8454
```bash
@@ -88,7 +58,7 @@ IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].va
8858
2. Send a few requests as follows:
8959
```bash
9060
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
91-
"model": "food-review",
61+
"model": "food-review-2",
9262
"prompt": "Write as if you were a critic: San Francisco",
9363
"max_tokens": 100,
9464
"temperature": 0
@@ -97,23 +67,6 @@ curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
9767

9868
### Finish the rollout
9969

100-
101-
Modify the InferenceModel to direct 100% of the traffic to the latest version of the adapter.
102-
103-
```yaml
104-
apiVersion: inference.networking.x-k8s.io/v1alpha2
105-
kind: InferenceModel
106-
metadata:
107-
name: food-review
108-
spec:
109-
criticality: 1
110-
poolRef:
111-
name: vllm-llama3-8b-instruct
112-
targetModels:
113-
- name: food-review-2
114-
weight: 100
115-
```
116-
11770
Unload the older versions from the servers by updating the LoRA syncer ConfigMap to list the older version under the `ensureNotExist` list:
11871

11972
```yaml
@@ -137,5 +90,5 @@ data:
13790
source: Kawon/llama3.1-food-finetune_v14_r8
13891
```
13992
140-
With this, all requests should be served by the new adapter version.
93+
With this, the new adapter version should be available for all incoming requests.
14194

site-src/guides/epp-configuration/config-text.md

Lines changed: 9 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,14 @@
1-
# Configuring Plugins via text
1+
# Configuring Plugins via YAML
22

33
The set of lifecycle hooks (plugins) that are used by the Inference Gateway (IGW) is determined by how
4-
it is configured. The IGW can be configured in several ways, either by code or via text.
4+
it is configured. The IGW is primarily configured via a configuration file.
55

6-
If configured by code either a set of predetermined environment variables must be used or one must
7-
fork the IGW and change code.
8-
9-
A simpler way to congigure the IGW is to use a text based configuration. This text is in YAML format
10-
and can either be in a file or specified in-line as a parameter. The configuration defines the set of
6+
The YAML file can either be specified as a path to a file or in-line as a parameter. The configuration defines the set of
117
plugins to be instantiated along with their parameters. Each plugin can also be given a name, enabling
12-
the same plugin type to be instantiated multiple times, if needed.
8+
the same plugin type to be instantiated multiple times, if needed (such as when configuring multiple scheduling profiles).
139

14-
Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling a request. If one is not defailed, a default one names `default` will be added and will reference all of the
10+
Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling a request.
11+
If no scheduling profile is specified, a default profile, named `default` will be added and will reference all of the
1512
instantiated plugins.
1613

1714
The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles
@@ -22,12 +19,9 @@ In addition, the set of instantiated plugins can also include a picker, which ch
2219
the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an
2320
instance of `MaxScorePicker` will be added to the SchedulingProfile in question.
2421

25-
It should be noted that while the configuration text looks like a Kubernetes Custom Resource, it is
26-
**NOT** a Kubernetes Custom Resource. Kubernetes infrastructure is used to load the configuration
27-
text and in the future will also help in versioning the text.
28-
29-
It should also be noted that even when the configuration text is loaded from a file, it is loaded at
30-
the Endpoint-Picker's (EPP) startup and changes to the file at runtime are ignored.
22+
***NOTE***: While the configuration text looks like a Kubernetes CRD, it is
23+
**NOT** a Kubernetes CRD. Specifically, the config is not reconciled upon, and is only read on startup.
24+
This behavior is intentional, as augmenting the scheduling config without redeploying the EPP is not supported.
3125

3226
The configuration text has the following form:
3327
```yaml

site-src/guides/implementers.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -157,8 +157,8 @@ An example of a similar approach is Kuadrant’s [WASM Shim](https://github.com/
157157
Here are some tips for testing your controller end-to-end:
158158

159159
- **Focus on Key Scenarios**: Add common scenarios like creating, updating, and deleting InferencePool resources, as well as different routing rules that target InferencePool backends.
160-
- **Verify Routing Behaviors**: Design more complex routing scenarios and verify that requests are correctly routed to the appropriate model server pods within the InferencePool based on the InferenceModel configuration.
161-
- **Test Error Handling**: Verify that the controller correctly handles scenarios like unsupported model names or resource constraints (if criticality-based shedding is implemented). Test with state transitions (such as constant requests while Pods behind EPP are being replaced and Pods behind InferencePool are being replaced) to ensure that the system is resilient to failures and can automatically recover by redirecting traffic to healthy Pods.
160+
- **Verify Routing Behaviors**: Design more complex routing scenarios and verify that requests are correctly routed to the appropriate model server pods within the InferencePool.
161+
- **Test Error Handling**: Verify that the controller correctly handles scenarios like unsupported model names or resource constraints (if priority-based shedding is implemented). Test with state transitions (such as constant requests while Pods behind EPP are being replaced and Pods behind InferencePool are being replaced) to ensure that the system is resilient to failures and can automatically recover by redirecting traffic to healthy Pods.
162162
- **Using Reference EPP Implementation + Echoserver**: You can use the [reference EPP implementation](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) for testing your controller end-to-end. Instead of a full-fledged model server, a simple mock server (like the [echoserver](https://github.com/kubernetes-sigs/ingress-controller-conformance/tree/master/images/echoserver)) can be very useful for verifying routing to ensure the correct pod received the request.
163163
- **Performance Test**: Run end-to-end [benchmarks](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/) to make sure that your inference gateway can achieve the latency target that is desired.
164164

0 commit comments

Comments
 (0)