From c4256065ea6732427d740f5971907e093923da38 Mon Sep 17 00:00:00 2001 From: Kellen Swain Date: Thu, 28 Aug 2025 21:16:32 +0000 Subject: [PATCH 1/2] Updating the guides in the doc site --- mkdocs.yml | 7 +- site-src/api-types/inferencemodel.md | 19 -- site-src/api-types/inferenceobjective.md | 14 + site-src/api-types/inferencepool.md | 5 +- site-src/concepts/api-overview.md | 4 +- site-src/concepts/priority-and-capacity.md | 11 + site-src/concepts/roles-and-personas.md | 2 +- site-src/guides/adapter-rollout.md | 53 +--- .../guides/epp-configuration/config-text.md | 24 +- site-src/guides/implementers.md | 4 +- site-src/guides/index.md | 2 +- site-src/guides/inferencepool-rollout.md | 282 +----------------- .../guides/serve-multiple-genai-models.md | 2 +- .../guides/serve-multiple-lora-adapters.md | 98 ------ site-src/guides/troubleshooting.md | 2 +- site-src/index.md | 8 +- site-src/reference/x-spec.md | 6 - 17 files changed, 57 insertions(+), 486 deletions(-) delete mode 100644 site-src/api-types/inferencemodel.md create mode 100644 site-src/api-types/inferenceobjective.md create mode 100644 site-src/concepts/priority-and-capacity.md delete mode 100644 site-src/guides/serve-multiple-lora-adapters.md diff --git a/mkdocs.yml b/mkdocs.yml index 78f1cb81c..8262f6fb0 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -56,6 +56,7 @@ nav: Design Principles: concepts/design-principles.md Conformance: concepts/conformance.md Roles and Personas: concepts/roles-and-personas.md + Priority and Capacity: concepts/priority-and-capacity.md - Implementations: - Gateways: implementations/gateways.md - Model Servers: implementations/model-servers.md @@ -65,13 +66,12 @@ nav: - Getting started: guides/index.md - Use Cases: - Serve Multiple GenAI models: guides/serve-multiple-genai-models.md - - Serve Multiple LoRA adapters: guides/serve-multiple-lora-adapters.md - Rollout: - Adapter Rollout: guides/adapter-rollout.md - InferencePool Rollout: guides/inferencepool-rollout.md - Metrics and Observability: guides/metrics-and-observability.md - Configuration Guide: - - Configuring the plugins via configuration files or text: guides/epp-configuration/config-text.md + - Configuring the plugins via configuration YAML file: guides/epp-configuration/config-text.md - Prefix Cache Aware Plugin: guides/epp-configuration/prefix-aware.md - Troubleshooting Guide: guides/troubleshooting.md - Implementer Guides: @@ -82,9 +82,10 @@ nav: - Regression Testing: performance/regression-testing/index.md - Reference: - API Reference: reference/spec.md + - Alpha API Reference: reference/x-spec.md - API Types: - InferencePool: api-types/inferencepool.md - - InferenceModel: api-types/inferencemodel.md + - InferenceObjective: api-types/inferenceobjective.md - Enhancements: - Overview: gieps/overview.md - Contributing: diff --git a/site-src/api-types/inferencemodel.md b/site-src/api-types/inferencemodel.md deleted file mode 100644 index fbf27ec43..000000000 --- a/site-src/api-types/inferencemodel.md +++ /dev/null @@ -1,19 +0,0 @@ -# Inference Model - -??? example "Alpha since v0.1.0" - - The `InferenceModel` resource is alpha and may have breaking changes in - future releases of the API. - -## Background - -An InferenceModel allows the Inference Workload Owner to define: - -- Which Model/LoRA adapter(s) to consume. - - Mapping from a client facing model name to the target model name in the InferencePool. - - InferenceModel allows for traffic splitting between adapters _in the same InferencePool_ to allow for new LoRA adapter versions to be easily rolled out. -- Criticality of the requests to the InferenceModel. - -## Spec - -The full spec of the InferenceModel is defined [here](/reference/x-spec/#inferencemodel). \ No newline at end of file diff --git a/site-src/api-types/inferenceobjective.md b/site-src/api-types/inferenceobjective.md new file mode 100644 index 000000000..9b7fe744c --- /dev/null +++ b/site-src/api-types/inferenceobjective.md @@ -0,0 +1,14 @@ +# Inference Objective + +??? example "Alpha since v1.0.0" + + The `InferenceObjective` resource is alpha and may have breaking changes in + future releases of the API. + +## Background + +The **InferenceObjective** API defines a set of serving objectives of the specific request it is associated with. This CRD currently houses only `Priority` but will be expanded to include fields such as SLO attainment. + +## Spec + +The full spec of the InferenceModel is defined [here](/reference/x-spec/#inferenceobjective). \ No newline at end of file diff --git a/site-src/api-types/inferencepool.md b/site-src/api-types/inferencepool.md index c4481b1ad..8922d0d11 100644 --- a/site-src/api-types/inferencepool.md +++ b/site-src/api-types/inferencepool.md @@ -1,9 +1,8 @@ # Inference Pool -??? example "Alpha since v0.1.0" +??? success example "GA since v1.0.0" - The `InferencePool` resource is alpha and may have breaking changes in - future releases of the API. + The `InferencePool` resource has been graduated to v1 and is considered stable. ## Background diff --git a/site-src/concepts/api-overview.md b/site-src/concepts/api-overview.md index ab07a1d2d..01ee25431 100644 --- a/site-src/concepts/api-overview.md +++ b/site-src/concepts/api-overview.md @@ -23,6 +23,6 @@ each aligning with a specific user persona in the Generative AI serving workflow InferencePool represents a set of Inference-focused Pods and an extension that will be used to route to them. Within the broader Gateway API resource model, this resource is considered a "backend". In practice, that means that you'd replace a Kubernetes Service with an InferencePool. This resource has some similarities to Service (a way to select Pods and specify a port), but has some unique capabilities. With InferencePool, you can configure a routing extension as well as inference-specific routing optimizations. For more information on this resource, refer to our [InferencePool documentation](/api-types/inferencepool) or go directly to the [InferencePool spec](/reference/spec/#inferencepool). -### InferenceModel +### InferenceObjective -An InferenceModel represents a model or adapter, and configuration associated with that model. This resource enables you to configure the relative criticality of a model, and allows you to seamlessly translate the requested model name to one or more backend model names. Multiple InferenceModels can be attached to an InferencePool. For more information on this resource, refer to our [InferenceModel documentation](/api-types/inferencemodel) or go directly to the [InferenceModel spec](/reference/spec/#inferencemodel). +An InferenceObjective represents the objectives of a specific request. A single InferenceObjective is associated with a request, and multiple requests with different InferenceObjectives can be attached to an InferencePool. For more information on this resource, refer to our [InferenceObjective documentation](/api-types/inferenceobjective) or go directly to the [InferenceObjective spec](/reference/spec/#inferenceobjective). diff --git a/site-src/concepts/priority-and-capacity.md b/site-src/concepts/priority-and-capacity.md new file mode 100644 index 000000000..b6b171516 --- /dev/null +++ b/site-src/concepts/priority-and-capacity.md @@ -0,0 +1,11 @@ +# Priority and Capacity + +The InferenceObjective creates the definition of `Priority` which describes how requests interact with each other, this naturally interacts with total pool capacity, and properly understanding and configuring these behaviors is important in allowing a pool to handle requests of different priority. + +# Priority + +Priority is a simple stack rank; the higher the number, the higher the priority. Should no priority for a request be specified, the default value is zero. Requests of higher priority are _always_ selected first when requests are queued. Requests of equal priority currently operate on a FCFS basis. + +# Capacity + +The current capacity model uses configurable [thresholds](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/35b14a10a9830d1a9e3850913539066ebc8fb317/pkg/epp/saturationdetector/saturationdetector.go#L49) to determine if the entire pool is saturated. The calculation is to simply iterate through each endpoint in the pool, and if all are above all thresholds, the pool is considered `saturated`. In the event of saturation, all requests with a negative priority will be rejected, and other requests will be scheduled and queued on the model server. \ No newline at end of file diff --git a/site-src/concepts/roles-and-personas.md b/site-src/concepts/roles-and-personas.md index 0746adbfb..f1d17a59d 100644 --- a/site-src/concepts/roles-and-personas.md +++ b/site-src/concepts/roles-and-personas.md @@ -17,7 +17,7 @@ The Inference Platform Admin creates and manages the infrastructure necessary to An Inference Workload Owner persona owns and manages one or many Generative AI Workloads (LLM focused *currently*). This includes: -- Defining criticality +- Defining priority - Managing fine-tunes - LoRA Adapters - System Prompts diff --git a/site-src/guides/adapter-rollout.md b/site-src/guides/adapter-rollout.md index 0936d2913..7d6611c92 100644 --- a/site-src/guides/adapter-rollout.md +++ b/site-src/guides/adapter-rollout.md @@ -3,7 +3,6 @@ The goal of this guide is to show you how to perform incremental roll out operations, which gradually deploy new versions of your inference infrastructure. You can update LoRA adapters and Inference Pool with minimal service disruption. -This page also provides guidance on traffic splitting and rollbacks to help ensure reliable deployments for LoRA adapters rollout. LoRA adapter rollouts let you deploy new versions of LoRA adapters in phases, without altering the underlying base model or infrastructure. @@ -49,36 +48,7 @@ data: The new adapter version is applied to the model servers live, without requiring a restart. - -### Direct traffic to the new adapter version - -Modify the InferenceModel to configure a canary rollout with traffic splitting. In this example, 10% of traffic for food-review model will be sent to the new ***food-review-2*** adapter. - - -```bash -kubectl edit inferencemodel food-review -``` - -Change the targetModels list in InferenceModel to match the following: - - -```yaml -apiVersion: inference.networking.x-k8s.io/v1alpha2 -kind: InferenceModel -metadata: - name: food-review -spec: - criticality: 1 - poolRef: - name: vllm-llama3-8b-instruct - targetModels: - - name: food-review-1 - weight: 90 - - name: food-review-2 - weight: 10 -``` - -The above configuration means one in every ten requests should be sent to the new version. Try it out: +Try it out: 1. Get the gateway IP: ```bash @@ -88,7 +58,7 @@ IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].va 2. Send a few requests as follows: ```bash curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ -"model": "food-review", +"model": "food-review-2", "prompt": "Write as if you were a critic: San Francisco", "max_tokens": 100, "temperature": 0 @@ -97,23 +67,6 @@ curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ ### Finish the rollout - -Modify the InferenceModel to direct 100% of the traffic to the latest version of the adapter. - -```yaml -apiVersion: inference.networking.x-k8s.io/v1alpha2 -kind: InferenceModel -metadata: - name: food-review -spec: - criticality: 1 - poolRef: - name: vllm-llama3-8b-instruct - targetModels: - - name: food-review-2 - weight: 100 -``` - Unload the older versions from the servers by updating the LoRA syncer ConfigMap to list the older version under the `ensureNotExist` list: ```yaml @@ -137,5 +90,5 @@ data: source: Kawon/llama3.1-food-finetune_v14_r8 ``` -With this, all requests should be served by the new adapter version. +With this, the new adapter version should be available for all incoming requests. diff --git a/site-src/guides/epp-configuration/config-text.md b/site-src/guides/epp-configuration/config-text.md index 6df19db80..cdb3714cb 100644 --- a/site-src/guides/epp-configuration/config-text.md +++ b/site-src/guides/epp-configuration/config-text.md @@ -1,17 +1,14 @@ -# Configuring Plugins via text +# Configuring Plugins via YAML The set of lifecycle hooks (plugins) that are used by the Inference Gateway (IGW) is determined by how -it is configured. The IGW can be configured in several ways, either by code or via text. +it is configured. The IGW is primarily configured via a configuration file. -If configured by code either a set of predetermined environment variables must be used or one must -fork the IGW and change code. - -A simpler way to congigure the IGW is to use a text based configuration. This text is in YAML format -and can either be in a file or specified in-line as a parameter. The configuration defines the set of +The YAML file can either be specified as a path to a file or in-line as a parameter. The configuration defines the set of plugins to be instantiated along with their parameters. Each plugin can also be given a name, enabling -the same plugin type to be instantiated multiple times, if needed. +the same plugin type to be instantiated multiple times, if needed (such as when configuring multiple scheduling profiles). -Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling a request. If one is not defailed, a default one names `default` will be added and will reference all of the +Also defined is a set of SchedulingProfiles, which determine the set of plugins to be used when scheduling a request. +If no scheduling profile is specified, a default profile, named `default` will be added and will reference all of the instantiated plugins. The set of plugins instantiated can include a Profile Handler, which determines which SchedulingProfiles @@ -22,12 +19,9 @@ In addition, the set of instantiated plugins can also include a picker, which ch the request is scheduled after filtering and scoring. If one is not referenced in a SchedulingProfile, an instance of `MaxScorePicker` will be added to the SchedulingProfile in question. -It should be noted that while the configuration text looks like a Kubernetes Custom Resource, it is -**NOT** a Kubernetes Custom Resource. Kubernetes infrastructure is used to load the configuration -text and in the future will also help in versioning the text. - -It should also be noted that even when the configuration text is loaded from a file, it is loaded at -the Endpoint-Picker's (EPP) startup and changes to the file at runtime are ignored. +***NOTE***: While the configuration text looks like a Kubernetes CRD, it is +**NOT** a Kubernetes CRD. Specifically, the config is not reconciled upon, and is only read on startup. +This behavior is intentional, as augmenting the scheduling config without redeploying the EPP is not supported. The configuration text has the following form: ```yaml diff --git a/site-src/guides/implementers.md b/site-src/guides/implementers.md index 6fce01657..d42fa8d91 100644 --- a/site-src/guides/implementers.md +++ b/site-src/guides/implementers.md @@ -157,8 +157,8 @@ An example of a similar approach is Kuadrant’s [WASM Shim](https://github.com/ Here are some tips for testing your controller end-to-end: - **Focus on Key Scenarios**: Add common scenarios like creating, updating, and deleting InferencePool resources, as well as different routing rules that target InferencePool backends. -- **Verify Routing Behaviors**: Design more complex routing scenarios and verify that requests are correctly routed to the appropriate model server pods within the InferencePool based on the InferenceModel configuration. -- **Test Error Handling**: Verify that the controller correctly handles scenarios like unsupported model names or resource constraints (if criticality-based shedding is implemented). Test with state transitions (such as constant requests while Pods behind EPP are being replaced and Pods behind InferencePool are being replaced) to ensure that the system is resilient to failures and can automatically recover by redirecting traffic to healthy Pods. +- **Verify Routing Behaviors**: Design more complex routing scenarios and verify that requests are correctly routed to the appropriate model server pods within the InferencePool. +- **Test Error Handling**: Verify that the controller correctly handles scenarios like unsupported model names or resource constraints (if priority-based shedding is implemented). Test with state transitions (such as constant requests while Pods behind EPP are being replaced and Pods behind InferencePool are being replaced) to ensure that the system is resilient to failures and can automatically recover by redirecting traffic to healthy Pods. - **Using Reference EPP Implementation + Echoserver**: You can use the [reference EPP implementation](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) for testing your controller end-to-end. Instead of a full-fledged model server, a simple mock server (like the [echoserver](https://github.com/kubernetes-sigs/ingress-controller-conformance/tree/master/images/echoserver)) can be very useful for verifying routing to ensure the correct pod received the request. - **Performance Test**: Run end-to-end [benchmarks](https://gateway-api-inference-extension.sigs.k8s.io/performance/benchmark/) to make sure that your inference gateway can achieve the latency target that is desired. diff --git a/site-src/guides/index.md b/site-src/guides/index.md index da8063298..3d47ff33a 100644 --- a/site-src/guides/index.md +++ b/site-src/guides/index.md @@ -349,7 +349,7 @@ Tooling: The following instructions assume you would like to cleanup ALL resources that were created in this quickstart guide. Please be careful not to delete resources you'd like to keep. - 1. Uninstall the InferencePool, InferenceModel, and model server resources + 1. Uninstall the InferencePool, InferenceObjective and model server resources ```bash helm uninstall vllm-llama3-8b-instruct diff --git a/site-src/guides/inferencepool-rollout.md b/site-src/guides/inferencepool-rollout.md index f3d929466..b274a7262 100644 --- a/site-src/guides/inferencepool-rollout.md +++ b/site-src/guides/inferencepool-rollout.md @@ -34,7 +34,6 @@ teams can ensure stability and performance, quickly identifying and reverting an 1. **Deploy new infrastructure**: Create a new InferencePool configured with the new node(compute/accelerator) / model server / base model that you chose. 1. **Configure traffic splitting**: Use an HTTPRoute to split traffic between the existing InferencePool and the new InferencePool. The `backendRefs.weight` field controls the traffic percentage allocated to each pool. -1. **Maintain InferenceModel integrity**: Retain the existing InferenceModel configuration to ensure uniform model behavior across both node configurations or base model versions or model server versions. 1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary. ## Example @@ -45,276 +44,10 @@ Follow the steps in the [main guide](index.md) ### Deploy new infrastructure You start with an existing InferencePool named vllm-llama3-8b-instruct. -To replace the original InferencePool, you create a new InferencePool named vllm-llama3-8b-instruct-new along with -InferenceModels and Endpoint Picker Extension configured with the updated node specifications of `nvidia-h100-80gb` accelerator type, +To replace the original InferencePool, you create a new InferencePool, configured to select the pods with the `nvidia-h100-80gb` accelerator type. -```yaml -kubectl apply -f - < ``` With this, all requests should be served by the new Inference Pool. diff --git a/site-src/guides/serve-multiple-genai-models.md b/site-src/guides/serve-multiple-genai-models.md index 92bfedaca..a2e4e51d5 100644 --- a/site-src/guides/serve-multiple-genai-models.md +++ b/site-src/guides/serve-multiple-genai-models.md @@ -4,7 +4,7 @@ A company wants to deploy multiple large language models (LLMs) to a cluster to For example, they might want to deploy a Gemma3 model for a chatbot interface and a DeepSeek model for a recommendation application. The company needs to ensure optimal serving performance for these LLMs. By using an Inference Gateway, you can deploy these LLMs on your cluster with your chosen accelerator configuration in an `InferencePool`. -You can then route requests based on the model name (such as `chatbot` and `recommender`) and the `Criticality` property. +You can then route requests based on the model name (such as `chatbot` and `recommender`). ## How diff --git a/site-src/guides/serve-multiple-lora-adapters.md b/site-src/guides/serve-multiple-lora-adapters.md deleted file mode 100644 index 59cfe7208..000000000 --- a/site-src/guides/serve-multiple-lora-adapters.md +++ /dev/null @@ -1,98 +0,0 @@ -# Serve LoRA adapters on a shared pool -A company wants to serve LLMs for document analysis and focuses on audiences in multiple languages, such as English and Spanish. -They have a fine-tuned LoRA adapter for each language, but need to efficiently use their GPU and TPU capacity. -You can use an Inference Gateway to deploy dynamic LoRA fine-tuned adapters for each language (for example, `english-bot` and `spanish-bot`) on a common base model and accelerator. -This lets you reduce the number of required accelerators by densely packing multiple models in a shared pool. - -## How -The following diagram illustrates how Inference Gateway serves multiple LoRA adapters on a shared pool. -![Serving LoRA adapters on a shared pool](../images/serve-LoRA-adapters.png) -This example illustrates how you can densely serve multiple LoRA adapters with distinct workload performance objectives on a common InferencePool. -```yaml -apiVersion: gateway.networking.x-k8s.io/v1alpha1 -kind: InferencePool -metadata: - name: gemma3 -spec: - selector: - pool: gemma3 -``` -Let us say we have a couple of LoRA adapters named “english-bot” and “spanish-bot” for the Gemma3 base model. -You can create an `InferenceModel` resource and associate these LoRA adapters to the relevant InferencePool resource. -In this case, we associate these LoRA adapters to the gemma3 InferencePool resource created above. - -```yaml -apiVersion: inference.networking.x-k8s.io/v1alpha2 -kind: InferenceModel -metadata: - name: english-bot -spec: - criticality: 1 - poolRef: - name: gemma3 - ---- -apiVersion: inference.networking.x-k8s.io/v1alpha2 -kind: InferenceModel -metadata: - name: spanish-bot -spec: - criticality: 2 - poolRef: - name: gemma3 - -``` -Now, you can route your requests from the gateway using the `HTTPRoute` object. -```yaml -apiVersion: gateway.networking.k8s.io/v1 -kind: Gateway -metadata: - name: inference-gateway -spec: - listeners: - - protocol: HTTP - port: 80 - name: http - ---- -kind: HTTPRoute -apiVersion: gateway.networking.k8s.io/v1 -metadata: - name: routes-to-llms -spec: - parentRefs: - - name: inference-gateway - rules: - - matches: - path: - type: PathPrefix - value: / - backendRefs: - - name: gemma3 - kind: InferencePool -``` - -## Try it out - -1. Get the gateway IP: -```bash -IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'); PORT=80 -``` -2. Send a few requests to model "english-bot" as follows: -```bash -curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ -"model": "english-bot", -"prompt": "What is the color of the sky", -"max_tokens": 100, -"temperature": 0 -}' -``` -3. Send a few requests to model "spanish-bot" as follows: -```bash -curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{ -"model": "spanish-bot", -"prompt": "¿De qué color es...?", -"max_tokens": 100, -"temperature": 0 -}' -``` \ No newline at end of file diff --git a/site-src/guides/troubleshooting.md b/site-src/guides/troubleshooting.md index c741d59bf..1f08619f7 100644 --- a/site-src/guides/troubleshooting.md +++ b/site-src/guides/troubleshooting.md @@ -16,7 +16,7 @@ This is a default gateway error, meaning the request never reached a backend ser ## 429 Too Many Requests ### `system saturated, sheddable request dropped` -This error indicates that the entire inference pool has exceeded its saturation thresholds. This means the system is under heavy load and is shedding non-critical requests. To address this, check the following: +This error indicates that the entire inference pool has exceeded its saturation thresholds. This means the system is under heavy load and is shedding low priority requests. To address this, check the following: * gateway-api-inference-extension version: * **v0.5.1 and earlier**: Verify you're using an `InferenceModel` and that its `criticality` is set to `Critical`. This ensures requests are queued on the model servers instead of being dropped. diff --git a/site-src/index.md b/site-src/index.md index a2892c5b8..cf1ddb32a 100644 --- a/site-src/index.md +++ b/site-src/index.md @@ -1,11 +1,5 @@ # Introduction -???+ warning - - - Some portions of this site may be out of date with the v1.0.0 release candidate. - Updates under active development! - Gateway API Inference Extension is an official Kubernetes project that optimizes self-hosting Generative Models on Kubernetes. The overall resource model focuses on 2 new inference-focused @@ -49,7 +43,7 @@ in a higher level **AI Gateways** like [LiteLLM](https://www.litellm.ai/), [Gloo - **Model-aware routing**: Instead of simply routing based on the path of the request, an **[inference gateway]** allows you to route to models based on the model names. This is enabled by support for GenAI Inference API specifications (such as OpenAI API) in the gateway implementations such as in Envoy Proxy. This model-aware routing also extends to Low-Rank Adaptation (LoRA) fine-tuned models. -- **Serving priority**: an **[inference gateway]** allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher [*Criticality*](/reference/spec/#criticality) than a model for latency tolerant tasks such as a summarization. +- **Serving priority**: an **[inference gateway]** allows you to specify the serving priority of your models. For example, you can specify that your models for online inference of chat tasks (which is more latency sensitive) have a higher [*Priority*](/reference/spec/#priority) than a model for latency tolerant tasks such as a summarization. - **Model rollouts**: an **[inference gateway]** allows you to incrementally roll out new model versions by traffic splitting definitions based on the model names. diff --git a/site-src/reference/x-spec.md b/site-src/reference/x-spec.md index 9151dcda6..c1a57ce3f 100644 --- a/site-src/reference/x-spec.md +++ b/site-src/reference/x-spec.md @@ -123,12 +123,6 @@ performance and latency goals for the model. These workloads are expected to operate within an InferencePool sharing compute capacity with other InferenceObjectives, defined by the Inference Platform Admin. -InferenceObjective's modelName (not the ObjectMeta name) is unique for a given InferencePool, -if the name is reused, an error will be shown on the status of a -InferenceObjective that attempted to reuse. The oldest InferenceObjective, based on -creation timestamp, will be selected to remain valid. In the event of a race -condition, one will be selected at random. - _Appears in:_ From 112bea109128b51dd96ace03b3f8733e32b6738f Mon Sep 17 00:00:00 2001 From: Kellen Swain Date: Fri, 12 Sep 2025 19:30:47 +0000 Subject: [PATCH 2/2] adding priority and capacity section --- site-src/concepts/priority-and-capacity.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/site-src/concepts/priority-and-capacity.md b/site-src/concepts/priority-and-capacity.md index b6b171516..367aebfdf 100644 --- a/site-src/concepts/priority-and-capacity.md +++ b/site-src/concepts/priority-and-capacity.md @@ -2,10 +2,16 @@ The InferenceObjective creates the definition of `Priority` which describes how requests interact with each other, this naturally interacts with total pool capacity, and properly understanding and configuring these behaviors is important in allowing a pool to handle requests of different priority. -# Priority +## Priority (in flow control) + +It should be noted that priority is currently only used in [Capacity](#capacity), and that the description below is how Priority will be consumed in the `Flow Control` model. Priority is a simple stack rank; the higher the number, the higher the priority. Should no priority for a request be specified, the default value is zero. Requests of higher priority are _always_ selected first when requests are queued. Requests of equal priority currently operate on a FCFS basis. -# Capacity +## Capacity + +The current capacity model uses configurable [thresholds](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/35b14a10a9830d1a9e3850913539066ebc8fb317/pkg/epp/saturationdetector/saturationdetector.go#L49) to determine if the entire pool is saturated. The calculation is to simply iterate through each endpoint in the pool, and if all are above all thresholds, the pool is considered `saturated`. In the event of saturation, all requests with a negative priority will be rejected, and other requests will be scheduled and queued on the model servers. + +## Future work -The current capacity model uses configurable [thresholds](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/35b14a10a9830d1a9e3850913539066ebc8fb317/pkg/epp/saturationdetector/saturationdetector.go#L49) to determine if the entire pool is saturated. The calculation is to simply iterate through each endpoint in the pool, and if all are above all thresholds, the pool is considered `saturated`. In the event of saturation, all requests with a negative priority will be rejected, and other requests will be scheduled and queued on the model server. \ No newline at end of file +The Flow Control system is nearing completion and will add more nuance to the Priority and Capacity model: proper priority enforcement, more articulate capacity tracking, queuing at the Inference Gateway level, etc. This documentation will be updated when the Flow Control has finished implementation. \ No newline at end of file