Skip to content

Commit 9536fa2

Browse files
committed
hpa recipe for ai inference using gpu custom metrics
1 parent 5f7960f commit 9536fa2

12 files changed

+920
-0
lines changed

ai/vllm-deployment/hpa/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
GEMINI.md

ai/vllm-deployment/hpa/README.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Horizontal Pod Autoscaling AI Inference Server
2+
3+
This exercise shows how to set up the infrastructure to automatically
4+
scale an AI inference server, using custom metrics (either server
5+
or GPU metrics). This exercise requires a running Prometheus instance,
6+
preferably managed by the Prometheus Operator. We assume
7+
you already have the vLLM AI inference server running from this
8+
[exercise](../README.md), in the parent directory.
9+
10+
## Architecture
11+
12+
The autoscaling solution works as follows:
13+
14+
1. The **vLLM Server** or the **NVIDIA DCGM Exporter** exposes raw metrics on a `/metrics` endpoint.
15+
2. A **ServiceMonitor** resource declaratively specifies how Prometheus should discover and scrape these metrics.
16+
3. The **Prometheus Operator** detects the `ServiceMonitor` and configures its managed **Prometheus Server** instance to begin scraping the metrics.
17+
4. For GPU metrics, a **PrometheusRule** is used to relabel the raw DCGM metrics, creating a new, HPA-compatible metric.
18+
5. The **Prometheus Adapter** queries the Prometheus Server for the processed metrics and exposes them through the Kubernetes custom metrics API.
19+
6. The **Horizontal Pod Autoscaler (HPA)** controller queries the custom metrics API for the metrics and compares them to the target values defined in the `HorizontalPodAutoscaler` resource.
20+
7. If the metrics exceed the target, the HPA scales up the `vllm-gemma-deployment`.
21+
22+
```
23+
┌──────────────┐ ┌────────────────┐ ┌──────────────────┐
24+
│ User Request │──>│ vLLM Server │──>│ ServiceMonitor │
25+
└──────────────┘ │ (or DCGM Exp.) │ └──────────────────┘
26+
└────────────────┘ │
27+
28+
┌────────────────┐ ┌──────────────────┐ ┌──────────────────┐
29+
│ HPA Controller │<──│ Prometheus Adpt. │<──│ Prometheus Srv. │
30+
└────────────────┘ └──────────────────┘ └──────────────────┘
31+
│ (GPU Path Only)
32+
33+
┌────────────────┐
34+
│ PrometheusRule │
35+
└────────────────┘
36+
```
37+
38+
## Prerequisites
39+
40+
This guide assumes you have a running Kubernetes cluster and `kubectl` installed. The vLLM server will be deployed in the `default` namespace, and the Prometheus and HPA resources will be in the `monitoring` namespace.
41+
42+
### Prometheus Operator Installation
43+
44+
The following commands will install the Prometheus Operator. It is recommended to install it in its own `monitoring` namespace.
45+
46+
```bash
47+
# Add the Prometheus community Helm repository
48+
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts/
49+
helm repo update
50+
51+
# Install the Prometheus Operator into the "monitoring" namespace
52+
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
53+
```
54+
**Note:** The default configuration of the Prometheus Operator only watches for `ServiceMonitor` resources within its own namespace. The `vllm-service-monitor.yaml` is configured to be in the `monitoring` namespace and watch for services in the `default` namespace, so no extra configuration is needed.
55+
56+
## I. HPA for vLLM AI Inference Server using vLLM metrics
57+
58+
[vLLM AI Inference Server HPA](./vllm-hpa.md)
59+
60+
## II. HPA for vLLM AI Inference Server using NVidia GPU metrics
61+
62+
[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md)
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# This Service provides a stable network endpoint for the NVIDIA DCGM Exporter
2+
# pods. The Prometheus Operator's ServiceMonitor will target this Service
3+
# to discover and scrape the GPU metrics. This is especially important
4+
# because the exporter pods are part of a DaemonSet, and their IPs can change.
5+
6+
apiVersion: v1
7+
kind: Service
8+
metadata:
9+
name: gke-managed-dcgm-exporter
10+
namespace: gke-managed-system
11+
labels:
12+
# This label is critical. The ServiceMonitor uses this label to find this
13+
# specific Service. If the labels don't match, Prometheus will not be
14+
# able to discover the metrics endpoint.
15+
app.kubernetes.io/name: gke-managed-dcgm-exporter
16+
spec:
17+
selector:
18+
# This selector tells the Service which pods to route traffic to.
19+
# It must match the labels on the DCGM exporter pods.
20+
app.kubernetes.io/name: gke-managed-dcgm-exporter
21+
ports:
22+
- # The 'name' of this port is important. The ServiceMonitor will specifically
23+
# look for a port with this name to scrape metrics from.
24+
name: metrics
25+
port: 9400
26+
targetPort: 9400
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
# This HorizontalPodAutoscaler (HPA) targets the vLLM deployment and scales
2+
# it based on the average GPU utilization across all pods. It uses the
3+
# custom metric 'gpu_utilization_percent', which is provided by the
4+
# Prometheus Adapter.
5+
6+
apiVersion: autoscaling/v2
7+
kind: HorizontalPodAutoscaler
8+
metadata:
9+
name: gemma-server-gpu-hpa
10+
namespace: default
11+
spec:
12+
# scaleTargetRef points the HPA to the deployment it needs to scale.
13+
scaleTargetRef:
14+
apiVersion: apps/v1
15+
kind: Deployment
16+
name: vllm-gemma-deployment
17+
minReplicas: 1
18+
maxReplicas: 5
19+
metrics:
20+
- type: Pods
21+
pods:
22+
metric:
23+
# This is the custom metric that the HPA will query.
24+
# IMPORTANT: This name ('gpu_utilization_percent') is not the raw metric
25+
# from the DCGM exporter. It is the clean, renamed metric that is
26+
# exposed by the Prometheus Adapter. The names must match exactly.
27+
name: gpu_utilization_percent
28+
target:
29+
type: AverageValue
30+
# This is the target value for the metric. The HPA will add or remove
31+
# pods to keep the average GPU utilization across all pods at 20%.
32+
averageValue: 20
33+
behavior:
34+
scaleUp:
35+
stabilizationWindowSeconds: 0
36+
policies:
37+
- type: Pods
38+
value: 4
39+
periodSeconds: 15
40+
- type: Percent
41+
value: 100
42+
periodSeconds: 15
43+
selectPolicy: Max
44+
scaleDown:
45+
stabilizationWindowSeconds: 30
46+
policies:
47+
- type: Percent
48+
value: 100
49+
periodSeconds: 15
50+
selectPolicy: Max

ai/vllm-deployment/hpa/gpu-hpa.md

Lines changed: 206 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,206 @@
1+
# Autoscaling an AI Inference Server with HPA using NVIDIA GPU Metrics
2+
3+
This guide provides a detailed walkthrough for configuring a Kubernetes Horizontal Pod Autoscaler (HPA) to dynamically scale a vLLM AI inference server based on NVIDIA GPU utilization. The autoscaling logic is driven by the `DCGM_FI_DEV_GPU_UTIL` metric, which is exposed by the NVIDIA Data Center GPU Manager (DCGM) Exporter. This approach allows the system to scale based on the actual hardware utilization of the GPU, providing a reliable indicator of workload intensity.
4+
5+
This guide assumes you have already deployed the vLLM inference server from the [parent directory's exercise](../README.md) into the `default` namespace.
6+
7+
---
8+
9+
## 1. Verify GPU Metric Collection
10+
11+
The first step is to ensure that GPU metrics are being collected and exposed within the cluster. This is handled by the NVIDIA DCGM Exporter, which runs as a DaemonSet on GPU-enabled nodes and scrapes metrics directly from the GPU hardware. The method for deploying this exporter varies across cloud providers.
12+
13+
### 1.1. Cloud Provider DCGM Exporter Setup
14+
15+
Below are the common setups for GKE, AKS, and EKS.
16+
17+
#### Google Kubernetes Engine (GKE)
18+
19+
On GKE, the DCGM exporter is a managed add-on that is automatically deployed and managed by the system. It runs in the `gke-managed-system` namespace.
20+
21+
**Verification:**
22+
You can verify that the exporter pods are running with the following command:
23+
```bash
24+
kubectl get pods --namespace gke-managed-system | grep dcgm-exporter
25+
```
26+
You should see one or more `dcgm-exporter` pods in a `Running` state.
27+
28+
#### Amazon Elastic Kubernetes Service (EKS) & Microsoft Azure Kubernetes Service (AKS)
29+
30+
On both EKS and AKS, the DCGM exporter is not a managed service and must be installed manually. The standard method is to use the official NVIDIA DCGM Exporter Helm chart, which deploys the exporter as a DaemonSet.
31+
32+
**Installation (for both EKS and AKS):**
33+
If you don't already have the exporter installed, you can do so with the following Helm commands:
34+
```bash
35+
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
36+
helm repo update
37+
helm install dcgm-exporter nvdp/dcgm-exporter --namespace monitoring
38+
```
39+
*Note: We are installing it into the `monitoring` namespace to keep all monitoring-related components together.*
40+
41+
**Verification:**
42+
You can verify that the exporter pods are running in the `monitoring` namespace:
43+
```bash
44+
kubectl get pods --namespace monitoring | grep dcgm-exporter
45+
```
46+
You should see one or more `dcgm-exporter` pods in a `Running` state.
47+
48+
---
49+
50+
## 2. Set Up Prometheus for Metric Collection
51+
52+
With the metric source confirmed, the next step is to configure Prometheus to scrape, process, and store these metrics.
53+
54+
### 2.1. Install the Prometheus Operator
55+
56+
The Prometheus Operator can be easily installed using its official Helm chart. This will deploy a full monitoring stack into the `monitoring` namespace. If you have already installed it in the previous exercise, you can skip this step.
57+
58+
```bash
59+
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts/
60+
helm repo update
61+
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
62+
```
63+
64+
### 2.2. Create a Service for the DCGM Exporter
65+
66+
The `ServiceMonitor` needs a stable network endpoint to reliably scrape metrics from the DCGM exporter pods. A Kubernetes Service provides this stable endpoint.
67+
68+
Apply the service manifest:
69+
```bash
70+
kubectl apply -f ./gpu-dcgm-exporter-service.yaml
71+
```
72+
73+
Verify that the service has been created successfully:
74+
```bash
75+
kubectl get svc -n gke-managed-system | grep gke-managed-dcgm-exporter
76+
```
77+
78+
### 2.3. Configure Metric Scraping with a `ServiceMonitor`
79+
80+
The `ServiceMonitor` tells the Prometheus Operator to scrape the DCGM exporter Service.
81+
82+
```bash
83+
kubectl apply -f ./gpu-service-monitor.yaml
84+
```
85+
86+
### 2.4. Create a Prometheus Rule for Metric Relabeling
87+
88+
This is a critical step. The raw `DCGM_FI_DEV_GPU_UTIL` metric does not have the standard `pod` and `namespace` labels the HPA needs. This `PrometheusRule` creates a *new*, correctly-labelled metric named `gke_dcgm_fi_dev_gpu_util_relabelled` that the Prometheus Adapter can use.
89+
90+
```bash
91+
kubectl apply -f ./prometheus-rule.yaml
92+
```
93+
94+
### 2.5. Verify Metric Collection and Relabeling in Prometheus
95+
96+
To ensure the entire pipeline is working, you must verify that the *new*, relabelled metric exists. First, establish a port-forward to the Prometheus service.
97+
98+
```bash
99+
kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
100+
```
101+
102+
In a separate terminal, use `curl` to query for the new metric.
103+
```bash
104+
# Query Prometheus for the new, relabelled metric
105+
curl -sS "http://localhost:9090/api/v1/query?query=gke_dcgm_fi_dev_gpu_util_relabelled" | jq
106+
```
107+
A successful verification will show the metric in the `result` array, complete with the correct `pod` and `namespace` labels.
108+
109+
---
110+
111+
## 3. Configure the Horizontal Pod Autoscaler
112+
113+
Now that a clean, usable metric is available in Prometheus, you can configure the HPA.
114+
115+
### 3.1. Deploy the Prometheus Adapter
116+
117+
The Prometheus Adapter bridges Prometheus and the Kubernetes custom metrics API. It is configured to read the `gke_dcgm_fi_dev_gpu_util_relabelled` metric and expose it as `gpu_utilization_percent`.
118+
119+
```bash
120+
kubectl apply -f ./prometheus-adapter.yaml
121+
```
122+
Verify that the adapter's pod is running in the `monitoring` namespace.
123+
124+
### 3.2. Verify the Custom Metrics API
125+
126+
After deploying the adapter, it's vital to verify that it is successfully exposing the transformed metrics to the Kubernetes API. You can do this by querying the custom metrics API directly.
127+
128+
```bash
129+
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
130+
```
131+
132+
The output should be a list of available custom metrics. Look for the `pods/gpu_utilization_percent` metric, which confirms that the entire pipeline is working correctly and the metric is ready for the HPA to consume.
133+
134+
```json
135+
{
136+
"kind": "APIResourceList",
137+
"apiVersion": "v1",
138+
"groupVersion": "custom.metrics.k8s.io/v1beta1",
139+
"resources": [
140+
{
141+
"name": "pods/gpu_utilization_percent",
142+
"singularName": "",
143+
"namespaced": true,
144+
"kind": "MetricValueList",
145+
"verbs": [
146+
"get"
147+
]
148+
}
149+
]
150+
}
151+
```
152+
153+
### 3.3. Deploy the Horizontal Pod Autoscaler (HPA)
154+
155+
The HPA is configured to use the final, clean metric name, `gpu_utilization_percent`, to maintain an average GPU utilization of 20%.
156+
157+
```bash
158+
kubectl apply -f ./gpu-horizontal-pod-autoscaler.yaml
159+
```
160+
161+
Inspect the HPA's configuration to confirm it's targeting the correct metric.
162+
```bash
163+
kubectl describe hpa/gemma-server-gpu-hpa -n default
164+
# Expected output should include:
165+
# Metrics: ( current / target )
166+
# "gpu_utilization_percent" on pods: <current value> / 20
167+
```
168+
169+
---
170+
171+
## 4. Load Test the Autoscaling Setup
172+
173+
Generate a sustained load on the vLLM server to cause GPU utilization to rise.
174+
175+
### 4.1. Generate Inference Load
176+
177+
First, establish a port-forward to the vLLM service.
178+
```bash
179+
kubectl port-forward service/vllm-service -n default 8081:8081
180+
```
181+
182+
In another terminal, execute the `request-looper.sh` script.
183+
```bash
184+
./request-looper.sh
185+
```
186+
187+
### 4.2. Observe the HPA Scaling the Deployment
188+
189+
While the load script is running, monitor the HPA's behavior.
190+
```bash
191+
# See the HPA's metric values and scaling events
192+
kubectl describe hpa/gemma-server-gpu-hpa -n default
193+
194+
# Watch the number of deployment replicas increase
195+
kubectl get deploy/vllm-gemma-deployment -n default -w
196+
```
197+
As the average GPU utilization exceeds the 20% target, the HPA will scale up the deployment.
198+
199+
---
200+
201+
## 5. Cleanup
202+
203+
To tear down the resources from this exercise, run the following command:
204+
```bash
205+
kubectl delete -f .
206+
```
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# This ServiceMonitor tells the Prometheus Operator how to discover and scrape
2+
# metrics from the NVIDIA DCGM Exporter. It is designed to find the
3+
# 'gke-managed-dcgm-exporter' Service in the 'gke-managed-system' namespace
4+
# and scrape its '/metrics' endpoint.
5+
apiVersion: monitoring.coreos.com/v1
6+
kind: ServiceMonitor
7+
metadata:
8+
name: nvidia-dcgm-exporter-servicemonitor
9+
namespace: monitoring
10+
labels:
11+
# This label is used by the Prometheus Operator to discover this
12+
# ServiceMonitor. It must match the 'serviceMonitorSelector' configured
13+
# in the Prometheus custom resource.
14+
release: prometheus
15+
spec:
16+
# This selector identifies the specific Service to scrape. It must match
17+
# the labels on the 'gke-managed-dcgm-exporter' Service.
18+
selector:
19+
matchLabels:
20+
app.kubernetes.io/name: gke-managed-dcgm-exporter
21+
# This selector specifies which namespace to search for the target Service.
22+
# For GKE, the DCGM service is in 'gke-managed-system'.
23+
namespaceSelector:
24+
matchNames:
25+
- gke-managed-system
26+
endpoints:
27+
- port: metrics
28+
interval: 15s

0 commit comments

Comments
 (0)