|
| 1 | +# Autoscaling an AI Inference Server with HPA using NVIDIA GPU Metrics |
| 2 | + |
| 3 | +This guide provides a detailed walkthrough for configuring a Kubernetes Horizontal Pod Autoscaler (HPA) to dynamically scale a vLLM AI inference server based on NVIDIA GPU utilization. The autoscaling logic is driven by the `DCGM_FI_DEV_GPU_UTIL` metric, which is exposed by the NVIDIA Data Center GPU Manager (DCGM) Exporter. This approach allows the system to scale based on the actual hardware utilization of the GPU, providing a reliable indicator of workload intensity. |
| 4 | + |
| 5 | +This guide assumes you have already deployed the vLLM inference server from the [parent directory's exercise](../README.md) into the `default` namespace. |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## 1. Verify GPU Metric Collection |
| 10 | + |
| 11 | +The first step is to ensure that GPU metrics are being collected and exposed within the cluster. This is handled by the NVIDIA DCGM Exporter, which runs as a DaemonSet on GPU-enabled nodes and scrapes metrics directly from the GPU hardware. The method for deploying this exporter varies across cloud providers. |
| 12 | + |
| 13 | +### 1.1. Cloud Provider DCGM Exporter Setup |
| 14 | + |
| 15 | +Below are the common setups for GKE, AKS, and EKS. |
| 16 | + |
| 17 | +#### Google Kubernetes Engine (GKE) |
| 18 | + |
| 19 | +On GKE, the DCGM exporter is a managed add-on that is automatically deployed and managed by the system. It runs in the `gke-managed-system` namespace. |
| 20 | + |
| 21 | +**Verification:** |
| 22 | +You can verify that the exporter pods are running with the following command: |
| 23 | +```bash |
| 24 | +kubectl get pods --namespace gke-managed-system | grep dcgm-exporter |
| 25 | +``` |
| 26 | +You should see one or more `dcgm-exporter` pods in a `Running` state. |
| 27 | + |
| 28 | +#### Amazon Elastic Kubernetes Service (EKS) & Microsoft Azure Kubernetes Service (AKS) |
| 29 | + |
| 30 | +On both EKS and AKS, the DCGM exporter is not a managed service and must be installed manually. The standard method is to use the official NVIDIA DCGM Exporter Helm chart, which deploys the exporter as a DaemonSet. |
| 31 | + |
| 32 | +**Installation (for both EKS and AKS):** |
| 33 | +If you don't already have the exporter installed, you can do so with the following Helm commands: |
| 34 | +```bash |
| 35 | +helm repo add nvdp https://nvidia.github.io/k8s-device-plugin |
| 36 | +helm repo update |
| 37 | +helm install dcgm-exporter nvdp/dcgm-exporter --namespace monitoring |
| 38 | +``` |
| 39 | +*Note: We are installing it into the `monitoring` namespace to keep all monitoring-related components together.* |
| 40 | + |
| 41 | +**Verification:** |
| 42 | +You can verify that the exporter pods are running in the `monitoring` namespace: |
| 43 | +```bash |
| 44 | +kubectl get pods --namespace monitoring | grep dcgm-exporter |
| 45 | +``` |
| 46 | +You should see one or more `dcgm-exporter` pods in a `Running` state. |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## 2. Set Up Prometheus for Metric Collection |
| 51 | + |
| 52 | +With the metric source confirmed, the next step is to configure Prometheus to scrape, process, and store these metrics. |
| 53 | + |
| 54 | +### 2.1. Install the Prometheus Operator |
| 55 | + |
| 56 | +The Prometheus Operator can be easily installed using its official Helm chart. This will deploy a full monitoring stack into the `monitoring` namespace. If you have already installed it in the previous exercise, you can skip this step. |
| 57 | + |
| 58 | +```bash |
| 59 | +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts/ |
| 60 | +helm repo update |
| 61 | +helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace |
| 62 | +``` |
| 63 | + |
| 64 | +### 2.2. Create a Service for the DCGM Exporter |
| 65 | + |
| 66 | +The `ServiceMonitor` needs a stable network endpoint to reliably scrape metrics from the DCGM exporter pods. A Kubernetes Service provides this stable endpoint. |
| 67 | + |
| 68 | +Apply the service manifest: |
| 69 | +```bash |
| 70 | +kubectl apply -f ./gpu-dcgm-exporter-service.yaml |
| 71 | +``` |
| 72 | + |
| 73 | +Verify that the service has been created successfully: |
| 74 | +```bash |
| 75 | +kubectl get svc -n gke-managed-system | grep gke-managed-dcgm-exporter |
| 76 | +``` |
| 77 | + |
| 78 | +### 2.3. Configure Metric Scraping with a `ServiceMonitor` |
| 79 | + |
| 80 | +The `ServiceMonitor` tells the Prometheus Operator to scrape the DCGM exporter Service. |
| 81 | + |
| 82 | +```bash |
| 83 | +kubectl apply -f ./gpu-service-monitor.yaml |
| 84 | +``` |
| 85 | + |
| 86 | +### 2.4. Create a Prometheus Rule for Metric Relabeling |
| 87 | + |
| 88 | +This is a critical step. The raw `DCGM_FI_DEV_GPU_UTIL` metric does not have the standard `pod` and `namespace` labels the HPA needs. This `PrometheusRule` creates a *new*, correctly-labelled metric named `gke_dcgm_fi_dev_gpu_util_relabelled` that the Prometheus Adapter can use. |
| 89 | + |
| 90 | +```bash |
| 91 | +kubectl apply -f ./prometheus-rule.yaml |
| 92 | +``` |
| 93 | + |
| 94 | +### 2.5. Verify Metric Collection and Relabeling in Prometheus |
| 95 | + |
| 96 | +To ensure the entire pipeline is working, you must verify that the *new*, relabelled metric exists. First, establish a port-forward to the Prometheus service. |
| 97 | + |
| 98 | +```bash |
| 99 | +kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring |
| 100 | +``` |
| 101 | + |
| 102 | +In a separate terminal, use `curl` to query for the new metric. |
| 103 | +```bash |
| 104 | +# Query Prometheus for the new, relabelled metric |
| 105 | +curl -sS "http://localhost:9090/api/v1/query?query=gke_dcgm_fi_dev_gpu_util_relabelled" | jq |
| 106 | +``` |
| 107 | +A successful verification will show the metric in the `result` array, complete with the correct `pod` and `namespace` labels. |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +## 3. Configure the Horizontal Pod Autoscaler |
| 112 | + |
| 113 | +Now that a clean, usable metric is available in Prometheus, you can configure the HPA. |
| 114 | + |
| 115 | +### 3.1. Deploy the Prometheus Adapter |
| 116 | + |
| 117 | +The Prometheus Adapter bridges Prometheus and the Kubernetes custom metrics API. It is configured to read the `gke_dcgm_fi_dev_gpu_util_relabelled` metric and expose it as `gpu_utilization_percent`. |
| 118 | + |
| 119 | +```bash |
| 120 | +kubectl apply -f ./prometheus-adapter.yaml |
| 121 | +``` |
| 122 | +Verify that the adapter's pod is running in the `monitoring` namespace. |
| 123 | + |
| 124 | +### 3.2. Verify the Custom Metrics API |
| 125 | + |
| 126 | +After deploying the adapter, it's vital to verify that it is successfully exposing the transformed metrics to the Kubernetes API. You can do this by querying the custom metrics API directly. |
| 127 | + |
| 128 | +```bash |
| 129 | +kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq . |
| 130 | +``` |
| 131 | + |
| 132 | +The output should be a list of available custom metrics. Look for the `pods/gpu_utilization_percent` metric, which confirms that the entire pipeline is working correctly and the metric is ready for the HPA to consume. |
| 133 | + |
| 134 | +```json |
| 135 | +{ |
| 136 | + "kind": "APIResourceList", |
| 137 | + "apiVersion": "v1", |
| 138 | + "groupVersion": "custom.metrics.k8s.io/v1beta1", |
| 139 | + "resources": [ |
| 140 | + { |
| 141 | + "name": "pods/gpu_utilization_percent", |
| 142 | + "singularName": "", |
| 143 | + "namespaced": true, |
| 144 | + "kind": "MetricValueList", |
| 145 | + "verbs": [ |
| 146 | + "get" |
| 147 | + ] |
| 148 | + } |
| 149 | + ] |
| 150 | +} |
| 151 | +``` |
| 152 | + |
| 153 | +### 3.3. Deploy the Horizontal Pod Autoscaler (HPA) |
| 154 | + |
| 155 | +The HPA is configured to use the final, clean metric name, `gpu_utilization_percent`, to maintain an average GPU utilization of 20%. |
| 156 | + |
| 157 | +```bash |
| 158 | +kubectl apply -f ./gpu-horizontal-pod-autoscaler.yaml |
| 159 | +``` |
| 160 | + |
| 161 | +Inspect the HPA's configuration to confirm it's targeting the correct metric. |
| 162 | +```bash |
| 163 | +kubectl describe hpa/gemma-server-gpu-hpa -n default |
| 164 | +# Expected output should include: |
| 165 | +# Metrics: ( current / target ) |
| 166 | +# "gpu_utilization_percent" on pods: <current value> / 20 |
| 167 | +``` |
| 168 | + |
| 169 | +--- |
| 170 | + |
| 171 | +## 4. Load Test the Autoscaling Setup |
| 172 | + |
| 173 | +Generate a sustained load on the vLLM server to cause GPU utilization to rise. |
| 174 | + |
| 175 | +### 4.1. Generate Inference Load |
| 176 | + |
| 177 | +First, establish a port-forward to the vLLM service. |
| 178 | +```bash |
| 179 | +kubectl port-forward service/vllm-service -n default 8081:8081 |
| 180 | +``` |
| 181 | + |
| 182 | +In another terminal, execute the `request-looper.sh` script. |
| 183 | +```bash |
| 184 | +./request-looper.sh |
| 185 | +``` |
| 186 | + |
| 187 | +### 4.2. Observe the HPA Scaling the Deployment |
| 188 | + |
| 189 | +While the load script is running, monitor the HPA's behavior. |
| 190 | +```bash |
| 191 | +# See the HPA's metric values and scaling events |
| 192 | +kubectl describe hpa/gemma-server-gpu-hpa -n default |
| 193 | + |
| 194 | +# Watch the number of deployment replicas increase |
| 195 | +kubectl get deploy/vllm-gemma-deployment -n default -w |
| 196 | +``` |
| 197 | +As the average GPU utilization exceeds the 20% target, the HPA will scale up the deployment. |
| 198 | + |
| 199 | +--- |
| 200 | + |
| 201 | +## 5. Cleanup |
| 202 | + |
| 203 | +To tear down the resources from this exercise, run the following command: |
| 204 | +```bash |
| 205 | +kubectl delete -f . |
| 206 | +``` |
0 commit comments