kubernetes · seans3 · Aug 25, 2025 · Sep 8, 2025 · Sep 8, 2025 · Sep 8, 2025
diff --git a/ai/vllm-deployment/hpa/.gitignore b/ai/vllm-deployment/hpa/.gitignore
@@ -0,0 +1 @@
+GEMINI.md
diff --git a/ai/vllm-deployment/hpa/README.md b/ai/vllm-deployment/hpa/README.md
@@ -0,0 +1,97 @@
+# Horizontal Pod Autoscaling AI Inference Server
+
+This exercise shows how to set up the infrastructure to automatically
+scale an AI inference server, using custom metrics (either server
+or GPU metrics). This exercise requires a running Prometheus instance,
+preferably managed by the Prometheus Operator. We assume
+you already have the vLLM AI inference server running from this
+[exercise](../README.md), in the parent directory.
+
+## Architecture
+
+The autoscaling solution works as follows:
+
+1.  The **vLLM Server** or the **NVIDIA DCGM Exporter** exposes raw metrics on a `/metrics` endpoint.
+2.  A **ServiceMonitor** resource declaratively specifies how Prometheus should discover and scrape these metrics.
+3.  The **Prometheus Operator** detects the `ServiceMonitor` and configures its managed **Prometheus Server** instance to begin scraping the metrics.
+4.  For GPU metrics, a **PrometheusRule** is used to relabel the raw DCGM metrics, creating a new, HPA-compatible metric.
+5.  The **Prometheus Adapter** queries the Prometheus Server for the processed metrics and exposes them through the Kubernetes custom metrics API.
+6.  The **Horizontal Pod Autoscaler (HPA)** controller queries the custom metrics API for the metrics and compares them to the target values defined in the `HorizontalPodAutoscaler` resource.
+7.  If the metrics exceed the target, the HPA scales up the `vllm-gemma-deployment`.
+
+
+```mermaid
+flowchart TD
+ D("PrometheusRule (GPU Metric Only)")
+ B("Prometheus Server")
+ C("ServiceMonitor")
+ subgraph subGraph0["Metrics Collection"]
+        A["vLLM Server"]
+        H["GPU DCGM Exporter"]
+  end
+ subgraph subGraph1["HPA Scaling Logic"]
+        E("Prometheus Adapter")
+        F("API Server (Custom Metrics)")
+        G("HPA Controller")
+  end
+    B -- Scrapes Raw Metrics --> A
+    B -- Scrapes Raw Metrics --> H
+    C -- Configures Scrape <--> B
+    B -- Processes Raw Metrics via --> D
+    D -- Creates Clean Metric in --> B
+    F -- Custom Metrics API <--> E
+    E -- Queries Processed Metric <--> B
+    G -- Queries Custom Metric --> F
+```
+
+
+## Prerequisites
+
+This guide assumes you have a running Kubernetes cluster and `kubectl` installed. The vLLM server will be deployed in the `default` namespace, and the Prometheus and HPA resources will be in the `monitoring` namespace.
+
+> **Note on Cluster Permissions:** This exercise requires permissions to install components that run on the cluster nodes themselves. The Prometheus Operator and the NVIDIA DCGM Exporter both deploy DaemonSets that require privileged access to the nodes to collect metrics. For GKE users, this means a **GKE Standard** cluster is required, as GKE Autopilot's security model restricts this level of node access.
+
+### Prometheus Operator Installation
+
+The following commands will install the Prometheus Operator. It is recommended to install it in its own `monitoring` namespace.
+
+```bash
+# Add the Prometheus community Helm repository
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts/
+helm repo update
+
+# Install the Prometheus Operator into the "monitoring" namespace
+helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
+```
+**Note:** The default configuration of the Prometheus Operator only watches for `ServiceMonitor` resources within its own namespace. The `vllm-service-monitor.yaml` is configured to be in the `monitoring` namespace and watch for services in the `default` namespace, so no extra configuration is needed.
+
+## I. HPA for vLLM AI Inference Server using vLLM metrics
+
+[vLLM AI Inference Server HPA](./vllm-hpa.md)
+
+## II. HPA for vLLM AI Inference Server using NVidia GPU metrics
+
+[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md)
+
+### Choosing the Right Metric: Trade-offs and Combining Metrics
+
+This project provides two methods for autoscaling: one based on the number of running requests (`vllm:num_requests_running`) and the other on GPU utilization (`dcgm_fi_dev_gpu_util`). Each has its own advantages, and they can be combined for a more robust scaling strategy.
+
+#### **Trade-offs**
+
+*   **Number of Running Requests (Application-Level Metric):**
+    *   **Pros:** This is a direct measure of the application's current workload. It is highly responsive to sudden changes in traffic, making it ideal for latency-sensitive applications. Scaling decisions are based on the actual number of requests being processed, which can be a more accurate predictor of future load than hardware utilization alone.
+    *   **Cons:** This metric may not always correlate directly with resource consumption. For example, a few computationally expensive requests could saturate the GPU, while a large number of simple requests might not. If the application has issues reporting this metric, the HPA will not be able to scale the deployment correctly.
+
+*   **GPU Utilization (Hardware-Level Metric):**
+    *   **Pros:** This provides a direct measurement of how busy the underlying hardware is. It is a reliable indicator of resource saturation and is useful for optimizing costs by scaling down when the GPU is underutilized.
+    *   **Cons:** GPU utilization can be a lagging indicator. By the time utilization is high, the application's latency may have already increased. It also does not distinguish between a single, intensive request and multiple, less demanding ones.
+
+#### **Combining Metrics for Robustness**
+
+For the most robust autoscaling, you can configure the HPA to use multiple metrics. For example, you could scale up if *either* the number of running requests exceeds a certain threshold *or* if GPU utilization spikes. The HPA will scale the deployment up if any of the metrics cross their defined thresholds, but it will only scale down when *all* metrics are below their target values (respecting the scale-down stabilization window).
+
+This combined approach provides several benefits:
+- **Proactive Scaling:** The HPA can scale up quickly in response to an increase in running requests, preventing latency spikes.
+- **Resource Protection:** It can also scale up if a small number of requests are consuming a large amount of GPU resources, preventing the server from becoming overloaded.
+- **Cost-Effective Scale-Down:** The deployment will only scale down when both the request load and GPU utilization are low, ensuring that resources are not removed prematurely.
diff --git a/ai/vllm-deployment/hpa/gpu-dcgm-exporter-service.yaml b/ai/vllm-deployment/hpa/gpu-dcgm-exporter-service.yaml
@@ -0,0 +1,37 @@
+# This Service provides a stable network endpoint for the NVIDIA DCGM Exporter
+# pods. The Prometheus Operator's ServiceMonitor will target this Service
+# to discover and scrape the GPU metrics. This is especially important
+# because the exporter pods are part of a DaemonSet, and their IPs can change.
+#
+# NOTE: This configuration is specific to GKE, which automatically deploys the
+# DCGM exporter in the 'gke-managed-system' namespace. For other cloud
+# providers or on-premise clusters, you would need to deploy your own DCGM
+# exporter (e.g., via a Helm chart) and update this Service's 'namespace'
+# and 'labels' to match your deployment.
+
+apiVersion: v1
+kind: Service
+metadata:
+  name: gke-managed-dcgm-exporter
+  # GKE-SPECIFIC: GKE deploys its managed DCGM exporter in this namespace.
+  # On other platforms, this would be the namespace where you deploy the exporter.
+  namespace: gke-managed-system
+  labels:
+    # This label is critical. The ServiceMonitor uses this label to find this
+    # specific Service. If the labels don't match, Prometheus will not be
+    # able to discover the metrics endpoint.
+    # GKE-SPECIFIC: This label is used by GKE's managed service. For a custom
+    # deployment, you would use a more generic label like 'nvidia-dcgm-exporter'.
+    app.kubernetes.io/name: gke-managed-dcgm-exporter
+spec:
+  selector:
+    # This selector tells the Service which pods to route traffic to.
+    # It must match the labels on the DCGM exporter pods.
+    # GKE-SPECIFIC: This selector matches the labels on GKE's managed DCGM pods.
+    app.kubernetes.io/name: gke-managed-dcgm-exporter
+  ports:
+    - # The 'name' of this port is important. The ServiceMonitor will specifically
+      # look for a port with this name to scrape metrics from.
+      name: metrics
+      port: 9400
+      targetPort: 9400
diff --git a/ai/vllm-deployment/hpa/gpu-horizontal-pod-autoscaler.yaml b/ai/vllm-deployment/hpa/gpu-horizontal-pod-autoscaler.yaml
@@ -0,0 +1,66 @@
+# This HorizontalPodAutoscaler (HPA) targets the vLLM deployment and scales
+# it based on the average GPU utilization across all pods. It uses the
+# custom metric 'gpu_utilization_percent', which is provided by the
+# Prometheus Adapter.
+
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: gemma-server-gpu-hpa
+  namespace: default
+spec:
+  # scaleTargetRef points the HPA to the deployment it needs to scale.
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: vllm-gemma-deployment
+  minReplicas: 1
+  maxReplicas: 5
+  metrics:
+  - type: Pods
+    pods:
+      metric:
+        # This is the custom metric that the HPA will query.
+        # IMPORTANT: This name ('gpu_utilization_percent') is not the raw metric
+        # from the DCGM exporter. It is the clean, renamed metric that is
+        # exposed by the Prometheus Adapter. The names must match exactly.
+        name: gpu_utilization_percent
+      target:
+        type: AverageValue
+        # This is the target value for the metric. The HPA will add or remove
+        # pods to keep the average GPU utilization across all pods at 20%.
+        averageValue: 20
+  behavior:
+    scaleUp:
+      # The stabilizationWindowSeconds is set to 0 to allow for immediate
+      # scaling up. This is a trade-off:
+      # - For highly volatile workloads, immediate scaling is critical to
+      #   maintain performance and responsiveness.
+      # - However, this also introduces a risk of over-scaling if the workload
+      #   spikes are very brief. A non-zero value would make the scaling
+      #   less sensitive to short-lived spikes, but could introduce latency
+      #   if the load persists.
+      stabilizationWindowSeconds: 0
+      policies:
+      - type: Pods
+        value: 4
+        periodSeconds: 15
+      - type: Percent
+        value: 100
+        periodSeconds: 15
+      selectPolicy: Max
+    scaleDown:
+      # The stabilizationWindowSeconds is set to 30 to prevent the HPA from
+      # scaling down too aggressively. This means the controller will wait for
+      # 30 seconds after a scale-down event before considering another one.
+      # This helps to smooth out the scaling behavior and prevent "flapping"
+      # (rapidly scaling up and down). A larger value will make the scaling
+      # more conservative, which can be useful for workloads with fluctuating
+      # metrics, but it may also result in higher costs if the resources are
+      # not released quickly after a load decrease.
+      stabilizationWindowSeconds: 30
+      policies:
+      - type: Percent
+        value: 100
+        periodSeconds: 15
+      selectPolicy: Max