kubernetes
diff --git a/‎ai/vllm-deployment/hpa/.gitignore
Lines changed: 1 addition & 0 deletions b/‎ai/vllm-deployment/hpa/.gitignore
Lines changed: 1 addition & 0 deletions
diff --git a/‎ai/vllm-deployment/hpa/README.md
Lines changed: 62 additions & 0 deletions b/‎ai/vllm-deployment/hpa/README.md
Lines changed: 62 additions & 0 deletions
diff --git a/‎ai/vllm-deployment/hpa/gpu-dcgm-exporter-service.yaml
Lines changed: 26 additions & 0 deletions b/‎ai/vllm-deployment/hpa/gpu-dcgm-exporter-service.yaml
Lines changed: 26 additions & 0 deletions
diff --git a/‎ai/vllm-deployment/hpa/gpu-horizontal-pod-autoscaler.yaml
Lines changed: 50 additions & 0 deletions b/‎ai/vllm-deployment/hpa/gpu-horizontal-pod-autoscaler.yaml
Lines changed: 50 additions & 0 deletions
diff --git a/‎ai/vllm-deployment/hpa/gpu-hpa.md
Lines changed: 206 additions & 0 deletions b/‎ai/vllm-deployment/hpa/gpu-hpa.md
Lines changed: 206 additions & 0 deletions
diff --git a/‎ai/vllm-deployment/hpa/gpu-service-monitor.yaml
Lines changed: 28 additions & 0 deletions b/‎ai/vllm-deployment/hpa/gpu-service-monitor.yaml
Lines changed: 28 additions & 0 deletions
@@ -0,0 +1 @@
+GEMINI.md
@@ -0,0 +1,62 @@
+# Horizontal Pod Autoscaling AI Inference Server
+
+This exercise shows how to set up the infrastructure to automatically
+scale an AI inference server, using custom metrics (either server
+or GPU metrics). This exercise requires a running Prometheus instance,
+preferably managed by the Prometheus Operator. We assume
+you already have the vLLM AI inference server running from this
+[exercise](../README.md), in the parent directory.
+
+## Architecture
+
+The autoscaling solution works as follows:
+
+1.  The **vLLM Server** or the **NVIDIA DCGM Exporter** exposes raw metrics on a `/metrics` endpoint.
+2.  A **ServiceMonitor** resource declaratively specifies how Prometheus should discover and scrape these metrics.
+3.  The **Prometheus Operator** detects the `ServiceMonitor` and configures its managed **Prometheus Server** instance to begin scraping the metrics.
+4.  For GPU metrics, a **PrometheusRule** is used to relabel the raw DCGM metrics, creating a new, HPA-compatible metric.
+5.  The **Prometheus Adapter** queries the Prometheus Server for the processed metrics and exposes them through the Kubernetes custom metrics API.
+6.  The **Horizontal Pod Autoscaler (HPA)** controller queries the custom metrics API for the metrics and compares them to the target values defined in the `HorizontalPodAutoscaler` resource.
+7.  If the metrics exceed the target, the HPA scales up the `vllm-gemma-deployment`.
+
+```
+┌──────────────┐   ┌────────────────┐   ┌──────────────────┐
+│ User Request │──>│ vLLM Server    │──>│ ServiceMonitor   │
+└──────────────┘   │ (or DCGM Exp.) │   └──────────────────┘
+                   └────────────────┘            │
+                                                 ▼
+┌────────────────┐   ┌──────────────────┐   ┌──────────────────┐
+│ HPA Controller │<──│ Prometheus Adpt. │<──│ Prometheus Srv.  │
+└────────────────┘   └──────────────────┘   └──────────────────┘
+                                                 │ (GPU Path Only)
+                                                 ▼
+                                           ┌────────────────┐
+                                           │ PrometheusRule │
+                                           └────────────────┘
+```
+
+## Prerequisites
+
+This guide assumes you have a running Kubernetes cluster and `kubectl` installed. The vLLM server will be deployed in the `default` namespace, and the Prometheus and HPA resources will be in the `monitoring` namespace.
+
+### Prometheus Operator Installation
+
+The following commands will install the Prometheus Operator. It is recommended to install it in its own `monitoring` namespace.
+
+```bash
+# Add the Prometheus community Helm repository
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts/
+helm repo update
+
+# Install the Prometheus Operator into the "monitoring" namespace
+helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
+```
+**Note:** The default configuration of the Prometheus Operator only watches for `ServiceMonitor` resources within its own namespace. The `vllm-service-monitor.yaml` is configured to be in the `monitoring` namespace and watch for services in the `default` namespace, so no extra configuration is needed.
+
+## I. HPA for vLLM AI Inference Server using vLLM metrics
+
+[vLLM AI Inference Server HPA](./vllm-hpa.md)
+
+## II. HPA for vLLM AI Inference Server using NVidia GPU metrics
+
+[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md)
@@ -0,0 +1,26 @@
+# This Service provides a stable network endpoint for the NVIDIA DCGM Exporter
+# pods. The Prometheus Operator's ServiceMonitor will target this Service
+# to discover and scrape the GPU metrics. This is especially important
+# because the exporter pods are part of a DaemonSet, and their IPs can change.
+
+apiVersion: v1
+kind: Service
+metadata:
+  name: gke-managed-dcgm-exporter
+  namespace: gke-managed-system
+  labels:
+    # This label is critical. The ServiceMonitor uses this label to find this
+    # specific Service. If the labels don't match, Prometheus will not be
+    # able to discover the metrics endpoint.
+    app.kubernetes.io/name: gke-managed-dcgm-exporter
+spec:
+  selector:
+    # This selector tells the Service which pods to route traffic to.
+    # It must match the labels on the DCGM exporter pods.
+    app.kubernetes.io/name: gke-managed-dcgm-exporter
+  ports:
+    - # The 'name' of this port is important. The ServiceMonitor will specifically
+      # look for a port with this name to scrape metrics from.
+      name: metrics
+      port: 9400
+      targetPort: 9400
@@ -0,0 +1,50 @@
+# This HorizontalPodAutoscaler (HPA) targets the vLLM deployment and scales
+# it based on the average GPU utilization across all pods. It uses the
+# custom metric 'gpu_utilization_percent', which is provided by the
+# Prometheus Adapter.
+
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: gemma-server-gpu-hpa
+  namespace: default
+spec:
+  # scaleTargetRef points the HPA to the deployment it needs to scale.
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: vllm-gemma-deployment
+  minReplicas: 1
+  maxReplicas: 5
+  metrics:
+  - type: Pods
+    pods:
+      metric:
+        # This is the custom metric that the HPA will query.
+        # IMPORTANT: This name ('gpu_utilization_percent') is not the raw metric
+        # from the DCGM exporter. It is the clean, renamed metric that is
+        # exposed by the Prometheus Adapter. The names must match exactly.
+        name: gpu_utilization_percent
+      target:
+        type: AverageValue
+        # This is the target value for the metric. The HPA will add or remove
+        # pods to keep the average GPU utilization across all pods at 20%.
+        averageValue: 20
+  behavior:
+    scaleUp:
+      stabilizationWindowSeconds: 0
+      policies:
+      - type: Pods
+        value: 4
+        periodSeconds: 15
+      - type: Percent
+        value: 100
+        periodSeconds: 15
+      selectPolicy: Max
+    scaleDown:
+      stabilizationWindowSeconds: 30
+      policies:
+      - type: Percent
+        value: 100
+        periodSeconds: 15
+      selectPolicy: Max
@@ -0,0 +1,206 @@
+# Autoscaling an AI Inference Server with HPA using NVIDIA GPU Metrics
+
+This guide provides a detailed walkthrough for configuring a Kubernetes Horizontal Pod Autoscaler (HPA) to dynamically scale a vLLM AI inference server based on NVIDIA GPU utilization. The autoscaling logic is driven by the `DCGM_FI_DEV_GPU_UTIL` metric, which is exposed by the NVIDIA Data Center GPU Manager (DCGM) Exporter. This approach allows the system to scale based on the actual hardware utilization of the GPU, providing a reliable indicator of workload intensity.
+
+This guide assumes you have already deployed the vLLM inference server from the [parent directory's exercise](../README.md) into the `default` namespace.
+
+---
+
+## 1. Verify GPU Metric Collection
+
+The first step is to ensure that GPU metrics are being collected and exposed within the cluster. This is handled by the NVIDIA DCGM Exporter, which runs as a DaemonSet on GPU-enabled nodes and scrapes metrics directly from the GPU hardware. The method for deploying this exporter varies across cloud providers.
+
+### 1.1. Cloud Provider DCGM Exporter Setup
+
+Below are the common setups for GKE, AKS, and EKS.
+
+#### Google Kubernetes Engine (GKE)
+
+On GKE, the DCGM exporter is a managed add-on that is automatically deployed and managed by the system. It runs in the `gke-managed-system` namespace.
+
+**Verification:**
+You can verify that the exporter pods are running with the following command:
+```bash
+kubectl get pods --namespace gke-managed-system | grep dcgm-exporter
+```
+You should see one or more `dcgm-exporter` pods in a `Running` state.
+
+#### Amazon Elastic Kubernetes Service (EKS) & Microsoft Azure Kubernetes Service (AKS)
+
+On both EKS and AKS, the DCGM exporter is not a managed service and must be installed manually. The standard method is to use the official NVIDIA DCGM Exporter Helm chart, which deploys the exporter as a DaemonSet.
+
+**Installation (for both EKS and AKS):**
+If you don't already have the exporter installed, you can do so with the following Helm commands:
+```bash
+helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
+helm repo update
+helm install dcgm-exporter nvdp/dcgm-exporter --namespace monitoring
+```
+*Note: We are installing it into the `monitoring` namespace to keep all monitoring-related components together.*
+
+**Verification:**
+You can verify that the exporter pods are running in the `monitoring` namespace:
+```bash
+kubectl get pods --namespace monitoring | grep dcgm-exporter
+```
+You should see one or more `dcgm-exporter` pods in a `Running` state.
+
+---
+
+## 2. Set Up Prometheus for Metric Collection
+
+With the metric source confirmed, the next step is to configure Prometheus to scrape, process, and store these metrics.
+
+### 2.1. Install the Prometheus Operator
+
+The Prometheus Operator can be easily installed using its official Helm chart. This will deploy a full monitoring stack into the `monitoring` namespace. If you have already installed it in the previous exercise, you can skip this step.
+
+```bash
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts/
+helm repo update
+helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
+```
+
+### 2.2. Create a Service for the DCGM Exporter
+
+The `ServiceMonitor` needs a stable network endpoint to reliably scrape metrics from the DCGM exporter pods. A Kubernetes Service provides this stable endpoint.
+
+Apply the service manifest:
+```bash
+kubectl apply -f ./gpu-dcgm-exporter-service.yaml
+```
+
+Verify that the service has been created successfully:
+```bash
+kubectl get svc -n gke-managed-system | grep gke-managed-dcgm-exporter
+```
+
+### 2.3. Configure Metric Scraping with a `ServiceMonitor`
+
+The `ServiceMonitor` tells the Prometheus Operator to scrape the DCGM exporter Service.
+
+```bash
+kubectl apply -f ./gpu-service-monitor.yaml
+```
+
+### 2.4. Create a Prometheus Rule for Metric Relabeling
+
+This is a critical step. The raw `DCGM_FI_DEV_GPU_UTIL` metric does not have the standard `pod` and `namespace` labels the HPA needs. This `PrometheusRule` creates a *new*, correctly-labelled metric named `gke_dcgm_fi_dev_gpu_util_relabelled` that the Prometheus Adapter can use.
+
+```bash
+kubectl apply -f ./prometheus-rule.yaml
+```
+
+### 2.5. Verify Metric Collection and Relabeling in Prometheus
+
+To ensure the entire pipeline is working, you must verify that the *new*, relabelled metric exists. First, establish a port-forward to the Prometheus service.
+
+```bash
+kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 -n monitoring
+```
+
+In a separate terminal, use `curl` to query for the new metric.
+```bash
+# Query Prometheus for the new, relabelled metric
+curl -sS "http://localhost:9090/api/v1/query?query=gke_dcgm_fi_dev_gpu_util_relabelled" | jq
+```
+A successful verification will show the metric in the `result` array, complete with the correct `pod` and `namespace` labels.
+
+---
+
+## 3. Configure the Horizontal Pod Autoscaler
+
+Now that a clean, usable metric is available in Prometheus, you can configure the HPA.
+
+### 3.1. Deploy the Prometheus Adapter
+
+The Prometheus Adapter bridges Prometheus and the Kubernetes custom metrics API. It is configured to read the `gke_dcgm_fi_dev_gpu_util_relabelled` metric and expose it as `gpu_utilization_percent`.
+
+```bash
+kubectl apply -f ./prometheus-adapter.yaml
+```
+Verify that the adapter's pod is running in the `monitoring` namespace.
+
+### 3.2. Verify the Custom Metrics API
+
+After deploying the adapter, it's vital to verify that it is successfully exposing the transformed metrics to the Kubernetes API. You can do this by querying the custom metrics API directly.
+
+```bash
+kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
+```
+
+The output should be a list of available custom metrics. Look for the `pods/gpu_utilization_percent` metric, which confirms that the entire pipeline is working correctly and the metric is ready for the HPA to consume.
+
+```json
+{
+  "kind": "APIResourceList",
+  "apiVersion": "v1",
+  "groupVersion": "custom.metrics.k8s.io/v1beta1",
+  "resources": [
+    {
+      "name": "pods/gpu_utilization_percent",
+      "singularName": "",
+      "namespaced": true,
+      "kind": "MetricValueList",
+      "verbs": [
+        "get"
+      ]
+    }
+  ]
+}
+```
+
+### 3.3. Deploy the Horizontal Pod Autoscaler (HPA)
+
+The HPA is configured to use the final, clean metric name, `gpu_utilization_percent`, to maintain an average GPU utilization of 20%.
+
+```bash
+kubectl apply -f ./gpu-horizontal-pod-autoscaler.yaml
+```
+
+Inspect the HPA's configuration to confirm it's targeting the correct metric.
+```bash
+kubectl describe hpa/gemma-server-gpu-hpa -n default
+# Expected output should include:
+# Metrics: ( current / target )
+# "gpu_utilization_percent" on pods: <current value> / 20
+```
+
+---
+
+## 4. Load Test the Autoscaling Setup
+
+Generate a sustained load on the vLLM server to cause GPU utilization to rise.
+
+### 4.1. Generate Inference Load
+
+First, establish a port-forward to the vLLM service.
+```bash
+kubectl port-forward service/vllm-service -n default 8081:8081
+```
+
+In another terminal, execute the `request-looper.sh` script.
+```bash
+./request-looper.sh
+```
+
+### 4.2. Observe the HPA Scaling the Deployment
+
+While the load script is running, monitor the HPA's behavior.
+```bash
+# See the HPA's metric values and scaling events
+kubectl describe hpa/gemma-server-gpu-hpa -n default
+
+# Watch the number of deployment replicas increase
+kubectl get deploy/vllm-gemma-deployment -n default -w
+```
+As the average GPU utilization exceeds the 20% target, the HPA will scale up the deployment.
+
+---
+
+## 5. Cleanup
+
+To tear down the resources from this exercise, run the following command:
+```bash
+kubectl delete -f .
+```
@@ -0,0 +1,28 @@
+# This ServiceMonitor tells the Prometheus Operator how to discover and scrape
+# metrics from the NVIDIA DCGM Exporter. It is designed to find the
+# 'gke-managed-dcgm-exporter' Service in the 'gke-managed-system' namespace
+# and scrape its '/metrics' endpoint.
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: nvidia-dcgm-exporter-servicemonitor
+  namespace: monitoring
+  labels:
+    # This label is used by the Prometheus Operator to discover this
+    # ServiceMonitor. It must match the 'serviceMonitorSelector' configured
+    # in the Prometheus custom resource.
+    release: prometheus
+spec:
+  # This selector identifies the specific Service to scrape. It must match
+  # the labels on the 'gke-managed-dcgm-exporter' Service.
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: gke-managed-dcgm-exporter
+  # This selector specifies which namespace to search for the target Service.
+  # For GKE, the DCGM service is in 'gke-managed-system'.
+  namespaceSelector:
+    matchNames:
+    - gke-managed-system
+  endpoints:
+  - port: metrics
+    interval: 15s