Skip to content

Conversation

seans3
Copy link
Contributor

@seans3 seans3 commented Aug 26, 2025

This PR extends the existing vLLM server example by introducing a complete Horizontal Pod Autoscaling (HPA) solution, contained within a new hpa/ directory. This provides a production-ready pattern for automatically scaling the AI inference server based on real-time demand.

Two distinct autoscaling methods are provided:

  • By vLLM Server Metrics: Scales based on the number of concurrent inference requests.
  • By NVIDIA GPU Utilization: Scales based on hardware-level GPU utilization.

How It Works

The solution uses a standard Prometheus-based monitoring pipeline. The Prometheus Operator scrapes metrics from either the vLLM server or the NVIDIA DCGM exporter. For GPU metrics, a PrometheusRule is used to relabel the raw data, making it compatible with the HPA. The Prometheus Adapter then exposes these metrics to the Kubernetes Custom Metrics API, which the HPA controller consumes to drive scaling decisions.

What's New

  • hpa/ directory: Contains all new manifests and documentation.
  • Two HPA Examples: Includes horizontal-pod-autoscaler.yaml for vLLM metrics and gpu-horizontal-pod-autoscaler.yaml for GPU metrics.
  • Step-by-Step Guides: vllm-hpa.md and gpu-hpa.md provide detailed instructions for each scaling method, including multi-cloud
    support for the GPU example.
  • Load Testing Script: request-looper.sh is included to easily generate load and test the autoscaling functionality.

How to Test

Detailed instructions and verification steps are available in the new guides:

  • For vLLM metrics: hpa/vllm-hpa.md
  • For GPU metrics: hpa/gpu-hpa.md

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Aug 26, 2025
@k8s-ci-robot k8s-ci-robot requested review from kow3ns and soltysh August 26, 2025 23:28
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Aug 26, 2025
@seans3
Copy link
Contributor Author

seans3 commented Aug 26, 2025

/assign @janetkuo

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: seans3
Once this PR has been reviewed and has the lgtm label, please ask for approval from janetkuo. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@janetkuo janetkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this practical example for autoscaling!


## Prerequisites

This guide assumes you have a running Kubernetes cluster and `kubectl` installed. The vLLM server will be deployed in the `default` namespace, and the Prometheus and HPA resources will be in the `monitoring` namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that default namespace is used for vLLM. Kubernetes best practice is to avoid deploying applications in the default namespace. Using it for actual workloads can lead to significant operational and security challenges as the cluster usage grows.

┌────────────────┐
│ PrometheusRule │
└────────────────┘
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions for the diagram to make it more clear:

  • Use numbered steps and arrow directions to guide the user through the precise data flow (scrape, evaluate, record, query, scale) from start to finish
  • The flow hides the crucial transformation step where a raw metric is converted into a processed metric. Recommend to clearly label the initial scrape with the raw DCGM metric name and the query from the adapter with the new, processed metric name.
  • The PrometheusRule is shown as a final step in the "GPU Path Only". However, the PrometheusRule is not a destination for data; it's a configuration that tells the Prometheus Server how to perform an internal calculation
  • Include the Kubernetes API Server between the adapter and HPA

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed that GitHub supports mermaid diagrams in markdown: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-diagrams#creating-mermaid-diagrams
Might be easier to edit than ASCII diagrams.


## II. HPA for vLLM AI Inference Server using NVidia GPU metrics

[vLLM AI Inference Server HPA with GPU metrics](./gpu-hpa.md)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could discuss the trade-offs between these 2 metrics options here, and how to combine multiple metrics for robustness (e.g., scale up if either the number of running requests exceeds a certain threshold or GPU utilization spikes.)

averageValue: 20
behavior:
scaleUp:
stabilizationWindowSeconds: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss the trade-offs here, e.g. the risk of over-scaling vs. with highly volatile workloads where immediate scaling up is critical to maintain performance and responsiveness

# the labels on the 'gke-managed-dcgm-exporter' Service.
selector:
matchLabels:
app.kubernetes.io/name: gke-managed-dcgm-exporter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the label value need to be GKE specific? Can this be more generic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants