Skip to content
4 changes: 2 additions & 2 deletions .cd/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,9 +149,9 @@ cd vllm-gaudi/.cd/

```bash
HF_TOKEN=<your huggingface token> \
VLLM_SERVER_CONFIG_FILE=server_configurations/server_text.yaml \
VLLM_SERVER_CONFIG_FILE=server/server_scenarios_text.yaml \
VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
VLLM_BENCHMARK_CONFIG_FILE=benchmark_configurations/benchmark_text.yaml \
VLLM_BENCHMARK_CONFIG_FILE=benchmark/benchmark_scenarios_text.yaml \
VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \
docker compose --profile benchmark up
```
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Learn more: 🚀 [vLLM Plugin System Overview](https://docs.vllm.ai/en/latest/de
git clone https://github.com/vllm-project/vllm-gaudi
cd vllm-gaudi
export VLLM_COMMIT_HASH=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null)
cd ..
```

2. Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
Expand Down
29 changes: 23 additions & 6 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Welcome to vLLM x Intel Gaudi
# Intel® Gaudi® vLLM Plugin

<figure markdown="span" style="display: flex; justify-content: center; align-items: center; gap: 10px; margin: auto;">
<img src="./assets/logos/vllm-logo-text-light.png" alt="vLLM" style="width: 30%; margin: 0;"> x
Expand All @@ -15,11 +15,28 @@
<a class="github-button" href="https://github.com/vllm-project/vllm-gaudi/fork" data-show-count="true" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
</p>

vLLM Gaudi plugin (vllm-gaudi) integrates Intel Gaudi accelerators with vLLM to optimize large language model inference.
Welcome to the **vLLM-Gaudi plugin**, a community-maintained integration layer that enables high-performance large language model (LLM) inference on Intel® Gaudi® AI accelerators.

This plugin follows the [[RFC]: Hardware pluggable](https://github.com/vllm-project/vllm/issues/11162) and [[RFC]: Enhancing vLLM Plugin Architecture](https://github.com/vllm-project/vllm/issues/19161) principles, providing a modular interface for Intel Gaudi hardware.
## 🔍 What is vLLM-Gaudi?

Learn more:
The **vLLM-Gaudi plugin** connects the vLLM serving engine with Intel Gaudi hardware, offering optimized inference capabilities for enterprise-scale LLM workloads. It is developed and maintained by Intel/Gaudi team and follows the Hardware Pluggable [RFC](https://github.com/vllm-project/vllm/issues/11162) and vLLM Plugin Architecture [RFC](https://github.com/vllm-project/vllm/issues/19161) for modular integration.

📚 [Intel Gaudi Documentation](https://docs.habana.ai/en/v1.21.1/index.html)
🚀 [vLLM Plugin System Overview](https://docs.vllm.ai/en/latest/design/plugin_system.html)
## 🚀 Why Use It?

- **Optimized for Gaudi**: Supports advanced features like bucketing mechanism, FP8 quantization, and custom graph caching for fast warm-up and efficient memory use.
- **Scalable and Efficient**: Designed to maximize throughput and minimize latency for large-scale deployments, making it ideal for production-grade LLM inference.
- **Community-Ready**: Actively maintained on [GitHub](https://github.com/vllm-project/vllm-gaudi) with contributions from Intel, Gaudi team, and the broader vLLM ecosystem.

## ✅ Action Items

To get started with the Intel® Gaudi® vLLM Plugin:

- [ ] **Set up your environment** using the [quickstart](getting_started/quickstart.md) and plugin locally or in your containerized environment.
- [ ] **Run inference** using supported models like Llama 3.1, Mixtral, or DeepSeek.
- [ ] **Explore advanced features** such as FP8 quantization, recipe caching, and expert parallelism.
- [ ] **Join the community** by contributing to the [vLLM-Gaudi GitHub repo](https://github.com/vllm-project/vllm-gaudi).

### Learn more

📚 [Intel Gaudi Documentation](https://docs.habana.ai/en/latest/index.html)
📦 [vLLM Plugin System Overview](https://docs.vllm.ai/en/latest/design/plugin_system.html)
36 changes: 21 additions & 15 deletions docs/features/supported_features.md
Original file line number Diff line number Diff line change
@@ -1,38 +1,44 @@
---
title: Supported Features
---
[](){ #supported-features }

## Supported Features

| **Feature** | **Description** | **References** |
|--- |--- |--- |
| Offline batched inference | Offline inference using LLM class from vLLM Python API | [Quickstart](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#offline-batched-inference) [Example](https://docs.vllm.ai/en/stable/getting_started/examples/offline_inference.html) |
| Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | [Documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html) [Example](https://docs.vllm.ai/en/stable/getting_started/examples/openai_chat_completion_client.html) |
| Offline batched inference | Offline inference using LLM class from vLLM Python API | [Quickstart](https://docs.vllm.ai/en/stable/getting_started/quickstart.html#offline-batched-inference) [Example](https://docs.vllm.ai/en/stable/examples/offline_inference/batch_llm_inference.html) |
| Online inference via OpenAI-Compatible Server | Online inference using HTTP server that implements OpenAI Chat and Completions API | [Documentation](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html) [Example](https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html) |
| HPU autodetection | HPU users do not need to specify the target platform, it will be detected automatically upon vLLM startup | N/A |
| Paged KV cache with algorithms enabled for Intel Gaudi accelerators | vLLM HPU backend contains a custom Paged Attention and cache operators implementations optimized for Gaudi devices. | N/A |
| Custom Intel Gaudi operator implementations | vLLM HPU backend provides optimized implementations of operators such as prefill attention, Root Mean Square Layer Normalization, Rotary Positional Encoding. | N/A |
| Tensor parallel inference | vLLM HPU backend supports multi-HPU inference with tensor parallelism with multiprocessing. | [Documentation](https://docs.vllm.ai/en/stable/serving/distributed_serving.html) [Example](https://docs.ray.io/en/latest/serve/tutorials/vllm-example.html) [HCCL reference](https://docs.habana.ai/en/latest/API_Reference_Guides/HCCL_APIs/index.html) |
| Tensor parallel inference | vLLM HPU backend supports multi-HPU inference with tensor parallelism with multiprocessing. | [Documentation](https://docs.vllm.ai/en/stable/serving/distributed_serving.html) [HCCL reference](https://docs.habana.ai/en/latest/API_Reference_Guides/HCCL_APIs/index.html) |
| Pipeline parallel inference | vLLM HPU backend supports multi-HPU inference with pipeline parallelism. | [Documentation](https://docs.vllm.ai/en/stable/serving/distributed_serving.html) [Running Pipeline Parallelism](https://vllm-gaudi.readthedocs.io/en/latest/configuration/pipeline_parallelism.html) |
| Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html) [vLLM HPU backend execution modes](https://docs.vllm.ai/en/stable/getting_started/gaudi-installation.html#execution-modes) [Optimization guide](https://docs.vllm.ai/en/latest/getting_started/gaudi-installation.html#hpu-graph-capture) |
| Inference with torch.compile | vLLM HPU backend supports inference with `torch.compile`. | [vLLM HPU backend execution modes](https://docs.vllm.ai/en/stable/getting_started/gaudi-installation.html#execution-modes) |
| Inference with HPU Graphs | vLLM HPU backend uses HPU Graphs by default for optimal performance. When HPU Graphs are enabled, execution graphs will be recorded ahead of time and replayed later during inference, significantly reducing host overheads. | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html) [Optimization guide](../configuration/optimization.html) |
| Inference with torch.compile | vLLM HPU backend supports inference with `torch.compile` which is default for HPU. | N/A |
| INC quantization | vLLM HPU backend supports FP8 model and KV cache quantization and calibration with Intel Neural Compressor (INC). (Not fully supported with torch.compile execution mode) | [Documentation](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_FP8.html) |
| AutoAWQ quantization | vLLM HPU backend supports inference with models quantized using AutoAWQ library. | [Library](https://github.com/casper-hansen/AutoAWQ) |
| AutoGPTQ quantization | vLLM HPU backend supports inference with models quantized using AutoGPTQ library. | [Library](https://github.com/AutoGPTQ/AutoGPTQ) |
| LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | [Documentation](https://docs.vllm.ai/en/stable/models/lora.html) [Example](https://docs.vllm.ai/en/stable/getting_started/examples/multilora_inference.html) [vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html) |
| LoRA/MultiLoRA support | vLLM HPU backend includes support for LoRA and MultiLoRA on supported models. | [Documentation](https://docs.vllm.ai/en/stable/models/lora.html) [Example](https://docs.vllm.ai/en/stable/examples/offline_inference/multilora_inference.html) [vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html) |
| Fully async model executor | This allows the model runner to function asynchronously when using async scheduling. This allows full overlap of the cpu operations (including prepare_inputs) and the model forward pass. This does not support speculative decoding, PP, or guided decoding. Expected speedup is 5-10% over the current async scheduling. | [Feature description](https://github.com/vllm-project/vllm/pull/23569) |
| Automatic prefix caching | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, configurable by standard `--enable-prefix-caching` parameter. | [Documentation](https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html) [Details](https://docs.vllm.ai/en/stable/automatic_prefix_caching/details.html) |
| Speculative decoding (functional release) | vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurable via standard `--speculative_model` and `--num_speculative_tokens` parameters. (Not fully supported with torch.compile execution mode) | [Documentation](https://docs.vllm.ai/en/stable/models/spec_decode.html) [Example](https://docs.vllm.ai/en/stable/getting_started/examples/mlpspeculator.html) |
| Automatic prefix caching | vLLM HPU backend includes automatic prefix caching (APC) support for more efficient prefills, enalbed by default. | [Documentation](https://docs.vllm.ai/en/stable/automatic_prefix_caching/apc.html) |
| Speculative decoding (functional release) | vLLM HPU backend includes experimental speculative decoding support for improving inter-token latency in some scenarios, configurable via standard `--speculative_model` and `--num_speculative_tokens` parameters. (Not fully supported with torch.compile execution mode) | [Documentation](https://docs.vllm.ai/en/stable/models/spec_decode.html) [Example](https://docs.vllm.ai/en/stable/examples/offline_inference/spec_decode.html) |
| Multiprocessing backend | Multiprocessing is the default distributed runtime in vLLM. | [Documentation](https://docs.vllm.ai/en/latest/serving/distributed_serving.html) |
| Multimodal | vLLM HPU backend supports the inference for multi-modal models. (Not fully supported with t.compile execution mode) | [Documentation](https://docs.vllm.ai/en/latest/serving/multimodal_inputs.html) |
| Multimodal | vLLM HPU backend supports the inference for multi-modal models. (Not fully supported with t.compile execution mode) | [Documentation](https://docs.vllm.ai/en/latest/features/multimodal_inputs.html) |
| Guided decode | vLLM HPU supports a guided decoding backend for generating structured outputs. | [Documentation](https://docs.vllm.ai/en/latest/features/structured_outputs.html) |
| Exponential bucketing | vLLM HPU supports exponential bucketing spacing instead of linear to automate configuration of bucketing mechanism, enabled by default. It can be disabled via `VLLM_EXPONENTIAL_BUCKETING=false` environment variable. | N/A |
| Data Parellel support | vLLM HPU supports Data Parellel | [Documentation](https://docs.vllm.ai/en/stable/serving/data_parallel_deployment.html) [Example](https://docs.vllm.ai/en/latest/examples/offline_inference/data_parallel.html) |

## Discontinued Features

| **Feature** | **Description** | **Reasoning** |
|--- |--- |--- |
| Multi-step scheduling | vLLM HPU backend includes multi-step scheduling support for host overhead reduction. | Replaced by ascyn-scheduling, configurable by standard `--async_scheduling` parameter. |
| Delayed Sampling | Support for delayed sampling scheduling for asynchronous execution. | Replaced by ascyn-scheduling, configurable by standard `--async_scheduling` parameter. |

## Coming Soon

- Sliding window attention
- P/D disaggregate support
- In-place weight update
- MLA with Unified Attention
- Multinode support
- [ ] Sliding window attention
- [ ] P/D disaggregate support
- [ ] In-place weight update
- [ ] MLA with Unified Attention
- [ ] Multinode support
1 change: 1 addition & 0 deletions docs/getting_started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ Use the following commands to run a Docker image. Make sure to update the versio
git clone https://github.com/vllm-project/vllm-gaudi
cd vllm-gaudi
export VLLM_COMMIT_HASH=$(git show "origin/vllm/last-good-commit-for-vllm-gaudi:VLLM_STABLE_COMMIT" 2>/dev/null)
cd ..
```

=== "Step 2: Install vLLM"
Expand Down
4 changes: 2 additions & 2 deletions docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,9 +156,9 @@ For most users, the basic setup is sufficient, but advanced users may benefit fr

```bash
HF_TOKEN=<your huggingface token> \
VLLM_SERVER_CONFIG_FILE=server_configurations/server_text.yaml \
VLLM_SERVER_CONFIG_FILE=server/server_scenarios_text.yaml \
VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
VLLM_BENCHMARK_CONFIG_FILE=benchmark_configurations/benchmark_text.yaml \
VLLM_BENCHMARK_CONFIG_FILE=benchmark/benchmark_scenarios_text.yaml \
VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \
docker compose --profile benchmark up
```
Expand Down
132 changes: 130 additions & 2 deletions docs/user_guide/faq.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,134 @@
---
title: Frequently Asked Questions
title: vLLM with Intel Gaudi Frequently Asked Questions
---
[](){ #faq }

WIP
## Prerequisites and System Requirements

### What are the system requirements for running vLLM on Intel® Gaudi®?

- Ubuntu 22.04 LTS OS.
- Python 3.10.
- Intel Gaudi 2 or Intel Gaudi 3 AI accelerator.
- Intel Gaudi software version 1.23.0 and above.

### What is vLLM plugin and where can I find this GitHub repository?

- Intel develops and maintains its own vLLM plugin project called [vLLM-gaudi](https://github.com/vllm-project/vllm-gaudi).

### How do I verify that the Intel Gaudi software is installed correctly?

- Run ``hl-smi`` to check if Gaudi accelerators are visible. Refer to [System Verifications and Final Tests](https://docs.habana.ai/en/latest/Installation_Guide/System_Verification_and_Final_Tests.html#system-verification) for more details.

- Run ``apt list --installed | grep habana`` to verify installed packages. The output should look similar to the below:

```text
$ apt list --installed | grep habana
habanalabs-container-runtime
habanalabs-dkms
habanalabs-firmware-tools
habanalabs-graph
habanalabs-qual
habanalabs-rdma-core
habanalabs-thunk
habanalabs-tools
```

- Check the installed Python packages by running ``pip list | grep habana`` and ``pip list | grep neural``. The output should look similar to the below:

```text
$ pip list | grep habana
habana_gpu_migration 1.19.0.561
habana-media-loader 1.19.0.561
habana-pyhlml 1.19.0.561
habana-torch-dataloader 1.19.0.561
habana-torch-plugin 1.19.0.561
lightning-habana 1.6.0
Pillow-SIMD 9.5.0.post20+habana
$ pip list | grep neural
neural_compressor_pt 3.2
```

### How can I quickly set up the environment for vLLM using Docker?

Use the Dockerfile.ubuntu.pytorch.vllm file provided in the vllm-plugin/vllm-gaudi/.cd GitHub repo to build and run a container with the latest Intel Gaudi software release.

For more details, see [Quick Start Using Dockerfile](../getting_started/quickstart.md).

## Building and Installing vLLM

### How can I install vLLM on Intel Gaudi?

- There are two different installation methods:

- (Recommended) Install the stable version from the HabanaAIvLLM-fork GitHub repo. This version is most suitable for production deployments.

- Install the latest version from the HabanaAI/vLLM-fork GitHub repo. This version is suitable for developers who would like to work on experimental code and new features that are still being tested.

- Install from the main vLLM source GitHub repo. This version is suitable for developers who would like to work with the official vLLM-project but may not have the latest Intel Gaudi features.

## Examples and Model Support

### Which models and configurations have been validated on Gaudi 2 and Gaudi 3 devices?

- Various Llama 2, Llama 3 and Llama 3.1 models (7B, 8B and 70B versions). Refer to Llama-3.1 jupyter notebook example.

- Mistral and Mixtral models.

- Different tensor parallelism configurations (single HPU, 2x, and 8x HPU).

- See [Supported Configurations](https://github.com/HabanaAI/vllm-fork/blob/v1.22.1/README_GAUDI.md#supported-configurations) for more details.

## Features and Support

### Which key features does vLLM support on Intel Gaudi?

- Offline Batched Inference.

- OpenAI-Compatible Server.

- Paged KV cache optimized for Gaudi devices.

- Speculative decoding (experimental).

- Tensor parallel inference.

- FP8 models and KV Cache quantization and calibration with Intel® Neural Compressor (INC). See [FP8 Calibration and Inference with vLLM](../features/quantization/inc.md) for more details.

- See [Supported Features](../features/supported_features.md) for more details.

## Performance Tuning

### Which execution modes does vLLM support on Intel Gaudi?

- PyTorch Eager mode (default).

- torch.compile (default).

- HPU Graphs (recommended for best performance).

- PyTorch Lazy mode.

- See [Execution Modes]() for more details.

### How does the bucketing mechanism work in vLLM for Intel Gaudi?

- The bucketing mechanism optimizes performance by grouping tensor shapes. This reduces the number of required graphs and minimizes compilations during server runtime.

- Buckets are determined by parameters for batch size and sequence length.

- See [Bucketing Mechanism](../features/bucketing_mechanism.md) for more details.

### What should I do if a request exceeds the maximum bucket size?

- Consider increasing the upper bucket boundaries using environment variables to avoid potential latency increases due to graph compilation.

## Troubleshooting

### How to troubleshoot Out-of-Memory errors encountered while running vLLM on Intel Gaudi?

- Increase ``--gpu-memory-utilization`` (default: 0.9) - This addresses insufficient available memory per card.

- Increase ``--tensor-parallel-size`` (default: 1) - This approach shards model weights across the devices and may help in loading a model (which is too big for a single card) across multiple cards.

- Disable HPU Graphs completely (switch to any other execution mode) to maximize KV Cache space allocation.