Skip to content

Commit e2d5b68

Browse files
Update docs: Executing inference (#412)
Completed Executing inference section of Quickstart doc. Cherry-pick of: #410 Signed-off-by: Paweł Olejniczak <[email protected]>
1 parent 6c4ca1d commit e2d5b68

File tree

1 file changed

+347
-20
lines changed

1 file changed

+347
-20
lines changed

docs/getting_started/quickstart.md

Lines changed: 347 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,13 @@
11
---
22
title: Quickstart
33
---
4-
[](){ #quickstart }
54

6-
This guide will help you quickly get started with vLLM to perform:
5+
## vLLM Quick Start Guide
76

8-
- [Offline batched inference][quickstart-offline]
9-
- [Online serving using OpenAI-compatible server][quickstart-online]
7+
This guide shows how to quickly launch vLLM on Gaudi using a prebuilt Docker
8+
image with Docker Compose which is supported on Ubuntu only. It supports model benchmarking, custom runtime parameters,
9+
and a selection of validated models — including the LLama, Mistral, and Qwen.
10+
The advanced configuration is available via environment variables or YAML files.
1011

1112
## Requirements
1213

@@ -19,38 +20,364 @@ This guide will help you quickly get started with vLLM to perform:
1920
To achieve the best performance on HPU, please follow the methods outlined in the
2021
[Optimizing Training Platform Guide](https://docs.habana.ai/en/latest/PyTorch/Model_Optimization_PyTorch/Optimization_in_Training_Platform.html).
2122

22-
## Quick Start Using Dockerfile
23+
## Running vLLM on Gaudi with Docker Compose
2324

24-
--8<-- "docs/getting_started/installation.md:docker_quickstart"
25+
Follow the steps below to run the vLLM server or launch benchmarks on Gaudi using Docker Compose.
26+
27+
### 1. Clone the vLLM fork repository and navigate to the appropriate directory
28+
29+
git clone https://github.com/vllm-project/vllm-gaudi.git
30+
cd vllm-gaudi/.cd/
31+
32+
This ensures you have the required files and Docker Compose configurations.
33+
34+
### 2. Set the following environment variables
35+
36+
| **Variable** | **Description** |
37+
| --- |--- |
38+
| `MODEL` | Choose a model name from the [`vllm supported models`][supported-models] list. |
39+
| `HF_TOKEN` | Your Hugging Face token (generate one at <https://huggingface.co>). |
40+
| `DOCKER_IMAGE` | The Docker image name or URL for the vLLM Gaudi container. When using the Gaudi repository, make sure to select Docker images with the *vllm-installer* prefix in the file name. |
41+
42+
### 3. Run the vLLM server using Docker Compose
43+
44+
MODEL="Qwen/Qwen2.5-14B-Instruct" \
45+
HF_TOKEN="<your huggingface token>" \
46+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu24.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
47+
docker compose up
48+
49+
To automatically run benchmarking for a selected model using default settings, add the `--profile benchmark up` option
50+
51+
MODEL="Qwen/Qwen2.5-14B-Instruct" \
52+
HF_TOKEN="<your huggingface token>" \
53+
DOCKER_IMAGE=="vault.habana.ai/gaudi-docker/|Version|/ubuntu24.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
54+
docker compose --profile benchmark up
55+
56+
This command launches the vLLM server and runs the associated benchmark suite.
57+
58+
## Advanced Options
59+
60+
The following steps cover optional advanced configurations for
61+
running the vLLM server and benchmark. These allow you to fine-tune performance,
62+
memory usage, and request handling using additional environment variables or configuration files.
63+
For most users, the basic setup is sufficient, but advanced users may benefit from these customizations.
64+
65+
=== "Run vLLM Using Docker Compose with Custom Parameters"
66+
67+
To override default settings, you can provide additional environment variables when starting the server. This advanced method allows fine-tuning for performance and memory usage.
68+
69+
**Environment variables**
70+
71+
| **Variable** | **Description** |
72+
|---|---|
73+
| `PT_HPU_LAZY_MODE` | Enables Lazy execution mode, potentially improving performance by batching operations. |
74+
| `VLLM_SKIP_WARMUP` | Skips the model warmup phase to reduce startup time (may affect initial latency). |
75+
| `MAX_MODEL_LEN` | Sets the maximum supported sequence length for the model. |
76+
| `MAX_NUM_SEQS` | Specifies the maximum number of sequences processed concurrently. |
77+
| `TENSOR_PARALLEL_SIZE` | Defines the degree of tensor parallelism. |
78+
| `VLLM_EXPONENTIAL_BUCKETING` | Enables or disables exponential bucketing for warmup strategy. |
79+
| `VLLM_DECODE_BLOCK_BUCKET_STEP` | Configures the step size for decode block allocation, affecting memory granularity. |
80+
| `VLLM_DECODE_BS_BUCKET_STEP` | Sets the batch size step for decode operations, impacting how decode batches are grouped. |
81+
| `VLLM_PROMPT_BS_BUCKET_STEP` | Adjusts the batch size step for prompt processing. |
82+
| `VLLM_PROMPT_SEQ_BUCKET_STEP` | Controls the step size for prompt sequence allocation. |
83+
84+
**Example**
85+
86+
```bash
87+
MODEL="Qwen/Qwen2.5-14B-Instruct" \
88+
HF_TOKEN="<your huggingface token>" \
89+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu24.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
90+
TENSOR_PARALLEL_SIZE=1 \
91+
MAX_MODEL_LEN=2048 \
92+
docker compose up
93+
```
94+
95+
=== "Run vLLM and Benchmark with Custom Parameters"
96+
97+
You can customize benchmark behavior by setting additional environment variables before running Docker Compose.
98+
99+
**Benchmark parameters:**
100+
101+
| **Variable** | **Description** |
102+
|---|---|
103+
| `INPUT_TOK` | Number of input tokens per prompt. |
104+
| `OUTPUT_TOK` | Number of output tokens to generate per prompt. |
105+
| `CON_REQ` | Number of concurrent requests during benchmarking. |
106+
| `NUM_PROMPTS`| Total number of prompts to use in the benchmark. |
107+
108+
**Example:**
109+
110+
```bash
111+
MODEL="Qwen/Qwen2.5-14B-Instruct" \
112+
HF_TOKEN="<your huggingface token>" \
113+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu24.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
114+
INPUT_TOK=128 \
115+
OUTPUT_TOK=128 \
116+
CON_REQ=16 \
117+
NUM_PROMPTS=64 \
118+
docker compose --profile benchmark up
119+
```
120+
121+
This launches the vLLM server and runs the benchmark using your specified parameters.
122+
123+
=== "Run vLLM and Benchmark with Combined Custom Parameters"
124+
125+
You can launch the vLLM server and benchmark together, providing any combination of server and benchmark-specific parameters.
126+
127+
**Example:**
128+
129+
```bash
130+
MODEL="Qwen/Qwen2.5-14B-Instruct" \
131+
HF_TOKEN="<your huggingface token>" \
132+
DOCKER_IMAGE="vault.habana.ai/gaudi-docker/|Version|/ubuntu22.04/habanalabs/vllm-installer-|PT_VERSION|:latest" \
133+
TENSOR_PARALLEL_SIZE=1 \
134+
MAX_MODEL_LEN=2048 \
135+
INPUT_TOK=128 \
136+
OUTPUT_TOK=128 \
137+
CON_REQ=16 \
138+
NUM_PROMPTS=64 \
139+
docker compose --profile benchmark up
140+
```
141+
142+
This command starts the server and executes benchmarking with the provided configuration.
143+
144+
=== "Run vLLM and Benchmark Using Configuration Files"
145+
146+
You can also configure the server and benchmark via YAML configuration files. Set the following environment variables:
147+
148+
| **Variable** | **Description** |
149+
|---|---|
150+
| `VLLM_SERVER_CONFIG_FILE` | Path to the server config file inside the Docker container. |
151+
| `VLLM_SERVER_CONFIG_NAME` | Name of the server config section. |
152+
| `VLLM_BENCHMARK_CONFIG_FILE` | Path to the benchmark config file inside the container. |
153+
| `VLLM_BENCHMARK_CONFIG_NAME` | Name of the benchmark config section. |
154+
155+
**Example**
156+
157+
```bash
158+
HF_TOKEN=<your huggingface token> \
159+
VLLM_SERVER_CONFIG_FILE=server_configurations/server_text.yaml \
160+
VLLM_SERVER_CONFIG_NAME=llama31_8b_instruct \
161+
VLLM_BENCHMARK_CONFIG_FILE=benchmark_configurations/benchmark_text.yaml \
162+
VLLM_BENCHMARK_CONFIG_NAME=llama31_8b_instruct \
163+
docker compose --profile benchmark up
164+
```
165+
166+
!!! note
167+
When using configuration files, you do not need to set the `MODEL` variable as the model details are included in the config files. However, the `HF_TOKEN` flag is still required.
168+
169+
=== "Run vLLM Directly Using Docker"
170+
171+
For maximum control, you can run the server directly using the `docker run` command, allowing full customization of Docker runtime settings.
172+
173+
**Example:**
174+
175+
```bash
176+
docker run -it --rm \
177+
-e MODEL=$MODEL \
178+
-e HF_TOKEN=$HF_TOKEN \
179+
-e http_proxy=$http_proxy \
180+
-e https_proxy=$https_proxy \
181+
-e no_proxy=$no_proxy \
182+
--cap-add=sys_nice \
183+
--ipc=host \
184+
--runtime=habana \
185+
-e HABANA_VISIBLE_DEVICES=all \
186+
-p 8000:8000 \
187+
--name vllm-server \
188+
<docker image name>
189+
```
190+
191+
This method provides full flexibility over how the vLLM server is executed within the container.
192+
193+
---
194+
195+
## Supported Models
196+
197+
| **Model Name** | **Validated TP Size** |
198+
|---|---|
199+
| deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 8 |
200+
| meta-llama/Llama-3.1-70B-Instruct | 4 |
201+
| meta-llama/Llama-3.1-405B-Instruct | 8 |
202+
| meta-llama/Llama-3.1-8B-Instruct | 1 |
203+
| meta-llama/Llama-3.3-70B-Instruct | 4 |
204+
| mistralai/Mistral-7B-Instruct-v0.2 | 1 |
205+
| mistralai/Mixtral-8x7B-Instruct-v0.1 | 2 |
206+
| mistralai/Mixtral-8x22B-Instruct-v0.1 | 4 |
207+
| Qwen/Qwen2.5-7B-Instruct | 1 |
208+
| Qwen/Qwen2.5-VL-7B-Instruct | 1 |
209+
| Qwen/Qwen2.5-14B-Instruct | 1 |
210+
| Qwen/Qwen2.5-32B-Instruct | 1 |
211+
| Qwen/Qwen2.5-72B-Instruct | 4 |
212+
| ibm-granite/granite-8b-code-instruct-4k | 1 |
213+
| ibm-granite/granite-20b-code-instruct-8k | 1 |
25214

26215
## Executing inference
27216

28217
=== "Offline Batched Inference"
29218

30219
[](){ #quickstart-offline }
220+
221+
Offline inference processes multiple prompts in a batch without needing a running server. This is ideal for batch jobs and testing.
222+
31223
```python
32224
from vllm import LLM, SamplingParams
33225

34-
prompts = [
35-
"Hello, my name is",
36-
"The future of AI is",
37-
]
38-
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
39-
llm = LLM(model="facebook/opt-125m")
226+
def main():
227+
prompts = [
228+
"Hello, my name is",
229+
"The future of AI is",
230+
]
231+
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
232+
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
233+
234+
outputs = llm.generate(prompts, sampling_params)
235+
236+
for output in outputs:
237+
prompt = output.prompt
238+
generated_text = output.outputs[0].text
239+
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
40240

41-
outputs = llm.generate(prompts, sampling_params)
241+
if __name__ == "__main__":
242+
main()
42243

43-
for output in outputs:
44-
prompt = output.prompt
45-
generated_text = output.outputs[0].text
46-
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
47244
```
48245

49-
=== "OpenAI Completions API"
246+
=== "Online Inference"
50247

51248
[](){ #quickstart-online }
52-
WIP
249+
250+
Online inference provides real-time text generation through a running vLLM server.
251+
First, start the server:
252+
253+
```bash
254+
python -m vllm.entrypoints.openai.api_server \
255+
--model meta-llama/Llama-3.1-8B-Instruct \
256+
--host 0.0.0.0 \
257+
--port 8000
258+
```
259+
260+
Then query it from Python:
261+
262+
```python
263+
import requests
264+
265+
def main():
266+
url = "http://localhost:8000/v1/completions"
267+
headers = {"Content-Type": "application/json"}
268+
269+
payload = {
270+
"model": "meta-llama/Llama-3.1-8B-Instruct",
271+
"prompt": "The future of AI is",
272+
"max_tokens": 50,
273+
"temperature": 0.8
274+
}
275+
276+
response = requests.post(url, headers=headers, json=payload)
277+
result = response.json()
278+
print(result["choices"][0]["text"])}
279+
280+
if __name__ == "__main__":
281+
main()
282+
283+
```
284+
285+
=== "OpenAI Completions API"
286+
287+
[](){ #quickstart-oopenai-completions-api }
288+
289+
vLLM provides an OpenAI-compatible completions API.
290+
Start the server:
291+
292+
```bash
293+
python -m vllm.entrypoints.openai.api_server \
294+
--model meta-llama/Llama-3.1-8B-Instruct \
295+
--host 0.0.0.0 \
296+
--port 8000
297+
```
298+
299+
Use the OpenAI Python client:
300+
301+
```python
302+
from openai import OpenAI
303+
304+
def main():
305+
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
306+
307+
result = client.completions.create(
308+
model="meta-llama/Llama-3.1-8B-Instruct",
309+
prompt="Explain quantum computing in simple terms:",
310+
max_tokens=100,
311+
temperature=0.7
312+
)
313+
print(result.choices[0].text)
314+
315+
if __name__ == "__main__":
316+
main()
317+
```
318+
319+
Or use curl:
320+
321+
```bash
322+
curl http://localhost:8000/v1/completions \
323+
-H "Content-Type: application/json" \
324+
-d '{
325+
"model": "meta-llama/Llama-3.1-8B-Instruct",
326+
"prompt": "Explain quantum computing in simple terms:",
327+
"max_tokens": 100,
328+
"temperature": 0.7
329+
}'
330+
```
53331

54332
=== "OpenAI Chat Completions API with vLLM"
55333

56-
WIP
334+
[](){ #quickstart-oopenai-chat-completions-api }
335+
336+
vLLM also supports the OpenAI chat completions API format.
337+
Start the server:
338+
339+
```bash
340+
python -m vllm.entrypoints.openai.api_server \
341+
--model meta-llama/Llama-3.1-8B-Instruct \
342+
--host 0.0.0.0 \
343+
--port 8000
344+
```
345+
346+
Use the OpenAI Python client:
347+
348+
```python
349+
from openai import OpenAI
350+
351+
def main():
352+
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1")
353+
354+
chat = client.chat.completions.create(
355+
model="meta-llama/Llama-3.1-8B-Instruct",
356+
messages=[
357+
{"role": "system", "content": "You are a helpful assistant."},
358+
{"role": "user", "content": "What is the capital of France?"}
359+
],
360+
max_tokens=50,
361+
temperature=0.7
362+
)
363+
print(chat.choices[0].message.content)
364+
365+
if __name__ == "__main__":
366+
main()
367+
```
368+
369+
Or use curl:
370+
371+
```bash
372+
curl http://localhost:8000/v1/chat/completions \
373+
-H "Content-Type: application/json" \
374+
-d '{
375+
"model": "meta-llama/Llama-3.1-8B-Instruct",
376+
"messages": [
377+
{"role": "system", "content": "You are a helpful assistant."},
378+
{"role": "user", "content": "What is the capital of France?"}
379+
],
380+
"max_tokens": 50,
381+
"temperature": 0.7
382+
}'
383+
```

0 commit comments

Comments
 (0)