You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This ensures you have the required files and Docker Compose configurations.
33
+
34
+
### 2. Set the following environment variables
35
+
36
+
|**Variable**|**Description**|
37
+
| --- |--- |
38
+
|`MODEL`| Choose a model name from the [`vllm supported models`][supported-models] list. |
39
+
|`HF_TOKEN`| Your Hugging Face token (generate one at <https://huggingface.co>). |
40
+
|`DOCKER_IMAGE`| The Docker image name or URL for the vLLM Gaudi container. When using the Gaudi repository, make sure to select Docker images with the *vllm-installer* prefix in the file name. |
This command launches the vLLM server and runs the associated benchmark suite.
57
+
58
+
## Advanced Options
59
+
60
+
The following steps cover optional advanced configurations for
61
+
running the vLLM server and benchmark. These allow you to fine-tune performance,
62
+
memory usage, and request handling using additional environment variables or configuration files.
63
+
For most users, the basic setup is sufficient, but advanced users may benefit from these customizations.
64
+
65
+
=== "Run vLLM Using Docker Compose with Custom Parameters"
66
+
67
+
To override default settings, you can provide additional environment variables when starting the server. This advanced method allows fine-tuning for performance and memory usage.
When using configuration files, you do not need to set the `MODEL` variable as the model details are included in the config files. However, the `HF_TOKEN` flag is still required.
168
+
169
+
=== "Run vLLM Directly Using Docker"
170
+
171
+
For maximum control, you can run the server directly using the `docker run` command, allowing full customization of Docker runtime settings.
172
+
173
+
**Example:**
174
+
175
+
```bash
176
+
docker run -it --rm \
177
+
-e MODEL=$MODEL \
178
+
-e HF_TOKEN=$HF_TOKEN \
179
+
-e http_proxy=$http_proxy \
180
+
-e https_proxy=$https_proxy \
181
+
-e no_proxy=$no_proxy \
182
+
--cap-add=sys_nice \
183
+
--ipc=host \
184
+
--runtime=habana \
185
+
-e HABANA_VISIBLE_DEVICES=all \
186
+
-p 8000:8000 \
187
+
--name vllm-server \
188
+
<docker image name>
189
+
```
190
+
191
+
This method provides full flexibility over how the vLLM server is executed within the container.
192
+
193
+
---
194
+
195
+
## Supported Models
196
+
197
+
|**Model Name**|**Validated TP Size**|
198
+
|---|---|
199
+
| deepseek-ai/DeepSeek-R1-Distill-Llama-70B | 8 |
200
+
| meta-llama/Llama-3.1-70B-Instruct | 4 |
201
+
| meta-llama/Llama-3.1-405B-Instruct | 8 |
202
+
| meta-llama/Llama-3.1-8B-Instruct | 1 |
203
+
| meta-llama/Llama-3.3-70B-Instruct | 4 |
204
+
| mistralai/Mistral-7B-Instruct-v0.2 | 1 |
205
+
| mistralai/Mixtral-8x7B-Instruct-v0.1 | 2 |
206
+
| mistralai/Mixtral-8x22B-Instruct-v0.1 | 4 |
207
+
| Qwen/Qwen2.5-7B-Instruct | 1 |
208
+
| Qwen/Qwen2.5-VL-7B-Instruct | 1 |
209
+
| Qwen/Qwen2.5-14B-Instruct | 1 |
210
+
| Qwen/Qwen2.5-32B-Instruct | 1 |
211
+
| Qwen/Qwen2.5-72B-Instruct | 4 |
212
+
| ibm-granite/granite-8b-code-instruct-4k | 1 |
213
+
| ibm-granite/granite-20b-code-instruct-8k | 1 |
25
214
26
215
## Executing inference
27
216
28
217
=== "Offline Batched Inference"
29
218
30
219
[](){ #quickstart-offline }
220
+
221
+
Offline inference processes multiple prompts in a batch without needing a running server. This is ideal for batch jobs and testing.
0 commit comments