Skip to content

Commit fbc7682

Browse files
authored
Merge branch 'main' into opt/adp_schedule_optimize
2 parents b449205 + 164acfa commit fbc7682

16 files changed

+549
-139
lines changed

docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -503,7 +503,7 @@ Let's use some representative workloads to illustrate the performance impact wit
503503
</div>
504504
<p align="center"><sub><em>Figure 24: EP impact over MoE Group GEMM and EP communication</em></sub></p>
505505
In Figure 24, it can be observed that by increasing the EP size from 4 to 72, the MoE Group GEMM computation time gets reduced, while the EP communication time (for EP4/EP8 Reduce/Scatter is used, while for EP>8 All2All is used) stays almost constant.
506-
When the EP size increases from 18 to 32, the speed-up diminishes. We are working on optimizing it.
506+
When the EP size increases from 18 to 72, the speed-up diminishes. We are working on optimizing it.
507507

508508
Next, let's use some representative workloads to understand the performance impact with EPLB.
509509
<div align="center">
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
trtllm-serve
2+
=======================
3+
4+
5+
.. toctree::
6+
:maxdepth: 1
7+
8+
trtllm-serve
9+
run-benchmark-with-trtllm-serve
Lines changed: 222 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,222 @@
1+
# Run benchmarking with `trtllm-serve`
2+
3+
TensorRT-LLM provides the OpenAI-compatiable API via `trtllm-serve` command.
4+
A complete reference for the API is available in the [OpenAI API Reference](https://platform.openai.com/docs/api-reference).
5+
6+
This step-by-step tutorial covers the following topics for running online serving benchmarking with Llama 3.1 70B:
7+
* Methodology Introduction
8+
* Launch the OpenAI-Compatibale Server with NGC container
9+
* Run the performance benchmark
10+
* Using `extra_llm_api_options`
11+
12+
13+
## Methodology Introduction
14+
15+
The overall performance benchmarking involves:
16+
1. Launch the OpenAI-compatible service with `trtllm-serve`
17+
2. Run the benchmark with [benchmark_serving.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/serve/scripts/benchmark_serving.py)
18+
19+
20+
## Launch the NGC container
21+
22+
TensorRT-LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags).
23+
24+
You can launch the container using the following command:
25+
26+
```bash
27+
docker run --rm --ipc host -p 8000:8000 --gpus all -it nvcr.io/nvidia/tensorrt-llm/release
28+
```
29+
30+
## Start the trtllm-serve service
31+
> [!WARNING]
32+
> The commands and configurations presented in this document are for illustrative purposes only.
33+
> They serve as examples and may not deliver the optimal performance for your specific use case.
34+
> Users are encouraged to tune the parameters based on their hardware and workload.
35+
For benchmarking purposes, first create a bash script using the following code and name it start.sh.
36+
```bash
37+
#! /bin/bash
38+
model_path=/path/to/llama3.1_70B
39+
extra_llm_api_file=/tmp/extra-llm-api-config.yml
40+
41+
cat << EOF > ${extra_llm_api_file}
42+
enable_attention_dp: false
43+
print_iter_log: true
44+
cuda_graph_config:
45+
enable_padding: true
46+
max_batch_size: 1024
47+
kv_cache_config:
48+
dtype: fp8
49+
EOF
50+
51+
trtllm-serve ${model_path} \
52+
--max_batch_size 1024 \
53+
--max_num_tokens 2048 \
54+
--max_seq_len 1024 \
55+
--kv_cache_free_gpu_memory_fraction 0.9 \
56+
--tp_size 1 \
57+
--ep_size 1 \
58+
--trust_remote_code \
59+
--extra_llm_api_options ${extra_llm_api_file}
60+
```
61+
> [!NOTE]
62+
> The trtllm-llmapi-launch is a script that launches the LLM-API code on
63+
> Slurm-like systems, and can support multi-node and multi-GPU setups.
64+
> e.g, trtllm-llmapi-launch trtllm-serve .....
65+
66+
Run the start.sh script in the **background** with the following command:
67+
68+
```bash
69+
bash -x start.sh &
70+
```
71+
72+
Once the serving is set up, it will generate the output log as shown below.
73+
```bash
74+
INFO: Started server process [80833]
75+
INFO: Waiting for application startup.
76+
INFO: Application startup complete.
77+
INFO: Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)
78+
```
79+
80+
## Run the benchmark
81+
82+
Similar to starting trtllm-serve, create a script to execute the benchmark using the following code and name it bench.sh.
83+
84+
```bash
85+
concurrency_list="1 2 4 8 16 32 64 128 256"
86+
multi_round=5
87+
isl=1024
88+
osl=1024
89+
result_dir=/tmp/llama3.1_output
90+
model_path=/path/to/llama3.1_70B
91+
92+
for concurrency in ${concurrency_list}; do
93+
num_prompts=$((concurrency * multi_round))
94+
python -m tensorrt_llm.serve.scripts.benchmark_serving \
95+
--model ${model_path} \
96+
--backend openai \
97+
--dataset-name "random" \
98+
--random-input-len ${isl} \
99+
--random-output-len ${osl} \
100+
--random-prefix-len 0 \
101+
--num-prompts ${num_prompts} \
102+
--max-concurrency ${concurrency} \
103+
--ignore-eos \
104+
--save-result \
105+
--result-dir "${result_dir}" \
106+
--result-filename "concurrency_${concurrency}.json" \
107+
--percentile-metrics "ttft,tpot,itl,e2el"
108+
done
109+
```
110+
111+
Then we can run the benchmark using the command below.
112+
113+
```bash
114+
bash -x bench.sh &> output_bench.log
115+
```
116+
117+
Below is some example TensorRT-LLM serving benchmark output. Your actual results may vary.
118+
119+
```
120+
============ Serving Benchmark Result ============
121+
Successful requests: 1
122+
Benchmark duration (s): 1.64
123+
Total input tokens: 1024
124+
Total generated tokens: 1024
125+
Request throughput (req/s): 0.61
126+
Output token throughput (tok/s): 622.56
127+
Total Token throughput (tok/s): 1245.12
128+
User throughput (tok/s): 623.08
129+
Mean Request AR: 0.9980
130+
Median Request AR: 0.9980
131+
---------------Time to First Token----------------
132+
Mean TTFT (ms): 12.83
133+
Median TTFT (ms): 12.83
134+
P99 TTFT (ms): 12.83
135+
-----Time per Output Token (excl. 1st token)------
136+
Mean TPOT (ms): 1.59
137+
Median TPOT (ms): 1.59
138+
P99 TPOT (ms): 1.59
139+
---------------Inter-token Latency----------------
140+
Mean ITL (ms): 1.59
141+
Median ITL (ms): 1.59
142+
P99 ITL (ms): 1.77
143+
----------------End-to-end Latency----------------
144+
Mean E2EL (ms): 1643.44
145+
Median E2EL (ms): 1643.44
146+
P99 E2EL (ms): 1643.44
147+
==================================================
148+
```
149+
150+
### Key Metrics
151+
152+
* Median Time to First Token (TTFT)
153+
* The typical time elapsed from when a request is sent until the first output token is generated.
154+
* Median Time Per Output Token (TPOT)
155+
* The typical time required to generate each token *after* the first one.
156+
* Median Inter-Token Latency (ITL)
157+
* The typical time delay between the completion of one token and the completion of the next.
158+
* Median End-to-End Latency (E2EL)
159+
* The typical total time from when a request is submitted until the final token of the response is received.
160+
* Total Token Throughput
161+
* The combined rate at which the system processes both input (prompt) tokens and output (generated) tokens.
162+
163+
## About `extra_llm_api_options`
164+
trtllm-serve provides `extra_llm_api_options` knob to **overwrite** the parameters specified by trtllm-serve.
165+
Generally, We create a YAML file that contains various performance switches.
166+
e.g
167+
```yaml
168+
cuda_graph_config:
169+
padding_enabled: true
170+
print_iter_log: true
171+
kv_cache_dtype: fp8
172+
enable_attention_dp: true
173+
```
174+
175+
The following is a list of common performance switches.
176+
#### `kv_cache_config`
177+
178+
&emsp;**Description**: A section for configuring the Key-Value (KV) cache.
179+
180+
&emsp;**Options**:
181+
182+
&emsp;&emsp;dtype: Sets the data type for the KV cache.
183+
184+
&emsp;&emsp;**Default**: auto (uses the data type specified in the model checkpoint).
185+
186+
#### `cuda_graph_config`
187+
188+
&emsp;**Description**: A section for configuring CUDA graphs to optimize performance.
189+
190+
&emsp;**Options**:
191+
192+
&emsp;&emsp;enable\_padding: If true, input batches are padded to the nearest cuda\_graph\_batch\_size. This can significantly improve performance.
193+
194+
&emsp;&emsp;**Default**: false
195+
196+
&emsp;&emsp;max\_batch\_size: Sets the maximum batch size for which a CUDA graph will be created.
197+
198+
&emsp;&emsp;**Default**: 0
199+
200+
&emsp;&emsp;**Recommendation**: Set this to the same value as the \--max\_batch\_size command-line option.
201+
202+
&emsp;&emsp;batch\_sizes: A specific list of batch sizes to create CUDA graphs for.
203+
204+
&emsp;&emsp;**Default**: None
205+
206+
#### `moe_config`
207+
208+
&emsp;**Description**: Configuration for Mixture-of-Experts (MoE) models.
209+
210+
&emsp;**Options**:
211+
212+
&emsp;&emsp;backend: The backend to use for MoE operations.
213+
214+
&emsp;&emsp;**Default**: CUTLASS
215+
216+
#### `attention_backend`
217+
218+
&emsp;**Description**: The backend to use for attention calculations.
219+
220+
&emsp;**Default**: TRTLLM
221+
222+
See the [TorchLlmArgs class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the extra\_llm\_api\_options`.`

docs/source/commands/trtllm-serve.rst renamed to docs/source/commands/trtllm-serve/trtllm-serve.rst

Lines changed: 3 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -175,26 +175,6 @@ TRT-LLM multimodal supports the following modalities and data types (depending o
175175
]}
176176
177177
178-
Benchmark
179-
---------
180-
181-
You can use any benchmark clients compatible with OpenAI API to test serving performance of ``trtllm_serve``, we recommend ``genai-perf`` and here is a benchmarking recipe.
182-
183-
First, install ``genai-perf`` with ``pip``:
184-
185-
.. code-block:: bash
186-
187-
pip install genai-perf
188-
189-
Then, :ref:`start a server<Starting a Server>` with ``trtllm-serve`` and ``TinyLlama-1.1B-Chat-v1.0``.
190-
191-
Finally, test performance with the following command:
192-
193-
.. literalinclude:: ../../../examples/serve/genai_perf_client.sh
194-
:language: bash
195-
:linenos:
196-
197-
Refer to `README <https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/README.md>`_ of ``genai-perf`` for more guidance.
198178
199179
Multi-node Serving with Slurm
200180
-----------------------------
@@ -278,3 +258,6 @@ Syntax
278258
.. click:: tensorrt_llm.commands.serve:main
279259
:prog: trtllm-serve
280260
:nested: full
261+
262+
Besides the above examples, `trtllm-serve` is also used as an entrypoint for performance benchmarking.
263+
Please refer to `Performance Benchmarking with `trtllm-serve` <https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/commands/trtllm-serve/trtllm-serve-bench.md>` for more details.

docs/source/index.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -75,11 +75,11 @@ Welcome to TensorRT-LLM's Documentation!
7575
.. toctree::
7676
:maxdepth: 2
7777
:caption: Command-Line Reference
78-
:hidden:
78+
:name: Command-Line Reference
7979

8080
commands/trtllm-bench
8181
commands/trtllm-build
82-
commands/trtllm-serve
82+
commands/trtllm-serve/index
8383

8484

8585
.. toctree::

docs/source/installation/linux.md

Lines changed: 10 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,22 @@
99
Before the pre-built Python wheel can be installed via `pip`, a few
1010
prerequisites must be put into place:
1111

12+
Install CUDA Toolkit following the [CUDA Installation Guide for Linux](https://docs.nvidia.com/cuda/cuda-installation-guide-linux/) and
13+
make sure `CUDA_HOME` environment variable is properly set.
14+
1215
```bash
13-
# Optional step: Only required for Blackwell and Grace Hopper
16+
# Optional step: Only required for NVIDIA Blackwell GPUs and SBSA platform
1417
pip3 install torch==2.7.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
1518

19+
# Optional step: Workaround for deep_gemm installation failure on SBSA platform
20+
# The actual deep_gemm package and version should be obtained from the requirements.txt file.
21+
pip3 install 'deep_gemm @ git+https://github.com/zongfeijing/DeepGEMM.git@a9d538ef4dff0326fe521c6ca0bfde115703b56a' \
22+
--extra-index-url https://download.pytorch.org/whl/cu128
23+
1624
sudo apt-get -y install libopenmpi-dev
1725
```
1826

19-
PyTorch CUDA 12.8 package is required for supporting NVIDIA Blackwell and Grace Hopper GPUs. On prior GPUs, this extra installation is not required.
27+
PyTorch CUDA 12.8 package is required for supporting NVIDIA Blackwell GPUs and SBSA platform. On prior GPUs or Linux x86_64 platform, this extra installation is not required.
2028

2129
```{tip}
2230
Instead of manually installing the preqrequisites as described
@@ -55,16 +63,3 @@ There are some known limitations when you pip install pre-built TensorRT-LLM whe
5563
when OMPI was not configured --with-slurm and we weren't able
5664
to discover a SLURM installation in the usual places.
5765
```
58-
59-
2. CUDA Toolkit
60-
61-
`pip install tensorrt-llm` won't install CUDA toolkit in your system, and the CUDA Toolkit is not required if want to just deploy a TensorRT-LLM engine.
62-
TensorRT-LLM uses the [ModelOpt](https://nvidia.github.io/TensorRT-Model-Optimizer/) to quantize a model, while the ModelOpt requires CUDA toolkit to jit compile certain kernels which is not included in the pytorch to do quantization effectively.
63-
Please install CUDA toolkit when you see the following message when running ModelOpt quantization.
64-
65-
```
66-
/usr/local/lib/python3.10/dist-packages/modelopt/torch/utils/cpp_extension.py:65:
67-
UserWarning: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
68-
Unable to load extension modelopt_cuda_ext and falling back to CPU version.
69-
```
70-
The installation of CUDA toolkit can be found in [CUDA Toolkit Documentation](https://docs.nvidia.com/cuda/).

0 commit comments

Comments
 (0)