Skip to content

Commit 7569eb0

Browse files
zbpateldc3671
authored andcommitted
[doc] Update perf_overview.md for release 0.21 (NVIDIA#6270)
Signed-off-by: zpatel <[email protected]>
1 parent 206841e commit 7569eb0

File tree

1 file changed

+104
-85
lines changed

1 file changed

+104
-85
lines changed

docs/source/performance/perf-overview.md

Lines changed: 104 additions & 85 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,8 @@ Tuning batch sizes, parallelism configurations, and other options may lead to im
1212

1313
For DeepSeek R1 performance, please check out our [performance guide](../blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)
1414

15+
For more information on benchmarking with `trtllm-bench` see this NVIDIA [blog post](https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm/).
16+
1517
## Throughput Measurements
1618

1719
The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages),
@@ -21,50 +23,64 @@ The performance numbers below were collected using the steps described in this d
2123

2224
Testing was performed on models with weights quantized using [ModelOpt](https://nvidia.github.io/TensorRT-Model-Optimizer/#) and published by NVIDIA on the [Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).
2325

24-
### FP4 Models:
25-
```
26+
### Hardware
27+
The following GPU variants were used for testing:
28+
- H100 SXM 80GB (DGX H100)
29+
- H200 SXM 141GB (DGX H200)
30+
- GH200 96GB HBM3 (480GB LPDDR5X)
31+
- B200 180GB (DGX B200)
32+
- GB200 192GB (GB200 NVL72)
33+
34+
Other hardware variants may have different TDP, memory bandwidth, core count, or other features leading to performance differences on these workloads.
35+
36+
### FP4 Models
37+
38+
```text
2639
nvidia/Llama-3.3-70B-Instruct-FP4
2740
nvidia/Llama-3.1-405B-Instruct-FP4
2841
```
2942

3043
#### Llama 3.3 70B FP4
3144

32-
| | GPU | B200 | | | |
33-
|:------------------------|:--------|:----------|:----------|:----------|:----------|
34-
| | TP Size | 1 | 2 | 4 | 8 |
35-
| ISL, OSL | | | | | |
36-
| | | | | | |
37-
| 128, 128 | | 10,994.48 | 17,542.11 | 24,667.31 | 27,272.27 |
38-
| 128, 2048 | | 9,580.46 | 15,432.35 | 23,568.12 | 31,174.31 |
39-
| 128, 4096 | | 6,418.39 | 9,841.53 | 17,808.76 | 25,229.25 |
40-
| 500, 2000 | | 7,343.32 | 11,850.57 | 20,709.67 | 28,038.78 |
41-
| 1000, 1000 | | 6,752.53 | 10,815.88 | 16,413.04 | 20,060.66 |
42-
| 1000, 2000 | | 6,670.07 | 9,830.73 | 15,597.49 | 20,672.37 |
43-
| 1024, 2048 | | 6,636.75 | 9,807.13 | 15,519.23 | 20,617.28 |
44-
| 2048, 128 | | 1,342.17 | 1,989.41 | 3,033.14 | 4,035.64 |
45-
| 5000, 500 | | 1,429.67 | 2,419.67 | 3,686.84 | 5,182.96 |
46-
| 20000, 2000 | | 629.77 | 1,177.01 | 2,120.66 | 3,429.03 |
45+
| | GPU: | B200 | GB200 |
46+
|:-----------------------------|:---|:----------|:--------------|
47+
| | TP Size | 1 | 1 |
48+
| ISL, OSL | | | |
49+
| | | | |
50+
| 128, 128 | | 10,613.84 | 11,100.97 |
51+
| 128, 2048 | | 9,445.51 | 10,276.05 |
52+
| 128, 4096 | | 6,276.85 | 7,351.12 |
53+
| 500, 2000 | | 6,983.27 | 8,194.30 |
54+
| 1000, 1000 | | 6,434.29 | 7,401.80 |
55+
| 1000, 2000 | | 6,725.03 | 6,478.72 |
56+
| 1024, 2048 | | 6,546.61 | 7,922.88 |
57+
| 2048, 128 | | 1,330.35 | 1,418.47 |
58+
| 2048, 2048 | | 4,528.48 | 5,326.77 |
59+
| 5000, 500 | | 1,427.44 | 1,502.44 |
60+
| 20000, 2000 | | 636.36 | 732.43 |
4761

4862
#### Llama 3.1 405B FP4
4963

50-
| | GPU | B200 | |
51-
|:------------------------|:------- |:---------|:----------|
52-
| | TP Size | 4 | 8 |
53-
| ISL, OSL | | | |
54-
| | | | |
55-
| 128, 128 | | 6,163.81 | 9,002.90 |
56-
| 128, 2048 | | 7,081.21 | 10,288.28 |
57-
| 128, 4096 | | 6,028.37 | 8,713.77 |
58-
| 500, 2000 | | 5,858.75 | 9,125.86 |
59-
| 1000, 1000 | | 4,848.00 | 7,582.97 |
60-
| 1000, 2000 | | 5,375.25 | 7,626.28 |
61-
| 1024, 2048 | | 5,345.70 | 7,464.03 |
62-
| 2048, 128 | | 693.55 | 1,086.56 |
63-
| 5000, 500 | | 947.49 | 1,532.45 |
64-
| 20000, 2000 | | 641.11 | 1,097.84 |
65-
66-
### FP8 Models:
67-
```
64+
| | GPU: | B200 | GB200 |
65+
|:-----------------------------|:---|:---------|:--------------|
66+
| | TP Size | 4 | 4 |
67+
| ISL, OSL | | | |
68+
| | | | |
69+
| 128, 128 | | 6,218.89 | 6,598.97 |
70+
| 128, 2048 | | 7,178.10 | 7,497.40 |
71+
| 128, 4096 | | 5,890.89 | 5,898.19 |
72+
| 500, 2000 | | 5,844.37 | 6,198.33 |
73+
| 1000, 1000 | | 4,958.53 | 5,243.35 |
74+
| 1000, 2000 | | 4,874.16 | 4,905.51 |
75+
| 1024, 2048 | | 4,833.19 | 4,686.38 |
76+
| 2048, 128 | | 737.95 | 761.58 |
77+
| 2048, 2048 | | 4,024.02 | 4,326.56 |
78+
| 5000, 500 | | 1,032.40 | 1,078.87 |
79+
| 20000, 2000 | | 667.39 | 649.95 |
80+
81+
### FP8 Models
82+
83+
```text
6884
nvidia/Llama-3.1-8B-Instruct-FP8
6985
nvidia/Llama-3.3-70B-Instruct-FP8
7086
nvidia/Llama-3.1-405B-Instruct-FP8
@@ -73,61 +89,65 @@ nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
7389

7490
#### Llama 3.1 8B FP8
7591

76-
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
77-
|:-----------------------------|:---|:------------------|:-----------------|
78-
| | TP Size | 1 | 1 |
79-
| ISL, OSL | | | |
80-
| | | | |
81-
| 128, 128 | | 27,970.14 | 27,688.36 |
82-
| 128, 2048 | | 23,326.38 | 21,841.15 |
83-
| 128, 4096 | | 17,508.51 | 13,730.89 |
84-
| 500, 2000 | | 21,390.41 | 17,833.34 |
85-
| 1000, 1000 | | 17,366.89 | 15,270.62 |
86-
| 1000, 2000 | | 16,831.31 | 13,798.08 |
87-
| 1024, 2048 | | 16,737.03 | 13,385.50 |
88-
| 2048, 128 | | 3,488.03 | 3,414.67 |
89-
| 5000, 500 | | 3,813.69 | 3,394.54 |
90-
| 20000, 2000 | | 1,696.66 | 1,345.42 |
92+
| | GPU: | GH200 | H100 | H200 |
93+
|:-----------------------------|:---|:--------------|:-----------------|:------------------|
94+
| | TP Size | 1 | 1 | 1 |
95+
| ISL, OSL | | | | |
96+
| | | | | |
97+
| 128, 128 | | 27,304.25 | 26,401.48 | 27,027.80 |
98+
| 128, 2048 | | 24,045.60 | 21,413.21 | 23,102.25 |
99+
| 128, 4096 | | 15,409.85 | 13,541.54 | 17,396.83 |
100+
| 500, 2000 | | 20,123.88 | 17,571.01 | 19,759.16 |
101+
| 1000, 1000 | | 16,352.99 | 14,991.62 | 17,162.49 |
102+
| 1000, 2000 | | 15,705.82 | 13,505.23 | 16,227.11 |
103+
| 1024, 2048 | | 16,102.52 | 13,165.91 | 16,057.66 |
104+
| 2048, 128 | | 3,573.85 | 3,275.55 | 3,390.69 |
105+
| 2048, 2048 | | 10,767.05 | 9,462.43 | 11,822.14 |
106+
| 5000, 500 | | 3,584.74 | 3,276.47 | 3,758.08 |
107+
| 20000, 2000 | | 1,393.31 | 1,340.69 | 1,705.68 |
91108

92109
#### Llama 3.3 70B FP8
93110

94-
| | GPU | H200 141GB HBM3 | | | | H100 80GB HBM3 | | | |
95-
|:-----------------------------|:---|:------------------|:---------|:----------|:----------|:-----------------|:---------|:----------|:----------|
96-
| | TP Size | 1 | 2 | 4 | 8 | 1 | 2 | 4 | 8 |
97-
| ISL, OSL | | | | | | | | | |
98-
| | | | | | | | | | |
99-
| 128, 128 | | 3,605.47 | 6,427.69 | 10,407.42 | 15,434.37 | 3,128.33 | 6,216.91 | | |
100-
| 128, 2048 | | 4,315.80 | 8,464.03 | 13,508.59 | 20,759.72 | 756.42 | 5,782.57 | 11,464.94 | 17,424.32 |
101-
| 128, 4096 | | 2,701.17 | 5,573.55 | 11,458.56 | 16,668.75 | | 3,868.37 | 8,206.39 | 12,624.61 |
102-
| 500, 2000 | | 3,478.76 | 6,740.06 | 12,200.18 | | | 4,684.06 | 9,903.53 | 14,553.93 |
103-
| 1000, 1000 | | 2,744.32 | 5,119.72 | 8,685.44 | 12,744.51 | 742.14 | 4,247.19 | 7,435.65 | 11,018.81 |
104-
| 1000, 2000 | | 2,896.44 | 5,847.26 | 9,031.21 | 13,141.17 | 533.74 | 3,866.53 | 7,611.12 | 11,139.22 |
105-
| 1024, 2048 | | 2,874.18 | 5,568.61 | 8,946.71 | 13,082.62 | 530.16 | 3,796.68 | 7,575.24 | 11,004.31 |
106-
| 2048, 128 | | 435.90 | 772.67 | 1,264.76 | | | 736.89 | 1,213.33 | 1,839.22 |
107-
| 2048, 2048 | | | | | 10,412.85 | | | | |
108-
| 5000, 500 | | 545.96 | 997.15 | 1,698.22 | 2,655.28 | 204.94 | 862.91 | 1,552.68 | 2,369.84 |
109-
| 20000, 2000 | | 276.66 | 620.33 | 1,161.29 | 1,985.85 | | 416.13 | 903.66 | 1,554.10 |
111+
| | GPU: | H100 | H200 |
112+
|:-----------------------------|:---|:-----------------|:------------------|
113+
| | TP Size | 2 | 2 |
114+
| ISL, OSL | | | |
115+
| | | | |
116+
| 128, 128 | | 6,092.28 | 6,327.98 |
117+
| 128, 2048 | | 5,892.94 | 7,467.36 |
118+
| 128, 4096 | | 3,828.46 | 5,526.42 |
119+
| 500, 2000 | | 4,654.74 | 6,639.15 |
120+
| 1000, 1000 | | 4,181.06 | 4,773.33 |
121+
| 1000, 2000 | | 3,708.93 | 5,790.36 |
122+
| 1024, 2048 | | 3,785.04 | 5,480.44 |
123+
| 2048, 128 | | 723.40 | 747.55 |
124+
| 2048, 2048 | | 2,785.53 | 3,775.80 |
125+
| 5000, 500 | | 865.55 | 978.28 |
126+
| 20000, 2000 | | 411.85 | 609.42 |
110127

111128
#### Llama 3.1 405B FP8
112-
113-
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
114-
|:-----------------------------|:---|:------------------|:-----------------|
115-
| | TP Size | 8 | 8 |
116-
| ISL, OSL | | | |
117-
| | | | |
118-
| 128, 2048 | | 5,567.87 | |
119-
| 128, 4096 | | 5,136.85 | |
120-
| 500, 2000 | | 4,787.61 | 3,673.91 |
121-
| 1000, 1000 | | 3,286.30 | 3,012.22 |
122-
| 1000, 2000 | | 3,636.76 | 3,262.20 |
123-
| 1024, 2048 | | 3,618.66 | 3,109.70 |
124-
| 2048, 128 | | 443.10 | 449.02 |
125-
| 5000, 500 | | 645.46 | |
126-
| 20000, 2000 | | | 372.12 |
129+
| | GPU: | H100 | H200 |
130+
|:-----------------------------|:---|:-----------------|:------------------|
131+
| | TP Size | 8 | 8 |
132+
| Runtime Input/Output Lengths | | | |
133+
| | | | |
134+
| 128, 128 | | | 3,705.18 |
135+
| 128, 2048 | | 4,517.39 | 4,715.13 |
136+
| 128, 4096 | | 2,910.31 | 4,475.91 |
137+
| 500, 2000 | | 3,664.62 | 4,804.10 |
138+
| 1000, 1000 | | 2,955.50 | 3,208.25 |
139+
| 1000, 2000 | | 2,884.69 | 3,630.29 |
140+
| 1024, 2048 | | 3,237.41 | 3,609.50 |
141+
| 2048, 128 | | 433.47 | 441.35 |
142+
| 2048, 2048 | | 2,216.55 | 2,840.86 |
143+
| 5000, 500 | | 579.05 | 645.26 |
144+
| 20000, 2000 | | 363.27 | 509.87 |
127145

128146
#### Llama 4 Maverick FP8
129147

130-
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
148+
Note: Performance for Llama 4 on sequence lengths less than 8,192 tokens is affected by an issue introduced in v0.21. To reproduce the Llama 4 performance noted here, please use v0.20
149+
150+
| | GPU | H200 | H100 |
131151
|:-----------------------------|:---|:------------------|:-----------------|
132152
| | TP Size | 8 | 8 |
133153
| ISL, OSL | | | |
@@ -140,7 +160,6 @@ nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
140160
| 2048, 128 | | 4,364.06 | 3,832.38 |
141161
| 2048, 2048 | | 12,800.89 | |
142162
| 5000, 500 | | 5,128.60 | |
143-
| 20000, 2000 | | 1,764.27 | 1,400.79 |
144163

145164
## Reproducing Benchmarked Results
146165

@@ -216,7 +235,7 @@ a model name (HuggingFace reference or path to a local model), a [generated data
216235
trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
217236
```
218237

219-
The data collected for the v0.20 benchmarks was run with the following file:
238+
The data collected for the v0.21 benchmarks was run with the following file:
220239

221240
`llm_options.yml`
222241
```yaml
@@ -240,7 +259,7 @@ cuda_graph_config:
240259
- 8192
241260
```
242261
243-
In a majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.
262+
In many cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` or lower if out-of-memory errors are encountered.
244263

245264
The results will be printed to the terminal upon benchmark completion. For example,
246265

0 commit comments

Comments
 (0)