You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -12,6 +12,8 @@ Tuning batch sizes, parallelism configurations, and other options may lead to im
12
12
13
13
For DeepSeek R1 performance, please check out our [performance guide](../blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md)
14
14
15
+
For more information on benchmarking with `trtllm-bench` see this NVIDIA [blog post](https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm/).
16
+
15
17
## Throughput Measurements
16
18
17
19
The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages),
@@ -21,50 +23,64 @@ The performance numbers below were collected using the steps described in this d
21
23
22
24
Testing was performed on models with weights quantized using [ModelOpt](https://nvidia.github.io/TensorRT-Model-Optimizer/#) and published by NVIDIA on the [Model Optimizer HuggingFace Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4).
23
25
24
-
### FP4 Models:
25
-
```
26
+
### Hardware
27
+
The following GPU variants were used for testing:
28
+
- H100 SXM 80GB (DGX H100)
29
+
- H200 SXM 141GB (DGX H200)
30
+
- GH200 96GB HBM3 (480GB LPDDR5X)
31
+
- B200 180GB (DGX B200)
32
+
- GB200 192GB (GB200 NVL72)
33
+
34
+
Other hardware variants may have different TDP, memory bandwidth, core count, or other features leading to performance differences on these workloads.
Note: Performance for Llama 4 on sequence lengths less than 8,192 tokens is affected by an issue introduced in v0.21. To reproduce the Llama 4 performance noted here, please use v0.20
The data collected for the v0.20 benchmarks was run with the following file:
238
+
The data collected for the v0.21 benchmarks was run with the following file:
220
239
221
240
`llm_options.yml`
222
241
```yaml
@@ -240,7 +259,7 @@ cuda_graph_config:
240
259
- 8192
241
260
```
242
261
243
-
In a majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.
262
+
In many cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` or lower if out-of-memory errors are encountered.
244
263
245
264
The results will be printed to the terminal upon benchmark completion. For example,
0 commit comments