You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/architecture/add-model.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,19 +2,19 @@
2
2
3
3
# Adding a Model
4
4
5
-
This document describes how to add a typical decoder-only model in TensorRT-LLM.
5
+
This document describes how to add a typical decoder-only model in TensorRTLLM.
6
6
7
7
## Step 1. Write Modeling Part
8
8
9
-
TensorRT-LLM provides different levels of APIs:
9
+
TensorRTLLM provides different levels of APIs:
10
10
11
11
- Low-level functions, for example, `concat`, `add`, and `sum`.
12
12
- Basic layers, such as, `Linear` and `LayerNorm`.
13
13
- High-level layers, such as, `MLP` and `Attention`.
14
14
- Base class for typical decoder-only models, such as, `DecoderModelForCausalLM`.
15
15
16
16
1. Create a model directory in `tensorrt_llm/models`, for example `my_model`.
17
-
2. Write a `model.py` with TensorRT-LLM's APIs
17
+
2. Write a `model.py` with TensorRTLLM's APIs
18
18
19
19
```python
20
20
classMyDecoderLayer(Module):
@@ -52,7 +52,7 @@ class MyModelForCausalLM(DecoderModelForCausalLM):
52
52
53
53
## Step 2. Implement Weight Conversion
54
54
55
-
The weights from source framework need to be converted and bound to the new added TensorRT-LLM model. Here is an example of converting HuggingFace weights:
55
+
The weights from source framework need to be converted and bound to the new added TensorRTLLM model. Here is an example of converting HuggingFace weights:
56
56
57
57
```python
58
58
classMyModelForCausalLM(DecoderModelForCausalLM):
@@ -62,8 +62,8 @@ class MyModelForCausalLM(DecoderModelForCausalLM):
Copy file name to clipboardExpand all lines: docs/source/architecture/checkpoint.md
+12-12Lines changed: 12 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,36 +1,36 @@
1
-
# TensorRT-LLM Checkpoint
1
+
# TensorRTLLM Checkpoint
2
2
3
3
## Overview
4
4
5
-
The earlier versions (pre-0.8 version) of TensorRT-LLM were developed with a very aggressive timeline. For those versions, emphasis was not put on defining a unified workflow. Now that TensorRT-LLM has reached some level of feature richness, the development team has decided to put more effort into unifying the APIs and workflow of TensorRT-LLM. This file documents the workflow around TensorRT-LLM checkpoint and the set of CLI tools to generate checkpoint, build engines, and evaluate engines.
5
+
The earlier versions (pre-0.8 version) of TensorRTLLM were developed with a very aggressive timeline. For those versions, emphasis was not put on defining a unified workflow. Now that TensorRTLLM has reached some level of feature richness, the development team has decided to put more effort into unifying the APIs and workflow of TensorRTLLM. This file documents the workflow around TensorRTLLM checkpoint and the set of CLI tools to generate checkpoint, build engines, and evaluate engines.
6
6
7
7
There are three steps in the workflow:
8
8
9
-
1. Convert weights from different source frameworks into TensorRT-LLM checkpoint.
10
-
2. Build the TensorRT-LLM checkpoint into TensorRT engines with a unified build command.
11
-
3. Load the engines to TensorRT-LLM model runner and evaluate with different evaluation tasks.
9
+
1. Convert weights from different source frameworks into TensorRTLLM checkpoint.
10
+
2. Build the TensorRTLLM checkpoint into TensorRT engines with a unified build command.
11
+
3. Load the engines to TensorRTLLM model runner and evaluate with different evaluation tasks.
The linear weights in TensorRT-LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT-LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
172
+
The linear weights in TensorRTLLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRTLLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
173
173
174
174
### Example
175
175
@@ -218,7 +218,7 @@ Here is the `config.json`:
218
218
219
219
## Build Checkpoint into TensorRT Engine
220
220
221
-
TensorRT-LLM provides a unified build command: `trtllm-build`. Before using it,
221
+
TensorRTLLM provides a unified build command: `trtllm-build`. Before using it,
Copy file name to clipboardExpand all lines: docs/source/architecture/overview.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Architecture Overview
2
2
3
-
The `LLM` class is a core entry point for the TensorRT-LLM, providing a simplified `generate()` API for efficient large language model inference. This abstraction aims to streamline the user experience, as demonstrated with TinyLlama:
3
+
The `LLM` class is a core entry point for the TensorRTLLM, providing a simplified `generate()` API for efficient large language model inference. This abstraction aims to streamline the user experience, as demonstrated with TinyLlama:
4
4
5
5
```python
6
6
from tensorrt_llm importLLM
@@ -16,7 +16,7 @@ The `LLM` class automatically manages essential pre and post-processing steps, i
16
16
17
17
Internally, the `LLM` class orchestrates the creation of a dedicated `PyExecutor(Worker)` process on each rank.
This `PyExecutor` operates in a continuous background loop, designed for the efficient, asynchronous processing of inference requests.
22
22
@@ -45,13 +45,13 @@ During each iteration of its background loop, the `PyExecutor` performs the foll
45
45
46
46
## Runtime Optimizations
47
47
48
-
TensorRT-LLM enhances inference throughput and reduces latency by integrating a suite of runtime optimizations, including CUDA Graph, [Overlap Scheduler](../features/overlap-scheduler.md), [Speculative decoding](../features/speculative-decoding.md), etc.
48
+
TensorRTLLM enhances inference throughput and reduces latency by integrating a suite of runtime optimizations, including CUDA Graph, [Overlap Scheduler](../features/overlap-scheduler.md), [Speculative decoding](../features/speculative-decoding.md), etc.
49
49
50
50
### CUDA Graph
51
51
52
52
CUDA Graphs drastically reduce the CPU-side overhead associated with launching GPU kernels, which is particularly impactful in PyTorch-based inference where Python's host-side code can be a bottleneck. By capturing a sequence of CUDA operations as a single graph, the entire sequence can be launched with one API call, minimizing CPU-GPU synchronization and driver overhead.
53
53
54
-
To maximize the "hit rate" of these cached graphs, TensorRT-LLM employs CUDA Graph padding. If an incoming batch's size doesn't match a captured graph, it's padded to the nearest larger, supported size for which a graph exists. While this incurs minor overhead from computing "wasted" tokens, it's often a better trade-off than falling back to slower eager mode execution. This optimization has a significant impact, demonstrating up to a 22% end-to-end throughput increase on certain models and hardware.
54
+
To maximize the "hit rate" of these cached graphs, TensorRTLLM employs CUDA Graph padding. If an incoming batch's size doesn't match a captured graph, it's padded to the nearest larger, supported size for which a graph exists. While this incurs minor overhead from computing "wasted" tokens, it's often a better trade-off than falling back to slower eager mode execution. This optimization has a significant impact, demonstrating up to a 22% end-to-end throughput increase on certain models and hardware.
55
55
56
56
### Overlap Scheduler
57
57
@@ -72,4 +72,4 @@ if self.previous_batch is not None:
72
72
self._process_previous_batch()
73
73
```
74
74
75
-
This approach effectively reduces GPU idle time and improves overall hardware occupancy. While it introduces one extra decoding step into the pipeline, the resulting throughput gain is a significant trade-off. For this reason, the Overlap Scheduler is enabled by default in TensorRT-LLM.
75
+
This approach effectively reduces GPU idle time and improves overall hardware occupancy. While it introduces one extra decoding step into the pipeline, the resulting throughput gain is a significant trade-off. For this reason, the Overlap Scheduler is enabled by default in TensorRTLLM.
Copy file name to clipboardExpand all lines: docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md
+19-19Lines changed: 19 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,18 +1,18 @@
1
-
# How to get best performance on DeepSeek-R1 in TensorRT-LLM
1
+
# How to get best performance on DeepSeek-R1 in TensorRTLLM
2
2
3
3
NVIDIA has announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. [NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
4
4
5
5
In this blog, we share the configurations and procedures about how to reproduce the number on both B200 and H200 with PyTorch workflow.
6
6
7
7
## Table of Contents
8
8
9
-
-[How to get best performance on DeepSeek-R1 in TensorRT-LLM](#how-to-get-best-performance-on-deepseek-r1-in-tensorrt-llm)
9
+
-[How to get best performance on DeepSeek-R1 in TensorRTLLM](#how-to-get-best-performance-on-deepseek-r1-in-tensorrt-llm)
10
10
-[Table of Contents](#table-of-contents)
11
-
-[Prerequisites: Install TensorRT-LLM and download models](#prerequisites-install-tensorrt-llm-and-download-models)
@@ -34,13 +34,13 @@ In this blog, we share the configurations and procedures about how to reproduce
34
34
-[Out of memory issues](#out-of-memory-issues)
35
35
36
36
37
-
## Prerequisites: Install TensorRT-LLM and download models
37
+
## Prerequisites: Install TensorRTLLM and download models
38
38
39
-
This section can be skipped if you already have TensorRT-LLM installed and have already downloaded the DeepSeek R1 model checkpoint.
39
+
This section can be skipped if you already have TensorRTLLM installed and have already downloaded the DeepSeek R1 model checkpoint.
40
40
41
-
#### 1. Download TensorRT-LLM
41
+
#### 1. Download TensorRTLLM
42
42
43
-
**You can also find more comprehensive instructions to install TensorRT-LLM in this [TensorRT-LLM installation guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html), refer to that guide for common issues if you encounter any here.**
43
+
**You can also find more comprehensive instructions to install TensorRTLLM in this [TensorRTLLM installation guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html), refer to that guide for common issues if you encounter any here.**
make -C docker run LOCAL_USER=1 DOCKER_RUN_ARGS="-v $YOUR_MODEL_PATH:$YOUR_MODEL_PATH:ro -v $YOUR_WORK_PATH:$YOUR_WORK_PATH"
85
85
```
86
86
Here we set `LOCAL_USER=1` argument to set up the local user instead of root account inside the container, you can remove it if running as root inside container is fine.
87
87
88
-
#### 4. Compile and Install TensorRT-LLM
88
+
#### 4. Compile and Install TensorRTLLM
89
89
Here we compile the source inside the container:
90
90
91
91
```bash
@@ -122,11 +122,11 @@ The command to generate synthetic dataset will be attached to the max throughput
122
122
123
123
This section provides the reproducing steps for NVIDIA Blackwell B200 and H200 GPUs, for both min-latency and max-throughput scenarios.
124
124
125
-
All the benchmarking is done by the trtllm-bench command line tool provided in the TensorRT-LLM installation, see [TensorRT-LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details of this tool.
125
+
All the benchmarking is done by the trtllm-bench command line tool provided in the TensorRTLLM installation, see [TensorRTLLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details of this tool.
126
126
127
127
For brevity, we only provide the commands to reproduce the perf numbers without detailed explanation of the tools and options in this doc.
128
128
129
-
All these commands here are assumed to be running inside the container started by `make -C docker run ...` command mentioned in the [Build and run TensorRT-LLM container section](#3-build-and-run-tensorrt-llm-container)
129
+
All these commands here are assumed to be running inside the container started by `make -C docker run ...` command mentioned in the [Build and run TensorRTLLM container section](#3-build-and-run-tensorrt-llm-container)
130
130
131
131
### B200 min-latency
132
132
Our benchmark results are based on **Batch = 1, ISL = 1K, OSL = 2K, num_requests = 10 from real dataset**
-`trtllm-bench`: A CLI benchmarking utility that aims to make it easier for users to reproduce our officially published. See [TensorRT-LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.
161
+
-`trtllm-bench`: A CLI benchmarking utility that aims to make it easier for users to reproduce our officially published. See [TensorRTLLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.
162
162
-`--dataset`: Prompt dataset used to benchmark. Our official benchmark dataset has ISL = 1K, OSL = 2K
163
163
-`--num_requests`: Num requests used for the benchmark.
164
164
-`--concurrency`: Total concurrency for the system.
@@ -186,7 +186,7 @@ Average request latency (ms): 7456.1219
186
186
187
187
Due to our evaluation found that FP8 KV cache does not introduce obvious accuracy drop compared to BF16 KV cache. See [Precision strategy](./tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md#precision-strategy), the latest [DeepSeek-R1-0528-FP4](https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4) checkpoint had enabled FP8 KV cache by-default.
188
188
189
-
We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers here. The results are reproduced with TensorRT-LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
189
+
We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers here. The results are reproduced with TensorRTLLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
190
190
191
191
!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.
192
192
@@ -239,7 +239,7 @@ Per GPU Output Throughput (tps/gpu): 5393.2755
239
239
### B200 max-throughput for R1 with FP16 KV cache
240
240
Our benchmark results are based on **Batch = 3072, ISL = 1K, OSL = 2K, num_requests = 49152 from synthetic dataset**.
241
241
242
-
The results are reproduced with TensorRT-LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
242
+
The results are reproduced with TensorRTLLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
243
243
244
244
!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.
245
245
@@ -401,7 +401,7 @@ Average request latency (ms): 181540.5739
401
401
402
402
## Exploring more ISL/OSL combinations
403
403
404
-
To benchmark TensorRT-LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT-LLM is working on enhancements that can make the benchmark process smoother.
404
+
To benchmark TensorRTLLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRTLLM is working on enhancements that can make the benchmark process smoother.
405
405
### WIP: Enable more features by default
406
406
407
407
Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as CUDA graph, overlap scheduler and attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.
0 commit comments