Skip to content

Commit c3b2ebe

Browse files
committed
[None][Doc] Rename TensorRT-LLM to TensorRT LLM.
Signed-off-by: nv-guomingz <[email protected]>
1 parent e07fa9d commit c3b2ebe

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+305
-305
lines changed

docs/source/architecture/add-model.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,19 +2,19 @@
22

33
# Adding a Model
44

5-
This document describes how to add a typical decoder-only model in TensorRT-LLM.
5+
This document describes how to add a typical decoder-only model in TensorRT LLM.
66

77
## Step 1. Write Modeling Part
88

9-
TensorRT-LLM provides different levels of APIs:
9+
TensorRT LLM provides different levels of APIs:
1010

1111
- Low-level functions, for example, `concat`, `add`, and `sum`.
1212
- Basic layers, such as, `Linear` and `LayerNorm`.
1313
- High-level layers, such as, `MLP` and `Attention`.
1414
- Base class for typical decoder-only models, such as, `DecoderModelForCausalLM`.
1515

1616
1. Create a model directory in `tensorrt_llm/models`, for example `my_model`.
17-
2. Write a `model.py` with TensorRT-LLM's APIs
17+
2. Write a `model.py` with TensorRT LLM's APIs
1818

1919
```python
2020
class MyDecoderLayer(Module):
@@ -52,7 +52,7 @@ class MyModelForCausalLM(DecoderModelForCausalLM):
5252

5353
## Step 2. Implement Weight Conversion
5454

55-
The weights from source framework need to be converted and bound to the new added TensorRT-LLM model. Here is an example of converting HuggingFace weights:
55+
The weights from source framework need to be converted and bound to the new added TensorRT LLM model. Here is an example of converting HuggingFace weights:
5656

5757
```python
5858
class MyModelForCausalLM(DecoderModelForCausalLM):
@@ -62,8 +62,8 @@ class MyModelForCausalLM(DecoderModelForCausalLM):
6262
hf_model_dir,
6363
dtype='float16',
6464
mapping: Optional[Mapping] = None) -> MyModelForCausalLM
65-
# create a TensorRT-LLM MyModelForCausalLM model object
66-
# convert HuggingFace checkpoint to TensorRT-LLM expected weights dict
65+
# create a TensorRT LLM MyModelForCausalLM model object
66+
# convert HuggingFace checkpoint to TensorRT LLM expected weights dict
6767
# load the weights to MyModelForCausalLM object
6868
```
6969

docs/source/architecture/checkpoint.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,36 +1,36 @@
1-
# TensorRT-LLM Checkpoint
1+
# TensorRT LLM Checkpoint
22

33
## Overview
44

5-
The earlier versions (pre-0.8 version) of TensorRT-LLM were developed with a very aggressive timeline. For those versions, emphasis was not put on defining a unified workflow. Now that TensorRT-LLM has reached some level of feature richness, the development team has decided to put more effort into unifying the APIs and workflow of TensorRT-LLM. This file documents the workflow around TensorRT-LLM checkpoint and the set of CLI tools to generate checkpoint, build engines, and evaluate engines.
5+
The earlier versions (pre-0.8 version) of TensorRT LLM were developed with a very aggressive timeline. For those versions, emphasis was not put on defining a unified workflow. Now that TensorRT LLM has reached some level of feature richness, the development team has decided to put more effort into unifying the APIs and workflow of TensorRT LLM. This file documents the workflow around TensorRT LLM checkpoint and the set of CLI tools to generate checkpoint, build engines, and evaluate engines.
66

77
There are three steps in the workflow:
88

9-
1. Convert weights from different source frameworks into TensorRT-LLM checkpoint.
10-
2. Build the TensorRT-LLM checkpoint into TensorRT engines with a unified build command.
11-
3. Load the engines to TensorRT-LLM model runner and evaluate with different evaluation tasks.
9+
1. Convert weights from different source frameworks into TensorRT LLM checkpoint.
10+
2. Build the TensorRT LLM checkpoint into TensorRT engines with a unified build command.
11+
3. Load the engines to TensorRT LLM model runner and evaluate with different evaluation tasks.
1212

1313
```
1414
NeMo -------------
1515
|
1616
HuggingFace ------
1717
| convert build load
18-
Modelopt --------- ----------> TensorRT-LLM Checkpoint --------> TensorRT Engine ------> TensorRT-LLM ModelRunner
18+
Modelopt --------- ----------> TensorRT LLM Checkpoint --------> TensorRT Engine ------> TensorRT LLM ModelRunner
1919
|
2020
JAX --------------
2121
|
2222
DeepSpeed --------
2323
```
2424

25-
## Prepare the TensorRT-LLM Checkpoint
25+
## Prepare the TensorRT LLM Checkpoint
2626

27-
TensorRT-LLM aims at supporting different sources:
27+
TensorRT LLM aims at supporting different sources:
2828

2929
1. Trained models from NVIDIA NeMo, Microsoft DeepSpeed, and JAX
3030
2. Quantized models from NVIDIA Modelopt
3131
3. Popular models from HuggingFace
3232

33-
TensorRT-LLM defines its own checkpoint format. A checkpoint directory includes:
33+
TensorRT LLM defines its own checkpoint format. A checkpoint directory includes:
3434

3535
1. One config `json` file, which contains several model hyper-parameters.
3636
2. One or several rank weights files, each file contains a dictionary of tensors (weights).
@@ -107,7 +107,7 @@ Here is the model specific config list:
107107
### Rank Weights
108108

109109
Like PyTorch, the tensor (weight) name is a string containing hierarchical information,
110-
which is uniquely mapped to a certain parameter of a TensorRT-LLM model.
110+
which is uniquely mapped to a certain parameter of a TensorRT LLM model.
111111

112112
For example, each transformer layer of the OPT model contains an `Attention` layer, an `MLP` layer. and two `LayerNorm` layers.
113113

@@ -169,7 +169,7 @@ Here is the AWQ scaling factors of `mlp.fc` linear layer:
169169
- `transformer.layers.0.mlp.fc.prequant_scaling_factor`
170170

171171
```{note}
172-
The linear weights in TensorRT-LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT-LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
172+
The linear weights in TensorRT LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
173173
174174
### Example
175175
@@ -218,7 +218,7 @@ Here is the `config.json`:
218218

219219
## Build Checkpoint into TensorRT Engine
220220

221-
TensorRT-LLM provides a unified build command: `trtllm-build`. Before using it,
221+
TensorRT LLM provides a unified build command: `trtllm-build`. Before using it,
222222
you may need to add it to the `PATH`.
223223

224224
```bash

docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md

Lines changed: 19 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,18 @@
1-
# How to get best performance on DeepSeek-R1 in TensorRT-LLM
1+
# How to get best performance on DeepSeek-R1 in TensorRT LLM
22

33
NVIDIA has announced world-record DeepSeek-R1 inference performance at NVIDIA GTC 2025. A single NVIDIA DGX system with eight NVIDIA Blackwell GPUs can achieve over 250 tokens per second per user or a maximum throughput of over 30,000 tokens per second on the massive, state-of-the-art 671 billion parameter DeepSeek-R1 model. [NVIDIA Blackwell Delivers World-Record DeepSeek-R1 Inference Performance](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
44

55
In this blog, we share the configurations and procedures about how to reproduce the number on both B200 and H200 with PyTorch workflow.
66

77
## Table of Contents
88

9-
- [How to get best performance on DeepSeek-R1 in TensorRT-LLM](#how-to-get-best-performance-on-deepseek-r1-in-tensorrt-llm)
9+
- [How to get best performance on DeepSeek-R1 in TensorRT LLM](#how-to-get-best-performance-on-deepseek-r1-in-tensorrt-llm)
1010
- [Table of Contents](#table-of-contents)
11-
- [Prerequisites: Install TensorRT-LLM and download models](#prerequisites-install-tensorrt-llm-and-download-models)
12-
- [1. Download TensorRT-LLM](#1-download-tensorrt-llm)
11+
- [Prerequisites: Install TensorRT LLM and download models](#prerequisites-install-tensorrt-llm-and-download-models)
12+
- [1. Download TensorRT LLM](#1-download-tensorrt-llm)
1313
- [2. Download the DeepSeek R1 models](#2-download-the-deepseek-r1-models)
14-
- [3. Build and run TensorRT-LLM container](#3-build-and-run-tensorrt-llm-container)
15-
- [4. Compile and Install TensorRT-LLM](#4-compile-and-install-tensorrt-llm)
14+
- [3. Build and run TensorRT LLM container](#3-build-and-run-tensorrt-llm-container)
15+
- [4. Compile and Install TensorRT LLM](#4-compile-and-install-tensorrt-llm)
1616
- [5. Optional: Tune GPU clocks](#5-optional-tune-gpu-clocks)
1717
- [6. Dataset preparation](#6-dataset-preparation)
1818
- [Reproducing steps](#reproducing-steps)
@@ -34,13 +34,13 @@ In this blog, we share the configurations and procedures about how to reproduce
3434
- [Out of memory issues](#out-of-memory-issues)
3535

3636

37-
## Prerequisites: Install TensorRT-LLM and download models
37+
## Prerequisites: Install TensorRT LLM and download models
3838

39-
This section can be skipped if you already have TensorRT-LLM installed and have already downloaded the DeepSeek R1 model checkpoint.
39+
This section can be skipped if you already have TensorRT LLM installed and have already downloaded the DeepSeek R1 model checkpoint.
4040

41-
#### 1. Download TensorRT-LLM
41+
#### 1. Download TensorRT LLM
4242

43-
**You can also find more comprehensive instructions to install TensorRT-LLM in this [TensorRT-LLM installation guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html), refer to that guide for common issues if you encounter any here.**
43+
**You can also find more comprehensive instructions to install TensorRT LLM in this [TensorRT LLM installation guide](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html), refer to that guide for common issues if you encounter any here.**
4444

4545
``` bash
4646
# Prerequisites
@@ -50,7 +50,7 @@ git lfs install
5050
# Replace with your actual path
5151
YOUR_WORK_PATH=<YOUR_WORK_PATH>
5252

53-
# Clone the TensorRT-LLM repository
53+
# Clone the TensorRT LLM repository
5454
cd $YOUR_WORK_PATH
5555
git clone https://github.com/NVIDIA/TensorRT-LLM.git
5656
cd TensorRT-LLM
@@ -77,15 +77,15 @@ git clone https://huggingface.co/nvidia/DeepSeek-R1-FP4
7777
git clone https://huggingface.co/deepseek-ai/DeepSeek-R1
7878
```
7979

80-
#### 3. Build and run TensorRT-LLM container
80+
#### 3. Build and run TensorRT LLM container
8181

8282
``` bash
8383
cd TensorRT-LLM
8484
make -C docker run LOCAL_USER=1 DOCKER_RUN_ARGS="-v $YOUR_MODEL_PATH:$YOUR_MODEL_PATH:ro -v $YOUR_WORK_PATH:$YOUR_WORK_PATH"
8585
```
8686
Here we set `LOCAL_USER=1` argument to set up the local user instead of root account inside the container, you can remove it if running as root inside container is fine.
8787

88-
#### 4. Compile and Install TensorRT-LLM
88+
#### 4. Compile and Install TensorRT LLM
8989
Here we compile the source inside the container:
9090

9191
``` bash
@@ -122,11 +122,11 @@ The command to generate synthetic dataset will be attached to the max throughput
122122

123123
This section provides the reproducing steps for NVIDIA Blackwell B200 and H200 GPUs, for both min-latency and max-throughput scenarios.
124124

125-
All the benchmarking is done by the trtllm-bench command line tool provided in the TensorRT-LLM installation, see [TensorRT-LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details of this tool.
125+
All the benchmarking is done by the trtllm-bench command line tool provided in the TensorRT LLM installation, see [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details of this tool.
126126

127127
For brevity, we only provide the commands to reproduce the perf numbers without detailed explanation of the tools and options in this doc.
128128

129-
All these commands here are assumed to be running inside the container started by `make -C docker run ...` command mentioned in the [Build and run TensorRT-LLM container section](#3-build-and-run-tensorrt-llm-container)
129+
All these commands here are assumed to be running inside the container started by `make -C docker run ...` command mentioned in the [Build and run TensorRT LLM container section](#3-build-and-run-tensorrt-llm-container)
130130

131131
### B200 min-latency
132132
Our benchmark results are based on **Batch = 1, ISL = 1K, OSL = 2K, num_requests = 10 from real dataset**
@@ -158,7 +158,7 @@ trtllm-bench --model nvidia/DeepSeek-R1-FP4 \
158158
```
159159

160160
Explanation:
161-
- `trtllm-bench`: A CLI benchmarking utility that aims to make it easier for users to reproduce our officially published. See [TensorRT-LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.
161+
- `trtllm-bench`: A CLI benchmarking utility that aims to make it easier for users to reproduce our officially published. See [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.
162162
- `--dataset`: Prompt dataset used to benchmark. Our official benchmark dataset has ISL = 1K, OSL = 2K
163163
- `--num_requests`: Num requests used for the benchmark.
164164
- `--concurrency`: Total concurrency for the system.
@@ -186,7 +186,7 @@ Average request latency (ms): 7456.1219
186186

187187
Due to our evaluation found that FP8 KV cache does not introduce obvious accuracy drop compared to BF16 KV cache. See [Precision strategy](./tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md#precision-strategy), the latest [DeepSeek-R1-0528-FP4](https://huggingface.co/nvidia/DeepSeek-R1-0528-FP4) checkpoint had enabled FP8 KV cache by-default.
188188

189-
We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers here. The results are reproduced with TensorRT-LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
189+
We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers here. The results are reproduced with TensorRT LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
190190

191191
!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.
192192

@@ -239,7 +239,7 @@ Per GPU Output Throughput (tps/gpu): 5393.2755
239239
### B200 max-throughput for R1 with FP16 KV cache
240240
Our benchmark results are based on **Batch = 3072, ISL = 1K, OSL = 2K, num_requests = 49152 from synthetic dataset**.
241241

242-
The results are reproduced with TensorRT-LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
242+
The results are reproduced with TensorRT LLM commit b6261862419c33d6ce2313aff1e7116067d6037d.
243243

244244
!! Note that the exact command to reproduce numbers can change as the API/options are refactored, the option and numbers here is a reference at given exact commit.
245245

@@ -401,7 +401,7 @@ Average request latency (ms): 181540.5739
401401

402402
## Exploring more ISL/OSL combinations
403403

404-
To benchmark TensorRT-LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT-LLM is working on enhancements that can make the benchmark process smoother.
404+
To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother.
405405
### WIP: Enable more features by default
406406

407407
Currently, there are some features that need to be enabled through a user-defined file `extra-llm-api-config.yml`, such as CUDA graph, overlap scheduler and attention dp. We're working on to enable those features by default, so that users can get good out-of-the-box performance on DeepSeek models.

docs/source/blogs/Falcon180B-H200.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,13 @@
11
# Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
22

3-
H200's large capacity & high memory bandwidth, paired with TensorRT-LLM's
3+
H200's large capacity & high memory bandwidth, paired with TensorRT LLM's
44
optimizations, maximizes inference performance.
55

66
## Falcon-180B on a single H200 with INT4 AWQ
77
[Falcon-180B](https://huggingface.co/tiiuae/falcon-180B), one of the largest &
88
most accurate open source models available, can run on a *single* H200 GPU.
99

10-
The 141GB of memory on H200, paired with TensorRT-LLM running INT4 AWQ with
10+
The 141GB of memory on H200, paired with TensorRT LLM running INT4 AWQ with
1111
FP8, allows for the entire large language model to fit on a single GPU, where
1212
previously eight A100s were required. H200 Falcon-180B provides up to **800**
1313
tok/s and retains high accuracy.
@@ -30,7 +30,7 @@ BS: (in order) 256, 128 </sup>
3030

3131
**Model Accuracy:**
3232
Often quantization can have adverse impacts on the accuracy of the model,
33-
however, TensorRT-LLM's AWQ decreases memory footprint of the model by **4x**
33+
however, TensorRT LLM's AWQ decreases memory footprint of the model by **4x**
3434
while maintaining high accuracy.
3535

3636
<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_acc.png?raw=true" alt="Falcon-180B accuracy comparison" width="600" height="auto">
@@ -52,18 +52,18 @@ retain higher accuracy than other 4bit methods and reduce memory usage, but
5252
requires special kernels capable of handling the change in precision
5353
performantly.
5454

55-
TensorRT-LLM has implemented custom kernels for AWQ, and taken the technique a
55+
TensorRT LLM has implemented custom kernels for AWQ, and taken the technique a
5656
step further by performing FP8 computation on Hopper GPUs instead of the
5757
standard FP16.
5858

59-
Similar examples running Falcon-180B with quantization in TensorRT-LLM are
59+
Similar examples running Falcon-180B with quantization in TensorRT LLM are
6060
available in [examples/models/contrib/falcon](/examples/models/contrib/falcon).
6161

6262
## Llama-70B on H200 up to 6.7x A100
6363

64-
TensorRT-LLM has improved its Group Query Attention (GQA) kernels, in the
64+
TensorRT LLM has improved its Group Query Attention (GQA) kernels, in the
6565
generation phase, providing up to 2.4x improvement on Llama-70B over
66-
TensorRT-LLM v0.5, achieving over **3,800** tok/s/gpu at up to **6.7x** faster
66+
TensorRT LLM v0.5, achieving over **3,800** tok/s/gpu at up to **6.7x** faster
6767
than A100.
6868

6969
**H200 6.7x A100**
@@ -106,7 +106,7 @@ BS 192 </sup>
106106
[**Grouped Query Attention (GQA)**](https://arxiv.org/abs/2305.13245v2)
107107
(Ainslie et al., 2023), used in Llama-70B, is a variant of Multihead Attention
108108
(MHA) which groups key-value (KV) heads together, resulting in fewer KV heads
109-
than query (Q) heads. TensorRT-LLM has a custom implementation of MHA which
109+
than query (Q) heads. TensorRT LLM has a custom implementation of MHA which
110110
supports GQA, multi-query attention (MQA) and standard MHA. It leverages Tensor
111111
Cores, including in the generation phase, and delivers great performance on
112112
NVIDIA GPUs.
@@ -116,7 +116,7 @@ NVIDIA GPUs.
116116
These improvements will be published in the `main` branch soon, and will be
117117
included in the v0.7 & v0.8 releases.
118118

119-
Similar examples running Llama-70B in TensorRT-LLM are published in
119+
Similar examples running Llama-70B in TensorRT LLM are published in
120120
[examples/models/core/llama](/examples/models/core/llama).
121121

122122
For more information about H200, please see the [H200 announcement blog](./H200launch.md).

0 commit comments

Comments
 (0)