Skip to content

Commit f53fb4c

Browse files
committed
[TRTLLM-5930][doc] 1.0 Documentation. (#6696)
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
1 parent 5c616da commit f53fb4c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+4269
-267
lines changed

docs/source/advanced/lora.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,9 +133,9 @@ Next, consider this linear layer is a `RowLinear` layer. When we partition the w
133133

134134
#### DoRA
135135

136-
TRTLLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
136+
TensorRT-LLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
137137

138-
The DoRA scales must be normalized before they are submitted to TRTLLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
138+
The DoRA scales must be normalized before they are submitted to TensorRT-LLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
139139

140140
When using DoRA, the format of `LoraWeights` and `LoraConfig` changes slightly.
141141
The shape of `LoraConfig` becomes `[module_id, layer_idx, adapter_size D (i.e. R value), is_dora]`, with `is_dora` a boolean flag that determines whether the supplied adapter contains DoRA scales or not. If the old config shape is used, it is assumed the adapter does not have DoRA scales.

docs/source/advanced/speculative-decoding.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits
173173

174174
### Disaggregated Serving
175175

176-
[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/disaggregated-service.md) with EAGLE-3 using the two-model approach is supported in the PyTorch backend.
176+
[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/disaggregated-service.md) with EAGLE3 using the two model approach is supported in the Pytorch backend. Please refer to the following [Dynamo example](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/llama4_plus_eagle.md) on how to run EAGLE3 with Disaggregated Serving for Llama 4 Maverick.
177177

178178
## Lookahead Decoding
179179

Lines changed: 67 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,75 @@
1-
(architecture-overview)=
1+
# Architecture Overview
22

3-
# TensorRT-LLM Architecture
3+
The `LLM` class is a core entry point for the TensorRT-LLM, providing a simplified `generate()` API for efficient large language model inference. This abstraction aims to streamline the user experience, as demonstrated with TinyLlama:
44

5-
TensorRT-LLM is a toolkit to assemble optimized solutions to perform Large Language Model (LLM) inference. It offers a Model Definition API to define models and compile efficient [TensorRT](https://developer.nvidia.com/tensorrt) engines for NVIDIA GPUs. It also contains Python and C++ components to build runtimes to execute those engines as well as backends for the [Triton Inference
6-
Server](https://developer.nvidia.com/nvidia-triton-inference-server) to easily create web-based services for LLMs. TensorRT-LLM supports multi-GPU and multi-node configurations (through MPI).
5+
```python
6+
from tensorrt_llm import LLM
77

8-
As a user, the very first step to create an inference solution is to either define your own model or select a pre-defined network architecture (refer to {ref}`models` for the list of models supported by TensorRT-LLM). Once defined, that model must be trained using a training framework (training is outside of the scope of TensorRT-LLM). For pre-defined models, checkpoints can be downloaded from various providers. To illustrate that point, a lot of examples in TensorRT-LLM use model weights obtained from the [Hugging Face](https://huggingface.co) hub and trained using [NVIDIA Nemo](https://developer.nvidia.com/nemo) or [PyTorch](https://pytorch.org).
8+
# Initialize the LLM with a specified model
9+
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
910

10-
Equipped with the model definition and the weights, a user must use TensorRT-LLM's Model Definition API to recreate the model in a way that can be compiled by TensorRT into an efficient engine. For ease of use, TensorRT-LLM already supports a handful of standard models.
11+
# Generate text using the model
12+
output = llm.generate("Hello, my name is")
13+
```
1114

12-
Together with the Model Definition API to describe models, TensorRT-LLM provides users with components to create a runtime that executes the efficient TensorRT engine. Runtime components offer beam-search, along with extensive sampling functionalities such as top-K and top-P sampling. The exhaustive list can be found in the documentation of the {ref}`gpt-runtime`. The C++ runtime is the recommended runtime.
15+
The `LLM` class automatically manages essential pre and post-processing steps, including tokenization (encoding input prompts into numerical representations) and detokenization (decoding model outputs back into human-readable text).
1316

14-
TensorRT-LLM also includes Python and C++ backends for NVIDIA Triton Inference Server to assemble solutions for LLM online serving. The C++ backend implements in-flight batching as explained in the {ref}`executor` documentation and is the recommended backend.
17+
Internally, the `LLM` class orchestrates the creation of a dedicated `PyExecutor(Worker)` process on each rank.
1518

16-
## Model Weights
19+
![TRT-LLM Architecture Overview](../media/TRTLLM_Architecture_Overview.png)
1720

18-
TensorRT-LLM is a library for LLM inference, and so to use it, you need to supply a set of trained weights. You can either use your own model weights trained in a framework like [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/generative-ai/nemo-framework/) or pull a set of pretrained weights from repositories like the Hugging Face Hub.
21+
This `PyExecutor` operates in a continuous background loop, designed for the efficient, asynchronous processing of inference requests.
22+
23+
The `PyExecutor`'s functionality is built upon several key components:
24+
25+
- `Scheduler`: Responsible for determining which active requests are ready for execution at each processing step.
26+
27+
- `KVCacheManager`: Manages the allocation, deallocation, and maintenance of the Key-Value (KV) Cache. This is a critical optimization for Transformer models, significantly enhancing performance during autoregressive text generation by storing previously computed attention keys and values.
28+
29+
- `ModelEngine`: Handles the loading and highly efficient execution of the language model on the GPU hardware.
30+
31+
- `Sampler`: Takes the raw outputs (logits) from the ModelEngine and applies appropriate sampling strategies (e.g., greedy, top-k, top-p, beam search) to generate the final output tokens.
32+
33+
During each iteration of its background loop, the `PyExecutor` performs the following sequence of operations:
34+
35+
- Request Fetching: Retrieves new inference requests from an internal request queue, if available.
36+
37+
- Scheduling: Interacts with the `Scheduler` to identify and prioritize requests that are ready to be processed in the current step.
38+
39+
- Resource Preparation: Coordinates with the `KVCacheManager` to ensure that the necessary Key-Value (KV) Cache resources are allocated for the selected requests.
40+
41+
- Model Execution: Invokes the `ModelEngine` to perform a forward pass on the scheduled requests, predicting the next output tokens.
42+
43+
- Output Handling: Updates the partial outputs for ongoing requests and finalizes the results for any requests that have reached completion, returning them to the user.
44+
45+
46+
## Runtime Optimizations
47+
48+
TensorRT-LLM enhances inference throughput and reduces latency by integrating a suite of runtime optimizations, including CUDA Graph, [Overlap Scheduler](../features/overlap-scheduler.md), [Speculative decoding](../features/speculative-decoding.md), etc.
49+
50+
### CUDA Graph
51+
52+
CUDA Graphs drastically reduce the CPU-side overhead associated with launching GPU kernels, which is particularly impactful in PyTorch-based inference where Python's host-side code can be a bottleneck. By capturing a sequence of CUDA operations as a single graph, the entire sequence can be launched with one API call, minimizing CPU-GPU synchronization and driver overhead.
53+
54+
To maximize the "hit rate" of these cached graphs, TensorRT-LLM employs CUDA Graph padding. If an incoming batch's size doesn't match a captured graph, it's padded to the nearest larger, supported size for which a graph exists. While this incurs minor overhead from computing "wasted" tokens, it's often a better trade-off than falling back to slower eager mode execution. This optimization has a significant impact, demonstrating up to a 22% end-to-end throughput increase on certain models and hardware.
55+
56+
### Overlap Scheduler
57+
58+
The Overlap Scheduler maximizes GPU utilization by hiding CPU-bound latency behind GPU computation.
59+
60+
The key strategy is to launch the GPU's work for the next step (n+1) immediately, without waiting for the CPU to finish processing the results of the current step (n). This allows the CPU to handle tasks like checking stop criteria or updating responses for one batch while the GPU is already executing the model for the subsequent batch.
61+
62+
This concurrent execution pipeline is illustrated in the `PyExecutor`'s logic:
63+
64+
```python
65+
# Schedule and launch GPU work for the current step (n)
66+
scheduled_batch, _, _ = self._schedule()
67+
batch_outputs = self._forward_step(scheduled_batch, previous_tensors_device)
68+
sample_state = self._sample_async(scheduled_batch, batch_outputs)
69+
70+
# While the GPU is busy, process the CPU-bound results from the previous step (n-1)
71+
if self.previous_batch is not None:
72+
self._process_previous_batch()
73+
```
74+
75+
This approach effectively reduces GPU idle time and improves overall hardware occupancy. While it introduces one extra decoding step into the pipeline, the resulting throughput gain is a significant trade-off. For this reason, the Overlap Scheduler is enabled by default in TensorRT-LLM.

docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ To minimize KV cache transmission latency, TensorRT-LLM currently uses direct tr
172172
</div>
173173
<p align="center"><sub><em>Figure 8. KV cache layout conversion</em></sub></p>
174174
175-
The optimizations required for KV cache transmission vary depending on whether it's single-node multi-GPU, multi-node multi-GPU, or different GPU models. To accommodate this, TensorRT-LLM provides a set of environment variables for selection in different environments. Please refer to [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/disaggregated-service.md) for details.
175+
The optimizations required for KV cache transmission vary depending on whether it's single-node multi-GPU, multi-node multi-GPU, or different GPU models. To accommodate this, TensorRT-LLM provides a set of environment variables for selection in different environments. Please refer to [this document](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/disagg-serving.md) for details.
176176
177177
## Performance Studies
178178
Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
trtllm-eval
2+
===========
3+
4+
About
5+
-----
6+
7+
The ``trtllm-eval`` command provides developers with a unified entry point for accuracy evaluation. It shares the core evaluation logic with the `accuracy test suite <https://github.com/NVIDIA/TensorRT-LLM/tree/main/tests/integration/defs/accuracy>`_ of TensorRT-LLM.
8+
9+
``trtllm-eval`` is built on the offline API -- LLM API. Compared to the online ``trtllm-serve``, the offline API provides clearer error messages and simplifies the debugging workflow.
10+
11+
The following tasks are currently supported:
12+
13+
.. list-table::
14+
:header-rows: 1
15+
:widths: 20 25 15 15 15
16+
17+
* - Dataset
18+
- Task
19+
- Metric
20+
- Default ISL
21+
- Default OSL
22+
* - CNN Dailymail
23+
- summarization
24+
- rouge
25+
- 924
26+
- 100
27+
* - MMLU
28+
- QA; multiple choice
29+
- accuracy
30+
- 4,094
31+
- 2
32+
* - GSM8K
33+
- QA; regex matching
34+
- accuracy
35+
- 4,096
36+
- 256
37+
* - GPQA
38+
- QA; multiple choice
39+
- accuracy
40+
- 32,768
41+
- 4,096
42+
* - JSON mode eval
43+
- structured generation
44+
- accuracy
45+
- 1,024
46+
- 512
47+
48+
.. note::
49+
50+
``trtllm-eval`` originates from the TensorRT-LLM accuracy test suite and serves as a lightweight utility for verifying and debugging accuracy. At this time, ``trtllm-eval`` is intended solely for development and is not recommended for production use.
51+
52+
Usage and Examples
53+
------------------
54+
55+
Some evaluation tasks (e.g., GSM8K and GPQA) depend on the ``lm_eval`` package. To run these tasks, you need to install ``lm_eval`` with:
56+
57+
.. code-block:: bash
58+
59+
pip install -r requirements-dev.txt
60+
61+
Alternatively, you can install the ``lm_eval`` version specified in ``requirements-dev.txt``.
62+
63+
Here are some examples:
64+
65+
.. code-block:: bash
66+
67+
# Evaluate Llama-3.1-8B-Instruct on MMLU
68+
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct mmlu
69+
70+
# Evaluate Llama-3.1-8B-Instruct on GSM8K
71+
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gsm8k
72+
73+
# Evaluate Llama-3.3-70B-Instruct on GPQA Diamond
74+
trtllm-eval --model meta-llama/Llama-3.3-70B-Instruct gpqa_diamond
75+
76+
The ``--model`` argument accepts either a Hugging Face model ID or a local checkpoint path. By default, ``trtllm-eval`` runs the model with the PyTorch backend; you can pass ``--backend tensorrt`` to switch to the TensorRT backend.
77+
78+
Alternatively, the ``--model`` argument also accepts a local path to pre-built TensorRT engines. In this case, you should pass the Hugging Face tokenizer path to the ``--tokenizer`` argument.
79+
80+
For more details, see ``trtllm-eval --help`` and ``trtllm-eval <task> --help``.
81+
82+
83+
84+
Syntax
85+
------
86+
87+
.. click:: tensorrt_llm.commands.eval:main
88+
:prog: trtllm-eval
89+
:nested: full

docs/source/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@
106106
[GitHub pre-release or release](https://github.com/NVIDIA/TensorRT-LLM/releases)
107107
(see also [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags)).
108108
```
109-
""",
109+
"""
110110
}
111111

112112
autosummary_generate = True
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Model Recipes
2+
================
3+
4+
.. toctree::
5+
:maxdepth: 1
6+
:caption: Model Recipes
7+
:name: Model Recipes
8+
9+
quick-start-recipe-for-deepseek-r1-on-trtllm.md
10+
quick-start-recipe-for-llama3.3-70b-on-trtllm.md
11+
quick-start-recipe-for-llama4-scout-on-trtllm.md

docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -275,7 +275,7 @@ To run the evaluation harness exec into the running TensorRT-LLM container and i
275275
```shell
276276
docker exec -it tensorrt_llm /bin/bash
277277
278-
pip install lm_eval
278+
pip install -U lm-eval
279279
```
280280

281281
FP8 command for GSM8K:

0 commit comments

Comments
 (0)