Skip to content

Commit cabda24

Browse files
authored
[TRTLLM-5930][doc] 1.0 Documentation. (#6696)
Signed-off-by: nv-guomingz <[email protected]>
1 parent 4612906 commit cabda24

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

58 files changed

+4609
-234
lines changed

docs/source/advanced/lora.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,9 +133,9 @@ Next, consider this linear layer is a `RowLinear` layer. When we partition the w
133133

134134
#### DoRA
135135

136-
TRTLLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
136+
TensorRT-LLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
137137

138-
The DoRA scales must be normalized before they are submitted to TRTLLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
138+
The DoRA scales must be normalized before they are submitted to TensorRT-LLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
139139

140140
When using DoRA, the format of `LoraWeights` and `LoraConfig` changes slightly.
141141
The shape of `LoraConfig` becomes `[module_id, layer_idx, adapter_size D (i.e. R value), is_dora]`, with `is_dora` a boolean flag that determines whether the supplied adapter contains DoRA scales or not. If the old config shape is used, it is assumed the adapter does not have DoRA scales.

docs/source/advanced/speculative-decoding.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -172,7 +172,7 @@ Similarly to ReDrafter, TensorRT-LLM implements the EAGLE model such that logits
172172

173173
### Disaggregated Serving
174174

175-
[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/advanced/disaggregated-service.md) with EAGLE3 using the two model approach is supported in the Pytorch backend. Please refer to the following [Dynamo example](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/llama4_plus_eagle.md) on how to run EAGLE3 with Disaggregated Serving for Llama 4 Maverick.
175+
[Disaggregated Serving](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/disaggregated-service.md) with EAGLE3 using the two model approach is supported in the Pytorch backend. Please refer to the following [Dynamo example](https://github.com/ai-dynamo/dynamo/blob/main/examples/tensorrt_llm/llama4_plus_eagle.md) on how to run EAGLE3 with Disaggregated Serving for Llama 4 Maverick.
176176

177177
## Lookahead Decoding
178178

Lines changed: 67 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,75 @@
1-
(architecture-overview)=
1+
# Architecture Overview
22

3-
# TensorRT-LLM Architecture
3+
The `LLM` class is a core entry point for the TensorRT-LLM, providing a simplified `generate()` API for efficient large language model inference. This abstraction aims to streamline the user experience, as demonstrated with TinyLlama:
44

5-
TensorRT-LLM is a toolkit to assemble optimized solutions to perform Large Language Model (LLM) inference. It offers a Model Definition API to define models and compile efficient [TensorRT](https://developer.nvidia.com/tensorrt) engines for NVIDIA GPUs. It also contains Python and C++ components to build runtimes to execute those engines as well as backends for the [Triton Inference
6-
Server](https://developer.nvidia.com/nvidia-triton-inference-server) to easily create web-based services for LLMs. TensorRT-LLM supports multi-GPU and multi-node configurations (through MPI).
5+
```python
6+
from tensorrt_llm import LLM
77

8-
As a user, the very first step to create an inference solution is to either define your own model or select a pre-defined network architecture (refer to {ref}`models` for the list of models supported by TensorRT-LLM). Once defined, that model must be trained using a training framework (training is outside of the scope of TensorRT-LLM). For pre-defined models, checkpoints can be downloaded from various providers. To illustrate that point, a lot of examples in TensorRT-LLM use model weights obtained from the [Hugging Face](https://huggingface.co) hub and trained using [NVIDIA Nemo](https://developer.nvidia.com/nemo) or [PyTorch](https://pytorch.org).
8+
# Initialize the LLM with a specified model
9+
llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")
910

10-
Equipped with the model definition and the weights, a user must use TensorRT-LLM's Model Definition API to recreate the model in a way that can be compiled by TensorRT into an efficient engine. For ease of use, TensorRT-LLM already supports a handful of standard models.
11+
# Generate text using the model
12+
output = llm.generate("Hello, my name is")
13+
```
1114

12-
Together with the Model Definition API to describe models, TensorRT-LLM provides users with components to create a runtime that executes the efficient TensorRT engine. Runtime components offer beam-search, along with extensive sampling functionalities such as top-K and top-P sampling. The exhaustive list can be found in the documentation of the {ref}`gpt-runtime`. The C++ runtime is the recommended runtime.
15+
The `LLM` class automatically manages essential pre and post-processing steps, including tokenization (encoding input prompts into numerical representations) and detokenization (decoding model outputs back into human-readable text).
1316

14-
TensorRT-LLM also includes Python and C++ backends for NVIDIA Triton Inference Server to assemble solutions for LLM online serving. The C++ backend implements in-flight batching as explained in the {ref}`executor` documentation and is the recommended backend.
17+
Internally, the `LLM` class orchestrates the creation of a dedicated `PyExecutor(Worker)` process on each rank.
1518

16-
## Model Weights
19+
![TRT-LLM Architecture Overview](../media/TRTLLM_Architecture_Overview.png)
1720

18-
TensorRT-LLM is a library for LLM inference, and so to use it, you need to supply a set of trained weights. You can either use your own model weights trained in a framework like [NVIDIA NeMo](https://www.nvidia.com/en-us/ai-data-science/generative-ai/nemo-framework/) or pull a set of pretrained weights from repositories like the Hugging Face Hub.
21+
This `PyExecutor` operates in a continuous background loop, designed for the efficient, asynchronous processing of inference requests.
22+
23+
The `PyExecutor`'s functionality is built upon several key components:
24+
25+
- `Scheduler`: Responsible for determining which active requests are ready for execution at each processing step.
26+
27+
- `KVCacheManager`: Manages the allocation, deallocation, and maintenance of the Key-Value (KV) Cache. This is a critical optimization for Transformer models, significantly enhancing performance during autoregressive text generation by storing previously computed attention keys and values.
28+
29+
- `ModelEngine`: Handles the loading and highly efficient execution of the language model on the GPU hardware.
30+
31+
- `Sampler`: Takes the raw outputs (logits) from the ModelEngine and applies appropriate sampling strategies (e.g., greedy, top-k, top-p, beam search) to generate the final output tokens.
32+
33+
During each iteration of its background loop, the `PyExecutor` performs the following sequence of operations:
34+
35+
- Request Fetching: Retrieves new inference requests from an internal request queue, if available.
36+
37+
- Scheduling: Interacts with the `Scheduler` to identify and prioritize requests that are ready to be processed in the current step.
38+
39+
- Resource Preparation: Coordinates with the `KVCacheManager` to ensure that the necessary Key-Value (KV) Cache resources are allocated for the selected requests.
40+
41+
- Model Execution: Invokes the `ModelEngine` to perform a forward pass on the scheduled requests, predicting the next output tokens.
42+
43+
- Output Handling: Updates the partial outputs for ongoing requests and finalizes the results for any requests that have reached completion, returning them to the user.
44+
45+
46+
## Runtime Optimizations
47+
48+
TensorRT-LLM enhances inference throughput and reduces latency by integrating a suite of runtime optimizations, including CUDA Graph, [Overlap Scheduler](../features/overlap-scheduler.md), [Speculative decoding](../features/speculative-decoding.md), etc.
49+
50+
### CUDA Graph
51+
52+
CUDA Graphs drastically reduce the CPU-side overhead associated with launching GPU kernels, which is particularly impactful in PyTorch-based inference where Python's host-side code can be a bottleneck. By capturing a sequence of CUDA operations as a single graph, the entire sequence can be launched with one API call, minimizing CPU-GPU synchronization and driver overhead.
53+
54+
To maximize the "hit rate" of these cached graphs, TensorRT-LLM employs CUDA Graph padding. If an incoming batch's size doesn't match a captured graph, it's padded to the nearest larger, supported size for which a graph exists. While this incurs minor overhead from computing "wasted" tokens, it's often a better trade-off than falling back to slower eager mode execution. This optimization has a significant impact, demonstrating up to a 22% end-to-end throughput increase on certain models and hardware.
55+
56+
### Overlap Scheduler
57+
58+
The Overlap Scheduler maximizes GPU utilization by hiding CPU-bound latency behind GPU computation.
59+
60+
The key strategy is to launch the GPU's work for the next step (n+1) immediately, without waiting for the CPU to finish processing the results of the current step (n). This allows the CPU to handle tasks like checking stop criteria or updating responses for one batch while the GPU is already executing the model for the subsequent batch.
61+
62+
This concurrent execution pipeline is illustrated in the `PyExecutor`'s logic:
63+
64+
```python
65+
# Schedule and launch GPU work for the current step (n)
66+
scheduled_batch, _, _ = self._schedule()
67+
batch_outputs = self._forward_step(scheduled_batch, previous_tensors_device)
68+
sample_state = self._sample_async(scheduled_batch, batch_outputs)
69+
70+
# While the GPU is busy, process the CPU-bound results from the previous step (n-1)
71+
if self.previous_batch is not None:
72+
self._process_previous_batch()
73+
```
74+
75+
This approach effectively reduces GPU idle time and improves overall hardware occupancy. While it introduces one extra decoding step into the pipeline, the resulting throughput gain is a significant trade-off. For this reason, the Overlap Scheduler is enabled by default in TensorRT-LLM.

0 commit comments

Comments
 (0)