You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/advanced/lora.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -133,9 +133,9 @@ Next, consider this linear layer is a `RowLinear` layer. When we partition the w
133
133
134
134
#### DoRA
135
135
136
-
TensorRT-LLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
136
+
TensorRTLLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
137
137
138
-
The DoRA scales must be normalized before they are submitted to TensorRT-LLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
138
+
The DoRA scales must be normalized before they are submitted to TensorRTLLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
139
139
140
140
When using DoRA, the format of `LoraWeights` and `LoraConfig` changes slightly.
141
141
The shape of `LoraConfig` becomes `[module_id, layer_idx, adapter_size D (i.e. R value), is_dora]`, with `is_dora` a boolean flag that determines whether the supplied adapter contains DoRA scales or not. If the old config shape is used, it is assumed the adapter does not have DoRA scales.
The linear weights in TensorRT-LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT-LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
172
+
The linear weights in TensorRTLLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRTLLM implemented by plugin may use (`in_feature`, `out_feature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
<p align="center"><sub><em>Figure 7: Pareto frontier analysis showing throughput-latency trade-offs across different ADP Balance configurations</em></sub></p>
@@ -364,4 +364,4 @@ The Pareto frontier analysis provides critical insights for real-world deploymen
364
364
365
365
## Acknowledgement
366
366
367
-
The ADP Balance strategy was a great team effort, covering system performance analysis and optimization. While we cannot thank every contributor individually, we are proud to acknowledge the dedicated team of engineers whose collective expertise has propelled TensorRT-LLM to new heights of performance. Through this collaborative effort, we have gained valuable insights into improving GPU utilization for large language model inference. We hope the techniques and experiences shared in this blog post will empower the developer community to better leverage the performance of NVIDIA GPUs in their mission-critical LLM inference applications.
367
+
The ADP Balance strategy was a great team effort, covering system performance analysis and optimization. While we cannot thank every contributor individually, we are proud to acknowledge the dedicated team of engineers whose collective expertise has propelled TensorRTLLM to new heights of performance. Through this collaborative effort, we have gained valuable insights into improving GPU utilization for large language model inference. We hope the techniques and experiences shared in this blog post will empower the developer community to better leverage the performance of NVIDIA GPUs in their mission-critical LLM inference applications.
Copy file name to clipboardExpand all lines: docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT-LLM)
1
+
## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRTLLM)
2
2
3
3
This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It replaces the low‑latency flow from the previous guide and intentionally omits max‑throughput, Hopper, and benchmarking content.
4
4
@@ -17,7 +17,7 @@ Expected directory layout on the host (example):
17
17
└─ eagle/ # Eagle3 speculative decoding assets
18
18
```
19
19
20
-
### Get the TensorRT-LLM Container (1.1.0rc0)
20
+
### Get the TensorRTLLM Container (1.1.0rc0)
21
21
22
22
If required by your environment, log into NGC and pull the image:
Run the container and bind-mount your models directory to `/config/models` inside the container:
36
36
@@ -122,7 +122,7 @@ When `Status: 200` is returned, the endpoint is ready to serve requests.
122
122
123
123
### Sample Chat Completions Request
124
124
125
-
Note: This Eagle3 + TensorRT-LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.
125
+
Note: This Eagle3 + TensorRTLLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.
126
126
127
127
Send a simple OpenAI-compatible Chat Completions request to the running server:
Copy file name to clipboardExpand all lines: docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -49,11 +49,11 @@ Note:
49
49
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
50
50
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
51
51
52
-
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
52
+
If you want to use latest main branch, you can choose to build from source to install TensorRTLLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
53
53
54
54
### Creating the TRT-LLM Server config
55
55
56
-
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
56
+
We create a YAML configuration file /tmp/config.yml for the TensorRTLLM Server and populate it with the following recommended performance settings.
57
57
58
58
```shell
59
59
EXTRA_LLM_API_FILE=/tmp/config.yml
@@ -108,7 +108,7 @@ These options are used directly on the command line when you start the `trtllm-s
108
108
109
109
#### `--backend pytorch`
110
110
111
-
***Description:** Tells TensorRT-LLM to use the **pytorch** backend.
111
+
 **Description:** Tells TensorRTLLM to use the **pytorch** backend.
112
112
113
113
#### `--max_batch_size`
114
114
@@ -124,7 +124,7 @@ These options are used directly on the command line when you start the `trtllm-s
124
124
125
125
#### `--trust_remote_code`
126
126
127
-
***Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
127
+
 **Description:** Allows TensorRTLLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
128
128
129
129
130
130
#### Extra LLM API Options (YAML Configuration)
@@ -264,7 +264,7 @@ Sample result in Blackwell
264
264
265
265
## Benchmarking Performance
266
266
267
-
To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
267
+
To benchmark the performance of your TensorRTLLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh`(http://bench.sh) script.
Copy file name to clipboardExpand all lines: docs/source/developer-guide/perf-benchmarking.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,8 +7,8 @@ This benchmarking suite is a work in progress.
7
7
Expect breaking API changes.
8
8
```
9
9
10
-
TensorRT-LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
11
-
easier for users to reproduce our officially published [performance overiew](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:
10
+
TensorRTLLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
11
+
easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:
12
12
13
13
- A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
KV cache behavior is set by providing the optional argument ```kv_cache_config``` when LLM engine is created. Consider the quickstart example (found in examples/pytorch/quickstart.py):
3
+
Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. Consider the quickstart example found in ```examples/pytorch/quickstart.py```:
This example runs with default KV cache properties. The default for ```free_gpu_memory_fraction``` is 0.9, which means TensorRT-LLM will try to allocate 90% of free GPU memory for KV cache. Depending on your system, this may be too aggressive, so you decide to dial that back to 0.7. This is done by adding the following lines to the quickstart example:
31
+
This example runs with default KV cache properties. The default value for `free_gpu_memory_fraction` is 0.9, which means TensorRT-LLM tries to allocate 90% of free GPU memory (after loading weights) for KV cache. Depending on your use case, this allocation can be too aggressive. You can reduce this value to 0.7 by adding the following lines to the quickstart example:
Block priority can be changed by providing the optional argument ```kv_cache_retention_config``` when a request is submitted to LLM engine. Consider the quickstart example (found in examples/pytorch/quickstart.py):
3
+
You can change block priority by providing the optional ```kv_cache_retention_config```argument when you submit a request to the LLM engine. Consider the quick start example found in ```examples/pytorch/quickstart.py```:
The blocks from the prompts will be stored for reuse with the default priority of 35 (on a scale from 1 to 100 where 100 is highest and 1 is lowest priority). Assume you know that the first four tokens of each prompt is a system prompt that should be stored with high priority (100). You do this by providing a kv cache retention config object when you submit the prompts for generation:
31
+
The blocks from the prompts are stored for reuse with the default priority of 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priority. Assume you know that the first four tokens of each prompt represent a system prompt that should be stored with high priority (100). You can achieve this by providing a KV cache retention config object when you submit the prompts for generation:
26
32
27
-
```
33
+
```python
28
34
from tensorrt_llm importLLM, SamplingParams
29
35
from tensorrt_llm.llmapi import KvCacheRetentionConfig
Here we used a single kv_cache_retention_config object for all the prompts. Alternatively, you can also provide a list, the list must have the same length as the list of prompts.
68
+
This example uses a single ```kv_cache_retention_config``` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.
Copy file name to clipboardExpand all lines: docs/source/features/disagg-serving.md
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -212,6 +212,8 @@ There are some other useful environment variables that may help when encounterin
212
212
213
213
* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.
214
214
215
+
* ``UCX_MAX_RNDV_RAILS`: With the default value 2, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting UCX_MAX_RNDV_RAILS=1 can reduce contention in this case.
216
+
215
217
## Troubleshooting and FAQ
216
218
217
219
### General FAQs
@@ -254,7 +256,7 @@ A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
254
256
255
257
A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.
256
258
257
-
*Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?
259
+
*Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?*
258
260
259
261
A. NVLink domain can be found with `nvidia-smi -q` in the `Fabric.ClusterUUID` field. A few UCX environment variables can be adjusted when your servers have different NVLink domains:
0 commit comments