Skip to content

Commit 35dac55

Browse files
committed
[None][doc] Update kvcache part (#7549)
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
1 parent f53fb4c commit 35dac55

File tree

16 files changed

+72
-62
lines changed

16 files changed

+72
-62
lines changed

docs/source/advanced/lora.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -133,9 +133,9 @@ Next, consider this linear layer is a `RowLinear` layer. When we partition the w
133133

134134
#### DoRA
135135

136-
TensorRT-LLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
136+
TensorRT LLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
137137

138-
The DoRA scales must be normalized before they are submitted to TensorRT-LLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
138+
The DoRA scales must be normalized before they are submitted to TensorRT LLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
139139

140140
When using DoRA, the format of `LoraWeights` and `LoraConfig` changes slightly.
141141
The shape of `LoraConfig` becomes `[module_id, layer_idx, adapter_size D (i.e. R value), is_dora]`, with `is_dora` a boolean flag that determines whether the supplied adapter contains DoRA scales or not. If the old config shape is used, it is assumed the adapter does not have DoRA scales.

docs/source/architecture/checkpoint.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ Here is the AWQ scaling factors of `mlp.fc` linear layer:
169169
- `transformer.layers.0.mlp.fc.prequant_scaling_factor`
170170

171171
```{note}
172-
The linear weights in TensorRT-LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT-LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
172+
The linear weights in TensorRT LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT LLM implemented by plugin may use (`in_feature`, `out_feature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
173173
174174
### Example
175175

docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# ADP Balance Strategy
22

3-
By NVIDIA TensorRT-LLM team
3+
By NVIDIA TensorRT LLM team
44

55
## Table of Contents
66
- [ADP Balance Strategy](#adp-balance-strategy)
@@ -96,7 +96,7 @@ The conventional approach employs a global load balancing strategy that sorts in
9696

9797
<div align="center">
9898
<figure>
99-
<img src="./../media/tech_blog10_baseline_round_robin_strategy.png">
99+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_baseline_round_robin_strategy.png">
100100
</figure>
101101
</div>
102102
<p align="center"><sub><em>Figure 1: Baseline round-robin strategy balances context request tokens across ranks through sorting and cyclic distribution</em></sub></p>
@@ -179,7 +179,7 @@ We evaluate our approach using a comprehensive dataset comprising 16,000 inferen
179179

180180
<div align="center">
181181
<figure>
182-
<img src="./../media/tech_blog10_dataset_token_distribution.png">
182+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_dataset_token_distribution.png">
183183
</figure>
184184
</div>
185185
<p align="center"><sub><em>Figure 2: Distribution of input and output token lengths</em></sub></p>
@@ -225,7 +225,7 @@ Figure 3 provides comprehensive insight into baseline system behavior, displayin
225225

226226
<div align="center">
227227
<figure>
228-
<img src="./../media/tech_blog10_baseline_performance_overview.png">
228+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_baseline_performance_overview.png">
229229
</figure>
230230
</div>
231231
<p align="center"><sub><em>Figure 3: Baseline performance overview showing token distribution and balance ratios across all iterations</em></sub></p>
@@ -239,7 +239,7 @@ Figure 4 zooms into the critical imbalance period [100-12,000], revealing the dr
239239

240240
<div align="center">
241241
<figure>
242-
<img src="./../media/tech_blog10_baseline_performance_detail.png">
242+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_baseline_performance_detail.png">
243243
</figure>
244244
</div>
245245
<p align="center"><sub><em>Figure 4: Detailed baseline analysis for iterations 100-12,000 showing severe balance fluctuations</em></sub></p>
@@ -260,7 +260,7 @@ The Context Wait mechanism (`timeout_iters=50`) demonstrates the effectiveness o
260260

261261
<div align="center">
262262
<figure>
263-
<img src="./../media/tech_blog10_context_wait_performance.png">
263+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_context_wait_performance.png">
264264
</figure>
265265
</div>
266266
<p align="center"><sub><em>Figure 5: Context Wait performance showing improved balance stability for iterations 100-12,000</em></sub></p>
@@ -300,7 +300,7 @@ The effectiveness of our complete ADP Balance implementation is clearly demonstr
300300
301301
<div align="center">
302302
<figure>
303-
<img src="./../media/tech_blog10_full_strategy_performance.png">
303+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_full_strategy_performance.png">
304304
</figure>
305305
</div>
306306
<p align="center"><sub><em>Figure 6: Full ADP Balance strategy demonstrating superior balance stability for iterations 100-12,000</em></sub></p>
@@ -324,7 +324,7 @@ Understanding the performance trade-offs inherent in our ADP Balance strategy is
324324

325325
<div align="center">
326326
<figure>
327-
<img src="./../media/tech_blog10_tps_ttft_pareto_curve.png">
327+
<img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_tps_ttft_pareto_curve.png">
328328
</figure>
329329
</div>
330330
<p align="center"><sub><em>Figure 7: Pareto frontier analysis showing throughput-latency trade-offs across different ADP Balance configurations</em></sub></p>
@@ -364,4 +364,4 @@ The Pareto frontier analysis provides critical insights for real-world deploymen
364364

365365
## Acknowledgement
366366

367-
The ADP Balance strategy was a great team effort, covering system performance analysis and optimization. While we cannot thank every contributor individually, we are proud to acknowledge the dedicated team of engineers whose collective expertise has propelled TensorRT-LLM to new heights of performance. Through this collaborative effort, we have gained valuable insights into improving GPU utilization for large language model inference. We hope the techniques and experiences shared in this blog post will empower the developer community to better leverage the performance of NVIDIA GPUs in their mission-critical LLM inference applications.
367+
The ADP Balance strategy was a great team effort, covering system performance analysis and optimization. While we cannot thank every contributor individually, we are proud to acknowledge the dedicated team of engineers whose collective expertise has propelled TensorRT LLM to new heights of performance. Through this collaborative effort, we have gained valuable insights into improving GPU utilization for large language model inference. We hope the techniques and experiences shared in this blog post will empower the developer community to better leverage the performance of NVIDIA GPUs in their mission-critical LLM inference applications.

docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT-LLM)
1+
## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
22

33
This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It replaces the low‑latency flow from the previous guide and intentionally omits max‑throughput, Hopper, and benchmarking content.
44

@@ -17,7 +17,7 @@ Expected directory layout on the host (example):
1717
└─ eagle/ # Eagle3 speculative decoding assets
1818
```
1919

20-
### Get the TensorRT-LLM Container (1.1.0rc0)
20+
### Get the TensorRT LLM Container (1.1.0rc0)
2121

2222
If required by your environment, log into NGC and pull the image:
2323

@@ -30,7 +30,7 @@ docker login nvcr.io
3030
docker pull nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
3131
```
3232

33-
### Start the TensorRT-LLM Container
33+
### Start the TensorRT LLM Container
3434

3535
Run the container and bind-mount your models directory to `/config/models` inside the container:
3636

@@ -122,7 +122,7 @@ When `Status: 200` is returned, the endpoint is ready to serve requests.
122122

123123
### Sample Chat Completions Request
124124

125-
Note: This Eagle3 + TensorRT-LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.
125+
Note: This Eagle3 + TensorRT LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.
126126

127127
Send a simple OpenAI-compatible Chat Completions request to the running server:
128128

docs/source/deployment-guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ Model Recipes
99
quick-start-recipe-for-deepseek-r1-on-trtllm.md
1010
quick-start-recipe-for-llama3.3-70b-on-trtllm.md
1111
quick-start-recipe-for-llama4-scout-on-trtllm.md
12+
quick-start-recipe-for-gpt-oss-on-trtllm.md

docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,11 +49,11 @@ Note:
4949
* The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
5050
* See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
5151

52-
If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
52+
If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
5353

5454
### Creating the TRT-LLM Server config
5555

56-
We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
56+
We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
5757

5858
```shell
5959
EXTRA_LLM_API_FILE=/tmp/config.yml
@@ -108,7 +108,7 @@ These options are used directly on the command line when you start the `trtllm-s
108108

109109
#### `--backend pytorch`
110110

111-
* **Description:** Tells TensorRT-LLM to use the **pytorch** backend.
111+
&emsp;**Description:** Tells TensorRT LLM to use the **pytorch** backend.
112112

113113
#### `--max_batch_size`
114114

@@ -124,7 +124,7 @@ These options are used directly on the command line when you start the `trtllm-s
124124

125125
#### `--trust_remote_code`
126126

127-
* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
127+
&emsp;**Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
128128

129129

130130
#### Extra LLM API Options (YAML Configuration)
@@ -264,7 +264,7 @@ Sample result in Blackwell
264264

265265
## Benchmarking Performance
266266

267-
To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
267+
To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh`(http://bench.sh) script.
268268

269269
```shell
270270
cat <<EOF > bench.sh

docs/source/developer-guide/perf-benchmarking.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,8 @@ This benchmarking suite is a work in progress.
77
Expect breaking API changes.
88
```
99

10-
TensorRT-LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
11-
easier for users to reproduce our officially published [performance overiew](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:
10+
TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
11+
easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:
1212

1313
- A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
1414
- An entirely Python workflow for benchmarking.
Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1-
# How To Change KV Cache Behavior
1+
# How to Change KV Cache Behavior
22

3-
KV cache behavior is set by providing the optional argument ```kv_cache_config``` when LLM engine is created. Consider the quickstart example (found in examples/pytorch/quickstart.py):
3+
Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. Consider the quickstart example found in ```examples/pytorch/quickstart.py```:
44

5-
```
5+
```python
66
from tensorrt_llm import LLM, SamplingParams
7+
8+
79
def main():
810
prompts = [
911
"Hello, my name is",
@@ -12,30 +14,34 @@ def main():
1214
"The future of AI is",
1315
]
1416
sampling_params = SamplingParams(max_tokens=32)
17+
1518
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
1619
outputs = llm.generate(prompts, sampling_params)
20+
1721
for i, output in enumerate(outputs):
1822
prompt = output.prompt
1923
generated_text = output.outputs[0].text
2024
print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
25+
26+
2127
if __name__ == '__main__':
2228
main()
2329
```
2430

25-
This example runs with default KV cache properties. The default for ```free_gpu_memory_fraction``` is 0.9, which means TensorRT-LLM will try to allocate 90% of free GPU memory for KV cache. Depending on your system, this may be too aggressive, so you decide to dial that back to 0.7. This is done by adding the following lines to the quickstart example:
31+
This example runs with default KV cache properties. The default value for `free_gpu_memory_fraction` is 0.9, which means TensorRT-LLM tries to allocate 90% of free GPU memory (after loading weights) for KV cache. Depending on your use case, this allocation can be too aggressive. You can reduce this value to 0.7 by adding the following lines to the quickstart example:
2632

27-
```
33+
```python
2834
from tensorrt_llm.llmapi import KvCacheConfig
2935
kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.7)
3036
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
3137
```
3238

33-
You can also set properties after you create KvCacheConfig, for instance
39+
You can also set properties after you create ```KvCacheConfig```. For example:
3440

35-
```
41+
```python
3642
kv_cache_config = KvCacheConfig()
3743
kv_cache_config.enable_block_reuse = False
3844
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
3945
```
4046

41-
will disable block reuse for the quickstart example.
47+
This code disables block reuse for the quick start example.
Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1-
# How To Change Block Priorities
1+
# How to Change Block Priorities
22

3-
Block priority can be changed by providing the optional argument ```kv_cache_retention_config``` when a request is submitted to LLM engine. Consider the quickstart example (found in examples/pytorch/quickstart.py):
3+
You can change block priority by providing the optional ```kv_cache_retention_config``` argument when you submit a request to the LLM engine. Consider the quick start example found in ```examples/pytorch/quickstart.py```:
44

5-
```
5+
```python
66
from tensorrt_llm import LLM, SamplingParams
7+
8+
79
def main():
810
prompts = [
911
"Hello, my name is",
@@ -12,21 +14,27 @@ def main():
1214
"The future of AI is",
1315
]
1416
sampling_params = SamplingParams(max_tokens=32)
17+
1518
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
1619
outputs = llm.generate(prompts, sampling_params)
20+
1721
for i, output in enumerate(outputs):
1822
prompt = output.prompt
1923
generated_text = output.outputs[0].text
2024
print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
25+
26+
2127
if __name__ == '__main__':
2228
main()
2329
```
2430

25-
The blocks from the prompts will be stored for reuse with the default priority of 35 (on a scale from 1 to 100 where 100 is highest and 1 is lowest priority). Assume you know that the first four tokens of each prompt is a system prompt that should be stored with high priority (100). You do this by providing a kv cache retention config object when you submit the prompts for generation:
31+
The blocks from the prompts are stored for reuse with the default priority of 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priority. Assume you know that the first four tokens of each prompt represent a system prompt that should be stored with high priority (100). You can achieve this by providing a KV cache retention config object when you submit the prompts for generation:
2632

27-
```
33+
```python
2834
from tensorrt_llm import LLM, SamplingParams
2935
from tensorrt_llm.llmapi import KvCacheRetentionConfig
36+
37+
3038
def main():
3139
prompts = [
3240
"Hello, my name is",
@@ -35,7 +43,9 @@ def main():
3543
"The future of AI is",
3644
]
3745
sampling_params = SamplingParams(max_tokens=32)
46+
3847
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
48+
3949
# Set priority for first 4 prompt tokens to 100. All other tokens set to default (35) priority.
4050
# This policy never lapses.
4151
tokenRangeRetentionConfig = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None)
@@ -44,12 +54,15 @@ def main():
4454
decode_retention_priority=35, # Set generated tokens to default priority
4555
decode_duration_ms=None)
4656
outputs = llm.generate(prompts, sampling_params, kv_cache_retention_config=kv_cache_retention_config)
57+
4758
for i, output in enumerate(outputs):
4859
prompt = output.prompt
4960
generated_text = output.outputs[0].text
5061
print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
62+
63+
5164
if __name__ == '__main__':
5265
main()
5366
```
5467

55-
Here we used a single kv_cache_retention_config object for all the prompts. Alternatively, you can also provide a list, the list must have the same length as the list of prompts.
68+
This example uses a single ```kv_cache_retention_config``` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.

docs/source/features/disagg-serving.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -212,6 +212,8 @@ There are some other useful environment variables that may help when encounterin
212212

213213
* `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.
214214

215+
* ``UCX_MAX_RNDV_RAILS`: With the default value 2, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting UCX_MAX_RNDV_RAILS=1 can reduce contention in this case.
216+
215217
## Troubleshooting and FAQ
216218

217219
### General FAQs
@@ -254,7 +256,7 @@ A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
254256

255257
A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.
256258

257-
*Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?
259+
*Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?*
258260

259261
A. NVLink domain can be found with `nvidia-smi -q` in the `Fabric.ClusterUUID` field. A few UCX environment variables can be adjusted when your servers have different NVLink domains:
260262

0 commit comments

Comments
 (0)