Skip to content

Commit f813bb2

Browse files
nv-guomingzdominicshanshan
authored andcommitted
[None][doc] Update kvcache part (NVIDIA#7549)
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
1 parent 88820f0 commit f813bb2

File tree

4 files changed

+39
-27
lines changed

4 files changed

+39
-27
lines changed

docs/source/examples/kvcacheconfig.md

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1-
# How To Change KV Cache Behavior
1+
# How to Change KV Cache Behavior
22

3-
KV cache behavior is set by providing the optional argument ```kv_cache_config``` when LLM engine is created. Consider the quickstart example (found in examples/pytorch/quickstart.py):
3+
Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. Consider the quickstart example found in ```examples/pytorch/quickstart.py```:
44

5-
```
5+
```python
66
from tensorrt_llm import LLM, SamplingParams
7+
8+
79
def main():
810
prompts = [
911
"Hello, my name is",
@@ -12,30 +14,34 @@ def main():
1214
"The future of AI is",
1315
]
1416
sampling_params = SamplingParams(max_tokens=32)
17+
1518
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
1619
outputs = llm.generate(prompts, sampling_params)
20+
1721
for i, output in enumerate(outputs):
1822
prompt = output.prompt
1923
generated_text = output.outputs[0].text
2024
print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
25+
26+
2127
if __name__ == '__main__':
2228
main()
2329
```
2430

25-
This example runs with default KV cache properties. The default for ```free_gpu_memory_fraction``` is 0.9, which means TensorRT-LLM will try to allocate 90% of free GPU memory for KV cache. Depending on your system, this may be too aggressive, so you decide to dial that back to 0.7. This is done by adding the following lines to the quickstart example:
31+
This example runs with default KV cache properties. The default value for `free_gpu_memory_fraction` is 0.9, which means TensorRT-LLM tries to allocate 90% of free GPU memory (after loading weights) for KV cache. Depending on your use case, this allocation can be too aggressive. You can reduce this value to 0.7 by adding the following lines to the quickstart example:
2632

27-
```
33+
```python
2834
from tensorrt_llm.llmapi import KvCacheConfig
2935
kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.7)
3036
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
3137
```
3238

33-
You can also set properties after you create KvCacheConfig, for instance
39+
You can also set properties after you create ```KvCacheConfig```. For example:
3440

35-
```
41+
```python
3642
kv_cache_config = KvCacheConfig()
3743
kv_cache_config.enable_block_reuse = False
3844
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
3945
```
4046

41-
will disable block reuse for the quickstart example.
47+
This code disables block reuse for the quick start example.
Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1-
# How To Change Block Priorities
1+
# How to Change Block Priorities
22

3-
Block priority can be changed by providing the optional argument ```kv_cache_retention_config``` when a request is submitted to LLM engine. Consider the quickstart example (found in examples/pytorch/quickstart.py):
3+
You can change block priority by providing the optional ```kv_cache_retention_config``` argument when you submit a request to the LLM engine. Consider the quick start example found in ```examples/pytorch/quickstart.py```:
44

5-
```
5+
```python
66
from tensorrt_llm import LLM, SamplingParams
7+
8+
79
def main():
810
prompts = [
911
"Hello, my name is",
@@ -12,21 +14,27 @@ def main():
1214
"The future of AI is",
1315
]
1416
sampling_params = SamplingParams(max_tokens=32)
17+
1518
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
1619
outputs = llm.generate(prompts, sampling_params)
20+
1721
for i, output in enumerate(outputs):
1822
prompt = output.prompt
1923
generated_text = output.outputs[0].text
2024
print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
25+
26+
2127
if __name__ == '__main__':
2228
main()
2329
```
2430

25-
The blocks from the prompts will be stored for reuse with the default priority of 35 (on a scale from 1 to 100 where 100 is highest and 1 is lowest priority). Assume you know that the first four tokens of each prompt is a system prompt that should be stored with high priority (100). You do this by providing a kv cache retention config object when you submit the prompts for generation:
31+
The blocks from the prompts are stored for reuse with the default priority of 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priority. Assume you know that the first four tokens of each prompt represent a system prompt that should be stored with high priority (100). You can achieve this by providing a KV cache retention config object when you submit the prompts for generation:
2632

27-
```
33+
```python
2834
from tensorrt_llm import LLM, SamplingParams
2935
from tensorrt_llm.llmapi import KvCacheRetentionConfig
36+
37+
3038
def main():
3139
prompts = [
3240
"Hello, my name is",
@@ -35,7 +43,9 @@ def main():
3543
"The future of AI is",
3644
]
3745
sampling_params = SamplingParams(max_tokens=32)
46+
3847
llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
48+
3949
# Set priority for first 4 prompt tokens to 100. All other tokens set to default (35) priority.
4050
# This policy never lapses.
4151
tokenRangeRetentionConfig = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None)
@@ -44,12 +54,15 @@ def main():
4454
decode_retention_priority=35, # Set generated tokens to default priority
4555
decode_duration_ms=None)
4656
outputs = llm.generate(prompts, sampling_params, kv_cache_retention_config=kv_cache_retention_config)
57+
4758
for i, output in enumerate(outputs):
4859
prompt = output.prompt
4960
generated_text = output.outputs[0].text
5061
print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
62+
63+
5164
if __name__ == '__main__':
5265
main()
5366
```
5467

55-
Here we used a single kv_cache_retention_config object for all the prompts. Alternatively, you can also provide a list, the list must have the same length as the list of prompts.
68+
This example uses a single ```kv_cache_retention_config``` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.

docs/source/features/kvcache.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,15 +22,15 @@ One caveat in the current code is that only leaf blocks can be evicted (leaves a
2222

2323
### Retention Policy
2424

25-
Blocks are assigned priority in line with the [retention policy](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig) of the request. The retention policy is a list of [TokenRangeRetentionConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig.TokenRangeRetentionConfig) objects, each specifying priority for a given range of tokens, such as "assign priority X to tokens 10 through 61". You can also assign a duration in milliseconds for this to remain in effect, priority will revert to the default after a period of ```duration_ms``` has elapsed from the first time the block was made available for reuse. TokenRangeRetentionConfig only applies to input (prompt) tokens. The property ```decode_retention_policy``` specifies what priority to assign to blocks with generated (decoded) tokens and ```decode_duration_ms``` specifies how long this should remain in effect, after which priority will revert to the default. Default priority is 35. Any property that expects a duration can be set to None, which indicates retention policy never expires.
25+
Blocks are assigned priority in line with the [retention policy](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig) of the request. Blocks with lower priority scores will be freed preferentially to blocks with higher priority. The retention policy is a list of [TokenRangeRetentionConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig.TokenRangeRetentionConfig) objects, each specifying priority for a given range of tokens, such as "assign priority X to tokens 10 through 61". You can also assign a duration in milliseconds for this to remain in effect. Priority reverts to the default of 35 after a period of ```duration_ms``` has elapsed from the first time the block was made available for reuse. TokenRangeRetentionConfig only applies to input (prompt) tokens. The property ```decode_retention_policy``` specifies what priority to assign to blocks with generated (decoded) tokens and ```decode_duration_ms``` specifies how long this should remain in effect. Priority reverts to the default after expiration. Any property that expects a duration can be set to None. This indicates that particular part of the retention policy never expires.
2626

2727
Not in use: ```transfer_mode``` is a debug option and should not be used.
2828

29-
See [this example](../examples/kvcacheretentionconfig.md) of how to change block priorities of specific requests by altering their retention policy.
29+
See [this example](../examples/kvcacheretentionconfig.md) for an example of how to change block priorities of specific requests by altering their retention policy.
3030

3131
### Speculative Decoding
3232

33-
Reuse across requests is only supported for one model MTP, all other [speculative decoding](speculative-decoding.md) algorithms must disable block reuse.
33+
Reuse across requests is supported by all speculative decoding models. Please see [speculative decoding](speculative-decoding.md) for more details.
3434

3535
## Limited Attention Window Size
3636

@@ -42,9 +42,9 @@ TensorRT-LLM takes advantage of grouped query attention in order to save memory.
4242

4343
## Controlling KV Cache Behavior
4444

45-
Many of the features in the KV cache system are optional or have user defined properties that alter how they work. Users can control KV cache features through class [KVCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig). The remainder of this section describes how to change the most important behaviors of KV cache system.
45+
Many of the features in the KV cache system are optional or have user defined properties that alter how they work. Users can control KV cache features through class [KVCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig). The remainder of this section describes how to change the most important behaviors of the KV cache system.
4646

47-
See [this example](../examples/kvcacheconfig.md) of how to use KvCacheConfig to control KV cache behavior.
47+
See [this example](../examples/kvcacheconfig.md) for an example of how to use KvCacheConfig to control KV cache behavior.
4848

4949
### Datatype
5050

docs/source/features/speculative-decoding.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -29,9 +29,6 @@ The two model implementation supports the following speculative decoding algorit
2929

3030
For all speculation algorithms, when speculation is enabled, a single sequence of draft tokens with length `max_draft_len` is created for every request. There is currently no way to dynamically disable speculation, thus speed ups are only observable at low batch sizes.
3131

32-
All two-model based speculation implementations have the following additional constraints:
33-
* KV cache reuse must be disabled (this occurs implicitly).
34-
* Overlap scheduling must be disabled.
3532

3633
### Draft/Target
3734

@@ -99,8 +96,6 @@ MTP is currently only supported by Deepseek. MTP can be tuned with the following
9996
* `relaxed_topk`: The top K tokens are sampled from the target model's logits to create the initial candidate set for relaxed decoding.
10097
* `relaxed_delta`: Used to further filter the top K candidate set for relaxed decoding. We remove tokens `t` for which `log(P(top 1 token)) - log(P(t)) > relaxed_delta`.
10198

102-
Unlike the other speculation algorithms, MTP supports the overlap scheduler and KV cache reuse.
103-
10499
```python
105100
from tensorrt_llm.llmapi import MTPDecodingConfig
106101

@@ -190,9 +185,7 @@ draft tokens to be attached to the `py_draft_tokens` field of request that specu
190185

191186
## Two Model Speculative Decoding Architecture
192187

193-
Note that there are currently a few limitations on the two model implementation:
194-
* KV cache reuse must be disabled.
195-
* Overlap scheduling must be disabled.
188+
Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically.
196189

197190
In this approach, there are two new steps to the `PyExecutor`'s `_executor_loop`.
198191
* `_prepare_draft_requests`

0 commit comments

Comments
 (0)