You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
KV cache behavior is set by providing the optional argument ```kv_cache_config``` when LLM engine is created. Consider the quickstart example (found in examples/pytorch/quickstart.py):
3
+
Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. Consider the quickstart example found in ```examples/pytorch/quickstart.py```:
This example runs with default KV cache properties. The default for ```free_gpu_memory_fraction``` is 0.9, which means TensorRT-LLM will try to allocate 90% of free GPU memory for KV cache. Depending on your system, this may be too aggressive, so you decide to dial that back to 0.7. This is done by adding the following lines to the quickstart example:
31
+
This example runs with default KV cache properties. The default value for `free_gpu_memory_fraction` is 0.9, which means TensorRT-LLM tries to allocate 90% of free GPU memory (after loading weights) for KV cache. Depending on your use case, this allocation can be too aggressive. You can reduce this value to 0.7 by adding the following lines to the quickstart example:
Block priority can be changed by providing the optional argument ```kv_cache_retention_config``` when a request is submitted to LLM engine. Consider the quickstart example (found in examples/pytorch/quickstart.py):
3
+
You can change block priority by providing the optional ```kv_cache_retention_config```argument when you submit a request to the LLM engine. Consider the quick start example found in ```examples/pytorch/quickstart.py```:
The blocks from the prompts will be stored for reuse with the default priority of 35 (on a scale from 1 to 100 where 100 is highest and 1 is lowest priority). Assume you know that the first four tokens of each prompt is a system prompt that should be stored with high priority (100). You do this by providing a kv cache retention config object when you submit the prompts for generation:
31
+
The blocks from the prompts are stored for reuse with the default priority of 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priority. Assume you know that the first four tokens of each prompt represent a system prompt that should be stored with high priority (100). You can achieve this by providing a KV cache retention config object when you submit the prompts for generation:
26
32
27
-
```
33
+
```python
28
34
from tensorrt_llm importLLM, SamplingParams
29
35
from tensorrt_llm.llmapi import KvCacheRetentionConfig
Here we used a single kv_cache_retention_config object for all the prompts. Alternatively, you can also provide a list, the list must have the same length as the list of prompts.
68
+
This example uses a single ```kv_cache_retention_config``` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.
Copy file name to clipboardExpand all lines: docs/source/features/kvcache.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -22,15 +22,15 @@ One caveat in the current code is that only leaf blocks can be evicted (leaves a
22
22
23
23
### Retention Policy
24
24
25
-
Blocks are assigned priority in line with the [retention policy](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig) of the request. The retention policy is a list of [TokenRangeRetentionConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig.TokenRangeRetentionConfig) objects, each specifying priority for a given range of tokens, such as "assign priority X to tokens 10 through 61". You can also assign a duration in milliseconds for this to remain in effect, priority will revert to the default after a period of ```duration_ms``` has elapsed from the first time the block was made available for reuse. TokenRangeRetentionConfig only applies to input (prompt) tokens. The property ```decode_retention_policy``` specifies what priority to assign to blocks with generated (decoded) tokens and ```decode_duration_ms``` specifies how long this should remain in effect, after which priority will revert to the default. Default priority is 35. Any property that expects a duration can be set to None, which indicates retention policy never expires.
25
+
Blocks are assigned priority in line with the [retention policy](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig) of the request. Blocks with lower priority scores will be freed preferentially to blocks with higher priority. The retention policy is a list of [TokenRangeRetentionConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig.TokenRangeRetentionConfig) objects, each specifying priority for a given range of tokens, such as "assign priority X to tokens 10 through 61". You can also assign a duration in milliseconds for this to remain in effect. Priority reverts to the default of 35 after a period of ```duration_ms``` has elapsed from the first time the block was made available for reuse. TokenRangeRetentionConfig only applies to input (prompt) tokens. The property ```decode_retention_policy``` specifies what priority to assign to blocks with generated (decoded) tokens and ```decode_duration_ms``` specifies how long this should remain in effect. Priority reverts to the default after expiration. Any property that expects a duration can be set to None. This indicates that particular part of the retention policy never expires.
26
26
27
27
Not in use: ```transfer_mode``` is a debug option and should not be used.
28
28
29
-
See [this example](../examples/kvcacheretentionconfig.md) of how to change block priorities of specific requests by altering their retention policy.
29
+
See [this example](../examples/kvcacheretentionconfig.md)for an example of how to change block priorities of specific requests by altering their retention policy.
30
30
31
31
### Speculative Decoding
32
32
33
-
Reuse across requests is only supported for one model MTP, all other [speculative decoding](speculative-decoding.md)algorithms must disable block reuse.
33
+
Reuse across requests is supported by all speculative decoding models. Please see [speculative decoding](speculative-decoding.md)for more details.
34
34
35
35
## Limited Attention Window Size
36
36
@@ -42,9 +42,9 @@ TensorRT-LLM takes advantage of grouped query attention in order to save memory.
42
42
43
43
## Controlling KV Cache Behavior
44
44
45
-
Many of the features in the KV cache system are optional or have user defined properties that alter how they work. Users can control KV cache features through class [KVCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig). The remainder of this section describes how to change the most important behaviors of KV cache system.
45
+
Many of the features in the KV cache system are optional or have user defined properties that alter how they work. Users can control KV cache features through class [KVCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig). The remainder of this section describes how to change the most important behaviors of the KV cache system.
46
46
47
-
See [this example](../examples/kvcacheconfig.md) of how to use KvCacheConfig to control KV cache behavior.
47
+
See [this example](../examples/kvcacheconfig.md)for an example of how to use KvCacheConfig to control KV cache behavior.
Copy file name to clipboardExpand all lines: docs/source/features/speculative-decoding.md
+1-8Lines changed: 1 addition & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,9 +29,6 @@ The two model implementation supports the following speculative decoding algorit
29
29
30
30
For all speculation algorithms, when speculation is enabled, a single sequence of draft tokens with length `max_draft_len` is created for every request. There is currently no way to dynamically disable speculation, thus speed ups are only observable at low batch sizes.
31
31
32
-
All two-model based speculation implementations have the following additional constraints:
33
-
* KV cache reuse must be disabled (this occurs implicitly).
34
-
* Overlap scheduling must be disabled.
35
32
36
33
### Draft/Target
37
34
@@ -99,8 +96,6 @@ MTP is currently only supported by Deepseek. MTP can be tuned with the following
99
96
*`relaxed_topk`: The top K tokens are sampled from the target model's logits to create the initial candidate set for relaxed decoding.
100
97
*`relaxed_delta`: Used to further filter the top K candidate set for relaxed decoding. We remove tokens `t` for which `log(P(top 1 token)) - log(P(t)) > relaxed_delta`.
101
98
102
-
Unlike the other speculation algorithms, MTP supports the overlap scheduler and KV cache reuse.
103
-
104
99
```python
105
100
from tensorrt_llm.llmapi import MTPDecodingConfig
106
101
@@ -190,9 +185,7 @@ draft tokens to be attached to the `py_draft_tokens` field of request that specu
190
185
191
186
## Two Model Speculative Decoding Architecture
192
187
193
-
Note that there are currently a few limitations on the two model implementation:
194
-
* KV cache reuse must be disabled.
195
-
* Overlap scheduling must be disabled.
188
+
Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically.
196
189
197
190
In this approach, there are two new steps to the `PyExecutor`'s `_executor_loop`.
0 commit comments