[None][doc] Update kvcache part (NVIDIA#7549)

nv-guomingz · dominicshanshan · commit f813bb2ba8f5 · 2025-09-08T00:49:25.000-07:00
Signed-off-by: nv-guomingz &lt;137257613+nv-guomingz@users.noreply.github.com&gt;
Signed-off-by: Wangshanshan &lt;30051912+dominicshanshan@users.noreply.github.com&gt;
diff --git a/docs/source/examples/kvcacheconfig.md b/docs/source/examples/kvcacheconfig.md
@@ -1,9 +1,11 @@
-# How To Change KV Cache Behavior
+# How to Change KV Cache Behavior
 
-KV cache behavior is set by providing the optional argument ```kv_cache_config``` when LLM engine is created. Consider the quickstart example (found in examples/pytorch/quickstart.py):
+Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. Consider the quickstart example found in ```examples/pytorch/quickstart.py```:
 
-```
+```python
 from tensorrt_llm import LLM, SamplingParams
+
+
 def main():
     prompts = [
         "Hello, my name is",
@@ -12,30 +14,34 @@ def main():
         "The future of AI is",
     ]
     sampling_params = SamplingParams(max_tokens=32)
+
     llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
     outputs = llm.generate(prompts, sampling_params)
+
     for i, output in enumerate(outputs):
         prompt = output.prompt
         generated_text = output.outputs[0].text
         print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
 if __name__ == '__main__':
     main()
 ```
 
-This example runs with default KV cache properties. The default for ```free_gpu_memory_fraction``` is 0.9, which means TensorRT-LLM will try to allocate 90% of free GPU memory for KV cache. Depending on your system, this may be too aggressive, so you decide to dial that back to 0.7. This is done by adding the following lines to the quickstart example:
+This example runs with default KV cache properties. The default value for `free_gpu_memory_fraction` is 0.9, which means TensorRT-LLM tries to allocate 90% of free GPU memory (after loading weights) for KV cache. Depending on your use case, this allocation can be too aggressive. You can reduce this value to 0.7 by adding the following lines to the quickstart example:
 
-```
+```python
 from tensorrt_llm.llmapi import KvCacheConfig
 kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.7)
 llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
 ```
 
-You can also set properties after you create KvCacheConfig, for instance
+You can also set properties after you create ```KvCacheConfig```. For example:
 
-```
+```python
 kv_cache_config = KvCacheConfig()
 kv_cache_config.enable_block_reuse = False
 llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
 ```
 
-will disable block reuse for the quickstart example.
+This code disables block reuse for the quick start example.
diff --git a/docs/source/examples/kvcacheretentionconfig.md b/docs/source/examples/kvcacheretentionconfig.md
@@ -1,9 +1,11 @@
-# How To Change Block Priorities
+# How to Change Block Priorities
 
-Block priority can be changed by providing the optional argument ```kv_cache_retention_config``` when a request is submitted to LLM engine. Consider the quickstart example (found in examples/pytorch/quickstart.py):
+You can change block priority by providing the optional ```kv_cache_retention_config``` argument when you submit a request to the LLM engine. Consider the quick start example found in ```examples/pytorch/quickstart.py```:
 
-```
+```python
 from tensorrt_llm import LLM, SamplingParams
+
+
 def main():
     prompts = [
         "Hello, my name is",
@@ -12,21 +14,27 @@ def main():
         "The future of AI is",
     ]
     sampling_params = SamplingParams(max_tokens=32)
+
     llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
     outputs = llm.generate(prompts, sampling_params)
+
     for i, output in enumerate(outputs):
         prompt = output.prompt
         generated_text = output.outputs[0].text
         print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
 if __name__ == '__main__':
     main()
 ```
 
-The blocks from the prompts will be stored for reuse with the default priority of 35 (on a scale from 1 to 100 where 100 is highest and 1 is lowest priority). Assume you know that the first four tokens of each prompt is a system prompt that should be stored with high priority (100). You do this by providing a kv cache retention config object when you submit the prompts for generation:
+The blocks from the prompts are stored for reuse with the default priority of 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priority. Assume you know that the first four tokens of each prompt represent a system prompt that should be stored with high priority (100). You can achieve this by providing a KV cache retention config object when you submit the prompts for generation:
 
-```
+```python
 from tensorrt_llm import LLM, SamplingParams
 from tensorrt_llm.llmapi import KvCacheRetentionConfig
+
+
 def main():
     prompts = [
         "Hello, my name is",
@@ -35,7 +43,9 @@ def main():
         "The future of AI is",
     ]
     sampling_params = SamplingParams(max_tokens=32)
+
     llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
+
     # Set priority for first 4 prompt tokens to 100. All other tokens set to default (35) priority.
     # This policy never lapses.
     tokenRangeRetentionConfig = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None)
@@ -44,12 +54,15 @@ def main():
         decode_retention_priority=35, # Set generated tokens to default priority
         decode_duration_ms=None)
     outputs = llm.generate(prompts, sampling_params, kv_cache_retention_config=kv_cache_retention_config)
+
     for i, output in enumerate(outputs):
         prompt = output.prompt
         generated_text = output.outputs[0].text
         print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
 if __name__ == '__main__':
     main()
 ```
 
-Here we used a single kv_cache_retention_config object for all the prompts. Alternatively, you can also provide a list, the list must have the same length as the list of prompts.
+This example uses a single ```kv_cache_retention_config``` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.
diff --git a/docs/source/features/kvcache.md b/docs/source/features/kvcache.md
@@ -22,15 +22,15 @@ One caveat in the current code is that only leaf blocks can be evicted (leaves a
 
 ### Retention Policy
 
-Blocks are assigned priority in line with the [retention policy](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig) of the request. The retention policy is a list of [TokenRangeRetentionConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig.TokenRangeRetentionConfig) objects, each specifying priority for a given range of tokens, such as "assign priority X to tokens 10 through 61". You can also assign a duration in milliseconds for this to remain in effect, priority will revert to the default after a period of ```duration_ms``` has elapsed from the first time the block was made available for reuse. TokenRangeRetentionConfig only applies to input (prompt) tokens. The property ```decode_retention_policy``` specifies what priority to assign to blocks with generated (decoded) tokens and ```decode_duration_ms``` specifies how long this should remain in effect, after which priority will revert to the default. Default priority is 35. Any property that expects a duration can be set to None, which indicates retention policy never expires.
+Blocks are assigned priority in line with the [retention policy](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig) of the request. Blocks with lower priority scores will be freed preferentially to blocks with higher priority. The retention policy is a list of [TokenRangeRetentionConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheRetentionConfig.TokenRangeRetentionConfig) objects, each specifying priority for a given range of tokens, such as "assign priority X to tokens 10 through 61". You can also assign a duration in milliseconds for this to remain in effect. Priority reverts to the default of 35 after a period of ```duration_ms``` has elapsed from the first time the block was made available for reuse. TokenRangeRetentionConfig only applies to input (prompt) tokens. The property ```decode_retention_policy``` specifies what priority to assign to blocks with generated (decoded) tokens and ```decode_duration_ms``` specifies how long this should remain in effect. Priority reverts to the default after expiration. Any property that expects a duration can be set to None. This indicates that particular part of the retention policy never expires.
 
 Not in use: ```transfer_mode``` is a debug option and should not be used.
 
-See [this example](../examples/kvcacheretentionconfig.md) of how to change block priorities of specific requests by altering their retention policy.
+See [this example](../examples/kvcacheretentionconfig.md) for an example of how to change block priorities of specific requests by altering their retention policy.
 
 ### Speculative Decoding
 
-Reuse across requests is only supported for one model MTP, all other [speculative decoding](speculative-decoding.md) algorithms must disable block reuse.
+Reuse across requests is supported by all speculative decoding models. Please see [speculative decoding](speculative-decoding.md) for more details.
 
 ## Limited Attention Window Size
 
@@ -42,9 +42,9 @@ TensorRT-LLM takes advantage of grouped query attention in order to save memory.
 
 ## Controlling KV Cache Behavior
 
-Many of the features in the KV cache system are optional or have user defined properties that alter how they work. Users can control KV cache features through class [KVCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig). The remainder of this section describes how to change the most important behaviors of KV cache system.
+Many of the features in the KV cache system are optional or have user defined properties that alter how they work. Users can control KV cache features through class [KVCacheConfig](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.KvCacheConfig). The remainder of this section describes how to change the most important behaviors of the KV cache system.
 
-See [this example](../examples/kvcacheconfig.md) of how to use KvCacheConfig to control KV cache behavior.
+See [this example](../examples/kvcacheconfig.md) for an example of how to use KvCacheConfig to control KV cache behavior.
 
 ### Datatype
 
diff --git a/docs/source/features/speculative-decoding.md b/docs/source/features/speculative-decoding.md
@@ -29,9 +29,6 @@ The two model implementation supports the following speculative decoding algorit
 
 For all speculation algorithms, when speculation is enabled, a single sequence of draft tokens with length `max_draft_len` is created for every request. There is currently no way to dynamically disable speculation, thus speed ups are only observable at low batch sizes.
 
-All two-model based speculation implementations have the following additional constraints:
-* KV cache reuse must be disabled (this occurs implicitly).
-* Overlap scheduling must be disabled.
 
 ### Draft/Target
 
@@ -99,8 +96,6 @@ MTP is currently only supported by Deepseek. MTP can be tuned with the following
 * `relaxed_topk`: The top K tokens are sampled from the target model's logits to create the initial candidate set for relaxed decoding.
 * `relaxed_delta`: Used to further filter the top K candidate set for relaxed decoding. We remove tokens `t` for which `log(P(top 1 token)) - log(P(t)) > relaxed_delta`.
 
-Unlike the other speculation algorithms, MTP supports the overlap scheduler and KV cache reuse.
-
 ```python
 from tensorrt_llm.llmapi import MTPDecodingConfig
 
@@ -190,9 +185,7 @@ draft tokens to be attached to the `py_draft_tokens` field of request that specu
 
 ## Two Model Speculative Decoding Architecture
 
-Note that there are currently a few limitations on the two model implementation:
-* KV cache reuse must be disabled.
-* Overlap scheduling must be disabled.
+Two-model based speculation implementations do not support overlap scheduler. It will be disabled automatically.
 
 In this approach, there are two new steps to the `PyExecutor`'s `_executor_loop`.
 * `_prepare_draft_requests`