NVIDIA
diff --git a/‎docs/source/advanced/lora.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/advanced/lora.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/architecture/checkpoint.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/architecture/checkpoint.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md‎
Lines changed: 9 additions & 9 deletions b/‎docs/source/blogs/tech_blog/blog10_ADP_Balance_Strategy.md‎
Lines changed: 9 additions & 9 deletions
diff --git a/‎docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/source/blogs/tech_blog/blog11_GPT_OSS_Eagle3.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎docs/source/deployment-guide/index.rst‎
Lines changed: 1 addition & 0 deletions b/‎docs/source/deployment-guide/index.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md‎
Lines changed: 5 additions & 5 deletions b/‎docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎docs/source/developer-guide/perf-benchmarking.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/developer-guide/perf-benchmarking.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/examples/kvcacheconfig.md‎
Lines changed: 14 additions & 8 deletions b/‎docs/source/examples/kvcacheconfig.md‎
Lines changed: 14 additions & 8 deletions
diff --git a/‎docs/source/examples/kvcacheretentionconfig.md‎
Lines changed: 19 additions & 6 deletions b/‎docs/source/examples/kvcacheretentionconfig.md‎
Lines changed: 19 additions & 6 deletions
diff --git a/‎docs/source/features/disagg-serving.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/source/features/disagg-serving.md‎
Lines changed: 3 additions & 1 deletion
@@ -133,9 +133,9 @@ Next, consider this linear layer is a `RowLinear` layer. When we partition the w
 
 #### DoRA
 
-TensorRT-LLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
+TensorRT LLM supports DoRA as described in https://arxiv.org/abs/2402.09353 . To enable DoRA, you must add the additional `--dora_plugin enable` flag to the `trtllm-build` command.
 
-The DoRA scales must be normalized before they are submitted to TensorRT-LLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
+The DoRA scales must be normalized before they are submitted to TensorRT LLM in an inference request. The normalization requires the base model weights. To normalize your adapter you may use the script provided in `tensorrt_llm/examples/dora/normalize_weights.py`.
 
 When using DoRA, the format of `LoraWeights` and `LoraConfig` changes slightly.
 The shape of `LoraConfig` becomes `[module_id, layer_idx, adapter_size D (i.e. R value), is_dora]`, with `is_dora` a boolean flag that determines whether the supplied adapter contains DoRA scales or not. If the old config shape is used, it is assumed the adapter does not have DoRA scales.
 
@@ -169,7 +169,7 @@ Here is the AWQ scaling factors of `mlp.fc` linear layer:
 - `transformer.layers.0.mlp.fc.prequant_scaling_factor`
 
     ```{note}
-    The linear weights in TensorRT-LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT-LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
+    The linear weights in TensorRT LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT LLM implemented by plugin may use (`in_feature`, `out_feature`) shape. The `trtllm-build` command adds a transpose operation to post-process it.
 
 ### Example
 
 
@@ -1,6 +1,6 @@
 # ADP Balance Strategy
 
-By NVIDIA TensorRT-LLM team
+By NVIDIA TensorRT LLM team
 
 ## Table of Contents
 - [ADP Balance Strategy](#adp-balance-strategy)
@@ -96,7 +96,7 @@ The conventional approach employs a global load balancing strategy that sorts in
 
 <div align="center">
 <figure>
-  <img src="./../media/tech_blog10_baseline_round_robin_strategy.png">
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_baseline_round_robin_strategy.png">
 </figure>
 </div>
 <p align="center"><sub><em>Figure 1: Baseline round-robin strategy balances context request tokens across ranks through sorting and cyclic distribution</em></sub></p>
@@ -179,7 +179,7 @@ We evaluate our approach using a comprehensive dataset comprising 16,000 inferen
 
 <div align="center">
 <figure>
-  <img src="./../media/tech_blog10_dataset_token_distribution.png">
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_dataset_token_distribution.png">
 </figure>
 </div>
 <p align="center"><sub><em>Figure 2: Distribution of input and output token lengths</em></sub></p>
@@ -225,7 +225,7 @@ Figure 3 provides comprehensive insight into baseline system behavior, displayin
 
 <div align="center">
 <figure>
-  <img src="./../media/tech_blog10_baseline_performance_overview.png">
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_baseline_performance_overview.png">
 </figure>
 </div>
 <p align="center"><sub><em>Figure 3: Baseline performance overview showing token distribution and balance ratios across all iterations</em></sub></p>
@@ -239,7 +239,7 @@ Figure 4 zooms into the critical imbalance period [100-12,000], revealing the dr
 
 <div align="center">
 <figure>
-  <img src="./../media/tech_blog10_baseline_performance_detail.png">
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_baseline_performance_detail.png">
 </figure>
 </div>
 <p align="center"><sub><em>Figure 4: Detailed baseline analysis for iterations 100-12,000 showing severe balance fluctuations</em></sub></p>
@@ -260,7 +260,7 @@ The Context Wait mechanism (`timeout_iters=50`) demonstrates the effectiveness o
 
 <div align="center">
 <figure>
-  <img src="./../media/tech_blog10_context_wait_performance.png">
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_context_wait_performance.png">
 </figure>
 </div>
 <p align="center"><sub><em>Figure 5: Context Wait performance showing improved balance stability for iterations 100-12,000</em></sub></p>
@@ -300,7 +300,7 @@ The effectiveness of our complete ADP Balance implementation is clearly demonstr
 
 <div align="center">
 <figure>
-  <img src="./../media/tech_blog10_full_strategy_performance.png">
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_full_strategy_performance.png">
 </figure>
 </div>
 <p align="center"><sub><em>Figure 6: Full ADP Balance strategy demonstrating superior balance stability for iterations 100-12,000</em></sub></p>
@@ -324,7 +324,7 @@ Understanding the performance trade-offs inherent in our ADP Balance strategy is
 
 <div align="center">
 <figure>
-  <img src="./../media/tech_blog10_tps_ttft_pareto_curve.png">
+  <img src="https://github.com/NVIDIA/TensorRT-LLM/raw/main/docs/source/blogs/media/tech_blog10_tps_ttft_pareto_curve.png">
 </figure>
 </div>
 <p align="center"><sub><em>Figure 7: Pareto frontier analysis showing throughput-latency trade-offs across different ADP Balance configurations</em></sub></p>
@@ -364,4 +364,4 @@ The Pareto frontier analysis provides critical insights for real-world deploymen
 
 ## Acknowledgement
 
-The ADP Balance strategy was a great team effort, covering system performance analysis and optimization. While we cannot thank every contributor individually, we are proud to acknowledge the dedicated team of engineers whose collective expertise has propelled TensorRT-LLM to new heights of performance. Through this collaborative effort, we have gained valuable insights into improving GPU utilization for large language model inference. We hope the techniques and experiences shared in this blog post will empower the developer community to better leverage the performance of NVIDIA GPUs in their mission-critical LLM inference applications.
+The ADP Balance strategy was a great team effort, covering system performance analysis and optimization. While we cannot thank every contributor individually, we are proud to acknowledge the dedicated team of engineers whose collective expertise has propelled TensorRT LLM to new heights of performance. Through this collaborative effort, we have gained valuable insights into improving GPU utilization for large language model inference. We hope the techniques and experiences shared in this blog post will empower the developer community to better leverage the performance of NVIDIA GPUs in their mission-critical LLM inference applications.
@@ -1,4 +1,4 @@
-## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT-LLM)
+## Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
 
 This guide sets up a production endpoint that uses Eagle3 speculative decoding on NVIDIA GB200 or B200 GPUs only. It replaces the low‑latency flow from the previous guide and intentionally omits max‑throughput, Hopper, and benchmarking content.
 
@@ -17,7 +17,7 @@ Expected directory layout on the host (example):
   └─ eagle/         # Eagle3 speculative decoding assets
 ```
 
-### Get the TensorRT-LLM Container (1.1.0rc0)
+### Get the TensorRT LLM Container (1.1.0rc0)
 
 If required by your environment, log into NGC and pull the image:
 
@@ -30,7 +30,7 @@ docker login nvcr.io
 docker pull nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc0
 ```
 
-### Start the TensorRT-LLM Container
+### Start the TensorRT LLM Container
 
 Run the container and bind-mount your models directory to `/config/models` inside the container:
 
@@ -122,7 +122,7 @@ When `Status: 200` is returned, the endpoint is ready to serve requests.
 
 ### Sample Chat Completions Request
 
-Note: This Eagle3 + TensorRT-LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.
+Note: This Eagle3 + TensorRT LLM endpoint currently supports only greedy sampling. The following Chat Completions parameters are ignored (no-ops): `temperature`, `top_p`, `top_k`, and `seed`.
 
 Send a simple OpenAI-compatible Chat Completions request to the running server:
 
 
@@ -9,3 +9,4 @@ Model Recipes
    quick-start-recipe-for-deepseek-r1-on-trtllm.md
    quick-start-recipe-for-llama3.3-70b-on-trtllm.md
    quick-start-recipe-for-llama4-scout-on-trtllm.md
+   quick-start-recipe-for-gpt-oss-on-trtllm.md
@@ -49,11 +49,11 @@ Note:
 * The command also maps port `8000` from the container to your host so you can access the LLM API endpoint from your host
 * See the <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags> for all the available containers. The containers published in the main branch weekly have `rcN` suffix, while the monthly release with QA tests has no `rcN` suffix. Use the `rc` release to get the latest model and feature support.
 
-If you want to use latest main branch, you can choose to build from source to install TensorRT-LLM, the steps refer to <https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html>.
+If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
 
 ### Creating the TRT-LLM Server config
 
-We create a YAML configuration file `/tmp/config.yml` for the TensorRT-LLM Server and populate it with the following recommended performance settings.
+We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings.
 
 ```shell
 EXTRA_LLM_API_FILE=/tmp/config.yml
@@ -108,7 +108,7 @@ These options are used directly on the command line when you start the `trtllm-s
 
 #### `--backend pytorch`
 
-* **Description:** Tells TensorRT-LLM to use the **pytorch** backend.
+&emsp;**Description:** Tells TensorRT LLM to use the **pytorch** backend.
 
 #### `--max_batch_size`
 
@@ -124,7 +124,7 @@ These options are used directly on the command line when you start the `trtllm-s
 
 #### `--trust_remote_code`
 
-* **Description:** Allows TensorRT-LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
+&emsp;**Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API.
 
 
 #### Extra LLM API Options (YAML Configuration)
@@ -264,7 +264,7 @@ Sample result in Blackwell
 
 ## Benchmarking Performance
 
-To benchmark the performance of your TensorRT-LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh` script.
+To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper `bench.sh`(http://bench.sh) script.
 
 ```shell
 cat <<EOF >  bench.sh
 
@@ -7,8 +7,8 @@ This benchmarking suite is a work in progress.
 Expect breaking API changes.
 ```
 
-TensorRT-LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
-easier for users to reproduce our officially published [performance overiew](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:
+TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it
+easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows:
 
 - A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
 - An entirely Python workflow for benchmarking.
 
@@ -1,9 +1,11 @@
-# How To Change KV Cache Behavior
+# How to Change KV Cache Behavior
 
-KV cache behavior is set by providing the optional argument ```kv_cache_config``` when LLM engine is created. Consider the quickstart example (found in examples/pytorch/quickstart.py):
+Set KV cache behavior by providing the optional ```kv_cache_config argument``` when you create the LLM engine. Consider the quickstart example found in ```examples/pytorch/quickstart.py```:
 
-```
+```python
 from tensorrt_llm import LLM, SamplingParams
+
+
 def main():
     prompts = [
         "Hello, my name is",
@@ -12,30 +14,34 @@ def main():
         "The future of AI is",
     ]
     sampling_params = SamplingParams(max_tokens=32)
+
     llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
     outputs = llm.generate(prompts, sampling_params)
+
     for i, output in enumerate(outputs):
         prompt = output.prompt
         generated_text = output.outputs[0].text
         print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
 if __name__ == '__main__':
     main()
 ```
 
-This example runs with default KV cache properties. The default for ```free_gpu_memory_fraction``` is 0.9, which means TensorRT-LLM will try to allocate 90% of free GPU memory for KV cache. Depending on your system, this may be too aggressive, so you decide to dial that back to 0.7. This is done by adding the following lines to the quickstart example:
+This example runs with default KV cache properties. The default value for `free_gpu_memory_fraction` is 0.9, which means TensorRT-LLM tries to allocate 90% of free GPU memory (after loading weights) for KV cache. Depending on your use case, this allocation can be too aggressive. You can reduce this value to 0.7 by adding the following lines to the quickstart example:
 
-```
+```python
 from tensorrt_llm.llmapi import KvCacheConfig
 kv_cache_config = KvCacheConfig(free_gpu_memory_fraction=0.7)
 llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
 ```
 
-You can also set properties after you create KvCacheConfig, for instance
+You can also set properties after you create ```KvCacheConfig```. For example:
 
-```
+```python
 kv_cache_config = KvCacheConfig()
 kv_cache_config.enable_block_reuse = False
 llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0', kv_cache_config=kv_cache_config)
 ```
 
-will disable block reuse for the quickstart example.
+This code disables block reuse for the quick start example.
@@ -1,9 +1,11 @@
-# How To Change Block Priorities
+# How to Change Block Priorities
 
-Block priority can be changed by providing the optional argument ```kv_cache_retention_config``` when a request is submitted to LLM engine. Consider the quickstart example (found in examples/pytorch/quickstart.py):
+You can change block priority by providing the optional ```kv_cache_retention_config``` argument when you submit a request to the LLM engine. Consider the quick start example found in ```examples/pytorch/quickstart.py```:
 
-```
+```python
 from tensorrt_llm import LLM, SamplingParams
+
+
 def main():
     prompts = [
         "Hello, my name is",
@@ -12,21 +14,27 @@ def main():
         "The future of AI is",
     ]
     sampling_params = SamplingParams(max_tokens=32)
+
     llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
     outputs = llm.generate(prompts, sampling_params)
+
     for i, output in enumerate(outputs):
         prompt = output.prompt
         generated_text = output.outputs[0].text
         print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
 if __name__ == '__main__':
     main()
 ```
 
-The blocks from the prompts will be stored for reuse with the default priority of 35 (on a scale from 1 to 100 where 100 is highest and 1 is lowest priority). Assume you know that the first four tokens of each prompt is a system prompt that should be stored with high priority (100). You do this by providing a kv cache retention config object when you submit the prompts for generation:
+The blocks from the prompts are stored for reuse with the default priority of 35 on a scale from 1 to 100, where 100 is highest priority and 1 is lowest priority. Assume you know that the first four tokens of each prompt represent a system prompt that should be stored with high priority (100). You can achieve this by providing a KV cache retention config object when you submit the prompts for generation:
 
-```
+```python
 from tensorrt_llm import LLM, SamplingParams
 from tensorrt_llm.llmapi import KvCacheRetentionConfig
+
+
 def main():
     prompts = [
         "Hello, my name is",
@@ -35,7 +43,9 @@ def main():
         "The future of AI is",
     ]
     sampling_params = SamplingParams(max_tokens=32)
+
     llm = LLM(model='TinyLlama/TinyLlama-1.1B-Chat-v1.0')
+
     # Set priority for first 4 prompt tokens to 100. All other tokens set to default (35) priority.
     # This policy never lapses.
     tokenRangeRetentionConfig = KvCacheRetentionConfig.TokenRangeRetentionConfig(0, 4, 100, None)
@@ -44,12 +54,15 @@ def main():
         decode_retention_priority=35, # Set generated tokens to default priority
         decode_duration_ms=None)
     outputs = llm.generate(prompts, sampling_params, kv_cache_retention_config=kv_cache_retention_config)
+
     for i, output in enumerate(outputs):
         prompt = output.prompt
         generated_text = output.outputs[0].text
         print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
+
+
 if __name__ == '__main__':
     main()
 ```
 
-Here we used a single kv_cache_retention_config object for all the prompts. Alternatively, you can also provide a list, the list must have the same length as the list of prompts.
+This example uses a single ```kv_cache_retention_config``` object for all the prompts. You can also provide a list that must have the same length as the list of prompts.
@@ -212,6 +212,8 @@ There are some other useful environment variables that may help when encounterin
 
 * `NCCL_GRAPH_MIXING_SUPPORT`: With the default value `1`, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to `0` will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it's unsafe.
 
+* ``UCX_MAX_RNDV_RAILS`: With the default value 2, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting UCX_MAX_RNDV_RAILS=1 can reduce contention in this case.
+
 ## Troubleshooting and FAQ
 
 ### General FAQs
@@ -254,7 +256,7 @@ A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
 
 A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.
 
-*Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?
+*Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?*
 
 A. NVLink domain can be found with `nvidia-smi -q` in the `Fabric.ClusterUUID` field. A few UCX environment variables can be adjusted when your servers have different NVLink domains: