-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][doc] Rename TensorRT-LLM to TensorRT LLM. #7554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][doc] Rename TensorRT-LLM to TensorRT LLM. #7554
Conversation
📝 WalkthroughWalkthroughDocumentation-wide rename: all occurrences of “TensorRT-LLM” → “TensorRT LLM” across many docs, plus targeted doc additions (attention backends, Hybrid ETP, FP4/FP8 quantization notes, GPU benchmarking prep, example YAML/here-doc). No code, API, or runtime logic changes. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Config as PyTorchConfig
participant Runtime as TRTLLM Runtime
participant Backend as Attention Backend
rect rgb(240,248,255)
User->>Config: set attn_backend (Vanilla/FlashInfer/Trtllm)
end
Config->>Runtime: provide config at init
Runtime->>Backend: select & initialize backend
Backend-->>Runtime: backend ready
Runtime->>Backend: execute attention (query, key, value)
Backend-->>Runtime: attention output
Runtime->>User: inference results
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
docs/source/features/speculative-decoding.md (4)
52-52
: Straggler: replace “TRT-LLM” with “TensorRT LLM”.This PR aims for consistent rebrand; keep the repo slug as-is in URLs only.
-TRT-LLM supports a modified version of the algorithm presented in the paper: ... +TensorRT LLM supports a modified version of the algorithm presented in the paper: ...
75-75
: Straggler: replace “TRT-LLM” with “TensorRT LLM”.-When the NGram algorithm is used, TRT-LLM will maintain a map from token prefixes... +When the NGram algorithm is used, TensorRT LLM will maintain a map from token prefixes...
116-117
: Clean up dangling list refs “7.)” and “4.)” after links.They read like copy artifacts and hinder readability.
-* `drafter`: An object of type `Drafter` that implements the `prepare_draft_tokens` method (see [Developer Guide](speculative-decoding.md#developer-guide) 7.) -* `resource_manager`: An optional `ResourceManager` object (see [Developer Guide](speculative-decoding.md#developer-guide) 4.) +* `drafter`: An object of type `Drafter` that implements the `prepare_draft_tokens` method (see [Developer Guide](speculative-decoding.md#developer-guide)). +* `resource_manager`: An optional `ResourceManager` object (see [Developer Guide](speculative-decoding.md#developer-guide)).
237-239
: Typo: “drat tokens” → “draft tokens”.User-facing docs—fix spelling.
-... where `d1`, `d2`, and `d3` are drat tokens. Suppose the token after `t` is `d1` +... where `d1`, `d2`, and `d3` are draft tokens. Suppose the token after `t` is `d1`docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)
33-44
: Update NGC container tag to a published release
The imagenvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6
isn’t available on NGC (NGC only lists up to1.0.0rc4
, and no GA1.0.0
tag exists yet). Change the tag to the latest published rc (e.g.,release:1.0.0rc4
) or add a note instructing users to verify and update the tag once1.0.0rc6
or a GA release appears on NGC.docs/source/features/attention.md (2)
22-36
: Fix branding and casing: “TRT-LLM backend” → “TensorRT LLM backend”; standardize FlashInfer casing.This PR’s goal is the rebrand. These lines still use the old “TRT-LLM” name and mixed “Flashinfer/FlashInfer”. Align terminology and keep the string value attn_backend="flashinfer" as-is.
-There are currently three available attention backends: the vanilla backend, the TRT-LLM backend, and the Flashinfer backend. +There are currently three available attention backends: the Vanilla backend, the TensorRT LLM backend, and the FlashInfer backend. -You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to utilize the Flashinfer backend, you can pass `attn_backend="flashinfer"` to the `LLM` constructor as follows: `LLM(attn_backend="flashinfer")`. This will enable the use of the Flashinfer backend for your model. +You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to use the FlashInfer backend, pass `attn_backend="flashinfer"` to the `LLM` constructor, e.g., `LLM(attn_backend="flashinfer")`. -The vanilla backend, `VanillaAttention`, is a reference implementation designed primarily for inflight batching and linear KV cache support. While it serves as a useful baseline, it is not recommended for production use due to its limited optimizations. +The Vanilla backend, `VanillaAttention`, is a reference implementation designed primarily for inflight batching and linear KV cache support. While it serves as a useful baseline, it is not recommended for production use due to limited optimizations. -In contrast, the Flashinfer backend, `FlashInferAttention`, is performance-optimized and supports both inflight batching and paged KV cache. It also includes the following advanced features: +In contrast, the FlashInfer backend, `FlashInferAttention`, is performance-optimized and supports both inflight batching and paged KV cache. It also includes the following advanced features: -The TRT-LLM backend, `TrtllmAttention`, serves as the default backend and supports all the features available in the Flashinfer backend while being further optimized for enhanced performance. +The TensorRT LLM backend, `TrtllmAttention`, serves as the default backend and supports all the features available in the FlashInfer backend while being further optimized for performance.
22-24
: Sync docs with actualattn_backend
values
docs/source/features/attention.md (22–24): replace “three available backends: vanilla, TRT-LLM, Flashinfer” with the complete set of supported identifiers—"TRTLLM"
(default),"FLASHINFER"
,"FLASHINFER_STAR_ATTENTION"
,"VANILLA"
, and (in auto-deploy)"triton"
and"torch"
.docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md (1)
88-103
: Avoid shadowing the installed wheel with PYTHONPATH.
Setting PYTHONPATH to repo root can override the installed wheel. Prefer leaving it unset or appending.-pip install --user build/tensorrt_llm*.whl -export PATH=${HOME}/.local/bin:${PATH} -export PYTHONPATH=`pwd` +pip install --user build/tensorrt_llm*.whl +export PATH="${HOME}/.local/bin:${PATH}" +# Optional: only append repo if you intentionally want local sources first +# export PYTHONPATH="$(pwd):${PYTHONPATH}"docs/source/architecture/checkpoint.md (1)
176-185
: Output dir shows TP=2 but example path lists 1‑gpu later.
Mismatch between the convert command (--tp_size 2
,2-gpu/
) and the directory listing below (1-gpu/
). Correct to2-gpu/
.-./opt/125M/trt_ckpt/fp16/1-gpu/ +./opt/125M/trt_ckpt/fp16/2-gpu/docs/source/features/parallel-strategy.md (2)
72-81
: YAML example combines mutually exclusive configs; copy-paste will be invalid.
Split into two separate blocks or comment one out to avoid duplicate keys.-```bash -cat <<EOF > parallel_config.yaml -# TP-8 -tensor_parallel_size: 8 -enable_attention_dp: false # default -# DP-8 -tensor_parallel_size: 8 -enable_attention_dp: true -EOF -``` +```yaml +# TP-8 (choose one block) +tensor_parallel_size: 8 +enable_attention_dp: false +``` + +```yaml +# DP-8 (alternative) +tensor_parallel_size: 8 +enable_attention_dp: true +```
103-118
: Same issue for MoE config: present as separate options.
Avoid conflicting keys in a single YAML.-```bash -cat <<EOF > parallel_config.yaml -# TP only -tensor_parallel_size: 8 -moe_tensor_parallel_size: 8 - -# EP only -tensor_parallel_size: 8 -moe_expert_parallel_size: 8 - -# Hybrid (TP-4 × EP-2) -tensor_parallel_size: 8 # 4 × 2 -moe_tensor_parallel_size: 4 -moe_expert_parallel_size: 2 -EOF -``` +```yaml +# TP only +tensor_parallel_size: 8 +moe_tensor_parallel_size: 8 +``` + +```yaml +# EP only +tensor_parallel_size: 8 +moe_expert_parallel_size: 8 +``` + +```yaml +# Hybrid (TP-4 × EP-2) +tensor_parallel_size: 8 # 4 × 2 +moe_tensor_parallel_size: 4 +moe_expert_parallel_size: 2 +```
♻️ Duplicate comments (1)
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)
322-323
: Final branding pass for sample output line.-Sample TensorRT LLM serving benchmark output. Your results may vary due to ongoing software optimizations. +Sample TensorRT LLM serving benchmark output. Your results may vary due to ongoing software optimizations.(Ensure “TensorRT LLM” matches elsewhere; if already updated, ignore.)
🧹 Nitpick comments (99)
docs/source/developer-guide/perf-analysis.md (1)
7-7
: Grammar/clarity tweak in intro sentence.Fix “Large Languages Models” → “large language models” and tighten phrasing.
-Given the potential long runtimes of Large Languages Models (LLMs) and the diversity of workloads a model may experience during a single inference pass or binary execution, NVIDIA has added features to TensorRT LLM to get the most out of Nsight Systems capabilities. This document outlines those features as well as provides examples of how to best utilize them to understand your application. +Given the potentially long runtimes of large language models (LLMs) and the diversity of workloads a model may experience during a single inference pass or binary execution, NVIDIA has added features to TensorRT LLM to make the most of Nsight Systems. This document outlines those features and provides examples of how to use them to understand your application.docs/source/examples/customization.md (1)
5-5
: Polish quantization paragraph for correctness and flow.Minor grammar and style cleanup; no semantic change.
-TensorRT LLM can quantize the Hugging Face model automatically. By setting the appropriate flags in the `LLM` instance. For example, to perform an Int4 AWQ quantization, the following code triggers the model quantization. Please refer to complete list of [supported flags](https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html#QuantAlgo) and acceptable values. +TensorRT LLM can automatically quantize Hugging Face models by setting the appropriate flags on the `LLM` instance. For example, to perform INT4 AWQ quantization, use the following code. See the complete list of [supported flags](https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html#QuantAlgo) and acceptable values.docs/source/features/checkpoint-loading.md (1)
19-19
: Tighten phrasing (“model’s definition”).Use a more natural construction.
-- **Weight Mappers**: Map and transform loaded weights to TensorRT LLM model's definition +- **Weight Mappers**: Map and transform loaded weights to the TensorRT LLM model definitiondocs/source/examples/kvcacheconfig.md (1)
25-25
: Minor wording improvement for default KV cache description.Improves readability; keeps meaning intact.
-This example runs with default KV cache properties. The default for ```free_gpu_memory_fraction``` is 0.9, which means TensorRT LLM will try to allocate 90% of free GPU memory for KV cache. Depending on your system, this may be too aggressive, so you decide to dial that back to 0.7. This is done by adding the following lines to the quickstart example: +This example runs with the default KV cache properties. By default, ```free_gpu_memory_fraction``` is 0.9, which means TensorRT LLM attempts to allocate 90% of the free GPU memory for the KV cache. Depending on your system, this may be too aggressive, so you might dial that back to 0.7 by adding the following lines to the quickstart example:docs/source/torch.md (1)
9-9
: Simplify version statement to avoid drift
Change “available in version 0.17 and later” to “available since v0.17” (or confirm the exact first-supported release) — docs/source/torch.md:9docs/source/features/speculative-decoding.md (1)
128-146
: YAML option name consistency check.You use
decoding_type: Eagle
but the text calls it “EAGLE 3”. Consider “Eagle” vs “EAGLE 3” consistently to avoid user confusion.docs/source/features/multi-modality.md (3)
3-3
: Conciseness: “a variety of” → “many”.-TensorRT LLM supports a variety of multimodal models, enabling efficient inference with inputs beyond just text. +TensorRT LLM supports many multimodal models, enabling efficient inference with inputs beyond just text.
41-41
: Simplify relative path to docs command page.Path currently walks outside and back into source. Prefer a simpler relative path.
-### OpenAI-Compatible Server via [`trtllm-serve`](../../source/commands/trtllm-serve/trtllm-serve.rst) +### OpenAI-Compatible Server via [`trtllm-serve`](../commands/trtllm-serve/trtllm-serve.rst)
35-35
: Use consistent “LLM API” style & verifysource:
link role
- Replace backtick-quoted
LLM-API
withLLM API
to match the project’s terminology.- Confirm that the existing
source:
link role resolves correctly in the Sphinx build (it’s used extensively elsewhere).docs/source/dev-on-cloud/build-image-to-dockerhub.md (1)
6-6
: Subheading branding fix LGTM; also “DockerHub” → “Docker Hub”.-## Build the TensorRT LLM Docker Image and Upload to DockerHub +## Build the TensorRT LLM Docker Image and Upload to Docker Hubdocs/source/installation/linux.md (3)
24-29
: Fix typo in tip block (“preqrequisites” → “prerequisites”).
Small spelling issue in the tip.Apply:
- Instead of manually installing the preqrequisites as described + Instead of manually installing the prerequisites as described
49-49
: Tighten grammar for “pip install” sentence.
Current: “There are some known limitations when you pip install pre-built TensorRT LLM wheel package.”Use one of:
-There are some known limitations when you pip install pre-built TensorRT LLM wheel package. +There are some known limitations when installing the pre-built TensorRT LLM wheel via pip.
53-55
: Polish Slurm/MPI guidance (capitalization, clarity).
Improve readability and correctness.- If you encounter an error while running TensorRT LLM in a Slurm-managed cluster, you need to reconfigure the MPI installation to work with Slurm. - The setup methods depends on your slurm configuration, pls check with your admin. This is not a TensorRT LLM specific, rather a general mpi+slurm issue. + If you encounter an error while running TensorRT LLM in a Slurm-managed cluster, you may need to reconfigure the MPI installation to work with Slurm. + The setup method depends on your Slurm configuration; please check with your administrator. This is not specific to TensorRT LLM—it is a general MPI + Slurm issue.docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (6)
53-54
: Grammar: “the latest main branch” + phrasing.-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to ... +If you want to use the latest main branch, build from source following ...
55-59
: Consistency: drop hyphen in “TRT-LLM Server”.
Use “TensorRT LLM Server” (or “TRT LLM Server”) to align with rebrand.-### Creating the TRT-LLM Server config +### Creating the TensorRT LLM Server config -... for the TensorRT LLM Server ... +... for the TensorRT LLM server ...
72-89
: Consistency: also update subsequent “TRT-LLM server” mentions.
There are a few remaining hyphenated instances.-Below is an example command to launch the TRT-LLM server +Below is an example command to launch the TensorRT LLM server -After the TRT-LLM server is set up +After the TensorRT LLM server is set up
110-114
: Capitalize “PyTorch”.
Brand capitalization.- **Description:** Tells TensorRT LLM to use the **pytorch** backend. + **Description:** Tells TensorRT LLM to use the **PyTorch** backend.
187-209
: Consistency: remaining “TRT-LLM server” in Testing section.
Align with “TensorRT LLM server”.-Start a new terminal on the host to test the TensorRT LLM server you just launched. +Start a new terminal on the host to test the TensorRT LLM server you just launched. -Here is an example response, showing that the TRT-LLM server returns +Here is an example response, showing that the TensorRT LLM server returns
274-275
: Grammar: fix phrasing for benchmarking wrapper step.-To benchmark the performance of your TensorRT LLM server you can leverage the built-in “benchmark_serving.py” script. To do this first creating a wrapper bench.sh script. +To benchmark your TensorRT LLM server, use the built-in “benchmark_serving.py” script. First, create a simple wrapper script (bench.sh).docs/source/features/quantization.md (4)
5-5
: Grammar: “used to reduces” → “used to reduce”.-Quantization is a technique used to reduces memory footprint and computational cost +Quantization is a technique used to reduce memory footprint and computational cost
62-66
: Rename “Model Supported Matrix” → “Model Support Matrix”.
Common phrasing.-## Model Supported Matrix +## Model Support Matrix
88-91
: Spacing and clarity in multimodal note.-The vision component of multi-modal models(BLIP2-OPT/BLIP2-T5/LLaVA/VILA/Nougat) uses FP16 by default. +The vision component of multi-modal models (BLIP2-OPT/BLIP2-T5/LLaVA/VILA/Nougat) uses FP16 by default.
103-105
: Hyphenation in note.-FP8 block wise scaling GEMM kernels +FP8 block-wise scaling GEMM kernelsdocs/source/quick-start-guide.md (2)
97-106
: Minor polish and consistency.
- Keep “TensorRT LLM” consistently capitalized in bullets.
- CLI bullets look correct; no rename needed for tool names.
If desired, add terminal periods for parallel punctuation across bullets.
10-12
: Pin container tag for reproducibility
Use the latest GA release tag (0.18.2) instead of the floatingrelease
tag to avoid breaking changes when it moves. (nvidia.github.io)docker run --ipc host --gpus all -p 8000:8000 -it nvcr.io/nvidia/tensorrt-llm/release:0.18.2docs/source/features/attention.md (4)
158-161
: Tighten wording.Use “can” instead of “is able to” and remove filler.
-The generation phase is implemented using a single kernel called the masked -multi-head attention in TensorRT LLM. That kernel is able to apply -pre-processing on the Q, K, and V elements on-the-fly: it adds the QKV bias, applies -RoPE, and performs dequantization and quantization. TensorRT LLM will continue to add (or -enable) additional features in future releases, such as enabling support for IA3. +The generation phase uses a single masked multi-head attention kernel in TensorRT LLM. +The kernel can apply preprocessing on Q, K, and V on the fly: it adds QKV bias, applies +RoPE, and performs dequantization and quantization. Additional features (e.g., IA3) may be added in future releases.
205-212
: Unify “inflight” vs “in-flight”.This file uses both. Prefer one form (project docs often use “inflight”).
-### In-flight Batching -TensorRT LLM supports in-flight batching of requests (also known as continuous +### Inflight Batching +TensorRT LLM supports inflight batching of requests (also known as continuous batching or iteration-level batching) for higher serving throughput. With this feature,
287-291
: Capitalize KV consistently and fix grammar.-TensorRT LLM has a feature called `Cyclic KV Cache`, which treats the kv cache -as a circular buffer. This means that it only stores the kv cache for the last N +TensorRT LLM has a feature called `Cyclic KV Cache`, which treats the KV cache +as a circular buffer. This means that it stores the KV cache for only the last N tokens, where N is determined by the `attention_window_size` parameter in `TrtllmAttention.forward`. When the cache is full, new tokens’ kv cache will -overwrite the "least recently used" caches. +overwrite the least recently used entries.
307-309
: Minor grammar tweak.-Similar to the cyclic KV cache feature in TensorRT LLM, `attention_window_size` -parameter is used to determine `N`. Different from the cyclic KV cache feature, +Similar to the cyclic KV cache feature in TensorRT LLM, the `attention_window_size` +parameter determines `N`. Unlike the cyclic KV cache feature,docs/source/architecture/add-model.md (1)
55-56
: Polish grammar in the weight-conversion intro.-The weights from source framework need to be converted and bound to the new added TensorRT LLM model. Here is an example of converting HuggingFace weights: +The weights from the source framework need to be converted and bound to the newly added TensorRT LLM model. Here is an example of converting HuggingFace weights:docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md (3)
1-6
: Standardize Eagle naming (use “Eagle-3” consistently) and trim extra space.Matches usage elsewhere in the docs and improves consistency.
-# How to launch Llama4 Maverick + Eagle3 TensorRT LLM server +# How to launch Llama 4 Maverick + Eagle-3 TensorRT LLM server - -Artificial Analysis has benchmarked the Llama4 Maverick with Eagle3 enabled TensorRT LLM server running at over [...] +Artificial Analysis has benchmarked the Llama 4 Maverick with Eagle-3–enabled TensorRT LLM server running at over [...] - -In the guide below, we will walk you through how to launch your own high-performance Llama4 Maverick with Eagle3 enabled TensorRT LLM server, from build to deployment. (Note that your specific performance numbers may vary—speculative decoding speedups depend upon the dataset!) +In the guide below, we walk through how to launch your own high-performance Llama 4 Maverick with Eagle-3–enabled TensorRT LLM server, from build to deployment. (Note that your specific performance numbers may vary—speculative decoding speedups depend on the dataset.)
39-44
: Container tag guidance: prefer published NGC tags over floating “latest”.To avoid readers pulling unpublished or mismatched images, recommend an explicit published tag.
-#### Option 1. Use weekly release NGC docker image -TensorRT LLM provides weekly release [docker image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) +#### Option 1. Use weekly release NGC Docker image +TensorRT LLM provides weekly release [Docker images](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release). Use a published tag (for example, `:24.09`) instead of `:latest`.
60-66
: Mirror Eagle-3 naming in Step 4 note.-**Important:** Replace `/path/to/maverick` and `/path/to/eagle` with the actual paths to your Maverick and Eagle3 model checkpoints on your host machine, downloaded in the [Download Artifacts](#download-artifacts) stage +**Important:** Replace `/path/to/maverick` and `/path/to/eagle` with the actual paths to your Maverick and Eagle-3 model checkpoints on your host, downloaded in the [Download Artifacts](#download-artifacts) stage.docs/source/overview.md (2)
49-54
: Minor punctuation: add missing space after comma; consistent series formatting.-- **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration +- **NVIDIA Hopper**: H100, H200, GH200 with FP8 acceleration
57-58
: Add missing space after period.-Whether you're building the next generation of AI applications, optimizing existing LLM deployments, or exploring the frontiers of large language model technology, TensorRT LLM provides the tools, performance, and flexibility you need to succeed in the era of generative AI.To get started, refer to the {ref}`quick-start-guide`. +Whether you're building the next generation of AI applications, optimizing existing LLM deployments, or exploring the frontiers of large language model technology, TensorRT LLM provides the tools, performance, and flexibility you need to succeed in the era of generative AI. To get started, refer to the {ref}`quick-start-guide`.docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md (2)
110-111
: Remove “very” and standardize FP8 casing.-Attention is also a very important component in supporting MTP inference. +Attention is also an important component in supporting MTP inference. @@ -In our implementation, TensorRT LLM will use the fp8 flashMLA generation kernel on Hopper GPU, while using TRTLLM customized attention kernels on Blackwell for better performance. +In our implementation, TensorRT LLM uses the FP8 flashMLA generation kernel on Hopper GPUs, and TensorRT LLM–customized attention kernels on Blackwell for better performance.
2-2
: Optional: add a period to the byline.-by NVIDIA TensorRT LLM team +by NVIDIA TensorRT LLM team.docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md (8)
1-1
: Polish the title for grammar ("the best performance").
Suggestion below keeps SEO while reading naturally.-# How to get best performance on DeepSeek-R1 in TensorRT LLM +# How to get the best performance on DeepSeek-R1 in TensorRT LLM
53-59
: Clarify repo name vs. product name to avoid confusion.
We renamed the product to “TensorRT LLM” but the repo/folder remains “TensorRT-LLM”. Add a one-liner note.# Clone the TensorRT LLM repository cd $YOUR_WORK_PATH git clone https://github.com/NVIDIA/TensorRT-LLM.git cd TensorRT-LLM git submodule update --init --recursive git lfs pull +# Note: The repository directory is named "TensorRT-LLM" (hyphen) even though the product name is "TensorRT LLM".
80-87
: Add explicit container image tag guidance (avoid unpublished tags).
To align with prior guidance, suggest pinning to a published NGC tag or documenting how the Makefile chooses the image.Do you want me to add a short subsection showing how to override the container tag via DOCKER_IMAGE/DOCKER_TAG and list the latest published tags?
125-131
: Tighten wording and cross-reference.
Add terminal period and link the earlier section explicitly.-All the benchmarking is done by the trtllm-bench command line tool provided in the TensorRT LLM installation, see [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details of this tool. +All benchmarking is done with the trtllm-bench CLI provided by TensorRT LLM. See [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.
160-169
: Fix minor grammar and missing phrase.-- `trtllm-bench`: A CLI benchmarking utility that aims to make it easier for users to reproduce our officially published. See [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details. +- `trtllm-bench`: A CLI benchmarking utility to reproduce officially published results. See [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details.
189-191
: Link the exact commit SHA.
Adding a permalink improves reproducibility.-We are seeing meaningful speedup using FP8 KV cache, thus refreshing the numbers here. The results are reproduced with TensorRT LLM commit b6261862419c33d6ce2313aff1e7116067d6037d. +We see meaningful speedup with FP8 KV cache; numbers are reproduced at TensorRT LLM commit b6261862419c33d6ce2313aff1e7116067d6037d (permalink).
242-244
: Do the same commit permalink here.-The results are reproduced with TensorRT LLM commit b6261862419c33d6ce2313aff1e7116067d6037d. +Results reproduced at commit b6261862419c33d6ce2313aff1e7116067d6037d (permalink).
404-414
: Grammar and style tune-ups.-To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. TensorRT LLM is working on enhancements that can make the benchmark process smoother. +To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, generate datasets with `prepare_dataset.py` and reuse the commands above. We are working on enhancements to make benchmarking smoother.-We're working on to enable those features by default, +We're working to enable those features by default,-It's possible seeing OOM issues on some cases. +It's possible to see OOM issues in some cases.docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md (7)
20-22
: Smooth the opening paragraph.
Minor phrasing fixes.-The open source DeepSeek R1 model's innovative architecture including the multi-head latent attention (MLA) and large sparse Mixture-of-Experts (MoE) significantly improved the inference efficiency of the LLM models. +The open-source DeepSeek R1 model’s architecture—including multi-head latent attention (MLA) and large sparse Mixture-of-Experts (MoE)—significantly improves LLM inference efficiency.
27-31
: “what is used” → “that is used”; bullet casing.-The mixed precision recipe for DeepSeek R1 throughput scenario is almost the same as [what](... ) is used for latency oriented scenario, with the following differences: +The mixed-precision recipe for the throughput scenario is almost the same as [that](... ) used for the latency-oriented scenario, with the following differences:
34-39
: Fix table formatting and note grammar.
The table renders inconsistently; also “slight” → “slightly”.-| Precision | GPQA Diamond | MATH-500 -| :-- | :-- | :-- | -| TensorRT LLM FP8 | 0.697 | 0.954 | -| TensorRT LLM FP4 | 0.705 | 0.96 | +| Precision | GPQA Diamond | MATH-500 | +| :----------------- | :----------- | :------- | +| TensorRT LLM FP8 | 0.697 | 0.954 | +| TensorRT LLM FP4 | 0.705 | 0.960 | -** Note there are some run-to-run variance for these evaluations, so FP4 data is slight higher here. We think FP4 has comparable accuracy with FP8 on these datasets. +Note: There is some run-to-run variance in these evaluations, so FP4 appears slightly higher here. FP4 is comparable to FP8 on these datasets.
65-71
: Tighten MLA paragraph and add missing period.-... Based on roofline analysis, non absorbed version is beneficial for the prefill phase with input length 256 or larger -The TensorRT LLM MLA implementation chooses ... +... Based on roofline analysis, the non‑absorbed version is beneficial for the prefill phase with input length ≥256. +The TensorRT LLM MLA implementation chooses ...
148-161
: CUDA Graphs section: clarify padding trade-off.
Minor wording; keep guidance crisp.-There is a feature called CUDA Graph padding in TensorRT LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. +CUDA Graph padding trades off the number of captured graphs vs. hit ratio by padding batches to the nearest captured size.
162-166
: “opt-out this feature by set” → “by setting”.-... users can still opt-out this feature by set *disable_overlap_scheduler* to true. +... users can still opt out of this feature by setting `disable_overlap_scheduler: true`.
182-183
: Acknowledgment paragraph reads long; split.
Improve readability with a break.-The substantial throughput advancements for DeepSeek R1 on Blackwell GPUs, as detailed in this post, are the fruit of a dedicated and collaborative engineering effort. Achieving nearly a 2.3x increase in TPS/GPU required a deep dive into MLA layers, MoE layers, and runtime optimizations. +The substantial throughput advancements for DeepSeek R1 on Blackwell GPUs are the result of a dedicated, collaborative engineering effort. +Achieving nearly a 2.3× increase in TPS/GPU required a deep dive into MLA layers, MoE layers, and runtime optimizations.docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md (5)
5-5
: Satisfy markdownlint MD036 by avoiding emphasis-as-heading.
Convert to a normal paragraph or a blockquote.-*By NVIDIA TensorRT LLM Team* +By the NVIDIA TensorRT LLM team.
31-33
: Consistent naming of prior blog.
Add “TensorRT LLM” once, then use “we”.-... the TensorRT LLM team has focused on optimizing the large EP implementation to improve performance. +... we focused on optimizing the large‑EP implementation to improve performance.
248-260
: Environment variables block: add brief context line.
Small clarification helps readers.-To enable powerful NVTX markers for easier analysis of host overheads, TensorRT LLM provides several useful environment variables: +To analyze host overheads with NVTX markers, set these environment variables:
320-323
: Grammar: possessive “users’ productive scenarios”.-Large-scale Expert Parallelism represents one of the important workloads for users productive scenarios, +Large-scale Expert Parallelism represents an important workload for users’ productive scenarios,
310-311
: Repro pointer: consider adding direct example command.
A one-liner to run the SLURM script improves usability.I can add a minimal sbatch example (variables for partition, nodes, GPUs) if you’d like.
docs/source/architecture/checkpoint.md (2)
219-223
: Section title and PATH tip.
Minor polish; “Run Evaluation” is clearer; quote PATH assignment.-## Make Evaluation +## Run Evaluation-export PATH=/usr/local/bin:$PATH +export PATH="/usr/local/bin:${PATH}"
5-6
: Minor tense/wording cleanup.-The earlier versions (pre-0.8 version) of TensorRT LLM were developed with a very aggressive timeline. For those versions, emphasis was not put on defining a unified workflow. +Earlier (pre‑0.8) versions of TensorRT LLM were developed on an aggressive timeline, so a unified workflow was not prioritized.docs/source/features/parallel-strategy.md (3)
7-14
: Nit: hyphenate “shard/distribute/replicate” consistently.
Optional wording clarity; no functional change.-* **Tensor Parallel (TP)** - Shards model weights across GPUs -* **Pipeline Parallel (PP)** - Distributes model layers across GPUs -* **Data Parallel (DP)** - Replicates model across GPUs for different requests +* **Tensor Parallel (TP)** — shards model weights across GPUs +* **Pipeline Parallel (PP)** — distributes model layers across GPUs +* **Data Parallel (DP)** — replicates the model across GPUs for different requests
119-121
: Note rendering: use fenced admonition for clarity.-```{note} -The product of `moe_tensor_parallel_size` and `moe_expert_parallel_size` must equal `tensor_parallel_size`. -``` +```{note} +The product of `moe_tensor_parallel_size` and `moe_expert_parallel_size` must equal `tensor_parallel_size`. +```(Ensure the site’s Sphinx/Myst config renders admonitions; otherwise keep as-is.)
125-136
: Wide‑EP intro: trim marketing, keep crisp.-Wide Expert Parallelism (Wide-EP) is TensorRT LLM's advanced solution for large-scale MoE model inference. It addresses the challenges of traditional expert parallelism through intelligent load balancing and expert replication strategies. +Wide Expert Parallelism (Wide‑EP) addresses large‑scale MoE inference challenges via expert replication and load balancing.docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (7)
41-43
: Verify NGC tag exists and is publicly publishedPlease confirm
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6
is currently published on NGC. If not, switch to the latest published tag to avoid 404 pulls per the team’s doc policy.I can suggest a safe fallback tag once you confirm what’s published.
50-51
: Tighten wording around rc vs releaseSmall clarity tweak and avoid implying weekly “main branch” images appear under “release” if they actually publish under a different repo/tag family.
-... The containers published in the main branch weekly have “rcN” suffix, while the monthly release with QA tests has no “rcN” suffix. +... Weekly builds carry an “rcN” suffix; monthly QA’d releases omit the “rcN” suffix.
52-53
: Grammar: add article and fix link textAdd “the” and consider a friendlier link phrase.
-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html +If you want to use the latest main branch, you can build from source to install TensorRT LLM. See: https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html
54-58
: Consistency: prefer “TensorRT LLM server” over “TRT-LLM Server/server”Branding update should use the product name consistently in headings and prose. CLI names (e.g.,
trtllm-serve
) remain unchanged.-### Creating the TRT-LLM Server config +### Creating the TensorRT LLM server config -### Launch the TRT-LLM Server +### Launch the TensorRT LLM server -... the TRT-LLM server ... +... the TensorRT LLM server ...Also applies to: 71-88, 186-198
196-205
: Example response: align server nameAlign narrative text with branding change.
-Here is an example response, showing that the TRT-LLM server returns ... +Here is an example response, showing that the TensorRT LLM server returns ...
225-233
: Pin lm-eval for reproducibilityUnpinned install inside container can break over time. Suggest pinning a known-good version or adding a note to use a tested version matrix.
-pip install -U lm-eval +pip install -U "lm-eval==<tested_version>" +# or add a note listing tested versions
269-297
: Benchmark wrapper: add shebang, strict bash, and base URLImprove robustness and avoid relying on environment for OpenAI backend base URL.
-cat <<EOF > bench.sh +cat <<'EOF' > bench.sh +#!/usr/bin/env bash +set -euo pipefail ... python -m tensorrt_llm.serve.scripts.benchmark_serving \ --model nvidia/Llama-4-Scout-17B-16E-Instruct-FP8 \ --backend openai \ + --base-url http://localhost:8000/v1 \ ... EOF chmod +x bench.shdocs/source/developer-guide/perf-benchmarking.md (5)
10-16
: Typos and clarity in introFix “overiew” and “follows”, and tighten one bullet.
-... reproduce our officially published [performance overiew] ... -... `trtllm-bench` provides the follows: +... reproduce our officially published [performance overview] ... +... `trtllm-bench` provides the following: - An entirely Python workflow for benchmarking. + A Python-only workflow for benchmarking.
68-71
: Grammar: validated networks sentence-While `trtllm-bench` should be able to run any network that TensorRT LLM supports, the following are the list -that have been validated extensively and is the same listing as seen on the +While `trtllm-bench` should run any network TensorRT LLM supports, the following networks have been validated extensively and match the listing on the
170-199
: Sample output: branding of “TensorRT-LLM Version”If the tool output has been updated to “TensorRT LLM Version”, refresh the sample; otherwise consider a note that the label reflects current CLI output.
-TensorRT-LLM Version: 0.17.0 +TensorRT LLM Version: 0.17.0
40-59
: Caution on “boost-slider” and power settingsThese commands are GPU- and driver-specific. Add a one-liner noting availability varies by GPU/driver and to restore settings post-benchmark.
-If supported, enable the boost slider ... +If supported on your GPU/driver, enable the boost slider ... +# After benchmarking, restore defaults (e.g., `sudo nvidia-smi -rgc`).
372-379
: Mark numeric values as placeholders“Issue Rate” and some metrics look placeholder-like. Add a brief “example output; values illustrative” note above the block to avoid confusion.
-# Example output +# Example output (values illustrative; your results will differ)docs/source/installation/build-from-source-linux.md (6)
5-6
: Version claim may drift; suggest softening or verifyingPyTorch linkage version on PyPI changes over time. Either verify 2.7.0 is still accurate or qualify it.
-... pre-built TensorRT LLM wheel on PyPI is linked against PyTorch 2.7.0 and subsequent versions ... +... pre-built TensorRT LLM wheel on PyPI is currently linked against recent PyTorch releases (e.g., 2.7.x). Verify your environment’s ABI compatibility.
55-56
: Minor grammar-The `make` command supports the `LOCAL_USER=1` argument to switch to the local user account instead of `root` inside the container. The examples of TensorRT LLM are installed in the `/app/tensorrt_llm/examples` directory. +The `make` command supports the `LOCAL_USER=1` argument to switch to the local user account instead of `root` inside the container. The TensorRT LLM examples are installed in `/app/tensorrt_llm/examples`.
117-119
: Linking: prefer relative doc link over raw GitHub pathKeeps docs portable across branches.
-For more information on building and running various TensorRT LLM container images, -check <https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>. +For more information on building and running TensorRT LLM container images, see {doc}`containers`.
150-158
: Pluralization and clarity-... a semicolon separated list of CUDA architecture as shown ... +... a semicolon-separated list of CUDA architectures, for example:
182-191
: Tone: simplify “prior to” and clarify ABI note-... older versions of `torch` (prior to 2.7.0) due to the Dual ABI ... +... older `torch` versions (before 2.7.0) due to GCC Dual ABI ...
214-219
: Precompiled wheel flow: add checksum and explicit wheel py tag exampleSuggest adding a checksum step and clarifying the wheel tag to reduce mismatch risk.
I can draft a short snippet with
sha256sum
validation and notes on matching cp/abi tags.Also applies to: 224-225, 231-233
docs/source/blogs/Falcon180B-H200.md (4)
10-14
: Clarify “INT4 AWQ with FP8” phrasingClarify weights vs compute precision to avoid ambiguity.
-... running INT4 AWQ with FP8, allows for the entire large language model ... +... using INT4 AWQ weights with FP8 compute allows the entire model ...
24-29
: Footnotes still use “TensorRT-LLM”For consistency with rebranding, update to “TensorRT LLM” unless quoting historical labels from figures.
-TensorRT-LLM v0.7a | +TensorRT LLM v0.7a |Also applies to: 87-92, 100-105
20-21
: Image links target “rel” branch on GitHubGitHub “rel” branch path may change; prefer relative repo paths or images hosted under docs’ media with relative links to avoid future 404s.
-<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_tps.png?raw=true" ... +<img src="../media/Falcon180B-H200_tps.png" ...Also applies to: 36-37, 71-72, 96-97
55-58
: Minor style: tighten sentence-TensorRT LLM has implemented custom kernels for AWQ, and taken the technique a -step further by performing FP8 computation on Hopper GPUs instead of the -standard FP16. +TensorRT LLM implements custom AWQ kernels and runs FP8 compute on Hopper GPUs instead of the usual FP16.docs/source/features/paged-attention-ifb-scheduler.md (6)
5-12
: Hyphenation consistency: “in‑flight”Use “in‑flight” consistently. Also fix “inflight” occurrence.
-TensorRT LLM supports in-flight batching of requests (also known as continuous -batching or iteration-level batching) ... -For efficiency reasons (1), the support for inflight batching ***requires... +TensorRT LLM supports in‑flight batching of requests (also known as continuous +batching or iteration‑level batching) ... +For efficiency reasons (1), in‑flight batching ***requires...
41-44
: Default explanation: add articleMinor grammar polish.
-Use default `max_seq_len` (which is `max_position_embeddings`), no need to tune it ... +Use the default `max_seq_len` (which is `max_position_embeddings`); no need to tune it ...
49-53
: Typo: “re-buliding” → “rebuilding”-... without re-buliding the engine. +... without rebuilding the engine.
61-64
: Flow: two short sentences read better-Using a more realistic value for `max_num_tokens` allows TensorRT LLM to -allocate more memory to store the KV cache and execute more requests together. -It leads to an increased efficiency. +Using a realistic `max_num_tokens` lets TensorRT LLM allocate more memory to the KV cache and execute more requests together, improving efficiency.
83-88
: KV cache paragraph: remove double space and tighten wording-already been computed. That cache is known as the KV cache. TensorRT LLM uses +already been computed. That cache is the KV cache. TensorRT LLM uses
109-113
: Alt text and captions still say “TRT-LLM”Optional rename in alt text/captions for consistency; file names can remain.
-![TRT-LLM Scheduler Visualization 1] +![TensorRT LLM Scheduler Visualization 1]Also applies to: 136-141
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (9)
38-39
: Avoid pinning to a possibly unpublished/stale NGC tag; point to tags page or use a placeholder.Per team preference, docs should reference published NGC tags only. Replace the hard-coded rc tag and instruct users to pick a tag from the NGC tags page to reduce churn.
-nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \ +nvcr.io/nvidia/tensorrt-llm/release:<choose-a-tag-from-NGC-tags-page> \Also verify the tag is published before merging this doc (README badges can lead ahead of NGC). Do you want me to scan recent NGC tags and suggest a currently published one?
Also applies to: 47-47
49-50
: Grammar + clarity: add “the”; keep URL (path with hyphen) but update visible name.-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to ... +If you want to use the latest main branch, you can build from source to install TensorRT LLM; see ...
72-74
: Clarify MoE note; it’s about MoE models (e.g., DeepSeek R1), not FP8 itself.The extra moe_config is tied to MoE architecture rather than the FP8 dtype per se. Reword to avoid confusion.
-For FP8 model, we need extra `moe_config`: +For MoE models (e.g., DeepSeek R1), add an extra `moe_config` (example below also shows FP8 KV cache):Also applies to: 88-91
75-91
: Overwriting config file twice may surprise users; note overwrite or combine into one block.Second here-doc replaces the first. Either mention that explicitly or present a single consolidated example to avoid accidental loss of settings.
-cat << EOF > ${EXTRA_LLM_API_FILE} +# Note: The following command replaces ${EXTRA_LLM_API_FILE}. +cat << EOF > ${EXTRA_LLM_API_FILE}
148-151
: Add a short security caution for--trust_remote_code
.Flag the risks briefly since this pulls and executes model repo code.
- **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. + **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face APIs. Use with caution and only for trusted model repositories.
203-204
: Fix stray backticks and punctuation.-See the [TorchLlmArgs class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the extra\_llm\_api\_options`.` +See the [TorchLlmArgs class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options that can be used in `extra_llm_api_options`.
296-297
: Minor grammar tweak in benchmarking intro.-To benchmark the performance of your TensorRT LLM server you can leverage the built-in “benchmark_serving.py” script. To do this first creating a wrapper [bench.sh] script. +To benchmark the performance of your TensorRT LLM server, use the built-in “benchmark_serving.py” script. To do this, first create a wrapper `bench.sh` script.
330-334
: Ensure result directory exists when saving outputs.--save-result \ ---result-dir "${result_dir}" \ +--result-dir "${result_dir}" \ --result-filename "concurrency_${concurrency}.json" + +# Create the directory before running the benchmark (if using --save-result): +mkdir -p "${result_dir}"
310-321
: Clarify client/server protocol in benchmark.The benchmark uses the OpenAI-compatible client (
--backend openai
), which targets the server’s OpenAI APIs. Add a brief note so users don’t confuse this with the server’s--backend pytorch
option.--backend openai \ + # Uses the OpenAI-compatible client to query the TensorRT LLM server’s OpenAI endpoints.
docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md
Show resolved
Hide resolved
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
Show resolved
Hide resolved
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md
Show resolved
Hide resolved
ce0cdd0
to
3bfa93b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (10)
docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md (2)
241-241
: Fix broken bold markup and spelling (“Placement”).-For the **Update Weights \& Placemen**t component, we identified two design choices: +For the **Update Weights & Placement** component, we identified two design choices:
399-399
: Fix typo and stabilize documentation link
- Remove the duplicate “how” and rephrase to “…as well as how to run the Offline EP Load Balancer end-to-end.”
- Update the
feat/large-ep
branch URL to a stable reference (e.g.main
or a specific commit SHA) to prevent future 404s.docs/source/features/checkpoint-loading.md (1)
1-327
: Replace all prose occurrences of “TRTLLM” and “TensorRT-LLM” with “TensorRT LLM”
The grep run revealed numerous prose instances across the docs (e.g. docs/source/architecture/overview.md, docs/source/features/checkpoint-loading.md, docs/source/reference/support-matrix.md, etc.). Leave code identifiers unchanged; only update natural-language text. Use a regex like/\bTRTLLM\b|\bTensorRT-LLM\b/
to locate and correct.docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (1)
8-8
: Fix spelling and style in list item.- * Launch the OpenAI-Compatibale Server with NGC container + * Launch the OpenAI-compatible server with the NGC containerdocs/source/blogs/H200launch.md (1)
1-60
: Replace broken GitHub URLs
- docs/source/blogs/H200launch.md (lines 38, 47): update both
<img>
src URLs fromhttps://github.com/NVIDIA/TensorRT LLM/...
tohttps://github.com/NVIDIA/TensorRT-LLM/...
.- docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md (line 68): update the FP8 Attention link to
https://github.com/NVIDIA/TensorRT-LLM/tree/main/...
to restore correct navigation.docs/source/installation/linux.md (1)
15-22
: Pin torchvision and torchaudio to specific cu128 versions
Update the pip install command in docs/source/installation/linux.md (lines 15–22) to:pip3 install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https://download.pytorch.org/whl/cu128
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (1)
270-338
: Sanitize example completion; remove chain-of-thought/internal text.The example embeds analysis/reasoning content that shouldn’t be shown to users. Replace with a concise, plausible response body.
-```bash -{ - "id": "chatcmpl-c440e2a3e7e14cd699295afc3739bf42", - "object": "chat.completion", - "created": 1754358426, - "model": "openai/gpt-oss-120b", - "choices": [ - { - "index": 0, - "message": { - "role": "assistant", - "content": "<|channel|>analysis<|message|>The user asks: \"What is NVIDIA's advantage for inference?\" ...", - "reasoning_content": null, - "tool_calls": [] - }, - "logprobs": null, - "finish_reason": "length", - "stop_reason": null, - "disaggregated_params": null - } - ], - "usage": { "...": "..." } -} -``` +```bash +{ + "id": "chatcmpl-123", + "object": "chat.completion", + "created": 1754358426, + "model": "openai/gpt-oss-120b", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "NVIDIA’s advantage is the tight integration of Tensor Core GPUs with the TensorRT software stack (compilation, kernel fusion, quantization) and Triton/TensorRT LLM serving, delivering high throughput and low latency at scale." + }, + "finish_reason": "stop" + } + ] +} +``` </blockquote></details> <details> <summary>docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)</summary><blockquote> `58-92`: **Avoid overwriting the same config file; clarify “pick one” or use separate files.** ```diff -EXTRA_LLM_API_FILE=/tmp/config.yml +EXTRA_LLM_API_FILE=/tmp/config.yml ... -For FP8 model, we need extra `moe_config`: +For FP8 model, use a separate file (or replace the earlier config; do not run both blocks): ```shell -EXTRA_LLM_API_FILE=/tmp/config.yml +EXTRA_LLM_API_FILE=/tmp/config_fp8.yml</blockquote></details> <details> <summary>docs/source/blogs/Falcon180B-H200.md (1)</summary><blockquote> `24-29`: **Branding consistency: change “TensorRT-LLM” → “TensorRT LLM” in captions/headers.** ```diff -TensorRT-LLM v0.7a | +TensorRT LLM v0.7a | ... -**TensorRT-LLM GQA now 2.4x faster on H200** +**TensorRT LLM GQA now 2.4x faster on H200** ... -TensorRT-LLM v0.7a vs TensorRT-LLM v0.6a | +TensorRT LLM v0.7a vs TensorRT LLM v0.6a |
Also applies to: 41-45, 94-94, 100-101
docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md (1)
348-353
: Dataset inconsistency: section says “real dataset” but generates synthetic.Either change text to “synthetic dataset” or swap to a real dataset path.
-Our benchmark results are based on **Batch = 1024, ISL = 1K, OSL = 2K, num_requests = 5120 from real dataset** +Our benchmark results are based on **Batch = 1024, ISL = 1K, OSL = 2K, num_requests = 5120 from a synthetic dataset**Also applies to: 362-383
♻️ Duplicate comments (8)
docs/source/dev-on-cloud/build-image-to-dockerhub.md (1)
4-4
: Grammar, capitalization, and naming polish.Prefer product name, add missing article, and fix “Docker Hub” capitalization.
-When you develop trt-llm on cloud platform such as runpod, you may need to provide a docker image for the platform. So you firstly need to upload the image to dockerhub. +When you develop TensorRT LLM on a cloud platform such as RunPod, you may need to provide a Docker image for the platform. First, upload the image to Docker Hub.docs/source/overview.md (1)
11-13
: Polish: fix link, PyTorch casing, spacing, and a role-like “source:” artifact.
- “Pytorch” → “PyTorch”.
- Replace
source:...
with a proper hyperlink or plain code.- Remove leading dot from the DeepSeek R1 URL (already flagged previously).
- Capitalize “DeepSeek”.
- Add missing space after comma in “H200,GH200”.
- Add missing space after period in “AI.To”.
-### 🔥 **Architected on Pytorch** +### 🔥 **Architected on PyTorch** @@ -TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using [native PyTorch code](source:tensorrt_llm/_torch/models/modeling_deepseekv3.py), making it easy to adapt the system to specific needs. +TensorRT LLM is designed to be modular and easy to modify. Its PyTorch‑native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre‑defined and can be customized using native PyTorch code (see `tensorrt_llm/_torch/models/modeling_deepseekv3.py`), making it easy to adapt the system to specific needs. @@ -- **DeepSeek R1**: [World-record inference performance on Blackwell GPUs](.https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/) +- **DeepSeek R1**: [World-record inference performance on Blackwell GPUs](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/) @@ -- **Language Models**: GPT-OSS, Deepseek-R1/V3, Llama 3/4, Qwen2/3, Gemma 3, Phi 4... +- **Language Models**: GPT-OSS, DeepSeek‑R1/V3, Llama 3/4, Qwen2/3, Gemma 3, Phi 4... @@ -- **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration +- **NVIDIA Hopper**: H100, H200, GH200 with FP8 acceleration @@ -Whether you're building the next generation of AI applications, optimizing existing LLM deployments, or exploring the frontiers of large language model technology, TensorRT LLM provides the tools, performance, and flexibility you need to succeed in the era of generative AI.To get started, refer to the {ref}`quick-start-guide`. +Whether you're building the next generation of AI applications, optimizing existing LLM deployments, or exploring the frontiers of large language model technology, TensorRT LLM provides the tools, performance, and flexibility you need to succeed in the era of generative AI. To get started, refer to the {ref}`quick-start-guide`.Also applies to: 15-16, 21-23, 26-31, 49-54, 55-58
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (2)
7-8
: Fix container narrative: use TensorRT LLM container (not PyTorch) to match commands.Intro says PyTorch container; the docker run uses the TensorRT LLM image. Align the text.
-The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution. +The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the TensorRT LLM container from NGC (or building from source), using FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.
51-51
: Branding: replace “TRT-LLM” with “TensorRT LLM” in headings/text (keep CLI/tool tokens).-### Creating the TRT-LLM Server config +### Creating the TensorRT LLM Server config ... -### Launch the TRT-LLM Server +### Launch the TensorRT LLM Server ... -After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server. +After the TensorRT LLM server is set up and shows “Application startup complete,” you can send requests to the server. ... -Here is an example response, showing that the TRT-LLM server returns “New York ... +Here is an example response, showing that the TensorRT LLM server returns “New York ...Also applies to: 94-96, 219-230
docs/source/architecture/checkpoint.md (1)
171-173
: Fix typos and agreement in the note (out_fature, follows).- The linear weights in TensorRT LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it. + The linear weights in a TensorRT LLM checkpoint always follow (`out_feature`, `in_feature`) shape, whereas some quantized linear layers implemented by plugins may use (`in_feature`, `out_feature`) shape. The `trtllm-build` command adds a transpose operation to post‑process them.docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md (2)
79-79
: Fix both MTP code links (point to the right symbols; use commit-pinned URLs).Both links currently point to the same line on main. Pin them to the stable commit used elsewhere in this doc and target the correct definitions.
-TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047). MTP Eagle is the default path for DeepSeek-V3 and DeepSeek-R1 models. +TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L326) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L1047). MTP Eagle is the default path for DeepSeek-V3 and DeepSeek-R1 models.
240-245
: Fix typos and standardize Eagle naming.Correct “Ealge2” and use consistent hyphenation.
-TensorRT LLM PyTorch backend can only support chain-based speculative decoding now, both MTP Vanilla and MTP Eagle. However, the tree-based speculative decoding technique is widely used in previous advanced methods, such as Ealge2 and Eagle3, to increase the acceptance rate. +TensorRT LLM PyTorch backend can only support chain-based speculative decoding now, both MTP Vanilla and MTP Eagle. However, the tree-based speculative decoding technique is widely used in previous advanced methods, such as Eagle-2 and Eagle-3, to increase the acceptance rate. @@ -Since TensorRT LLM already has [Eagle-3 support](https://github.com/NVIDIA/TensorRT-LLM/pull/3035) now, in the future, we also want to train an Eagle3 head to support DeepSeek-V3/R1+Eagle3 to achieve better speedup. +Since TensorRT LLM already has [Eagle-3 support](https://github.com/NVIDIA/TensorRT-LLM/pull/3035) now, in the future we also want to train an Eagle-3 head to support DeepSeek-V3/R1 + Eagle-3 to achieve better speedup.docs/source/installation/build-from-source-linux.md (1)
12-13
: Fix typo and tighten phrasing; anchors already exist.-If you intend to build any TensortRT-LLM artifacts, such as any of the container images (note that there exist pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) container images in NGC), or the TensorRT LLM Python wheel, you first need to clone the TensorRT LLM repository: +If you intend to build any TensorRT LLM artifacts—such as container images (pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) images are available on NGC)—or the TensorRT LLM Python wheel, first clone the TensorRT LLM repository:
🧹 Nitpick comments (99)
docs/source/dev-on-cloud/build-image-to-dockerhub.md (5)
6-6
: Capitalize “Docker Hub”.Use the official product capitalization.
-## Build the TensorRT LLM Docker Image and Upload to DockerHub +## Build the TensorRT LLM Docker Image and Upload to Docker Hub
17-25
: SSH hardening and startup reliability.Current snippet enables root SSH and assumes entrypoint.d is sourced. Suggest hardening and making startup explicit.
FROM tensorrt_llm/devel:latest RUN apt update && apt install openssh-server -y RUN mkdir -p /run/sshd && chmod 755 /run/sshd RUN mkdir -p /root/.ssh && chmod 700 /root/.ssh && touch /root/.ssh/authorized_keys && chmod 600 /root/.ssh/authorized_keys -# add sshd to entrypoint script -RUN echo "sshd -E /opt/sshd.log" >> /opt/nvidia/entrypoint.d/99-start-sshd.sh +# Optional: create a non-root user for SSH +RUN useradd -m -s /bin/bash appuser && mkdir -p /home/appuser/.ssh && chmod 700 /home/appuser/.ssh && touch /home/appuser/.ssh/authorized_keys && chmod 600 /home/appuser/.ssh/authorized_keys && chown -R appuser:appuser /home/appuser/.ssh +# Harden SSH: disable root password login; key-only auth +RUN sed -i 's/^#\?PermitRootLogin .*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config && \ + sed -i 's/^#\?PasswordAuthentication .*/PasswordAuthentication no/' /etc/ssh/sshd_config +EXPOSE 22 +# Start sshd in foreground for Docker +CMD ["/usr/sbin/sshd","-D","-e"]
41-45
: Use token via stdin for non-interactive login.Improves CI/automation ergonomics and avoids echoing prompts.
-```bash -docker login -u <your_dockerhub_username> -``` - -Enter the token to the console. +```bash +echo '<your_token>' | docker login -u <your_dockerhub_username> --password-stdin +```
13-15
: Optional: capitalize SSH in headings/sentences.Consistent with common style guides.
-### Enable ssh access to the container -Since the default docker image doesn’t have ssh support, we can’t ssh into it. We need to add ssh support to the container. +### Enable SSH access to the container +Since the default Docker image doesn’t have SSH support, we can’t SSH into it. We need to add SSH support to the container.
18-18
: Clarify base image source.Make it explicit this is the locally built image tag to avoid confusion with NGC images.
-FROM tensorrt_llm/devel:latest +FROM tensorrt_llm/devel:latest # locally built image from `make -C docker build`docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md (3)
5-5
: Fix markdownlint MD036 (emphasis used as heading) without changing style.Suppress the rule locally to keep the byline format.
+# <!-- markdownlint-disable-next-line MD036 --> *By NVIDIA TensorRT LLM Team*
250-252
: Minor clarity nit: simplify sentence on IPC/serialization overhead.Current sentence reads heavy. Consider a small wording tweak.
-TensorRT LLM is designed to be composed of both C++ and Python code, so that C++ can handle the most performance-sensitive parts while Python handles higher-level logic. As we try to put more logic into Python to make the program easier to read and debug, there are still frequent conversations through binding interfaces between C++ and Python. Besides, since most of the logic is implemented in Python, there are several layers of implementation that communicate with each other through inter-process communication overhead. +TensorRT LLM combines C++ for performance‑critical paths with Python for higher‑level logic. As more logic moves to Python for readability and debuggability, frequent C++/Python binding calls and multi‑layer inter‑process communication (with serialization/deserialization) introduce measurable overhead.
320-323
: Tone nit: consider trimming superlatives in Acknowledgements.If you follow a neutral technical tone in blogs, consider softening marketing language.
-This work represents an outstanding example of collaborative engineering excellence within the TensorRT LLM team. The successful implementation and optimization of large-scale Expert Parallelism required coordinated efforts across multiple domains - from low-level CUDA kernel optimizations to high-level system architecture design. The dedication and technical expertise demonstrated by our team members throughout this project has been truly remarkable. +This work reflects cross‑functional collaboration within the TensorRT LLM team, spanning low‑level CUDA kernel changes through system‑level design.docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (9)
5-8
: Clarify NVFP4 vs FP4 and container wording.Minor wording fixes to avoid confusion with the PyTorch container and alias FP4.
-This deployment guide provides step-by-step instructions for running the Llama-4-Scout-17B-16E-Instruct model using TensorRT LLM with FP8 and NVFP4 quantization, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT LLM parameters, launching the server, and validating inference output. +This deployment guide provides step-by-step instructions for running the Llama-4-Scout-17B-16E-Instruct model using TensorRT LLM with FP8 and NVFP4 (FP4) quantization, optimized for NVIDIA GPUs. It covers the complete setup required; from accessing model weights and preparing the software environment to configuring TensorRT LLM parameters, launching the server, and validating inference output.-The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution. +The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting from the TensorRT LLM container from NGC, together with FlashInfer for optimized CUDA kernels and ModelOpt to enable FP8 and NVFP4 quantized execution.
32-43
: Use a published NGC tag; avoid hardcoding rc.Replace 1.0.0rc6 with a known published tag or an env var (e.g., TRTLLM_TAG) and point readers to the NGC tags page. This prevents broken pulls when rc tags roll.
Note from team history: docs should only reference published NGC container images; README badges may be ahead.
52-52
: Grammar: add article and tighten phrasing.-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html +If you want to use the latest main branch, build from source: https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html
71-88
: Standardize “TRT-LLM server” → “TensorRT LLM server”.Several headings/sentences still use “TRT-LLM” (e.g., Lines 71, 73, 196, 207). Align with the new branding in prose while keeping repo/image names unchanged.
127-127
: Add security caveat for trust_remote_code.Recommend caution when enabling remote code.
- **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. + **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. Enable only for trusted model repos.
225-231
: Prefer python -m pip for reliability.Minor robustness improvement inside container.
-pip install -U lm-eval +python -m pip install -U lm-eval
269-269
: Grammar: “first creating” → “first create”.-To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper [bench.sh](http://bench.sh) script. +To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this, first create a wrapper [bench.sh](http://bench.sh) script.
47-49
: Tiny UX tweak: mkdir example.Use -p and code formatting for the path.
-* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist please create it using mkdir `~/.cache`. +* The command mounts your user .cache directory to save the downloaded model checkpoints which are saved to `~/.cache/huggingface/hub/` by default. This prevents having to redownload the weights each time you rerun the container. If the `~/.cache` directory doesn’t exist, create it with: `mkdir -p ~/.cache`.
23-27
: Optional: clarify NVFP4 vs FP4 repo naming.HF repos use “FP4” in names while docs refer to “NVFP4”. Add a short note here to state FP4 (repo name) corresponds to NVFP4 in TRT LLM docs.
docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md (8)
42-42
: Remove stray spaces before punctuation.There are extra spaces before “.” and “,” after “TensorRT LLM”.
-In this tech blog, we will introduce the details of the design and implementation to support E2E large-scale EP in TensorRT LLM . This blog post mainly covers the following: +In this tech blog, we will introduce the details of the design and implementation to support E2E large-scale EP in TensorRT LLM. This blog post mainly covers the following:- * Some preliminary data demonstrating the effectiveness of the current implementation in TensorRT LLM . + * Some preliminary data demonstrating the effectiveness of the current implementation in TensorRT LLM.-Even if, in this tech blog, we focus on TensorRT LLM , we believe the core ideas and implementation can also be applied to other inference engines to help the inference performance on NVIDIA GPUs. Also, with the help of the community, we would like to figure out how to better modularize the current TensorRT LLM large-scale EP implementation and make it more easily reusable by the community. +Even if, in this tech blog, we focus on TensorRT LLM, we believe the core ideas and implementation can also be applied to other inference engines to help the inference performance on NVIDIA GPUs. Also, with the help of the community, we would like to figure out how to better modularize the current TensorRT LLM large-scale EP implementation and make it more easily reusable by the community.Also applies to: 51-51, 60-60
54-54
: Tighten phrasing for clarity.-* The introduction of performance tuning and optimization for TensorRT LLM large-scale EP GB200 implementation. +* Performance tuning and optimization for the GB200 implementation of TensorRT LLM large-scale EP.
224-237
: Figure reference likely off-by-one.The caption labels the architecture figure as “Figure 12”, but the paragraph below refers to “Figure 11” for Dispatch/Combine. That seems to point to this architecture diagram, not the earlier dataset figure.
-* EP communication kernels. In Figure 11, those are the **Dispatch** and **Combine** components. +* EP communication kernels. In Figure 12, those are the **Dispatch** and **Combine** components.
309-309
: Standardize capitalization: “CUDA Graph”.-... does not provide CUDA graph compatibility for all the scenarios. We believe that CUDA graph is needed ... +... does not provide CUDA Graph compatibility for all the scenarios. We believe that CUDA Graph is needed ...
367-367
: Fix spacing before colon and tighten sentence.-Based on these core utilities, there are two versions of EP Load Balancer in TensorRT LLM : Offline EP Load Balancer and Online EP Load Balancer. +Based on these core utilities, there are two versions of the EP Load Balancer in TensorRT LLM: Offline EP Load Balancer and Online EP Load Balancer.
694-694
: Minor grammar polish.- * Further performance tuning and optimizations. **We are actively working on this now.** + * Further performance tuning and optimization. **We are actively working on this now.**
697-697
: Improve flow and remove filler.- * More validation with workloads close to production traffic. **Here we highly welcome the community’s feedback to help us calibrate TensorRT LLM large-scale EP implementation based on more concrete workloads.** + * More validation with workloads close to production traffic. **We welcome community feedback to help calibrate the TensorRT LLM large-scale EP implementation based on concrete workloads.**
710-710
: Avoid repetitive sentence openings (“We … We … We …”).-We believe the current implementation can be viewed as a reasonable E2E large-scale EP implementation and we encourage the community to try new ideas and performance validation. We encourage the community to share feedback to help us move fast in this area. We are actively tracking the TensorRT LLM large-scale EP execution in [this](https://github.com/NVIDIA/TensorRT-LLM/issues/4127) GitHub issue to ensure transparency to the community. +We believe the current implementation is a reasonable E2E large-scale EP baseline and encourage the community to try new ideas and validate performance. Please share feedback to help us move faster. We are actively tracking progress in [this](https://github.com/NVIDIA/TensorRT-LLM/issues/4127) GitHub issue to ensure transparency.docs/source/features/checkpoint-loading.md (1)
19-19
: Tighten wording for clarity and style.Prefer “the … model definition” over the possessive.
-**Weight Mappers**: Map and transform loaded weights to TensorRT LLM model's definition +**Weight Mappers**: Map and transform loaded weights to the TensorRT LLM model definitiondocs/source/examples/customization.md (1)
5-5
: Fix grammar and tighten phrasing.Removes the sentence fragment, improves article use, and standardizes INT4 casing.
-TensorRT LLM can quantize the Hugging Face model automatically. By setting the appropriate flags in the `LLM` instance. For example, to perform an Int4 AWQ quantization, the following code triggers the model quantization. Please refer to complete list of [supported flags](https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html#QuantAlgo) and acceptable values. +TensorRT LLM can automatically quantize Hugging Face models by setting the appropriate flags on the `LLM` instance. For example, to perform INT4 AWQ quantization, use the following code. Please refer to the complete list of [supported flags](https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html#QuantAlgo) and acceptable values.docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (2)
163-174
: Unify option names with earlier example (enable_padding
,kv_cache_config: { dtype }
).The YAML here uses different keys (
padding_enabled
,kv_cache_dtype
) than the earlier start.sh snippet. Align to one schema to avoid user confusion.```yaml - cuda_graph_config: - padding_enabled: true - print_iter_log: true - kv_cache_dtype: fp8 - enable_attention_dp: true + cuda_graph_config: + enable_padding: true + print_iter_log: true + kv_cache_config: + dtype: fp8 + enable_attention_dp: true ```
222-222
: Remove stray backticks and tweak wording.-See the [TorchLlmArgs class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options which can be used in the extra\_llm\_api\_options`.` +See the [TorchLlmArgs class](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) for the full list of options that can be used in `extra_llm_api_options`.docs/source/features/multi-modality.md (2)
3-3
: Slightly more concise phrasing.-TensorRT LLM supports a variety of multimodal models, enabling efficient inference with inputs beyond just text. +TensorRT LLM supports many multimodal models, enabling efficient inference with inputs beyond text.
31-31
: Trim verbosity in examples intro.-The following examples demonstrate how to use TensorRT LLM's multimodal support in various scenarios, including quick run examples, serving endpoints, and performance benchmarking. +The following examples show how to use TensorRT LLM's multimodal support in quick starts, serving, and benchmarking.docs/source/blogs/H200launch.md (1)
58-58
: Tighten grammar, remove repetition.-H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize that compute performance, H200 is the first GPU with HBM3e memory with 4.8TB/s of memory bandwidth, a 1.4X increase over H100. H200 also expands GPU memory capacity nearly 2X to 141 gigabytes (GB). The combination of faster and larger HBM memory accelerates performance of LLM model inference performance with faster throughput and tokens per second. These results are measured and preliminary, more updates expected as optimizations for H200 continue with TensorRT LLM. +H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize compute performance, H200 is the first GPU with HBM3e memory (4.8 TB/s bandwidth, a 1.4× increase over H100) and nearly 2× the memory capacity (141 GB). The combination of faster and larger HBM accelerates LLM inference throughput and tokens per second. These results are preliminary; more updates are expected as TensorRT LLM optimizations for H200 continue.docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (6)
53-54
: Grammar fix (“the latest main branch”).-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to ... +If you want to use the latest main branch, build from source to install TensorRT LLM; see ...
57-70
: YAML snippet: good defaults; consider noting fp4 when using FP4 model.
Add a tip that dtype should match the chosen model (fp4 for FP4 flow).
187-198
: Naming consistency: “TRT-LLM server” appears elsewhere.
The doc mostly says “TensorRT LLM”; ensure occurrences like “TRT-LLM server” (e.g., headings near Lines 72, 74, 197) are aligned.
226-233
: Minor style: “exec into” → “exec into the”.
Optional clarity tweak.-To run the evaluation harness exec into the running TensorRT LLM container and install with this command: +To run the evaluation harness, exec into the running TensorRT LLM container and install with this command:
274-302
: Fix wording and harden bench.sh.
- “first creating” → “first create”
- Add shebang and safe bash options
- Optionally integrate save-result flags so users don’t copy a second block.
-To benchmark the performance of your TensorRT LLM server you can leverage the built-in “benchmark_serving.py” script. To do this first creating a wrapper [bench.sh](http://bench.sh) script. +To benchmark the performance of your TensorRT LLM server, use the built-in “benchmark_serving.py” script. First, create a wrapper bench.sh script. @@ -cat <<EOF > bench.sh +cat <<'EOF' > bench.sh +#!/usr/bin/env bash +set -euo pipefail concurrency_list="1 2 4 8 16 32 64 128 256" multi_round=5 isl=1024 osl=1024 result_dir=/tmp/llama3.3_output @@ --percentile-metrics "ttft,tpot,itl,e2el" + # Uncomment to persist results: + # --save-result --result-dir "${result_dir}" --result-filename "concurrency_${concurrency}.json" done EOF chmod +x bench.sh
33-44
: Use a valid NGC tag or placeholder
- Replace
nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6
with the latest published tag (1.0.0rc4
) (catalog.ngc.nvidia.com), or use a<TAG>
placeholder and retain RC notes below.- Verify the final tag exists on NGC before merging.
docs/source/blogs/H100vsA100.md (2)
5-8
: Use consistent branding and tighten phrasing.
Replace “TRT-LLM” with “TensorRT LLM” and prefer “can” over “is able to”.-# H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token +# H100 has 4.6x A100 performance in TensorRT LLM, achieving 10,000 tok/s at 100 ms to first token @@ -TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 is able to achieve over 10,000 output tok/s at peak throughput for 64 concurrent requests, while maintaining a 1st token latency of 100ms. For min-latency applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency. +TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 can achieve over 10,000 output tok/s at peak throughput for 64 concurrent requests, while maintaining a 1st token latency of 100 ms. For min-latency applications, TensorRT LLM on H100 can achieve less than 10 ms to 1st token latency.
29-31
: Punctuation fix.
Remove the extra period after version.-<sub>FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL's provided, TensorRT LLM v0.5.0., TensorRT 9.1</sub> +<sub>FP8 H100, FP16 A100, SXM 80GB GPUs, TP1, ISL/OSL provided, TensorRT LLM v0.5.0, TensorRT 9.1.</sub>docs/source/quick-start-guide.md (1)
80-88
: LLM API paragraph reads well; minor style tweak optional.
Consider “directly in Python” instead of “directly within Python.”-The LLM API is a Python API designed to facilitate setup and inference with TensorRT LLM directly within Python. +The LLM API is a Python API designed to facilitate setup and inference with TensorRT LLM directly in Python.docs/source/features/quantization.md (2)
3-5
: Fix grammar (“reduces” → “reduce”).-Quantization is a technique used to reduces memory footprint and computational cost ... +Quantization is a technique used to reduce memory footprint and computational cost ...
7-18
: Tighten phrasing.
“A variety of” → “several”.-TensorRT LLM offers a variety of quantization recipes to optimize LLM inference. +TensorRT LLM offers several quantization recipes to optimize LLM inference.Additionally, consider standardizing names to “Per‑tensor” and “Row‑wise” for consistency across docs.
docs/source/architecture/add-model.md (4)
5-6
: Intro branding LGTM; suggest improving Step 1 heading (unchanged lines).
If you touch headings later, “Write Modeling Part” → “Implement the model.”
9-10
: Wording nit.
Consider “APIs at different abstraction levels:” for flow.-TensorRT LLM provides different levels of APIs: +TensorRT LLM provides APIs at different abstraction levels:
18-18
: Minor phrasing.-2. Write a `model.py` with TensorRT LLM's APIs +2. Write a `model.py` using TensorRT LLM APIs
55-56
: Grammar tweak.-The weights from source framework need to be converted and bound to the new added TensorRT LLM model. Here is an example of converting HuggingFace weights: +Weights from the source framework need to be converted and bound to the newly added TensorRT LLM model. Here is an example of converting Hugging Face weights:docs/source/installation/linux.md (3)
25-27
: Fix typo and tighten wording in tip.
- “preqrequisites” → “prerequisites”
- Minor clarity tweaks.
Apply:
- Instead of manually installing the preqrequisites as described + Instead of manually installing the prerequisites as described above, it is also possible to use the pre-built [TensorRT LLM Develop container image hosted on NGC](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/devel) (see [here](containers) for information on container tags).
49-49
: Polish grammar in “Known limitations” lead-in.-There are some known limitations when you pip install pre-built TensorRT LLM wheel package. +There are some known limitations when installing the pre-built TensorRT LLM wheel via pip.
53-55
: Improve Slurm/MPI wording and capitalization.- If you encounter an error while running TensorRT LLM in a Slurm-managed cluster, you need to reconfigure the MPI installation to work with Slurm. - The setup methods depends on your slurm configuration, pls check with your admin. This is not a TensorRT LLM specific, rather a general mpi+slurm issue. + If you encounter an error while running TensorRT LLM in a Slurm-managed cluster, reconfigure your MPI installation to work with Slurm. + The setup method depends on your Slurm configuration; please check with your administrator. This is not TensorRT LLM–specific, but a general MPI+Slurm issue.docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md (2)
36-44
: Tighten Step 2 headings and link sentence; fix minor grammar.-### Step 2: Prepare the TensorRT LLM release Docker image +### Step 2: Prepare the TensorRT LLM release Docker image -#### Option 1. Use weekly release NGC docker image -TensorRT LLM provides weekly release [docker image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) +#### Option 1. Use the weekly release NGC Docker image +TensorRT LLM provides a weekly release [Docker image](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release). -#### Option 2. Build TensorRT LLM Docker image (Alternative way) -If you want to compile a specific TensorRT LLM commit, you can build the docker image by checking out the specific branch or commit and running a make command. This may take 15-30 minutes depending on your system. +#### Option 2. Build the TensorRT LLM Docker image (alternative) +To compile a specific TensorRT LLM commit, build the Docker image by checking out the branch or commit and running the make command. This may take 15–30 minutes depending on your system.
67-78
: Optional: split long docker run for readability.Consider using backslashes consistently or a here-doc for c.yaml to improve maintainability.
docs/source/features/attention.md (7)
9-16
: Branding consistency: prefer “TensorRT LLM” over “TRT-LLM”; fix “Flashinfer” capitalization.-models in TensorRT LLM's PyTorch backend. +models in TensorRT LLM’s PyTorch backend. @@ -TensorRT LLM provides several implementations using different backends in `tensorrt_llm/_torch/attention_backend/`. +TensorRT LLM provides several implementations using different backends in `tensorrt_llm/_torch/attention_backend/`.And below:
-There are currently three available attention backends: the vanilla backend, the TRT-LLM backend, and the Flashinfer backend. +There are currently three available attention backends: the vanilla backend, the TensorRT LLM backend, and the FlashInfer backend.
22-36
: Capitalize “FlashInfer” and streamline wording; avoid “is able to”.-You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to utilize the Flashinfer backend, you can pass `attn_backend="flashinfer"` to the `LLM` constructor as follows: `LLM(attn_backend="flashinfer")`. This will enable the use of the Flashinfer backend for your model. +You can select the backend via `PyTorchConfig.attn_backend`. For example, use `LLM(attn_backend="flashinfer")` to enable the FlashInfer backend. @@ -In contrast, the Flashinfer backend, `FlashInferAttention`, is performance-optimized and supports both inflight batching and paged KV cache. +In contrast, the FlashInfer backend, `FlashInferAttention`, is performance‑optimized and supports both in‑flight batching and paged KV cache. @@ -The TRT-LLM backend, `TrtllmAttention`, serves as the default backend +The TensorRT LLM backend, `TrtllmAttention`, serves as the default backend @@ -1. **Fused QKV Input**: It can accept a single QKV tensor as input, which is more efficient compared to using separate Q, K, and V tensors. +1. **Fused QKV input**: Accepts a single QKV tensor, which is more efficient than separate Q, K, and V tensors. -2. **FP8 Output**: It supports outputting the attention result in FP8 format, fusing quantization into the attention computation process. +2. **FP8 output**: Can emit FP8, fusing quantization into the attention computation.
118-121
: Minor grammar tweak around packed mode sentence.-To overcome that problem, TensorRT LLM supports a mode without padding where +To overcome that problem, TensorRT LLM supports a no‑padding mode where
158-162
: Concise phrasing.-multi-head attention in TensorRT LLM. That kernel is able to apply +multi‑head attention in TensorRT LLM. That kernel can apply @@ -RoPE, and performs dequantization and quantization. TensorRT LLM will continue to add (or -enable) additional features in future releases, such as enabling support for IA3. +RoPE, and perform dequantization and quantization. TensorRT LLM will continue to add features in future releases, such as support for IA3.
205-212
: Hyphenate “in‑flight batching” and emphasize requirement once.-TensorRT LLM supports in-flight batching of requests (also known as continuous +TensorRT LLM supports in‑flight batching of requests (also known as continuous @@ -For efficiency reasons (1), the support for inflight batching ***requires the +For efficiency reasons (1), in‑flight batching ***requires the
287-296
: Capitalize KV consistently and tighten wording.-TensorRT LLM has a feature called `Cyclic KV Cache`, which treats the kv cache +TensorRT LLM has a feature called `Cyclic KV Cache`, which treats the KV cache @@ -overwrite the "least recently used" caches. +overwrite the least‑recently‑used entries.
307-313
: Minor grammar in StreamingLLM paragraph.-Similar to the cyclic KV cache feature in TensorRT LLM, `attention_window_size` +Similar to the cyclic KV cache feature in TensorRT LLM, the `attention_window_size` @@ -StreamingLLM use the positions within the cache rather than those in the original text. +StreamingLLM uses positions within the cache rather than those in the original text.docs/source/developer-guide/perf-benchmarking.md (5)
10-16
: Fix grammar and clarity in intro bullets.-TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it -easier for users to reproduce our officially published [performance overiew](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows: +TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that makes it +easier to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the following: @@ -- Ability to benchmark various flows and features within TensorRT LLM. +- Ability to benchmark various flows and features in TensorRT LLM.
17-20
: Link wording nit.-the [in-flight batching section](../advanced/gpt-attention.md#in-flight-batching) that describes the concept -in further detail. +the [in‑flight batching section](../advanced/gpt-attention.md#in-flight-batching) for further details.
68-71
: Grammar: validated networks sentence.-While `trtllm-bench` should be able to run any network that TensorRT LLM supports, the following are the list -that have been validated extensively and is the same listing as seen on the +While `trtllm-bench` can run any network that TensorRT LLM supports, the following models have been validated extensively and match the listing on the [Performance Overview](./perf-overview.md) page.
101-107
: Tip wording.-Although TensorRT LLM supports more quantization modes than listed above, `trtllm-bench` currently only configures for -a smaller subset. +Although TensorRT LLM supports more quantization modes than listed above, `trtllm-bench` currently configures only a subset.
156-163
: Capitalize PyTorch and minor clarity.-Note that CUDA graph is enabled by default. You can add additional pytorch config with +Note that CUDA Graph is enabled by default. You can add additional PyTorch config with `--extra_llm_api_options` followed by the path to a YAML file.docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md (5)
32-39
: Fix table formatting and note grammar.-The accuracy score of common dataset on this FP4 checkpoint and TensorRT LLM implementations are: +Accuracy on common datasets for this FP4 checkpoint and TensorRT LLM implementations: @@ -| TensorRT LLM FP8 | 0.697 | 0.954 | -| TensorRT LLM FP4 | 0.705 | 0.96 | +| TensorRT LLM FP8 | 0.697 | 0.954 | +| TensorRT LLM FP4 | 0.705 | 0.960 | @@ -** Note there are some run-to-run variance for these evaluations, so FP4 data is slight higher here. We think FP4 has comparable accuracy with FP8 on these datasets. +**Note:** There is run‑to‑run variance in these evaluations, so FP4 appears slightly higher here. FP4 is comparable to FP8 on these datasets.
63-71
: Tighten MLA paragraph; minor punctuation and hyphenation.-... equivalent to multiple query attention(MQA). +... equivalent to multiple‑query attention (MQA). @@ -... with input length 256 or larger -The TensorRT LLM MLA implementation chooses... +... with input length ≥256. +The TensorRT LLM MLA implementation chooses...
150-161
: Improve CUDA Graph padding wording.-There is a feature called CUDA Graph padding in TensorRT LLM, which is a good trade-off between the number of CUDA Graphs and the CUDA Graph hit ratio; it tries to pad a batch to the nearest one with a captured CUDA Graph. +CUDA Graph padding in TensorRT LLM trades off the number of captured graphs and hit ratio by padding a batch to the nearest captured size. @@ -Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n enable_padding: False`, see API here [Pytorch backend config](...) +Users can opt out of CUDA Graph padding to evaluate performance by setting: +`cuda_graph_config:\n enable_padding: False`. See the [PyTorch backend config](...) for details.
164-165
: Minor grammar.-... users can still opt-out this feature by set *disable_overlap_scheduler* to true. +... users can still opt out by setting `disable_overlap_scheduler: true`.
32-46
: Public HF repo confirmed; clarify MoE-specific FP4 quantization details
Thenvidia/DeepSeek-R1-FP4
repository is public and shows FP4 quantization via NVIDIA Model Optimizer, but it doesn’t describe MoE-specific techniques—consider linking to relevant research (e.g., MoEQuant) or adding a footnote on per-expert calibration.docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (4)
27-28
: Fix headings/wording: “Serve” → “Server”; grammar nits.
- Use “following” (not “follow”).
- “Each requires different configurations and commands.” (plural agreement)
- Two section headers read “Launching TensorRT LLM Serve” — should be “Server”.
-Run the follow docker command to start the TensorRT LLM container in interactive mode: +Run the following docker command to start the TensorRT LLM container in interactive mode: -Each requires a different configurations and commands to run. +Each requires different configurations and commands to run. -#### Launching TensorRT LLM Serve +#### Launching the TensorRT LLM Server -#### Launching TensorRT LLM Serve +#### Launching the TensorRT LLM ServerAlso applies to: 55-58, 80-83, 156-159
25-26
: Verify NGC image tag is published; avoid ephemeral/dev tags in docs.Per prior guidance, docs should reference only published NGC images. Please confirm
nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev
exists and is publicly available, or replace it with a publishedrelease:<version>
tag. I can help swap in the correct tag once confirmed.Also applies to: 38-39
35-37
: Environment variable prefix inconsistency (“TRT_LLM_” vs “TRTLLM_”).Project uses the
TRTLLM_*
prefix (e.g.,TRTLLM_ENABLE_PDL
). Align the disable flag for consistency.- -e TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \ + -e TRTLLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
247-267
: Minor copyedit: apostrophe and casing.
- “NVIDIA’s” needs an apostrophe.
- Keep examples concise to avoid wrapping.
- "content": "What is NVIDIAs advantage for inference?" + "content": "What is NVIDIA's advantage for inference?"Also applies to: 262-263
docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md (2)
52-53
: Tighten terminology and grammar; prefer “OpenAI‑compatible” phrasing.
- Use “disaggregated” consistently (no hyphen).
- “to perform disaggregated LLM inference” reads better than “to do disaggregation…”.
- “OpenAI‑compatible REST API” avoids implying it’s the official OpenAI API.
-<p align="center"><sub><em>Figure 2. The execution timeline of dis-aggregated LLM serving</em></sub></p> +<p align="center"><sub><em>Figure 2. The execution timeline of disaggregated LLM serving</em></sub></p> -There are three different approaches to do disaggregation LLM inference with TensorRT LLM, +There are three different approaches to perform disaggregated LLM inference with TensorRT LLM, -...dispatches them to the appropriate context and generation servers via OpenAI REST API. +...dispatches them to the appropriate context and generation servers via an OpenAI‑compatible REST API.Also applies to: 60-66, 108-111
131-131
: Unordered list style (markdownlint MD004).Repo style prefers dashes over asterisks. Consider normalizing list bullets in this file.
Also applies to: 284-290
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (4)
38-39
: Verify NGC tag exists; prefer a published tag.Confirm nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 is published. If not, switch to a published rc or monthly release tag. Avoid “latest”.
49-49
: Grammar nit: “use latest main branch” → “use the latest main branch”.-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to ... +If you want to use the latest main branch, you can build from source to install TensorRT LLM. See ...
134-135
: Clarify backend choice.“--backend pytorch” means CPU/GPU via PyTorch backend, not TensorRT engines. Add a one‑liner to set expectations.
- **Description:** Tells TensorRT LLM to use the **pytorch** backend. + **Description:** Uses the **pytorch** backend (no TensorRT engine build required). For TensorRT runtime, use the default backend or the appropriate flag per release notes.
296-296
: Grammar nit: “first creating a wrapper” → “first create a wrapper”.-To do this first creating a wrapper [bench.sh] script. +To do this, first create a wrapper [bench.sh] script.docs/source/architecture/checkpoint.md (2)
11-11
: Minor phrasing: “Load the engines to” → “Load the engines into”.-3. Load the engines to TensorRT LLM model runner and evaluate with different evaluation tasks. +3. Load the engines into the TensorRT LLM ModelRunner and evaluate with different tasks.
221-223
: Tighten wording around PATH.-TensorRT LLM provides a unified build command: `trtllm-build`. Before using it, -you may need to add it to the `PATH`. +TensorRT LLM provides a unified build command: `trtllm-build`. If the command is not found, add its directory to `PATH`.docs/source/features/paged-attention-ifb-scheduler.md (4)
41-42
: Fix typo: “re-buliding” → “rebuilding”.-... You can tune it using the runtime `max_num_tokens` without re-buliding the engine. ... +... You can tune it using the runtime `max_num_tokens` without rebuilding the engine. ...
109-110
: Consistency: “max-batch size” → “max batch size”.-... schedules requests based on max-batch size and max-num tokens. +... schedules requests based on max batch size and max num tokens.
111-111
: Branding in alt text: replace “TRT-LLM” with “TensorRT LLM” (keep file names).- + ... - + ... - + ... - +Also applies to: 118-118, 130-130, 142-142
140-141
: Normalize “max num tokens” formatting.-... over the max-num tokens limit. +... over the max num tokens limit.docs/source/blogs/Falcon180B-H200.md (2)
20-21
: Image links point to “blob/rel/...”; verify branch or use relative paths.Switch to relative paths under docs (preferred) or a valid branch (e.g., main or a release branch). “rel” is likely invalid.
-<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_tps.png?raw=true" ... +<img src="../media/Falcon180B-H200_tps.png" ...Also applies to: 36-36, 71-71, 96-96
10-13
: Style: combine “INT4 AWQ with FP8” phrasing.-The 141GB of memory on H200, paired with TensorRT LLM running INT4 AWQ with -FP8, allows ... +The 141GB of memory on H200, paired with TensorRT LLM running INT4‑AWQ with FP8, allows ...docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md (6)
83-87
: Container run step OK; consider mounting HF cache for speed.-make -C docker run LOCAL_USER=1 DOCKER_RUN_ARGS="-v $YOUR_MODEL_PATH:$YOUR_MODEL_PATH:ro -v $YOUR_WORK_PATH:$YOUR_WORK_PATH" +make -C docker run LOCAL_USER=1 DOCKER_RUN_ARGS="-v $YOUR_MODEL_PATH:$YOUR_MODEL_PATH:ro -v $YOUR_WORK_PATH:$YOUR_WORK_PATH -v $HOME/.cache/huggingface:/root/.cache/huggingface"
89-95
: Ensure correct working dir inside container; minor shell nits.-Here we compile the source inside the container: +Here we compile the source inside the container (from the repo root): ```bash +cd "$YOUR_WORK_PATH/TensorRT-LLM" python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --cuda_architectures "90-real;100-real" --python_bindings --clean...
pip install --user build/tensorrt_llm*.whl export PATH=${HOME}/.local/bin:${PATH} -export PYTHONPATH=`pwd` +export PYTHONPATH="$(pwd)"Also applies to: 99-102 --- `160-169`: **Clarify trtllm-bench description (“published numbers” sentence).** ```diff -... aims to make it easier for users to reproduce our officially published. +... aims to make it easier for users to reproduce our officially published performance numbers.
189-214
: Unify KV‑cache option name with other docs (kv_cache_config vs kv_cache_dtype).Use one style across docs; TorchLlmArgs commonly uses kv_cache_config.dtype. Update here for consistency.
-kv_cache_dtype: fp8 +kv_cache_config: + dtype: fp8
279-289
: Use consistent CLI flag form for model (“--model” instead of “-m”).-trtllm-bench -m nvidia/DeepSeek-R1-FP4 \ +trtllm-bench --model nvidia/DeepSeek-R1-FP4 \ ... -trtllm-bench -m deepseek-ai/DeepSeek-R1 \ +trtllm-bench --model deepseek-ai/DeepSeek-R1 \Also applies to: 372-384
404-405
: Wording polish.-TensorRT LLM is working on enhancements that can make the benchmark process smoother. +TensorRT LLM is working on enhancements to make the benchmarking process smoother.docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md (5)
2-2
: Tighten byline grammar.-by NVIDIA TensorRT LLM team +By the NVIDIA TensorRT LLM team
4-10
: Fix unordered list indentation to satisfy MD007.Make top-level bullets flush-left; indent sub-items by two spaces.
- - [Table of Contents](#table-of-contents) - - [MTP for inference](#mtp-for-inference) + - [Table of Contents](#table-of-contents) + - [MTP for inference](#mtp-for-inference) - [Background](#background) - [MTP Vanilla](#mtp-vanilla) - [MTP Eagle](#mtp-eagle) - - [MTP implementation in TensorRT LLM](#mtp-implementation-in-tensorrt-llm) + - [MTP implementation in TensorRT LLM](#mtp-implementation-in-tensorrt-llm)
28-28
: Consider pinning the “previous blog post” link to a commit for stability.Avoid link rot as docs evolve on main.
110-110
: Avoid weak intensifier.-Attention is also a very important component in supporting MTP inference. +Attention is also a critical component in supporting MTP inference.
248-248
: Polish closing sentence grammar.-... And a special thanks goes to the DeepSeek team ... +... A special thank-you goes to the DeepSeek team ...docs/source/installation/build-from-source-linux.md (1)
182-191
: Tighten wording; minor readability nits.-... with older versions of `torch` (prior to 2.7.0) due to the ... +... with older versions of `torch` (before 2.7.0) due to the ... @@ -By default, TensorRT LLM uses `cpp/build` as the build directory, but you can specify a different location with the `--build_dir` option. +By default, TensorRT LLM uses `cpp/build` as the build directory; you can specify a different location with `--build_dir`.
...g/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md
Outdated
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md
Show resolved
Hide resolved
docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md
Show resolved
Hide resolved
3bfa93b
to
c3b2ebe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (11)
docs/source/features/paged-attention-ifb-scheduler.md (2)
39-50
: Fix grammar and a typo (“re-buliding” → “rebuilding”)Small doc issues that will ship to users; please fix.
-`max_seq_len` defines the maximum sequence length of single request +`max_seq_len` defines the maximum sequence length of a single request. @@ -`max_num_tokens` is set to 8192 by default starting from v0.11. You can tune it using the runtime `max_num_tokens` without re-buliding the engine. It is recommended to tune `--max_num_tokens` for better performance. +`max_num_tokens` is set to 8192 by default starting from v0.11. You can tune it using the runtime `max_num_tokens` without rebuilding the engine. It is recommended to tune `--max_num_tokens` for better performance.
109-134
: Complete branding and parameter formatting inpaged-attention-ifb-scheduler.md
- Replace all instances of “TRT-LLM” or “TRTLLM” in alt text (lines 111, 118, 124, 130, 142) with “TensorRT LLM”
- Replace “max-batch size” and “max-num tokens” mentions (lines 109, 140, 144) with backticked flags
max_batch_size
andmax_num_tokens
- Standardize all “kv-cache” references (lines 76, 77, 120, 128, 134, 144) to “KV cache” or the backticked flag
kv_cache
Re-run:
rg -n 'TRT-LLM|TRTLLM' docs/source/features/paged-attention-ifb-scheduler.md rg -n 'max-num tokens' docs/source/features/paged-attention-ifb-scheduler.md rg -n 'kv-cache' docs/source/features/paged-attention-ifb-scheduler.mdto confirm no occurrences remain.
docs/source/developer-guide/perf-benchmarking.md (2)
17-19
: Fix in-flight batching cross-reference.This link likely 404s; prefer the section label added in the attention doc.
-`trtllm-bench` executes all benchmarks using [in-flight batching] -- for more information see -the [in-flight batching section](../advanced/gpt-attention.md#in-flight-batching) that describes the concept +`trtllm-bench` executes all benchmarks using in‑flight batching — for more information see +the {ref}`inflight-batching` section that describes the concept
117-119
: “input_ids” are token IDs, not logits.This will confuse users.
-| `input_ids` | Y* | List[Integer] | List of logits that make up the request prompt. | +| `input_ids` | Y* | List[Integer] | Token IDs that make up the request prompt. |docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md (1)
539-542
: Broken link path to benchmarking doc.The project moved the page under developer-guide; this link likely 404s.
-Prepare a dataset following the [benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/performance/perf-benchmarking.md#preparing-a-dataset) and save it as `./dataset.json`. +Prepare a dataset following the [benchmarking documentation](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/developer-guide/perf-benchmarking.md#preparing-a-dataset) and save it as `./dataset.json`.docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md (1)
71-77
: Fix non-exported env var and make YAML creation POSIX-safe.
TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True
on its own line won’t be exported totrtllm-serve
. Prefix it to the command orexport
it.echo -e
is not portable with/bin/sh
. Use a heredoc withprintf
/cat
.Apply:
- docker.io/<username>/tensorrt_llm:main sh \ - -c "echo -e 'enable_autotuner: false\nenable_attention_dp: false\nenable_min_latency: true\ncuda_graph_config:\n max_batch_size: 8\nspeculative_config:\n decoding_type: Eagle\n max_draft_len: 3\n speculative_model_dir: /config/models/eagle\n eagle3_one_model: true\nkv_cache_config:\n enable_block_reuse: false' > c.yaml && \ - TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \ - trtllm-serve /config/models/maverick \ + docker.io/<username>/tensorrt_llm:main sh \ + -c "cat > c.yaml <<'EOF' +enable_autotuner: false +enable_attention_dp: false +enable_min_latency: true +cuda_graph_config: + max_batch_size: 8 +speculative_config: + decoding_type: Eagle + max_draft_len: 3 + speculative_model_dir: /config/models/eagle + eagle3_one_model: true +kv_cache_config: + enable_block_reuse: false +EOF + TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True trtllm-serve /config/models/maverick \ --host 0.0.0.0 --port 8000 \ --tp_size 8 --ep_size 1 \ --trust_remote_code --extra_llm_api_options c.yaml \ --kv_cache_free_gpu_memory_fraction 0.75"docs/source/features/parallel-strategy.md (2)
72-81
: Example YAML is invalid as-is (duplicate keys); split TP/DP configs.Copy-pasting this block yields one file where later keys override earlier ones. Provide separate examples or files.
-```bash -cat <<EOF > parallel_config.yaml -# TP-8 -tensor_parallel_size: 8 -enable_attention_dp: false # default -# DP-8 -tensor_parallel_size: 8 -enable_attention_dp: true -EOF -``` +```bash +# TP example +cat > parallel_config_tp.yaml <<'EOF' +tensor_parallel_size: 8 +enable_attention_dp: false +EOF +# DP example +cat > parallel_config_dp.yaml <<'EOF' +tensor_parallel_size: 8 +enable_attention_dp: true +EOF +```
103-118
: Same issue for MoE: split TP/EP/Hybrid into separate examples.One YAML with repeated keys is misleading.
-```bash -cat <<EOF > parallel_config.yaml -# TP only -tensor_parallel_size: 8 -moe_tensor_parallel_size: 8 - -# EP only -tensor_parallel_size: 8 -moe_expert_parallel_size: 8 - -# Hybrid (TP-4 × EP-2) -tensor_parallel_size: 8 # 4 × 2 -moe_tensor_parallel_size: 4 -moe_expert_parallel_size: 2 -EOF -``` +```bash +# TP-only +cat > parallel_config_moe_tp.yaml <<'EOF' +tensor_parallel_size: 8 +moe_tensor_parallel_size: 8 +EOF +# EP-only +cat > parallel_config_moe_ep.yaml <<'EOF' +tensor_parallel_size: 8 +moe_expert_parallel_size: 8 +EOF +# Hybrid (TP-4 × EP-2) +cat > parallel_config_moe_hybrid.yaml <<'EOF' +tensor_parallel_size: 8 +moe_tensor_parallel_size: 4 +moe_expert_parallel_size: 2 +EOF +```docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (1)
34-43
: Parameterize NGC tag with default to published 1.0.0rc4+TRTLLM_TAG=${TRTLLM_TAG:-1.0.0rc4} # Latest published on NGC; verify on the NGC tags page docker run --rm -it \ --ipc=host \ --gpus all \ -p 8000:8000 \ -v ~/.cache:/root/.cache:rw \ --name tensorrt_llm \ -nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \ +nvcr.io/nvidia/tensorrt-llm/release:${TRTLLM_TAG} \ /bin/bashdocs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md (2)
214-224
: Missing dataset definition before using ${YOUR_DATA_PATH}.This section references ${YOUR_DATA_PATH} without defining it. Add dataset preparation (real or synthetic) before the benchmark.
cat >./extra-llm-api-config.yml <<EOF @@ EOF -trtllm-bench --model nvidia/DeepSeek-R1-0528-FP4 +# Prepare a dataset (synthetic example shown; replace with your real dataset if needed) +python ${YOUR_WORK_PATH}/benchmarks/cpp/prepare_dataset.py \ + --stdout \ + --tokenizer nvidia/DeepSeek-R1-0528-FP4 \ + token-norm-dist --input-mean 1024 --output-mean 1024 \ + --input-stdev 0 --output-stdev 0 \ + --num-requests 114688 > dataset.txt +YOUR_DATA_PATH=./dataset.txt + +trtllm-bench --model nvidia/DeepSeek-R1-0528-FP4 throughput - --dataset ${YOUR_DATA_PATH} + --dataset ${YOUR_DATA_PATH}
347-354
: Dataset inconsistency: text says “real dataset” but commands generate synthetic.Align the prose with the commands.
-Our benchmark results are based on **Batch = 1024, ISL = 1K, OSL = 2K, num_requests = 5120 from real dataset** +Our benchmark results are based on **Batch = 1024, ISL = 1K, OSL = 2K, num_requests = 5120 from a synthetic dataset** (commands below generate it).Also applies to: 351-361
♻️ Duplicate comments (17)
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (1)
3-3
: Fix spelling and article usage (“OpenAI-compatible”).Also add “the” before the command.
-TensorRT LLM provides the OpenAI-compatiable API via `trtllm-serve` command. +TensorRT LLM provides an OpenAI-compatible API via the `trtllm-serve` command.docs/source/blogs/H200launch.md (2)
38-38
: Fix broken image URL (space in repo name).Use the correct repo slug with a hyphen so the image loads.
-<img src="https://github.com/NVIDIA/TensorRT LLM/blob/rel/docs/source/blogs/media/H200launch_tps.png?raw=true" alt="H200 TPS" width="500" height="auto"> +<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/H200launch_tps.png?raw=true" alt="H200 TPS" width="500" height="auto">
47-47
: Fix broken image URL (same issue).Replace space with hyphen in repo name.
-<img src="https://github.com/NVIDIA/TensorRT LLM/blob/rel/docs/source/blogs/media/H200launch_H200vsH100_tps.png?raw=true" alt="max throughput llama sweep" width="500" height="auto"> +<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/H200launch_H200vsH100_tps.png?raw=true" alt="max throughput llama sweep" width="500" height="auto">docs/source/quick-start-guide.md (1)
100-105
: Fix “Deployment Guide” link target (broken path).Use the deployment-guide index instead of a non-existent examples path. Matches prior review guidance.
-- **[Deployment Guide](examples/llm_api_examples)** - Comprehensive examples for deploying LLM inference in various scenarios +- **[Deployment Guide](deployment-guide/index.rst)** - Comprehensive examples for deploying LLM inference in various scenariosdocs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md (1)
68-68
: Broken GitHub link: stray space in repo path.Replace “TensorRT LLM” with “TensorRT-LLM” in the URL.
-*TensorRT LLM already supports [FP8 Attention](https://github.com/NVIDIA/TensorRT LLM/tree/main/examples/models/core/deepseek_v3#fp8-kv-cache-and-mla) ... +*TensorRT LLM already supports [FP8 Attention](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/deepseek_v3#fp8-kv-cache-and-mla) ...docs/source/overview.md (1)
21-23
: Broken URL (leading dot).Remove the leading dot before
https://
.-- **DeepSeek R1**: [World-record inference performance on Blackwell GPUs](.https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/) +- **DeepSeek R1**: [World-record inference performance on Blackwell GPUs](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (4)
7-8
: Align container narrative with commands.Intro says “start with the PyTorch container,” but the instructions use the TensorRT LLM container. Make the intro consistent.
-The guide is intended for developers ... —starting with the PyTorch container from NGC, then installing TensorRT LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution. +The guide is intended for developers ... —starting with the TensorRT LLM container from NGC (or building from source), using FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.
51-54
: Branding: “TRT-LLM Server” → “TensorRT LLM Server”.Update heading and surrounding text; keep CLI/tool names unchanged.
-### Creating the TRT-LLM Server config +### Creating the TensorRT LLM Server config
94-101
: Branding: replace “TRT-LLM server”.-### Launch the TRT-LLM Server +### Launch the TensorRT LLM Server - -Below is an example command to launch the TRT-LLM server ... +Below is an example command to launch the TensorRT LLM server ...
209-216
: Branding: fix remaining “TRT-LLM server” mentions.-Start a new terminal on the host to test the TensorRT LLM server you just launched. +Start a new terminal on the host to test the TensorRT LLM server you just launched. -After the TRT-LLM server is set up ... +After the TensorRT LLM server is set up ... -... showing that the TRT-LLM server returns ... +... showing that the TensorRT LLM server returns ...Also applies to: 219-231
docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md (1)
144-145
: Fix typo: “NIXL” → “NCCL”; adjust recommendation.-... MPI, UCX, and NIXL—are all supported ... -... we recommend using UCX and NIXL backends ... +... MPI, UCX, and NCCL—are all supported ... +... we recommend using UCX and NCCL backends ...docs/source/architecture/checkpoint.md (1)
171-173
: Fix typos and agreement in note (“out_fature”, “follows”).- The linear weights in TensorRT LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it. + The linear weights in a TensorRT LLM checkpoint always follow (`out_feature`, `in_feature`) shape, whereas some quantized linear layers implemented by plugins may use (`in_feature`, `out_feature`) shape. The `trtllm-build` command adds a transpose operation to post‑process them.docs/source/installation/build-from-source-linux.md (2)
12-12
: Fix typo and tighten phrasing; ensure anchors exist.Replace “TensortRT-LLM” and streamline the sentence. Also confirm the two fragment anchors exist later in the doc.
-If you intend to build any TensortRT-LLM artifacts, such as any of the container images (note that there exist pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) container images in NGC), or the TensorRT LLM Python wheel, you first need to clone the TensorRT LLM repository: +If you intend to build any TensorRT LLM artifacts—such as container images (pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) images are available on NGC)—or the TensorRT LLM Python wheel, first clone the TensorRT LLM repository:
36-36
: Broken self-link after header rename; add explicit anchor and align link text.The link at Line 36 points to the old autogenerated slug. Add an explicit label before the Option 2 header and adjust the earlier link text to match the new header wording.
-TensorRT LLM contains a simple command to create a Docker image. Note that if you plan to develop on TensorRT LLM, we recommend using [Option 2: Build TensorRT LLM Step-By-Step](#option-2-build-tensorrt-llm-step-by-step). +TensorRT LLM contains a simple command to create a Docker image. Note that if you plan to develop on TensorRT LLM, we recommend using [Option 2: Container for building TensorRT LLM Step-by-Step](#option-2-build-tensorrt-llm-step-by-step).+ (option-2-build-tensorrt-llm-step-by-step)= -### Option 2: Container for building TensorRT LLM Step-by-Step +### Option 2: Container for building TensorRT LLM Step-by-StepAlso applies to: 59-61
docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md (3)
244-245
: Consistent hyphenation and concision for Eagle-3 paragraph.-Another important method is Eagle3. From the [Eagle3 paper](https://arxiv.org/pdf/2503.01840), the promising results show that it can help greatly increase the acceptance rate by leveraging different levels’ hidden states to predict draft tokens. Since TensorRT LLM already has [Eagle-3 support](https://github.com/NVIDIA/TensorRT-LLM/pull/3035) now, in the future, we also want to train an Eagle3 head to support DeepSeek-V3/R1+Eagle3 to achieve better speedup. +Another important method is Eagle-3. The [Eagle-3 paper](https://arxiv.org/pdf/2503.01840) shows it can increase acceptance rates by leveraging multi-level hidden states to predict draft tokens. Since TensorRT LLM already has [Eagle-3 support](https://github.com/NVIDIA/TensorRT-LLM/pull/3035), we plan to train an Eagle-3 head to support DeepSeek-V3/R1 + Eagle-3 for better speedup.
77-79
: Fix both MTP code links; point to the correct symbols with a pinned commit.Both links currently point to the same line on main. Link Vanilla and Eagle separately using the same commit used later in the doc.
-TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047). MTP Eagle is the default path for DeepSeek-V3 and DeepSeek-R1 models. +TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L326) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L1047). MTP Eagle is the default path for DeepSeek-V3 and DeepSeek-R1 models.
240-241
: Fix typo and standardize Eagle naming.Correct “Ealge2” and keep hyphenation consistent.
-...such as Ealge2 and Eagle3, to increase the acceptance rate. +...such as Eagle-2 and Eagle-3, to increase the acceptance rate.
🧹 Nitpick comments (81)
docs/source/features/paged-attention-ifb-scheduler.md (4)
5-11
: Hyphenation/style: “in‑flight batching” and minor tighteningUse consistent hyphenation and drop the extra “the” for smoother reading.
-For efficiency reasons (1), the support for inflight batching ***requires the -input tensors to be packed (no padding)***. +For efficiency reasons (1), support for in-flight batching ***requires +input tensors to be packed (no padding)***.
61-64
: Tighten wordingImprove flow; no meaning change.
-Using a more realistic value for `max_num_tokens` allows TensorRT LLM to -allocate more memory to store the KV cache and execute more requests together. -It leads to an increased efficiency. +Using a realistic value for `max_num_tokens` lets TensorRT LLM allocate more memory to the KV cache and execute more requests together, improving efficiency.
83-87
: Define acronym on first use; remove double spaceMHA should be expanded on first mention; also remove extra space after period.
-In the generation phase, a common optimization is to provide the MHA kernel +In the generation phase, a common optimization is to provide the multi-head attention (MHA) kernel @@ -already been computed. That cache is known as the KV cache. TensorRT LLM uses +already been computed. That cache is known as the KV cache. TensorRT LLM uses
138-141
: Use the backticked flag namemax_num_tokens
Keep terminology consistent with earlier sections and runtime flag names.
-The [visualization](#the-schedulers) of the TensorRT LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the max-num tokens limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled. +The [visualization](#the-schedulers) of the TensorRT LLM scheduler showed that initially Request 3 couldn't be scheduled because it would put the scheduler over the `max_num_tokens` limit. However, with context chunking, this is no longer the case, and the first chunk of Request 3 can be scheduled.docs/source/features/checkpoint-loading.md (1)
19-19
: Grammar: drop the possessive.Prefer “the TensorRT LLM model definition.”
-- **Weight Mappers**: Map and transform loaded weights to TensorRT LLM model's definition +- **Weight Mappers**: Map and transform loaded weights to the TensorRT LLM model definitiondocs/source/examples/customization.md (1)
5-5
: Tighten grammar and articles; fix phrasing.Use an active construction and add “the.”
-TensorRT LLM can quantize the Hugging Face model automatically. By setting the appropriate flags in the `LLM` instance. For example, to perform an Int4 AWQ quantization, the following code triggers the model quantization. Please refer to complete list of [supported flags](https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html#QuantAlgo) and acceptable values. +TensorRT LLM can automatically quantize Hugging Face models by setting the appropriate flags on the `LLM` instance. For example, to perform INT4 AWQ quantization, use the following. Please refer to the complete list of [supported flags](https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html#QuantAlgo) and acceptable values.docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (2)
22-22
: NGC phrasing + suggest pinning to a published tag.Use “in NGC Catalog,” and consider showing a concrete published tag to avoid ambiguity.
-TensorRT LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags). +TensorRT LLM distributes a pre-built container in the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags).Example command (non-diff) with a pinned tag:
docker run --rm --ipc host -p 8000:8000 --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:<published_tag>To keep docs accurate per published images, replace <published_tag> with an actual tag listed on NGC for the release channel.
117-117
: Article fix.Use “an example of …”.
-Below is some example TensorRT LLM serving benchmark output. Your actual results may vary. +Below is an example of TensorRT LLM serving benchmark output. Your actual results may vary.docs/source/features/multi-modality.md (1)
35-35
: Styling: prefer “LLM API” (no hyphen).Align with usage elsewhere in docs.
-Quickly try out TensorRT LLM's multimodal support using our `LLM-API` and a ready-to-run [example](source:examples/llm-api/quickstart_multimodal.py): +Quickly try out TensorRT LLM's multimodal support using our `LLM API` and a ready-to-run [example](source:examples/llm-api/quickstart_multimodal.py):docs/source/dev-on-cloud/build-image-to-dockerhub.md (1)
6-6
: Use “Docker Hub” capitalization.Also keep heading casing consistent.
-## Build the TensorRT LLM Docker Image and Upload to DockerHub +## Build the TensorRT LLM Docker Image and Upload to Docker Hubdocs/source/blogs/H200launch.md (2)
1-1
: Verify target of “updated Llama performance” link.The anchor text says “Llama performance” but links to Falcon180B-H200.md. Confirm intent or update the link.
58-58
: Tighten grammar and remove repetition (“performance … performance”).-H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize that compute performance, H200 is the first GPU with HBM3e memory with 4.8TB/s of memory bandwidth, a 1.4X increase over H100. H200 also expands GPU memory capacity nearly 2X to 141 gigabytes (GB). The combination of faster and larger HBM memory accelerates performance of LLM model inference performance with faster throughput and tokens per second. These results are measured and preliminary, more updates expected as optimizations for H200 continue with TensorRT LLM. +H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize that compute performance, H200 is the first GPU with HBM3e memory with 4.8TB/s of bandwidth, a 1.4× increase over H100. H200 also expands GPU memory capacity nearly 2× to 141 GB. The combination of faster and larger HBM memory accelerates LLM inference, improving throughput and tokens per second. These results are preliminary; more updates are expected as H200 optimizations continue with TensorRT LLM.docs/source/blogs/H100vsA100.md (2)
7-7
: Branding consistency: replace “TRT-LLM” with “TensorRT LLM” and minor style tweak.-TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 is able to achieve over 10,000 output tok/s at peak throughput for 64 concurrent requests, while maintaining a 1st token latency of 100ms. For min-latency applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency. +TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 can achieve over 10,000 output tok/s at peak throughput for 64 concurrent requests, while maintaining a 1st token latency of 100 ms. For min-latency applications, TensorRT LLM on H100 can achieve less than 10 ms to 1st token latency.
31-31
: End sentence with a period for consistency.-The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html) +The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html).docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (7)
53-54
: Grammar: “use the latest main branch”.-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html) +If you want to use the latest main branch, you can build from source to install TensorRT LLM. See: [https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html](https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html)
55-58
: Branding: update section title to “TensorRT LLM server config”.-### Creating the TRT-LLM Server config -We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings. +### Creating the TensorRT LLM server config +Create a YAML configuration file at /tmp/config.yml for the TensorRT LLM server and populate it with the following recommended performance settings.
72-76
: Branding: update launch section title.-### Launch the TRT-LLM Server +### Launch the TensorRT LLM server
197-209
: Branding: replace “TRT-LLM server” with “TensorRT LLM server”.-After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server. +After the TensorRT LLM server is set up and shows “Application startup complete”, you can send requests to the server. - -Here is an example response, showing that the TRT-LLM server returns “New York is a state located in the northeastern United States. It is bordered by”, completing the input sequence. +Here is an example response, showing that the TensorRT LLM server returns “New York is a state located in the northeastern United States. It is bordered by…”, completing the input sequence.
255-256
: Fix model name capitalization (“Llama”, not “LLama”).-* Note: The tokenizer will add BOS before input prompt by default, which leads to accuracy regression on GSM8K task for LLama 3.3 70B instruction model. So set add\_special\_tokens=False to avoid it. +* Note: The tokenizer will add BOS before the input prompt by default, which leads to accuracy regression on the GSM8K task for Llama 3.3 70B instruction model. Set add\_special\_tokens=False to avoid it.
274-275
: Grammar: “first create a wrapper script” and avoid external link to bench.sh.-To benchmark the performance of your TensorRT LLM server you can leverage the built-in “benchmark\_serving.py” script. To do this first creating a wrapper [bench.sh](http://bench.sh) script. +To benchmark the performance of your TensorRT LLM server, use the built-in “benchmark\_serving.py” script. First, create a wrapper script named `bench.sh`.
314-314
: Prefer repository path over full GitHub URL in docs.Use a relative docs link or Sphinx ref to benchmark_serving.py to keep links stable across branches.
docs/source/features/quantization.md (1)
7-7
: Concise wording (“several” vs “a variety of”).-TensorRT LLM offers a variety of quantization recipes to optimize LLM inference. These recipes can be broadly categorized as follows: +TensorRT LLM offers several quantization recipes to optimize LLM inference. These can be categorized as follows:docs/source/architecture/add-model.md (2)
9-18
: Tighten wording and punctuation in “Step 1” bullets.
- Remove the extra comma after “such as”.
- Drop the possessive in “TensorRT LLM’s APIs”.
- - Basic layers, such as, `Linear` and `LayerNorm`. - - High-level layers, such as, `MLP` and `Attention`. + - Basic layers, such as `Linear` and `LayerNorm`. + - High-level layers, such as `MLP` and `Attention`. -2. Write a `model.py` with TensorRT LLM's APIs +2. Write a `model.py` with TensorRT LLM APIs
55-56
: Grammar: clarify weight-conversion sentence.Small readability fix.
-The weights from source framework need to be converted and bound to the new added TensorRT LLM model. Here is an example of converting HuggingFace weights: +The weights from the source framework must be converted and bound to the newly added TensorRT LLM model. For example, to convert Hugging Face weights:docs/source/installation/linux.md (4)
24-29
: Fix typos and ensure container guidance matches published NGC tags.
- Spelling: “preqrequisites” → “prerequisites”.
Also, per prior learnings, confirm the “Develop container” tag(s) referenced here are actually published on NGC.
-```{tip} -Instead of manually installing the preqrequisites as described +```{tip} +Instead of manually installing the prerequisites as described
31-34
: Consistent naming.Use the new branding in the heading.
-### Install pre-built TensorRT LLM wheel +### Install the pre-built TensorRT LLM wheel
33-37
: Minor copy edit.-Once all prerequisites are in place, TensorRT LLM can be installed as follows: +Once all prerequisites are in place, install TensorRT LLM as follows:
49-55
: Polish “Known limitations” phrasing and Slurm note.Capitalize Slurm/MPI and fix grammar.
-There are some known limitations when you pip install pre-built TensorRT LLM wheel package. +There are some known limitations when you install the pre-built TensorRT LLM wheel via pip. -If you encounter an error while running TensorRT LLM in a Slurm-managed cluster, you need to reconfigure the MPI installation to work with Slurm. -The setup methods depends on your slurm configuration, pls check with your admin. This is not a TensorRT LLM specific, rather a general mpi+slurm issue. +If you encounter an error while running TensorRT LLM in a Slurm-managed cluster, you may need to reconfigure the MPI installation to work with Slurm. +The setup method depends on your Slurm configuration—please check with your admin. This is not TensorRT LLM‑specific; it is a general MPI+Slurm issue.docs/source/features/attention.md (5)
22-36
: Normalize backend names and casing.Use “TensorRT LLM backend” (brand) and “FlashInfer” (library) consistently; keep class names as-is.
-There are currently three available attention backends: the vanilla backend, the TRT-LLM backend, and the Flashinfer backend. +There are currently three available attention backends: the vanilla backend, the TensorRT LLM backend, and the FlashInfer backend. @@ -In contrast, the Flashinfer backend, `FlashInferAttention`, is performance-optimized... +In contrast, the FlashInfer backend, `FlashInferAttention`, is performance-optimized... @@ -The TRT-LLM backend, `TrtllmAttention`, serves as the default backend... +The TensorRT LLM backend, `TrtllmAttention`, serves as the default backend...
118-121
: Hyphenation and e.g./i.e. style.-In the `TrtllmAttention` backend, the attention operator supports the packed (i.e. non padded) QKV inputs. +In the `TrtllmAttention` backend, the attention operator supports packed (i.e., non‑padded) QKV inputs.
156-162
: Concise wording.-That kernel is able to apply +That kernel can apply @@ -TensorRT LLM will continue to add (or -enable) additional features in future releases, such as enabling support for IA3. +TensorRT LLM will continue to add (or enable) features in future releases, such as support for IA3.
287-301
: KV cache capitalization and agreement.Capitalize “KV cache” and fix subject–verb agreement.
-TensorRT LLM has a feature called `Cyclic KV Cache`, which treats the kv cache +TensorRT LLM has a feature called `Cyclic KV Cache`, which treats the KV cache @@ -_Note that the cyclic kv cache feature doesn't work with beam searching currently as -the context kv cache are shared across beams. +_Note that the cyclic KV cache feature doesn't work with beam search currently, as +the context KV cache is shared across beams._
305-319
: Subject–verb agreement in StreamingLLM section.-Similar to the cyclic KV cache feature in TensorRT LLM, `attention_window_size` -parameter is used to determine `N`. Different from the cyclic KV cache feature, +Similar to the cyclic KV cache feature in TensorRT LLM, the `attention_window_size` +parameter is used to determine `N`. Unlike the cyclic KV cache feature, @@ -StreamingLLM use the positions within the cache rather than those in the original text. +StreamingLLM uses the positions within the cache rather than those in the original text.docs/source/developer-guide/perf-benchmarking.md (3)
10-16
: Typos and phrasing near intro.
- “overiew” → “overview”
- “the follows” → “the following”
-easier for users to reproduce our officially published [performance overiew](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows: +easier for users to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the following:
33-38
: Benchmarking clocks guidance: consider adding max-lock option.For reproducibility, many users pin SM/mem clocks during benchmarks. Suggest adding an optional snippet.
### GPU Clock Management @@ -Reset GPU clocks using: +Reset GPU clocks using: @@ sudo nvidia-smi -rgc @@ +Alternatively, to minimize variance, you can lock clocks to their maximum supported values (query first, then set): +```shell +sudo nvidia-smi -q -d SUPPORTED_CLOCKS +sudo nvidia-smi -lgc <max_graphics_clock> +sudo nvidia-smi -lmc <max_mem_clock> +```
68-71
: Subject–verb agreement in “Validated Networks.”-While `trtllm-bench` should be able to run any network that TensorRT LLM supports, the following are the list -that have been validated extensively and is the same listing as seen on the +While `trtllm-bench` should be able to run any network that TensorRT LLM supports, the following list +has been validated extensively and matches the listing on thedocs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md (6)
42-52
: Remove extra spaces before punctuation and after brand.Minor cleanup for readability.
-In this tech blog, we will introduce the details of the design and implementation to support E2E large-scale EP in TensorRT LLM . This blog post mainly covers the following: +In this tech blog, we introduce the design and implementation to support E2E large‑scale EP in TensorRT LLM. This blog post mainly covers: @@ - * Some preliminary data demonstrating the effectiveness of the current implementation in TensorRT LLM . + * Some preliminary data demonstrating the effectiveness of the current implementation in TensorRT LLM.
60-61
: Trailing space before punctuation.-Even if, in this tech blog, we focus on TensorRT LLM , we believe the core ideas and implementation can also be applied to other inference engines to help the inference performance on NVIDIA GPUs. Also, with the help of the community, we would like to figure out how to better modularize the current TensorRT LLM large-scale EP implementation and make it more easily reusable by the community. +Even if, in this tech blog, we focus on TensorRT LLM, we believe the core ideas and implementation can also be applied to other inference engines to help inference performance on NVIDIA GPUs. Also, with the help of the community, we would like to figure out how to better modularize the current TensorRT LLM large‑scale EP implementation and make it more easily reusable by the community.
224-225
: Figure caption capitalization.Optional style tweak.
-<em>Figure 12: the high-level design of TensorRT LLM large-scale EP</em> +<em>Figure 12: High‑level design of TensorRT LLM large‑scale EP</em>
367-368
: Branding: extra space before colon.-Based on these core utilities, there are two versions of EP Load Balancer in TensorRT LLM : Offline EP Load Balancer and Online EP Load Balancer. +Based on these core utilities, there are two versions of EP Load Balancer in TensorRT LLM: Offline EP Load Balancer and Online EP Load Balancer.
710-711
: Reduce repeated sentence starts (“We … We … We …”).Optional rephrase to avoid repetition.
-We encourage the community to try new ideas and performance validation. We encourage the community to share feedback to help us move fast in this area. We are actively tracking the TensorRT LLM large-scale EP execution in [this](https://github.com/NVIDIA/TensorRT-LLM/issues/4127) GitHub issue to ensure transparency to the community. +We encourage the community to try new ideas and performance validation and to share feedback to help us move fast in this area. We are actively tracking the TensorRT LLM large‑scale EP execution in [this](https://github.com/NVIDIA/TensorRT-LLM/issues/4127) GitHub issue to ensure transparency.
715-716
: Trailing space after brand.-... performance in TensorRT LLM . +... performance in TensorRT LLM.docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md (1)
39-44
: Minor copyedit: “Docker image(s)” capitalization and plurality.
- Use “Docker images” (plural) and capitalize “Docker”.
-#### Option 1. Use weekly release NGC docker image -TensorRT LLM provides weekly release [docker image](...) +#### Option 1. Use weekly release NGC Docker images +TensorRT LLM provides weekly release [Docker images](...)docs/source/features/parallel-strategy.md (1)
45-46
: Name style nit: “LLaMA4” → “Llama 4”.Align with model naming used elsewhere in docs.
-**Best for:** Large-scale MoE models like DeepSeek-V3/R1, LLaMA4, Qwen3 +**Best for:** Large-scale MoE models like DeepSeek-V3/R1, Llama 4, Qwen3docs/source/blogs/tech_blog/blog1_Pushing_Latency_Boundaries_Optimizing_DeepSeek-R1_Performance_on_NVIDIA_B200_GPUs.md (1)
250-250
: Avoid bare URL; render as a proper link.-https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-min-latency +[Best perf practice on DeepSeek-R1 (B200 min-latency)](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-min-latency)docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (4)
52-53
: Grammar and phrasing cleanup.-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to ... +If you want to use the latest main branch, build from source to install TensorRT LLM. See ...
54-57
: Consistent naming: “TRT-LLM” → “TensorRT LLM”.-### Creating the TRT-LLM Server config +### Creating the TensorRT LLM server config - We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server + We create a YAML configuration file /tmp/config.yml for the TensorRT LLM server
71-75
: Consistent naming in section title.-### Launch the TRT-LLM Server +### Launch the TensorRT LLM server
186-198
: Consistent naming and minor wording.-Start a new terminal on the host to test the TensorRT LLM server you just launched. +Start a new terminal on the host to test the TensorRT LLM server you launched. -After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server. +After the TensorRT LLM server shows “Application startup complete”, you can send requests.docs/source/overview.md (4)
11-11
: Spelling: “Pytorch” → “PyTorch”.-### 🔥 **Architected on Pytorch** +### 🔥 **Architected on PyTorch**
13-14
: Questionable “NVIDIA Dynamo” link; verify or remove.The GitHub link looks non-canonical. Recommend removing the Dynamo reference unless a correct official link is available.
-TensorRT LLM ... integrates seamlessly with the broader inference ecosystem, including NVIDIA [Dynamo](https://github.com/ai-dynamo/dynamo) and the [Triton Inference Server](https://github.com/triton-inference-server/server). +TensorRT LLM ... integrates seamlessly with the broader inference ecosystem, including the [Triton Inference Server](https://github.com/triton-inference-server/server).
51-53
: Spacing/typo: add missing space after comma.-- **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration +- **NVIDIA Hopper**: H100, H200, GH200 with FP8 acceleration
57-58
: Missing space after period.-... generative AI.To get started, refer to the {ref}`quick-start-guide`. +... generative AI. To get started, refer to the {ref}`quick-start-guide`.docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (4)
49-50
: Grammar tweak and brand/link clarity.-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to ... +If you want to use the latest main branch, build from source to install TensorRT LLM. See ...
72-92
: Clarify YAML overwrite vs. append.The second here-doc overwrites /tmp/config.yml. Either state that it replaces the previous file or use a different filename to avoid confusion.
-EXTRA_LLM_API_FILE=/tmp/config.yml +EXTRA_LLM_API_FILE=/tmp/config_fp8.yml
296-304
: Grammar: minor tweak.-To benchmark the performance of your TensorRT LLM server you can leverage the built-in “benchmark_serving.py” script. To do this first creating a wrapper bench.sh script. +To benchmark your TensorRT LLM server, use the built-in benchmark_serving.py. First, create a wrapper bench.sh script.
47-48
: Container cadence note reads fine; add stability guidance.Consider adding: “Use monthly release tags (no rc suffix) for reproducibility; use rc tags for latest features.”
docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md (1)
60-61
: Wording: “disaggregated LLM inference”.-There are three different approaches to do disaggregation LLM inference ... +There are three approaches to perform disaggregated LLM inference ...docs/source/architecture/checkpoint.md (4)
13-23
: Specify fenced block language to satisfy MD040.-``` +```text NeMo ------------- | HuggingFace ------ ...--- `27-33`: **Concise phrasing.** ```diff -TensorRT LLM aims at supporting different sources: +TensorRT LLM aims to support different sources:
9-12
: Improve list grammar.-3. Load the engines to TensorRT LLM model runner and evaluate with different evaluation tasks. +3. Load the engines into the TensorRT LLM model runner and evaluate with different tasks.
221-223
: Heading/title polish.Consider “Build the Checkpoint into a TensorRT Engine”. Current content otherwise looks good.
docs/source/blogs/Falcon180B-H200.md (5)
10-13
: Clarify AWQ vs compute dtype phrasing.-The 141GB of memory on H200, paired with TensorRT LLM running INT4 AWQ with -FP8, allows for the entire large language model to fit on a single GPU, where +The 141GB of memory on H200, paired with TensorRT LLM running INT4 AWQ with +FP8 compute, allows the entire large language model to fit on a single GPU, where
20-21
: Prefer relative image paths for docs builds.Use relative paths so Sphinx/offline builds don’t depend on GitHub query params.
-<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_tps.png?raw=true" ... +<img src="../media/Falcon180B-H200_tps.png" ...-<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_acc.png?raw=true" ... +<img src="../media/Falcon180B-H200_acc.png" ...Also applies to: 36-37
24-29
: Branding in superscripts: “TensorRT-LLM” → “TensorRT LLM”.-TensorRT-LLM v0.7a | +TensorRT LLM v0.7a |-TensorRT-LLM v0.7a vs TensorRT-LLM v0.6a | +TensorRT LLM v0.7a vs TensorRT LLM v0.6a |Also applies to: 41-45, 87-91, 100-105
119-121
: Make the examples link resilient.Use a GitHub link (or relative path within repo) to avoid site-root breaks.
-Similar examples running Llama-70B in TensorRT LLM are published in [examples/models/core/llama](/examples/models/core/llama). +Similar examples running Llama-70B in TensorRT LLM are published in [examples/models/core/llama](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/models/core/llama).
116-118
: Forward-looking statements can go stale.If these versions are already released, replace “will be published” with the concrete release versions/dates, or remove the sentence.
docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md (3)
195-214
: Option name check: kv_cache_dtype vs kv_cache_config.dtype.Recent docs typically use kv_cache_config: { dtype: fp8 }. Verify kv_cache_dtype is still supported or switch to kv_cache_config for consistency.
-kv_cache_dtype: fp8 +kv_cache_config: + dtype: fp8
404-405
: Grammar/clarity nits.-To benchmark TensorRT LLM on DeepSeek models with more ISL/OSL combinations, you can use `prepare_dataset.py` to generate the dataset and use similar commands mentioned in the previous section. +To benchmark additional ISL/OSL combinations, generate datasets with `prepare_dataset.py` and reuse the commands above.-We're working on to enable those features by default, +We're working to enable those features by default,Also applies to: 407-409
420-423
: Grammar: OOM tip.-It's possible seeing OOM issues on some cases. Considering reducing `kv_cache_free_gpu_mem_fraction` to a smaller value as a workaround. +You might see OOM in some cases. Consider reducing `kv_cache_free_gpu_mem_fraction` as a workaround.docs/source/installation/build-from-source-linux.md (5)
36-36
: Standardize capitalization: “Step-by-Step”.Use “Step-by-Step” consistently (not “Step-By-Step”).
-[Option 2: Build TensorRT LLM Step-By-Step] +[Option 2: Build TensorRT LLM Step-by-Step]Also applies to: 59-59
5-5
: Version/ABI statement likely to drift; verify against current wheel build matrix.“Pre-built wheel on PyPI is linked against PyTorch 2.7.0+ and uses the new CXX11 ABI” may change. Please confirm the exact PyTorch version(s) and ABI used by the current published wheels.
I can draft a tiny script to parse the wheel metadata and torch dependency from PyPI/NGC if helpful.
117-118
: Tighten grammar and prefer Sphinx cross-ref.Minor grammar polish; optionally switch the raw URL to a {doc} or {ref} if a local page exists.
-For more information on building and running various TensorRT LLM container images, -check <https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>. +For more information on building and running TensorRT LLM container images, +see <https://github.com/NVIDIA/TensorRT-LLM/tree/main/docker>.
182-190
: Light copy edit and clarity for Dual ABI note.Use plain “before 2.7.0” and break up the long sentence for readability.
-This is particularly useful for avoiding linking issues that may arise with older versions of `torch` (prior to 2.7.0) due to the [Dual ABI support in GCC](https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html). The `--clean` option removes the build directory before starting a new build. By default, TensorRT LLM uses `cpp/build` as the build directory, but you can specify a different location with the `--build_dir` option. For a complete list of available build options, run `python3 ./scripts/build_wheel.py --help`. +This is useful for avoiding linking issues with `torch` versions before 2.7.0 caused by GCC’s [dual ABI](https://gcc.gnu.org/onlinedocs/libstdc++/manual/using_dual_abi.html). +The `--clean` option removes the build directory before starting a new build. By default, TensorRT LLM uses `cpp/build` as the build directory, but you can specify a different location with `--build_dir`. For all options, run `python3 ./scripts/build_wheel.py --help`.
33-34
: Cross-links to “containers” page: prefer stable Sphinx refs.Consider switching “(containers)” links to a Sphinx {ref} pointing at a labeled target in the containers doc to avoid breakage if the filename or heading changes.
-...[use the pre-built TensorRT LLM Release container images](containers). +...{ref}`use the pre-built TensorRT LLM Release container images <containers>`.Also applies to: 66-68
docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md (4)
70-71
: Use stable links or commit pins for code references.The relative link to attention.py can drift. Consider pinning to a commit or using a permalinker.
-see [MLA](../../../../tensorrt_llm/_torch/modules/attention.py). +see [MLA](https://github.com/NVIDIA/TensorRT-LLM/blob/<commit-sha>/tensorrt_llm/_torch/modules/attention.py).
150-161
: Present CUDA Graph padding config as YAML; fix “PyTorch” capitalization.Inline backslash-newlines are hard to read. Show a minimal YAML snippet and fix capitalization.
-Users can opt-out the CUDA Graph padding feature to see the perf benefits, by setting the `cuda_graph_config:\n enable_padding: False`, see API here [Pytorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41) +Users can opt out of CUDA Graph padding by setting: + +```yaml +cuda_graph_config: + enable_padding: false +``` +See the [PyTorch backend config](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/pyexecutor/config.py#L41).
152-152
: Unify list marker style to satisfy markdownlint (MD004).Switch asterisks to dashes to match the rest of the doc.
-* CUDA Graph +- CUDA Graph @@ -* Overlap Scheduler: +- Overlap Scheduler:Also applies to: 162-162
164-165
: Minor grammar and emphasis tweak.-Showed a **4% E2E performance impact** and should generally **always be used**. +Shows a **4% E2E performance impact** and should generally be enabled.docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md (2)
110-111
: Tone/branding polish and GPU naming.Avoid “very” and keep branding consistent.
-Attention is also a very important component in supporting MTP inference. The changes are mainly in the attention kernels for the generation phase. ... TensorRT LLM will use the fp8 flashMLA generation kernel on Hopper GPU, while using TRTLLM customized attention kernels on Blackwell for better performance. +Attention is also an important component for MTP inference. The changes are mainly in the attention kernels for the generation phase. ... TensorRT LLM uses the FP8 flashMLA generation kernel on Hopper, and TensorRT LLM–customized attention kernels on Blackwell for better performance.
248-248
: Tighten acknowledgment sentence.Minor clarity and punctuation.
-This was a remarkable cross-team effort to support and optimize MTP in TensorRT LLM. We would like to extend our gratitude to everyone who contributed to making this possible, as it involved a typical system/algorithm co-design approach spanning multiple technical layers—including kernel optimization, runtime enhancements, algorithmic improvements, and performance measurement & analysis. And a special thanks goes to the DeepSeek team for developing the MTP method, which lays down the foundation of this blog. +This was a cross-team effort to support and optimize MTP in TensorRT LLM. We thank everyone involved across kernel optimization, runtime enhancements, algorithmic improvements, and performance measurement and analysis. Special thanks to the DeepSeek team for developing MTP, which underpins this blog.
docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md
Show resolved
Hide resolved
docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md
Show resolved
Hide resolved
a9db493
to
5a20225
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md (1)
71-77
: Environment variable is not exported totrtllm-serve
.The assignment occurs on a separate line; it won’t be in the child process environment. Inline it or export before invoking.
- TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \ - trtllm-serve /config/models/maverick \ + TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True trtllm-serve /config/models/maverick \ --host 0.0.0.0 --port 8000 \Alternatively:
- TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \ - trtllm-serve /config/models/maverick \ + export TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True && \ + trtllm-serve /config/models/maverick \docs/source/blogs/Falcon180B-H200.md (1)
20-28
: Fix likely-broken image links (use raw or relative paths instead of blob/rel).The URLs point to blob/rel which likely doesn’t exist. Use raw/main or repo-relative paths.
-<img src="https://github.com/NVIDIA/TensorRT-LLM/blob/rel/docs/source/blogs/media/Falcon180B-H200_tps.png?raw=true" alt="Falcon-180B performance comparison" width="450" height="auto"> +<img src="https://raw.githubusercontent.com/NVIDIA/TensorRT-LLM/main/docs/source/blogs/media/Falcon180B-H200_tps.png" alt="Falcon-180B performance comparison" width="450" height="auto">Also apply the same fix to lines 36 and 71, 96.
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)
38-39
: Use a published NGC container tag: replacerelease:1.0.0rc6
withrelease:1.0.0rc4
(latest published tag) in the snippets at lines 38–39 and 47. (catalog.ngc.nvidia.com)docs/source/developer-guide/perf-benchmarking.md (1)
113-123
: Correct “logits” to “token IDs” in dataset schema and tips.
input_ids
are token IDs, not logits. This is a user-facing correctness issue.Apply:
-| `input_ids` | Y* | List[Integer] | List of logits that make up the request prompt. | +| `input_ids` | Y* | List[Integer] | Token IDs that make up the request prompt. | @@ -\* Specifying `prompt` or `input_ids` is required. However, you can not have both prompts and logits (`input_ids`) +\* Specifying `prompt` or `input_ids` is required. However, you cannot have both `prompt` and `input_ids` defined at the same time. If you specify `input_ids`, the `prompt` entry is ignored for request generation. @@ -- Entries which contain logits. +- Entries which contain token IDs.Also applies to: 134-139
docs/source/features/paged-attention-ifb-scheduler.md (1)
49-54
: Typo: “re-buliding” → “rebuilding.”User-facing doc.
Apply:
-You can tune it using the runtime `max_num_tokens` without re-buliding the engine. +You can tune it using the runtime `max_num_tokens` without rebuilding the engine.docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (1)
71-89
: Normalize branding: replace “TRT-LLM” with “TensorRT LLM.”Consistent with PR objectives.
-### Launch the TRT-LLM Server +### Launch the TensorRT LLM Server @@ -Below is an example command to launch the TRT-LLM server... +Below is an example command to launch the TensorRT LLM server... @@ -After the TRT-LLM server is set up... +After the TensorRT LLM server is set up... @@ -Here is an example response, showing that the TRT-LLM server returns... +Here is an example response, showing that the TensorRT LLM server returns... @@ -Sample TensorRT LLM serving benchmark output. +Sample TensorRT LLM serving benchmark output.Also applies to: 186-207, 317-317
docs/source/features/attention.md (1)
22-33
: Use consistent “FlashInfer” naming and correct the example value.-There are currently three available attention backends: the vanilla backend, the TRT-LLM backend, and the Flashinfer backend. +There are currently three available attention backends: the vanilla backend, the TRT-LLM backend, and the FlashInfer backend. -You can specify the desired attention backend using `PyTorchConfig.attn_backend`. For instance, to utilize the Flashinfer backend, you can pass `attn_backend="flashinfer"` to the `LLM` constructor as follows: `LLM(attn_backend="flashinfer")`. This will enable the use of the Flashinfer backend for your model. +You can specify the desired attention backend via the `attn_backend` parameter of the `LLM` constructor (which sets `PyTorchConfig.attn_backend`). For example, to use the FlashInfer backend, instantiate with: +```python +LLM(attn_backend="FLASHINFER") +``` -In contrast, the Flashinfer backend, `FlashInferAttention`, is performance-optimized and supports both inflight batching and paged KV cache. It also includes the following advanced features: +In contrast, the FlashInfer backend, `FlashInferAttention`, is performance-optimized and supports both inflight batching and paged KV cache. It also includes the following advanced features:
♻️ Duplicate comments (16)
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (1)
3-3
: Fix spelling and article usage (“OpenAI-compatible”).-TensorRT LLM provides the OpenAI-compatiable API via `trtllm-serve` command. +TensorRT LLM provides an OpenAI-compatible API via the `trtllm-serve` command.docs/source/dev-on-cloud/build-image-to-dockerhub.md (1)
4-4
: Grammar and naming polish (duplicate of prior feedback).Apply article/capitalization fixes and use product name.
-When you develop trt-llm on cloud platform such as runpod, you may need to provide a docker image for the platform. So you firstly need to upload the image to dockerhub. +When you develop TensorRT LLM on a cloud platform such as RunPod, you may need to provide a Docker image for the platform. First, upload the image to Docker Hub.docs/source/quick-start-guide.md (2)
91-92
: Provide explicit link texts and fix the examples link target.-To learn more about the LLM API, check out the [](llm-api/index) and [](examples/llm_api_examples). +To learn more about the LLM API, check out the [LLM API docs](llm-api/index) and the [Deployment Guide](deployment-guide/index.rst).
102-105
: Fix “Deployment Guide” link to the correct target.-- **[Deployment Guide](examples/llm_api_examples)** - Comprehensive examples for deploying LLM inference in various scenarios +- **[Deployment Guide](deployment-guide/index.rst)** - Comprehensive examples for deploying LLM inference in various scenariosdocs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (2)
51-54
: Branding consistency: replace “TRT-LLM Server” with “TensorRT LLM Server” in prose.Keep CLI/tool names as-is (e.g., trtllm-serve).
-### Creating the TRT-LLM Server config +### Creating the TensorRT LLM Server config-### Launch the TRT-LLM Server +### Launch the TensorRT LLM Server-Start a new terminal on the host to test the TensorRT LLM server you just launched. +Start a new terminal on the host to test the TensorRT LLM server you just launched.-After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server. +After the TensorRT LLM server is set up and shows Application startup complete, you can send requests to the server.-Here is an example response, showing that the TRT-LLM server returns “New York ... +Here is an example response, showing that the TensorRT LLM server returns “New York ...Also applies to: 94-96, 209-210, 219-230
7-8
: Fix container narrative to match commands.Intro says start with PyTorch container, but commands use the TensorRT LLM container. Align text with actual steps.
-The guide is intended for developers ... —starting with the PyTorch container from NGC, then installing TensorRT LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution. +The guide is intended for developers ... —starting with the TensorRT LLM container from NGC (or building from source), using FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.Also applies to: 29-40
docs/source/overview.md (2)
7-7
: Link TensorRT LLM to the project docs, not generic TensorRT.-[TensorRT LLM](https://developer.nvidia.com/tensorrt) +[TensorRT LLM](https://nvidia.github.io/TensorRT-LLM/)
21-23
: Fix broken URL (leading dot).-[World-record inference performance on Blackwell GPUs](.https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/) +[World-record inference performance on Blackwell GPUs](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)docs/source/architecture/checkpoint.md (1)
171-173
: Fix typos and agreement in note (out_fature, follows).- The linear weights in TensorRT LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it. + The linear weights in a TensorRT LLM checkpoint always follow (`out_feature`, `in_feature`) shape, whereas some quantized linear layers implemented by plugins may use (`in_feature`, `out_feature`) shape. The `trtllm-build` command adds a transpose operation to post‑process it.docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md (1)
262-272
: Document the default for stream_interval and where it’s consumed.Add “Default: 1” and briefly note tokenizer/executor consumption; this was requested previously.
Apply:
```yaml stream_interval: 4+Default: 1. The value is consumed by the tokenizer (to batch detokenization) and the Python executor to emit streamed responses every N iterations.
</blockquote></details> <details> <summary>docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md (1)</summary><blockquote> `144-145`: **Replace “NIXL” with “NCCL” and adjust the recommendation.** This was flagged earlier; please correct both mentions. Apply: ```diff -Currently, mainstream communication protocols—MPI, UCX, and NIXL—are all supported by TensorRT LLM, and the underlying communication protocols utilize RDMA / NVLink. Currently, we recommend using UCX and NIXL backends, +Currently, mainstream communication protocols—MPI, UCX, and NCCL—are supported by TensorRT LLM, and the underlying communication protocols utilize RDMA/NVLink. Currently, we recommend using the UCX and NCCL backends,
docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md (2)
240-245
: Fix “Ealge2” typo and standardize naming.Repeat of a previously raised issue.
Apply:
-such as Ealge2 and Eagle3, to increase the acceptance rate. +such as Eagle‑2 and Eagle‑3, to increase the acceptance rate. @@ -Since TensorRT LLM already has [Eagle-3 support](https://github.com/NVIDIA/TensorRT-LLM/pull/3035) now, in the future, we also want to train an Eagle3 head to support DeepSeek-V3/R1+Eagle3 to achieve better speedup. +Since TensorRT LLM already has [Eagle‑3 support](https://github.com/NVIDIA/TensorRT-LLM/pull/3035), in the future we also want to train an Eagle‑3 head to support DeepSeek‑V3/R1 + Eagle‑3 to achieve better speedup.
79-80
: Fix MTP source links and pin to a commit (consistency with later section).Both links currently point to the same line on main. Use the same pinned SHA you already use below.
Apply:
-TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047). MTP Eagle is the default path for DeepSeek-V3 and DeepSeek-R1 models. +TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L326) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L1047). MTP Eagle is the default path for DeepSeek‑V3 and DeepSeek‑R1 models.docs/source/installation/build-from-source-linux.md (1)
12-13
: Fix typo and streamline phrasing; ensure anchor names remain valid.“TensortRT-LLM” → “TensorRT LLM”, reduce redundancy, and keep existing anchors.
Apply:
-If you intend to build any TensortRT-LLM artifacts, such as any of the container images (note that there exist pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) container images in NGC), or the TensorRT LLM Python wheel, you first need to clone the TensorRT LLM repository: +If you intend to build any TensorRT LLM artifacts—such as container images (pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) images are available on NGC)—or the TensorRT LLM Python wheel, first clone the TensorRT LLM repository:docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md (1)
32-39
: Fix Markdown table (tabs, trailing pipes) and grammar.Apply:
-The accuracy score of common dataset on this FP4 checkpoint and TensorRT LLM implementations are: +The accuracy on common datasets for this FP4 checkpoint with TensorRT LLM is: @@ -| Precision | GPQA Diamond | MATH-500 -| :-- | :-- | :-- | -| TensorRT LLM FP8 | 0.697 | 0.954 | -| TensorRT LLM FP4 | 0.705 | 0.96 | +| Precision | GPQA Diamond | MATH-500 | +| :--------------- | :----------- | :------- | +| TensorRT LLM FP8 | 0.697 | 0.954 | +| TensorRT LLM FP4 | 0.705 | 0.960 | @@ -** Note there are some run-to-run variance for these evaluations, so FP4 data is slight higher here. We think FP4 has comparable accuracy with FP8 on these datasets. +Note: There is run-to-run variance in these evaluations, so FP4 appears slightly higher here. FP4 accuracy is comparable to FP8 on these datasets.docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (1)
269-270
: Accidental external link to bench.sh domain.Remove hyperlink; use plain filename.
-...first creating a wrapper [bench.sh](http://bench.sh) script. +...first create a wrapper script named `bench.sh`.
🧹 Nitpick comments (75)
docs/source/features/auto_deploy/auto-deploy.md (7)
12-12
: Tighten caption wording (“relation with” → “relationship to”; avoid doubled “LLM”).Apply:
-<sub><em>AutoDeploy overview and relation with TensorRT LLM's LLM API</em></sub> +<sub><em>AutoDeploy overview and relationship to the TensorRT LLM API</em></sub>
14-14
: Unify KV cache terminology and trim the long sentence.
Use “KV caches” consistently (you have “KV-caches” here and “KV-cache” below) and split for readability.-AutoDeploy provides an alternative method for deploying models using the LLM API without requiring code changes to the source model (for example, Hugging Face Transformers models) or manual implementation of inference optimizations, such as KV-caches, multi-GPU parallelism, or quantization. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. AutoDeploy generates an inference-optimized graph that can be directly executed in the TensorRT LLM PyTorch runtime and leverages various runtime optimizations including in-flight batching, paging, and overlap scheduling. +AutoDeploy provides an alternative method for deploying models using the LLM API without requiring code changes to the source model (for example, Hugging Face Transformers models) or manual implementation of inference optimizations such as KV caches, multi‑GPU parallelism, or quantization. Instead, AutoDeploy extracts a computation graph from the source model and applies inference optimizations through a series of automated graph transformations. It then generates an inference‑optimized graph that can be executed in the TensorRT LLM PyTorch runtime and leverages runtime optimizations including in‑flight batching, paging, and overlap scheduling.
51-51
: Clarify the “compile backends” reference; “torch-opt” may be ambiguous.
If you mean torch.compile backends (e.g., Inductor), prefer the generic phrasing below or name the supported backends explicitly.-The exported graph then undergoes a series of automated transformations, including graph sharding, KV-cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported compile backends (like `torch-opt`), followed by deploying it via the TensorRT LLM runtime. +The exported graph then undergoes a series of automated transformations, including graph sharding, KV‑cache insertion, and GEMM fusion, to optimize model performance. After these transformations, the graph is compiled using one of the supported torch.compile backends, then deployed via the TensorRT LLM runtime.If specific backends are supported, consider linking to a definitive list in your docs.
28-28
: Update lingering “TRT-LLM” to “TensorRT LLM” for branding consistency.
This PR centers on the rename; keep this page fully aligned.-AutoDeploy is included with the TRT-LLM installation. +AutoDeploy is included with the TensorRT LLM installation.
34-34
: Adjust link text to match the new name (keep URL as-is if path hasn’t changed).-You can refer to [TRT-LLM installation guide](../../installation/linux.md) for more information. +You can refer to the [TensorRT LLM installation guide](../../installation/linux.md) for more information.
38-38
: Fix product capitalization: “Llama” (not “LLama”).-You are now ready to run an in-framework LLama Demo. +You are now ready to run an in‑framework Llama demo.
40-40
: Punctuation/grammar tweak (“script, Checkpoints” → “script. Checkpoints”).-The general entry point for running the AutoDeploy demo is the `build_and_run_ad.py` script, Checkpoints are loaded directly from Huggingface (HF) or a local HF-like directory: +The general entry point for running the AutoDeploy demo is the `build_and_run_ad.py` script. Checkpoints are loaded directly from Hugging Face (HF) or a local HF‑like directory:docs/source/features/checkpoint-loading.md (1)
19-19
: Fix possessive usage (“model definition”).Use “model definition” (no apostrophe) for correct grammar and consistency with other docs.
-**Weight Mappers**: Map and transform loaded weights to TensorRT LLM model's definition +**Weight Mappers**: Map and transform loaded weights to the TensorRT LLM model definitiondocs/source/examples/customization.md (1)
5-5
: Tighten wording and fix sentence fragment.Minor grammar/style fixes and clearer phrasing.
-TensorRT LLM can quantize the Hugging Face model automatically. By setting the appropriate flags in the `LLM` instance. For example, to perform an Int4 AWQ quantization, the following code triggers the model quantization. Please refer to complete list of [supported flags](https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html#QuantAlgo) and acceptable values. +TensorRT LLM can automatically quantize Hugging Face models by setting the appropriate flags on the `LLM` instance. For example, to perform INT4 AWQ quantization, use the following code. Please refer to the complete list of [supported flags](https://nvidia.github.io/TensorRT-LLM/_modules/tensorrt_llm/quantization/mode.html#QuantAlgo) and acceptable values.docs/source/architecture/overview.md (2)
48-48
: Pluralize “CUDA Graphs”.Small grammar tweak.
-TensorRT LLM enhances inference throughput and reduces latency by integrating a suite of runtime optimizations, including CUDA Graph, [Overlap Scheduler](../features/overlap-scheduler.md), [Speculative decoding](../features/speculative-decoding.md), etc. +TensorRT LLM enhances inference throughput and reduces latency by integrating a suite of runtime optimizations, including CUDA Graphs, [Overlap Scheduler](../features/overlap-scheduler.md), [Speculative decoding](../features/speculative-decoding.md), etc.
54-54
: Remove article before percentage (“up to 22%”).Minor grammar fix.
-This optimization has a significant impact, demonstrating up to a 22% end-to-end throughput increase on certain models and hardware. +This optimization has a significant impact, demonstrating up to 22% end-to-end throughput increase on certain models and hardware.docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (1)
22-22
: Pin container to a published NGC tag for reproducibility.Per team guidance, reference only published NGC tags; avoid untagged “latest”.
-TensorRT LLM distributes the pre-built container on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags). +TensorRT LLM distributes the pre-built container on the [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release/tags). Use a published tag for reproducibility (for example, :24.08); do not rely on an untagged image.Example command:
-docker run --rm --ipc host -p 8000:8000 --gpus all -it nvcr.io/nvidia/tensorrt-llm/release +docker run --rm --ipc host -p 8000:8000 --gpus all -it nvcr.io/nvidia/tensorrt-llm/release:<published-tag>docs/source/features/multi-modality.md (3)
3-3
: Tighten phrasing (“various” instead of “a variety of”).Concise style; no change in meaning.
-TensorRT LLM supports a variety of multimodal models, enabling efficient inference with inputs beyond just text. +TensorRT LLM supports various multimodal models, enabling efficient inference with inputs beyond just text.
17-17
: Streamline wording (“includes key optimizations”).-TensorRT LLM incorporates some key optimizations to enhance the performance of multimodal inference: +TensorRT LLM includes key optimizations to enhance multimodal inference performance:
35-35
: Use “LLM API” (no hyphen) and make the example reference resolvable.Consistent terminology and fix the nonstandard “source:” reference.
-Quickly try out TensorRT LLM's multimodal support using our `LLM-API` and a ready-to-run [example](source:examples/llm-api/quickstart_multimodal.py): +Quickly try out TensorRT LLM's multimodal support using our LLM API and a ready-to-run example (see `examples/llm-api/quickstart_multimodal.py`):docs/source/dev-on-cloud/build-image-to-dockerhub.md (1)
3-6
: Use “Docker Hub” branding and keep rename consistent in headings.Rename “DockerHub” → “Docker Hub” and keep “TensorRT LLM” as updated.
-# Build the TensorRT LLM Docker Image +# Build the TensorRT LLM Docker Image -## Build the TensorRT LLM Docker Image and Upload to DockerHub +## Build the TensorRT LLM Docker Image and Upload to Docker Hubdocs/source/blogs/H200launch.md (1)
58-58
: Tighten the closing sentence.Minor readability improvement.
-H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize that compute performance, H200 is the first GPU with HBM3e memory with 4.8TB/s of memory bandwidth, a 1.4X increase over H100. H200 also expands GPU memory capacity nearly 2X to 141 gigabytes (GB). The combination of faster and larger HBM memory accelerates performance of LLM model inference performance with faster throughput and tokens per second. These results are measured and preliminary, more updates expected as optimizations for H200 continue with TensorRT LLM. +H200 is the newest addition to NVIDIA’s data center GPU portfolio. To maximize compute performance, H200 is the first GPU with HBM3e, providing 4.8 TB/s of memory bandwidth (1.4× over H100) and expanding capacity nearly 2× to 141 GB. The combination of faster and larger HBM accelerates LLM inference throughput and tokens/s. These results are preliminary; more updates are expected as TensorRT LLM optimizations continue.docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (5)
33-44
: Verify NGC container tag exists and prefer published images.The doc uses nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6. Please confirm this tag is publicly available on NGC and, per doc standards, prefer a published non-rc tag when targeting a release branch.
If needed, replace with the latest published release tag and add a note explaining rc vs monthly releases.
53-54
: Minor grammar fix.-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html +If you want to use the latest main branch, you can build from source to install TensorRT LLM. See: https://nvidia.github.io/TensorRT-LLM/latest/installation/build-from-source-linux.html
55-58
: Normalize “TRT-LLM” vs “TensorRT LLM” usage.The doc mixes “TRT-LLM” and “TensorRT LLM.” Consider using “TensorRT LLM” for product name and reserving “trtllm-serve” for the CLI.
-### Creating the TRT-LLM Server config +### Create the TensorRT LLM server config -We create a YAML configuration file /tmp/config.yml for the TensorRT LLM Server and populate it with the following recommended performance settings. +Create a YAML configuration file at /tmp/config.yml for the TensorRT LLM server and populate it with the following recommended performance settings. -### Launch the TRT-LLM Server +### Launch the TensorRT LLM server -Start a new terminal on the host to test the TensorRT LLM server you just launched. +Start a new terminal on the host to test the TensorRT LLM server you just launched. -After the TRT-LLM server is set up and shows Application startup complete, you can send requests to the server. +After the TensorRT LLM server is set up and shows “Application startup complete,” you can send requests to the server.Also applies to: 72-75, 187-188, 197-199
181-181
: Fix docs URL to the reference page.Add “latest” to be consistent with other links.
-See the [TorchLlmArgs](https://nvidia.github.io/TensorRT-LLM/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) class +See the [TorchLlmArgs](https://nvidia.github.io/TensorRT-LLM/latest/llm-api/reference.html#tensorrt_llm.llmapi.TorchLlmArgs) class
239-246
: Add language to fenced block and minor grammar in benchmarking section.Also avoid linking to “http://bench.sh”; just show the filename.
-``` +```shell MODEL_PATH=nvidia/Llama-3.3-70B-Instruct-FP8 @@```diff -To benchmark the performance of your TensorRT LLM server you can leverage the built-in “benchmark_serving.py” script. To do this first creating a wrapper [bench.sh] script. +To benchmark the performance of your TensorRT LLM server, use the built-in “benchmark_serving.py” script. First, create a wrapper script named `bench.sh`.
docs/source/blogs/H100vsA100.md (2)
5-7
: Consistent product name and concise wording.Use “TensorRT LLM” instead of “TRT-LLM” in prose and simplify “is able to”.
-# H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token +# H100 has 4.6x A100 performance in TensorRT LLM, achieving 10,000 tok/s at 100 ms to first token -TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6x max throughput and 4.4x faster 1st token latency than A100**. H100 FP8 is able to achieve over 10,000 output tok/s at peak throughput for 64 concurrent requests, while maintaining a 1st token latency of 100ms. For min-latency applications, TRT-LLM H100 can achieve less than 10ms to 1st token latency. +TensorRT LLM evaluated on both Hopper and Ampere shows **H100 FP8 is up to 4.6× max throughput and 4.4× faster 1st token latency than A100**. H100 FP8 can achieve over 10,000 output tok/s at peak throughput for 64 concurrent requests, while maintaining a 1st token latency of 100 ms. For min-latency applications, TensorRT LLM on H100 can achieve <10 ms to first token.
31-31
: Add terminal punctuation.Optional readability tweak.
-The full data behind these charts & tables and including larger models with higher TP values can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html) +The full data behind these charts and tables, including larger models with higher TP values, can be found in TensorRT LLM's [Performance Documentation](https://nvidia.github.io/TensorRT-LLM/latest/performance/perf-overview.html).docs/source/architecture/add-model.md (4)
9-15
: Polish “such as” lists; remove extraneous commas.-- Low-level functions, for example, `concat`, `add`, and `sum`. -- Basic layers, such as, `Linear` and `LayerNorm`. -- High-level layers, such as, `MLP` and `Attention`. -- Base class for typical decoder-only models, such as, `DecoderModelForCausalLM`. +- Low-level functions, for example `concat`, `add`, and `sum`. +- Basic layers, such as `Linear` and `LayerNorm`. +- High-level layers, such as `MLP` and `Attention`. +- Base class for typical decoder-only models, such as `DecoderModelForCausalLM`.
16-18
: Minor clarity and punctuation.-1. Create a model directory in `tensorrt_llm/models`, for example `my_model`. -2. Write a `model.py` with TensorRT LLM's APIs +1. Create a model directory in `tensorrt_llm/models`, for example, `my_model`. +2. Write a `model.py` with TensorRT LLM APIs.
55-56
: Fix grammar in weight conversion intro.-The weights from source framework need to be converted and bound to the new added TensorRT LLM model. Here is an example of converting HuggingFace weights: +Weights from the source framework need to be converted and bound to the newly added TensorRT LLM model. Here is an example of converting Hugging Face weights:
95-95
: Fix link formatting.-It's recommended to read the workflow[./workflow.md] and checkpoint[./checkpoint.md] documents for more details. +It's recommended to read the [workflow](./workflow.md) and [checkpoint](./checkpoint.md) documents for more details.docs/source/installation/linux.md (3)
25-29
: Fix typo in tip block ("preqrequisites" → "prerequisites").Apply:
- Instead of manually installing the preqrequisites as described + Instead of manually installing the prerequisites as described
49-49
: Polish phrasing for Known limitations lead-in.-There are some known limitations when you pip install pre-built TensorRT LLM wheel package. +There are some known limitations when you install the pre-built TensorRT LLM wheel via pip.
53-55
: Clarify and correct Slurm/MPI guidance and capitalization.- If you encounter an error while running TensorRT LLM in a Slurm-managed cluster, you need to reconfigure the MPI installation to work with Slurm. - The setup methods depends on your slurm configuration, pls check with your admin. This is not a TensorRT LLM specific, rather a general mpi+slurm issue. + If you encounter an error while running TensorRT LLM in a Slurm-managed cluster, reconfigure your MPI installation to work with Slurm. + The setup method depends on your Slurm configuration; please check with your administrator. This is not TensorRT LLM–specific, but a general MPI+Slurm issue.docs/source/quick-start-guide.md (1)
11-12
: Pin a published NGC tag instead of the floating default.To avoid ambiguity and align with docs policy, replace
.../tensorrt-llm/release
with an explicit published tag from NGC (for example:.../tensorrt-llm/release:<YYYY.MM>
). Please confirm the correct current tag on NGC before updating.docs/source/features/quantization.md (4)
5-5
: Grammar: “used to reduces” → “used to reduce”.-Quantization is a technique used to reduces memory footprint and computational cost +Quantization is a technique used to reduce memory footprint and computational cost
7-7
: Concise wording (“offers a variety of” → “provides”).-TensorRT LLM offers a variety of quantization recipes to optimize LLM inference. +TensorRT LLM provides several quantization recipes to optimize LLM inference.
88-91
: Minor terminology/style fixes (“multi-modal”, spacing, hyphenation).-The vision component of multi-modal models(BLIP2-OPT/BLIP2-T5/LLaVA/VILA/Nougat) uses FP16 by default. +The vision component of multimodal models (BLIP2-OPT/BLIP2-T5/LLaVA/VILA/Nougat) uses FP16 by default.-FP8 block wise scaling GEMM kernels for sm100 are using MXFP8 recipe +FP8 block-wise scaling GEMM kernels for sm100 use the MXFP8 recipeAlso applies to: 104-105
62-66
: Standardize section titles (“Support Matrix”).-## Model Supported Matrix +## Model Support Matrix-## Hardware Support Matrix +## Hardware Support MatrixAlso applies to: 94-103
docs/source/features/parallel-strategy.md (2)
72-81
: Split TP and DP examples to avoid conflicting keys in a single YAML.Current here-doc defines two alternative configs in one file; only the last values apply. Provide two separate snippets or comments clarifying one-at-a-time usage.
-cat <<EOF > parallel_config.yaml -# TP-8 -tensor_parallel_size: 8 -enable_attention_dp: false # default -# DP-8 -tensor_parallel_size: 8 -enable_attention_dp: true -EOF +cat <<EOF > tp_parallel_config.yaml +tensor_parallel_size: 8 +enable_attention_dp: false +EOF + +cat <<EOF > dp_parallel_config.yaml +tensor_parallel_size: 8 +enable_attention_dp: true +EOF
173-176
: List style consistency in References.Use a consistent unordered list marker (e.g., asterisks) per markdownlint and local style.
-- [Technical Blog: Scaling Expert Parallelism in TensorRT LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md) +- [Technical Blog: Scaling Expert Parallelism in TensorRT LLM](https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md)docs/source/blogs/tech_blog/blog6_Llama4_maverick_eagle_guide.md (2)
40-41
: Pin a published NGC image tag.To follow the NGC tagging guidance, replace the generic “release” reference with a specific, published tag (for example,
.../release:<YYYY.MM>
). Please verify the current recommended tag on NGC.
64-65
: Minor punctuation.-**Important:** Replace `/path/to/maverick` and `/path/to/eagle` with the actual paths to your Maverick and Eagle3 model checkpoints on your host machine, downloaded in the [Download Artifacts](#download-artifacts) stage +**Important:** Replace `/path/to/maverick` and `/path/to/eagle` with the actual paths to your Maverick and Eagle3 model checkpoints on your host machine, downloaded in the [Download Artifacts](#download-artifacts) stage.docs/source/blogs/Falcon180B-H200.md (3)
24-28
: Align branding in footnotes: “TensorRT-LLM” → “TensorRT LLM”.For consistency with this PR’s objective, update hyphenated mentions in these superscripts.
-TensorRT-LLM v0.7a | +TensorRT LLM v0.7a |Also applies to: 41-45, 87-91, 100-104
10-13
: Tighten the AWQ + FP8 sentence.Minor grammar/flow.
-The 141GB of memory on H200, paired with TensorRT LLM running INT4 AWQ with -FP8, allows for the entire large language model to fit on a single GPU, where +The 141GB of memory on H200, paired with TensorRT LLM running INT4 AWQ with FP8, +allows the entire large language model to fit on a single GPU, whereas previously eight A100s were required.
31-34
: Minor grammar and punctuation fixes.-however, TensorRT LLM's AWQ decreases memory footprint of the model by **4x** +however, TensorRT LLM's AWQ decreases the model’s memory footprint by **4x**-TensorRT LLM has implemented custom kernels for AWQ, and taken the technique a +TensorRT LLM has implemented custom kernels for AWQ and taken the technique a step further by performing FP8 computation on Hopper GPUs instead of the standard FP16.-TensorRT LLM has improved its Group Query Attention (GQA) kernels, in the +TensorRT LLM has improved its Grouped Query Attention (GQA) kernels in the generation phase, providing up to 2.4x improvement on Llama-70B over-than query (Q) heads. TensorRT LLM has a custom implementation of MHA which +than query (Q) heads. TensorRT LLM has a custom implementation of MHA that supports GQA, multi-query attention (MQA) and standard MHA.Also applies to: 55-57, 64-67, 109-112
docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (3)
49-50
: Grammar: “use latest main branch” → “use the latest main branch”.-If you want to use latest main branch, you can choose to build from source to install TensorRT LLM, the steps refer to ... +If you want to use the latest main branch, you can build from source to install TensorRT LLM. See ...
150-151
: Security note for --trust_remote_code.Consider adding a caution that this executes untrusted model code from Hugging Face; enable only for trusted sources.
- **Description:** Allows TensorRT LLM to download models and tokenizers from Hugging Face. This flag is passed directly to the Hugging Face API. + **Description:** Allows execution of custom model code from Hugging Face (trust_remote_code=True). Enable only for trusted sources; it may run arbitrary Python from the model repo.
296-306
: Minor: “first creating” → “first create”; fix link text.-To benchmark the performance of your TensorRT LLM server you can leverage the built-in “benchmark_serving.py” script. To do this first creating a wrapper [bench.sh](http://bench.sh) script. +To benchmark the performance of your TensorRT LLM server, use the built-in benchmark_serving.py script. First, create a wrapper script (bench.sh).docs/source/overview.md (3)
11-11
: Capitalization: “Pytorch” → “PyTorch”.-### 🔥 **Architected on Pytorch** +### 🔥 **Architected on PyTorch**
49-53
: Minor punctuation/spacing fixes.-- **NVIDIA Hopper**: H100, H200,GH200 with FP8 acceleration +- **NVIDIA Hopper**: H100, H200, GH200 with FP8 acceleration-- **NVIDIA Ada Lovelace**: L40/L40S, RTX 40 series with FP8 acceleration +- **NVIDIA Ada Lovelace**: L40/L40S; RTX 40 series with FP8 acceleration
55-58
: Fix run-on sentence and spacing.-Whether you're building the next generation of AI applications, optimizing existing LLM deployments, or exploring the frontiers of large language model technology, TensorRT LLM provides the tools, performance, and flexibility you need to succeed in the era of generative AI.To get started, refer to the {ref}`quick-start-guide`. +Whether you're building the next generation of AI applications, optimizing existing LLM deployments, or exploring the frontiers of large language model technology, TensorRT LLM provides the tools, performance, and flexibility you need to succeed in the era of generative AI. To get started, refer to the {ref}`quick-start-guide`.docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md (1)
42-52
: Clean up spacing/typos and consistent naming.-In this tech blog, we will introduce the details of the design and implementation to support E2E large-scale EP in TensorRT LLM . This blog post mainly covers the following: +In this tech blog, we introduce the design and implementation to support E2E large-scale EP in TensorRT LLM. This post covers:- * Some preliminary data demonstrating the effectiveness of the current implementation in TensorRT LLM . + * Preliminary data demonstrating the effectiveness of the current implementation in TensorRT LLM.-* The introduction of performance tuning and optimization for TensorRT LLM large-scale EP GB200 implementation. +* Performance tuning and optimization for TensorRT LLM large-scale EP on GB200.-Even if, in this tech blog, we focus on TensorRT LLM , we believe +Even though this tech blog focuses on TensorRT LLM, we believe-<p align="center"><sub><em>Figure 12: the high-level design of TensorRT LLM large-scale EP</em></sub></p> +<p align="center"><sub><em>Figure 12: High-level design of TensorRT LLM large-scale EP</em></sub></p>-* DeepEP is a great piece of work done by the DeepSeek team. When we started the TensorRT LLM large-scale EP efforts, our first focus was on GB200. We chose to implement our own custom EP communication kernels as it was easier to introduce optimizations requiring the GB200 MNNVL capability. Also, based on our current evaluation, DeepEP does not provide CUDA graph compatibility for all the scenarios. We believe that CUDA graph is needed for the scenario we are interested in. +* DeepEP is a great piece of work by the DeepSeek team. When we started TensorRT LLM large-scale EP, our first focus was GB200. We implemented custom EP communication kernels to optimize for GB200 MNNVL and because DeepEP lacked CUDA graph compatibility in all scenarios.-Based on these core utilities, there are two versions of EP Load Balancer in TensorRT LLM : Offline EP Load Balancer and Online EP Load Balancer. +Based on these utilities, TensorRT LLM provides two EP Load Balancers: Offline and Online.-The current TensorRT LLM large-scale EP implementation is not perfect and there are still known limitations (community contributions are welcome to help us improve). For example, we need: +The current TensorRT LLM large-scale EP implementation has known limitations (community contributions welcome). For example, we need:- * More validation with workloads close to production traffic. **Here we highly welcome the community’s feedback to help us calibrate TensorRT LLM large-scale EP implementation based on more concrete workloads.** + * More validation with production-like workloads. **Community feedback will help calibrate TensorRT LLM large-scale EP implementation.**-We encourage the community to try new ideas and performance validation. We encourage the community to share feedback to help us move fast in this area. We are actively tracking the TensorRT LLM large-scale EP execution in [this](https://github.com/NVIDIA/TensorRT-LLM/issues/4127) GitHub issue to ensure transparency to the community. +We encourage the community to try new ideas and share feedback. We are tracking TensorRT LLM large-scale EP in [this](https://github.com/NVIDIA/TensorRT-LLM/issues/4127) issue to ensure transparency.-... performance in TensorRT LLM . +... performance in TensorRT LLM.Also applies to: 54-58, 60-61, 224-225, 336-336, 367-367, 690-691, 697-699, 710-710, 715-716
docs/source/architecture/checkpoint.md (2)
13-23
: Add language to fenced code block (markdownlint MD040).-``` +```text NeMo ------------- | HuggingFace ------ ... TensorRT LLM ModelRunner--- `5-12`: **Minor grammar and clarity improvements.** ```diff -This file documents the workflow around TensorRT LLM checkpoint and the set of CLI tools to generate checkpoint, build engines, and evaluate engines. +This file documents the TensorRT LLM checkpoint workflow and the CLI tools to generate checkpoints, build engines, and evaluate engines.
-## Prepare the TensorRT LLM Checkpoint +## Prepare the TensorRT LLM checkpoint-which is uniquely mapped to a certain parameter of a TensorRT LLM model. +which uniquely maps to a specific parameter of a TensorRT LLM model.-TensorRT LLM provides a unified build command: `trtllm-build`. Before using it, -you may need to add it to the `PATH`. +TensorRT LLM provides a unified build command: `trtllm-build`. Before using it, add it to your `PATH`.Also applies to: 25-34, 110-111, 221-223
docs/source/developer-guide/perf-benchmarking.md (4)
10-16
: Fix grammar and wording in CLI intro.
- “performance overiew” → “performance overview”
- “provides the follows” → “provides the following”
- Tighten bullets for clarity.
Apply:
- TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that aims to make it -easier for users to reproduce our officially published [performance overiew](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the follows: + TensorRT LLM provides the `trtllm-bench` CLI, a packaged benchmarking utility that makes it + easier to reproduce our officially published [performance overview](./perf-overview.md#throughput-measurements). `trtllm-bench` provides the following: @@ -- Ability to benchmark various flows and features within TensorRT LLM. +- The ability to benchmark various flows and features within TensorRT LLM.
66-71
: Grammar: “the following are the list … is” → “the following networks … are.”Apply:
-While `trtllm-bench` should be able to run any network that TensorRT LLM supports, the following are the list -that have been validated extensively and is the same listing as seen on the +While `trtllm-bench` can run any network that TensorRT LLM supports, the following networks +have been validated extensively and match the listing on the
176-188
: Keep sample output branding consistent.“TensorRT-LLM Version” should reflect the new branding.
Apply:
-TensorRT-LLM Version: 0.17.0 +TensorRT LLM Version: 0.17.0
419-469
: Tighten quantization prose; fix fragment and table caption.
- Merge the sentence fragment after the code block.
- Minor clarity on “mappings.”
Apply:
-The checkpoints above are quantized to run with a compute precision of `FP8` and default to no KV cache quantization (full -`FP16` cache). When running `trtllm-bench throughput`. The benchmark will select a KV cache quantization that is best suited +The checkpoints above use `FP8` compute and default to no KV cache quantization (full `FP16` cache). +When running `trtllm-bench throughput`, the benchmark will select a KV cache quantization that is best suited @@ -The following are the mappings that `trtllm-bench` will -follow when a checkpoint does not specify a KV cache quantization algorithm: +When a checkpoint does not specify a KV cache quantization algorithm, `trtllm-bench` applies the following mapping:docs/source/blogs/tech_blog/blog8_Scaling_Expert_Parallelism_in_TensorRT-LLM_part2.md (1)
320-321
: Minor grammar.“users productive scenarios” → “users’ productive scenarios”.
Apply:
-Large-scale Expert Parallelism represents one of the important workloads for users productive scenarios, +Large-scale Expert Parallelism represents one of the important workloads for users’ productive scenarios,docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md (1)
60-66
: Grammar: “approaches to do disaggregation LLM inference” → “approaches to perform disaggregated LLM inference.”Apply:
-There are three different approaches to do disaggregation LLM inference with TensorRT LLM, where each approach offers distinct architectural and operational characteristics suited to different deployment scenarios. +There are three approaches to perform disaggregated LLM inference with TensorRT LLM, each suited to different deployment scenarios.docs/source/features/paged-attention-ifb-scheduler.md (3)
10-18
: Standardize “in‑flight” wording and clarify the constraint text.Use “in‑flight” consistently; tighten the packed-input constraint sentence.
Apply:
-For efficiency reasons (1), the support for inflight batching ***requires the +For efficiency reasons (1), in‑flight batching ***requires the input tensors to be packed (no padding)***. @@ -***In the current implementation, the sequences that are going through the +***In the current implementation, sequences going through the context phase must come before the sequences in the generation phase in the input tensor. For example, for sequences `S0`, `S1` and `S2`, if `S0` and `S2` are in context phase (and `S1` in generation), tokens from `S0` and `S2` must appear before the tokens of `S1` in the input tensor***. The constraint may or may not be relaxed in a future version.
109-111
: Hyphenation consistency.Use “max batch size” (no hyphen) to match the rest of the doc.
Apply:
-This section visualizes how TensorRT LLM schedules requests based on max-batch size and max-num tokens. +This section visualizes how TensorRT LLM schedules requests based on max batch size and max num tokens.
83-87
: Minor punctuation.Remove double space after period.
Apply:
-already been computed. That cache is known as the KV cache. +already been computed. That cache is known as the KV cache.docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md (2)
2-2
: Capitalize byline.Apply:
-by NVIDIA TensorRT LLM team +By NVIDIA TensorRT LLM Team
217-218
: Clarify “1.03x average latency.”Likely meant “1.03x speedup” or “0.97x latency.” Rephrase to avoid ambiguity.
Apply:
-while the overlap scheduler can achieve 1.03x average latency. +while the overlap scheduler shows a ~1.03× speedup (~3% lower latency).docs/source/installation/build-from-source-linux.md (2)
233-234
: Verify external wheel URL and version.Ensure the example URL exists and matches a published artifact; otherwise, prefer a stable “latest”/template placeholder to avoid rot.
Example:
-TRTLLM_PRECOMPILED_LOCATION=https://pypi.nvidia.com/tensorrt-llm/tensorrt_llm-0.16.0-cp312-cp312-linux_x86_64.whl pip install -e . +TRTLLM_PRECOMPILED_LOCATION=<URL_OR_PATH_TO_PUBLISHED_WHEEL> pip install -e .
191-191
: Minor wording nit.-...with older versions of `torch` (prior to 2.7.0)... +...with older versions of `torch` (earlier than 2.7.0)...docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md (1)
160-166
: Avoid hardcoding commit links in user docs.Commit URLs rot; prefer a docs reference or a short note “defaulted on since mid-2025” if needed.
-The overlap schedule is already defaulted on in TensorRT LLM by [commit](https://github.com/NVIDIA/TensorRT-LLM/commit/b4e5df0e...). +The overlap scheduler is defaulted on in recent TensorRT LLM releases.docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (2)
32-43
: Container tag likely to drift; point to a published tag or variable.Per team guidance, reference only published NGC tags. Replace hardcoded rc tag with a placeholder or verified release tag.
-nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc6 \ +nvcr.io/nvidia/tensorrt-llm/release:<PUBLISHED_TAG> \Also add a note to pick a tag from the tags page below.
52-53
: Grammar nit.-If you want to use latest main branch, you can choose to build from source... +If you want to use the latest main branch, build from source...docs/source/features/attention.md (4)
66-67
: Typo (“store” → “stored”).-| seq_lens_cuda | Tensor | A copy of `seq_lens` store on the GPU. | +| seq_lens_cuda | Tensor | A copy of `seq_lens` stored on the GPU. |
146-154
: Option name check for FP8 Context FMHA.Confirm
use_paged_context_fmha
is the public option name exposed to users; if internal, document the user-facing flag/API.
158-161
: Tighten wording; avoid “able to.”-That kernel is able to apply +That kernel can apply @@ -TensorRT LLM will continue to add (or enable) additional features... +TensorRT LLM will continue to add additional features...
287-293
: Capitalize “KV cache” consistently.-...treats the kv cache +...treats the KV cache @@ -overwrite the "least recently used" caches. +overwrite the “least recently used” caches.docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md (3)
9-16
: List indentation (MD007) and heading level (MD001).Fix bullet indentation under TOC and avoid skipping heading levels (h3→h4).
Example:
- - [1. Download TensorRT LLM](#1-download-tensorrt-llm) + - [1. Download TensorRT LLM](#1-download-tensorrt-llm)And promote “#### 1. Download TensorRT LLM” to “###” if the preceding heading is “## Prerequisites…”.
125-130
: Clarify tool reference and context.Add a short parenthetical on trtllm-bench install/source to help new users.
-...see [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details of this tool. +...see [TensorRT LLM Benchmarking](https://nvidia.github.io/TensorRT-LLM/performance/perf-benchmarking.html) for details. (Installed with TensorRT LLM.)
189-193
: Commit pin is fine; add date for reproducibility.Add a date to contextualize the commit-based results.
-The results are reproduced with TensorRT LLM commit b6261862419c33d6ce2313aff1e7116067d6037d. +The results are reproduced with TensorRT LLM commit b6261862419c33d6ce2313aff1e7116067d6037d (Aug/Sep 2025).
Signed-off-by: nv-guomingz <[email protected]>
5a20225
to
c95b320
Compare
/bot skip --comment "docs change only" |
PR_Github #17762 [ skip ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
docs/source/blogs/tech_blog/blog9_Deploying_GPT_OSS_on_TRTLLM.md (2)
223-239
: CLI name mismatch: use trtllm‑serve consistently.The text says “trtllm-server” but earlier commands use “trtllm-serve”.
-Alternatively the TRITON backend can be enabled by passing the CLI flag to the trtllm-server command at runtime: +Alternatively, the TRITON backend can be enabled by passing this flag to the trtllm-serve command at runtime: ```bash --moe_backend TRITON--- `271-338`: **Sanitize the example output; remove chain‑of‑thought/internal content.** The sample includes internal reasoning tokens and meta‑instructions, which should not appear in user‑visible docs. Replace with a minimal, realistic response body. ```diff -```bash -{ - "id": "chatcmpl-c440e2a3e7e14cd699295afc3739bf42", - "object": "chat.completion", - "created": 1754358426, - "model": "openai/gpt-oss-120b", - "choices": [ - { - "index": 0, - "message": { - "role": "assistant", - "content": "<|channel|>analysis<|message|>The user asks: \"What is NVIDIA's advantage for inference?\" The user wants: - \"What is NVIDIA's advantage for inference?\" Likely they want a detailed answer about why NVIDIA has advantages for - inference tasks (e.g., GPU acceleration, TensorRT, software stack, Tensor Cores, hardware, performance, low latency, - etc.). The user wants a short answer? Not specified. The user wants a straightforward answer. Probably a brief - explanation: hardware features like Tensor cores, optimized libraries (TensorRT), ecosystem, software stack, - integration with frameworks, cuDNN, inference GPU architecture, high throughput, low latency, FP16, INT8, etc.\n\nWe - need to produce a response: Provide NVIDIA's advantage for inference. Provide specifics: hardware (Tensor cores, RT - cores, Ampere architecture), software (TensorRT, CUDA, cuDNN, Triton Inference Server), developer ecosystem, network - effects, etc.\n\nAlso mention inference performance vs competition, DPUs, etc.\n\nWe'll answer concisely but thorough. - \n\nWe also must not mention we are ChatGPT or mention policy. Provide final answer.\n\nLet's produce a clear, - well-structured answer.\n\n<|end|><|start|>assistant<|channel|>final<|message|>**NVIDIA’s edge in AI inference comes - from a tightly‑integrated stack of hardware, software, and ecosystem tools that together deliver the - highest‑throughput, lowest‑latency execution of deep‑learning models.**\n\n| **Layer** | **Key Advantages** | **Why - It Matters for Inference** |\n|-----------|-------------------|---------------------------------|\n| **GPU - Architecture** | • **Tensor Cores** (FP16/FP8/INT8) – dedicated matrix‑multiply units that give 2–8× higher ops/W - than classic CUDA cores. <br>• **CUDA SMs with mixed‑precision support** – can run FP16/INT8/INT4 with minimal - accuracy loss. <br>• **NVLink / NVSwitch** – high‑bandwidth, low‑latency inter‑GPU communication for multi‑GPU - serving. | Maximizes raw arithmetic density, letting a single GPU handle many concurrent requests. |\n| - **Inference‑focused Silicon** | • **A100 / H100 “Tensor‑Core” GPUs** – FP8 support (H100) and sparsity acceleration - (up to 2× throughput). <br>• **T4/Ti, L40, L4, Jetson edge modules** – power‑optimized variants for data‑center, - edge, and robotics. | Provides the right performance‑per‑watt for cloud, on‑prem, or edge deployments. |\n| - **Software Stack** | • **TensorRT** – a compiler & runtime that fuses layers, applies precision calibration, and - auto‑tunes kernels for each GPU. <br>• **CUDA, cuDNN, cuBLAS** – low‑level libraries tuned for every generation of - GPU. <br>• **Triton Inference Server** – model‑agnostic serving, model‑versioning, batching, GPU sharing, and - scaling across CPUs/GPUs. | Turns raw GPU power into production‑ready, low‑latency services with minimal engineering - effort. |\n| **Model Optimizations** | • **Quantization (INT8/FP8) & Structured Sparsity** – supported natively by - TensorRT and the hardware. <br>• **Automatic Mixed‑Precision (AMP)** – retains accuracy while cutting compute. | - Reduces memory bandwidth and compute cost while keeping accuracy within acceptable bounds. |\n| **Ecosystem & - Compatibility** | • **Broad framework support** – TensorFlow, PyTorch, ONNX, JAX, etc., all compile to TensorRT. - <br>• **NVIDIA NGC** – pre‑optimized model zoo, containers, and reference pipelines. <br>• **MLOps tools** – - NVIDIA Merlin, Clara, Metropolis, etc., for recommendation, medical, vision pipelines. | Engineers can - plug‑and‑play, accelerate, and ship models faster. |\n| **Scalability & Deployment Flexibility** | • **DGX Cloud, - EGX, Jetson, and Orin** – end‑to‑end solutions from cloud to edge. <br>• **Multi‑Instance GPU (MIG)** – partition - a single A100 into up to 7 isolated inference instances. <br>• **NVIDIA AI Enterprise** – managed software suite - for on‑prem data‑centers. | Allows the same code to run on a laptop, an edge device, or a massive data‑center - cluster. |\n| **Performance Benchmarks** | • **Industry‑leading latency/throughput** on MLPerf Inference (FP8, - INT8). <br>• **Sparsity‑aware kernels** give >2× speedup on H100 with < 0.1 % accuracy loss. | Demonstrates - real‑world advantage in the most respected benchmark suite. |\n|", - "reasoning_content": null, - "tool_calls": [] - }, - "logprobs": null, - "finish_reason": "length", - "stop_reason": null, - "disaggregated_params": null - } - ], - "usage": { - "prompt_tokens": 17, - "total_tokens": 1041, - "completion_tokens": 1024 - }, - "prompt_token_ids": null -} - -``` +```bash +{ + "id": "chatcmpl-1234", + "object": "chat.completion", + "created": 1754358426, + "model": "openai/gpt-oss-120b", + "choices": [ + { + "index": 0, + "message": { + "role": "assistant", + "content": "NVIDIA’s advantage in inference comes from Tensor Cores, an optimized software stack (TensorRT, Triton), and high-bandwidth interconnects (NVLink/NVSwitch) that deliver low latency and high throughput at scale." + }, + "finish_reason": "stop" + } + ], + "usage": { + "prompt_tokens": 17, + "completion_tokens": 48, + "total_tokens": 65 + } +} +```
docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md (1)
41-44
: Update NGC image tag to published version
The referenced container tag1.0.0rc6
isn’t available on NVIDIA NGC (latest is1.0.0rc4
as of September 5, 2025). Update the docs to use a published tag (e.g.nvcr.io/nvidia/tensorrt-llm/release:1.0.0rc4
) or wait until1.0.0rc6
is released.
Location: docs/source/deployment-guide/quick-start-recipe-for-llama3.3-70b-on-trtllm.md:41-44docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md (1)
224-236
: Figure reference consistency + typo.
- Caption is fine; but the paragraph below refers to “Figure 11” when discussing Dispatch/Combine. That should reference Figure 12 here.
- Fix “Placement” split in prior bullets.
- * EP communication kernels. In Figure 11, those are the **Dispatch** and **Combine** components. + * EP communication kernels. In Figure 12, these are the **Dispatch** and **Combine** components.Also fix earlier occurrences of “Update Weights & Placemen t” to “Update Weights & Placement.”
docs/source/developer-guide/perf-benchmarking.md (1)
113-123
: Incorrect definition:input_ids
are token IDs, not logits. Also “prompts and logits” wording.This is user-facing schema; correctness matters.
-| `input_ids` | Y* | List[Integer] | List of logits that make up the request prompt. | +| `input_ids` | Y* | List[Integer] | List of token IDs that make up the request prompt. | @@ -\* Specifying `prompt` or `input_ids` is required. However, you can not have both prompts and logits (`input_ids`) +\* Specifying `prompt` or `input_ids` is required. However, you can not have both prompts and token IDs (`input_ids`)
♻️ Duplicate comments (15)
docs/source/commands/trtllm-serve/run-benchmark-with-trtllm-serve.md (1)
3-3
: Fix spelling and article (“OpenAI-compatible”).Use correct spelling and articles.
-TensorRT LLM provides the OpenAI-compatiable API via `trtllm-serve` command. +TensorRT LLM provides an OpenAI-compatible API via the `trtllm-serve` command.docs/source/dev-on-cloud/build-image-to-dockerhub.md (1)
4-4
: Grammar and naming polish (already noted).-When you develop trt-llm on cloud platform such as runpod, you may need to provide a docker image for the platform. So you firstly need to upload the image to dockerhub. +When you develop TensorRT LLM on a cloud platform such as RunPod, you may need to provide a Docker image for the platform. First, upload the image to Docker Hub.docs/source/blogs/H200launch.md (1)
38-48
: Broken image URLs were fixed—good catch.Both image src paths now use “TensorRT-LLM” (hyphenated) and should render correctly on GitHub.
docs/source/quick-start-guide.md (1)
102-105
: Fix “Deployment Guide” link target (path is missing).-- **[Deployment Guide](examples/llm_api_examples)** - Comprehensive examples for deploying LLM inference in various scenarios +- **[Deployment Guide](deployment-guide/index.rst)** - Comprehensive examples for deploying LLM inference in various scenariosdocs/source/overview.md (2)
7-7
: Point the “TensorRT LLM” link to the project docs, not generic TensorRT.Use the project’s documentation URL.
-[TensorRT LLM](https://developer.nvidia.com/tensorrt) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs. +[TensorRT LLM](https://nvidia.github.io/TensorRT-LLM/) is NVIDIA's comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.
21-23
: Remove leading dot in URL (broken link).The DeepSeek R1 link won’t work with a leading dot.
-- **DeepSeek R1**: [World-record inference performance on Blackwell GPUs](.https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/) +- **DeepSeek R1**: [World-record inference performance on Blackwell GPUs](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)docs/source/blogs/tech_blog/blog3_Optimizing_DeepSeek_R1_Throughput_on_NVIDIA_Blackwell_GPUs.md (1)
32-39
: Fix table formatting and grammar; remove tabs and trailing pipes.Re-apply the earlier correction; tabs still present and phrasing is clunky.
-The checkpoint used in this blog is hosted in [nvidia/DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), generated by [NVIDIA Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer). The accuracy score of common dataset on this FP4 checkpoint and TensorRT LLM implementations are: +The checkpoint used in this blog is hosted in [nvidia/DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), generated by [NVIDIA Model Optimizer](https://github.com/NVIDIA/TensorRT-Model-Optimizer). The accuracy on common datasets for this FP4 checkpoint with TensorRT LLM is: -| Precision | GPQA Diamond | MATH-500 -| :-- | :-- | :-- | -| TensorRT LLM FP8 | 0.697 | 0.954 | -| TensorRT LLM FP4 | 0.705 | 0.96 | +| Precision | GPQA Diamond | MATH-500 | +| :--------------- | :----------- | :------- | +| TensorRT LLM FP8 | 0.697 | 0.954 | +| TensorRT LLM FP4 | 0.705 | 0.960 | -** Note there are some run-to-run variance for these evaluations, so FP4 data is slight higher here. We think FP4 has comparable accuracy with FP8 on these datasets. +Note: There is run-to-run variance in these evaluations, so FP4 appears slightly higher here. FP4 accuracy is comparable to FP8 on these datasets.docs/source/deployment-guide/quick-start-recipe-for-deepseek-r1-on-trtllm.md (1)
7-8
: Align container narrative with commands (use TensorRT LLM container).Intro says “start with PyTorch container,” but the run command uses the TensorRT LLM container.
-The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the PyTorch container from NGC, then installing TensorRT LLM for model serving, FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution. +The guide is intended for developers and practitioners seeking high-throughput or low-latency inference using NVIDIA’s accelerated stack—starting with the TensorRT LLM container from NGC (or building from source), using FlashInfer for optimized CUDA kernels, and ModelOpt to enable FP8 and NVFP4 quantized execution.docs/source/blogs/tech_blog/blog5_Disaggregated_Serving_in_TensorRT-LLM.md (1)
144-145
: Likely typo: “NIXL” → “NCCL”; recommendation wording.Mainstream stacks are MPI, UCX, and NCCL; align the recommendation to UCX/NCCL.
-Currently, mainstream communication protocols—MPI, UCX, and NIXL—are all supported by TensorRT LLM, ... -Currently, we recommend using UCX and NIXL backends, ... +Currently, mainstream communication protocols—MPI, UCX, and NCCL—are all supported by TensorRT LLM, ... +Currently, we recommend using the UCX and NCCL backends, ...docs/source/architecture/checkpoint.md (1)
171-173
: Fix typos and agreement in note (out_fature, follows).- The linear weights in TensorRT LLM checkpoint always follows (`out_feature`, `in_feature`) shape, whereas some quantized linear in TensorRT LLM implemented by plugin may use (`in_feature`, `out_fature`) shape. The `trtllm-build` command adds a transpose operation to post-process it. + The linear weights in a TensorRT LLM checkpoint always follow the (`out_feature`, `in_feature`) shape, whereas some quantized linear layers implemented by plugins may use (`in_feature`, `out_feature`). The `trtllm-build` command adds a transpose to post‑process this.docs/source/deployment-guide/quick-start-recipe-for-llama4-scout-on-trtllm.md (1)
269-270
: Accidental external link to bench.sh domain.Make “bench.sh” a filename, not a URL.
-To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. To do this first creating a wrapper [bench.sh](http://bench.sh) script. +To benchmark the performance of your TensorRT LLM server you can leverage the built-in `benchmark_serving.py` script. First, create a wrapper script named `bench.sh`.docs/source/installation/build-from-source-linux.md (2)
12-12
: Fix typo and tighten phrasing; keep anchors.Replace “TensortRT-LLM” and streamline the sentence per earlier suggestion.
-If you intend to build any TensortRT-LLM artifacts, such as any of the container images (note that there exist pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) container images in NGC), or the TensorRT LLM Python wheel, you first need to clone the TensorRT LLM repository: +If you intend to build any TensorRT LLM artifacts—such as container images (pre-built [develop](#build-from-source-tip-develop-container) and [release](#build-from-source-tip-release-container) images are available on NGC)—or the TensorRT LLM Python wheel, first clone the TensorRT LLM repository:
181-186
: Broken self-link for “C++ Runtime”; add explicit label and retarget link.Auto anchors don’t handle “c++” reliably. Add a label and link to it.
-This feature will not be enabled when [`building only the C++ runtime`](#link-with-the-tensorrt-llm-c++-runtime). +This feature will not be enabled when [`building only the C++ runtime`](#link-with-tensorrt-llm-cpp-runtime). -#### Linking with the TensorRT LLM C++ Runtime +(link-with-tensorrt-llm-cpp-runtime)= +#### Linking with the TensorRT LLM C++ RuntimeAlso applies to: 183-183
docs/source/blogs/tech_blog/blog2_DeepSeek_R1_MTP_Implementation_and_Optimization.md (2)
240-245
: Fix typos and standardize Eagle naming.Correct “Ealge2” and use a consistent “Eagle-2/Eagle-3” style across the paragraph.
-... such as Ealge2 and Eagle3, ... +... such as Eagle-2 and Eagle-3, ... @@ -... DeepSeek-V3/R1+Eagle3 ... +... DeepSeek-V3/R1 + Eagle-3 ...
77-80
: Fix both MTP code links and pin to a stable commit.Both links currently point to the same line on main. Link Vanilla and Eagle to their respective definitions using the same pinned SHA used later in this doc.
-TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/speculative/mtp.py#L1047). +TensorRT LLM has two different paths for MTP, one for [MTP Vanilla](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L326) and another for [MTP Eagle](https://github.com/NVIDIA/TensorRT-LLM/blob/338744fba6a91147b739b7f02d19b37bc19aa17a/tensorrt_llm/_torch/speculative/mtp.py#L1047).
PR_Github #17762 [ skip ] completed with state |
Signed-off-by: nv-guomingz <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: nv-guomingz <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.