Skip to content

Conversation

yuxianq
Copy link
Collaborator

@yuxianq yuxianq commented Aug 3, 2025

Summary by CodeRabbit

  • Documentation
    • Added a comprehensive deployment and usage guide for running the DeepSeek R1 model on NVIDIA GPUs with TensorRT-LLM, including setup, configuration, testing, evaluation, and benchmarking instructions.
    • Updated configuration to exclude Markdown files from trailing whitespace checks.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yuxianq yuxianq requested a review from nv-guomingz August 3, 2025 16:10
Copy link
Contributor

coderabbitai bot commented Aug 3, 2025

📝 Walkthrough

Walkthrough

The changes update the pre-commit configuration to exclude Markdown files from the trailing whitespace check and introduce a comprehensive Markdown guide detailing deployment, testing, evaluation, and benchmarking procedures for running the DeepSeek R1 model on TensorRT-LLM.

Changes

Cohort / File(s) Change Summary
Pre-commit Hook Configuration
.pre-commit-config.yaml
Expanded the exclusion pattern for the trailing-whitespace hook to skip both .patch and .md files.
Documentation: DeepSeek R1 on TensorRT-LLM
examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md
Added a detailed deployment, usage, troubleshooting, evaluation, and benchmarking guide for DeepSeek R1 on TensorRT-LLM.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Docker
    participant TensorRT-LLM Server
    participant GPU
    participant HTTP Client

    User->>Docker: Launch TensorRT-LLM container with DeepSeek R1
    Docker->>TensorRT-LLM Server: Start server with config and model
    TensorRT-LLM Server->>GPU: Load model, allocate resources
    HTTP Client->>TensorRT-LLM Server: Send inference request (prompt)
    TensorRT-LLM Server->>GPU: Run inference
    GPU-->>TensorRT-LLM Server: Return output tokens
    TensorRT-LLM Server-->>HTTP Client: Respond with completion
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~7 minutes

Possibly related PRs

Suggested labels

Community want to contribute

Suggested reviewers

  • nv-guomingz
  • kaiyux

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (5)

3-4: Fix heading level hierarchy

# Introduction should be an H2 (##) so that heading levels increment by one from the main title, eliminating the MD001 linter error.

-# Introduction
+## Introduction

194-195: Remove stray back-tick at end of sentence

The closing back-tick after extra_llm_api_options is orphaned and appears to be a typo.

-... which can be used in the extra_llm_api_options`.`
+... which can be used in the `extra_llm_api_options`.

251-256: Specify a language for fenced code block

Linter MD040 flags this block. Add a language (e.g., shell) after the opening back-ticks.

-```
+```shell
 MODEL_PATH=deepseek-ai/DeepSeek-R1
 ...

259-264: Add language hint to result table block

The fenced block containing the sample results lacks a language tag. Use text to silence MD040 and preserve monospace alignment.

-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 ...

337-346: Add language spec to benchmark output block

Same MD040 issue for the benchmark sample.

-```
+```text
 ============ Serving Benchmark Result ============
 ...
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3f7abf8 and 423692c.

📒 Files selected for processing (2)
  • .pre-commit-config.yaml (1 hunks)
  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-01T07:34:42.734Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
🪛 LanguageTool
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

[style] ~128-~128: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~132-~132: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~136-~136: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~140-~140: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance.  Options:   enable_paddi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~169-~169: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~173-~173: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~175-~175: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~181-~181: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models.  Options:   backend: The ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~185-~185: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~191-~191: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~316-~316: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

109-109: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)


251-251: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


337-337: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
.pre-commit-config.yaml (1)

30-30: Trailing-whitespace hook exclusion looks correct

Expanding the regex to \.(patch|md)$ cleanly prevents markdown docs (like the one added in this PR) from being flagged while still keeping other file types covered. No further action needed.

Signed-off-by: Yuxian Qiu <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

190-196: Remove stray back-tick at the end of the hyperlink sentence

The dangling back-tick breaks inline-code formatting and renders a literal “`” in GitHub preview.

-... list of options which can be used in the extra_llm_api_options`.
+... list of options which can be used in the `extra_llm_api_options`.

249-256: Add language identifiers to fenced code blocks

markdownlint (MD040) warns when a fence lacks a language.
Specify shell for commands and text for sample outputs/tables to enable proper syntax highlighting.

-```
+```shell
 MODEL_PATH=deepseek-ai/DeepSeek-R1
@@
-```
+```text
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
@@
-```
+```text
 ============ Serving Benchmark Result ============
@@
Repeat for every unlabeled block (three occurrences in this range).  



Also applies to: 259-264, 337-364

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used: .coderabbit.yaml**
**Review profile: CHILL**
**Plan: Pro**


<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 423692ce6f5e3024cc16494b17b507265d125439 and ecbd20b8c3cea90c5d7ed4d0e7bd33fa3b40d686.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🧠 Learnings (3)</summary>

<details>
<summary>📓 Common learnings</summary>

Learnt from: yibinl-nvidia
PR: #6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.


Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.


Learnt from: yiqingy0
PR: #5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.


</details>
<details>
<summary>📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...</summary>

Learnt from: yibinl-nvidia
PR: #6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.


**Applied to files:**
- `examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md`

</details>
<details>
<summary>📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...</summary>

Learnt from: moraxu
PR: #6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.


**Applied to files:**
- `examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md`

</details>

</details><details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md</summary>

[style] ~128-~128: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend.  #### `--max_batch_size`  &emsp;**Description:** The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~132-~132: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing.  #### `--max_num_tokens`  &emsp;**Description:** The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~136-~136: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch.  #### `--max_seq_len`  &emsp;**Description:** The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~140-~140: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens.  #### `--trust_remote_code`  &emsp;**Description:** Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.  &emsp;&emsp;**Default**: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint).  #### `cuda_graph_config`  &emsp;**Description**: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~169-~169: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.  &emsp;&emsp;**Default**: 0  &emsp;&emsp;**Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~173-~173: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max\_batch\_size command-line option.  &emsp;&emsp;batch\_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~175-~175: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.  &emsp;&emsp;**Default**: None  #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~185-~185: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.  &emsp;&emsp;**Default**: CUTLASS  #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~191-~191: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  &emsp;**Default**: TRTLLM  See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

---

[style] ~316-~316: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ```  To benchmark the FP4 model, replace \--mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.17.2)</summary>

<details>
<summary>examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md</summary>

109-109: Heading levels should only increment by one level at a time
Expected: h3; Actual: h4

(MD001, heading-increment)

---

251-251: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

337-337: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: Pre-commit Check

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Signed-off-by: Yuxian Qiu <[email protected]>
@nv-guomingz nv-guomingz requested a review from chenopis August 4, 2025 02:40
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1)

251-256: Add language identifiers to fenced code blocks to satisfy markdownlint (MD040)

The three blocks below lack a language tag:

  • Lines 251-255 (lm_eval …) – use bash
  • Lines 259-264 (table output) – use text
  • Lines 337-363 (benchmark sample output) – use text
-```
+```bash      # for the command block

-```
+```text      # for plain text / table output

This removes the current MD040 violations and enables proper syntax highlighting.

Also applies to: 259-264, 337-363

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ecbd20b and 531a7ca.

📒 Files selected for processing (1)
  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
🪛 LanguageTool
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

[style] ~128-~128: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~132-~132: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~136-~136: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~140-~140: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~161-~161: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance.  Options:   enable_paddi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~169-~169: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~173-~173: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~175-~175: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~181-~181: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models.  Options:   backend: The ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~185-~185: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~191-~191: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~316-~316: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

251-251: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


337-337: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1)

106-144: Heading hierarchy now correct – nice cleanup

The section was demoted to ### and option headings to ####, fixing the previous MD001 complaint.
No further action needed.

Signed-off-by: Yuxian Qiu <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

106-188: Heading levels still skip H3 (MD001)

All option headings (####) are children of an H2 section, so they should be ###.
Same feedback was given earlier and remains unresolved.


55-66: Config file is overwritten & FP4 config missing – creates runtime confusion

You still create /tmp/config.yml twice and both times hard-code dtype: fp8.
The second cat clobbers the first one and there is no standalone FP4 config, so the FP4 workflow silently falls back to FP8.

-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp4.yml        # dedicated file for FP4
@@
-kv_cache_config:
-  dtype: fp8
+kv_cache_config:
+  dtype: fp4
-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp8.yml        # dedicated file for FP8
@@
 kv_cache_config:
   dtype: fp8

Remember to pass the correct file in trtllm-serve --extra_llm_api_options.
Without this fix users will get unexpected accuracy/perf results.

Also applies to: 70-83

🧹 Nitpick comments (1)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1)

249-264: Specify a language for fenced code blocks (MD040)

Add shell or text after the opening back-ticks to silence markdown-lint and enable syntax highlighting.

-```
+```shell

Apply to the GSM8K result snippets and the benchmark sample too.

Also applies to: 266-283, 337-364

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 531a7ca and d94211e.

📒 Files selected for processing (1)
  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
🪛 LanguageTool
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

[style] ~128-~128: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~132-~132: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~136-~136: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~140-~140: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~155-~155: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~159-~159: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~169-~169: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~173-~173: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~175-~175: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~185-~185: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~191-~191: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~316-~316: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

251-251: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


337-337: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@yuxianq yuxianq changed the title doc: Add DeepSeek R1 deployment guide. doc: [TRTLLM-6859] Add DeepSeek R1 deployment guide. Aug 4, 2025
Signed-off-by: Yuxian Qiu <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

114-154: Heading levels still skip H3 under “Configs and Parameters”

Configs and Parameters is an H2, yet the option names start at H4 (####).
Promote them to H3 (###) to satisfy markdown-lint MD001 and keep a logical hierarchy.

-#### `--tp_size`
-#### `--ep_size`
-#### `--kv_cache_free_gpu_memory_fraction`
+### `--tp_size`
+### `--ep_size`
+### `--kv_cache_free_gpu_memory_fraction`

Apply the same promotion to every option heading in this section.


55-90: Config file is still overwritten and NVFP4 dtype hard-coded to fp8

The two cat << EOF > ${EXTRA_LLM_API_FILE} blocks write to the same /tmp/config.yml; the second block fully clobbers the first.
Additionally, the first block keeps kv_cache_config.dtype: fp8, so NVFP4 runs will silently fall back to FP8.

-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp4.yml
 ...
-kv_cache_config:
-  dtype: fp8
+kv_cache_config:
+  dtype: fp4

And for the FP8 case:

-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp8.yml

Update the subsequent trtllm-serve ... --extra_llm_api_options invocation(s) to reference the correct file.

🧹 Nitpick comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

259-272: Add language identifiers to fenced code blocks (MD040)

Markdown-lint flags these blocks because no language is specified.
Use shell, text, or another suitable identifier:

-```
+```shell
   # command

and

-```
+```text
   |Tasks|Version|…

Do the same for the benchmark sample output at Lines 345-372.

Also applies to: 345-372


200-203: Dangling back-tick breaks Markdown rendering

The closing sentence ends with ``options `.``` — the stray back-tick leaves the inline code span open.

-…extra_llm_api_options`.`
+…`extra_llm_api_options`.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d94211e and 37f5d3c.

📒 Files selected for processing (1)
  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
🪛 LanguageTool
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

[style] ~136-~136: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~140-~140: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~144-~144: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~148-~148: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~163-~163: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~167-~167: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~181-~181: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~183-~183: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~193-~193: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~199-~199: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~324-~324: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


267-267: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


345-345: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Signed-off-by: Yuxian Qiu <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

114-131: Heading level skips break Markdown-lint (MD001)

Configs and Parameters is an H2, but option names start at H4 (####).
Promote them to H3 to maintain a one-level increment and keep autogenerated TOCs correct.

-#### `--tp_size`
-#### `--ep_size`
-#### `--kv_cache_free_gpu_memory_fraction`
-#### `--backend pytorch`
-#### `--max_batch_size`
-#### `--max_num_tokens`
-#### `--max_seq_len`
-#### `--trust_remote_code`
+### `--tp_size`
+### `--ep_size`
+### `--kv_cache_free_gpu_memory_fraction`
+### `--backend pytorch`
+### `--max_batch_size`
+### `--max_num_tokens`
+### `--max_seq_len`
+### `--trust_remote_code`

56-70: Config file is overwritten & NVFP4 still uses fp8 dtype

The same ${EXTRA_LLM_API_FILE} path (/tmp/config.yml) is re-used twice.
The second cat <<EOF > … completely clobbers the first YAML, so the FP4 settings are lost.
In addition, kv_cache_config.dtype is hard-coded to fp8, meaning the “FP4 guide” still runs FP8 under the hood.

-EXTRA_LLM_API_FILE=/tmp/config.yml
-
-kv_cache_config:
-  dtype: fp8          # <- wrong for FP4
+EXTRA_LLM_API_FILE=/tmp/config_fp4.yml
+
+kv_cache_config:
+  dtype: fp4-EXTRA_LLM_API_FILE=/tmp/config.yml   # second block now becomes FP8-specific
+EXTRA_LLM_API_FILE=/tmp/config_fp8.yml

Then launch the server with the matching file:

--extra_llm_api_options ${EXTRA_LLM_API_FILE}   # adjust per model

Failing to separate the files silently produces wrong results and makes debugging painful.
Please split the configs or use the -a >> append operator if you truly intend a single file.

Also applies to: 75-91

🧹 Nitpick comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

200-203: Stray back-tick breaks the sentence

The line ends with options which can be used in the extra\_llm\_api\_options`.
Remove the trailing back-tick (or move it before the period).

-… extra_llm_api_options`.`
+… extra_llm_api_options.

259-273: Add language identifiers to fenced code blocks (MD040)

Markdown-lint flags code blocks without a language.
Use text for plain console output to silence the warning.

-```
+```text
|Tasks|Version| …

Apply the same change to the benchmark sample block below.

Also applies to: 345-351

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 37f5d3c and 2ee475a.

📒 Files selected for processing (1)
  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.550Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
🪛 LanguageTool
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

[style] ~136-~136: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~140-~140: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~144-~144: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~148-~148: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~163-~163: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~167-~167: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~169-~169: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance.  Options:   enable_paddi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~181-~181: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~183-~183: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~189-~189: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models.  Options:   backend: The ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~193-~193: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~199-~199: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~324-~324: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


267-267: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


345-345: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@yuxianq yuxianq changed the title doc: [TRTLLM-6859] Add DeepSeek R1 deployment guide. [TRTLLM-6859][doc] Add DeepSeek R1 deployment guide. Aug 4, 2025
@yuxianq
Copy link
Collaborator Author

yuxianq commented Aug 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13928 [ run ] triggered by Bot

Signed-off-by: Yuxian Qiu <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1)

55-90: Config file is overwritten and FP4 dtype is wrong – create two distinct YAMLs

Both cat-hered blocks write to /tmp/config.yml, so the second block clobbers the first and you never get an FP4-specific config. In addition, the “FP4” block still hard-codes dtype: fp8. Readers who copy-paste will silently end up running FP8 in all cases.

-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp4.yml-kv_cache_config:
-  dtype: fp8
+kv_cache_config:
+  dtype: fp4
…
 # For FP8 model
-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp8.yml

and update the subsequent trtllm-serve … --extra_llm_api_options examples to point at the correct file (config_fp4.yml or config_fp8.yml).
This avoids the overwrite, sets the right dtype for FP4, and makes the guide copy-paste-safe.

🧹 Nitpick comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

259-272: Specify a language for the fenced code block (MD040)

Markdown-lint flags the evaluation result block because it lacks a language tag.

-```
+```text

345-372: Add language identifier for the sample benchmark output

Same MD040 issue here—declare the block as plain text so linters pass and syntax highlighting is disabled.

-```
+```text
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ee475a and e6696e1.

📒 Files selected for processing (1)
  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
🪛 LanguageTool
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

[style] ~136-~136: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~140-~140: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~144-~144: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~148-~148: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~163-~163: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~167-~167: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~169-~169: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance.  Options:   enable_paddi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~181-~181: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~183-~183: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~189-~189: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models.  Options:   backend: The ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~193-~193: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~199-~199: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [https://g...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~324-~324: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


345-345: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13928 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10490 completed with status: 'FAILURE'

Signed-off-by: Yuxian Qiu <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

55-91: Config file is still overwritten; fp4 config remains impossible
The two cat << EOF > ${EXTRA_LLM_API_FILE} blocks target the same /tmp/config.yml, so the first YAML is discarded when the second block executes. The fp4 use-case therefore never sees its intended settings and the KV-cache dtype stays fp8.

This was flagged in a previous review and is still unresolved.

-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp4.yml        # first block

 # … fp4-specific YAML …

 # -------------------------------------------------
-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp8.yml        # second block

 # … fp8-specific YAML …

Ensure the subsequent trtllm-serve command references the correct file for each quantization mode.


114-147: Heading level jumps still violate MD001
Configs and Parameters is an H2 (##) but the option headings start at H4 (####). They must be promoted to H3 (###) to maintain a one-level increment.

This exact issue was raised earlier but the markdown remains unchanged.

-#### `--tp_size`
+### `--tp_size`

Apply the same change to every option heading in this section.

🧹 Nitpick comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

259-264: Specify language for fenced code block (MD040)
Markdown-lint flags code blocks without a language hint. Add shell here.

-```
+```shell
 MODEL_PATH=deepseek-ai/DeepSeek-R1-0528

345-372: Second unlabeled code block needs a language tag
The sample benchmark output block also violates MD040. Use text or console.

-```
+```text
 ============ Serving Benchmark Result ============
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6696e1 and 9ecd493.

📒 Files selected for processing (1)
  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
🪛 LanguageTool
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

[style] ~136-~136: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~140-~140: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~144-~144: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~148-~148: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~163-~163: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~167-~167: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~169-~169: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance.  Options:   enable_paddi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~177-~177: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~181-~181: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~183-~183: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~189-~189: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models.  Options:   backend: The ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~193-~193: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~199-~199: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [TorchLlmA...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~324-~324: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

259-259: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


345-345: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

Signed-off-by: Yuxian Qiu <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (2)

55-66: Config file is overwritten & NVFP4 still hard-coded to FP8

Both cat <<EOF > ${EXTRA_LLM_API_FILE} blocks target the same /tmp/config.yml.
The second block replaces the first, so the NVFP4-intended settings are lost and kv_cache_config.dtype remains fp8, which will silently disable FP4 execution.

-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp4.yml   # first block

 ...
-kv_cache_config:
-  dtype: fp8
+kv_cache_config:
+  dtype: fp4
 ...
-EXTRA_LLM_API_FILE=/tmp/config.yml
+EXTRA_LLM_API_FILE=/tmp/config_fp8.yml   # second block

Then launch the server with the appropriate file:

--extra_llm_api_options ${EXTRA_LLM_API_FILE}   # point to *_fp4.yml or *_fp8.yml

Without this separation, FP4 users will unknowingly run an FP8 config and MoE users lose the extra settings.

Also applies to: 75-91


118-150: Heading levels still skip H3, violating MD001

Configs and Parameters is an H2. Option headings start at ####, jumping two levels.
Promote them uniformly to H3 (###) to restore semantic hierarchy and avoid markdown-lint failures.

-#### `--tp_size`
+### `--tp_size`

Apply to every option heading in this section.

🧹 Nitpick comments (1)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1)

260-264: Specify the language on fenced code blocks (MD040)

Add shell (or none) after the triple back-ticks so tools render and lint correctly:

-```
+```shell

Do the same for the benchmark result block (``````text).

Also applies to: 346-373

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ecd493 and b3f7dfe.

📒 Files selected for processing (1)
  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md (1 hunks)
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-04T02:12:17.582Z
Learning: Applies to **/*.py : The code developed for TensorRT-LLM should conform to Python 3.8+.
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md
🪛 LanguageTool
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

[style] ~137-~137: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~149-~149: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~164-~164: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~168-~168: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~178-~178: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~182-~182: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~184-~184: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~194-~194: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~200-~200: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [TorchLlmA...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~325-~325: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/deepseek_v3/deployment-guide-for-trt-llm-deepseek-r1.md

260-260: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


346-346: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

Signed-off-by: Yuxian Qiu <[email protected]>
@yuxianq yuxianq requested a review from a team as a code owner August 6, 2025 08:22
@yuxianq yuxianq requested review from QiJune and kevinch-nv August 6, 2025 08:22
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md (4)

214-215: Clarify container hostname vs. host access

Inside the container the server binds to 0.0.0.0:8000, but from the host you must use localhost:8000 only if the docker run used -p 8000:8000.
If users change the port mapping, the curl example will fail. Add a brief reminder such as:

“Replace 8000 with the host-side port you mapped in docker run.”


260-266: Specify language for fenced code block – fixes MD040

Markdown-lint flags this block because it lacks a language hint.

-```
+```shell
 MODEL_PATH=deepseek-ai/DeepSeek-R1-0528
 ...

Do the same for other shell blocks to keep lint clean.


346-373: Missing language spec on sample benchmark output

Add text (or none) to silence MD040 and keep syntax highlighters from mis-detecting numbers as code.

-```
+```text
 ============ Serving Benchmark Result ============
 ...
==================================================

137-150: Stylistic: consecutive sentences start with “The”

LanguageTool flags several clusters (Lines 137-150, etc.). While not critical, varying openings improves readability—e.g., replace with “This option…”, “It sets…”, etc.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3f7dfe and efa0cf4.

📒 Files selected for processing (1)
  • examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: in tensorrt-llm, examples directory can have different dependency versions than the root requirement...
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md
📚 Learning: in tensorrt-llm testing, it's common to have both cli flow tests (test_cli_flow.py) and pytorch api ...
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md
🪛 LanguageTool
examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md

[style] ~137-~137: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ch** backend. #### --max_batch_sizeDescription: The maximum number of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~141-~141: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...r processing. #### --max_num_tokensDescription: The maximum total numb...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~145-~145: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...scheduled batch. #### --max_seq_lenDescription: The maximum possible s...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~149-~149: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...ut tokens. #### --trust_remote_codeDescription: Allows TensorRT-LLM to...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~164-~164: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...: Sets the data type for the KV cache.   Default: auto (uses the data ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~168-~168: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...checkpoint). #### cuda_graph_configDescription: A section for configur...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~170-~170: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...g CUDA graphs to optimize performance.  Options:   enable_paddi...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~178-~178: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...or which a CUDA graph will be created.   Default: 0   **Rec...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~182-~182: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...-max_batch_size command-line option.   batch_sizes: A specific list of ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~184-~184: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...batch sizes to create CUDA graphs for.   Default: None #### `moe_conf...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~190-~190: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...n for Mixture-of-Experts (MoE) models.  Options:   backend: The ...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~194-~194: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...The backend to use for MoE operations.   Default: CUTLASS #### `atten...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~200-~200: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...end to use for attention calculations.  Default: TRTLLM See the [TorchLlmA...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)


[style] ~325-~325: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...l,e2el" done EOF chmod +x bench.sh ``` To benchmark the FP4 model, replace --mod...

(ENGLISH_WORD_REPEAT_BEGINNING_RULE)

🪛 markdownlint-cli2 (0.17.2)
examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md

260-260: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


346-346: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

Comment on lines +58 to +70
cat << EOF > ${EXTRA_LLM_API_FILE}
enable_attention_dp: true
cuda_graph_config:
enable_padding: true
max_batch_size: 128
kv_cache_config:
dtype: fp8
stream_interval: 10
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 1
EOF
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Config file is silently overwritten – FP8 example wipes the generic settings

Both cat << EOF > ${EXTRA_LLM_API_FILE} blocks target the same /tmp/config.yml.
Running the second block (FP8 + moe_config) discards the first block’s contents, so users following the tutorial verbatim will only keep the FP8-specific YAML and lose the common settings.

-EXTRA_LLM_API_FILE=/tmp/config.yml
-cat << EOF > ${EXTRA_LLM_API_FILE}
+# General config
+EXTRA_LLM_API_FILE=/tmp/config_fp4.yml
+cat << EOF > "${EXTRA_LLM_API_FILE}"
   ...
 EOF
 ...
-EXTRA_LLM_API_FILE=/tmp/config.yml
-cat << EOF > ${EXTRA_LLM_API_FILE}
+# FP8-specific config
+EXTRA_LLM_API_FILE=/tmp/config_fp8.yml
+cat << EOF > "${EXTRA_LLM_API_FILE}"
   ...
 EOF

Then pass the appropriate file via --extra_llm_api_options.
At minimum, call out in text that the second block replaces the first.

Also applies to: 75-92

🤖 Prompt for AI Agents
In
examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md
around lines 58 to 70, the second cat command overwrites the same config file as
the first, causing the initial generic settings to be lost. To fix this, either
merge the contents of both config blocks into a single file before writing or
write to separate files and clearly document that the second file replaces the
first when passed via --extra_llm_api_options. Also add a note in the tutorial
text explaining this replacement behavior to avoid confusion. Repeat the same
fix for lines 75 to 92.

Comment on lines +99 to +111
trtllm-serve deepseek-ai/DeepSeek-R1-0528 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 1024 \
--max_num_tokens 3200 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.8 \
--tp_size 8 \
--ep_size 8 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

--tp_size 8 and --ep_size 8 imply 64 GPUs – highlight or lower defaults

Setting both flags to 8 requires 8 × 8 = 64 GPUs for one model instance. Most users running a “quick-start” will not have that scale, and the server will abort at runtime.

Recommend either:

---tp_size 8 \
---ep_size 8 \
+# Adjust parallelism for your GPU count (e.g. 2 GPUs → --tp_size 2 --ep_size 1)
+--tp_size 1 \
+--ep_size 1 \

or add an explicit note explaining the requirement.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
trtllm-serve deepseek-ai/DeepSeek-R1-0528 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 1024 \
--max_num_tokens 3200 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.8 \
--tp_size 8 \
--ep_size 8 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
```
trtllm-serve deepseek-ai/DeepSeek-R1-0528 \
--host 0.0.0.0 \
--port 8000 \
--backend pytorch \
--max_batch_size 1024 \
--max_num_tokens 3200 \
--max_seq_len 2048 \
--kv_cache_free_gpu_memory_fraction 0.8 \
# Adjust parallelism for your GPU count (e.g. 2 GPUs → --tp_size 2 --ep_size 1)
--tp_size 1 \
--ep_size 1 \
--trust_remote_code \
--extra_llm_api_options ${EXTRA_LLM_API_FILE}
🤖 Prompt for AI Agents
In
examples/models/core/deepseek_v3/quick-start-recipe-for-deepseek-r1-on-trt-llm.md
around lines 99 to 111, the flags --tp_size 8 and --ep_size 8 imply a total of
64 GPUs, which is likely beyond the capacity of most users running a
quick-start. To fix this, either lower the default values of --tp_size and
--ep_size to reflect a smaller GPU count or add a clear note explicitly stating
that these settings require 64 GPUs and the server will abort if insufficient
GPUs are available.

@yuxianq
Copy link
Collaborator Author

yuxianq commented Aug 6, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14276 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14276 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10781 completed with status: 'FAILURE'

@litaotju
Copy link
Collaborator

litaotju commented Aug 6, 2025

Bypass and merging. The doc change won't affect any CI. The failures are known in existing CI.

@litaotju litaotju merged commit 3a71ddf into NVIDIA:main Aug 6, 2025
3 of 4 checks passed
jain-ria pushed a commit to jain-ria/TensorRT-LLM that referenced this pull request Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants