Skip to content

Auto-enable ngram with concurrency <= 32. #6232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

SimengLiu-nv
Copy link
Collaborator

@SimengLiu-nv SimengLiu-nv commented Jul 21, 2025

Summary by CodeRabbit

  • New Features

    • Added an automatic speculative decoding mode that selects optimal decoding parameters based on concurrency and batch size.
    • Introduced heuristic adjustments for speculative decoding configurations to improve performance for smaller batch sizes.
    • Added support for the new "AUTO" option in speculative decoding arguments and configuration settings.
    • Added a new test to validate the automatic speculative decoding mode with specific model configurations.
  • Bug Fixes

    • Updated default decoding parameter values to enhance reliability and consistency across different scenarios.

Description

Test Coverage

test_e2e.py::test_ptp_quickstart_advanced_auto

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link

coderabbitai bot commented Jul 21, 2025

Walkthrough

This update adds a new automatic speculative decoding configuration, AutoDecodingConfig, with concurrency-based heuristics for parameters. It extends related enums and validation logic to support the "AUTO" mode, updates CLI argument choices accordingly, and adjusts internal model-building logic to apply heuristics based on batch size. Additionally, it refactors an executor method to simplify its interface by using internal state directly.

Changes

File(s) Change Summary
examples/llm-api/quickstart_advanced.py Added "AUTO" to --spec_decode_algo choices; imported AutoDecodingConfig; added handling for "AUTO" mode in setup_llm.
tensorrt_llm/llmapi/llm_args.py Added AutoDecodingConfig class with concurrency heuristics; updated NGramDecodingConfig fields and defaults; extended dispatch and validation to support "AUTO".
tensorrt_llm/llmapi/llm.py Added heuristic logic in _TorchLLM._build_model to adjust speculative decoding config for "AUTO" mode with batch size ≤ 32, disabling overlap scheduler and setting n-gram parameters.
tensorrt_llm/_torch/pyexecutor/py_executor.py Refactored _prepare_draft_requests to remove requests parameter and operate directly on self.active_requests.
tensorrt_llm/_torch/speculative/interface.py, tensorrt_llm/models/modeling_utils.py Added AUTO member to SpeculativeDecodingMode enum; updated static method to handle "auto" string argument.
tensorrt_llm/llmapi/init.py Added AutoDecodingConfig to imports and __all__ exports.
tests/integration/defs/test_e2e.py Added new integration test for the "AUTO" speculative decoding algorithm using quickstart_advanced.py.
tests/unittest/api_stability/references_committed/llm.yaml Updated type annotations to include AutoDecodingConfig in speculative config parameter union.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant LLMArgs
    participant LLM
    participant PyExecutor

    User->>CLI: Run with/without --spec_decode_algo "AUTO"
    CLI->>LLMArgs: Parse arguments including AUTO option
    LLMArgs->>LLM: Pass speculative decoding config (may be AutoDecodingConfig)
    LLM->>LLM: _build_model applies AUTO heuristic if applicable
    LLM->>PyExecutor: update_executor_config with resolved speculative config
Loading

Estimated code review effort

3 (~45 minutes)

Possibly related PRs

Suggested reviewers

  • HuiGao-NV
  • litaotju

Poem

🐇 With twitchy nose and eager paws,
AUTO decoding breaks old laws.
Heuristics guide the draft and gram,
Smarter choices, yes we can!
A hop, a skip, the code’s anew,
Faster runs for me and you! 🐰✨

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm.py (1)

967-1001: Approve the NGram auto-enabling logic with style improvements needed.

The implementation correctly implements the heuristic for auto-enabling NGram speculative decoding based on batch size thresholds. The logic properly:

  1. Checks for NGram mode and batch size constraints
  2. Disables overlap scheduler when conditions are met
  3. Applies appropriate default values based on batch size ranges
  4. Logs the changes for debugging

However, there are several style issues that should be addressed:

Apply these fixes to improve code style and readability:

-                logger.info(
-                    "Disable overlap scheduler to enable NGram speculative decoding."
-                )
+                logger.info("Disable overlap scheduler to enable NGram speculative decoding.")
                 # From benchmark results, we found that NGram speculative decoding provides better performance than overlap scheduler with low concurrency <= 32.
                 # Therefore, we disable overlap scheduler to enable NGram speculative decoding.
                 self.args.disable_overlap_scheduler = True

-            if spec_config.max_draft_len != 0 and spec_config.max_matching_ngram_size != 0:
+            if spec_config.max_draft_len != 0 and spec_config.max_matching_ngram_size != 0:
                 pass
             else:
                 if max_batch_size <= 4:
-                    spec_config.max_draft_len = 5 if spec_config.max_draft_len == 0 else spec_config.max_draft_len
-                    spec_config.max_matching_ngram_size = 3 if spec_config.max_matching_ngram_size == 0 else spec_config.max_matching_ngram_size
+                    spec_config.max_draft_len = (
+                        5 if spec_config.max_draft_len == 0 else spec_config.max_draft_len
+                    )
+                    spec_config.max_matching_ngram_size = (
+                        3 if spec_config.max_matching_ngram_size == 0 else spec_config.max_matching_ngram_size
+                    )
                 elif max_batch_size <= 32:
-                    spec_config.max_draft_len = 3 if spec_config.max_draft_len == 0 else spec_config.max_draft_len
-                    spec_config.max_matching_ngram_size = 5 if spec_config.max_matching_ngram_size == 0 else spec_config.max_matching_ngram_size
+                    spec_config.max_draft_len = (
+                        3 if spec_config.max_draft_len == 0 else spec_config.max_draft_len
+                    )
+                    spec_config.max_matching_ngram_size = (
+                        5 if spec_config.max_matching_ngram_size == 0 else spec_config.max_matching_ngram_size
+                    )
                 logger.info(
-                    f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}"
+                    f"Apply heuristic to incomplete NGramDecodingConfig: "
+                    f"max_draft_len={spec_config.max_draft_len}, "
+                    f"max_matching_ngram_size={spec_config.max_matching_ngram_size}"
                 )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9645814 and bcb72e0.

📒 Files selected for processing (4)
  • examples/llm-api/quickstart_advanced.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
  • tensorrt_llm/llmapi/llm.py (1 hunks)
  • tensorrt_llm/llmapi/llm_args.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py

977-977: Line too long (161 > 120)

(E501)


986-986: Line too long (144 > 120)

(E501)


989-989: Line too long (144 > 120)

(E501)


991-991: Line too long (178 > 120)

(E501)

🔇 Additional comments (5)
examples/llm-api/quickstart_advanced.py (2)

111-113: LGTM! Default value changes align with heuristic approach.

The change from hardcoded defaults (1 and 5) to 0 allows the system to use heuristics based on concurrency levels, which aligns with the PR objective of auto-enabling ngram with concurrency <= 32.


155-158: Conditional logic correctly handles non-NGRAM algorithms.

The implementation properly sets spec_decode_max_draft_len to 1 for non-NGRAM algorithms when the user hasn't explicitly set a value (i.e., when it's 0). This ensures backward compatibility while allowing NGRAM to use the new heuristic logic.

tensorrt_llm/llmapi/llm_args.py (1)

388-393: Excellent documentation and parameter structure for heuristic implementation.

The comments clearly document the heuristic behavior:

  • Concurrency ≤ 4: max_draft_len = 5, max_matching_ngram_size = 3
  • Concurrency ≤ 32: max_draft_len = 3, max_matching_ngram_size = 5

Setting both parameters to default value 0 properly indicates "use heuristic" and aligns with the changes in the example file.

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

925-925: LGTM! Method call updated correctly.

The method call has been properly updated to match the simplified signature without the requests parameter.


1012-1016: LGTM! Good refactoring to simplify the method interface.

The changes properly simplify the _prepare_draft_requests method by:

  • Removing the unnecessary requests parameter from the method signature
  • Directly iterating over self.active_requests instead of relying on an external parameter
  • Ensuring consistent access to the current active requests stored in the instance

This refactoring improves code clarity and reduces potential inconsistencies between the passed parameter and the actual active requests state.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
tensorrt_llm/llmapi/llm_args.py (1)

256-271: Missing "AUTO" decoding type in config class mapping.

The DecodingBaseConfig.from_dict method needs to include the new AutoDecodingConfig to properly handle deserialization of "AUTO" decoding type configurations.

Add the missing mapping:

         config_classes = {
             "MTP": MTPDecodingConfig,
             "Medusa": MedusaDecodingConfig,
             "Eagle": EagleDecodingConfig,
             "Lookahead": LookaheadDecodingConfig,
             "NGram": NGramDecodingConfig,
             "DraftTarget": DraftTargetDecodingConfig,
             "UserProvided": UserProvidedDecodingConfig,
+            "AUTO": AutoDecodingConfig,
         }
🧹 Nitpick comments (3)
tensorrt_llm/llmapi/llm_args.py (1)

461-482: Well-designed auto-configuration class with minor docstring formatting issue.

The implementation follows established patterns and clearly documents the heuristic logic. The backend restriction to PyTorch is appropriate for ngram decoding.

Fix the docstring formatting to comply with PEP 257:

 class AutoDecodingConfig(DecodingBaseConfig):
     """
     Configuration for auto speculative decoding.
+
     This config is used to automatically select the best speculative decoding algorithm.
tensorrt_llm/llmapi/llm.py (2)

978-978: Fix line length violation.

The line exceeds the 120 character limit.

Apply this diff to fix the line length:

-                    "Disable overlap scheduler to enable Auto speculative decoding with Ngram."
+                    "Disable overlap scheduler to enable Auto speculative "
+                    "decoding with Ngram."

991-993: Fix line length violation.

The log message line exceeds the 120 character limit.

Apply this diff to fix the line length:

-                f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}"
+                f"Apply heuristic to incomplete NGramDecodingConfig: "
+                f"max_draft_len={spec_config.max_draft_len}, "
+                f"max_matching_ngram_size={spec_config.max_matching_ngram_size}"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcb72e0 and d69e813.

📒 Files selected for processing (3)
  • examples/llm-api/quickstart_advanced.py (3 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
  • tensorrt_llm/llmapi/llm_args.py (2 hunks)
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py

978-978: Line too long (161 > 120)

(E501)


991-991: Line too long (174 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

477-479: 1 blank line required between summary line and description

(D205)

🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/llm-api/quickstart_advanced.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py

978-978: Line too long (161 > 120)

(E501)


991-991: Line too long (174 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

477-479: 1 blank line required between summary line and description

(D205)

🔇 Additional comments (2)
tensorrt_llm/llmapi/llm_args.py (1)

388-393: LGTM! Good documentation of heuristic behavior.

The changes to use default values of 0 with clear comments explaining the heuristic logic align well with the PR objective of auto-enabling ngram for low concurrency scenarios.

tensorrt_llm/llmapi/llm.py (1)

967-994: LGTM! Well-implemented heuristic for auto-enabling ngram speculative decoding.

The conditional logic correctly implements the PR objective to auto-enable ngram with concurrency <= 32. The batch size-based parameter selection and overlap scheduler disabling are well-reasoned based on benchmark results mentioned in comments.

@SimengLiu-nv SimengLiu-nv marked this pull request as ready for review July 22, 2025 06:38
@SimengLiu-nv SimengLiu-nv requested review from a team as code owners July 22, 2025 06:38
@SimengLiu-nv SimengLiu-nv requested a review from nv-guomingz July 22, 2025 06:38
@SimengLiu-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)

462-483: Fix docstring formatting issue.

The AutoDecodingConfig class implementation is well-designed and follows the established patterns. However, there's a docstring formatting issue that needs to be addressed.

Apply this fix for the docstring formatting:

 class AutoDecodingConfig(DecodingBaseConfig):
     """
     Configuration for auto speculative decoding.
+
     This config is used to automatically select the best speculative decoding algorithm.
 
     According to benchmark results, the best algorithm in general is NGRAM with low concurrency <= 32.
     Default heuristic:
         With concurrency <= 4, max_draft_len = 5, max_matching_ngram_size = 3
         With concurrency <= 32, max_draft_len = 3, max_matching_ngram_size = 5
         With concurrency > 32, speculative decoding is disabled.
     """
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d69e813 and 21cfc3d.

📒 Files selected for processing (7)
  • examples/llm-api/quickstart_advanced.py (3 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (1 hunks)
  • tensorrt_llm/llmapi/__init__.py (1 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
  • tensorrt_llm/llmapi/llm_args.py (5 hunks)
  • tensorrt_llm/models/modeling_utils.py (2 hunks)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py

970-970: Line too long (161 > 120)

(E501)


983-983: Line too long (174 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

478-480: 1 blank line required between summary line and description

(D205)

✅ Files skipped from review due to trivial changes (3)
  • tensorrt_llm/llmapi/init.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/models/modeling_utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • examples/llm-api/quickstart_advanced.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py

970-970: Line too long (161 > 120)

(E501)


983-983: Line too long (174 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

478-480: 1 blank line required between summary line and description

(D205)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/llmapi/llm.py (3)

34-35: Import addition looks good.

The NGramDecodingConfig import is properly added and required for the new auto-configuration logic.


993-993: Parameter passing looks correct.

The speculative_config=spec_config parameter correctly uses the potentially modified config from the heuristic logic above.


960-961: Speculative config decoding_type attribute is valid
Verified that every speculative config class in tensorrt_llm/llmapi/llm_args.py defines decoding_type: ClassVar[str] = "AUTO" (and other types), so accessing spec_config.decoding_type is safe and requires no change.

tensorrt_llm/llmapi/llm_args.py (4)

389-394: Good implementation of heuristic-based defaults.

The changes to NGramDecodingConfig properly implement the concurrency-based heuristic mentioned in the PR objectives. The comments clearly explain the default behavior and the fields now use 0 as defaults to indicate when heuristics should be applied.


265-265: Proper integration of AutoDecodingConfig in dispatch dictionary.

The addition of "AUTO": AutoDecodingConfig to the dispatch dictionary correctly enables the factory method to instantiate the new configuration class.


788-788: Proper type alias extension.

The addition of AutoDecodingConfig to the SpeculativeConfig type alias correctly extends the union type to include the new configuration class.


1534-1538: Correct validation logic for AutoDecodingConfig.

The validation logic properly handles AutoDecodingConfig instances by:

  • Checking backend compatibility (pytorch or _autodeploy)
  • Setting the appropriate speculative decoding mode to AUTO
  • Assigning the max_draft_len from the configuration

The implementation follows the established pattern for other decoding config types.

Comment on lines +966 to +986
if spec_config is not None and spec_config.decoding_type == "AUTO" and max_batch_size <= 32:
if not self.args.disable_overlap_scheduler:
logger.info(
"Disable overlap scheduler to enable Auto speculative decoding with Ngram."
)
# From benchmark results, we found that NGram speculative decoding provides better performance than overlap scheduler with low concurrency <= 32.
# Therefore, we disable overlap scheduler to enable NGram speculative decoding.
self.args.disable_overlap_scheduler = True

spec_config = NGramDecodingConfig(
max_draft_len=5 if max_batch_size <= 4 else 3,
max_matching_ngram_size=3 if max_batch_size <= 4 else 5,
is_keep_all=True,
is_use_oldest=True,
is_public_pool=True,
)

logger.info(
f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}"
)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Logic implementation looks correct but has style issues.

The heuristic logic for auto-enabling ngram speculative decoding is well-structured and addresses the PR objective. However, there are a few concerns:

  1. Line length violations: Lines 970 and 983 exceed 120 characters (flagged by static analysis)
  2. Magic numbers: The batch size thresholds (4, 32) and ngram parameters should be documented or made configurable

Apply this diff to fix the line length issues:

-                logger.info(
-                    "Disable overlap scheduler to enable Auto speculative decoding with Ngram."
-                )
+                logger.info(
+                    "Disable overlap scheduler to enable Auto speculative "
+                    "decoding with Ngram."
+                )
-            logger.info(
-                f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}"
-            )
+            logger.info(
+                f"Apply heuristic to NGramDecodingConfig: "
+                f"max_draft_len={spec_config.max_draft_len}, "
+                f"max_matching_ngram_size={spec_config.max_matching_ngram_size}"
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if spec_config is not None and spec_config.decoding_type == "AUTO" and max_batch_size <= 32:
if not self.args.disable_overlap_scheduler:
logger.info(
"Disable overlap scheduler to enable Auto speculative decoding with Ngram."
)
# From benchmark results, we found that NGram speculative decoding provides better performance than overlap scheduler with low concurrency <= 32.
# Therefore, we disable overlap scheduler to enable NGram speculative decoding.
self.args.disable_overlap_scheduler = True
spec_config = NGramDecodingConfig(
max_draft_len=5 if max_batch_size <= 4 else 3,
max_matching_ngram_size=3 if max_batch_size <= 4 else 5,
is_keep_all=True,
is_use_oldest=True,
is_public_pool=True,
)
logger.info(
f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}"
)
if spec_config is not None and spec_config.decoding_type == "AUTO" and max_batch_size <= 32:
if not self.args.disable_overlap_scheduler:
logger.info(
"Disable overlap scheduler to enable Auto speculative "
"decoding with Ngram."
)
# From benchmark results, we found that NGram speculative decoding provides better performance than overlap scheduler with low concurrency <= 32.
# Therefore, we disable overlap scheduler to enable NGram speculative decoding.
self.args.disable_overlap_scheduler = True
spec_config = NGramDecodingConfig(
max_draft_len=5 if max_batch_size <= 4 else 3,
max_matching_ngram_size=3 if max_batch_size <= 4 else 5,
is_keep_all=True,
is_use_oldest=True,
is_public_pool=True,
)
logger.info(
f"Apply heuristic to NGramDecodingConfig: "
f"max_draft_len={spec_config.max_draft_len}, "
f"max_matching_ngram_size={spec_config.max_matching_ngram_size}"
)
🧰 Tools
🪛 Ruff (0.12.2)

970-970: Line too long (161 > 120)

(E501)


983-983: Line too long (174 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm.py around lines 966 to 986, fix the style issues by
breaking long lines at 970 and 983 to keep them under 120 characters.
Additionally, replace the magic numbers 4 and 32 with named constants or
configuration parameters, and add comments explaining their purpose to improve
code readability and maintainability.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12532 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12532 [ run ] completed with state FAILURE

@SimengLiu-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/llmapi/llm.py (1)

959-986: Fix line length violations and consider documenting magic numbers.

The heuristic logic correctly implements the PR objective but has style issues that need addressing:

  1. Line length violations: Lines 970 and 983 exceed 120 characters (flagged by static analysis)
  2. Magic numbers: The batch size thresholds (4, 32) should be documented for better maintainability

Apply this diff to fix the line length issues:

-                logger.info(
-                    "Disable overlap scheduler to enable Auto speculative decoding with Ngram."
-                )
+                logger.info(
+                    "Disable overlap scheduler to enable Auto speculative "
+                    "decoding with Ngram."
+                )
-            logger.info(
-                f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}"
-            )
+            logger.info(
+                f"Apply heuristic to NGramDecodingConfig: "
+                f"max_draft_len={spec_config.max_draft_len}, "
+                f"max_matching_ngram_size={spec_config.max_matching_ngram_size}"
+            )

Consider adding constants or comments explaining the threshold values (4, 32) and parameter choices.

🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)

462-483: Fix docstring formatting issue.

The AutoDecodingConfig class is well-implemented with clear heuristics documentation. However, there's a formatting issue in the docstring.

Apply this diff to fix the docstring formatting:

 class AutoDecodingConfig(DecodingBaseConfig):
     """
     Configuration for auto speculative decoding.
+
     This config is used to automatically select the best speculative decoding algorithm.
 
     According to benchmark results, the best algorithm in general is NGRAM with low concurrency <= 32.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 21cfc3d and 093c394.

📒 Files selected for processing (6)
  • examples/llm-api/quickstart_advanced.py (3 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (1 hunks)
  • tensorrt_llm/llmapi/__init__.py (2 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
  • tensorrt_llm/llmapi/llm_args.py (5 hunks)
  • tensorrt_llm/models/modeling_utils.py (2 hunks)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py

970-970: Line too long (161 > 120)

(E501)


983-983: Line too long (174 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

478-480: 1 blank line required between summary line and description

(D205)

🚧 Files skipped from review as they are similar to previous changes (4)
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/models/modeling_utils.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/llmapi/init.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py

970-970: Line too long (161 > 120)

(E501)


983-983: Line too long (174 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

478-480: 1 blank line required between summary line and description

(D205)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/llmapi/llm.py (2)

34-35: Import addition looks good.

The addition of NGramDecodingConfig to the import statement is necessary for the new auto-decoding heuristic logic implemented below.


993-993: Correct usage of modified speculative config.

The change to use spec_config instead of self.args.speculative_config is correct, ensuring the heuristic-adjusted configuration (if applied) is passed to update_executor_config.

tensorrt_llm/llmapi/llm_args.py (4)

265-265: LGTM: Dispatch dictionary correctly updated.

The addition of "AUTO": AutoDecodingConfig to the dispatch dictionary follows the existing pattern and enables proper instantiation of the new AutoDecodingConfig class.


788-788: LGTM: Type alias properly extended.

Adding AutoDecodingConfig to the SpeculativeConfig type alias ensures type consistency and allows the new configuration to be used wherever speculative configurations are expected.


389-394: Verify that AUTO→NGram heuristic is applied in llm.py

The default 0 values in NGramDecodingConfig act as sentinels for “AUTO.” The heuristic is implemented in tensorrt_llm/llmapi/llm.py:

  • When decoding_type == "AUTO" and max_batch_size <= 32, a new NGramDecodingConfig is constructed with
    max_draft_len = 5 (if <=4) or 3 (else)
    max_matching_ngram_size = 3 (if <=4) or 5 (else)

This matches the comments in llm_args.py and ensures zero‐defaults are overridden. No further action is needed.

Likely an incorrect or invalid review comment.


1534-1537: Speculative AUTO mode always yields a non-zero max_draft_len via the NGram heuristic, so no change needed.

The AutoDecodingConfig branch in validate_speculative_config is never invoked with a bare AUTO config at runtime—before validation the LLM API converts any AUTO setting into a NGramDecodingConfig with max_draft_len=5 (for ≤4) or 3 (for ≤32). The CLI build path doesn’t even expose an “auto” option. No further action is required here.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12534 [ run ] triggered by Bot

Copy link
Collaborator

@wili-65535 wili-65535 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

# Default heuristic will be use
# With concurrency <= 4, max_draft_len = 5, max_matching_ngram_size = 3
# With concurrency <= 32, max_draft_len = 3, max_matching_ngram_size = 5
max_draft_len: int = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just use the max_draft_len from class DecodingBaseConfig?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, removed this line.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12534 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9317 completed with status: 'FAILURE'

@schetlur-nv
Copy link
Collaborator

Where/how are we testing these changes?

Signed-off-by: Simeng Liu <[email protected]>
Signed-off-by: Simeng Liu <[email protected]>
@SimengLiu-nv SimengLiu-nv requested a review from a team as a code owner July 22, 2025 23:44
@SimengLiu-nv
Copy link
Collaborator Author

PR_Github #12534 [ run ] completed with state SUCCESS /LLM/main/L0_MergeRequest_PR pipeline #9317 completed with status: 'FAILURE'

Added test_e2e.py::test_ptp_quickstart_advanced_auto test.

@SimengLiu-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)

457-478: Fix docstring formatting and implementation looks good.

The AutoDecodingConfig class is well-implemented with clear documentation of its purpose and heuristics. The backend restriction to PyTorch is appropriate for this feature.

Apply this diff to fix the docstring formatting issue flagged by static analysis:

 class AutoDecodingConfig(DecodingBaseConfig):
     """
     Configuration for auto speculative decoding.
+
     This config is used to automatically select the best speculative decoding algorithm.
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 093c394 and 496f3ea.

📒 Files selected for processing (4)
  • examples/llm-api/quickstart_advanced.py (2 hunks)
  • tensorrt_llm/llmapi/llm_args.py (5 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/unittest/api_stability/references_committed/llm.yaml (1 hunks)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py

473-475: 1 blank line required between summary line and description

(D205)

✅ Files skipped from review due to trivial changes (1)
  • tests/unittest/api_stability/references_committed/llm.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/llm-api/quickstart_advanced.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py

473-475: 1 blank line required between summary line and description

(D205)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tests/integration/defs/test_e2e.py (1)

1768-1791: AUTO mode fallback to NGram is validated

The --spec_decode_algo AUTO test runs with --max_batch_size=4 (≤32), which triggers the AUTO heuristic in tensorrt_llm/llmapi/llm.py to select an NGram speculative decoding configuration. The observed 27 GiB memory profile matches the existing NGram test, confirming the AUTO mode behavior. No changes needed.

tensorrt_llm/llmapi/llm_args.py (4)

265-265: LGTM!

The addition of the "AUTO" mapping to the dispatch dictionary correctly follows the established pattern for routing decoding types to their respective configuration classes.


389-389: Verify the impact of changing the default value.

Changing the default max_matching_ngram_size from 4 to 0 could be a breaking change for existing users who rely on the previous default behavior. Please ensure this change is intentional and consider:

  1. Whether existing code that doesn't explicitly set this parameter will be affected
  2. If documentation updates are needed to reflect this change
  3. Whether this should be mentioned in release notes as a breaking change

783-783: LGTM!

Adding AutoDecodingConfig to the SpeculativeConfig type alias is necessary for proper type checking and ensures the new configuration can be used wherever speculative configurations are expected.


1529-1532: LGTM! Validation logic is consistent.

The validation logic for AutoDecodingConfig correctly follows the established pattern:

  • Backend assertion matches the supports_backend method
  • Appropriate speculative decoding mode assignment
  • Consistent with other decoding configuration validations

The actual heuristics application (based on concurrency/batch size) appears to be handled elsewhere in the system, which is a reasonable separation of concerns.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12619 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12619 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9388 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants