Skip to content

Conversation

zheyuf
Copy link
Collaborator

@zheyuf zheyuf commented Aug 21, 2025

Description

Previously, we disabled speculation solely when len(active_requests) > max_concurrency, which was conservative. We now gate by a better estimated per-step capacity (min(len(active_requests), max_batch_size, max_num_tokens // (1 + draft_len))), so speculation isn’t unnecessarily turned off when the schedulable batch is small. This respects real decoding limits, improving GPU utilization and latency.

Test Coverage

There are existing unit tests in TensorRTLLM/tests/unittest/_torch/speculative/test_dynamic_spec_decode.py. Also added some quick unit tests in this file.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

  • New Features

    • Speculative decoding now factors batch size, overall token budget, and draft length when deciding to enable speculation, improving efficiency across workloads.
  • Refactor

    • Decision gate changed from a pure concurrency check to a token-budget- and batch-aware approach for more predictable and constrained speculation behavior.
  • Tests

    • Unit tests updated and a new test added to cover varied scenarios and edge cases for the expanded decision criteria.

Copy link
Contributor

coderabbitai bot commented Aug 21, 2025

Caution

Review failed

The head commit changed during the review from a01bc8d to c134cf1.

📝 Walkthrough

Walkthrough

PyExecutor now calls Drafter.should_use_spec_decode with four arguments (active_requests, max_batch_size, max_num_tokens, max_draft_len). Drafter's signature and gating logic were updated to enforce token-budget and batch-size constraints; unit tests were updated and a new test added to validate the gating behavior.

Changes

Cohort / File(s) Summary
Executor + Drafter
tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/speculative/drafter.py
PyExecutor invokes Drafter.should_use_spec_decode(active_requests, max_batch_size, max_num_tokens, max_draft_len). Drafter.should_use_spec_decode signature changed and logic now applies a token-budget and batch-size gate (compute tokens_per_request, token_cap, num_effective_requests = min(len(requests), max_batch_size, token_cap), require 0 < num_effective_requests <= max_concurrency). Result assigned to self.use_spec_decode and passed to model_engine.enable_spec_decode; _prepare_draft_requests() still runs when speculation is enabled.
Unit tests
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
Mock updated to accept (requests, max_batch_size, max_num_tokens, max_draft_len). Added test_should_use_spec_decode() exercising multiple scenarios (varying active_requests, small batch, small token budget, OFF, None, token-cap-zero) and retained existing speculative-path assertions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant E as PyExecutor
  participant D as Drafter
  participant M as ModelEngine

  C->>E: submit requests
  E->>M: query model limits (max_num_tokens, max_draft_len)
  E->>E: compute max_batch_size, active_requests
  E->>D: should_use_spec_decode(active_requests, max_batch_size, max_num_tokens, max_draft_len)
  D-->>E: use_spec (bool)

  alt use_spec == true
    E->>M: enable_spec_decode(true)
    note right of E #eef6ff: initialize per-request draft state
    E->>E: _prepare_draft_requests()
    E->>E: prepare drafting & schedule speculative batch
  else
    E->>M: enable_spec_decode(false)
    E->>E: schedule standard decoding
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • Naveassaf
  • ziyixiong-nv
  • achartier
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
tensorrt_llm/_torch/speculative/drafter.py (2)

1-1: Add NVIDIA copyright header (2025).

All Python source files should start with the standard NVIDIA copyright header per repository guidelines.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

29-37: Docstring should describe new parameters and return semantics.

The signature expanded, but the docstring didn’t. Adding Args/Returns will help downstream users and maintainers.

Apply this diff to enrich the docstring:

         """
-        You probably don't want to override this. ModelEngine
-        assumes that speculation is always on if max_concurrency
-        is not specified by the user's spec config.
+        Decide whether to enable speculative decoding for this step.
+
+        You probably don't want to override this. ModelEngine assumes that
+        speculation is always on if max_concurrency is not specified by the
+        user's spec config.
+
+        Args:
+            requests: Active requests on this rank.
+            max_batch_size: Upper bound on sequences schedulable this step.
+            max_num_tokens: Per-step token budget available for generation
+                (including draft tokens).
+            max_draft_len: Maximum number of draft tokens per request.
+
+        Returns:
+            True iff speculative decoding should be attempted for this step.
         """
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

923-933: Option: gate after scheduling using actual batch size.

Gating before scheduling uses len(self.active_requests), which can be stricter/looser than the scheduled batch, especially with pausing or KV-transfer constraints. Consider moving gating after _schedule() and using scheduled_batch.batch_size and the number of generation requests to compute the effective token demand for this step.

If you pursue this, keep a fast-path when max_concurrency is None to avoid extra overhead.

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (3)

1-1: Add NVIDIA copyright header (2025).

Tests are also Python sources; please prepend the standard NVIDIA header.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

55-61: Strengthen the mock: assert plumbed values match expectations.

Adding lightweight assertions in the mock helps catch regression in plumbing the new parameters without making the test brittle.

Apply this diff:

-    def mock_should_use_spec_decode(self, requests, max_batch_size,
-                                    max_num_tokens, draft_len):
+    def mock_should_use_spec_decode(self, requests, max_batch_size,
+                                    max_num_tokens, draft_len):
+        # Sanity-check the plumbed values
+        assert isinstance(max_batch_size, int) and max_batch_size >= 1
+        assert draft_len == max_draft_len
+        assert isinstance(max_num_tokens, int) and max_num_tokens > 0
         if not hasattr(mock_should_use_spec_decode, 'call_count'):
             mock_should_use_spec_decode.call_count = 0
         mock_should_use_spec_decode.call_count += 1
         return mock_should_use_spec_decode.call_count <= 2

17-53: Add focused unit tests for should_use_spec_decode logic (no GPU).

Current test validates end-to-end behavior with a heavy model. Add a lightweight unit test that constructs a minimal Drafter subclass and exercises edge cases:

  • token_cap == 0 (should return False with max_concurrency set)
  • requests == [] (should return False)
  • num_effective_requests just above/below max_concurrency

I can draft such a test that runs on CPU and completes quickly if you’d like.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2d40e87 and 06875c2.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/speculative/drafter.py
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/speculative/drafter.py
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

924-927: Verified — ModelEngine attributes and spec-decode callsites are correct

Quick summary: PyTorchModelEngine exposes max_num_tokens and derives max_draft_len from spec_config; drafter.should_use_spec_decode signature matches the py_executor call and no other outdated call sites were found.

Points of evidence:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py — PyTorchModelEngine.init sets self.max_num_tokens (class at ~line 260) and sets self.max_draft_len from spec_config (around lines 391–410).
  • tensorrt_llm/_torch/speculative/drafter.py — should_use_spec_decode signature expects (requests, max_batch_size, max_num_tokens, max_draft_len). (definition at ~line 30)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py — call site at lines ~923–926 passes model_engine.max_num_tokens and model_engine.max_draft_len into should_use_spec_decode.
  • Ripgrep search returned no other should_use_spec_decode call sites.

Conclusion: No code changes required for the call in py_executor.py; attributes are present on the concrete engine implementation.

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

55-65: Verified patch target is correct

  • The ModelDrafter class is defined in tensorrt_llm/_torch/speculative/model_drafter.py (line 38), matching the module path under test.
  • The should_use_spec_decode method is implemented on the base Drafter class in tensorrt_llm/_torch/speculative/drafter.py (line 30) and inherited by ModelDrafter, so patching ModelDrafter.should_use_spec_decode will override the behavior invoked by LLM (e.g., in py_executor.py:923).

No changes needed here.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 21, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tensorrt_llm/_torch/speculative/drafter.py (1)

39-47: Add explicit guards for zero/invalid capacity; clamp draft length to avoid div-by-zero for negatives.

Semantics are currently correct due to 0 < num_effective_requests, but early guards improve readability and avoid computing min() when capacity is zero. Also clamp max_draft_len to tolerate unexpected negatives.

-        # Inputs validated upstream: max_batch_size>0, max_num_tokens>0, max_draft_len>0
+        # Inputs typically validated upstream: max_batch_size>0, max_num_tokens>0, max_draft_len>=0

         if self.max_concurrency is None:
             return True

-        token_cap = max_num_tokens // (1 + max_draft_len)
-        num_effective_requests = min(len(requests), max_batch_size, token_cap)
-
-        return 0 < num_effective_requests <= self.max_concurrency
+        # Defensive guards; keep behavior explicit for zero/empty cases
+        if max_batch_size <= 0 or max_num_tokens <= 0 or not requests:
+            return False
+
+        tokens_per_request = 1 + max(0, max_draft_len)
+        token_cap = max_num_tokens // tokens_per_request
+        if token_cap <= 0:
+            return False
+
+        num_effective_requests = min(len(requests), max_batch_size, token_cap)
+        return num_effective_requests <= self.max_concurrency
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

922-927: Good fix: gating now uses spec_config.max_draft_len (consistent with draft allocation).

This resolves the prior inconsistency between gating and _prepare_draft_requests (which uses spec_config.max_draft_len). Thanks for aligning the sources.

🧹 Nitpick comments (8)
tensorrt_llm/_torch/speculative/drafter.py (3)

1-4: Add NVIDIA copyright header (2025) per repo guidelines.

All source files must carry the NVIDIA copyright header.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+"""
+This file is part of TensorRT-LLM.
+"""

30-37: Clarify and expand the method docstring; include args/returns and the gating formula.

Make the contract explicit and Sphinx/Google-style so downstream users know exactly how the decision is computed.

-    def should_use_spec_decode(self, requests: List[LlmRequest],
-                               max_batch_size: int, max_num_tokens: int,
-                               max_draft_len: int) -> bool:
-        """
-        You probably don't want to override this. ModelEngine
-        assumes that speculation is always on if max_concurrency
-        is not specified by the user's spec config.
-        """
+    def should_use_spec_decode(self, requests: List[LlmRequest],
+                               max_batch_size: int, max_num_tokens: int,
+                               max_draft_len: int) -> bool:
+        """
+        Decide whether to enable speculative decoding in this step.
+
+        Speculation is enabled when:
+            - max_concurrency is None (unbounded), or
+            - 0 < min(len(requests), max_batch_size,
+                      max_num_tokens // (1 + max(0, max_draft_len))) <= max_concurrency
+
+        Args:
+            requests (List[LlmRequest]): Active requests this step.
+            max_batch_size (int): Upper bound on batch size for this step.
+            max_num_tokens (int): Token budget available for this step.
+            max_draft_len (int): Max draft tokens per request used by the drafter.
+
+        Returns:
+            bool: True to enable speculative decoding; False to disable.
+        """

30-47: Optional: surface observability hook.

If you’d prefer not to log at call sites, consider emitting a debug-level trace here (behind a cheap bool) with inputs and the decision for troubleshooting. Decline if you’d rather keep the base class silent.

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

1-3: Add NVIDIA copyright header (2025) at file top.

This file is missing the required header.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+"""
+This file is part of TensorRT-LLM.
+"""

922-928: Log the gating decision for visibility (debug-level).

A one-line debug log helps diagnose unexpected toggling without stepping through. Cheap to compute and guarded by logger’s level.

         if self.drafter is not None:
             self.use_spec_decode = self.drafter.should_use_spec_decode(
                 self.active_requests, self.max_batch_size,
                 self.model_engine.max_num_tokens,
                 self.model_engine.spec_config.max_draft_len)
             self.model_engine.enable_spec_decode = self.use_spec_decode
+            logger.debug(
+                "spec gating: active=%d, max_batch_size=%d, max_num_tokens=%d, "
+                "max_draft_len=%d -> use_spec_decode=%s",
+                len(self.active_requests),
+                self.max_batch_size,
+                self.model_engine.max_num_tokens,
+                self.model_engine.spec_config.max_draft_len,
+                str(self.use_spec_decode),
+            )
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (3)

1-4: Add NVIDIA copyright header (2025).

Tests are also subject to the header policy.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

91-147: Strengthen unit coverage with unbounded concurrency case; minor tidy-ups.

  • Add a case where max_concurrency=None (should always enable speculation).
  • Optional: the earlier sampling_params = SamplingParams(max_tokens=128, ...) assignment is unused; drop it.
  • Optional: consider using a tiny lightweight LlmRequest-like stub (with only what’s needed) rather than object() to future-proof the test if the method ever inspects attributes.
@@
     drafter = _DummyDrafter(max_concurrency=6)
@@
     assert drafter.should_use_spec_decode(active_requests,
                                           max_batch_size=8,
                                           max_num_tokens=4,
                                           max_draft_len=4) is False
 
+    # Unbounded concurrency ON case (should always enable)
+    drafter_unbounded = _DummyDrafter(max_concurrency=None)
+    active_requests = [object()] * 100
+    assert drafter_unbounded.should_use_spec_decode(
+        active_requests,
+        max_batch_size=8,
+        max_num_tokens=16,
+        max_draft_len=7) is True

And the small cleanup in the first test:

-        sampling_params = SamplingParams(max_tokens=128, temperature=0)

91-147: Optionally parameterize the truth-table to reduce duplication.

A table-driven test with pytest parametrization would make it easier to extend scenarios without repetitive boilerplate. Happy to provide a parametrized version if you want it in this PR.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 06875c2 and c970d04.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/speculative/drafter.py
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/speculative/drafter.py
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🔇 Additional comments (2)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2)

55-56: Mock signature update looks correct.

Matches the production method’s (requests, max_batch_size, max_num_tokens, draft_len) and maintains the original behavior (first two calls True, then False).


1-150: Sanity Check Complete: All should_use_spec_decode Call Sites Updated

All invocations of should_use_spec_decode in the repository now supply at least four arguments, matching the updated signature. No further changes are required.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/_torch/speculative/drafter.py (1)

39-55: Gating logic aligns with PR goals; add an optional guard for zero draft length.

  • The zero/empty/budget guards and token-cap gating are correct and address earlier feedback. Nice.
  • Optional: if max_draft_len == 0, speculation is effectively a no-op; returning False avoids unnecessary toggling of enable_spec_decode and avoid extra drafter work.

(Complements the earlier “guard insufficient token budget” thread that you addressed.)

         # Defensive guards; keep behavior explicit for zero/empty cases
-        if not requests or max_batch_size <= 0 or max_num_tokens <= 0:
+        if not requests or max_batch_size <= 0 or max_num_tokens <= 0:
             return False
+        # No drafts → no speculation benefit; avoid enabling spec decode
+        if max_draft_len <= 0:
+            return False
🧹 Nitpick comments (6)
tensorrt_llm/_torch/speculative/drafter.py (2)

1-1: Add NVIDIA Apache-2.0 header (2025).

Per repository guidelines, prepend the NVIDIA Apache-2.0 copyright header with the current year to this source file.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

33-37: Document the new parameters and behavior.

Expand the docstring to a Google-style docstring so external users know exactly how gating is computed.

-        """
-        You probably don't want to override this. ModelEngine
-        assumes that speculation is always on if max_concurrency
-        is not specified by the user's spec config.
-        """
+        """
+        Decide whether to enable speculative decoding for this step.
+
+        You probably don't want to override this. ModelEngine assumes that
+        speculation is always on if max_concurrency is not specified by the
+        user's spec config.
+
+        Args:
+            requests (List[LlmRequest]): Currently active requests.
+            max_batch_size (int): Executor's per-step batch-size limit.
+            max_num_tokens (int): Executor's per-step token budget for decoding.
+            max_draft_len (int): Maximum draft length per request for speculative decoding.
+
+        Returns:
+            bool: True if speculation should be enabled given concurrency, batch size,
+            and token budget constraints; otherwise False.
+        """
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

1319-1320: Align dummy-request draft allocation with spec_config.

adp dummy path uses self.max_draft_len, which may diverge from spec_config.max_draft_len used elsewhere (gating and draft prep). Use the same source to avoid inconsistent allocations.

-                max_num_draft_tokens=self.max_draft_len,
+                max_num_draft_tokens=self.model_engine.spec_config.max_draft_len,

961-970: Minor: avoid calling drafter.prepare_draft_tokens when no drafts (max_draft_len == 0).

If you accept the optional guard in Drafter.should_use_spec_decode for zero draft length, this will naturally skip. If not, you can add a cheap local check to avoid a no-op invocation.

-                    if self.drafter is not None and self.use_spec_decode:
+                    if (self.drafter is not None and self.use_spec_decode
+                            and self.model_engine.spec_config.max_draft_len > 0):
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2)

1-1: Add NVIDIA Apache-2.0 header (2025) to test file.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

91-146: Good direct unit coverage; add a few targeted edge cases.

Nice focused test for the new gating logic. Consider adding:

  • max_draft_len == 0 should be False (if adopting the optional guard).
  • max_concurrency=None path should always return True.
  • max_batch_size <= 0 should be False (already guarded in code but not asserted here).

Minimal diff to extend the test:

@@
     drafter = _DummyDrafter(max_concurrency=6)
@@
     assert drafter.should_use_spec_decode(active_requests,
                                           max_batch_size=8,
                                           max_num_tokens=4,
                                           max_draft_len=4) is False
 
+    # Edge case - Zero draft length OFF (no speculation benefit)
+    active_requests = [object()] * 3
+    assert drafter.should_use_spec_decode(active_requests,
+                                          max_batch_size=8,
+                                          max_num_tokens=128,
+                                          max_draft_len=0) is False
+
+    # Edge case - max_batch_size <= 0 OFF
+    active_requests = [object()] * 3
+    assert drafter.should_use_spec_decode(active_requests,
+                                          max_batch_size=0,
+                                          max_num_tokens=128,
+                                          max_draft_len=4) is False
+
+    # Behavior when concurrency is unspecified: always ON
+    drafter_unbounded = _DummyDrafter(max_concurrency=None)
+    active_requests = [object()] * 100
+    assert drafter_unbounded.should_use_spec_decode(active_requests,
+                                                    max_batch_size=1,
+                                                    max_num_tokens=1,
+                                                    max_draft_len=4) is True

If you prefer, I can also parametrize this test with pytest to reduce duplication.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c970d04 and 556438b.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespace when importing: from package.subpackage import foo; then use foo.SomeClass()
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; prefix k for names starting with a number (e.g., k_99th_percentile)
Global variables are UPPER_SNAKE_CASE prefixed with G (e.g., G_MY_GLOBAL)
Constants are UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings over comments; comments for internal code or local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Attributes and variables can be documented inline with trailing docstrings under the class or module
Avoid using reflection when easily avoidable; prefer explicit parameters/constructs over dict(**locals())
In try/except, catch the narrowest exception types possible
For duck-typing try/except, keep try body minimal and place logic in else after attribute existence checks

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/speculative/drafter.py
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA Apache-2.0 copyright header with current year to all source files

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/speculative/drafter.py
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tests/unittest/llmapi/apps/_test_openai_misc.py (1)
  • max_batch_size (30-31)
tensorrt_llm/_torch/speculative/drafter.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
  • LlmRequest (282-424)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)
tensorrt_llm/_torch/speculative/drafter.py (3)
  • Drafter (9-54)
  • prepare_draft_tokens (16-27)
  • should_use_spec_decode (30-54)
🔇 Additional comments (3)
tensorrt_llm/_torch/speculative/drafter.py (1)

30-32: Verified: no overrides or stale callers for should_use_spec_decode

All occurrences of def should_use_spec_decode are in tensorrt_llm/_torch/speculative/drafter.py (line 30); no subclass redefinitions were found. All call sites were scanned and confirmed to pass at least four arguments to the new signature.

– Definition only at:
• tensorrt_llm/_torch/speculative/drafter.py:30
– No subclasses override should_use_spec_decode
– All invocations supply ≥ 4 arguments

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

889-891: ⚠️ Verify max_num_tokens semantics

It looks like model_engine.max_num_tokens isn’t set anywhere in the Python ModelEngine class itself—it’s imported from the underlying TensorRT engine’s build-time configuration rather than computed per forward call. Please confirm that this value truly represents your per-step token budget (i.e. target + draft tokens) and doesn’t include context-only tokens or a global theoretical maximum. If it does, the gating calculation in PyExecutor may allow speculation when it shouldn’t.

  • No Python assignment to self.max_num_tokens in tensorrt_llm/_torch/pyexecutor/model_engine.py
  • Ensure your engine build’s max_num_tokens matches the desired per-iteration limit for speculation gating
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

55-56: Mock signature update matches the new API.

The patched mock now accepts (requests, max_batch_size, max_num_tokens, max_draft_len). Good.

Signed-off-by: Zheyu Fu <[email protected]>
@zheyuf zheyuf marked this pull request as ready for review August 25, 2025 20:54
@zheyuf zheyuf requested review from a team as code owners August 25, 2025 20:54
@zheyuf zheyuf requested review from yweng0828 and Naveassaf August 25, 2025 20:54
@mikeiovine mikeiovine self-requested a review August 25, 2025 21:34
@mikeiovine
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16475 [ run ] triggered by Bot

@zheyuf zheyuf enabled auto-merge (squash) August 25, 2025 22:16
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16475 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12376 completed with status: 'FAILURE'

@zheyuf
Copy link
Collaborator Author

zheyuf commented Aug 26, 2025

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/speculative/drafter.py (1)

1-4: Compliance: add the NVIDIA copyright header (2025)

The repository guidelines require the NVIDIA header on all source files; missing headers can trip CI/license checks.

Apply at the very top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software distributed under
+# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
+# either express or implied. See the License for the specific language governing permissions and
+# limitations under the License.

If the repo uses a different boilerplate, please paste the project-standard header instead.

🧹 Nitpick comments (3)
tensorrt_llm/_torch/speculative/drafter.py (3)

41-47: Guard ordering: avoid returning True with invalid/degenerate inputs when concurrency is unlimited

With max_concurrency is None, the function returns True before hitting the defensive guards. That enables speculation even when requests is empty or max_batch_size/max_num_tokens are non-positive. Suggest validating inputs and token_cap first; then apply the “unlimited concurrency” fast-path only when at least one request can actually be scheduled.

Apply this minimal reorder so True is only reachable when something fits:

-        if self.max_concurrency is None:
-            return True
-
-        # Defensive guards; keep behavior explicit for zero/empty cases
-        if not requests or max_batch_size <= 0 or max_num_tokens <= 0:
-            return False
+        # Defensive guards; keep behavior explicit for zero/empty cases
+        if not requests or max_batch_size <= 0 or max_num_tokens <= 0:
+            return False

And move the unlimited-concurrency path after computing num_effective_requests:

-        num_effective_requests = min(len(requests), max_batch_size, token_cap)
-        return num_effective_requests <= self.max_concurrency
+        num_effective_requests = min(len(requests), max_batch_size, token_cap)
+        if self.max_concurrency is None:
+            # Unlimited concurrency: we've already established at least one request can fit.
+            return True
+        return num_effective_requests <= self.max_concurrency

This keeps the documented behavior while preventing “spec enabled” on no-op steps.


30-33: Add a Google-style docstring to clarify inputs, semantics, and edge cases

Documenting expectations (e.g., whether requests includes only target requests, how max_num_tokens is measured, and the “unlimited concurrency” behavior) will reduce ambiguity for future maintainers and test authors.

Apply:

 def should_use_spec_decode(self, requests: List[LlmRequest],
                            max_batch_size: int, max_num_tokens: int,
                            max_draft_len: int) -> bool:
-        """
-        You probably don't want to override this. ModelEngine
-        assumes that speculation is always on if max_concurrency
-        is not specified by the user's spec config.
-        """
+        """Decide whether to enable speculative decoding for this step.
+
+        Args:
+            requests: Active target requests under consideration this step.
+            max_batch_size: Upper bound of sequences schedulable this step (> 0).
+            max_num_tokens: Per-step token budget for decoding (targets + draft) (> 0).
+            max_draft_len: Max proposed draft tokens per request (>= 0).
+
+        Returns:
+            True if speculation should be enabled given the per-step token budget,
+            batch-size cap, and the configured concurrency limit. If
+            max_concurrency is None (unlimited), speculation is enabled as long
+            as at least one request fits the budget.
+        """

30-33: Broaden input type to Sequence to avoid over-constraining callers

The method only reads length and iterates; accepting Sequence[LlmRequest] makes it friendlier to upstream schedulers without forcing a list.

Apply:

-from typing import List, Optional, final
+from typing import Optional, Sequence, final
@@
-    def should_use_spec_decode(self, requests: List[LlmRequest],
+    def should_use_spec_decode(self, requests: Sequence[LlmRequest],
                                max_batch_size: int, max_num_tokens: int,
                                max_draft_len: int) -> bool:
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a9be135 and 394ccc8.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/speculative/drafter.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/speculative/drafter.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/speculative/drafter.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
  • LlmRequest (282-424)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/speculative/drafter.py (1)

30-33: Smarter capacity gate + API expansion look good

The new signature and the per-step gating via min(len(requests), max_batch_size, token_cap) align with the PR objective and should reduce unnecessary speculation disables. The @Final choice also makes sense here.

@zheyuf
Copy link
Collaborator Author

zheyuf commented Aug 26, 2025

/bot run

zheyuf added 2 commits August 26, 2025 10:06
Signed-off-by: Zheyu Fu <[email protected]>
@zheyuf
Copy link
Collaborator Author

zheyuf commented Aug 26, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16582 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1318-1321: Use the same draft length source for ADP dummy requests (keep gating and allocation consistent).

here: max_num_draft_tokens=self.max_draft_len. Elsewhere we gate and allocate drafts using spec_config.max_draft_len. If self.max_draft_len differs from spec_config.max_draft_len, ADP dummy requests will be sized inconsistently vs. the gating decision.

Apply this small change:

-                max_num_draft_tokens=self.max_draft_len,
+                max_num_draft_tokens=self.model_engine.spec_config.max_draft_len,

Follow-up (optional): consider deprecating PyExecutor.max_draft_len in favor of model_engine.spec_config.max_draft_len to avoid future drift.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 394ccc8 and d0adbbd.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
  • tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tensorrt_llm/_torch/speculative/drafter.py
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

888-897: LGTM: smarter gating call wired correctly and consistent draft length source.

Passing active_requests, max_batch_size, max_num_tokens, and spec_config.max_draft_len into Drafter.should_use_spec_decode aligns the executor’s decision with batch-size and token-budget limits. Using spec_config.max_draft_len here matches _prepare_draft_requests (Line 1016+), eliminating the prior inconsistency.


889-891: All should_use_spec_decode call sites are up-to-date

I ran the provided ripgrep checks across the repository and found only:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py – updated to pass all four parameters
  • tests/unittest/_torch/speculative/test_dynamic_spec_decode.py – all assertions invoke should_use_spec_decode with the new multi-arg signature (and rely on the default for any omitted parameter)

No stale single-argument calls remain.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16582 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12450 completed with status: 'FAILURE'

@zheyuf
Copy link
Collaborator Author

zheyuf commented Aug 26, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16590 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16590 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12456 completed with status: 'FAILURE'

@mikeiovine
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16738 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16738 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12562 completed with status: 'FAILURE'

@mikeiovine mikeiovine removed the Community want to contribute PRs initiated from Community label Aug 28, 2025
@zheyuf
Copy link
Collaborator Author

zheyuf commented Aug 29, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16958 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16958 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12729 completed with status: 'FAILURE'

@zheyuf
Copy link
Collaborator Author

zheyuf commented Aug 29, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17013 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17013 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12776 completed with status: 'FAILURE'

@zheyuf
Copy link
Collaborator Author

zheyuf commented Aug 30, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17062 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17062 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12822 completed with status: 'FAILURE'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants