[None][feat] Make the should_use_spec_decode logic a bit smarter #7112

zheyuf · 2025-08-21T06:47:04Z

Description

Previously, we disabled speculation solely when len(active_requests) > max_concurrency, which was conservative. We now gate by a better estimated per-step capacity (min(len(active_requests), max_batch_size, max_num_tokens // (1 + draft_len))), so speculation isn’t unnecessarily turned off when the schedulable batch is small. This respects real decoding limits, improving GPU utilization and latency.

Test Coverage

There are existing unit tests in TensorRTLLM/tests/unittest/_torch/speculative/test_dynamic_spec_decode.py. Also added some quick unit tests in this file.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Speculative decoding now factors batch size, overall token budget, and draft length when deciding to enable speculation, improving efficiency across workloads.
Refactor
- Decision gate changed from a pure concurrency check to a token-budget- and batch-aware approach for more predictable and constrained speculation behavior.
Tests
- Unit tests updated and a new test added to cover varied scenarios and edge cases for the expanded decision criteria.

Signed-off-by: Zheyu Fu <[email protected]>

coderabbitai · 2025-08-21T06:47:10Z

Caution

Review failed

The head commit changed during the review from a01bc8d to c134cf1.

📝 Walkthrough

Walkthrough

PyExecutor now calls Drafter.should_use_spec_decode with four arguments (active_requests, max_batch_size, max_num_tokens, max_draft_len). Drafter's signature and gating logic were updated to enforce token-budget and batch-size constraints; unit tests were updated and a new test added to validate the gating behavior.

Changes

Cohort / File(s)	Summary
Executor + Drafter `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tensorrt_llm/_torch/speculative/drafter.py`	PyExecutor invokes `Drafter.should_use_spec_decode(active_requests, max_batch_size, max_num_tokens, max_draft_len)`. `Drafter.should_use_spec_decode` signature changed and logic now applies a token-budget and batch-size gate (compute `tokens_per_request`, `token_cap`, `num_effective_requests = min(len(requests), max_batch_size, token_cap)`, require `0 < num_effective_requests <= max_concurrency`). Result assigned to `self.use_spec_decode` and passed to `model_engine.enable_spec_decode`; `_prepare_draft_requests()` still runs when speculation is enabled.
Unit tests `tests/unittest/_torch/speculative/test_dynamic_spec_decode.py`	Mock updated to accept `(requests, max_batch_size, max_num_tokens, max_draft_len)`. Added `test_should_use_spec_decode()` exercising multiple scenarios (varying active_requests, small batch, small token budget, OFF, None, token-cap-zero) and retained existing speculative-path assertions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant C as Client
  participant E as PyExecutor
  participant D as Drafter
  participant M as ModelEngine

  C->>E: submit requests
  E->>M: query model limits (max_num_tokens, max_draft_len)
  E->>E: compute max_batch_size, active_requests
  E->>D: should_use_spec_decode(active_requests, max_batch_size, max_num_tokens, max_draft_len)
  D-->>E: use_spec (bool)

  alt use_spec == true
    E->>M: enable_spec_decode(true)
    note right of E #eef6ff: initialize per-request draft state
    E->>E: _prepare_draft_requests()
    E->>E: prepare drafting & schedule speculative batch
  else
    E->>M: enable_spec_decode(false)
    E->>E: schedule standard decoding
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

[https://nvbugs/5441438][fix] Set correct draft length for the cuda graph dummy request #6701: Modifies Drafter.should_use_spec_decode and py_executor/spec-decode wiring — closely related signature/usage changes.
[None][feat] Clean up ngram auto mode, add max_concurrency to configs #6676: Adjusts Drafter gating logic and caller integration — overlaps on gating semantics and call sites.
[None][feat] Optimize CUDA graph memory usage for spec decode cases #6718: Updates Drafter.should_use_spec_decode implementation and semantics — directly overlaps with the gating/token-budget changes.

Suggested reviewers

Naveassaf
ziyixiong-nv
achartier

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (6)

tensorrt_llm/_torch/speculative/drafter.py (2)
1-1: Add NVIDIA copyright header (2025).

All Python source files should start with the standard NVIDIA copyright header per repository guidelines.

Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
29-37: Docstring should describe new parameters and return semantics.

The signature expanded, but the docstring didn’t. Adding Args/Returns will help downstream users and maintainers.

Apply this diff to enrich the docstring:
         """
-        You probably don't want to override this. ModelEngine
-        assumes that speculation is always on if max_concurrency
-        is not specified by the user's spec config.
+        Decide whether to enable speculative decoding for this step.
+
+        You probably don't want to override this. ModelEngine assumes that
+        speculation is always on if max_concurrency is not specified by the
+        user's spec config.
+
+        Args:
+            requests: Active requests on this rank.
+            max_batch_size: Upper bound on sequences schedulable this step.
+            max_num_tokens: Per-step token budget available for generation
+                (including draft tokens).
+            max_draft_len: Maximum number of draft tokens per request.
+
+        Returns:
+            True iff speculative decoding should be attempted for this step.
         """
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

923-933: Option: gate after scheduling using actual batch size.

Gating before scheduling uses len(self.active_requests), which can be stricter/looser than the scheduled batch, especially with pausing or KV-transfer constraints. Consider moving gating after _schedule() and using scheduled_batch.batch_size and the number of generation requests to compute the effective token demand for this step.

If you pursue this, keep a fast-path when max_concurrency is None to avoid extra overhead.
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (3)
1-1: Add NVIDIA copyright header (2025).

Tests are also Python sources; please prepend the standard NVIDIA header.

Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
55-61: Strengthen the mock: assert plumbed values match expectations.

Adding lightweight assertions in the mock helps catch regression in plumbing the new parameters without making the test brittle.

Apply this diff:
-    def mock_should_use_spec_decode(self, requests, max_batch_size,
-                                    max_num_tokens, draft_len):
+    def mock_should_use_spec_decode(self, requests, max_batch_size,
+                                    max_num_tokens, draft_len):
+        # Sanity-check the plumbed values
+        assert isinstance(max_batch_size, int) and max_batch_size >= 1
+        assert draft_len == max_draft_len
+        assert isinstance(max_num_tokens, int) and max_num_tokens > 0
         if not hasattr(mock_should_use_spec_decode, 'call_count'):
             mock_should_use_spec_decode.call_count = 0
         mock_should_use_spec_decode.call_count += 1
         return mock_should_use_spec_decode.call_count <= 2
17-53: Add focused unit tests for should_use_spec_decode logic (no GPU).

Current test validates end-to-end behavior with a heavy model. Add a lightweight unit test that constructs a minimal Drafter subclass and exercises edge cases:

token_cap == 0 (should return False with max_concurrency set)

requests == [] (should return False)

num_effective_requests just above/below max_concurrency

I can draft such a test that runs on CPU and completes quickly if you’d like.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 2d40e87 and 06875c2.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

🔇 Additional comments (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

924-927: Verified — ModelEngine attributes and spec-decode callsites are correct

Quick summary: PyTorchModelEngine exposes max_num_tokens and derives max_draft_len from spec_config; drafter.should_use_spec_decode signature matches the py_executor call and no other outdated call sites were found.

Points of evidence:

tensorrt_llm/_torch/pyexecutor/model_engine.py — PyTorchModelEngine.init sets self.max_num_tokens (class at ~line 260) and sets self.max_draft_len from spec_config (around lines 391–410).

tensorrt_llm/_torch/speculative/drafter.py — should_use_spec_decode signature expects (requests, max_batch_size, max_num_tokens, max_draft_len). (definition at ~line 30)

tensorrt_llm/_torch/pyexecutor/py_executor.py — call site at lines ~923–926 passes model_engine.max_num_tokens and model_engine.max_draft_len into should_use_spec_decode.

Ripgrep search returned no other should_use_spec_decode call sites.

Conclusion: No code changes required for the call in py_executor.py; attributes are present on the concrete engine implementation.

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

55-65: Verified patch target is correct

The ModelDrafter class is defined in tensorrt_llm/_torch/speculative/model_drafter.py (line 38), matching the module path under test.

The should_use_spec_decode method is implemented on the base Drafter class in tensorrt_llm/_torch/speculative/drafter.py (line 30) and inherited by ModelDrafter, so patching ModelDrafter.should_use_spec_decode will override the behavior invoked by LLM (e.g., in py_executor.py:923).

No changes needed here.

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt_llm/_torch/speculative/drafter.py

Signed-off-by: Zheyu Fu <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

tensorrt_llm/_torch/speculative/drafter.py (1)
39-47: Add explicit guards for zero/invalid capacity; clamp draft length to avoid div-by-zero for negatives.

Semantics are currently correct due to 0 < num_effective_requests, but early guards improve readability and avoid computing min() when capacity is zero. Also clamp max_draft_len to tolerate unexpected negatives.
-        # Inputs validated upstream: max_batch_size>0, max_num_tokens>0, max_draft_len>0
+        # Inputs typically validated upstream: max_batch_size>0, max_num_tokens>0, max_draft_len>=0

         if self.max_concurrency is None:
             return True

-        token_cap = max_num_tokens // (1 + max_draft_len)
-        num_effective_requests = min(len(requests), max_batch_size, token_cap)
-
-        return 0 < num_effective_requests <= self.max_concurrency
+        # Defensive guards; keep behavior explicit for zero/empty cases
+        if max_batch_size <= 0 or max_num_tokens <= 0 or not requests:
+            return False
+
+        tokens_per_request = 1 + max(0, max_draft_len)
+        token_cap = max_num_tokens // tokens_per_request
+        if token_cap <= 0:
+            return False
+
+        num_effective_requests = min(len(requests), max_batch_size, token_cap)
+        return num_effective_requests <= self.max_concurrency
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

922-927: Good fix: gating now uses spec_config.max_draft_len (consistent with draft allocation).

This resolves the prior inconsistency between gating and _prepare_draft_requests (which uses spec_config.max_draft_len). Thanks for aligning the sources.

🧹 Nitpick comments (8)

tensorrt_llm/_torch/speculative/drafter.py (3)

1-4: Add NVIDIA copyright header (2025) per repo guidelines.

All source files must carry the NVIDIA copyright header.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+"""
+This file is part of TensorRT-LLM.
+"""

30-37: Clarify and expand the method docstring; include args/returns and the gating formula.

Make the contract explicit and Sphinx/Google-style so downstream users know exactly how the decision is computed.

-    def should_use_spec_decode(self, requests: List[LlmRequest],
-                               max_batch_size: int, max_num_tokens: int,
-                               max_draft_len: int) -> bool:
-        """
-        You probably don't want to override this. ModelEngine
-        assumes that speculation is always on if max_concurrency
-        is not specified by the user's spec config.
-        """
+    def should_use_spec_decode(self, requests: List[LlmRequest],
+                               max_batch_size: int, max_num_tokens: int,
+                               max_draft_len: int) -> bool:
+        """
+        Decide whether to enable speculative decoding in this step.
+
+        Speculation is enabled when:
+            - max_concurrency is None (unbounded), or
+            - 0 < min(len(requests), max_batch_size,
+                      max_num_tokens // (1 + max(0, max_draft_len))) <= max_concurrency
+
+        Args:
+            requests (List[LlmRequest]): Active requests this step.
+            max_batch_size (int): Upper bound on batch size for this step.
+            max_num_tokens (int): Token budget available for this step.
+            max_draft_len (int): Max draft tokens per request used by the drafter.
+
+        Returns:
+            bool: True to enable speculative decoding; False to disable.
+        """

30-47: Optional: surface observability hook.

If you’d prefer not to log at call sites, consider emitting a debug-level trace here (behind a cheap bool) with inputs and the decision for troubleshooting. Decline if you’d rather keep the base class silent.

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

1-3: Add NVIDIA copyright header (2025) at file top.

This file is missing the required header.

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+#
+"""
+This file is part of TensorRT-LLM.
+"""

922-928: Log the gating decision for visibility (debug-level).

A one-line debug log helps diagnose unexpected toggling without stepping through. Cheap to compute and guarded by logger’s level.

         if self.drafter is not None:
             self.use_spec_decode = self.drafter.should_use_spec_decode(
                 self.active_requests, self.max_batch_size,
                 self.model_engine.max_num_tokens,
                 self.model_engine.spec_config.max_draft_len)
             self.model_engine.enable_spec_decode = self.use_spec_decode
+            logger.debug(
+                "spec gating: active=%d, max_batch_size=%d, max_num_tokens=%d, "
+                "max_draft_len=%d -> use_spec_decode=%s",
+                len(self.active_requests),
+                self.max_batch_size,
+                self.model_engine.max_num_tokens,
+                self.model_engine.spec_config.max_draft_len,
+                str(self.use_spec_decode),
+            )

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (3)

1-4: Add NVIDIA copyright header (2025).

Tests are also subject to the header policy.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
91-147: Strengthen unit coverage with unbounded concurrency case; minor tidy-ups.

Add a case where max_concurrency=None (should always enable speculation).

Optional: the earlier sampling_params = SamplingParams(max_tokens=128, ...) assignment is unused; drop it.

Optional: consider using a tiny lightweight LlmRequest-like stub (with only what’s needed) rather than object() to future-proof the test if the method ever inspects attributes.
@@
     drafter = _DummyDrafter(max_concurrency=6)
@@
     assert drafter.should_use_spec_decode(active_requests,
                                           max_batch_size=8,
                                           max_num_tokens=4,
                                           max_draft_len=4) is False
 
+    # Unbounded concurrency ON case (should always enable)
+    drafter_unbounded = _DummyDrafter(max_concurrency=None)
+    active_requests = [object()] * 100
+    assert drafter_unbounded.should_use_spec_decode(
+        active_requests,
+        max_batch_size=8,
+        max_num_tokens=16,
+        max_draft_len=7) is True
And the small cleanup in the first test:
-        sampling_params = SamplingParams(max_tokens=128, temperature=0)
91-147: Optionally parameterize the truth-table to reduce duplication.

A table-driven test with pytest parametrization would make it easier to extend scenarios without repetitive boilerplate. Happy to provide a parametrized version if you want it in this PR.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 06875c2 and c970d04.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

🔇 Additional comments (2)

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2)

55-56: Mock signature update looks correct.

Matches the production method’s (requests, max_batch_size, max_num_tokens, draft_len) and maintains the original behavior (first two calls True, then False).

1-150: Sanity Check Complete: All should_use_spec_decode Call Sites Updated

All invocations of should_use_spec_decode in the repository now supply at least four arguments, matching the updated signature. No further changes are required.

Signed-off-by: Zheyu Fu <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

tensorrt_llm/_torch/speculative/drafter.py (1)
39-55: Gating logic aligns with PR goals; add an optional guard for zero draft length.

The zero/empty/budget guards and token-cap gating are correct and address earlier feedback. Nice.

Optional: if max_draft_len == 0, speculation is effectively a no-op; returning False avoids unnecessary toggling of enable_spec_decode and avoid extra drafter work.

(Complements the earlier “guard insufficient token budget” thread that you addressed.)
         # Defensive guards; keep behavior explicit for zero/empty cases
-        if not requests or max_batch_size <= 0 or max_num_tokens <= 0:
+        if not requests or max_batch_size <= 0 or max_num_tokens <= 0:
             return False
+        # No drafts → no speculation benefit; avoid enabling spec decode
+        if max_draft_len <= 0:
+            return False

🧹 Nitpick comments (6)

tensorrt_llm/_torch/speculative/drafter.py (2)

1-1: Add NVIDIA Apache-2.0 header (2025).

Per repository guidelines, prepend the NVIDIA Apache-2.0 copyright header with the current year to this source file.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

33-37: Document the new parameters and behavior.

Expand the docstring to a Google-style docstring so external users know exactly how gating is computed.

-        """
-        You probably don't want to override this. ModelEngine
-        assumes that speculation is always on if max_concurrency
-        is not specified by the user's spec config.
-        """
+        """
+        Decide whether to enable speculative decoding for this step.
+
+        You probably don't want to override this. ModelEngine assumes that
+        speculation is always on if max_concurrency is not specified by the
+        user's spec config.
+
+        Args:
+            requests (List[LlmRequest]): Currently active requests.
+            max_batch_size (int): Executor's per-step batch-size limit.
+            max_num_tokens (int): Executor's per-step token budget for decoding.
+            max_draft_len (int): Maximum draft length per request for speculative decoding.
+
+        Returns:
+            bool: True if speculation should be enabled given concurrency, batch size,
+            and token budget constraints; otherwise False.
+        """

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

1319-1320: Align dummy-request draft allocation with spec_config.

adp dummy path uses self.max_draft_len, which may diverge from spec_config.max_draft_len used elsewhere (gating and draft prep). Use the same source to avoid inconsistent allocations.
-                max_num_draft_tokens=self.max_draft_len,
+                max_num_draft_tokens=self.model_engine.spec_config.max_draft_len,
961-970: Minor: avoid calling drafter.prepare_draft_tokens when no drafts (max_draft_len == 0).

If you accept the optional guard in Drafter.should_use_spec_decode for zero draft length, this will naturally skip. If not, you can add a cheap local check to avoid a no-op invocation.
-                    if self.drafter is not None and self.use_spec_decode:
+                    if (self.drafter is not None and self.use_spec_decode
+                            and self.model_engine.spec_config.max_draft_len > 0):

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2)

1-1: Add NVIDIA Apache-2.0 header (2025) to test file.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

91-146: Good direct unit coverage; add a few targeted edge cases.

Nice focused test for the new gating logic. Consider adding:

max_draft_len == 0 should be False (if adopting the optional guard).
max_concurrency=None path should always return True.
max_batch_size <= 0 should be False (already guarded in code but not asserted here).

Minimal diff to extend the test:

@@
     drafter = _DummyDrafter(max_concurrency=6)
@@
     assert drafter.should_use_spec_decode(active_requests,
                                           max_batch_size=8,
                                           max_num_tokens=4,
                                           max_draft_len=4) is False
 
+    # Edge case - Zero draft length OFF (no speculation benefit)
+    active_requests = [object()] * 3
+    assert drafter.should_use_spec_decode(active_requests,
+                                          max_batch_size=8,
+                                          max_num_tokens=128,
+                                          max_draft_len=0) is False
+
+    # Edge case - max_batch_size <= 0 OFF
+    active_requests = [object()] * 3
+    assert drafter.should_use_spec_decode(active_requests,
+                                          max_batch_size=0,
+                                          max_num_tokens=128,
+                                          max_draft_len=4) is False
+
+    # Behavior when concurrency is unspecified: always ON
+    drafter_unbounded = _DummyDrafter(max_concurrency=None)
+    active_requests = [object()] * 100
+    assert drafter_unbounded.should_use_spec_decode(active_requests,
+                                                    max_batch_size=1,
+                                                    max_num_tokens=1,
+                                                    max_draft_len=4) is True

If you prefer, I can also parametrize this test with pytest to reduce duplication.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c970d04 and 556438b.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespace when importing: from package.subpackage import foo; then use foo.SomeClass()
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; prefix k for names starting with a number (e.g., k_99th_percentile)
Global variables are UPPER_SNAKE_CASE prefixed with G (e.g., G_MY_GLOBAL)
Constants are UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings over comments; comments for internal code or local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Attributes and variables can be documented inline with trailing docstrings under the class or module
Avoid using reflection when easily avoidable; prefer explicit parameters/constructs over dict(**locals())
In try/except, catch the narrowest exception types possible
For duck-typing try/except, keep try body minimal and place logic in else after attribute existence checks

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py

**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA Apache-2.0 copyright header with current year to all source files

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

tests/unittest/llmapi/apps/_test_openai_misc.py (1)

max_batch_size (30-31)

tensorrt_llm/_torch/speculative/drafter.py (1)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

LlmRequest (282-424)

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

tensorrt_llm/_torch/speculative/drafter.py (3)

Drafter (9-54)

prepare_draft_tokens (16-27)

should_use_spec_decode (30-54)

🔇 Additional comments (3)

tensorrt_llm/_torch/speculative/drafter.py (1)

30-32: Verified: no overrides or stale callers for should_use_spec_decode

All occurrences of def should_use_spec_decode are in tensorrt_llm/_torch/speculative/drafter.py (line 30); no subclass redefinitions were found. All call sites were scanned and confirmed to pass at least four arguments to the new signature.

– Definition only at:
• tensorrt_llm/_torch/speculative/drafter.py:30
– No subclasses override should_use_spec_decode
– All invocations supply ≥ 4 arguments

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

889-891: ⚠️ Verify max_num_tokens semantics

It looks like model_engine.max_num_tokens isn’t set anywhere in the Python ModelEngine class itself—it’s imported from the underlying TensorRT engine’s build-time configuration rather than computed per forward call. Please confirm that this value truly represents your per-step token budget (i.e. target + draft tokens) and doesn’t include context-only tokens or a global theoretical maximum. If it does, the gating calculation in PyExecutor may allow speculation when it shouldn’t.

No Python assignment to self.max_num_tokens in tensorrt_llm/_torch/pyexecutor/model_engine.py

Ensure your engine build’s max_num_tokens matches the desired per-iteration limit for speculation gating

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)

55-56: Mock signature update matches the new API.

The patched mock now accepts (requests, max_batch_size, max_num_tokens, max_draft_len). Good.

Signed-off-by: Zheyu Fu <[email protected]>

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py

mikeiovine · 2025-08-25T21:36:35Z

/bot run

tensorrt-cicd · 2025-08-25T21:41:51Z

PR_Github #16475 [ run ] triggered by Bot

Signed-off-by: Zheyu Fu <[email protected]>

tensorrt-cicd · 2025-08-26T02:39:10Z

PR_Github #16475 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12376 completed with status: 'FAILURE'

zheyuf · 2025-08-26T05:17:08Z

/bot run

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/speculative/drafter.py (1)
1-4: Compliance: add the NVIDIA copyright header (2025)

The repository guidelines require the NVIDIA header on all source files; missing headers can trip CI/license checks.

Apply at the very top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software distributed under
+# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
+# either express or implied. See the License for the specific language governing permissions and
+# limitations under the License.
If the repo uses a different boilerplate, please paste the project-standard header instead.

🧹 Nitpick comments (3)

tensorrt_llm/_torch/speculative/drafter.py (3)

41-47: Guard ordering: avoid returning True with invalid/degenerate inputs when concurrency is unlimited

With max_concurrency is None, the function returns True before hitting the defensive guards. That enables speculation even when requests is empty or max_batch_size/max_num_tokens are non-positive. Suggest validating inputs and token_cap first; then apply the “unlimited concurrency” fast-path only when at least one request can actually be scheduled.

Apply this minimal reorder so True is only reachable when something fits:

-        if self.max_concurrency is None:
-            return True
-
-        # Defensive guards; keep behavior explicit for zero/empty cases
-        if not requests or max_batch_size <= 0 or max_num_tokens <= 0:
-            return False
+        # Defensive guards; keep behavior explicit for zero/empty cases
+        if not requests or max_batch_size <= 0 or max_num_tokens <= 0:
+            return False

And move the unlimited-concurrency path after computing num_effective_requests:

-        num_effective_requests = min(len(requests), max_batch_size, token_cap)
-        return num_effective_requests <= self.max_concurrency
+        num_effective_requests = min(len(requests), max_batch_size, token_cap)
+        if self.max_concurrency is None:
+            # Unlimited concurrency: we've already established at least one request can fit.
+            return True
+        return num_effective_requests <= self.max_concurrency

This keeps the documented behavior while preventing “spec enabled” on no-op steps.

30-33: Add a Google-style docstring to clarify inputs, semantics, and edge cases

Documenting expectations (e.g., whether requests includes only target requests, how max_num_tokens is measured, and the “unlimited concurrency” behavior) will reduce ambiguity for future maintainers and test authors.

Apply:

 def should_use_spec_decode(self, requests: List[LlmRequest],
                            max_batch_size: int, max_num_tokens: int,
                            max_draft_len: int) -> bool:
-        """
-        You probably don't want to override this. ModelEngine
-        assumes that speculation is always on if max_concurrency
-        is not specified by the user's spec config.
-        """
+        """Decide whether to enable speculative decoding for this step.
+
+        Args:
+            requests: Active target requests under consideration this step.
+            max_batch_size: Upper bound of sequences schedulable this step (> 0).
+            max_num_tokens: Per-step token budget for decoding (targets + draft) (> 0).
+            max_draft_len: Max proposed draft tokens per request (>= 0).
+
+        Returns:
+            True if speculation should be enabled given the per-step token budget,
+            batch-size cap, and the configured concurrency limit. If
+            max_concurrency is None (unlimited), speculation is enabled as long
+            as at least one request fits the budget.
+        """

30-33: Broaden input type to Sequence to avoid over-constraining callers

The method only reads length and iterates; accepting Sequence[LlmRequest] makes it friendlier to upstream schedulers without forcing a list.

Apply:

-from typing import List, Optional, final
+from typing import Optional, Sequence, final
@@
-    def should_use_spec_decode(self, requests: List[LlmRequest],
+    def should_use_spec_decode(self, requests: Sequence[LlmRequest],
                                max_batch_size: int, max_num_tokens: int,
                                max_draft_len: int) -> bool:

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a9be135 and 394ccc8.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/speculative/drafter.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/speculative/drafter.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/speculative/drafter.py (1)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

LlmRequest (282-424)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (1)

tensorrt_llm/_torch/speculative/drafter.py (1)

30-33: Smarter capacity gate + API expansion look good

The new signature and the per-step gating via min(len(requests), max_batch_size, token_cap) align with the PR objective and should reduce unnecessary speculation disables. The @Final choice also makes sense here.

zheyuf · 2025-08-26T05:47:46Z

/bot run

Signed-off-by: Zheyu Fu <[email protected]>

zheyuf · 2025-08-26T17:06:50Z

/bot run

tensorrt-cicd · 2025-08-26T17:12:05Z

PR_Github #16582 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1318-1321: Use the same draft length source for ADP dummy requests (keep gating and allocation consistent).

here: max_num_draft_tokens=self.max_draft_len. Elsewhere we gate and allocate drafts using spec_config.max_draft_len. If self.max_draft_len differs from spec_config.max_draft_len, ADP dummy requests will be sized inconsistently vs. the gating decision.

Apply this small change:
-                max_num_draft_tokens=self.max_draft_len,
+                max_num_draft_tokens=self.model_engine.spec_config.max_draft_len,
Follow-up (optional): consider deprecating PyExecutor.max_draft_len in favor of model_engine.spec_config.max_draft_len to avoid future drift.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 394ccc8 and d0adbbd.

📒 Files selected for processing (3)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1 hunks)
tensorrt_llm/_torch/speculative/drafter.py (1 hunks)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (2)

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

888-897: LGTM: smarter gating call wired correctly and consistent draft length source.

Passing active_requests, max_batch_size, max_num_tokens, and spec_config.max_draft_len into Drafter.should_use_spec_decode aligns the executor’s decision with batch-size and token-budget limits. Using spec_config.max_draft_len here matches _prepare_draft_requests (Line 1016+), eliminating the prior inconsistency.

889-891: All should_use_spec_decode call sites are up-to-date

I ran the provided ripgrep checks across the repository and found only:

tensorrt_llm/_torch/pyexecutor/py_executor.py – updated to pass all four parameters

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py – all assertions invoke should_use_spec_decode with the new multi-arg signature (and rely on the default for any omitted parameter)

No stale single-argument calls remain.

tensorrt_llm/_torch/pyexecutor/py_executor.py

tensorrt-cicd · 2025-08-26T19:05:30Z

PR_Github #16582 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12450 completed with status: 'FAILURE'

zheyuf · 2025-08-26T19:51:27Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-26T19:57:35Z

PR_Github #16590 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-27T00:03:01Z

PR_Github #16590 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12456 completed with status: 'FAILURE'

mikeiovine · 2025-08-27T18:38:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-27T18:44:17Z

PR_Github #16738 [ run ] triggered by Bot

tensorrt_llm/_torch/speculative/drafter.py

tensorrt-cicd · 2025-08-28T02:14:03Z

PR_Github #16738 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12562 completed with status: 'FAILURE'

Signed-off-by: Zheyu Fu <[email protected]>

…_spec Signed-off-by: Zheyu Fu <[email protected]>

zheyuf · 2025-08-29T07:45:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-29T07:50:55Z

PR_Github #16958 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-29T11:02:29Z

PR_Github #16958 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12729 completed with status: 'FAILURE'

zheyuf · 2025-08-29T17:19:55Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-29T17:25:07Z

PR_Github #17013 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-29T23:57:16Z

PR_Github #17013 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12776 completed with status: 'FAILURE'

zheyuf · 2025-08-30T17:51:30Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-08-30T17:56:30Z

PR_Github #17062 [ run ] triggered by Bot

tensorrt-cicd · 2025-08-31T00:35:42Z

PR_Github #17062 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12822 completed with status: 'FAILURE'

zheyuf and others added 2 commits August 20, 2025 23:10

Make the should_use_spec_decode logic a bit smarter.

a4e6235

Signed-off-by: Zheyu Fu <[email protected]>

Merge branch 'NVIDIA:main' into should_use_spec

06875c2

coderabbitai bot reviewed Aug 21, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/speculative/drafter.py Outdated Show resolved Hide resolved

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 21, 2025

Add test cases for should_use_spec_decode.

c970d04

Signed-off-by: Zheyu Fu <[email protected]>

coderabbitai bot reviewed Aug 23, 2025

View reviewed changes

zheyuf and others added 3 commits August 24, 2025 22:54

Follow CodeRabbit's comments.

6e75d48

Signed-off-by: Zheyu Fu <[email protected]>

Merge branch 'NVIDIA:main' into should_use_spec

08b2040

Merge from origin.

556438b

Signed-off-by: Zheyu Fu <[email protected]>

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

Minor change.

f6dc52b

Signed-off-by: Zheyu Fu <[email protected]>

zheyuf marked this pull request as ready for review August 25, 2025 20:54

zheyuf requested review from a team as code owners August 25, 2025 20:54

zheyuf requested review from yweng0828 and Naveassaf August 25, 2025 20:54

mikeiovine self-requested a review August 25, 2025 21:34

mikeiovine approved these changes Aug 25, 2025

View reviewed changes

tests/unittest/_torch/speculative/test_dynamic_spec_decode.py Outdated Show resolved Hide resolved

Use assert <expr> directly.

9a28cda

Signed-off-by: Zheyu Fu <[email protected]>

zheyuf force-pushed the should_use_spec branch from 9a28cda to 0b54ebe Compare August 25, 2025 22:13

zheyuf enabled auto-merge (squash) August 25, 2025 22:16

zheyuf force-pushed the should_use_spec branch from 0b54ebe to a9be135 Compare August 25, 2025 22:49

zheyuf force-pushed the should_use_spec branch from a9be135 to 394ccc8 Compare August 26, 2025 05:18

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

zheyuf added 2 commits August 26, 2025 10:06

Minor change.

a9cf635

Signed-off-by: Zheyu Fu <[email protected]>

Use assert <expr> directly.

d0adbbd

Signed-off-by: Zheyu Fu <[email protected]>

zheyuf force-pushed the should_use_spec branch from 394ccc8 to d0adbbd Compare August 26, 2025 17:06

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

tensorrt_llm/_torch/pyexecutor/py_executor.py Show resolved Hide resolved

Merge branch 'main' into should_use_spec

22997ec

yweng0828 approved these changes Aug 28, 2025

View reviewed changes

tensorrt_llm/_torch/speculative/drafter.py Outdated Show resolved Hide resolved

mikeiovine removed the Community want to contribute PRs initiated from Community label Aug 28, 2025

zheyuf and others added 3 commits August 29, 2025 00:35

Change max(0, max_draft_len) to max_draft_len

5ba2eb4

Signed-off-by: Zheyu Fu <[email protected]>

Merge remote-tracking branch 'origin/should_use_spec' into should_use…

a01bc8d

…_spec Signed-off-by: Zheyu Fu <[email protected]>

Merge branch 'main' into should_use_spec

c134cf1

Merge branch 'main' into should_use_spec

690e9cf

Merge branch 'main' into should_use_spec

54c30fc

[None][feat] Make the should_use_spec_decode logic a bit smarter #7112

Are you sure you want to change the base?

[None][feat] Make the should_use_spec_decode logic a bit smarter #7112

Uh oh!

Conversation

zheyuf commented Aug 21, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mikeiovine commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 25, 2025

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

zheyuf commented Aug 26, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

zheyuf commented Aug 26, 2025

Uh oh!

zheyuf commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

zheyuf commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 26, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

mikeiovine commented Aug 27, 2025

Uh oh!

tensorrt-cicd commented Aug 27, 2025

Uh oh!

Uh oh!

tensorrt-cicd commented Aug 28, 2025

Uh oh!

zheyuf commented Aug 29, 2025

Uh oh!

zheyuf commented Aug 21, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Aug 21, 2025 •

edited

Loading