-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][feat] Make the should_use_spec_decode logic a bit smarter #7112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughPyExecutor now calls Drafter.should_use_spec_decode with four arguments (active_requests, max_batch_size, max_num_tokens, max_draft_len). Drafter's signature and gating logic were updated to enforce token-budget and batch-size constraints; unit tests were updated and a new test added to validate the gating behavior. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant C as Client
participant E as PyExecutor
participant D as Drafter
participant M as ModelEngine
C->>E: submit requests
E->>M: query model limits (max_num_tokens, max_draft_len)
E->>E: compute max_batch_size, active_requests
E->>D: should_use_spec_decode(active_requests, max_batch_size, max_num_tokens, max_draft_len)
D-->>E: use_spec (bool)
alt use_spec == true
E->>M: enable_spec_decode(true)
note right of E #eef6ff: initialize per-request draft state
E->>E: _prepare_draft_requests()
E->>E: prepare drafting & schedule speculative batch
else
E->>M: enable_spec_decode(false)
E->>E: schedule standard decoding
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
tensorrt_llm/_torch/speculative/drafter.py (2)
1-1
: Add NVIDIA copyright header (2025).All Python source files should start with the standard NVIDIA copyright header per repository guidelines.
Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
29-37
: Docstring should describe new parameters and return semantics.The signature expanded, but the docstring didn’t. Adding Args/Returns will help downstream users and maintainers.
Apply this diff to enrich the docstring:
""" - You probably don't want to override this. ModelEngine - assumes that speculation is always on if max_concurrency - is not specified by the user's spec config. + Decide whether to enable speculative decoding for this step. + + You probably don't want to override this. ModelEngine assumes that + speculation is always on if max_concurrency is not specified by the + user's spec config. + + Args: + requests: Active requests on this rank. + max_batch_size: Upper bound on sequences schedulable this step. + max_num_tokens: Per-step token budget available for generation + (including draft tokens). + max_draft_len: Maximum number of draft tokens per request. + + Returns: + True iff speculative decoding should be attempted for this step. """tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
923-933
: Option: gate after scheduling using actual batch size.Gating before scheduling uses
len(self.active_requests)
, which can be stricter/looser than the scheduled batch, especially with pausing or KV-transfer constraints. Consider moving gating after_schedule()
and usingscheduled_batch.batch_size
and the number of generation requests to compute the effective token demand for this step.If you pursue this, keep a fast-path when
max_concurrency is None
to avoid extra overhead.tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (3)
1-1
: Add NVIDIA copyright header (2025).Tests are also Python sources; please prepend the standard NVIDIA header.
Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
55-61
: Strengthen the mock: assert plumbed values match expectations.Adding lightweight assertions in the mock helps catch regression in plumbing the new parameters without making the test brittle.
Apply this diff:
- def mock_should_use_spec_decode(self, requests, max_batch_size, - max_num_tokens, draft_len): + def mock_should_use_spec_decode(self, requests, max_batch_size, + max_num_tokens, draft_len): + # Sanity-check the plumbed values + assert isinstance(max_batch_size, int) and max_batch_size >= 1 + assert draft_len == max_draft_len + assert isinstance(max_num_tokens, int) and max_num_tokens > 0 if not hasattr(mock_should_use_spec_decode, 'call_count'): mock_should_use_spec_decode.call_count = 0 mock_should_use_spec_decode.call_count += 1 return mock_should_use_spec_decode.call_count <= 2
17-53
: Add focused unit tests for should_use_spec_decode logic (no GPU).Current test validates end-to-end behavior with a heavy model. Add a lightweight unit test that constructs a minimal Drafter subclass and exercises edge cases:
- token_cap == 0 (should return False with max_concurrency set)
- requests == [] (should return False)
- num_effective_requests just above/below max_concurrency
I can draft such a test that runs on CPU and completes quickly if you’d like.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py
(1 hunks)tensorrt_llm/_torch/speculative/drafter.py
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
924-927
: Verified — ModelEngine attributes and spec-decode callsites are correctQuick summary: PyTorchModelEngine exposes max_num_tokens and derives max_draft_len from spec_config; drafter.should_use_spec_decode signature matches the py_executor call and no other outdated call sites were found.
Points of evidence:
- tensorrt_llm/_torch/pyexecutor/model_engine.py — PyTorchModelEngine.init sets self.max_num_tokens (class at ~line 260) and sets self.max_draft_len from spec_config (around lines 391–410).
- tensorrt_llm/_torch/speculative/drafter.py — should_use_spec_decode signature expects (requests, max_batch_size, max_num_tokens, max_draft_len). (definition at ~line 30)
- tensorrt_llm/_torch/pyexecutor/py_executor.py — call site at lines ~923–926 passes model_engine.max_num_tokens and model_engine.max_draft_len into should_use_spec_decode.
- Ripgrep search returned no other should_use_spec_decode call sites.
Conclusion: No code changes required for the call in py_executor.py; attributes are present on the concrete engine implementation.
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)
55-65
: Verified patch target is correct
- The ModelDrafter class is defined in tensorrt_llm/_torch/speculative/model_drafter.py (line 38), matching the module path under test.
- The should_use_spec_decode method is implemented on the base Drafter class in tensorrt_llm/_torch/speculative/drafter.py (line 30) and inherited by ModelDrafter, so patching ModelDrafter.should_use_spec_decode will override the behavior invoked by LLM (e.g., in py_executor.py:923).
No changes needed here.
Signed-off-by: Zheyu Fu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tensorrt_llm/_torch/speculative/drafter.py (1)
39-47
: Add explicit guards for zero/invalid capacity; clamp draft length to avoid div-by-zero for negatives.Semantics are currently correct due to
0 < num_effective_requests
, but early guards improve readability and avoid computingmin()
when capacity is zero. Also clampmax_draft_len
to tolerate unexpected negatives.- # Inputs validated upstream: max_batch_size>0, max_num_tokens>0, max_draft_len>0 + # Inputs typically validated upstream: max_batch_size>0, max_num_tokens>0, max_draft_len>=0 if self.max_concurrency is None: return True - token_cap = max_num_tokens // (1 + max_draft_len) - num_effective_requests = min(len(requests), max_batch_size, token_cap) - - return 0 < num_effective_requests <= self.max_concurrency + # Defensive guards; keep behavior explicit for zero/empty cases + if max_batch_size <= 0 or max_num_tokens <= 0 or not requests: + return False + + tokens_per_request = 1 + max(0, max_draft_len) + token_cap = max_num_tokens // tokens_per_request + if token_cap <= 0: + return False + + num_effective_requests = min(len(requests), max_batch_size, token_cap) + return num_effective_requests <= self.max_concurrencytensorrt_llm/_torch/pyexecutor/py_executor.py (1)
922-927
: Good fix: gating now uses spec_config.max_draft_len (consistent with draft allocation).This resolves the prior inconsistency between gating and
_prepare_draft_requests
(which usesspec_config.max_draft_len
). Thanks for aligning the sources.
🧹 Nitpick comments (8)
tensorrt_llm/_torch/speculative/drafter.py (3)
1-4
: Add NVIDIA copyright header (2025) per repo guidelines.All source files must carry the NVIDIA copyright header.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +""" +This file is part of TensorRT-LLM. +"""
30-37
: Clarify and expand the method docstring; include args/returns and the gating formula.Make the contract explicit and Sphinx/Google-style so downstream users know exactly how the decision is computed.
- def should_use_spec_decode(self, requests: List[LlmRequest], - max_batch_size: int, max_num_tokens: int, - max_draft_len: int) -> bool: - """ - You probably don't want to override this. ModelEngine - assumes that speculation is always on if max_concurrency - is not specified by the user's spec config. - """ + def should_use_spec_decode(self, requests: List[LlmRequest], + max_batch_size: int, max_num_tokens: int, + max_draft_len: int) -> bool: + """ + Decide whether to enable speculative decoding in this step. + + Speculation is enabled when: + - max_concurrency is None (unbounded), or + - 0 < min(len(requests), max_batch_size, + max_num_tokens // (1 + max(0, max_draft_len))) <= max_concurrency + + Args: + requests (List[LlmRequest]): Active requests this step. + max_batch_size (int): Upper bound on batch size for this step. + max_num_tokens (int): Token budget available for this step. + max_draft_len (int): Max draft tokens per request used by the drafter. + + Returns: + bool: True to enable speculative decoding; False to disable. + """
30-47
: Optional: surface observability hook.If you’d prefer not to log at call sites, consider emitting a debug-level trace here (behind a cheap bool) with inputs and the decision for troubleshooting. Decline if you’d rather keep the base class silent.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
1-3
: Add NVIDIA copyright header (2025) at file top.This file is missing the required header.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# +""" +This file is part of TensorRT-LLM. +"""
922-928
: Log the gating decision for visibility (debug-level).A one-line debug log helps diagnose unexpected toggling without stepping through. Cheap to compute and guarded by logger’s level.
if self.drafter is not None: self.use_spec_decode = self.drafter.should_use_spec_decode( self.active_requests, self.max_batch_size, self.model_engine.max_num_tokens, self.model_engine.spec_config.max_draft_len) self.model_engine.enable_spec_decode = self.use_spec_decode + logger.debug( + "spec gating: active=%d, max_batch_size=%d, max_num_tokens=%d, " + "max_draft_len=%d -> use_spec_decode=%s", + len(self.active_requests), + self.max_batch_size, + self.model_engine.max_num_tokens, + self.model_engine.spec_config.max_draft_len, + str(self.use_spec_decode), + )tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (3)
1-4
: Add NVIDIA copyright header (2025).Tests are also subject to the header policy.
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
91-147
: Strengthen unit coverage with unbounded concurrency case; minor tidy-ups.
- Add a case where
max_concurrency=None
(should always enable speculation).- Optional: the earlier
sampling_params = SamplingParams(max_tokens=128, ...)
assignment is unused; drop it.- Optional: consider using a tiny lightweight
LlmRequest
-like stub (with only what’s needed) rather thanobject()
to future-proof the test if the method ever inspects attributes.@@ drafter = _DummyDrafter(max_concurrency=6) @@ assert drafter.should_use_spec_decode(active_requests, max_batch_size=8, max_num_tokens=4, max_draft_len=4) is False + # Unbounded concurrency ON case (should always enable) + drafter_unbounded = _DummyDrafter(max_concurrency=None) + active_requests = [object()] * 100 + assert drafter_unbounded.should_use_spec_decode( + active_requests, + max_batch_size=8, + max_num_tokens=16, + max_draft_len=7) is TrueAnd the small cleanup in the first test:
- sampling_params = SamplingParams(max_tokens=128, temperature=0)
91-147
: Optionally parameterize the truth-table to reduce duplication.A table-driven test with pytest parametrization would make it easier to extend scenarios without repetitive boilerplate. Happy to provide a parametrized version if you want it in this PR.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py
(1 hunks)tensorrt_llm/_torch/speculative/drafter.py
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
🔇 Additional comments (2)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2)
55-56
: Mock signature update looks correct.Matches the production method’s
(requests, max_batch_size, max_num_tokens, draft_len)
and maintains the original behavior (first two calls True, then False).
1-150
: Sanity Check Complete: All should_use_spec_decode Call Sites UpdatedAll invocations of
should_use_spec_decode
in the repository now supply at least four arguments, matching the updated signature. No further changes are required.
Signed-off-by: Zheyu Fu <[email protected]>
Signed-off-by: Zheyu Fu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/_torch/speculative/drafter.py (1)
39-55
: Gating logic aligns with PR goals; add an optional guard for zero draft length.
- The zero/empty/budget guards and token-cap gating are correct and address earlier feedback. Nice.
- Optional: if max_draft_len == 0, speculation is effectively a no-op; returning False avoids unnecessary toggling of enable_spec_decode and avoid extra drafter work.
(Complements the earlier “guard insufficient token budget” thread that you addressed.)
# Defensive guards; keep behavior explicit for zero/empty cases - if not requests or max_batch_size <= 0 or max_num_tokens <= 0: + if not requests or max_batch_size <= 0 or max_num_tokens <= 0: return False + # No drafts → no speculation benefit; avoid enabling spec decode + if max_draft_len <= 0: + return False
🧹 Nitpick comments (6)
tensorrt_llm/_torch/speculative/drafter.py (2)
1-1
: Add NVIDIA Apache-2.0 header (2025).Per repository guidelines, prepend the NVIDIA Apache-2.0 copyright header with the current year to this source file.
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.
33-37
: Document the new parameters and behavior.Expand the docstring to a Google-style docstring so external users know exactly how gating is computed.
- """ - You probably don't want to override this. ModelEngine - assumes that speculation is always on if max_concurrency - is not specified by the user's spec config. - """ + """ + Decide whether to enable speculative decoding for this step. + + You probably don't want to override this. ModelEngine assumes that + speculation is always on if max_concurrency is not specified by the + user's spec config. + + Args: + requests (List[LlmRequest]): Currently active requests. + max_batch_size (int): Executor's per-step batch-size limit. + max_num_tokens (int): Executor's per-step token budget for decoding. + max_draft_len (int): Maximum draft length per request for speculative decoding. + + Returns: + bool: True if speculation should be enabled given concurrency, batch size, + and token budget constraints; otherwise False. + """tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
1319-1320
: Align dummy-request draft allocation with spec_config.adp dummy path uses self.max_draft_len, which may diverge from spec_config.max_draft_len used elsewhere (gating and draft prep). Use the same source to avoid inconsistent allocations.
- max_num_draft_tokens=self.max_draft_len, + max_num_draft_tokens=self.model_engine.spec_config.max_draft_len,
961-970
: Minor: avoid calling drafter.prepare_draft_tokens when no drafts (max_draft_len == 0).If you accept the optional guard in Drafter.should_use_spec_decode for zero draft length, this will naturally skip. If not, you can add a cheap local check to avoid a no-op invocation.
- if self.drafter is not None and self.use_spec_decode: + if (self.drafter is not None and self.use_spec_decode + and self.model_engine.spec_config.max_draft_len > 0):tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (2)
1-1
: Add NVIDIA Apache-2.0 header (2025) to test file.+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.
91-146
: Good direct unit coverage; add a few targeted edge cases.Nice focused test for the new gating logic. Consider adding:
- max_draft_len == 0 should be False (if adopting the optional guard).
- max_concurrency=None path should always return True.
- max_batch_size <= 0 should be False (already guarded in code but not asserted here).
Minimal diff to extend the test:
@@ drafter = _DummyDrafter(max_concurrency=6) @@ assert drafter.should_use_spec_decode(active_requests, max_batch_size=8, max_num_tokens=4, max_draft_len=4) is False + # Edge case - Zero draft length OFF (no speculation benefit) + active_requests = [object()] * 3 + assert drafter.should_use_spec_decode(active_requests, + max_batch_size=8, + max_num_tokens=128, + max_draft_len=0) is False + + # Edge case - max_batch_size <= 0 OFF + active_requests = [object()] * 3 + assert drafter.should_use_spec_decode(active_requests, + max_batch_size=0, + max_num_tokens=128, + max_draft_len=4) is False + + # Behavior when concurrency is unspecified: always ON + drafter_unbounded = _DummyDrafter(max_concurrency=None) + active_requests = [object()] * 100 + assert drafter_unbounded.should_use_spec_decode(active_requests, + max_batch_size=1, + max_num_tokens=1, + max_draft_len=4) is TrueIf you prefer, I can also parametrize this test with pytest to reduce duplication.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py
(1 hunks)tensorrt_llm/_torch/speculative/drafter.py
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespace when importing: from package.subpackage import foo; then use foo.SomeClass()
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; prefix k for names starting with a number (e.g., k_99th_percentile)
Global variables are UPPER_SNAKE_CASE prefixed with G (e.g., G_MY_GLOBAL)
Constants are UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings over comments; comments for internal code or local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Attributes and variables can be documented inline with trailing docstrings under the class or module
Avoid using reflection when easily avoidable; prefer explicit parameters/constructs over dict(**locals())
In try/except, catch the narrowest exception types possible
For duck-typing try/except, keep try body minimal and place logic in else after attribute existence checks
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA Apache-2.0 copyright header with current year to all source files
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/speculative/drafter.py
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tests/unittest/llmapi/apps/_test_openai_misc.py (1)
max_batch_size
(30-31)
tensorrt_llm/_torch/speculative/drafter.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest
(282-424)
tests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)
tensorrt_llm/_torch/speculative/drafter.py (3)
Drafter
(9-54)prepare_draft_tokens
(16-27)should_use_spec_decode
(30-54)
🔇 Additional comments (3)
tensorrt_llm/_torch/speculative/drafter.py (1)
30-32
: Verified: no overrides or stale callers for should_use_spec_decodeAll occurrences of
def should_use_spec_decode
are in tensorrt_llm/_torch/speculative/drafter.py (line 30); no subclass redefinitions were found. All call sites were scanned and confirmed to pass at least four arguments to the new signature.– Definition only at:
• tensorrt_llm/_torch/speculative/drafter.py:30
– No subclasses overrideshould_use_spec_decode
– All invocations supply ≥ 4 argumentstensorrt_llm/_torch/pyexecutor/py_executor.py (1)
889-891
:⚠️ Verify max_num_tokens semanticsIt looks like
model_engine.max_num_tokens
isn’t set anywhere in the PythonModelEngine
class itself—it’s imported from the underlying TensorRT engine’s build-time configuration rather than computed per forward call. Please confirm that this value truly represents your per-step token budget (i.e. target + draft tokens) and doesn’t include context-only tokens or a global theoretical maximum. If it does, the gating calculation inPyExecutor
may allow speculation when it shouldn’t.
- No Python assignment to
self.max_num_tokens
intensorrt_llm/_torch/pyexecutor/model_engine.py
- Ensure your engine build’s
max_num_tokens
matches the desired per-iteration limit for speculation gatingtests/unittest/_torch/speculative/test_dynamic_spec_decode.py (1)
55-56
: Mock signature update matches the new API.The patched mock now accepts (requests, max_batch_size, max_num_tokens, max_draft_len). Good.
Signed-off-by: Zheyu Fu <[email protected]>
/bot run |
PR_Github #16475 [ run ] triggered by Bot |
Signed-off-by: Zheyu Fu <[email protected]>
9a28cda
to
0b54ebe
Compare
0b54ebe
to
a9be135
Compare
PR_Github #16475 [ run ] completed with state |
/bot run |
a9be135
to
394ccc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/speculative/drafter.py (1)
1-4
: Compliance: add the NVIDIA copyright header (2025)The repository guidelines require the NVIDIA header on all source files; missing headers can trip CI/license checks.
Apply at the very top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software distributed under +# the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, +# either express or implied. See the License for the specific language governing permissions and +# limitations under the License.If the repo uses a different boilerplate, please paste the project-standard header instead.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/speculative/drafter.py (3)
41-47
: Guard ordering: avoid returning True with invalid/degenerate inputs when concurrency is unlimitedWith max_concurrency is None, the function returns True before hitting the defensive guards. That enables speculation even when requests is empty or max_batch_size/max_num_tokens are non-positive. Suggest validating inputs and token_cap first; then apply the “unlimited concurrency” fast-path only when at least one request can actually be scheduled.
Apply this minimal reorder so True is only reachable when something fits:
- if self.max_concurrency is None: - return True - - # Defensive guards; keep behavior explicit for zero/empty cases - if not requests or max_batch_size <= 0 or max_num_tokens <= 0: - return False + # Defensive guards; keep behavior explicit for zero/empty cases + if not requests or max_batch_size <= 0 or max_num_tokens <= 0: + return FalseAnd move the unlimited-concurrency path after computing num_effective_requests:
- num_effective_requests = min(len(requests), max_batch_size, token_cap) - return num_effective_requests <= self.max_concurrency + num_effective_requests = min(len(requests), max_batch_size, token_cap) + if self.max_concurrency is None: + # Unlimited concurrency: we've already established at least one request can fit. + return True + return num_effective_requests <= self.max_concurrencyThis keeps the documented behavior while preventing “spec enabled” on no-op steps.
30-33
: Add a Google-style docstring to clarify inputs, semantics, and edge casesDocumenting expectations (e.g., whether requests includes only target requests, how max_num_tokens is measured, and the “unlimited concurrency” behavior) will reduce ambiguity for future maintainers and test authors.
Apply:
def should_use_spec_decode(self, requests: List[LlmRequest], max_batch_size: int, max_num_tokens: int, max_draft_len: int) -> bool: - """ - You probably don't want to override this. ModelEngine - assumes that speculation is always on if max_concurrency - is not specified by the user's spec config. - """ + """Decide whether to enable speculative decoding for this step. + + Args: + requests: Active target requests under consideration this step. + max_batch_size: Upper bound of sequences schedulable this step (> 0). + max_num_tokens: Per-step token budget for decoding (targets + draft) (> 0). + max_draft_len: Max proposed draft tokens per request (>= 0). + + Returns: + True if speculation should be enabled given the per-step token budget, + batch-size cap, and the configured concurrency limit. If + max_concurrency is None (unlimited), speculation is enabled as long + as at least one request fits the budget. + """
30-33
: Broaden input type to Sequence to avoid over-constraining callersThe method only reads length and iterates; accepting Sequence[LlmRequest] makes it friendlier to upstream schedulers without forcing a list.
Apply:
-from typing import List, Optional, final +from typing import Optional, Sequence, final @@ - def should_use_spec_decode(self, requests: List[LlmRequest], + def should_use_spec_decode(self, requests: Sequence[LlmRequest], max_batch_size: int, max_num_tokens: int, max_draft_len: int) -> bool:
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py
(1 hunks)tensorrt_llm/_torch/speculative/drafter.py
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/speculative/drafter.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/speculative/drafter.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/speculative/drafter.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest
(282-424)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/speculative/drafter.py (1)
30-33
: Smarter capacity gate + API expansion look goodThe new signature and the per-step gating via min(len(requests), max_batch_size, token_cap) align with the PR objective and should reduce unnecessary speculation disables. The @Final choice also makes sense here.
/bot run |
Signed-off-by: Zheyu Fu <[email protected]>
Signed-off-by: Zheyu Fu <[email protected]>
394ccc8
to
d0adbbd
Compare
/bot run |
PR_Github #16582 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1318-1321
: Use the same draft length source for ADP dummy requests (keep gating and allocation consistent).here: max_num_draft_tokens=self.max_draft_len. Elsewhere we gate and allocate drafts using spec_config.max_draft_len. If self.max_draft_len differs from spec_config.max_draft_len, ADP dummy requests will be sized inconsistently vs. the gating decision.
Apply this small change:
- max_num_draft_tokens=self.max_draft_len, + max_num_draft_tokens=self.model_engine.spec_config.max_draft_len,Follow-up (optional): consider deprecating PyExecutor.max_draft_len in favor of model_engine.spec_config.max_draft_len to avoid future drift.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py
(1 hunks)tensorrt_llm/_torch/speculative/drafter.py
(1 hunks)tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_torch/speculative/drafter.py
- tests/unittest/_torch/speculative/test_dynamic_spec_decode.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
888-897
: LGTM: smarter gating call wired correctly and consistent draft length source.Passing active_requests, max_batch_size, max_num_tokens, and spec_config.max_draft_len into Drafter.should_use_spec_decode aligns the executor’s decision with batch-size and token-budget limits. Using spec_config.max_draft_len here matches _prepare_draft_requests (Line 1016+), eliminating the prior inconsistency.
889-891
: Allshould_use_spec_decode
call sites are up-to-dateI ran the provided ripgrep checks across the repository and found only:
tensorrt_llm/_torch/pyexecutor/py_executor.py
– updated to pass all four parameterstests/unittest/_torch/speculative/test_dynamic_spec_decode.py
– all assertions invokeshould_use_spec_decode
with the new multi-arg signature (and rely on the default for any omitted parameter)No stale single-argument calls remain.
PR_Github #16582 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16590 [ run ] triggered by Bot |
PR_Github #16590 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16738 [ run ] triggered by Bot |
PR_Github #16738 [ run ] completed with state |
Signed-off-by: Zheyu Fu <[email protected]>
…_spec Signed-off-by: Zheyu Fu <[email protected]>
/bot run --disable-fail-fast |
PR_Github #16958 [ run ] triggered by Bot |
PR_Github #16958 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #17013 [ run ] triggered by Bot |
PR_Github #17013 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #17062 [ run ] triggered by Bot |
PR_Github #17062 [ run ] completed with state |
Description
Previously, we disabled speculation solely when len(active_requests) > max_concurrency, which was conservative. We now gate by a better estimated per-step capacity (min(len(active_requests), max_batch_size, max_num_tokens // (1 + draft_len))), so speculation isn’t unnecessarily turned off when the schedulable batch is small. This respects real decoding limits, improving GPU utilization and latency.
Test Coverage
There are existing unit tests in TensorRTLLM/tests/unittest/_torch/speculative/test_dynamic_spec_decode.py. Also added some quick unit tests in this file.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Refactor
Tests