-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Auto-enable ngram with concurrency <= 32. #6232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThis update adds a new automatic speculative decoding configuration, Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant LLMArgs
participant LLM
participant PyExecutor
User->>CLI: Run with/without --spec_decode_algo "AUTO"
CLI->>LLMArgs: Parse arguments including AUTO option
LLMArgs->>LLM: Pass speculative decoding config (may be AutoDecodingConfig)
LLM->>LLM: _build_model applies AUTO heuristic if applicable
LLM->>PyExecutor: update_executor_config with resolved speculative config
Estimated code review effort3 (~45 minutes) Possibly related PRs
Suggested reviewers
Poem
✨ Finishing Touches
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm.py (1)
967-1001
: Approve the NGram auto-enabling logic with style improvements needed.The implementation correctly implements the heuristic for auto-enabling NGram speculative decoding based on batch size thresholds. The logic properly:
- Checks for NGram mode and batch size constraints
- Disables overlap scheduler when conditions are met
- Applies appropriate default values based on batch size ranges
- Logs the changes for debugging
However, there are several style issues that should be addressed:
Apply these fixes to improve code style and readability:
- logger.info( - "Disable overlap scheduler to enable NGram speculative decoding." - ) + logger.info("Disable overlap scheduler to enable NGram speculative decoding.") # From benchmark results, we found that NGram speculative decoding provides better performance than overlap scheduler with low concurrency <= 32. # Therefore, we disable overlap scheduler to enable NGram speculative decoding. self.args.disable_overlap_scheduler = True - if spec_config.max_draft_len != 0 and spec_config.max_matching_ngram_size != 0: + if spec_config.max_draft_len != 0 and spec_config.max_matching_ngram_size != 0: pass else: if max_batch_size <= 4: - spec_config.max_draft_len = 5 if spec_config.max_draft_len == 0 else spec_config.max_draft_len - spec_config.max_matching_ngram_size = 3 if spec_config.max_matching_ngram_size == 0 else spec_config.max_matching_ngram_size + spec_config.max_draft_len = ( + 5 if spec_config.max_draft_len == 0 else spec_config.max_draft_len + ) + spec_config.max_matching_ngram_size = ( + 3 if spec_config.max_matching_ngram_size == 0 else spec_config.max_matching_ngram_size + ) elif max_batch_size <= 32: - spec_config.max_draft_len = 3 if spec_config.max_draft_len == 0 else spec_config.max_draft_len - spec_config.max_matching_ngram_size = 5 if spec_config.max_matching_ngram_size == 0 else spec_config.max_matching_ngram_size + spec_config.max_draft_len = ( + 3 if spec_config.max_draft_len == 0 else spec_config.max_draft_len + ) + spec_config.max_matching_ngram_size = ( + 5 if spec_config.max_matching_ngram_size == 0 else spec_config.max_matching_ngram_size + ) logger.info( - f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}" + f"Apply heuristic to incomplete NGramDecodingConfig: " + f"max_draft_len={spec_config.max_draft_len}, " + f"max_matching_ngram_size={spec_config.max_matching_ngram_size}" )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/llm-api/quickstart_advanced.py
(2 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(1 hunks)tensorrt_llm/llmapi/llm_args.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
977-977: Line too long (161 > 120)
(E501)
986-986: Line too long (144 > 120)
(E501)
989-989: Line too long (144 > 120)
(E501)
991-991: Line too long (178 > 120)
(E501)
🔇 Additional comments (5)
examples/llm-api/quickstart_advanced.py (2)
111-113
: LGTM! Default value changes align with heuristic approach.The change from hardcoded defaults (1 and 5) to 0 allows the system to use heuristics based on concurrency levels, which aligns with the PR objective of auto-enabling ngram with concurrency <= 32.
155-158
: Conditional logic correctly handles non-NGRAM algorithms.The implementation properly sets
spec_decode_max_draft_len
to 1 for non-NGRAM algorithms when the user hasn't explicitly set a value (i.e., when it's 0). This ensures backward compatibility while allowing NGRAM to use the new heuristic logic.tensorrt_llm/llmapi/llm_args.py (1)
388-393
: Excellent documentation and parameter structure for heuristic implementation.The comments clearly document the heuristic behavior:
- Concurrency ≤ 4:
max_draft_len = 5, max_matching_ngram_size = 3
- Concurrency ≤ 32:
max_draft_len = 3, max_matching_ngram_size = 5
Setting both parameters to default value 0 properly indicates "use heuristic" and aligns with the changes in the example file.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
925-925
: LGTM! Method call updated correctly.The method call has been properly updated to match the simplified signature without the
requests
parameter.
1012-1016
: LGTM! Good refactoring to simplify the method interface.The changes properly simplify the
_prepare_draft_requests
method by:
- Removing the unnecessary
requests
parameter from the method signature- Directly iterating over
self.active_requests
instead of relying on an external parameter- Ensuring consistent access to the current active requests stored in the instance
This refactoring improves code clarity and reduces potential inconsistencies between the passed parameter and the actual active requests state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
256-271
: Missing "AUTO" decoding type in config class mapping.The
DecodingBaseConfig.from_dict
method needs to include the newAutoDecodingConfig
to properly handle deserialization of "AUTO" decoding type configurations.Add the missing mapping:
config_classes = { "MTP": MTPDecodingConfig, "Medusa": MedusaDecodingConfig, "Eagle": EagleDecodingConfig, "Lookahead": LookaheadDecodingConfig, "NGram": NGramDecodingConfig, "DraftTarget": DraftTargetDecodingConfig, "UserProvided": UserProvidedDecodingConfig, + "AUTO": AutoDecodingConfig, }
🧹 Nitpick comments (3)
tensorrt_llm/llmapi/llm_args.py (1)
461-482
: Well-designed auto-configuration class with minor docstring formatting issue.The implementation follows established patterns and clearly documents the heuristic logic. The backend restriction to PyTorch is appropriate for ngram decoding.
Fix the docstring formatting to comply with PEP 257:
class AutoDecodingConfig(DecodingBaseConfig): """ Configuration for auto speculative decoding. + This config is used to automatically select the best speculative decoding algorithm.
tensorrt_llm/llmapi/llm.py (2)
978-978
: Fix line length violation.The line exceeds the 120 character limit.
Apply this diff to fix the line length:
- "Disable overlap scheduler to enable Auto speculative decoding with Ngram." + "Disable overlap scheduler to enable Auto speculative " + "decoding with Ngram."
991-993
: Fix line length violation.The log message line exceeds the 120 character limit.
Apply this diff to fix the line length:
- f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}" + f"Apply heuristic to incomplete NGramDecodingConfig: " + f"max_draft_len={spec_config.max_draft_len}, " + f"max_matching_ngram_size={spec_config.max_matching_ngram_size}"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/llm-api/quickstart_advanced.py
(3 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(2 hunks)
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
978-978: Line too long (161 > 120)
(E501)
991-991: Line too long (174 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
477-479: 1 blank line required between summary line and description
(D205)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/llm-api/quickstart_advanced.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/llmapi/llm.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
978-978: Line too long (161 > 120)
(E501)
991-991: Line too long (174 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
477-479: 1 blank line required between summary line and description
(D205)
🔇 Additional comments (2)
tensorrt_llm/llmapi/llm_args.py (1)
388-393
: LGTM! Good documentation of heuristic behavior.The changes to use default values of 0 with clear comments explaining the heuristic logic align well with the PR objective of auto-enabling ngram for low concurrency scenarios.
tensorrt_llm/llmapi/llm.py (1)
967-994
: LGTM! Well-implemented heuristic for auto-enabling ngram speculative decoding.The conditional logic correctly implements the PR objective to auto-enable ngram with concurrency <= 32. The batch size-based parameter selection and overlap scheduler disabling are well-reasoned based on benchmark results mentioned in comments.
Signed-off-by: Simeng Liu <[email protected]>
d69e813
to
21cfc3d
Compare
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
462-483
: Fix docstring formatting issue.The
AutoDecodingConfig
class implementation is well-designed and follows the established patterns. However, there's a docstring formatting issue that needs to be addressed.Apply this fix for the docstring formatting:
class AutoDecodingConfig(DecodingBaseConfig): """ Configuration for auto speculative decoding. + This config is used to automatically select the best speculative decoding algorithm. According to benchmark results, the best algorithm in general is NGRAM with low concurrency <= 32. Default heuristic: With concurrency <= 4, max_draft_len = 5, max_matching_ngram_size = 3 With concurrency <= 32, max_draft_len = 3, max_matching_ngram_size = 5 With concurrency > 32, speculative decoding is disabled. """
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
examples/llm-api/quickstart_advanced.py
(3 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(2 hunks)tensorrt_llm/_torch/speculative/interface.py
(1 hunks)tensorrt_llm/llmapi/__init__.py
(1 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(5 hunks)tensorrt_llm/models/modeling_utils.py
(2 hunks)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
970-970: Line too long (161 > 120)
(E501)
983-983: Line too long (174 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
478-480: 1 blank line required between summary line and description
(D205)
✅ Files skipped from review due to trivial changes (3)
- tensorrt_llm/llmapi/init.py
- tensorrt_llm/_torch/speculative/interface.py
- tensorrt_llm/models/modeling_utils.py
🚧 Files skipped from review as they are similar to previous changes (2)
- tensorrt_llm/_torch/pyexecutor/py_executor.py
- examples/llm-api/quickstart_advanced.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
970-970: Line too long (161 > 120)
(E501)
983-983: Line too long (174 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
478-480: 1 blank line required between summary line and description
(D205)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/llmapi/llm.py (3)
34-35
: Import addition looks good.The
NGramDecodingConfig
import is properly added and required for the new auto-configuration logic.
993-993
: Parameter passing looks correct.The
speculative_config=spec_config
parameter correctly uses the potentially modified config from the heuristic logic above.
960-961
: Speculative configdecoding_type
attribute is valid
Verified that every speculative config class intensorrt_llm/llmapi/llm_args.py
definesdecoding_type: ClassVar[str] = "AUTO"
(and other types), so accessingspec_config.decoding_type
is safe and requires no change.tensorrt_llm/llmapi/llm_args.py (4)
389-394
: Good implementation of heuristic-based defaults.The changes to
NGramDecodingConfig
properly implement the concurrency-based heuristic mentioned in the PR objectives. The comments clearly explain the default behavior and the fields now use0
as defaults to indicate when heuristics should be applied.
265-265
: Proper integration of AutoDecodingConfig in dispatch dictionary.The addition of
"AUTO": AutoDecodingConfig
to the dispatch dictionary correctly enables the factory method to instantiate the new configuration class.
788-788
: Proper type alias extension.The addition of
AutoDecodingConfig
to theSpeculativeConfig
type alias correctly extends the union type to include the new configuration class.
1534-1538
: Correct validation logic for AutoDecodingConfig.The validation logic properly handles
AutoDecodingConfig
instances by:
- Checking backend compatibility (pytorch or _autodeploy)
- Setting the appropriate speculative decoding mode to AUTO
- Assigning the max_draft_len from the configuration
The implementation follows the established pattern for other decoding config types.
if spec_config is not None and spec_config.decoding_type == "AUTO" and max_batch_size <= 32: | ||
if not self.args.disable_overlap_scheduler: | ||
logger.info( | ||
"Disable overlap scheduler to enable Auto speculative decoding with Ngram." | ||
) | ||
# From benchmark results, we found that NGram speculative decoding provides better performance than overlap scheduler with low concurrency <= 32. | ||
# Therefore, we disable overlap scheduler to enable NGram speculative decoding. | ||
self.args.disable_overlap_scheduler = True | ||
|
||
spec_config = NGramDecodingConfig( | ||
max_draft_len=5 if max_batch_size <= 4 else 3, | ||
max_matching_ngram_size=3 if max_batch_size <= 4 else 5, | ||
is_keep_all=True, | ||
is_use_oldest=True, | ||
is_public_pool=True, | ||
) | ||
|
||
logger.info( | ||
f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic implementation looks correct but has style issues.
The heuristic logic for auto-enabling ngram speculative decoding is well-structured and addresses the PR objective. However, there are a few concerns:
- Line length violations: Lines 970 and 983 exceed 120 characters (flagged by static analysis)
- Magic numbers: The batch size thresholds (4, 32) and ngram parameters should be documented or made configurable
Apply this diff to fix the line length issues:
- logger.info(
- "Disable overlap scheduler to enable Auto speculative decoding with Ngram."
- )
+ logger.info(
+ "Disable overlap scheduler to enable Auto speculative "
+ "decoding with Ngram."
+ )
- logger.info(
- f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}"
- )
+ logger.info(
+ f"Apply heuristic to NGramDecodingConfig: "
+ f"max_draft_len={spec_config.max_draft_len}, "
+ f"max_matching_ngram_size={spec_config.max_matching_ngram_size}"
+ )
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if spec_config is not None and spec_config.decoding_type == "AUTO" and max_batch_size <= 32: | |
if not self.args.disable_overlap_scheduler: | |
logger.info( | |
"Disable overlap scheduler to enable Auto speculative decoding with Ngram." | |
) | |
# From benchmark results, we found that NGram speculative decoding provides better performance than overlap scheduler with low concurrency <= 32. | |
# Therefore, we disable overlap scheduler to enable NGram speculative decoding. | |
self.args.disable_overlap_scheduler = True | |
spec_config = NGramDecodingConfig( | |
max_draft_len=5 if max_batch_size <= 4 else 3, | |
max_matching_ngram_size=3 if max_batch_size <= 4 else 5, | |
is_keep_all=True, | |
is_use_oldest=True, | |
is_public_pool=True, | |
) | |
logger.info( | |
f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}" | |
) | |
if spec_config is not None and spec_config.decoding_type == "AUTO" and max_batch_size <= 32: | |
if not self.args.disable_overlap_scheduler: | |
logger.info( | |
"Disable overlap scheduler to enable Auto speculative " | |
"decoding with Ngram." | |
) | |
# From benchmark results, we found that NGram speculative decoding provides better performance than overlap scheduler with low concurrency <= 32. | |
# Therefore, we disable overlap scheduler to enable NGram speculative decoding. | |
self.args.disable_overlap_scheduler = True | |
spec_config = NGramDecodingConfig( | |
max_draft_len=5 if max_batch_size <= 4 else 3, | |
max_matching_ngram_size=3 if max_batch_size <= 4 else 5, | |
is_keep_all=True, | |
is_use_oldest=True, | |
is_public_pool=True, | |
) | |
logger.info( | |
f"Apply heuristic to NGramDecodingConfig: " | |
f"max_draft_len={spec_config.max_draft_len}, " | |
f"max_matching_ngram_size={spec_config.max_matching_ngram_size}" | |
) |
🧰 Tools
🪛 Ruff (0.12.2)
970-970: Line too long (161 > 120)
(E501)
983-983: Line too long (174 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm.py around lines 966 to 986, fix the style issues by
breaking long lines at 970 and 983 to keep them under 120 characters.
Additionally, replace the magic numbers 4 and 32 with named constants or
configuration parameters, and add comments explaining their purpose to improve
code readability and maintainability.
PR_Github #12532 [ run ] triggered by Bot |
PR_Github #12532 [ run ] completed with state |
…ith Ngram. Signed-off-by: Simeng Liu <[email protected]>
21cfc3d
to
093c394
Compare
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/llmapi/llm.py (1)
959-986
: Fix line length violations and consider documenting magic numbers.The heuristic logic correctly implements the PR objective but has style issues that need addressing:
- Line length violations: Lines 970 and 983 exceed 120 characters (flagged by static analysis)
- Magic numbers: The batch size thresholds (4, 32) should be documented for better maintainability
Apply this diff to fix the line length issues:
- logger.info( - "Disable overlap scheduler to enable Auto speculative decoding with Ngram." - ) + logger.info( + "Disable overlap scheduler to enable Auto speculative " + "decoding with Ngram." + )- logger.info( - f"Apply heuristic to incomplete NGramDecodingConfig: max_draft_len={spec_config.max_draft_len}, max_matching_ngram_size={spec_config.max_matching_ngram_size}" - ) + logger.info( + f"Apply heuristic to NGramDecodingConfig: " + f"max_draft_len={spec_config.max_draft_len}, " + f"max_matching_ngram_size={spec_config.max_matching_ngram_size}" + )Consider adding constants or comments explaining the threshold values (4, 32) and parameter choices.
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
462-483
: Fix docstring formatting issue.The AutoDecodingConfig class is well-implemented with clear heuristics documentation. However, there's a formatting issue in the docstring.
Apply this diff to fix the docstring formatting:
class AutoDecodingConfig(DecodingBaseConfig): """ Configuration for auto speculative decoding. + This config is used to automatically select the best speculative decoding algorithm. According to benchmark results, the best algorithm in general is NGRAM with low concurrency <= 32.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
examples/llm-api/quickstart_advanced.py
(3 hunks)tensorrt_llm/_torch/speculative/interface.py
(1 hunks)tensorrt_llm/llmapi/__init__.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(5 hunks)tensorrt_llm/models/modeling_utils.py
(2 hunks)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
970-970: Line too long (161 > 120)
(E501)
983-983: Line too long (174 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
478-480: 1 blank line required between summary line and description
(D205)
🚧 Files skipped from review as they are similar to previous changes (4)
- tensorrt_llm/_torch/speculative/interface.py
- tensorrt_llm/models/modeling_utils.py
- examples/llm-api/quickstart_advanced.py
- tensorrt_llm/llmapi/init.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm.py
970-970: Line too long (161 > 120)
(E501)
983-983: Line too long (174 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
478-480: 1 blank line required between summary line and description
(D205)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (6)
tensorrt_llm/llmapi/llm.py (2)
34-35
: Import addition looks good.The addition of
NGramDecodingConfig
to the import statement is necessary for the new auto-decoding heuristic logic implemented below.
993-993
: Correct usage of modified speculative config.The change to use
spec_config
instead ofself.args.speculative_config
is correct, ensuring the heuristic-adjusted configuration (if applied) is passed toupdate_executor_config
.tensorrt_llm/llmapi/llm_args.py (4)
265-265
: LGTM: Dispatch dictionary correctly updated.The addition of "AUTO": AutoDecodingConfig to the dispatch dictionary follows the existing pattern and enables proper instantiation of the new AutoDecodingConfig class.
788-788
: LGTM: Type alias properly extended.Adding AutoDecodingConfig to the SpeculativeConfig type alias ensures type consistency and allows the new configuration to be used wherever speculative configurations are expected.
389-394
: Verify that AUTO→NGram heuristic is applied in llm.pyThe default
0
values in NGramDecodingConfig act as sentinels for “AUTO.” The heuristic is implemented intensorrt_llm/llmapi/llm.py
:
- When
decoding_type == "AUTO"
andmax_batch_size <= 32
, a new NGramDecodingConfig is constructed with
max_draft_len = 5
(if<=4
) or3
(else)
max_matching_ngram_size = 3
(if<=4
) or5
(else)This matches the comments in
llm_args.py
and ensures zero‐defaults are overridden. No further action is needed.Likely an incorrect or invalid review comment.
1534-1537
: Speculative AUTO mode always yields a non-zero max_draft_len via the NGram heuristic, so no change needed.The
AutoDecodingConfig
branch invalidate_speculative_config
is never invoked with a bareAUTO
config at runtime—before validation the LLM API converts anyAUTO
setting into aNGramDecodingConfig
withmax_draft_len=5
(for ≤4) or3
(for ≤32). The CLI build path doesn’t even expose an “auto” option. No further action is required here.
PR_Github #12534 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
tensorrt_llm/llmapi/llm_args.py
Outdated
# Default heuristic will be use | ||
# With concurrency <= 4, max_draft_len = 5, max_matching_ngram_size = 3 | ||
# With concurrency <= 32, max_draft_len = 3, max_matching_ngram_size = 5 | ||
max_draft_len: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just use the max_draft_len
from class DecodingBaseConfig
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, removed this line.
PR_Github #12534 [ run ] completed with state |
Where/how are we testing these changes? |
Signed-off-by: Simeng Liu <[email protected]>
Signed-off-by: Simeng Liu <[email protected]>
Added |
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
457-478
: Fix docstring formatting and implementation looks good.The
AutoDecodingConfig
class is well-implemented with clear documentation of its purpose and heuristics. The backend restriction to PyTorch is appropriate for this feature.Apply this diff to fix the docstring formatting issue flagged by static analysis:
class AutoDecodingConfig(DecodingBaseConfig): """ Configuration for auto speculative decoding. + This config is used to automatically select the best speculative decoding algorithm.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/llm-api/quickstart_advanced.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(5 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/unittest/api_stability/references_committed/llm.yaml
(1 hunks)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
473-475: 1 blank line required between summary line and description
(D205)
✅ Files skipped from review due to trivial changes (1)
- tests/unittest/api_stability/references_committed/llm.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/llm-api/quickstart_advanced.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
473-475: 1 blank line required between summary line and description
(D205)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tests/integration/defs/test_e2e.py (1)
1768-1791
: AUTO mode fallback to NGram is validatedThe
--spec_decode_algo AUTO
test runs with--max_batch_size=4
(≤32), which triggers the AUTO heuristic intensorrt_llm/llmapi/llm.py
to select an NGram speculative decoding configuration. The observed 27 GiB memory profile matches the existing NGram test, confirming the AUTO mode behavior. No changes needed.tensorrt_llm/llmapi/llm_args.py (4)
265-265
: LGTM!The addition of the "AUTO" mapping to the dispatch dictionary correctly follows the established pattern for routing decoding types to their respective configuration classes.
389-389
: Verify the impact of changing the default value.Changing the default
max_matching_ngram_size
from 4 to 0 could be a breaking change for existing users who rely on the previous default behavior. Please ensure this change is intentional and consider:
- Whether existing code that doesn't explicitly set this parameter will be affected
- If documentation updates are needed to reflect this change
- Whether this should be mentioned in release notes as a breaking change
783-783
: LGTM!Adding
AutoDecodingConfig
to theSpeculativeConfig
type alias is necessary for proper type checking and ensures the new configuration can be used wherever speculative configurations are expected.
1529-1532
: LGTM! Validation logic is consistent.The validation logic for
AutoDecodingConfig
correctly follows the established pattern:
- Backend assertion matches the
supports_backend
method- Appropriate speculative decoding mode assignment
- Consistent with other decoding configuration validations
The actual heuristics application (based on concurrency/batch size) appears to be handled elsewhere in the system, which is a reasonable separation of concerns.
PR_Github #12619 [ run ] triggered by Bot |
PR_Github #12619 [ run ] completed with state |
Summary by CodeRabbit
New Features
Bug Fixes
Description
Test Coverage
test_e2e.py::test_ptp_quickstart_advanced_auto
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.