-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-6685][feat] Add speculative metrics for trt llm bench #6476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TRTLLM-6685][feat] Add speculative metrics for trt llm bench #6476
Conversation
📝 WalkthroughWalkthroughThe changes introduce enhanced tracking and reporting of speculative decoding metrics in the benchmarking utilities. New statistics related to draft tokens, accepted drafts, acceptance rates, and acceptance lengths are collected, aggregated, and reported. Data models are updated to support these metrics, and utility methods are added for configuration retrieval and statistics formatting. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
443-454
: Refactor long lines to comply with 120-character limit.Several lines exceed the 120-character limit specified in coding guidelines.
Consider extracting the model_dump calls to improve readability:
-"draft_tokens_percentiles": -self.statistics.num_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None, +def _format_percentiles(stats): + return stats.model_dump(exclude_none=True, by_alias=True, mode='json') if stats else None + +"draft_tokens_percentiles": _format_percentiles(self.statistics.num_draft_tokens_percentiles), +"num_accepted_drafts_percentiles": _format_percentiles(self.statistics.num_accepted_drafts_percentiles), +"draft_acceptance_rate_percentiles": _format_percentiles(self.statistics.draft_acceptance_rate_percentiles), +"acceptance_length_percentiles": _format_percentiles(self.statistics.acceptance_length_percentiles),
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.py
(9 hunks)tensorrt_llm/bench/dataclasses/statistics.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/bench/dataclasses/statistics.py
tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/bench/dataclasses/statistics.py
tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
119-119: Line too long (147 > 120)
(E501)
122-122: Undefined name num_accepted_draft_tokens
(F821)
123-123: Undefined name num_accepted_draft_tokens
(F821)
144-144: Undefined name draft_tokens
(F821)
145-145: Undefined name global_acceptance_rate
(F821)
444-444: Line too long (125 > 120)
(E501)
447-447: Line too long (128 > 120)
(E501)
450-450: Line too long (130 > 120)
(E501)
453-453: Line too long (125 > 120)
(E501)
🔇 Additional comments (4)
tensorrt_llm/bench/dataclasses/statistics.py (1)
124-124
: LGTM! Well-structured data model extensions for speculative decoding metrics.The new fields appropriately capture draft token statistics and acceptance metrics with proper initialization defaults.
Also applies to: 131-131, 141-145
tensorrt_llm/bench/dataclasses/reporting.py (3)
121-127
: Design consideration: Uniform draft token assumption may not reflect reality.The current implementation assumes all requests use the same
num_draft_tokens
value. In practice, draft token counts may vary per request based on speculative decoding dynamics.Consider tracking draft tokens per request for more accurate metrics. Would you like me to propose an alternative implementation that supports per-request draft token tracking?
597-633
: Well-structured statistics reporting with proper null handling.The implementation correctly checks for None values before accessing percentile statistics and provides clear, formatted output for all speculative decoding metrics.
647-671
: Robust configuration retrieval with proper fallback handling.The method correctly handles both C++ and Python backend configurations with appropriate error handling for missing files or fields.
d2d24a4
to
619560b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
127-165
: Fix duplicate keyword argument and formatting issues.Critical syntax error and formatting issues need to be addressed:
- Line 158 has duplicate
acceptance_length
keyword argument causing syntax error- Line 129 contains trailing whitespace
Apply these fixes:
global_acceptance_length = sum(acceptance_length) / len(acceptance_length) queue_time_total = last_queue_time - start_time - +And remove the duplicate keyword argument on line 158:
token_percentiles=PercentileStats.from_iterable(output_tokens), issue_rate_ns=queue_time_total / num_requests, - acceptance_length=global_acceptance_length, num_draft_tokens_percentiles=num_draft_tokens_percentiles,
♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
75-75
: Critical: Variable name collision between parameter and local variable.The parameter
num_draft_tokens
conflicts with the local list variable of the same name on line 98. This will cause the parameter value to be shadowed and inaccessible.Consider renaming the parameter to avoid the collision:
-def generate_statistics_summary(self, num_draft_tokens: int) -> None: +def generate_statistics_summary(self, max_draft_tokens: int) -> None:Then update the usage on lines 120-125 to use the renamed parameter.
119-126
: Fix multiple critical issues in speculative metrics calculation.Several issues need to be addressed:
- Variable shadowing prevents access to the
num_draft_tokens
parameter- Line 122 references undefined variable
num_accepted_draft_tokens
instead ofnum_accepted_drafts
- Line length violation on line 119 (147 > 120 characters)
Apply these fixes:
# For speculative decoding, we need to track the number of draft tokens per request and the number of accepted draft tokens per request -if num_draft_tokens > 0: - num_draft_tokens.append(num_draft_tokens * (entry.decode_iteration + 1)) - num_accepted_draft_tokens.append(entry.num_generated_tokens - entry.decode_iteration - 1) - draft_acceptance_rate.append(float(num_accepted_draft_tokens[-1]) / float(num_draft_tokens[-1])) - acceptance_length.append(entry.num_generated_tokens / (entry.decode_iteration + - 1)) +if max_draft_tokens > 0: # Use renamed parameter + num_draft_tokens.append(max_draft_tokens * (entry.decode_iteration + 1)) + num_accepted_drafts.append(entry.num_generated_tokens - entry.decode_iteration - 1) + draft_acceptance_rate.append(float(num_accepted_drafts[-1]) / float(num_draft_tokens[-1])) + acceptance_length.append(entry.num_generated_tokens / (entry.decode_iteration + 1))
🧹 Nitpick comments (3)
tensorrt_llm/bench/dataclasses/reporting.py (3)
640-663
: Good implementation with minor formatting issues.The method correctly implements a fallback strategy to retrieve max draft length from multiple configuration sources with proper error handling.
Fix trailing whitespace issues:
except (FileNotFoundError, KeyError, json.JSONDecodeError): pass - + # Try to get from speculative_config if ("speculative_config" in self.kwargs and self.kwargs["speculative_config"] is not None): return self.kwargs["speculative_config"].max_draft_len or 0 - + return 0
438-449
: Fix line length violations while maintaining functionality.The logic correctly adds speculative decoding percentile statistics, but several lines exceed the 120-character limit.
Apply formatting fixes to comply with line length limits:
"num_draft_tokens_percentiles": - self.statistics.num_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None, + (self.statistics.num_draft_tokens_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.num_draft_tokens_percentiles else None), "num_accepted_drafts_percentiles": - self.statistics.num_accepted_drafts_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_accepted_drafts_percentiles else None, + (self.statistics.num_accepted_drafts_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.num_accepted_drafts_percentiles else None), "draft_acceptance_rate_percentiles": - self.statistics.draft_acceptance_rate_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.draft_acceptance_rate_percentiles else None, + (self.statistics.draft_acceptance_rate_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.draft_acceptance_rate_percentiles else None), "acceptance_length_percentiles": - self.statistics.acceptance_length_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.acceptance_length_percentiles else None + (self.statistics.acceptance_length_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.acceptance_length_percentiles else None)
589-627
: Excellent implementation of speculative decoding statistics reporting.The reporting format is consistent with existing patterns and provides comprehensive coverage of all speculative decoding metrics with clear abbreviations and formatting.
Fix trailing whitespace:
if decoding is not None: decoding = stats_dict["decoding_stats"] - + num_draft_tokens = decoding["num_draft_tokens_percentiles"]
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.py
(7 hunks)tensorrt_llm/bench/dataclasses/statistics.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
119-119: Line too long (147 > 120)
(E501)
129-129: Blank line contains whitespace
(W293)
158-158: SyntaxError: Duplicate keyword argument "acceptance_length"
440-440: Line too long (125 > 120)
(E501)
443-443: Line too long (128 > 120)
(E501)
446-446: Line too long (130 > 120)
(E501)
449-449: Line too long (125 > 120)
(E501)
589-589: Blank line contains whitespace
(W293)
657-657: Blank line contains whitespace
(W293)
662-662: Blank line contains whitespace
(W293)
🔇 Additional comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
98-101
: LGTM: Proper variable declarations for speculative metrics.The variable declarations follow proper naming conventions and are appropriately initialized as empty lists for collecting per-request metrics.
191-192
: LGTM: Proper integration of max draft length parameter.The constructor correctly calls the new
get_max_draft_len()
method to provide the required parameter togenerate_statistics_summary()
.
619560b
to
4fb07c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
119-126
: Fix line length and verify calculations.The speculative decoding logic appears mathematically sound, but there's a line length issue that needs addressing.
Apply this diff to fix the line length:
- acceptance_length.append(entry.num_generated_tokens / (entry.decode_iteration + - 1)) + acceptance_length.append( + entry.num_generated_tokens / (entry.decode_iteration + 1))The calculations look correct:
num_draft_tokens
: Total tokens proposed across all decode iterationsnum_accepted_draft_tokens
: Generated tokens minus base tokens per iterationdraft_acceptance_rate
: Ratio of accepted to proposed tokensacceptance_length
: Average tokens per decode iteration
437-448
: Fix line length issues while maintaining logic.The new percentile statistics are correctly added with proper null checking, but several lines exceed the 120-character limit.
Apply this diff to improve readability:
"num_draft_tokens_percentiles": - self.statistics.num_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None, + (self.statistics.num_draft_tokens_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.num_draft_tokens_percentiles else None), "num_accepted_draft_tokens_percentiles": - self.statistics.num_accepted_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_accepted_draft_tokens_percentiles else None, + (self.statistics.num_accepted_draft_tokens_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.num_accepted_draft_tokens_percentiles else None), "draft_acceptance_rate_percentiles": - self.statistics.draft_acceptance_rate_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.draft_acceptance_rate_percentiles else None, + (self.statistics.draft_acceptance_rate_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.draft_acceptance_rate_percentiles else None), "acceptance_length_percentiles": - self.statistics.acceptance_length_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.acceptance_length_percentiles else None + (self.statistics.acceptance_length_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.acceptance_length_percentiles else None)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.py
(7 hunks)tensorrt_llm/bench/dataclasses/statistics.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
118-118: Line too long (147 > 120)
(E501)
438-438: Line too long (125 > 120)
(E501)
441-441: Line too long (134 > 120)
(E501)
444-444: Line too long (130 > 120)
(E501)
447-447: Line too long (125 > 120)
(E501)
🔇 Additional comments (6)
tensorrt_llm/bench/dataclasses/reporting.py (6)
75-75
: LGTM! Variable name collision resolved.The parameter name has been changed from
num_draft_tokens
tomax_draft_tokens
, which resolves the previous variable shadowing issue with the local list variable.
98-101
: LGTM! Well-structured metric tracking variables.The new metric tracking lists are properly initialized and follow consistent naming conventions for speculative decoding statistics.
127-127
: LGTM! Correct global acceptance length calculation.The calculation properly computes the average tokens generated per decoding iteration across all requests, which is the appropriate global metric for speculative decoding performance.
130-137
: LGTM! Proper percentile statistics calculation.The percentile calculations correctly handle both populated and empty metric lists, ensuring robust statistics generation regardless of whether speculative decoding is enabled.
190-191
: LGTM! Proper statistics initialization.The constructor correctly stores raw statistics and generates processed statistics with the appropriate max draft length parameter, enabling proper speculative decoding metrics calculation.
639-662
: LGTM! Well-implemented configuration retrieval.The method properly handles both C++ and PyTorch backend configurations with appropriate fallbacks and error handling. The logic flow correctly prioritizes engine config over speculative config and provides a sensible default.
/bot run |
4fb07c6
to
51d5ed5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
588-625
: Add missing null checks for robust reporting.The reporting logic for speculative decoding statistics is comprehensive but lacks null checks that were identified in previous reviews. This could cause runtime errors when percentile statistics are unavailable.
The statistics formatting assumes all percentile objects exist, but they can be None when no speculative decoding is used. Add null checks before accessing each percentile statistics dictionary to prevent runtime errors.
🧹 Nitpick comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
119-126
: Verify the draft token calculation logic.The speculative decoding metrics calculation appears sound, but please verify:
- Line 121:
max_draft_tokens * (entry.decode_iteration + 1)
- This assumes each iteration generates exactlymax_draft_tokens
draft tokens- Line 122:
entry.num_generated_tokens - entry.decode_iteration - 1
- This calculates accepted tokens by subtracting iterations from total generated tokens- Line 124-125: The acceptance length calculation divides total generated tokens by iterations
Please confirm these calculations align with the actual speculative decoding implementation.
Apply this diff to fix the line length issue flagged by static analysis:
- # For speculative decoding, we need to track the number of draft tokens per request and the number of accepted draft tokens per request + # For speculative decoding, we need to track draft tokens and + # accepted draft tokens per request
437-448
: Fix line length violations in statistics dictionary.The logic for including speculative decoding percentile statistics is correct with proper null checks, but several lines exceed the 120-character limit.
Apply this diff to fix the line length issues:
- "num_draft_tokens_percentiles": - self.statistics.num_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None, - "num_accepted_draft_tokens_percentiles": - self.statistics.num_accepted_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_accepted_draft_tokens_percentiles else None, - "draft_acceptance_rate_percentiles": - self.statistics.draft_acceptance_rate_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.draft_acceptance_rate_percentiles else None, - "acceptance_length_percentiles": - self.statistics.acceptance_length_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.acceptance_length_percentiles else None + "num_draft_tokens_percentiles": + (self.statistics.num_draft_tokens_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.num_draft_tokens_percentiles else None), + "num_accepted_draft_tokens_percentiles": + (self.statistics.num_accepted_draft_tokens_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.num_accepted_draft_tokens_percentiles else None), + "draft_acceptance_rate_percentiles": + (self.statistics.draft_acceptance_rate_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.draft_acceptance_rate_percentiles else None), + "acceptance_length_percentiles": + (self.statistics.acceptance_length_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.acceptance_length_percentiles else None)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.py
(7 hunks)tensorrt_llm/bench/dataclasses/statistics.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
118-118: Line too long (147 > 120)
(E501)
438-438: Line too long (125 > 120)
(E501)
441-441: Line too long (134 > 120)
(E501)
444-444: Line too long (130 > 120)
(E501)
447-447: Line too long (125 > 120)
(E501)
🔇 Additional comments (6)
tensorrt_llm/bench/dataclasses/reporting.py (6)
75-75
: LGTM: Method signature updated for speculative decoding support.The parameter name
max_draft_tokens
clearly indicates its purpose and avoids the variable collision issues mentioned in previous reviews.
98-101
: LGTM: Proper initialization of speculative decoding metrics.The four lists for tracking draft token metrics are well-named and clearly serve their purpose in speculative decoding analysis.
130-137
: LGTM: Robust percentile statistics creation.The conditional creation of percentile statistics objects with proper null checks ensures the code handles cases where no speculative decoding metrics are available.
157-161
: LGTM: Complete integration of speculative decoding metrics.All four new percentile statistics objects are properly passed to the BenchmarkStatistics constructor, ensuring comprehensive tracking of speculative decoding performance.
190-191
: LGTM: Clean integration of speculative decoding support.The constructor properly calls
generate_statistics_summary
with the maximum draft length retrieved from configuration, maintaining backward compatibility while adding new functionality.
639-662
: LGTM: Well-designed configuration retrieval method.The
get_max_draft_len
method provides excellent abstraction for retrieving maximum draft length from different backend configurations:
- Handles both C++ backend (engine config file) and PyTorch backend (speculative_config)
- Includes proper exception handling for file operations
- Provides sensible default value of 0
- Clean separation of concerns
The implementation is robust and follows good error handling practices.
347d461
to
b8ba61d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
588-627
: Add null checks for robustness in percentile statistics formatting.The reporting logic is comprehensive and well-formatted, but should include null checks to prevent runtime errors when percentile statistics are not available.
Add null checks before accessing each percentile dictionary:
if self.get_max_draft_len() > 0: - num_draft_tokens = decoding["num_draft_tokens_percentiles"] - num_draft_tokens_stats = "\n".join( - f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in - ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"]) + num_draft_tokens_stats = "" + if decoding["num_draft_tokens_percentiles"]: + num_draft_tokens = decoding["num_draft_tokens_percentiles"] + num_draft_tokens_stats = "\n".join( + f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in + ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])Apply similar null checks for the other three statistics types to ensure robust reporting when some metrics are unavailable.
157-161
: Missing total_draft_tokens field in BenchmarkStatistics construction.The new percentile fields are correctly added, but based on past review comments, the
total_draft_tokens
field is still missing from the constructor.Add the missing field:
issue_rate_ns=queue_time_total / num_requests, + total_draft_tokens=sum(num_draft_tokens) if num_draft_tokens else 0, acceptance_length=global_acceptance_length,
🧹 Nitpick comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
119-126
: LGTM! Speculative decoding metrics calculation is mathematically sound.The logic correctly calculates:
- Total draft tokens per request
- Accepted draft tokens per request
- Draft acceptance rate as a ratio
- Average acceptance length per iteration
Note: Line 124-125 exceeds the 120 character limit. Consider breaking it into multiple lines:
- acceptance_length.append(entry.num_generated_tokens / (entry.decode_iteration + - 1)) + acceptance_length.append( + entry.num_generated_tokens / (entry.decode_iteration + 1) + )
437-448
: LGTM! Statistics dictionary properly includes new percentile metrics.The conditional model_dump calls correctly handle None values and serialize the statistics appropriately.
Consider breaking long lines to improve readability:
- "num_draft_tokens_percentiles": - self.statistics.num_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None, + "num_draft_tokens_percentiles": ( + self.statistics.num_draft_tokens_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json' + ) if self.statistics.num_draft_tokens_percentiles else None + ),Apply similar formatting to the other three percentile entries.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.py
(7 hunks)tensorrt_llm/bench/dataclasses/statistics.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
118-118: Line too long (147 > 120)
(E501)
438-438: Line too long (125 > 120)
(E501)
441-441: Line too long (134 > 120)
(E501)
444-444: Line too long (130 > 120)
(E501)
447-447: Line too long (125 > 120)
(E501)
🔇 Additional comments (6)
tensorrt_llm/bench/dataclasses/reporting.py (6)
75-75
: LGTM! Method signature properly updated for speculative decoding support.The parameter name
max_draft_tokens
clearly indicates its purpose and the type annotation is correct.
98-101
: LGTM! Well-named variables for speculative decoding metrics.The variable names clearly indicate their purpose and follow proper Python naming conventions.
127-127
: LGTM! Global acceptance length calculation is correct.This properly calculates the overall average acceptance length across all requests as total output tokens divided by total decoding iterations.
130-137
: LGTM! Percentile statistics creation is robust.The conditional creation pattern ensures statistics are only generated when data exists, preventing errors when speculative decoding is not enabled.
190-191
: LGTM! Constructor properly integrates speculative decoding support.The call to
generate_statistics_summary
now correctly passes the maximum draft length, enabling proper statistics collection.
639-662
: LGTM! Well-implemented method for retrieving maximum draft length.The method correctly handles both C++ and PyTorch backends, gracefully handles file reading errors, and returns a sensible default value of 0 when speculative decoding is not configured.
The implementation properly:
- Checks C++ backend engine config first
- Falls back to PyTorch speculative config
- Handles FileNotFoundError, KeyError, and JSONDecodeError exceptions
- Returns 0 as a safe default
7ae33b0
to
74c6b46
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
588-627
: Add null checks for robustness in statistics formatting.While the condition checks
get_max_draft_len() > 0
, the code directly accesses percentile dictionaries without verifying they're not None. This could cause runtime errors if percentile statistics are unavailable.Add null checks before accessing each percentile dictionary:
if self.get_max_draft_len() > 0: - num_draft_tokens = decoding["num_draft_tokens_percentiles"] - num_draft_tokens_stats = "\n".join( - f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in - ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"]) + num_draft_tokens_stats = "" + if decoding["num_draft_tokens_percentiles"]: + num_draft_tokens = decoding["num_draft_tokens_percentiles"] + num_draft_tokens_stats = "\n".join( + f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in + ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])Apply similar null checks for the other three percentile statistics.
157-161
: Missing total_draft_tokens field in BenchmarkStatistics construction.Based on previous review feedback, the
total_draft_tokens
field appears to be missing from the BenchmarkStatistics constructor. The percentile fields are correctly added, but consider adding the total as well.Add the missing field:
issue_rate_ns=queue_time_total / num_requests, + total_draft_tokens=sum(num_draft_tokens) if num_draft_tokens else 0, acceptance_length=global_acceptance_length,
🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
118-118
: Format long lines for better readability.Several lines exceed the 120-character limit. Consider breaking them for better readability.
Example for line 118:
- acceptance_length.append(entry.num_total_output_tokens / (entry.decode_iteration + - 1)) + acceptance_length.append( + entry.num_total_output_tokens / (entry.decode_iteration + 1) + )Apply similar formatting to the other long lines in the
model_dump()
calls.Also applies to: 438-438, 441-441, 444-444, 447-447
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tensorrt_llm/bench/dataclasses/reporting.py
(7 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
118-118: Line too long (147 > 120)
(E501)
438-438: Line too long (125 > 120)
(E501)
441-441: Line too long (134 > 120)
(E501)
444-444: Line too long (130 > 120)
(E501)
447-447: Line too long (125 > 120)
(E501)
🔇 Additional comments (5)
tensorrt_llm/bench/dataclasses/reporting.py (5)
75-75
: LGTM! Parameter naming issue resolved.The method signature correctly uses
max_draft_tokens
parameter name, which resolves the variable shadowing issue identified in previous reviews.
98-137
: LGTM! Comprehensive speculative decoding metrics implementation.The speculative decoding metrics collection is well-implemented:
- Proper conditional logic to only collect metrics when speculative decoding is enabled
- Correct calculation of draft tokens, accepted draft tokens, acceptance rates, and acceptance lengths
- Appropriate null checks when creating percentile statistics
The metrics provide valuable insights into speculative decoding performance.
190-191
: LGTM! Clean constructor implementation.The constructor properly stores raw statistics and generates the summary with the appropriate max_draft_len parameter. Good separation of concerns.
437-448
: LGTM! Robust statistics dictionary implementation.The decoding statistics are properly added with appropriate null checks and consistent formatting. The defensive programming approach prevents runtime errors when percentile statistics are unavailable.
639-662
: LGTM! Well-implemented configuration retrieval method.The
get_max_draft_len
method properly handles multiple configuration sources with appropriate fallbacks:
- Tries C++ backend engine config first
- Falls back to speculative_config from kwargs
- Includes proper exception handling for file operations
- Returns sensible default of 0
The implementation is robust and follows good practices.
74c6b46
to
43f708a
Compare
/mbot run |
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
157-161
: Missing total_draft_tokens field in BenchmarkStatistics construction.Based on previous review feedback, the
total_draft_tokens
field is still missing from the BenchmarkStatistics constructor.Add the missing field:
issue_rate_ns=queue_time_total / num_requests, + total_draft_tokens=sum(num_draft_tokens) if num_draft_tokens else 0, acceptance_length=global_acceptance_length,
588-627
: Comprehensive speculative decoding statistics logging with potential null safety issue.The detailed logging provides excellent visibility into speculative decoding performance with clear labeling and full percentile breakdowns. However, this duplicates a previous review concern about null checking.
Add null checks before accessing percentile dictionaries to prevent runtime errors when statistics are unavailable:
if self.get_max_draft_len() > 0: - num_draft_tokens = decoding["num_draft_tokens_percentiles"] - num_draft_tokens_stats = "\n".join( - f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in - ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"]) + num_draft_tokens_stats = "" + if decoding["num_draft_tokens_percentiles"]: + num_draft_tokens = decoding["num_draft_tokens_percentiles"] + num_draft_tokens_stats = "\n".join( + f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in + ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])
🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
437-448
: LGTM! Comprehensive percentile statistics with proper null safety.The updated dictionary correctly includes all four new percentile statistics with appropriate null checks. However, consider breaking the long lines for better readability.
Consider reformatting the long lines:
- "num_draft_tokens_percentiles": - self.statistics.num_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None, + "num_draft_tokens_percentiles": ( + self.statistics.num_draft_tokens_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.num_draft_tokens_percentiles else None + ),
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.py
(7 hunks)tensorrt_llm/bench/dataclasses/statistics.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
118-118: Line too long (147 > 120)
(E501)
438-438: Line too long (125 > 120)
(E501)
441-441: Line too long (134 > 120)
(E501)
444-444: Line too long (130 > 120)
(E501)
447-447: Line too long (125 > 120)
(E501)
🔇 Additional comments (5)
tensorrt_llm/bench/dataclasses/reporting.py (5)
75-75
: LGTM! Variable shadowing issue resolved.The parameter rename from
num_draft_tokens
tomax_draft_tokens
correctly addresses the previous variable shadowing issue with the local list variable.
98-126
: LGTM! Comprehensive speculative decoding metrics collection.The logic correctly collects key speculative decoding metrics:
- Total draft tokens generated per request
- Accepted draft tokens per request
- Draft acceptance rate calculations
- Acceptance length per iteration
The conditional collection based on
max_draft_tokens > 0
ensures metrics are only computed when speculative decoding is enabled.
130-137
: LGTM! Proper percentile statistics generation with null safety.The logic correctly generates percentile statistics for each metric type with appropriate null checks to handle empty lists when speculative decoding is not active.
190-191
: LGTM! Proper integration of speculative decoding metrics.The constructor correctly calls
generate_statistics_summary
with the maximum draft length to enable collection of speculative decoding metrics when appropriate.
639-662
: LGTM! Robust configuration retrieval for multiple backends.The method correctly handles both C++ and PyTorch backends with appropriate fallback logic:
- Reads from engine config JSON for C++ backend
- Falls back to speculative_config kwargs for PyTorch backend
- Includes proper exception handling for file operations
- Returns sensible default of 0 when configuration is unavailable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
157-161
: Add missing total_draft_tokens field.Based on past review comments, the
total_draft_tokens
field is still missing from the BenchmarkStatistics construction.Add the missing field before line 157:
issue_rate_ns=queue_time_total / num_requests, + total_draft_tokens=sum(num_draft_tokens) if num_draft_tokens else 0, acceptance_length=global_acceptance_length,
588-627
: Add null checks for robustness.The formatting logic correctly displays speculative decoding statistics, but should include null checks to prevent potential runtime errors when percentile statistics are not available.
Add null checks before formatting each statistic type:
if self.get_max_draft_len() > 0: - num_draft_tokens = decoding["num_draft_tokens_percentiles"] - num_draft_tokens_stats = "\n".join( - f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in - ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"]) + num_draft_tokens_stats = "" + if decoding["num_draft_tokens_percentiles"]: + num_draft_tokens = decoding["num_draft_tokens_percentiles"] + num_draft_tokens_stats = "\n".join( + f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in + ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])Apply similar patterns for the other three statistic types to ensure robustness when some metrics are unavailable.
🧹 Nitpick comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)
119-126
: Fix line length violation.The logic for collecting speculative decoding statistics is correct, but Line 118 exceeds the 120-character limit.
Apply this diff to fix the line length:
- # For speculative decoding, we need to track the number of draft tokens per request and the number of accepted draft tokens per request + # For speculative decoding, we need to track draft tokens and accepted draft tokens per request
437-448
: Fix line length violations.The logic for adding percentile statistics to the decoding stats dictionary is correct, but several lines exceed the 120-character limit.
Apply these diffs to fix the line length violations:
- "num_draft_tokens_percentiles": - self.statistics.num_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None, + "num_draft_tokens_percentiles": ( + self.statistics.num_draft_tokens_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.num_draft_tokens_percentiles else None + ), - "num_accepted_draft_tokens_percentiles": - self.statistics.num_accepted_draft_tokens_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.num_accepted_draft_tokens_percentiles else None, + "num_accepted_draft_tokens_percentiles": ( + self.statistics.num_accepted_draft_tokens_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.num_accepted_draft_tokens_percentiles else None + ), - "draft_acceptance_rate_percentiles": - self.statistics.draft_acceptance_rate_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.draft_acceptance_rate_percentiles else None, + "draft_acceptance_rate_percentiles": ( + self.statistics.draft_acceptance_rate_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.draft_acceptance_rate_percentiles else None + ), - "acceptance_length_percentiles": - self.statistics.acceptance_length_percentiles.model_dump( - exclude_none=True, by_alias=True, mode='json') if self.statistics.acceptance_length_percentiles else None + "acceptance_length_percentiles": ( + self.statistics.acceptance_length_percentiles.model_dump( + exclude_none=True, by_alias=True, mode='json') + if self.statistics.acceptance_length_percentiles else None + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.py
(7 hunks)tensorrt_llm/bench/dataclasses/statistics.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
118-118: Line too long (147 > 120)
(E501)
438-438: Line too long (125 > 120)
(E501)
441-441: Line too long (134 > 120)
(E501)
444-444: Line too long (130 > 120)
(E501)
447-447: Line too long (125 > 120)
(E501)
🔇 Additional comments (5)
tensorrt_llm/bench/dataclasses/reporting.py (5)
75-75
: LGTM!The method signature change to accept
max_draft_tokens
parameter is correct and addresses the previous variable name collision issue.
98-101
: LGTM!The new variable declarations for tracking speculative decoding metrics are well-named and properly initialized.
130-137
: LGTM!The percentile calculations are correctly implemented with proper null checks to handle cases where speculative decoding is not used.
190-191
: LGTM!The constructor correctly integrates the new speculative decoding metrics by calling
generate_statistics_summary
with the maximum draft length.
639-662
: LGTM!The new
get_max_draft_len
method is well-implemented with proper error handling and covers both C++ and Python backend configurations. The fallback to 0 is appropriate when no speculative decoding configuration is found.
/bot run |
1 similar comment
/bot run |
PR_Github #13915 [ run ] triggered by Bot |
PR_Github #13915 [ run ] completed with state |
/bot run |
PR_Github #13952 [ run ] triggered by Bot |
PR_Github #13952 [ run ] completed with state |
As discussed offline with @lfr-0531, trtllm-bench tensorrt backend for speculative decoding is not supported well. For this PR, do you think if it is okay to report detailed speculative metrics for pytorch backend? @mikeiovine @longlee0622 |
I think it's OK to focus on Pytorch backend for now. |
52a73c4
to
5ffabdf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
599-653
: Add null checks for robust reporting.The reporting logic assumes all percentile statistics exist but should handle cases where they might be
None
.- if self.get_max_draft_len() > 0: - num_draft_tokens = decoding["num_draft_tokens_percentiles"] + if (self.get_max_draft_len() > 0 and + decoding["num_draft_tokens_percentiles"] is not None): + num_draft_tokens = decoding["num_draft_tokens_percentiles"]Apply similar null checks for
num_accepted_draft_tokens_percentiles
,draft_acceptance_rate_percentiles
, andacceptance_length_percentiles
before accessing their keys to prevent potentialAttributeError
orTypeError
exceptions.
🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
118-118
: Fix line length violation.The line exceeds the 120-character limit specified by static analysis.
- # For speculative decoding, we need to track the number of draft tokens per request and the number of accepted draft tokens per request + # For speculative decoding, we need to track the number of draft tokens + # per request and the number of accepted draft tokens per request
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.py
(7 hunks)tensorrt_llm/bench/dataclasses/statistics.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
118-118: Line too long (147 > 120)
(E501)
🔇 Additional comments (9)
tensorrt_llm/bench/dataclasses/reporting.py (9)
75-75
: LGTM: Method signature updated appropriately.The parameter name change from
num_draft_tokens
tomax_draft_tokens
resolves the variable shadowing issue identified in previous reviews and clearly indicates the parameter's purpose.
98-101
: LGTM: Variable declarations are well-structured.The new variables for tracking speculative decoding metrics are clearly named and appropriately initialized as empty lists for per-request data collection.
131-132
: LGTM: Global acceptance length calculation is correct.The calculation
sum(output_tokens) / total_decoding_iterations
appropriately computes the average acceptance length across all requests.
135-142
: LGTM: Defensive programming with proper null checks.The percentile statistics generation correctly handles empty lists by returning
None
, preventing runtime errors when no speculative decoding data is available.
162-167
: LGTM: BenchmarkStatistics construction updated correctly.All new speculative decoding fields are properly added to the statistics object construction, maintaining consistency with the data model.
196-198
: LGTM: Constructor integration is well-implemented.The integration of
get_max_draft_len()
into the statistics generation call is clean and maintains the existing initialization pattern.
444-460
: LGTM: Statistics dictionary properly handles null values.The dictionary construction correctly uses conditional expressions to handle
None
values for percentile statistics, preventing serialization issues when speculative decoding data is not available.
665-672
: LGTM: Helper method is well-implemented.The
get_max_draft_len()
method correctly extracts the configuration value with appropriate fallbacks and null safety checks.
119-130
: Verify speculative decoding calculationsFile: tensorrt_llm/bench/dataclasses/reporting.py Lines 119–130
Please confirm that the following formulas match the intended speculative decoding specification:
Draft tokens
num_draft_tokens = max_draft_tokens × (entry.decode_iteration + 1)
(assumes each iteration consumes the full draft token budget)Accepted draft tokens
num_accepted_draft_tokens = entry.num_total_output_tokens − (entry.decode_iteration + 1)
(assumes one baseline token per iteration)Acceptance length
acceptance_length = entry.num_total_output_tokens / (entry.decode_iteration + 1)
(calculates average output tokens per iteration)Ensure these calculations align with the algorithm’s documentation or reference implementation.
/bot run |
PR_Github #13976 [ run ] triggered by Bot |
PR_Github #13976 [ run ] completed with state |
Signed-off-by: linquanh <[email protected]>
Signed-off-by: linquanh <[email protected]>
Signed-off-by: linquanh <[email protected]>
Signed-off-by: linquanh <[email protected]>
5ffabdf
to
903bd76
Compare
I have tested pytorch backend using deepseek. It works as expected. And tensorrt backend support is deleted for now. |
/bot run |
PR_Github #13996 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
599-653
: Good implementation, but add null checks for robustness.The reporting logic is well-structured and provides comprehensive speculative decoding metrics. However, consider adding null checks before accessing percentile data to prevent potential runtime errors.
Add null checks before formatting each statistic type:
- num_draft_tokens = decoding["num_draft_tokens_percentiles"] - num_draft_tokens_stats = "\n".join( - f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" - for key in [ - "minimum", "maximum", "average", "p50", "p90", "p95", - "p99" - ]) + num_draft_tokens_stats = "" + if decoding["num_draft_tokens_percentiles"]: + num_draft_tokens = decoding["num_draft_tokens_percentiles"] + num_draft_tokens_stats = "\n".join( + f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" + for key in [ + "minimum", "maximum", "average", "p50", "p90", "p95", + "p99" + ])Apply similar patterns for the other three statistic types.
🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)
119-129
: Logic looks correct, but fix line length violation.The speculative decoding metrics calculations are mathematically sound and properly gated behind the
max_draft_tokens > 0
condition.However, line 121-122 exceeds the 120 character limit. Consider breaking it:
- num_draft_tokens.append(max_draft_tokens * - (entry.decode_iteration + 1)) + num_draft_tokens.append( + max_draft_tokens * (entry.decode_iteration + 1))
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tensorrt_llm/bench/dataclasses/reporting.py
(7 hunks)tensorrt_llm/bench/dataclasses/statistics.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py
118-118: Line too long (147 > 120)
(E501)
🔇 Additional comments (8)
tensorrt_llm/bench/dataclasses/reporting.py (8)
75-75
: LGTM! Parameter naming issue resolved.The method signature change correctly addresses the previous variable shadowing issue by using
max_draft_tokens
instead ofnum_draft_tokens
.
98-101
: LGTM! Well-named variables for speculative metrics.The variable declarations follow Python naming conventions and are descriptive for their purpose.
131-132
: LGTM! Correct global acceptance length calculation.The calculation properly computes the average acceptance length across all requests by dividing total output tokens by total decoding iterations.
135-142
: LGTM! Proper percentile statistics creation with null safety.The code correctly handles empty lists and follows the established pattern for creating percentile statistics.
162-167
: LGTM! Complete BenchmarkStatistics constructor update.All new percentile statistics fields are properly added to the constructor, and the acceptance_length field is correctly updated.
196-198
: LGTM! Proper integration of max draft length parameter.The constructor correctly calls the updated
generate_statistics_summary
method with the max draft length configuration.
665-672
: LGTM! Robust configuration extraction method.The method safely extracts the max_draft_len from speculative_config with proper null checks and sensible defaults.
444-460
: LGTM! Complete statistics dictionary integration.The new percentile statistics are properly integrated into the decoding_stats section with appropriate null checking.
PR_Github #13996 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, should just need a stamp from @FrankD412
LGTM as well -- merging. |
…#6476) Signed-off-by: linquanh <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
…#6476) Signed-off-by: linquanh <[email protected]>
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Improvements
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.