Skip to content

[TRTLLM-6685][feat] Add speculative metrics for trt llm bench #6476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kris1025
Copy link
Collaborator

@kris1025 kris1025 commented Jul 30, 2025

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added comprehensive tracking and reporting of speculative decoding draft token metrics, including draft tokens number, accepted draft tokens number, acceptance rates, and acceptance lengths.
    • Introduced detailed percentile statistics for these new metrics in benchmark reports.
    • Added the ability to retrieve the maximum draft length configuration for speculative decoding.
  • Improvements

    • Enhanced benchmark statistics output with more granular and informative speculative decoding metrics.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@kris1025 kris1025 requested a review from a team as a code owner July 30, 2025 06:55
@kris1025 kris1025 requested a review from FrankD412 July 30, 2025 06:55
Copy link
Contributor

coderabbitai bot commented Jul 30, 2025

📝 Walkthrough

Walkthrough

The changes introduce enhanced tracking and reporting of speculative decoding metrics in the benchmarking utilities. New statistics related to draft tokens, accepted drafts, acceptance rates, and acceptance lengths are collected, aggregated, and reported. Data models are updated to support these metrics, and utility methods are added for configuration retrieval and statistics formatting.

Changes

Cohort / File(s) Change Summary
Speculative Decoding Statistics Collection & Reporting
tensorrt_llm/bench/dataclasses/reporting.py
Enhanced StatsKeeper.generate_statistics_summary to accept max_draft_tokens and track draft token statistics (number of draft tokens, accepted drafts, acceptance rates, acceptance lengths). Updated ReportUtility to retrieve max draft length from config and incorporate new statistics in reporting and dictionary outputs. Added get_max_draft_len method.
Benchmark Statistics Data Model Update
tensorrt_llm/bench/dataclasses/statistics.py
Extended BenchmarkStatistics with acceptance_length replacing acceptance_rate. Removed single acceptance_percentiles and added four new percentile fields for draft tokens, accepted drafts, draft acceptance rates, and acceptance lengths, providing more granular speculative decoding statistics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • FrankD412
  • yilin-void
  • zbpatel
  • kaiyux

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@kris1025 kris1025 marked this pull request as draft July 30, 2025 06:55
@kris1025 kris1025 changed the title add speculative metrics add speculative metrics for trt llm bench Jul 30, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

443-454: Refactor long lines to comply with 120-character limit.

Several lines exceed the 120-character limit specified in coding guidelines.

Consider extracting the model_dump calls to improve readability:

-"draft_tokens_percentiles":
-self.statistics.num_draft_tokens_percentiles.model_dump(
-    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None,
+def _format_percentiles(stats):
+    return stats.model_dump(exclude_none=True, by_alias=True, mode='json') if stats else None
+
+"draft_tokens_percentiles": _format_percentiles(self.statistics.num_draft_tokens_percentiles),
+"num_accepted_drafts_percentiles": _format_percentiles(self.statistics.num_accepted_drafts_percentiles),
+"draft_acceptance_rate_percentiles": _format_percentiles(self.statistics.draft_acceptance_rate_percentiles),
+"acceptance_length_percentiles": _format_percentiles(self.statistics.acceptance_length_percentiles),
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f39a11 and d2d24a4.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py (9 hunks)
  • tensorrt_llm/bench/dataclasses/statistics.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/bench/dataclasses/statistics.py
  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/bench/dataclasses/statistics.py
  • tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

119-119: Line too long (147 > 120)

(E501)


122-122: Undefined name num_accepted_draft_tokens

(F821)


123-123: Undefined name num_accepted_draft_tokens

(F821)


144-144: Undefined name draft_tokens

(F821)


145-145: Undefined name global_acceptance_rate

(F821)


444-444: Line too long (125 > 120)

(E501)


447-447: Line too long (128 > 120)

(E501)


450-450: Line too long (130 > 120)

(E501)


453-453: Line too long (125 > 120)

(E501)

🔇 Additional comments (4)
tensorrt_llm/bench/dataclasses/statistics.py (1)

124-124: LGTM! Well-structured data model extensions for speculative decoding metrics.

The new fields appropriately capture draft token statistics and acceptance metrics with proper initialization defaults.

Also applies to: 131-131, 141-145

tensorrt_llm/bench/dataclasses/reporting.py (3)

121-127: Design consideration: Uniform draft token assumption may not reflect reality.

The current implementation assumes all requests use the same num_draft_tokens value. In practice, draft token counts may vary per request based on speculative decoding dynamics.

Consider tracking draft tokens per request for more accurate metrics. Would you like me to propose an alternative implementation that supports per-request draft token tracking?


597-633: Well-structured statistics reporting with proper null handling.

The implementation correctly checks for None values before accessing percentile statistics and provides clear, formatted output for all speculative decoding metrics.


647-671: Robust configuration retrieval with proper fallback handling.

The method correctly handles both C++ and Python backend configurations with appropriate error handling for missing files or fields.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Jul 30, 2025
@kris1025 kris1025 force-pushed the feature-add-speculative-metric-linquanh branch from d2d24a4 to 619560b Compare July 30, 2025 07:31
@coderabbitai coderabbitai bot requested review from kaiyux, yilin-void and zbpatel July 30, 2025 07:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

127-165: Fix duplicate keyword argument and formatting issues.

Critical syntax error and formatting issues need to be addressed:

  1. Line 158 has duplicate acceptance_length keyword argument causing syntax error
  2. Line 129 contains trailing whitespace

Apply these fixes:

 global_acceptance_length = sum(acceptance_length) / len(acceptance_length)
 queue_time_total = last_queue_time - start_time
-        
+

And remove the duplicate keyword argument on line 158:

             token_percentiles=PercentileStats.from_iterable(output_tokens),
             issue_rate_ns=queue_time_total / num_requests,
-            acceptance_length=global_acceptance_length,
             num_draft_tokens_percentiles=num_draft_tokens_percentiles,
♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

75-75: Critical: Variable name collision between parameter and local variable.

The parameter num_draft_tokens conflicts with the local list variable of the same name on line 98. This will cause the parameter value to be shadowed and inaccessible.

Consider renaming the parameter to avoid the collision:

-def generate_statistics_summary(self, num_draft_tokens: int) -> None:
+def generate_statistics_summary(self, max_draft_tokens: int) -> None:

Then update the usage on lines 120-125 to use the renamed parameter.


119-126: Fix multiple critical issues in speculative metrics calculation.

Several issues need to be addressed:

  1. Variable shadowing prevents access to the num_draft_tokens parameter
  2. Line 122 references undefined variable num_accepted_draft_tokens instead of num_accepted_drafts
  3. Line length violation on line 119 (147 > 120 characters)

Apply these fixes:

 # For speculative decoding, we need to track the number of draft tokens per request and the number of accepted draft tokens per request
-if num_draft_tokens > 0:
-    num_draft_tokens.append(num_draft_tokens * (entry.decode_iteration + 1))
-    num_accepted_draft_tokens.append(entry.num_generated_tokens - entry.decode_iteration - 1)
-    draft_acceptance_rate.append(float(num_accepted_draft_tokens[-1]) / float(num_draft_tokens[-1]))
-    acceptance_length.append(entry.num_generated_tokens / (entry.decode_iteration +
-                                           1))
+if max_draft_tokens > 0:  # Use renamed parameter
+    num_draft_tokens.append(max_draft_tokens * (entry.decode_iteration + 1))
+    num_accepted_drafts.append(entry.num_generated_tokens - entry.decode_iteration - 1)
+    draft_acceptance_rate.append(float(num_accepted_drafts[-1]) / float(num_draft_tokens[-1]))
+    acceptance_length.append(entry.num_generated_tokens / (entry.decode_iteration + 1))
🧹 Nitpick comments (3)
tensorrt_llm/bench/dataclasses/reporting.py (3)

640-663: Good implementation with minor formatting issues.

The method correctly implements a fallback strategy to retrieve max draft length from multiple configuration sources with proper error handling.

Fix trailing whitespace issues:

                 except (FileNotFoundError, KeyError, json.JSONDecodeError):
                     pass
-        
+
         # Try to get from speculative_config
         if ("speculative_config" in self.kwargs
                 and self.kwargs["speculative_config"] is not None):
             return self.kwargs["speculative_config"].max_draft_len or 0
-        
+
         return 0

438-449: Fix line length violations while maintaining functionality.

The logic correctly adds speculative decoding percentile statistics, but several lines exceed the 120-character limit.

Apply formatting fixes to comply with line length limits:

                 "num_draft_tokens_percentiles":
-                self.statistics.num_draft_tokens_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None,
+                (self.statistics.num_draft_tokens_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json')
+                 if self.statistics.num_draft_tokens_percentiles else None),
                 "num_accepted_drafts_percentiles":
-                self.statistics.num_accepted_drafts_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_accepted_drafts_percentiles else None,
+                (self.statistics.num_accepted_drafts_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json')
+                 if self.statistics.num_accepted_drafts_percentiles else None),
                 "draft_acceptance_rate_percentiles":
-                self.statistics.draft_acceptance_rate_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.draft_acceptance_rate_percentiles else None,
+                (self.statistics.draft_acceptance_rate_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json')
+                 if self.statistics.draft_acceptance_rate_percentiles else None),
                 "acceptance_length_percentiles":
-                self.statistics.acceptance_length_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.acceptance_length_percentiles else None
+                (self.statistics.acceptance_length_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json')
+                 if self.statistics.acceptance_length_percentiles else None)

589-627: Excellent implementation of speculative decoding statistics reporting.

The reporting format is consistent with existing patterns and provides comprehensive coverage of all speculative decoding metrics with clear abbreviations and formatting.

Fix trailing whitespace:

         if decoding is not None:
             decoding = stats_dict["decoding_stats"]
-            
+
             num_draft_tokens = decoding["num_draft_tokens_percentiles"]
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d2d24a4 and 619560b.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py (7 hunks)
  • tensorrt_llm/bench/dataclasses/statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

119-119: Line too long (147 > 120)

(E501)


129-129: Blank line contains whitespace

(W293)


158-158: SyntaxError: Duplicate keyword argument "acceptance_length"


440-440: Line too long (125 > 120)

(E501)


443-443: Line too long (128 > 120)

(E501)


446-446: Line too long (130 > 120)

(E501)


449-449: Line too long (125 > 120)

(E501)


589-589: Blank line contains whitespace

(W293)


657-657: Blank line contains whitespace

(W293)


662-662: Blank line contains whitespace

(W293)

🔇 Additional comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

98-101: LGTM: Proper variable declarations for speculative metrics.

The variable declarations follow proper naming conventions and are appropriately initialized as empty lists for collecting per-request metrics.


191-192: LGTM: Proper integration of max draft length parameter.

The constructor correctly calls the new get_max_draft_len() method to provide the required parameter to generate_statistics_summary().

@kris1025 kris1025 force-pushed the feature-add-speculative-metric-linquanh branch from 619560b to 4fb07c6 Compare July 30, 2025 08:20
@kris1025 kris1025 marked this pull request as ready for review July 30, 2025 08:21
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

119-126: Fix line length and verify calculations.

The speculative decoding logic appears mathematically sound, but there's a line length issue that needs addressing.

Apply this diff to fix the line length:

-                acceptance_length.append(entry.num_generated_tokens / (entry.decode_iteration +
-                                                       1))
+                acceptance_length.append(
+                    entry.num_generated_tokens / (entry.decode_iteration + 1))

The calculations look correct:

  • num_draft_tokens: Total tokens proposed across all decode iterations
  • num_accepted_draft_tokens: Generated tokens minus base tokens per iteration
  • draft_acceptance_rate: Ratio of accepted to proposed tokens
  • acceptance_length: Average tokens per decode iteration

437-448: Fix line length issues while maintaining logic.

The new percentile statistics are correctly added with proper null checking, but several lines exceed the 120-character limit.

Apply this diff to improve readability:

                "num_draft_tokens_percentiles":
-                self.statistics.num_draft_tokens_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None,
+                (self.statistics.num_draft_tokens_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json') 
+                 if self.statistics.num_draft_tokens_percentiles else None),
                "num_accepted_draft_tokens_percentiles":
-                self.statistics.num_accepted_draft_tokens_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_accepted_draft_tokens_percentiles else None,
+                (self.statistics.num_accepted_draft_tokens_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json') 
+                 if self.statistics.num_accepted_draft_tokens_percentiles else None),
                "draft_acceptance_rate_percentiles":
-                self.statistics.draft_acceptance_rate_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.draft_acceptance_rate_percentiles else None,
+                (self.statistics.draft_acceptance_rate_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json') 
+                 if self.statistics.draft_acceptance_rate_percentiles else None),
                "acceptance_length_percentiles":
-                self.statistics.acceptance_length_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.acceptance_length_percentiles else None
+                (self.statistics.acceptance_length_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json') 
+                 if self.statistics.acceptance_length_percentiles else None)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 619560b and 4fb07c6.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py (7 hunks)
  • tensorrt_llm/bench/dataclasses/statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

118-118: Line too long (147 > 120)

(E501)


438-438: Line too long (125 > 120)

(E501)


441-441: Line too long (134 > 120)

(E501)


444-444: Line too long (130 > 120)

(E501)


447-447: Line too long (125 > 120)

(E501)

🔇 Additional comments (6)
tensorrt_llm/bench/dataclasses/reporting.py (6)

75-75: LGTM! Variable name collision resolved.

The parameter name has been changed from num_draft_tokens to max_draft_tokens, which resolves the previous variable shadowing issue with the local list variable.


98-101: LGTM! Well-structured metric tracking variables.

The new metric tracking lists are properly initialized and follow consistent naming conventions for speculative decoding statistics.


127-127: LGTM! Correct global acceptance length calculation.

The calculation properly computes the average tokens generated per decoding iteration across all requests, which is the appropriate global metric for speculative decoding performance.


130-137: LGTM! Proper percentile statistics calculation.

The percentile calculations correctly handle both populated and empty metric lists, ensuring robust statistics generation regardless of whether speculative decoding is enabled.


190-191: LGTM! Proper statistics initialization.

The constructor correctly stores raw statistics and generates processed statistics with the appropriate max draft length parameter, enabling proper speculative decoding metrics calculation.


639-662: LGTM! Well-implemented configuration retrieval.

The method properly handles both C++ and PyTorch backend configurations with appropriate fallbacks and error handling. The logic flow correctly prioritizes engine config over speculative config and provides a sensible default.

@kris1025
Copy link
Collaborator Author

/bot run

@kris1025 kris1025 force-pushed the feature-add-speculative-metric-linquanh branch from 4fb07c6 to 51d5ed5 Compare July 30, 2025 08:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

588-625: Add missing null checks for robust reporting.

The reporting logic for speculative decoding statistics is comprehensive but lacks null checks that were identified in previous reviews. This could cause runtime errors when percentile statistics are unavailable.

The statistics formatting assumes all percentile objects exist, but they can be None when no speculative decoding is used. Add null checks before accessing each percentile statistics dictionary to prevent runtime errors.

🧹 Nitpick comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

119-126: Verify the draft token calculation logic.

The speculative decoding metrics calculation appears sound, but please verify:

  1. Line 121: max_draft_tokens * (entry.decode_iteration + 1) - This assumes each iteration generates exactly max_draft_tokens draft tokens
  2. Line 122: entry.num_generated_tokens - entry.decode_iteration - 1 - This calculates accepted tokens by subtracting iterations from total generated tokens
  3. Line 124-125: The acceptance length calculation divides total generated tokens by iterations

Please confirm these calculations align with the actual speculative decoding implementation.

Apply this diff to fix the line length issue flagged by static analysis:

-            # For speculative decoding, we need to track the number of draft tokens per request and the number of accepted draft tokens per request
+            # For speculative decoding, we need to track draft tokens and 
+            # accepted draft tokens per request

437-448: Fix line length violations in statistics dictionary.

The logic for including speculative decoding percentile statistics is correct with proper null checks, but several lines exceed the 120-character limit.

Apply this diff to fix the line length issues:

-                "num_draft_tokens_percentiles":
-                self.statistics.num_draft_tokens_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None,
-                "num_accepted_draft_tokens_percentiles":
-                self.statistics.num_accepted_draft_tokens_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_accepted_draft_tokens_percentiles else None,
-                "draft_acceptance_rate_percentiles":
-                self.statistics.draft_acceptance_rate_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.draft_acceptance_rate_percentiles else None,
-                "acceptance_length_percentiles":
-                self.statistics.acceptance_length_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.acceptance_length_percentiles else None
+                "num_draft_tokens_percentiles":
+                (self.statistics.num_draft_tokens_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json') 
+                 if self.statistics.num_draft_tokens_percentiles else None),
+                "num_accepted_draft_tokens_percentiles":
+                (self.statistics.num_accepted_draft_tokens_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json') 
+                 if self.statistics.num_accepted_draft_tokens_percentiles else None),
+                "draft_acceptance_rate_percentiles":
+                (self.statistics.draft_acceptance_rate_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json') 
+                 if self.statistics.draft_acceptance_rate_percentiles else None),
+                "acceptance_length_percentiles":
+                (self.statistics.acceptance_length_percentiles.model_dump(
+                    exclude_none=True, by_alias=True, mode='json') 
+                 if self.statistics.acceptance_length_percentiles else None)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4fb07c6 and 51d5ed5.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py (7 hunks)
  • tensorrt_llm/bench/dataclasses/statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

118-118: Line too long (147 > 120)

(E501)


438-438: Line too long (125 > 120)

(E501)


441-441: Line too long (134 > 120)

(E501)


444-444: Line too long (130 > 120)

(E501)


447-447: Line too long (125 > 120)

(E501)

🔇 Additional comments (6)
tensorrt_llm/bench/dataclasses/reporting.py (6)

75-75: LGTM: Method signature updated for speculative decoding support.

The parameter name max_draft_tokens clearly indicates its purpose and avoids the variable collision issues mentioned in previous reviews.


98-101: LGTM: Proper initialization of speculative decoding metrics.

The four lists for tracking draft token metrics are well-named and clearly serve their purpose in speculative decoding analysis.


130-137: LGTM: Robust percentile statistics creation.

The conditional creation of percentile statistics objects with proper null checks ensures the code handles cases where no speculative decoding metrics are available.


157-161: LGTM: Complete integration of speculative decoding metrics.

All four new percentile statistics objects are properly passed to the BenchmarkStatistics constructor, ensuring comprehensive tracking of speculative decoding performance.


190-191: LGTM: Clean integration of speculative decoding support.

The constructor properly calls generate_statistics_summary with the maximum draft length retrieved from configuration, maintaining backward compatibility while adding new functionality.


639-662: LGTM: Well-designed configuration retrieval method.

The get_max_draft_len method provides excellent abstraction for retrieving maximum draft length from different backend configurations:

  1. Handles both C++ backend (engine config file) and PyTorch backend (speculative_config)
  2. Includes proper exception handling for file operations
  3. Provides sensible default value of 0
  4. Clean separation of concerns

The implementation is robust and follows good error handling practices.

@kris1025 kris1025 force-pushed the feature-add-speculative-metric-linquanh branch 2 times, most recently from 347d461 to b8ba61d Compare July 30, 2025 08:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

588-627: Add null checks for robustness in percentile statistics formatting.

The reporting logic is comprehensive and well-formatted, but should include null checks to prevent runtime errors when percentile statistics are not available.

Add null checks before accessing each percentile dictionary:

            if self.get_max_draft_len() > 0:
-                num_draft_tokens = decoding["num_draft_tokens_percentiles"]
-                num_draft_tokens_stats = "\n".join(
-                    f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in
-                    ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])
+                num_draft_tokens_stats = ""
+                if decoding["num_draft_tokens_percentiles"]:
+                    num_draft_tokens = decoding["num_draft_tokens_percentiles"]
+                    num_draft_tokens_stats = "\n".join(
+                        f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in
+                        ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])

Apply similar null checks for the other three statistics types to ensure robust reporting when some metrics are unavailable.


157-161: Missing total_draft_tokens field in BenchmarkStatistics construction.

The new percentile fields are correctly added, but based on past review comments, the total_draft_tokens field is still missing from the constructor.

Add the missing field:

            issue_rate_ns=queue_time_total / num_requests,
+            total_draft_tokens=sum(num_draft_tokens) if num_draft_tokens else 0,
            acceptance_length=global_acceptance_length,
🧹 Nitpick comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

119-126: LGTM! Speculative decoding metrics calculation is mathematically sound.

The logic correctly calculates:

  • Total draft tokens per request
  • Accepted draft tokens per request
  • Draft acceptance rate as a ratio
  • Average acceptance length per iteration

Note: Line 124-125 exceeds the 120 character limit. Consider breaking it into multiple lines:

-                acceptance_length.append(entry.num_generated_tokens / (entry.decode_iteration +
-                                                       1))
+                acceptance_length.append(
+                    entry.num_generated_tokens / (entry.decode_iteration + 1)
+                )

437-448: LGTM! Statistics dictionary properly includes new percentile metrics.

The conditional model_dump calls correctly handle None values and serialize the statistics appropriately.

Consider breaking long lines to improve readability:

-                "num_draft_tokens_percentiles":
-                self.statistics.num_draft_tokens_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None,
+                "num_draft_tokens_percentiles": (
+                    self.statistics.num_draft_tokens_percentiles.model_dump(
+                        exclude_none=True, by_alias=True, mode='json'
+                    ) if self.statistics.num_draft_tokens_percentiles else None
+                ),

Apply similar formatting to the other three percentile entries.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 347d461 and b8ba61d.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py (7 hunks)
  • tensorrt_llm/bench/dataclasses/statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

118-118: Line too long (147 > 120)

(E501)


438-438: Line too long (125 > 120)

(E501)


441-441: Line too long (134 > 120)

(E501)


444-444: Line too long (130 > 120)

(E501)


447-447: Line too long (125 > 120)

(E501)

🔇 Additional comments (6)
tensorrt_llm/bench/dataclasses/reporting.py (6)

75-75: LGTM! Method signature properly updated for speculative decoding support.

The parameter name max_draft_tokens clearly indicates its purpose and the type annotation is correct.


98-101: LGTM! Well-named variables for speculative decoding metrics.

The variable names clearly indicate their purpose and follow proper Python naming conventions.


127-127: LGTM! Global acceptance length calculation is correct.

This properly calculates the overall average acceptance length across all requests as total output tokens divided by total decoding iterations.


130-137: LGTM! Percentile statistics creation is robust.

The conditional creation pattern ensures statistics are only generated when data exists, preventing errors when speculative decoding is not enabled.


190-191: LGTM! Constructor properly integrates speculative decoding support.

The call to generate_statistics_summary now correctly passes the maximum draft length, enabling proper statistics collection.


639-662: LGTM! Well-implemented method for retrieving maximum draft length.

The method correctly handles both C++ and PyTorch backends, gracefully handles file reading errors, and returns a sensible default value of 0 when speculative decoding is not configured.

The implementation properly:

  • Checks C++ backend engine config first
  • Falls back to PyTorch speculative config
  • Handles FileNotFoundError, KeyError, and JSONDecodeError exceptions
  • Returns 0 as a safe default

@kris1025 kris1025 marked this pull request as draft July 30, 2025 12:15
@kris1025 kris1025 force-pushed the feature-add-speculative-metric-linquanh branch from 7ae33b0 to 74c6b46 Compare July 31, 2025 09:04
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

588-627: Add null checks for robustness in statistics formatting.

While the condition checks get_max_draft_len() > 0, the code directly accesses percentile dictionaries without verifying they're not None. This could cause runtime errors if percentile statistics are unavailable.

Add null checks before accessing each percentile dictionary:

            if self.get_max_draft_len() > 0:
-                num_draft_tokens = decoding["num_draft_tokens_percentiles"]
-                num_draft_tokens_stats = "\n".join(
-                    f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in
-                    ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])
+                num_draft_tokens_stats = ""
+                if decoding["num_draft_tokens_percentiles"]:
+                    num_draft_tokens = decoding["num_draft_tokens_percentiles"]
+                    num_draft_tokens_stats = "\n".join(
+                        f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in
+                        ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])

Apply similar null checks for the other three percentile statistics.


157-161: Missing total_draft_tokens field in BenchmarkStatistics construction.

Based on previous review feedback, the total_draft_tokens field appears to be missing from the BenchmarkStatistics constructor. The percentile fields are correctly added, but consider adding the total as well.

Add the missing field:

            issue_rate_ns=queue_time_total / num_requests,
+            total_draft_tokens=sum(num_draft_tokens) if num_draft_tokens else 0,
            acceptance_length=global_acceptance_length,
🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

118-118: Format long lines for better readability.

Several lines exceed the 120-character limit. Consider breaking them for better readability.

Example for line 118:

-                acceptance_length.append(entry.num_total_output_tokens / (entry.decode_iteration +
-                                                       1))
+                acceptance_length.append(
+                    entry.num_total_output_tokens / (entry.decode_iteration + 1)
+                )

Apply similar formatting to the other long lines in the model_dump() calls.

Also applies to: 438-438, 441-441, 444-444, 447-447

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8ba61d and 7ae33b0.

📒 Files selected for processing (1)
  • tensorrt_llm/bench/dataclasses/reporting.py (7 hunks)
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

118-118: Line too long (147 > 120)

(E501)


438-438: Line too long (125 > 120)

(E501)


441-441: Line too long (134 > 120)

(E501)


444-444: Line too long (130 > 120)

(E501)


447-447: Line too long (125 > 120)

(E501)

🔇 Additional comments (5)
tensorrt_llm/bench/dataclasses/reporting.py (5)

75-75: LGTM! Parameter naming issue resolved.

The method signature correctly uses max_draft_tokens parameter name, which resolves the variable shadowing issue identified in previous reviews.


98-137: LGTM! Comprehensive speculative decoding metrics implementation.

The speculative decoding metrics collection is well-implemented:

  • Proper conditional logic to only collect metrics when speculative decoding is enabled
  • Correct calculation of draft tokens, accepted draft tokens, acceptance rates, and acceptance lengths
  • Appropriate null checks when creating percentile statistics

The metrics provide valuable insights into speculative decoding performance.


190-191: LGTM! Clean constructor implementation.

The constructor properly stores raw statistics and generates the summary with the appropriate max_draft_len parameter. Good separation of concerns.


437-448: LGTM! Robust statistics dictionary implementation.

The decoding statistics are properly added with appropriate null checks and consistent formatting. The defensive programming approach prevents runtime errors when percentile statistics are unavailable.


639-662: LGTM! Well-implemented configuration retrieval method.

The get_max_draft_len method properly handles multiple configuration sources with appropriate fallbacks:

  • Tries C++ backend engine config first
  • Falls back to speculative_config from kwargs
  • Includes proper exception handling for file operations
  • Returns sensible default of 0

The implementation is robust and follows good practices.

@kris1025 kris1025 force-pushed the feature-add-speculative-metric-linquanh branch from 74c6b46 to 43f708a Compare July 31, 2025 09:05
@kris1025 kris1025 marked this pull request as ready for review July 31, 2025 09:05
@kris1025
Copy link
Collaborator Author

/mbot run

@kris1025
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

157-161: Missing total_draft_tokens field in BenchmarkStatistics construction.

Based on previous review feedback, the total_draft_tokens field is still missing from the BenchmarkStatistics constructor.

Add the missing field:

            issue_rate_ns=queue_time_total / num_requests,
+            total_draft_tokens=sum(num_draft_tokens) if num_draft_tokens else 0,
            acceptance_length=global_acceptance_length,

588-627: Comprehensive speculative decoding statistics logging with potential null safety issue.

The detailed logging provides excellent visibility into speculative decoding performance with clear labeling and full percentile breakdowns. However, this duplicates a previous review concern about null checking.

Add null checks before accessing percentile dictionaries to prevent runtime errors when statistics are unavailable:

            if self.get_max_draft_len() > 0:
-                num_draft_tokens = decoding["num_draft_tokens_percentiles"]
-                num_draft_tokens_stats = "\n".join(
-                    f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in
-                    ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])
+                num_draft_tokens_stats = ""
+                if decoding["num_draft_tokens_percentiles"]:
+                    num_draft_tokens = decoding["num_draft_tokens_percentiles"]
+                    num_draft_tokens_stats = "\n".join(
+                        f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in
+                        ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])
🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

437-448: LGTM! Comprehensive percentile statistics with proper null safety.

The updated dictionary correctly includes all four new percentile statistics with appropriate null checks. However, consider breaking the long lines for better readability.

Consider reformatting the long lines:

-                "num_draft_tokens_percentiles":
-                self.statistics.num_draft_tokens_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None,
+                "num_draft_tokens_percentiles": (
+                    self.statistics.num_draft_tokens_percentiles.model_dump(
+                        exclude_none=True, by_alias=True, mode='json')
+                    if self.statistics.num_draft_tokens_percentiles else None
+                ),
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ae33b0 and 74c6b46.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py (7 hunks)
  • tensorrt_llm/bench/dataclasses/statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

118-118: Line too long (147 > 120)

(E501)


438-438: Line too long (125 > 120)

(E501)


441-441: Line too long (134 > 120)

(E501)


444-444: Line too long (130 > 120)

(E501)


447-447: Line too long (125 > 120)

(E501)

🔇 Additional comments (5)
tensorrt_llm/bench/dataclasses/reporting.py (5)

75-75: LGTM! Variable shadowing issue resolved.

The parameter rename from num_draft_tokens to max_draft_tokens correctly addresses the previous variable shadowing issue with the local list variable.


98-126: LGTM! Comprehensive speculative decoding metrics collection.

The logic correctly collects key speculative decoding metrics:

  • Total draft tokens generated per request
  • Accepted draft tokens per request
  • Draft acceptance rate calculations
  • Acceptance length per iteration

The conditional collection based on max_draft_tokens > 0 ensures metrics are only computed when speculative decoding is enabled.


130-137: LGTM! Proper percentile statistics generation with null safety.

The logic correctly generates percentile statistics for each metric type with appropriate null checks to handle empty lists when speculative decoding is not active.


190-191: LGTM! Proper integration of speculative decoding metrics.

The constructor correctly calls generate_statistics_summary with the maximum draft length to enable collection of speculative decoding metrics when appropriate.


639-662: LGTM! Robust configuration retrieval for multiple backends.

The method correctly handles both C++ and PyTorch backends with appropriate fallback logic:

  • Reads from engine config JSON for C++ backend
  • Falls back to speculative_config kwargs for PyTorch backend
  • Includes proper exception handling for file operations
  • Returns sensible default of 0 when configuration is unavailable

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

157-161: Add missing total_draft_tokens field.

Based on past review comments, the total_draft_tokens field is still missing from the BenchmarkStatistics construction.

Add the missing field before line 157:

            issue_rate_ns=queue_time_total / num_requests,
+            total_draft_tokens=sum(num_draft_tokens) if num_draft_tokens else 0,
            acceptance_length=global_acceptance_length,

588-627: Add null checks for robustness.

The formatting logic correctly displays speculative decoding statistics, but should include null checks to prevent potential runtime errors when percentile statistics are not available.

Add null checks before formatting each statistic type:

            if self.get_max_draft_len() > 0:
-                num_draft_tokens = decoding["num_draft_tokens_percentiles"]
-                num_draft_tokens_stats = "\n".join(
-                    f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in
-                    ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])
+                num_draft_tokens_stats = ""
+                if decoding["num_draft_tokens_percentiles"]:
+                    num_draft_tokens = decoding["num_draft_tokens_percentiles"]
+                    num_draft_tokens_stats = "\n".join(
+                        f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}" for key in
+                        ["minimum", "maximum", "average", "p50", "p90", "p95", "p99"])

Apply similar patterns for the other three statistic types to ensure robustness when some metrics are unavailable.

🧹 Nitpick comments (2)
tensorrt_llm/bench/dataclasses/reporting.py (2)

119-126: Fix line length violation.

The logic for collecting speculative decoding statistics is correct, but Line 118 exceeds the 120-character limit.

Apply this diff to fix the line length:

-            # For speculative decoding, we need to track the number of draft tokens per request and the number of accepted draft tokens per request
+            # For speculative decoding, we need to track draft tokens and accepted draft tokens per request

437-448: Fix line length violations.

The logic for adding percentile statistics to the decoding stats dictionary is correct, but several lines exceed the 120-character limit.

Apply these diffs to fix the line length violations:

-                "num_draft_tokens_percentiles":
-                self.statistics.num_draft_tokens_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_draft_tokens_percentiles else None,
+                "num_draft_tokens_percentiles": (
+                    self.statistics.num_draft_tokens_percentiles.model_dump(
+                        exclude_none=True, by_alias=True, mode='json') 
+                    if self.statistics.num_draft_tokens_percentiles else None
+                ),
-                "num_accepted_draft_tokens_percentiles":
-                self.statistics.num_accepted_draft_tokens_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.num_accepted_draft_tokens_percentiles else None,
+                "num_accepted_draft_tokens_percentiles": (
+                    self.statistics.num_accepted_draft_tokens_percentiles.model_dump(
+                        exclude_none=True, by_alias=True, mode='json') 
+                    if self.statistics.num_accepted_draft_tokens_percentiles else None
+                ),
-                "draft_acceptance_rate_percentiles":
-                self.statistics.draft_acceptance_rate_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.draft_acceptance_rate_percentiles else None,
+                "draft_acceptance_rate_percentiles": (
+                    self.statistics.draft_acceptance_rate_percentiles.model_dump(
+                        exclude_none=True, by_alias=True, mode='json') 
+                    if self.statistics.draft_acceptance_rate_percentiles else None
+                ),
-                "acceptance_length_percentiles":
-                self.statistics.acceptance_length_percentiles.model_dump(
-                    exclude_none=True, by_alias=True, mode='json') if self.statistics.acceptance_length_percentiles else None
+                "acceptance_length_percentiles": (
+                    self.statistics.acceptance_length_percentiles.model_dump(
+                        exclude_none=True, by_alias=True, mode='json') 
+                    if self.statistics.acceptance_length_percentiles else None
+                )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 74c6b46 and 43f708a.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py (7 hunks)
  • tensorrt_llm/bench/dataclasses/statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

118-118: Line too long (147 > 120)

(E501)


438-438: Line too long (125 > 120)

(E501)


441-441: Line too long (134 > 120)

(E501)


444-444: Line too long (130 > 120)

(E501)


447-447: Line too long (125 > 120)

(E501)

🔇 Additional comments (5)
tensorrt_llm/bench/dataclasses/reporting.py (5)

75-75: LGTM!

The method signature change to accept max_draft_tokens parameter is correct and addresses the previous variable name collision issue.


98-101: LGTM!

The new variable declarations for tracking speculative decoding metrics are well-named and properly initialized.


130-137: LGTM!

The percentile calculations are correctly implemented with proper null checks to handle cases where speculative decoding is not used.


190-191: LGTM!

The constructor correctly integrates the new speculative decoding metrics by calling generate_statistics_summary with the maximum draft length.


639-662: LGTM!

The new get_max_draft_len method is well-implemented with proper error handling and covers both C++ and Python backend configurations. The fallback to 0 is appropriate when no speculative decoding configuration is found.

@kris1025
Copy link
Collaborator Author

kris1025 commented Aug 4, 2025

/bot run

1 similar comment
@ziyixiong-nv
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13915 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13915 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10477 completed with status: 'FAILURE'

@kris1025
Copy link
Collaborator Author

kris1025 commented Aug 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13952 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13952 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10508 completed with status: 'FAILURE'

@kris1025
Copy link
Collaborator Author

kris1025 commented Aug 4, 2025

Overall, this looks fine to me -- just a double check, you've verified these changes work for both paths and produced expected results?

As discussed offline with @lfr-0531, trtllm-bench tensorrt backend for speculative decoding is not supported well.

For this PR, do you think if it is okay to report detailed speculative metrics for pytorch backend? @mikeiovine @longlee0622

@longlee0622
Copy link
Collaborator

Overall, this looks fine to me -- just a double check, you've verified these changes work for both paths and produced expected results?

As discussed offline with @lfr-0531, trtllm-bench tensorrt backend for speculative decoding is not supported well.

For this PR, do you think if it is okay to report detailed speculative metrics for pytorch backend? @mikeiovine @longlee0622

I think it's OK to focus on Pytorch backend for now.

@kris1025 kris1025 force-pushed the feature-add-speculative-metric-linquanh branch from 52a73c4 to 5ffabdf Compare August 4, 2025 09:17
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

599-653: Add null checks for robust reporting.

The reporting logic assumes all percentile statistics exist but should handle cases where they might be None.

-            if self.get_max_draft_len() > 0:
-                num_draft_tokens = decoding["num_draft_tokens_percentiles"]
+            if (self.get_max_draft_len() > 0 and 
+                decoding["num_draft_tokens_percentiles"] is not None):
+                num_draft_tokens = decoding["num_draft_tokens_percentiles"]

Apply similar null checks for num_accepted_draft_tokens_percentiles, draft_acceptance_rate_percentiles, and acceptance_length_percentiles before accessing their keys to prevent potential AttributeError or TypeError exceptions.

🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

118-118: Fix line length violation.

The line exceeds the 120-character limit specified by static analysis.

-            # For speculative decoding, we need to track the number of draft tokens per request and the number of accepted draft tokens per request
+            # For speculative decoding, we need to track the number of draft tokens
+            # per request and the number of accepted draft tokens per request
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52a73c4 and 5ffabdf.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py (7 hunks)
  • tensorrt_llm/bench/dataclasses/statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

118-118: Line too long (147 > 120)

(E501)

🔇 Additional comments (9)
tensorrt_llm/bench/dataclasses/reporting.py (9)

75-75: LGTM: Method signature updated appropriately.

The parameter name change from num_draft_tokens to max_draft_tokens resolves the variable shadowing issue identified in previous reviews and clearly indicates the parameter's purpose.


98-101: LGTM: Variable declarations are well-structured.

The new variables for tracking speculative decoding metrics are clearly named and appropriately initialized as empty lists for per-request data collection.


131-132: LGTM: Global acceptance length calculation is correct.

The calculation sum(output_tokens) / total_decoding_iterations appropriately computes the average acceptance length across all requests.


135-142: LGTM: Defensive programming with proper null checks.

The percentile statistics generation correctly handles empty lists by returning None, preventing runtime errors when no speculative decoding data is available.


162-167: LGTM: BenchmarkStatistics construction updated correctly.

All new speculative decoding fields are properly added to the statistics object construction, maintaining consistency with the data model.


196-198: LGTM: Constructor integration is well-implemented.

The integration of get_max_draft_len() into the statistics generation call is clean and maintains the existing initialization pattern.


444-460: LGTM: Statistics dictionary properly handles null values.

The dictionary construction correctly uses conditional expressions to handle None values for percentile statistics, preventing serialization issues when speculative decoding data is not available.


665-672: LGTM: Helper method is well-implemented.

The get_max_draft_len() method correctly extracts the configuration value with appropriate fallbacks and null safety checks.


119-130: Verify speculative decoding calculations

File: tensorrt_llm/bench/dataclasses/reporting.py Lines 119–130

Please confirm that the following formulas match the intended speculative decoding specification:

  • Draft tokens
    num_draft_tokens = max_draft_tokens × (entry.decode_iteration + 1)
    (assumes each iteration consumes the full draft token budget)

  • Accepted draft tokens
    num_accepted_draft_tokens = entry.num_total_output_tokens − (entry.decode_iteration + 1)
    (assumes one baseline token per iteration)

  • Acceptance length
    acceptance_length = entry.num_total_output_tokens / (entry.decode_iteration + 1)
    (calculates average output tokens per iteration)

Ensure these calculations align with the algorithm’s documentation or reference implementation.

@kris1025
Copy link
Collaborator Author

kris1025 commented Aug 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13976 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13976 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10528 completed with status: 'FAILURE'

Signed-off-by: linquanh <[email protected]>
Signed-off-by: linquanh <[email protected]>
Signed-off-by: linquanh <[email protected]>
@kris1025 kris1025 force-pushed the feature-add-speculative-metric-linquanh branch from 5ffabdf to 903bd76 Compare August 4, 2025 12:22
@kris1025
Copy link
Collaborator Author

kris1025 commented Aug 4, 2025

@FrankD412

Overall, this looks fine to me -- just a double check, you've verified these changes work for both paths and produced expected results?

I have tested pytorch backend using deepseek. It works as expected. And tensorrt backend support is deleted for now.

@kris1025
Copy link
Collaborator Author

kris1025 commented Aug 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13996 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

599-653: Good implementation, but add null checks for robustness.

The reporting logic is well-structured and provides comprehensive speculative decoding metrics. However, consider adding null checks before accessing percentile data to prevent potential runtime errors.

Add null checks before formatting each statistic type:

-            num_draft_tokens = decoding["num_draft_tokens_percentiles"]
-            num_draft_tokens_stats = "\n".join(
-                f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}"
-                for key in [
-                    "minimum", "maximum", "average", "p50", "p90", "p95",
-                    "p99"
-                ])
+            num_draft_tokens_stats = ""
+            if decoding["num_draft_tokens_percentiles"]:
+                num_draft_tokens = decoding["num_draft_tokens_percentiles"]
+                num_draft_tokens_stats = "\n".join(
+                    f"[DT] {key.upper():<7}: {num_draft_tokens[key]:.2f}"
+                    for key in [
+                        "minimum", "maximum", "average", "p50", "p90", "p95",
+                        "p99"
+                    ])

Apply similar patterns for the other three statistic types.

🧹 Nitpick comments (1)
tensorrt_llm/bench/dataclasses/reporting.py (1)

119-129: Logic looks correct, but fix line length violation.

The speculative decoding metrics calculations are mathematically sound and properly gated behind the max_draft_tokens > 0 condition.

However, line 121-122 exceeds the 120 character limit. Consider breaking it:

-                num_draft_tokens.append(max_draft_tokens *
-                                        (entry.decode_iteration + 1))
+                num_draft_tokens.append(
+                    max_draft_tokens * (entry.decode_iteration + 1))
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ffabdf and 903bd76.

📒 Files selected for processing (2)
  • tensorrt_llm/bench/dataclasses/reporting.py (7 hunks)
  • tensorrt_llm/bench/dataclasses/statistics.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/bench/dataclasses/statistics.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for classes and functions in Python, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/bench/dataclasses/reporting.py
🪛 Ruff (0.12.2)
tensorrt_llm/bench/dataclasses/reporting.py

118-118: Line too long (147 > 120)

(E501)

🔇 Additional comments (8)
tensorrt_llm/bench/dataclasses/reporting.py (8)

75-75: LGTM! Parameter naming issue resolved.

The method signature change correctly addresses the previous variable shadowing issue by using max_draft_tokens instead of num_draft_tokens.


98-101: LGTM! Well-named variables for speculative metrics.

The variable declarations follow Python naming conventions and are descriptive for their purpose.


131-132: LGTM! Correct global acceptance length calculation.

The calculation properly computes the average acceptance length across all requests by dividing total output tokens by total decoding iterations.


135-142: LGTM! Proper percentile statistics creation with null safety.

The code correctly handles empty lists and follows the established pattern for creating percentile statistics.


162-167: LGTM! Complete BenchmarkStatistics constructor update.

All new percentile statistics fields are properly added to the constructor, and the acceptance_length field is correctly updated.


196-198: LGTM! Proper integration of max draft length parameter.

The constructor correctly calls the updated generate_statistics_summary method with the max draft length configuration.


665-672: LGTM! Robust configuration extraction method.

The method safely extracts the max_draft_len from speculative_config with proper null checks and sensible defaults.


444-460: LGTM! Complete statistics dictionary integration.

The new percentile statistics are properly integrated into the decoding_stats section with appropriate null checking.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13996 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10545 completed with status: 'SUCCESS'

Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, should just need a stamp from @FrankD412

@FrankD412
Copy link
Collaborator

LGTM as well -- merging.

@FrankD412 FrankD412 merged commit 6a3a921 into NVIDIA:main Aug 4, 2025
5 checks passed
lancelly pushed a commit to lancelly/TensorRT-LLM that referenced this pull request Aug 6, 2025
jain-ria pushed a commit to jain-ria/TensorRT-LLM that referenced this pull request Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants