Skip to content

Conversation

yunruis
Copy link
Contributor

@yunruis yunruis commented Aug 27, 2025

Summary by CodeRabbit

  • New Features

    • Optional batch-wait controls for scheduling: iteration-based timeout and token-ratio threshold (disabled by default).
    • Configurable max token limit (default 8192) to guide batching thresholds.
    • Scheduler can defer context processing until iteration or token thresholds are met to improve GPU utilization.
  • Bug Fixes / Validation

    • Input validation added: non-negative iterations and token ratio constrained to 0–1.
  • Tests

    • Added integration coverage for batch-waiting; duplicate test entries were also introduced.

Description

to open the feature, add parameter to config.yaml. For example

batch_wait_timeout_iters: 20
batch_wait_max_tokens_ratio: 0.75

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yunruis yunruis requested review from a team as code owners August 27, 2025 08:21
@yunruis yunruis requested review from syuoni and HuiGao-NV August 27, 2025 08:21
Copy link
Contributor

coderabbitai bot commented Aug 27, 2025

📝 Walkthrough

Walkthrough

Adds three batch-waiting configuration fields to PyTorchConfig and TorchLlmArgs, wires defaults in ADEngine shim, implements internal batch-waiting state and a _waiting_requests helper in PyExecutor to defer or release context scheduling based on iteration and token-ratio thresholds, and adds parameterized integration tests and API-stability references.

Changes

Cohort / File(s) Summary of changes
PyTorch backend config
tensorrt_llm/_torch/pyexecutor/config.py
Added fields to PyTorchConfig: max_num_tokens: int = 8192, batch_wait_timeout_iters: int = 0, batch_wait_max_tokens_ratio: float = 0.
Executor scheduling logic
tensorrt_llm/_torch/pyexecutor/py_executor.py
Added batch-waiting state (max_num_tokens, batch_wait_timeout_iters, batch_wait_max_tokens_ratio, enable_batch_waiting, batch_wait_iters_count), implemented internal helper _waiting_requests(context_requests, generation_requests) to defer or release context scheduling based on iteration and token-ratio thresholds, integrated into _schedule when attention-DP is disabled and generation requests exist (guard added to avoid waiting when no generation requests).
LLM args and validation
tensorrt_llm/llmapi/llm_args.py
Added TorchLlmArgs fields batch_wait_timeout_iters and batch_wait_max_tokens_ratio with validators (batch_wait_timeout_iters >= 0, 0 <= batch_wait_max_tokens_ratio <= 1) and propagated them into PyTorchConfig via get_pytorch_backend_config.
Auto-deploy shim
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
Initialized defaults on pytorch_backend_config: batch_wait_timeout_iters = 0, batch_wait_max_tokens_ratio = 0.0, max_num_tokens = 8192.
Integration tests
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Added parameterized test_nvfp4_batch_waiting in DeepSeekV3Lite tests exercising batch_wait_timeout_iters and batch_wait_max_tokens_ratio under NVFP4 MOE configs; test appears duplicated in file.
Test roster
tests/integration/test_lists/qa/llm_function_full.txt
Inserted new test roster entry for DeepSeekV3Lite including batch_wait_timeout_iters=10 and batch_wait_max_tokens_ratio=0.75 (duplicated entry).
API stability reference
tests/unittest/api_stability/references/llm.yaml
Recorded new __init__ parameters: batch_wait_timeout_iters (int, default 0) and batch_wait_max_tokens_ratio (float, default 0).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant TorchLlmArgs
  participant PyTorchConfig
  participant PyExecutor
  participant Scheduler

  Client->>TorchLlmArgs: construct with batch_wait_* params
  TorchLlmArgs->>TorchLlmArgs: validate params
  TorchLlmArgs->>PyTorchConfig: get_pytorch_backend_config(..., batch_wait_*, max_num_tokens)
  PyTorchConfig-->>PyExecutor: config on init
  PyExecutor->>PyExecutor: set enable_batch_waiting, batch_wait_iters_count=0

  loop scheduling step
    PyExecutor->>Scheduler: collect generation_requests + context_requests
    alt attention_dp disabled AND enable_batch_waiting AND generation_requests exist
      PyExecutor->>PyExecutor: _waiting_requests(context_requests, generation_requests)
      alt stop conditions met (iters >= timeout_iters OR total_tokens >= ratio * max_num_tokens)
        Note right of PyExecutor: release deferred context requests\nreset batch_wait_iters_count
        PyExecutor-->>Scheduler: schedule context + generation
      else still waiting
        Note right of PyExecutor: increment batch_wait_iters_count\ndefer context scheduling
        PyExecutor-->>Scheduler: schedule generation only
      end
    else
      PyExecutor-->>Scheduler: schedule context + generation (no waiting)
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • syuoni
  • netanel-haber
  • litaotju
  • pcastonguay
  • lucaslie
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@litaotju litaotju requested a review from Kefeng-Duan August 27, 2025 08:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

183-195: Guard against None/invalid max_num_tokens when TP waiting is enabled.

If max_num_tokens is None or <= 0, the wait threshold math will raise or misbehave.

Apply this diff to fail fast:

         self.max_draft_len = max_draft_len
-        self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens
+        self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens
         self.print_log = model_engine.pytorch_backend_config.print_iter_log
@@
-        self.attention_tp_enable_waiting = model_engine.pytorch_backend_config.attention_tp_enable_waiting
+        self.attention_tp_enable_waiting = model_engine.pytorch_backend_config.attention_tp_enable_waiting
         self.attention_tp_timeout_iters = model_engine.pytorch_backend_config.attention_tp_timeout_iters
         self.attention_tp_max_token_ratio = model_engine.pytorch_backend_config.attention_tp_max_token_ratio
+        if self.attention_tp_enable_waiting and (self.max_num_tokens is None or self.max_num_tokens <= 0):
+            raise ValueError(
+                "attention_tp_enable_waiting requires a positive max_num_tokens. "
+                "Set TorchLlmArgs.max_num_tokens or ensure build_config.max_num_tokens is propagated.")
🧹 Nitpick comments (7)
tensorrt_llm/_torch/pyexecutor/config.py (2)

1-1: Missing NVIDIA copyright header (2025).

Please prepend the standard NVIDIA copyright header to comply with repo guidelines.


53-57: Default max_num_tokens looks off; ensure it's intentional (4086 vs 4096).

If this is a typo, fix it; otherwise document why 4086 is chosen. Prefer deriving from BuildConfig to avoid drift.

Example fix:

-    max_num_tokens: int = 4086
+    # Derive at runtime via TorchLlmArgs/build_config; keep a sane default.
+    max_num_tokens: int = 4096
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1-1: Missing NVIDIA copyright header (2025).

Please add the standard header.

tensorrt_llm/llmapi/llm_args.py (3)

1-1: Missing NVIDIA copyright header (2025).

Please add the standard header.


216-229: LGTM: AttentionTPConfig added with sane defaults and from_dict.

Consider clarifying in the docstring that TP waiting applies only when attention_dp is disabled.


2206-2210: Line too long; tighten field description (E501).

Shorten to avoid lint failures.

Apply:

-    attention_tp_config: Optional[AttentionTPConfig] = Field(
-        default=None,
-        description="Optimized load for the TP Attention.",
-        status="beta")
+    attention_tp_config: Optional[AttentionTPConfig] = Field(
+        default=None,
+        description="Token-parallel attention waiting controls.",
+        status="beta")
tensorrt_llm/llmapi/__init__.py (1)

1-1: Missing NVIDIA copyright header (2025).

Please add the standard header.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f167b1f and f26d6e9.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
  • tensorrt_llm/llmapi/__init__.py (2 hunks)
  • tensorrt_llm/llmapi/llm_args.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/llmapi/__init__.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/llmapi/__init__.py
🧠 Learnings (1)
📚 Learning: 2025-08-26T09:37:10.431Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.431Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tensorrt_llm/llmapi/llm_args.py
🧬 Code graph analysis (3)
tensorrt_llm/llmapi/llm_args.py (2)
tensorrt_llm/builder.py (2)
  • default (50-58)
  • from_dict (572-677)
tensorrt_llm/_torch/pyexecutor/config.py (1)
  • PyTorchConfig (17-120)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
  • LlmRequest (282-424)
tensorrt_llm/llmapi/__init__.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
  • AttentionTPConfig (211-222)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py

2210-2210: Line too long (165 > 120)

(E501)

tensorrt_llm/_torch/pyexecutor/py_executor.py

1277-1277: Line too long (172 > 120)

(E501)


1284-1284: Line too long (282 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

239-240: Good: initialized TP waiting iteration counter.


1298-1302: Gating TP waiting only when attention DP is disabled is correct.

tensorrt_llm/llmapi/llm_args.py (1)

2649-2650: LGTM: wiring attention_tp_config through update_llm_args_with_extra_dict.

tensorrt_llm/llmapi/__init__.py (2)

7-8: LGTM: exported AttentionTPConfig from llmapi.


61-62: LGTM: added AttentionTPConfig to all.

Comment on lines 1265 to 1295
def _waiting_tp_requests(self, context_requests: list[LlmRequest],
generation_requests: list[LlmRequest]):
if not self.attention_tp_enable_waiting:
return context_requests

waited_context_requests = []
stop_waiting = False
num_scheduled_ctx_tokens = sum(
len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
num_scheduled_gen_tokens = sum(
len(gen_req.get_tokens(0)) for gen_req in generation_requests)
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
stop_waiting = self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters or num_scheduled_tokens >= self.attention_tp_max_token_ratio * self.max_num_tokens
if stop_waiting:
waited_context_requests = context_requests
self.tp_ctx_waiting_iters_count = 0
else:
self.tp_ctx_waiting_iters_count += 1
# logger.info(
# f"waited_context_requests: {waited_context_requests}, tp_ctx_waiting_iters_count: {self.tp_ctx_waiting_iters_count}, num_scheduled_tokens: {num_scheduled_tokens}, max_num_tokens: {self.max_num_tokens}, attention_tp_max_token_ratio: {self.attention_tp_max_token_ratio}"
# )
return waited_context_requests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Token accounting is incorrect; use per-iteration costs (ctx chunk size, 1 token per gen). Also fix long lines (E501).

Using len(get_tokens(0)) overcounts historical tokens and can stall or prematurely stop waiting. Count:

  • context: sum of request.context_chunk_size
  • generation: number of generation requests (optionally times beam width)

Apply:

-    def _waiting_tp_requests(self, context_requests: list[LlmRequest],
-                             generation_requests: list[LlmRequest]):
+    def _waiting_tp_requests(self, context_requests: list[LlmRequest],
+                             generation_requests: list[LlmRequest]):
         if not self.attention_tp_enable_waiting:
             return context_requests
 
         waited_context_requests = []
-        stop_waiting = False
-        num_scheduled_ctx_tokens = sum(
-            len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
-        num_scheduled_gen_tokens = sum(
-            len(gen_req.get_tokens(0)) for gen_req in generation_requests)
-        num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
-        stop_waiting = self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters or num_scheduled_tokens >= self.attention_tp_max_token_ratio * self.max_num_tokens
+        # Per-iteration token costs:
+        num_scheduled_ctx_tokens = sum(
+            req.context_chunk_size for req in context_requests
+        )
+        # Approximate 1 token per generation request this iteration
+        num_scheduled_gen_tokens = len(generation_requests)
+        num_scheduled_tokens = (
+            num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+        )
+        stop_waiting = (
+            self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters
+            or num_scheduled_tokens >= int(
+                self.attention_tp_max_token_ratio * self.max_num_tokens
+            )
+        )
         if stop_waiting:
             waited_context_requests = context_requests
             self.tp_ctx_waiting_iters_count = 0
         else:
             self.tp_ctx_waiting_iters_count += 1
-        # logger.info(
-        #     f"waited_context_requests: {waited_context_requests}, tp_ctx_waiting_iters_count: {self.tp_ctx_waiting_iters_count}, num_scheduled_tokens: {num_scheduled_tokens}, max_num_tokens: {self.max_num_tokens}, attention_tp_max_token_ratio: {self.attention_tp_max_token_ratio}"
-        # )
         return waited_context_requests
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _waiting_tp_requests(self, context_requests: list[LlmRequest],
generation_requests: list[LlmRequest]):
if not self.attention_tp_enable_waiting:
return context_requests
waited_context_requests = []
stop_waiting = False
num_scheduled_ctx_tokens = sum(
len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
num_scheduled_gen_tokens = sum(
len(gen_req.get_tokens(0)) for gen_req in generation_requests)
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
stop_waiting = self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters or num_scheduled_tokens >= self.attention_tp_max_token_ratio * self.max_num_tokens
if stop_waiting:
waited_context_requests = context_requests
self.tp_ctx_waiting_iters_count = 0
else:
self.tp_ctx_waiting_iters_count += 1
# logger.info(
# f"waited_context_requests: {waited_context_requests}, tp_ctx_waiting_iters_count: {self.tp_ctx_waiting_iters_count}, num_scheduled_tokens: {num_scheduled_tokens}, max_num_tokens: {self.max_num_tokens}, attention_tp_max_token_ratio: {self.attention_tp_max_token_ratio}"
# )
return waited_context_requests
def _waiting_tp_requests(self, context_requests: list[LlmRequest],
generation_requests: list[LlmRequest]):
if not self.attention_tp_enable_waiting:
return context_requests
waited_context_requests = []
# Per-iteration token costs:
num_scheduled_ctx_tokens = sum(
req.context_chunk_size for req in context_requests
)
# Approximate 1 token per generation request this iteration
num_scheduled_gen_tokens = len(generation_requests)
num_scheduled_tokens = (
num_scheduled_ctx_tokens + num_scheduled_gen_tokens
)
stop_waiting = (
self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters
or num_scheduled_tokens >= int(
self.attention_tp_max_token_ratio * self.max_num_tokens
)
)
if stop_waiting:
waited_context_requests = context_requests
self.tp_ctx_waiting_iters_count = 0
else:
self.tp_ctx_waiting_iters_count += 1
return waited_context_requests
🧰 Tools
🪛 Ruff (0.12.2)

1277-1277: Line too long (172 > 120)

(E501)


1284-1284: Line too long (282 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 1265-1287, replace
the incorrect token counting that uses len(req.get_tokens(0)) with per-iteration
cost accounting: sum the context chunk cost by summing each context request's
context_chunk_size (e.g., ctx_req.context_chunk_size) and count generation cost
as the number of active generation requests (optionally multiplied by beam width
if applicable) instead of token history; compute num_scheduled_tokens =
num_scheduled_ctx_tokens + num_scheduled_gen_tokens, keep the existing
stop_waiting logic but ensure tp_ctx_waiting_iters_count is updated the same
way, and shorten any long lines to satisfy E501 (break calculations and
conditions across multiple lines).

@yunruis yunruis changed the title [None][Opt] opt: add tp attn waiting [None][Opt] add batch waiting on scheduling Aug 27, 2025
@yunruis yunruis changed the title [None][Opt] add batch waiting on scheduling [None][Opt] Add batch waiting on scheduling Aug 27, 2025
@yunruis yunruis changed the title [None][Opt] Add batch waiting on scheduling [None][Opt] Add batch waiting when scheduling Aug 27, 2025
@yunruis yunruis force-pushed the syr/attn_tp_config branch from f26d6e9 to 4e65519 Compare August 27, 2025 13:13
@yunruis yunruis requested a review from a team as a code owner August 27, 2025 13:13
@yunruis yunruis requested a review from netanel-haber August 27, 2025 13:13
@yunruis yunruis changed the title [None][Opt] Add batch waiting when scheduling [None][opt] Add batch waiting when scheduling Aug 27, 2025
@yunruis yunruis force-pushed the syr/attn_tp_config branch from 4e65519 to f58f5a6 Compare August 27, 2025 13:20
@yunruis
Copy link
Contributor Author

yunruis commented Aug 27, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16692 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1266-1287: Fix token accounting and NVTX label in _waiting_requests (and wrap long line).

  • NVTX label is incorrect ("_schedule" on _waiting_requests), hurting trace readability.
  • Token counting uses len(req.get_tokens(0)), which overcounts historical tokens and can prematurely release or stall waiting. Count per-iteration cost instead:
    • Context: sum context_chunk_size for scheduled context requests this iter.
    • Generation: approximate 1 token/request (optionally times beam width).
  • Also address E501 long line.

Apply:

-@nvtx_range("_schedule")
+@nvtx_range("_waiting_requests")
 def _waiting_requests(self, context_requests: list[LlmRequest],
                       generation_requests: list[LlmRequest]):
     if not self.enable_batch_waiting:
         return context_requests
 
     waited_context_requests = []
-    stop_waiting = False
-    num_scheduled_ctx_tokens = sum(
-        len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
-    num_scheduled_gen_tokens = sum(
-        len(gen_req.get_tokens(0)) for gen_req in generation_requests)
-    num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
-
-    stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
+    # Per-iteration token costs:
+    num_scheduled_ctx_tokens = sum(
+        getattr(req, "context_chunk_size", 1) for req in context_requests
+    )
+    # Approximate 1 token per generation request this iteration.
+    # If desired, incorporate beam width:
+    # num_scheduled_gen_tokens = sum(getattr(req.sampling_config, "beam_width", 1) for req in generation_requests)
+    num_scheduled_gen_tokens = len(generation_requests)
+    num_scheduled_tokens = (
+        num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+    )
+
+    stop_waiting = (
+        self.batch_wait_iters_count >= self.batch_wait_timeout_iters
+        or num_scheduled_tokens >= int(
+            self.batch_wait_max_tokens_ratio * self.max_num_tokens
+        )
+    )
     if stop_waiting:
         waited_context_requests = context_requests
         self.batch_wait_iters_count = 0
     else:
         self.batch_wait_iters_count += 1
     return waited_context_requests
tensorrt_llm/llmapi/llm_args.py (1)

2612-2615: Propagate a non-None max_num_tokens to the PyTorch backend

Ensure backend receives a concrete value (fallback to build_config) so ratio-based waiting can compute thresholds.

             attention_dp_batching_wait_iters=self.attention_dp_config.
             batching_wait_iters if self.attention_dp_config is not None else
             AttentionDpConfig.model_fields['batching_wait_iters'].default,
+            max_num_tokens=self.max_num_tokens
+            if self.max_num_tokens is not None
+            else self.build_config.max_num_tokens,
             batch_wait_timeout_ms=self.batch_wait_timeout_ms,
             batch_wait_timeout_iters=self.batch_wait_timeout_iters,
             batch_wait_max_tokens_ratio=self.batch_wait_max_tokens_ratio,
🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1298-1305: Guard condition reads correctly; consider explicit short-circuit for readability.

No functional change needed. Optional clarity:

-        if not self.enable_attention_dp and self.enable_batch_waiting and len(
-                scheduler_output.context_requests) > 0 and len(
-                    scheduler_output.generation_requests) > 0:
+        if (not self.enable_attention_dp and self.enable_batch_waiting
+                and scheduler_output.context_requests
+                and scheduler_output.generation_requests):
tensorrt_llm/llmapi/llm_args.py (4)

2243-2254: Use int for iteration-based timeout; tighten description

Iterations should be integral and the description can be clearer.

Apply:

-    batch_wait_timeout_iters: float = Field(
+    batch_wait_timeout_iters: int = Field(
         default=0,
         description=
-        "Maximum number of iterations the scheduler will wait to accumulate new coming requests for improved GPU utilization efficiency. If greater than 0, the scheduler will delay batch processing to gather more requests up to the specified iteration limit. If 0, disables timeout-iters-based batching delays.",
+        "Maximum number of scheduler iterations to wait to accumulate incoming requests for better GPU utilization. If > 0, the scheduler may delay batch processing up to this budget; if 0, disables iteration-based batching delays.",
         status="prototype")
 
     batch_wait_max_tokens_ratio: float = Field(
         default=0,
         description=
-        "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.",
+        "Token accumulation threshold ratio for batch scheduling. If > 0, accumulate requests until total tokens reach batch_wait_max_tokens_ratio * max_num_tokens. This improves GPU utilization by ensuring adequate batch sizes. If 0, disables token-based batching delays.",
         status="prototype")

2519-2521: Fix validation message (0 is allowed)

Error text should say “greater than or equal to 0.”

-        if self.batch_wait_timeout_ms < 0:
-            raise ValueError("batch_wait_timeout_ms must be greater than 0")
+        if self.batch_wait_timeout_ms < 0:
+            raise ValueError("batch_wait_timeout_ms must be greater than or equal to 0")

2523-2528: Fix validation message (0 is allowed) for iters

Keep logic, update message; aligns with docs.

-        if self.batch_wait_timeout_iters < 0:
-            raise ValueError("batch_wait_timeout_iters must be greater than 0")
+        if self.batch_wait_timeout_iters < 0:
+            raise ValueError("batch_wait_timeout_iters must be greater than or equal to 0")

2243-2254: Clarify independent batching thresholds and their execution order

The two parameters govern separate stages of the PyTorch executor’s batching logic and do not override one another. Specifically:

  • batch_wait_timeout_ms applies in ExecutorRequestQueue._get_from_request_queue (tensorrt_llm/_torch/pyexecutor/executor_request_queue.py). When > 0, the request queue will wait up to that many milliseconds (or until max_batch_size is reached) before flushing items to the executor.
  • batch_wait_timeout_iters (and batch_wait_max_tokens_ratio) applies later in PyTorchExecutor._waiting_requests (tensorrt_llm/_torch/pyexecutor/py_executor.py). When enable_batch_waiting is true, the scheduler will hold generation requests for up to the configured number of iterations (or until token-count threshold) before releasing them.

Because these delays occur at different points in the pipeline, both will be applied sequentially if set > 0 (queue-level ms wait first, then scheduler-level iters wait). Please update the field descriptions in llm_args.py to:

  • Call out that they control distinct phases
  • Describe the order in which they are evaluated
  • Note that setting both introduces two successive batching delays rather than one “wins” over the other

This will ensure users understand how the executor actually behaves.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f26d6e9 and f58f5a6.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
  • tensorrt_llm/llmapi/llm_args.py (3 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/unittest/api_stability/references/llm.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/pyexecutor/config.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/llmapi/llm_args.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_utils.py (1)
  • nvtx_range (843-862)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)
tests/integration/defs/conftest.py (2)
  • parametrize_with_ids (1786-1811)
  • llm_models_root (77-83)
tensorrt_llm/llmapi/llm_args.py (7)
  • KvCacheConfig (946-1077)
  • TorchCompileConfig (2108-2155)
  • CudaGraphConfig (108-165)
  • MoeConfig (168-196)
  • MTPDecodingConfig (521-556)
  • quant_config (2328-2331)
  • quant_config (2334-2335)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1011-1027)
tensorrt_llm/quantization/mode.py (1)
  • QuantAlgo (23-46)
tests/integration/defs/accuracy/accuracy_core.py (3)
  • GSM8K (293-308)
  • evaluate (147-206)
  • evaluate (707-717)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

1280-1280: Line too long (165 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/unittest/api_stability/references/llm.yaml (1)

134-141: LGTM: new batch-waiting knobs exposed in API reference.

Defaults and statuses look consistent with prototype rollout.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1512-1519: Use the parametrized fp8kv instead of hardcoding.

The conditional below already respects fp8kv. Removing the hardcoded assignment (per the previous diff) is sufficient; no further change needed.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

183-195: Ensure max_num_tokens is always positive when ratio-based waiting is enabled.

If max_num_tokens were 0/None, the ratio threshold becomes 0, releasing immediately. If that’s unintended, clamp or disable ratio-based waiting when max_num_tokens <= 0.

Would you like a follow-up patch to guard enable_batch_waiting with self.max_num_tokens > 0?

Comment on lines 2529 to 2535
@model_validator(mode='after')
def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs':
if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1:
raise ValueError(
"batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1"
)
return self

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Require max_num_tokens when token-ratio waiting is enabled

Without an effective max_num_tokens (> 0), ratio-based waiting can’t compute a threshold and will fail or behave incorrectly at runtime.

     @model_validator(mode='after')
     def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs':
         if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1:
             raise ValueError(
                 "batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1"
             )
+        if self.batch_wait_max_tokens_ratio > 0:
+            effective_max_tokens = self.max_num_tokens or (
+                self.build_config.max_num_tokens if self.build_config else None
+            )
+            if effective_max_tokens is None or effective_max_tokens <= 0:
+                raise ValueError(
+                    "batch_wait_max_tokens_ratio requires max_num_tokens to be set (> 0). "
+                    "Set TorchLlmArgs.max_num_tokens or build_config.max_num_tokens."
+                )
         return self
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@model_validator(mode='after')
def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs':
if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1:
raise ValueError(
"batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1"
)
return self
@model_validator(mode='after')
def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs':
if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1:
raise ValueError(
"batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1"
)
if self.batch_wait_max_tokens_ratio > 0:
effective_max_tokens = self.max_num_tokens or (
self.build_config.max_num_tokens if self.build_config else None
)
if effective_max_tokens is None or effective_max_tokens <= 0:
raise ValueError(
"batch_wait_max_tokens_ratio requires max_num_tokens to be set (> 0). "
"Set TorchLlmArgs.max_num_tokens or build_config.max_num_tokens."
)
return self
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm_args.py around lines 2529 to 2536, the validator only
checks that batch_wait_max_tokens_ratio is in [0,1]; add a check that if
batch_wait_max_tokens_ratio > 0 then self.max_num_tokens must be > 0 and raise a
ValueError with a clear message if not; keep the existing ratio bounds
validation and return self as before.

@yunruis yunruis force-pushed the syr/attn_tp_config branch from f58f5a6 to f6778f2 Compare August 27, 2025 13:52
@yunruis
Copy link
Contributor Author

yunruis commented Aug 27, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16692 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12530 completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16696 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16696 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12531 completed with status: 'FAILURE'

@yunruis yunruis force-pushed the syr/attn_tp_config branch from f6778f2 to a4ea57f Compare August 28, 2025 02:09
@yunruis
Copy link
Contributor Author

yunruis commented Aug 28, 2025

/bot run

@yunruis yunruis force-pushed the syr/attn_tp_config branch from a4ea57f to 42002a7 Compare August 28, 2025 02:13
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16769 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

183-195: Harden config defaults to avoid TypeError and make enable_batch_waiting robust

If any of these fields are None in pytorch_backend_config, comparisons like > 0 raise. Cast with safe fallbacks.

-        self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens
+        self.max_num_tokens = int(getattr(
+            model_engine.pytorch_backend_config, "max_num_tokens", 0) or 0)
@@
-        self.batch_wait_timeout_iters = model_engine.pytorch_backend_config.batch_wait_timeout_iters
-        self.batch_wait_max_tokens_ratio = model_engine.pytorch_backend_config.batch_wait_max_tokens_ratio
-        self.enable_batch_waiting = self.batch_wait_timeout_iters > 0 or self.batch_wait_max_tokens_ratio > 0
+        self.batch_wait_timeout_iters = int(getattr(
+            model_engine.pytorch_backend_config, "batch_wait_timeout_iters", 0) or 0)
+        self.batch_wait_max_tokens_ratio = float(getattr(
+            model_engine.pytorch_backend_config, "batch_wait_max_tokens_ratio", 0.0) or 0.0)
+        self.enable_batch_waiting = (
+            self.batch_wait_timeout_iters > 0 or self.batch_wait_max_tokens_ratio > 0.0
+        )
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1266-1286: Fix token accounting; count per-iteration costs (ctx chunk size, ~1 token/gen); also guard when max_num_tokens <= 0

Using len(req.get_tokens(0)) counts history and breaks the threshold; it can cause premature stop or no-op waiting. Also break long lines (E501).

-    def _waiting_requests(self, context_requests: list[LlmRequest],
-                          generation_requests: list[LlmRequest]):
+    def _waiting_requests(self, context_requests: list[LlmRequest],
+                          generation_requests: list[LlmRequest]):
         if not self.enable_batch_waiting:
             return context_requests
 
         waited_context_requests = []
-        stop_waiting = False
-        num_scheduled_ctx_tokens = sum(
-            len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
-        num_scheduled_gen_tokens = sum(
-            len(gen_req.get_tokens(0)) for gen_req in generation_requests)
-        num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
-
-        stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
+        # Per-iteration token costs:
+        num_scheduled_ctx_tokens = sum(
+            req.context_chunk_size for req in context_requests
+        )
+        # Approximate 1 token per generation request this iteration
+        num_scheduled_gen_tokens = len(generation_requests)
+        num_scheduled_tokens = (
+            num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+        )
+        # Ratio threshold only meaningful if both ratio > 0 and max_num_tokens > 0
+        ratio_threshold = (
+            int(self.batch_wait_max_tokens_ratio * self.max_num_tokens)
+            if (self.batch_wait_max_tokens_ratio > 0.0 and
+                self.max_num_tokens > 0)
+            else float("inf")
+        )
+        stop_waiting = (
+            self.batch_wait_iters_count >= self.batch_wait_timeout_iters
+            or num_scheduled_tokens >= ratio_threshold
+        )
         if stop_waiting:
             waited_context_requests = context_requests
             self.batch_wait_iters_count = 0
         else:
             self.batch_wait_iters_count += 1
         return waited_context_requests
tensorrt_llm/llmapi/llm_args.py (2)

2529-2536: Require max_num_tokens when ratio-based waiting is enabled

Ratio-based waiting needs a concrete threshold; enforce effective max_num_tokens (> 0). This prevents runtime issues in the executor.

Apply:

     @model_validator(mode='after')
     def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs':
         if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1:
             raise ValueError(
                 "batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1"
             )
+        if self.batch_wait_max_tokens_ratio > 0:
+            effective_max_tokens = self.max_num_tokens or (
+                self.build_config.max_num_tokens if self.build_config else None
+            )
+            if effective_max_tokens is None or effective_max_tokens <= 0:
+                raise ValueError(
+                    "batch_wait_max_tokens_ratio requires max_num_tokens to be set (> 0). "
+                    "Set TorchLlmArgs.max_num_tokens or build_config.max_num_tokens."
+                )
         return self

2612-2615: Propagate a non-None max_num_tokens to PyTorchConfig

Backend needs a concrete threshold when ratio-based waiting is used; pass max_num_tokens with a build_config fallback.

Apply:

             attention_dp_batching_wait_iters=self.attention_dp_config.
             batching_wait_iters if self.attention_dp_config is not None else
             AttentionDpConfig.model_fields['batching_wait_iters'].default,
+            max_num_tokens=self.max_num_tokens
+            if self.max_num_tokens is not None
+            else self.build_config.max_num_tokens,
             batch_wait_timeout_ms=self.batch_wait_timeout_ms,
             batch_wait_timeout_iters=self.batch_wait_timeout_iters,
             batch_wait_max_tokens_ratio=self.batch_wait_max_tokens_ratio,

Run to verify PyTorchConfig signature and types:

#!/bin/bash
# Verify PyTorchConfig fields exist and types are as expected.
rg -nP 'class\s+PyTorchConfig\b' -A120 tensorrt_llm/_torch/pyexecutor/config.py
rg -nP '\bmax_num_tokens\b|\bbatch_wait_timeout_iters\b|\bbatch_wait_max_tokens_ratio\b' tensorrt_llm/_torch/pyexecutor/config.py
🧹 Nitpick comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1486-1527: Ensure ratio-based waiting is actually exercised: set max_num_tokens when batch_wait_max_tokens_ratio > 0

If max_num_tokens defaults to 0/None, the condition tokens >= ratio * max_num_tokens collapses to tokens >= 0 and stops waiting immediately; the test won’t meaningfully cover ratio behavior. Guard by setting a sane max_num_tokens only when ratio > 0, and (optionally) assert the backend picked up the values.

@@
-        with LLM(f"{llm_models_root()}/DeepSeek-V3-Lite/nvfp4_moe_only_mtp",
+        llm_kwargs = {}
+        if batch_wait_max_tokens_ratio > 0:
+            # Keep moderate to avoid OOM while making the ratio threshold meaningful.
+            llm_kwargs["max_num_tokens"] = 16384
+        with LLM(f"{llm_models_root()}/DeepSeek-V3-Lite/nvfp4_moe_only_mtp",
                  kv_cache_config=kv_cache_config,
                  **pytorch_config,
                  enable_attention_dp=False,
-                 speculative_config=mtp_config) as llm:
+                 speculative_config=mtp_config,
+                 **llm_kwargs) as llm:
             assert llm.args.quant_config.quant_algo == QuantAlgo.NVFP4
+            # Optional: verify wiring
+            assert llm.args.pytorch_backend_config.batch_wait_timeout_iters == batch_wait_timeout_iters
+            assert llm.args.pytorch_backend_config.batch_wait_max_tokens_ratio == batch_wait_max_tokens_ratio
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1297-1304: Gate waiting only when both ctx and gen exist: OK, but reference local scheduled_context_requests for clarity

Logic is correct; minor readability improvement below.

-        if not self.enable_attention_dp and self.enable_batch_waiting and len(
-                scheduler_output.context_requests) > 0 and len(
-                    scheduler_output.generation_requests) > 0:
-            scheduled_context_requests = self._waiting_requests(
-                scheduler_output.context_requests,
-                scheduler_output.generation_requests)
+        if (not self.enable_attention_dp and self.enable_batch_waiting
+                and scheduler_output.context_requests
+                and scheduler_output.generation_requests):
+            scheduled_context_requests = self._waiting_requests(
+                scheduler_output.context_requests,
+                scheduler_output.generation_requests
+            )
tensorrt_llm/llmapi/llm_args.py (2)

2249-2254: Nit: fix spacing/punctuation in description

Minor grammar/spacing for clarity.

Apply:

-        "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.",
+        "Token accumulation threshold ratio for batch scheduling optimization. If > 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization by ensuring adequate batch sizes. If 0, disables token-based batching delays.",

2523-2528: Validator message should say '>= 0'; optionally warn if both ms- and iter-based waits are enabled

  • Error text should match the logic (0 is allowed).
  • Optional: warn if both knobs are > 0 to avoid compounded delays.

Apply:

     @model_validator(mode='after')
     def validate_batch_wait_timeout_iters(self) -> 'TorchLlmArgs':
-        if self.batch_wait_timeout_iters < 0:
-            raise ValueError("batch_wait_timeout_iters must be greater than 0")
+        if self.batch_wait_timeout_iters < 0:
+            raise ValueError("batch_wait_timeout_iters must be greater than or equal to 0")
+        # Guardrail: avoid unintentionally enabling two waiting mechanisms.
+        if self.batch_wait_timeout_ms > 0 and self.batch_wait_timeout_iters > 0:
+            logger.warning(
+                "Both batch_wait_timeout_ms and batch_wait_timeout_iters are > 0; this may introduce compounded delays. "
+                "Consider enabling only one waiting mechanism.")
         return self
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f6778f2 and 42002a7.

📒 Files selected for processing (5)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
  • tensorrt_llm/llmapi/llm_args.py (3 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/unittest/api_stability/references/llm.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/unittest/api_stability/references/llm.yaml
  • tensorrt_llm/_torch/pyexecutor/config.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/llmapi/llm_args.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (2)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧬 Code graph analysis (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
tests/integration/defs/conftest.py (2)
  • parametrize_with_ids (1786-1811)
  • llm_models_root (77-83)
tensorrt_llm/llmapi/llm_args.py (7)
  • KvCacheConfig (946-1077)
  • TorchCompileConfig (2108-2155)
  • CudaGraphConfig (108-165)
  • MoeConfig (168-196)
  • MTPDecodingConfig (521-556)
  • quant_config (2328-2331)
  • quant_config (2334-2335)
tensorrt_llm/quantization/mode.py (1)
  • QuantAlgo (23-46)
tests/integration/defs/accuracy/accuracy_core.py (3)
  • GSM8K (293-308)
  • evaluate (147-206)
  • evaluate (707-717)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
  • LlmRequest (282-424)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

1279-1279: Line too long (165 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

240-240: Good: initialize batch_wait_iters_count state

The counter is required to enforce iteration-based timeouts.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16925 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16926 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16925 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16927 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16926 [ run ] completed with state ABORTED

@yunruis
Copy link
Contributor Author

yunruis commented Aug 29, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16933 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16927 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16933 [ run ] completed with state FAILURE

@litaotju
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16937 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16937 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12723 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Aug 29, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16945 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16945 [ run ] completed with state FAILURE

@yunruis
Copy link
Contributor Author

yunruis commented Aug 29, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16947 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16947 [ run ] completed with state FAILURE

@litaotju
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16949 [ run ] triggered by Bot

@yunruis yunruis force-pushed the syr/attn_tp_config branch from 0cf28c0 to c3feac4 Compare August 29, 2025 07:31
@yunruis
Copy link
Contributor Author

yunruis commented Aug 29, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16955 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16949 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

183-195: Sanitize config and compute enable flag robustly

Guard against invalid values and clamp ratio to [0.0, 1.0]. This avoids accidental “always release” when ratio=0 and protects against negatives.

Apply:

-        self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens
+        self.max_num_tokens = max(
+            1, int(model_engine.pytorch_backend_config.max_num_tokens)
+        )
         self.print_log = model_engine.pytorch_backend_config.print_iter_log
@@
-        self.batch_wait_timeout_iters = model_engine.pytorch_backend_config.batch_wait_timeout_iters
-        self.batch_wait_max_tokens_ratio = model_engine.pytorch_backend_config.batch_wait_max_tokens_ratio
-        self.enable_batch_waiting = self.batch_wait_timeout_iters > 0 or self.batch_wait_max_tokens_ratio > 0
+        _timeout = model_engine.pytorch_backend_config.batch_wait_timeout_iters
+        _ratio = model_engine.pytorch_backend_config.batch_wait_max_tokens_ratio
+        self.batch_wait_timeout_iters = max(0, int(_timeout))
+        self.batch_wait_max_tokens_ratio = max(0.0, min(1.0, float(_ratio)))
+        self.enable_batch_waiting = (
+            self.batch_wait_timeout_iters > 0
+            or self.batch_wait_max_tokens_ratio > 0.0
+        )
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1275-1295: Fix token accounting, zero-threshold semantics, and long line (E501)

Current logic overcounts tokens using full history and treats ratio=0 as “always release,” nullifying timeout-only waiting. Replace with per-iteration costs and guard zero thresholds.

Apply:

-    def _waiting_requests(self, context_requests: list[LlmRequest],
-                          generation_requests: list[LlmRequest]):
+    def _waiting_requests(self, context_requests: list[LlmRequest],
+                          generation_requests: list[LlmRequest]):
         if not self.enable_batch_waiting:
             return context_requests
 
         waited_context_requests = []
-        stop_waiting = False
-        num_scheduled_ctx_tokens = sum(
-            len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
-        num_scheduled_gen_tokens = sum(
-            len(gen_req.get_tokens(0)) for gen_req in generation_requests)
-        num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
-
-        stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens
+        # Per-iteration costs: context consumes current chunk size; generation ~1 token per req
+        num_scheduled_ctx_tokens = sum(req.context_chunk_size
+                                       for req in context_requests)
+        num_scheduled_gen_tokens = len(generation_requests)
+        num_scheduled_tokens = (num_scheduled_ctx_tokens
+                                + num_scheduled_gen_tokens)
+        # Only apply a check when the corresponding threshold is enabled (> 0)
+        ratio_enabled = self.batch_wait_max_tokens_ratio > 0.0
+        timeout_enabled = self.batch_wait_timeout_iters > 0
+        ratio_threshold = int(self.batch_wait_max_tokens_ratio
+                              * self.max_num_tokens) if ratio_enabled else 0
+        stop_waiting = (
+            (timeout_enabled
+             and self.batch_wait_iters_count >= self.batch_wait_timeout_iters)
+            or (ratio_enabled and num_scheduled_tokens >= ratio_threshold)
+        )
         if stop_waiting:
             waited_context_requests = context_requests
             self.batch_wait_iters_count = 0
         else:
             self.batch_wait_iters_count += 1
         return waited_context_requests
tensorrt_llm/llmapi/llm_args.py (2)

2529-2535: Require effective max_num_tokens when ratio-based waiting is enabled.

Without a concrete max_num_tokens, the token-ratio threshold is undefined and will misbehave at runtime.

     def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs':
-        if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1:
-            raise ValueError(
-                "batch_wait_max_tokens_ratio must be between 0 and 1")
+        if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1:
+            raise ValueError("batch_wait_max_tokens_ratio must be in [0, 1]")
+
+        if self.batch_wait_max_tokens_ratio > 0:
+            effective_max_tokens = self.max_num_tokens or (
+                self.build_config.max_num_tokens if self.build_config else None
+            )
+            if effective_max_tokens is None or effective_max_tokens <= 0:
+                raise ValueError(
+                    "batch_wait_max_tokens_ratio requires max_num_tokens to be set (> 0). "
+                    "Set TorchLlmArgs.max_num_tokens or build_config.max_num_tokens."
+                )
         return self

2611-2614: Propagate a non-None max_num_tokens to PyTorchConfig.

Ensure backend gets a concrete threshold for ratio-based waiting; fall back to build_config when needed.

         return PyTorchConfig(
             extra_resource_managers=self.extra_resource_managers,
@@
             mamba_ssm_cache_dtype=self.kv_cache_config.mamba_ssm_cache_dtype,
+            max_num_tokens=self.max_num_tokens
+            if self.max_num_tokens is not None
+            else self.build_config.max_num_tokens,
             enable_iter_perf_stats=self.enable_iter_perf_stats,
🧹 Nitpick comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1519-1523: Make ratio-threshold path deterministic

Without bounding max_num_tokens, the 0.75 ratio may rarely trigger before timeout in CI-sized runs. Recommend lowering max_num_tokens to force the ratio gate to engage reliably.

Apply:

 with LLM(f"{llm_models_root()}/DeepSeek-V3-Lite/nvfp4_moe_only_mtp",
          kv_cache_config=kv_cache_config,
+         max_num_tokens=512,
          **pytorch_config,
          enable_attention_dp=False,
          speculative_config=mtp_config) as llm:
tensorrt_llm/llmapi/llm_args.py (3)

2243-2247: Wording: “new coming” → “incoming” (minor doc polish).

Tighten the field description for clarity.

-        "Maximum number of iterations the scheduler will wait to accumulate new coming requests for improved GPU utilization efficiency. If greater than 0, the scheduler will delay batch processing to gather more requests up to the specified iteration limit. If 0, disables timeout-iters-based batching delays.",
+        "Maximum number of iterations the scheduler will wait to accumulate incoming requests for improved GPU utilization. If > 0, the scheduler may delay batch processing to gather more requests up to this iteration limit. If 0, disables iter-based batching delays.",

2249-2254: Fix typo and tighten phrasing in description.

Add missing space after sentence and simplify wording.

-        "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.",
+        "Token-accumulation threshold ratio for batch scheduling. If > 0, the scheduler accumulates requests until total scheduled tokens reach batch_wait_max_tokens_ratio * max_num_tokens. This helps ensure adequate batch sizes. If 0, disables token-based batching delays.",

2523-2528: Validator message conflicts with allowed value 0; make it “>= 0”.

The check allows 0, but the message says “> 0”.

-        if self.batch_wait_timeout_iters < 0:
-            raise ValueError("batch_wait_timeout_iters must be greater than 0")
+        if self.batch_wait_timeout_iters < 0:
+            raise ValueError("batch_wait_timeout_iters must be >= 0")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6488677 and c3feac4.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (5 hunks)
  • tensorrt_llm/llmapi/llm_args.py (3 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (1 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (1 hunks)
  • tests/unittest/api_stability/references/llm.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tests/unittest/api_stability/references/llm.yaml
  • tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces only; indent 4 spaces

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports; prefer from package.subpackage import module then module.Symbol
Python file names use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variable names use snake_case; if starting with a number, prefix k (e.g., k_99th_percentile)
Global variables use G_ prefix and UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings; reserve comments for internal code or local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline as shown
Avoid reflection when simple, explicit code suffices (e.g., prefer def make_complex(x,y) over locals()/dict tricks)
Catch the narrowest exceptions possible in try/except
For duck-typing try/except, keep try body minimal and use else for main logic

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧬 Code graph analysis (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)
tests/integration/defs/conftest.py (2)
  • parametrize_with_ids (1786-1811)
  • llm_models_root (77-83)
tensorrt_llm/llmapi/llm_args.py (7)
  • KvCacheConfig (946-1077)
  • TorchCompileConfig (2108-2155)
  • CudaGraphConfig (108-165)
  • MoeConfig (168-196)
  • MTPDecodingConfig (521-556)
  • quant_config (2328-2331)
  • quant_config (2334-2335)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1011-1027)
tensorrt_llm/quantization/mode.py (1)
  • QuantAlgo (23-46)
tests/integration/defs/accuracy/accuracy_core.py (3)
  • GSM8K (293-308)
  • evaluate (147-206)
  • evaluate (707-717)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
  • LlmRequest (282-424)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

1288-1288: Line too long (165 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1486-1513: Batch-waiting test matrix looks solid

Good coverage of timeout-only, ratio-only, and combined cases; disabling attention_dp ensures the new path is exercised.

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

240-240: LGTM: initialize waiting-iteration counter

Counter is correctly initialized.


1306-1313: Release guard to avoid dead-wait is correct

Conditionally applying waiting only when there are both ctx and gen requests avoids stalls.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16955 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12727 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Aug 29, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16980 [ run ] triggered by Bot

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants