-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][opt] Add batch waiting when scheduling #7287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds three batch-waiting configuration fields to PyTorchConfig and TorchLlmArgs, wires defaults in ADEngine shim, implements internal batch-waiting state and a Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant TorchLlmArgs
participant PyTorchConfig
participant PyExecutor
participant Scheduler
Client->>TorchLlmArgs: construct with batch_wait_* params
TorchLlmArgs->>TorchLlmArgs: validate params
TorchLlmArgs->>PyTorchConfig: get_pytorch_backend_config(..., batch_wait_*, max_num_tokens)
PyTorchConfig-->>PyExecutor: config on init
PyExecutor->>PyExecutor: set enable_batch_waiting, batch_wait_iters_count=0
loop scheduling step
PyExecutor->>Scheduler: collect generation_requests + context_requests
alt attention_dp disabled AND enable_batch_waiting AND generation_requests exist
PyExecutor->>PyExecutor: _waiting_requests(context_requests, generation_requests)
alt stop conditions met (iters >= timeout_iters OR total_tokens >= ratio * max_num_tokens)
Note right of PyExecutor: release deferred context requests\nreset batch_wait_iters_count
PyExecutor-->>Scheduler: schedule context + generation
else still waiting
Note right of PyExecutor: increment batch_wait_iters_count\ndefer context scheduling
PyExecutor-->>Scheduler: schedule generation only
end
else
PyExecutor-->>Scheduler: schedule context + generation (no waiting)
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
183-195
: Guard against None/invalid max_num_tokens when TP waiting is enabled.If max_num_tokens is None or <= 0, the wait threshold math will raise or misbehave.
Apply this diff to fail fast:
self.max_draft_len = max_draft_len - self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens + self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens self.print_log = model_engine.pytorch_backend_config.print_iter_log @@ - self.attention_tp_enable_waiting = model_engine.pytorch_backend_config.attention_tp_enable_waiting + self.attention_tp_enable_waiting = model_engine.pytorch_backend_config.attention_tp_enable_waiting self.attention_tp_timeout_iters = model_engine.pytorch_backend_config.attention_tp_timeout_iters self.attention_tp_max_token_ratio = model_engine.pytorch_backend_config.attention_tp_max_token_ratio + if self.attention_tp_enable_waiting and (self.max_num_tokens is None or self.max_num_tokens <= 0): + raise ValueError( + "attention_tp_enable_waiting requires a positive max_num_tokens. " + "Set TorchLlmArgs.max_num_tokens or ensure build_config.max_num_tokens is propagated.")
🧹 Nitpick comments (7)
tensorrt_llm/_torch/pyexecutor/config.py (2)
1-1
: Missing NVIDIA copyright header (2025).Please prepend the standard NVIDIA copyright header to comply with repo guidelines.
53-57
: Default max_num_tokens looks off; ensure it's intentional (4086 vs 4096).If this is a typo, fix it; otherwise document why 4086 is chosen. Prefer deriving from BuildConfig to avoid drift.
Example fix:
- max_num_tokens: int = 4086 + # Derive at runtime via TorchLlmArgs/build_config; keep a sane default. + max_num_tokens: int = 4096tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1-1
: Missing NVIDIA copyright header (2025).Please add the standard header.
tensorrt_llm/llmapi/llm_args.py (3)
1-1
: Missing NVIDIA copyright header (2025).Please add the standard header.
216-229
: LGTM: AttentionTPConfig added with sane defaults and from_dict.Consider clarifying in the docstring that TP waiting applies only when attention_dp is disabled.
2206-2210
: Line too long; tighten field description (E501).Shorten to avoid lint failures.
Apply:
- attention_tp_config: Optional[AttentionTPConfig] = Field( - default=None, - description="Optimized load for the TP Attention.", - status="beta") + attention_tp_config: Optional[AttentionTPConfig] = Field( + default=None, + description="Token-parallel attention waiting controls.", + status="beta")tensorrt_llm/llmapi/__init__.py (1)
1-1
: Missing NVIDIA copyright header (2025).Please add the standard header.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(5 hunks)tensorrt_llm/llmapi/__init__.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/llmapi/__init__.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/llmapi/__init__.py
🧠 Learnings (1)
📚 Learning: 2025-08-26T09:37:10.431Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.431Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tensorrt_llm/llmapi/llm_args.py
🧬 Code graph analysis (3)
tensorrt_llm/llmapi/llm_args.py (2)
tensorrt_llm/builder.py (2)
default
(50-58)from_dict
(572-677)tensorrt_llm/_torch/pyexecutor/config.py (1)
PyTorchConfig
(17-120)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest
(282-424)
tensorrt_llm/llmapi/__init__.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
AttentionTPConfig
(211-222)
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py
2210-2210: Line too long (165 > 120)
(E501)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1277-1277: Line too long (172 > 120)
(E501)
1284-1284: Line too long (282 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
239-240
: Good: initialized TP waiting iteration counter.
1298-1302
: Gating TP waiting only when attention DP is disabled is correct.tensorrt_llm/llmapi/llm_args.py (1)
2649-2650
: LGTM: wiring attention_tp_config through update_llm_args_with_extra_dict.tensorrt_llm/llmapi/__init__.py (2)
7-8
: LGTM: exported AttentionTPConfig from llmapi.
61-62
: LGTM: added AttentionTPConfig to all.
def _waiting_tp_requests(self, context_requests: list[LlmRequest], | ||
generation_requests: list[LlmRequest]): | ||
if not self.attention_tp_enable_waiting: | ||
return context_requests | ||
|
||
waited_context_requests = [] | ||
stop_waiting = False | ||
num_scheduled_ctx_tokens = sum( | ||
len(ctx_req.get_tokens(0)) for ctx_req in context_requests) | ||
num_scheduled_gen_tokens = sum( | ||
len(gen_req.get_tokens(0)) for gen_req in generation_requests) | ||
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens | ||
stop_waiting = self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters or num_scheduled_tokens >= self.attention_tp_max_token_ratio * self.max_num_tokens | ||
if stop_waiting: | ||
waited_context_requests = context_requests | ||
self.tp_ctx_waiting_iters_count = 0 | ||
else: | ||
self.tp_ctx_waiting_iters_count += 1 | ||
# logger.info( | ||
# f"waited_context_requests: {waited_context_requests}, tp_ctx_waiting_iters_count: {self.tp_ctx_waiting_iters_count}, num_scheduled_tokens: {num_scheduled_tokens}, max_num_tokens: {self.max_num_tokens}, attention_tp_max_token_ratio: {self.attention_tp_max_token_ratio}" | ||
# ) | ||
return waited_context_requests | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Token accounting is incorrect; use per-iteration costs (ctx chunk size, 1 token per gen). Also fix long lines (E501).
Using len(get_tokens(0)) overcounts historical tokens and can stall or prematurely stop waiting. Count:
- context: sum of request.context_chunk_size
- generation: number of generation requests (optionally times beam width)
Apply:
- def _waiting_tp_requests(self, context_requests: list[LlmRequest],
- generation_requests: list[LlmRequest]):
+ def _waiting_tp_requests(self, context_requests: list[LlmRequest],
+ generation_requests: list[LlmRequest]):
if not self.attention_tp_enable_waiting:
return context_requests
waited_context_requests = []
- stop_waiting = False
- num_scheduled_ctx_tokens = sum(
- len(ctx_req.get_tokens(0)) for ctx_req in context_requests)
- num_scheduled_gen_tokens = sum(
- len(gen_req.get_tokens(0)) for gen_req in generation_requests)
- num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens
- stop_waiting = self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters or num_scheduled_tokens >= self.attention_tp_max_token_ratio * self.max_num_tokens
+ # Per-iteration token costs:
+ num_scheduled_ctx_tokens = sum(
+ req.context_chunk_size for req in context_requests
+ )
+ # Approximate 1 token per generation request this iteration
+ num_scheduled_gen_tokens = len(generation_requests)
+ num_scheduled_tokens = (
+ num_scheduled_ctx_tokens + num_scheduled_gen_tokens
+ )
+ stop_waiting = (
+ self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters
+ or num_scheduled_tokens >= int(
+ self.attention_tp_max_token_ratio * self.max_num_tokens
+ )
+ )
if stop_waiting:
waited_context_requests = context_requests
self.tp_ctx_waiting_iters_count = 0
else:
self.tp_ctx_waiting_iters_count += 1
- # logger.info(
- # f"waited_context_requests: {waited_context_requests}, tp_ctx_waiting_iters_count: {self.tp_ctx_waiting_iters_count}, num_scheduled_tokens: {num_scheduled_tokens}, max_num_tokens: {self.max_num_tokens}, attention_tp_max_token_ratio: {self.attention_tp_max_token_ratio}"
- # )
return waited_context_requests
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def _waiting_tp_requests(self, context_requests: list[LlmRequest], | |
generation_requests: list[LlmRequest]): | |
if not self.attention_tp_enable_waiting: | |
return context_requests | |
waited_context_requests = [] | |
stop_waiting = False | |
num_scheduled_ctx_tokens = sum( | |
len(ctx_req.get_tokens(0)) for ctx_req in context_requests) | |
num_scheduled_gen_tokens = sum( | |
len(gen_req.get_tokens(0)) for gen_req in generation_requests) | |
num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens | |
stop_waiting = self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters or num_scheduled_tokens >= self.attention_tp_max_token_ratio * self.max_num_tokens | |
if stop_waiting: | |
waited_context_requests = context_requests | |
self.tp_ctx_waiting_iters_count = 0 | |
else: | |
self.tp_ctx_waiting_iters_count += 1 | |
# logger.info( | |
# f"waited_context_requests: {waited_context_requests}, tp_ctx_waiting_iters_count: {self.tp_ctx_waiting_iters_count}, num_scheduled_tokens: {num_scheduled_tokens}, max_num_tokens: {self.max_num_tokens}, attention_tp_max_token_ratio: {self.attention_tp_max_token_ratio}" | |
# ) | |
return waited_context_requests | |
def _waiting_tp_requests(self, context_requests: list[LlmRequest], | |
generation_requests: list[LlmRequest]): | |
if not self.attention_tp_enable_waiting: | |
return context_requests | |
waited_context_requests = [] | |
# Per-iteration token costs: | |
num_scheduled_ctx_tokens = sum( | |
req.context_chunk_size for req in context_requests | |
) | |
# Approximate 1 token per generation request this iteration | |
num_scheduled_gen_tokens = len(generation_requests) | |
num_scheduled_tokens = ( | |
num_scheduled_ctx_tokens + num_scheduled_gen_tokens | |
) | |
stop_waiting = ( | |
self.tp_ctx_waiting_iters_count >= self.attention_tp_timeout_iters | |
or num_scheduled_tokens >= int( | |
self.attention_tp_max_token_ratio * self.max_num_tokens | |
) | |
) | |
if stop_waiting: | |
waited_context_requests = context_requests | |
self.tp_ctx_waiting_iters_count = 0 | |
else: | |
self.tp_ctx_waiting_iters_count += 1 | |
return waited_context_requests |
🧰 Tools
🪛 Ruff (0.12.2)
1277-1277: Line too long (172 > 120)
(E501)
1284-1284: Line too long (282 > 120)
(E501)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 1265-1287, replace
the incorrect token counting that uses len(req.get_tokens(0)) with per-iteration
cost accounting: sum the context chunk cost by summing each context request's
context_chunk_size (e.g., ctx_req.context_chunk_size) and count generation cost
as the number of active generation requests (optionally multiplied by beam width
if applicable) instead of token history; compute num_scheduled_tokens =
num_scheduled_ctx_tokens + num_scheduled_gen_tokens, keep the existing
stop_waiting logic but ensure tp_ctx_waiting_iters_count is updated the same
way, and shorten any long lines to satisfy E501 (break calculations and
conditions across multiple lines).
f26d6e9
to
4e65519
Compare
4e65519
to
f58f5a6
Compare
/bot run |
PR_Github #16692 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1266-1287
: Fix token accounting and NVTX label in_waiting_requests
(and wrap long line).
- NVTX label is incorrect (
"_schedule"
on_waiting_requests
), hurting trace readability.- Token counting uses
len(req.get_tokens(0))
, which overcounts historical tokens and can prematurely release or stall waiting. Count per-iteration cost instead:
- Context: sum
context_chunk_size
for scheduled context requests this iter.- Generation: approximate 1 token/request (optionally times beam width).
- Also address E501 long line.
Apply:
-@nvtx_range("_schedule") +@nvtx_range("_waiting_requests") def _waiting_requests(self, context_requests: list[LlmRequest], generation_requests: list[LlmRequest]): if not self.enable_batch_waiting: return context_requests waited_context_requests = [] - stop_waiting = False - num_scheduled_ctx_tokens = sum( - len(ctx_req.get_tokens(0)) for ctx_req in context_requests) - num_scheduled_gen_tokens = sum( - len(gen_req.get_tokens(0)) for gen_req in generation_requests) - num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens - - stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens + # Per-iteration token costs: + num_scheduled_ctx_tokens = sum( + getattr(req, "context_chunk_size", 1) for req in context_requests + ) + # Approximate 1 token per generation request this iteration. + # If desired, incorporate beam width: + # num_scheduled_gen_tokens = sum(getattr(req.sampling_config, "beam_width", 1) for req in generation_requests) + num_scheduled_gen_tokens = len(generation_requests) + num_scheduled_tokens = ( + num_scheduled_ctx_tokens + num_scheduled_gen_tokens + ) + + stop_waiting = ( + self.batch_wait_iters_count >= self.batch_wait_timeout_iters + or num_scheduled_tokens >= int( + self.batch_wait_max_tokens_ratio * self.max_num_tokens + ) + ) if stop_waiting: waited_context_requests = context_requests self.batch_wait_iters_count = 0 else: self.batch_wait_iters_count += 1 return waited_context_requeststensorrt_llm/llmapi/llm_args.py (1)
2612-2615
: Propagate a non-None max_num_tokens to the PyTorch backendEnsure backend receives a concrete value (fallback to build_config) so ratio-based waiting can compute thresholds.
attention_dp_batching_wait_iters=self.attention_dp_config. batching_wait_iters if self.attention_dp_config is not None else AttentionDpConfig.model_fields['batching_wait_iters'].default, + max_num_tokens=self.max_num_tokens + if self.max_num_tokens is not None + else self.build_config.max_num_tokens, batch_wait_timeout_ms=self.batch_wait_timeout_ms, batch_wait_timeout_iters=self.batch_wait_timeout_iters, batch_wait_max_tokens_ratio=self.batch_wait_max_tokens_ratio,
🧹 Nitpick comments (5)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1298-1305
: Guard condition reads correctly; consider explicit short-circuit for readability.No functional change needed. Optional clarity:
- if not self.enable_attention_dp and self.enable_batch_waiting and len( - scheduler_output.context_requests) > 0 and len( - scheduler_output.generation_requests) > 0: + if (not self.enable_attention_dp and self.enable_batch_waiting + and scheduler_output.context_requests + and scheduler_output.generation_requests):tensorrt_llm/llmapi/llm_args.py (4)
2243-2254
: Use int for iteration-based timeout; tighten descriptionIterations should be integral and the description can be clearer.
Apply:
- batch_wait_timeout_iters: float = Field( + batch_wait_timeout_iters: int = Field( default=0, description= - "Maximum number of iterations the scheduler will wait to accumulate new coming requests for improved GPU utilization efficiency. If greater than 0, the scheduler will delay batch processing to gather more requests up to the specified iteration limit. If 0, disables timeout-iters-based batching delays.", + "Maximum number of scheduler iterations to wait to accumulate incoming requests for better GPU utilization. If > 0, the scheduler may delay batch processing up to this budget; if 0, disables iteration-based batching delays.", status="prototype") batch_wait_max_tokens_ratio: float = Field( default=0, description= - "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.", + "Token accumulation threshold ratio for batch scheduling. If > 0, accumulate requests until total tokens reach batch_wait_max_tokens_ratio * max_num_tokens. This improves GPU utilization by ensuring adequate batch sizes. If 0, disables token-based batching delays.", status="prototype")
2519-2521
: Fix validation message (0 is allowed)Error text should say “greater than or equal to 0.”
- if self.batch_wait_timeout_ms < 0: - raise ValueError("batch_wait_timeout_ms must be greater than 0") + if self.batch_wait_timeout_ms < 0: + raise ValueError("batch_wait_timeout_ms must be greater than or equal to 0")
2523-2528
: Fix validation message (0 is allowed) for itersKeep logic, update message; aligns with docs.
- if self.batch_wait_timeout_iters < 0: - raise ValueError("batch_wait_timeout_iters must be greater than 0") + if self.batch_wait_timeout_iters < 0: + raise ValueError("batch_wait_timeout_iters must be greater than or equal to 0")
2243-2254
: Clarify independent batching thresholds and their execution orderThe two parameters govern separate stages of the PyTorch executor’s batching logic and do not override one another. Specifically:
- batch_wait_timeout_ms applies in ExecutorRequestQueue._get_from_request_queue (tensorrt_llm/_torch/pyexecutor/executor_request_queue.py). When > 0, the request queue will wait up to that many milliseconds (or until max_batch_size is reached) before flushing items to the executor.
- batch_wait_timeout_iters (and batch_wait_max_tokens_ratio) applies later in PyTorchExecutor._waiting_requests (tensorrt_llm/_torch/pyexecutor/py_executor.py). When enable_batch_waiting is true, the scheduler will hold generation requests for up to the configured number of iterations (or until token-count threshold) before releasing them.
Because these delays occur at different points in the pipeline, both will be applied sequentially if set > 0 (queue-level ms wait first, then scheduler-level iters wait). Please update the field descriptions in llm_args.py to:
- Call out that they control distinct phases
- Describe the order in which they are evaluated
- Note that setting both introduces two successive batching delays rather than one “wins” over the other
This will ensure users understand how the executor actually behaves.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(5 hunks)tensorrt_llm/llmapi/llm_args.py
(3 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(1 hunks)tests/unittest/api_stability/references/llm.yaml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/pyexecutor/config.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_utils.py (1)
nvtx_range
(843-862)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)
tests/integration/defs/conftest.py (2)
parametrize_with_ids
(1786-1811)llm_models_root
(77-83)tensorrt_llm/llmapi/llm_args.py (7)
KvCacheConfig
(946-1077)TorchCompileConfig
(2108-2155)CudaGraphConfig
(108-165)MoeConfig
(168-196)MTPDecodingConfig
(521-556)quant_config
(2328-2331)quant_config
(2334-2335)tensorrt_llm/llmapi/llm.py (1)
LLM
(1011-1027)tensorrt_llm/quantization/mode.py (1)
QuantAlgo
(23-46)tests/integration/defs/accuracy/accuracy_core.py (3)
GSM8K
(293-308)evaluate
(147-206)evaluate
(707-717)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1280-1280: Line too long (165 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/unittest/api_stability/references/llm.yaml (1)
134-141
: LGTM: new batch-waiting knobs exposed in API reference.Defaults and statuses look consistent with prototype rollout.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
1512-1519
: Use the parametrizedfp8kv
instead of hardcoding.The conditional below already respects
fp8kv
. Removing the hardcoded assignment (per the previous diff) is sufficient; no further change needed.tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
183-195
: Ensuremax_num_tokens
is always positive when ratio-based waiting is enabled.If
max_num_tokens
were 0/None, the ratio threshold becomes 0, releasing immediately. If that’s unintended, clamp or disable ratio-based waiting whenmax_num_tokens <= 0
.Would you like a follow-up patch to guard
enable_batch_waiting
withself.max_num_tokens > 0
?
tensorrt_llm/llmapi/llm_args.py
Outdated
@model_validator(mode='after') | ||
def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs': | ||
if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1: | ||
raise ValueError( | ||
"batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1" | ||
) | ||
return self | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Require max_num_tokens when token-ratio waiting is enabled
Without an effective max_num_tokens (> 0), ratio-based waiting can’t compute a threshold and will fail or behave incorrectly at runtime.
@model_validator(mode='after')
def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs':
if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1:
raise ValueError(
"batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1"
)
+ if self.batch_wait_max_tokens_ratio > 0:
+ effective_max_tokens = self.max_num_tokens or (
+ self.build_config.max_num_tokens if self.build_config else None
+ )
+ if effective_max_tokens is None or effective_max_tokens <= 0:
+ raise ValueError(
+ "batch_wait_max_tokens_ratio requires max_num_tokens to be set (> 0). "
+ "Set TorchLlmArgs.max_num_tokens or build_config.max_num_tokens."
+ )
return self
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@model_validator(mode='after') | |
def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs': | |
if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1: | |
raise ValueError( | |
"batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1" | |
) | |
return self | |
@model_validator(mode='after') | |
def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs': | |
if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1: | |
raise ValueError( | |
"batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1" | |
) | |
if self.batch_wait_max_tokens_ratio > 0: | |
effective_max_tokens = self.max_num_tokens or ( | |
self.build_config.max_num_tokens if self.build_config else None | |
) | |
if effective_max_tokens is None or effective_max_tokens <= 0: | |
raise ValueError( | |
"batch_wait_max_tokens_ratio requires max_num_tokens to be set (> 0). " | |
"Set TorchLlmArgs.max_num_tokens or build_config.max_num_tokens." | |
) | |
return self |
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm_args.py around lines 2529 to 2536, the validator only
checks that batch_wait_max_tokens_ratio is in [0,1]; add a check that if
batch_wait_max_tokens_ratio > 0 then self.max_num_tokens must be > 0 and raise a
ValueError with a clear message if not; keep the existing ratio bounds
validation and return self as before.
f58f5a6
to
f6778f2
Compare
/bot run |
PR_Github #16692 [ run ] completed with state |
PR_Github #16696 [ run ] triggered by Bot |
PR_Github #16696 [ run ] completed with state |
f6778f2
to
a4ea57f
Compare
/bot run |
a4ea57f
to
42002a7
Compare
PR_Github #16769 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
183-195
: Harden config defaults to avoid TypeError and make enable_batch_waiting robustIf any of these fields are None in pytorch_backend_config, comparisons like > 0 raise. Cast with safe fallbacks.
- self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens + self.max_num_tokens = int(getattr( + model_engine.pytorch_backend_config, "max_num_tokens", 0) or 0) @@ - self.batch_wait_timeout_iters = model_engine.pytorch_backend_config.batch_wait_timeout_iters - self.batch_wait_max_tokens_ratio = model_engine.pytorch_backend_config.batch_wait_max_tokens_ratio - self.enable_batch_waiting = self.batch_wait_timeout_iters > 0 or self.batch_wait_max_tokens_ratio > 0 + self.batch_wait_timeout_iters = int(getattr( + model_engine.pytorch_backend_config, "batch_wait_timeout_iters", 0) or 0) + self.batch_wait_max_tokens_ratio = float(getattr( + model_engine.pytorch_backend_config, "batch_wait_max_tokens_ratio", 0.0) or 0.0) + self.enable_batch_waiting = ( + self.batch_wait_timeout_iters > 0 or self.batch_wait_max_tokens_ratio > 0.0 + )
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1266-1286
: Fix token accounting; count per-iteration costs (ctx chunk size, ~1 token/gen); also guard when max_num_tokens <= 0Using len(req.get_tokens(0)) counts history and breaks the threshold; it can cause premature stop or no-op waiting. Also break long lines (E501).
- def _waiting_requests(self, context_requests: list[LlmRequest], - generation_requests: list[LlmRequest]): + def _waiting_requests(self, context_requests: list[LlmRequest], + generation_requests: list[LlmRequest]): if not self.enable_batch_waiting: return context_requests waited_context_requests = [] - stop_waiting = False - num_scheduled_ctx_tokens = sum( - len(ctx_req.get_tokens(0)) for ctx_req in context_requests) - num_scheduled_gen_tokens = sum( - len(gen_req.get_tokens(0)) for gen_req in generation_requests) - num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens - - stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens + # Per-iteration token costs: + num_scheduled_ctx_tokens = sum( + req.context_chunk_size for req in context_requests + ) + # Approximate 1 token per generation request this iteration + num_scheduled_gen_tokens = len(generation_requests) + num_scheduled_tokens = ( + num_scheduled_ctx_tokens + num_scheduled_gen_tokens + ) + # Ratio threshold only meaningful if both ratio > 0 and max_num_tokens > 0 + ratio_threshold = ( + int(self.batch_wait_max_tokens_ratio * self.max_num_tokens) + if (self.batch_wait_max_tokens_ratio > 0.0 and + self.max_num_tokens > 0) + else float("inf") + ) + stop_waiting = ( + self.batch_wait_iters_count >= self.batch_wait_timeout_iters + or num_scheduled_tokens >= ratio_threshold + ) if stop_waiting: waited_context_requests = context_requests self.batch_wait_iters_count = 0 else: self.batch_wait_iters_count += 1 return waited_context_requeststensorrt_llm/llmapi/llm_args.py (2)
2529-2536
: Require max_num_tokens when ratio-based waiting is enabledRatio-based waiting needs a concrete threshold; enforce effective max_num_tokens (> 0). This prevents runtime issues in the executor.
Apply:
@model_validator(mode='after') def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs': if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1: raise ValueError( "batch_wait_max_tokens_ratio must be greater than or equal to 0 and less than or equal to 1" ) + if self.batch_wait_max_tokens_ratio > 0: + effective_max_tokens = self.max_num_tokens or ( + self.build_config.max_num_tokens if self.build_config else None + ) + if effective_max_tokens is None or effective_max_tokens <= 0: + raise ValueError( + "batch_wait_max_tokens_ratio requires max_num_tokens to be set (> 0). " + "Set TorchLlmArgs.max_num_tokens or build_config.max_num_tokens." + ) return self
2612-2615
: Propagate a non-None max_num_tokens to PyTorchConfigBackend needs a concrete threshold when ratio-based waiting is used; pass max_num_tokens with a build_config fallback.
Apply:
attention_dp_batching_wait_iters=self.attention_dp_config. batching_wait_iters if self.attention_dp_config is not None else AttentionDpConfig.model_fields['batching_wait_iters'].default, + max_num_tokens=self.max_num_tokens + if self.max_num_tokens is not None + else self.build_config.max_num_tokens, batch_wait_timeout_ms=self.batch_wait_timeout_ms, batch_wait_timeout_iters=self.batch_wait_timeout_iters, batch_wait_max_tokens_ratio=self.batch_wait_max_tokens_ratio,Run to verify PyTorchConfig signature and types:
#!/bin/bash # Verify PyTorchConfig fields exist and types are as expected. rg -nP 'class\s+PyTorchConfig\b' -A120 tensorrt_llm/_torch/pyexecutor/config.py rg -nP '\bmax_num_tokens\b|\bbatch_wait_timeout_iters\b|\bbatch_wait_max_tokens_ratio\b' tensorrt_llm/_torch/pyexecutor/config.py
🧹 Nitpick comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
1486-1527
: Ensure ratio-based waiting is actually exercised: set max_num_tokens when batch_wait_max_tokens_ratio > 0If max_num_tokens defaults to 0/None, the condition tokens >= ratio * max_num_tokens collapses to tokens >= 0 and stops waiting immediately; the test won’t meaningfully cover ratio behavior. Guard by setting a sane max_num_tokens only when ratio > 0, and (optionally) assert the backend picked up the values.
@@ - with LLM(f"{llm_models_root()}/DeepSeek-V3-Lite/nvfp4_moe_only_mtp", + llm_kwargs = {} + if batch_wait_max_tokens_ratio > 0: + # Keep moderate to avoid OOM while making the ratio threshold meaningful. + llm_kwargs["max_num_tokens"] = 16384 + with LLM(f"{llm_models_root()}/DeepSeek-V3-Lite/nvfp4_moe_only_mtp", kv_cache_config=kv_cache_config, **pytorch_config, enable_attention_dp=False, - speculative_config=mtp_config) as llm: + speculative_config=mtp_config, + **llm_kwargs) as llm: assert llm.args.quant_config.quant_algo == QuantAlgo.NVFP4 + # Optional: verify wiring + assert llm.args.pytorch_backend_config.batch_wait_timeout_iters == batch_wait_timeout_iters + assert llm.args.pytorch_backend_config.batch_wait_max_tokens_ratio == batch_wait_max_tokens_ratiotensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1297-1304
: Gate waiting only when both ctx and gen exist: OK, but reference local scheduled_context_requests for clarityLogic is correct; minor readability improvement below.
- if not self.enable_attention_dp and self.enable_batch_waiting and len( - scheduler_output.context_requests) > 0 and len( - scheduler_output.generation_requests) > 0: - scheduled_context_requests = self._waiting_requests( - scheduler_output.context_requests, - scheduler_output.generation_requests) + if (not self.enable_attention_dp and self.enable_batch_waiting + and scheduler_output.context_requests + and scheduler_output.generation_requests): + scheduled_context_requests = self._waiting_requests( + scheduler_output.context_requests, + scheduler_output.generation_requests + )tensorrt_llm/llmapi/llm_args.py (2)
2249-2254
: Nit: fix spacing/punctuation in descriptionMinor grammar/spacing for clarity.
Apply:
- "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.", + "Token accumulation threshold ratio for batch scheduling optimization. If > 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization by ensuring adequate batch sizes. If 0, disables token-based batching delays.",
2523-2528
: Validator message should say '>= 0'; optionally warn if both ms- and iter-based waits are enabled
- Error text should match the logic (0 is allowed).
- Optional: warn if both knobs are > 0 to avoid compounded delays.
Apply:
@model_validator(mode='after') def validate_batch_wait_timeout_iters(self) -> 'TorchLlmArgs': - if self.batch_wait_timeout_iters < 0: - raise ValueError("batch_wait_timeout_iters must be greater than 0") + if self.batch_wait_timeout_iters < 0: + raise ValueError("batch_wait_timeout_iters must be greater than or equal to 0") + # Guardrail: avoid unintentionally enabling two waiting mechanisms. + if self.batch_wait_timeout_ms > 0 and self.batch_wait_timeout_iters > 0: + logger.warning( + "Both batch_wait_timeout_ms and batch_wait_timeout_iters are > 0; this may introduce compounded delays. " + "Consider enabling only one waiting mechanism.") return self
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(5 hunks)tensorrt_llm/llmapi/llm_args.py
(3 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(1 hunks)tests/unittest/api_stability/references/llm.yaml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/unittest/api_stability/references/llm.yaml
- tensorrt_llm/_torch/pyexecutor/config.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/llmapi/llm_args.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (2)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
🧬 Code graph analysis (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
tests/integration/defs/conftest.py (2)
parametrize_with_ids
(1786-1811)llm_models_root
(77-83)tensorrt_llm/llmapi/llm_args.py (7)
KvCacheConfig
(946-1077)TorchCompileConfig
(2108-2155)CudaGraphConfig
(108-165)MoeConfig
(168-196)MTPDecodingConfig
(521-556)quant_config
(2328-2331)quant_config
(2334-2335)tensorrt_llm/quantization/mode.py (1)
QuantAlgo
(23-46)tests/integration/defs/accuracy/accuracy_core.py (3)
GSM8K
(293-308)evaluate
(147-206)evaluate
(707-717)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest
(282-424)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1279-1279: Line too long (165 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
240-240
: Good: initialize batch_wait_iters_count stateThe counter is required to enforce iteration-based timeouts.
PR_Github #16925 [ run ] triggered by Bot |
PR_Github #16926 [ run ] triggered by Bot |
PR_Github #16925 [ run ] completed with state |
PR_Github #16927 [ run ] triggered by Bot |
PR_Github #16926 [ run ] completed with state |
/bot run |
PR_Github #16933 [ run ] triggered by Bot |
PR_Github #16927 [ run ] completed with state |
PR_Github #16933 [ run ] completed with state |
/bot run |
PR_Github #16937 [ run ] triggered by Bot |
PR_Github #16937 [ run ] completed with state |
/bot run |
PR_Github #16945 [ run ] triggered by Bot |
PR_Github #16945 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16947 [ run ] triggered by Bot |
PR_Github #16947 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16949 [ run ] triggered by Bot |
Signed-off-by: yunruis <[email protected]>
0cf28c0
to
c3feac4
Compare
/bot run --disable-fail-fast |
PR_Github #16955 [ run ] triggered by Bot |
PR_Github #16949 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
183-195
: Sanitize config and compute enable flag robustlyGuard against invalid values and clamp ratio to [0.0, 1.0]. This avoids accidental “always release” when ratio=0 and protects against negatives.
Apply:
- self.max_num_tokens = model_engine.pytorch_backend_config.max_num_tokens + self.max_num_tokens = max( + 1, int(model_engine.pytorch_backend_config.max_num_tokens) + ) self.print_log = model_engine.pytorch_backend_config.print_iter_log @@ - self.batch_wait_timeout_iters = model_engine.pytorch_backend_config.batch_wait_timeout_iters - self.batch_wait_max_tokens_ratio = model_engine.pytorch_backend_config.batch_wait_max_tokens_ratio - self.enable_batch_waiting = self.batch_wait_timeout_iters > 0 or self.batch_wait_max_tokens_ratio > 0 + _timeout = model_engine.pytorch_backend_config.batch_wait_timeout_iters + _ratio = model_engine.pytorch_backend_config.batch_wait_max_tokens_ratio + self.batch_wait_timeout_iters = max(0, int(_timeout)) + self.batch_wait_max_tokens_ratio = max(0.0, min(1.0, float(_ratio))) + self.enable_batch_waiting = ( + self.batch_wait_timeout_iters > 0 + or self.batch_wait_max_tokens_ratio > 0.0 + )
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1275-1295
: Fix token accounting, zero-threshold semantics, and long line (E501)Current logic overcounts tokens using full history and treats ratio=0 as “always release,” nullifying timeout-only waiting. Replace with per-iteration costs and guard zero thresholds.
Apply:
- def _waiting_requests(self, context_requests: list[LlmRequest], - generation_requests: list[LlmRequest]): + def _waiting_requests(self, context_requests: list[LlmRequest], + generation_requests: list[LlmRequest]): if not self.enable_batch_waiting: return context_requests waited_context_requests = [] - stop_waiting = False - num_scheduled_ctx_tokens = sum( - len(ctx_req.get_tokens(0)) for ctx_req in context_requests) - num_scheduled_gen_tokens = sum( - len(gen_req.get_tokens(0)) for gen_req in generation_requests) - num_scheduled_tokens = num_scheduled_ctx_tokens + num_scheduled_gen_tokens - - stop_waiting = self.batch_wait_iters_count >= self.batch_wait_timeout_iters or num_scheduled_tokens >= self.batch_wait_max_tokens_ratio * self.max_num_tokens + # Per-iteration costs: context consumes current chunk size; generation ~1 token per req + num_scheduled_ctx_tokens = sum(req.context_chunk_size + for req in context_requests) + num_scheduled_gen_tokens = len(generation_requests) + num_scheduled_tokens = (num_scheduled_ctx_tokens + + num_scheduled_gen_tokens) + # Only apply a check when the corresponding threshold is enabled (> 0) + ratio_enabled = self.batch_wait_max_tokens_ratio > 0.0 + timeout_enabled = self.batch_wait_timeout_iters > 0 + ratio_threshold = int(self.batch_wait_max_tokens_ratio + * self.max_num_tokens) if ratio_enabled else 0 + stop_waiting = ( + (timeout_enabled + and self.batch_wait_iters_count >= self.batch_wait_timeout_iters) + or (ratio_enabled and num_scheduled_tokens >= ratio_threshold) + ) if stop_waiting: waited_context_requests = context_requests self.batch_wait_iters_count = 0 else: self.batch_wait_iters_count += 1 return waited_context_requeststensorrt_llm/llmapi/llm_args.py (2)
2529-2535
: Require effective max_num_tokens when ratio-based waiting is enabled.Without a concrete max_num_tokens, the token-ratio threshold is undefined and will misbehave at runtime.
def validate_batch_wait_max_tokens_ratio(self) -> 'TorchLlmArgs': - if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1: - raise ValueError( - "batch_wait_max_tokens_ratio must be between 0 and 1") + if self.batch_wait_max_tokens_ratio < 0 or self.batch_wait_max_tokens_ratio > 1: + raise ValueError("batch_wait_max_tokens_ratio must be in [0, 1]") + + if self.batch_wait_max_tokens_ratio > 0: + effective_max_tokens = self.max_num_tokens or ( + self.build_config.max_num_tokens if self.build_config else None + ) + if effective_max_tokens is None or effective_max_tokens <= 0: + raise ValueError( + "batch_wait_max_tokens_ratio requires max_num_tokens to be set (> 0). " + "Set TorchLlmArgs.max_num_tokens or build_config.max_num_tokens." + ) return self
2611-2614
: Propagate a non-None max_num_tokens to PyTorchConfig.Ensure backend gets a concrete threshold for ratio-based waiting; fall back to build_config when needed.
return PyTorchConfig( extra_resource_managers=self.extra_resource_managers, @@ mamba_ssm_cache_dtype=self.kv_cache_config.mamba_ssm_cache_dtype, + max_num_tokens=self.max_num_tokens + if self.max_num_tokens is not None + else self.build_config.max_num_tokens, enable_iter_perf_stats=self.enable_iter_perf_stats,
🧹 Nitpick comments (4)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
1519-1523
: Make ratio-threshold path deterministicWithout bounding max_num_tokens, the 0.75 ratio may rarely trigger before timeout in CI-sized runs. Recommend lowering max_num_tokens to force the ratio gate to engage reliably.
Apply:
with LLM(f"{llm_models_root()}/DeepSeek-V3-Lite/nvfp4_moe_only_mtp", kv_cache_config=kv_cache_config, + max_num_tokens=512, **pytorch_config, enable_attention_dp=False, speculative_config=mtp_config) as llm:
tensorrt_llm/llmapi/llm_args.py (3)
2243-2247
: Wording: “new coming” → “incoming” (minor doc polish).Tighten the field description for clarity.
- "Maximum number of iterations the scheduler will wait to accumulate new coming requests for improved GPU utilization efficiency. If greater than 0, the scheduler will delay batch processing to gather more requests up to the specified iteration limit. If 0, disables timeout-iters-based batching delays.", + "Maximum number of iterations the scheduler will wait to accumulate incoming requests for improved GPU utilization. If > 0, the scheduler may delay batch processing to gather more requests up to this iteration limit. If 0, disables iter-based batching delays.",
2249-2254
: Fix typo and tighten phrasing in description.Add missing space after sentence and simplify wording.
- "Token accumulation threshold ratio for batch scheduling optimization. If greater than 0, the scheduler will accumulate requests locally until the total token count reaches batch_wait_max_tokens_ratio * max_num_tokens. This mechanism enhances GPU utilization efficiency by ensuring adequate batch sizes.If 0 disables token-based batching delays.", + "Token-accumulation threshold ratio for batch scheduling. If > 0, the scheduler accumulates requests until total scheduled tokens reach batch_wait_max_tokens_ratio * max_num_tokens. This helps ensure adequate batch sizes. If 0, disables token-based batching delays.",
2523-2528
: Validator message conflicts with allowed value 0; make it “>= 0”.The check allows 0, but the message says “> 0”.
- if self.batch_wait_timeout_iters < 0: - raise ValueError("batch_wait_timeout_iters must be greater than 0") + if self.batch_wait_timeout_iters < 0: + raise ValueError("batch_wait_timeout_iters must be >= 0")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(5 hunks)tensorrt_llm/llmapi/llm_args.py
(3 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(1 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(1 hunks)tests/unittest/api_stability/references/llm.yaml
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tensorrt_llm/_torch/pyexecutor/config.py
- tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
- tests/unittest/api_stability/references/llm.yaml
- tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces only; indent 4 spaces
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports; prefer from package.subpackage import module then module.Symbol
Python file names use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variable names use snake_case; if starting with a number, prefix k (e.g., k_99th_percentile)
Global variables use G_ prefix and UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings; reserve comments for internal code or local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline as shown
Avoid reflection when simple, explicit code suffices (e.g., prefer def make_complex(x,y) over locals()/dict tricks)
Catch the narrowest exceptions possible in try/except
For duck-typing try/except, keep try body minimal and use else for main logic
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
🧬 Code graph analysis (2)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)
tests/integration/defs/conftest.py (2)
parametrize_with_ids
(1786-1811)llm_models_root
(77-83)tensorrt_llm/llmapi/llm_args.py (7)
KvCacheConfig
(946-1077)TorchCompileConfig
(2108-2155)CudaGraphConfig
(108-165)MoeConfig
(168-196)MTPDecodingConfig
(521-556)quant_config
(2328-2331)quant_config
(2334-2335)tensorrt_llm/llmapi/llm.py (1)
LLM
(1011-1027)tensorrt_llm/quantization/mode.py (1)
QuantAlgo
(23-46)tests/integration/defs/accuracy/accuracy_core.py (3)
GSM8K
(293-308)evaluate
(147-206)evaluate
(707-717)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
LlmRequest
(282-424)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1288-1288: Line too long (165 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
1486-1513
: Batch-waiting test matrix looks solidGood coverage of timeout-only, ratio-only, and combined cases; disabling attention_dp ensures the new path is exercised.
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
240-240
: LGTM: initialize waiting-iteration counterCounter is correctly initialized.
1306-1313
: Release guard to avoid dead-wait is correctConditionally applying waiting only when there are both ctx and gen requests avoids stalls.
PR_Github #16955 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16980 [ run ] triggered by Bot |
Summary by CodeRabbit
New Features
Bug Fixes / Validation
Tests
Description
to open the feature, add parameter to config.yaml. For example
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.