Skip to content

Conversation

liji-nv
Copy link
Collaborator

@liji-nv liji-nv commented Aug 8, 2025

Summary by CodeRabbit

  • New Features

    • Padded-sequence support added across attention and MLA with explicit output-buffer handling and zeroing of padded tails.
    • Per-request piecewise CUDA graph control and expanded configurable piecewise batch sizes; warmup now builds granular requests.
  • Performance

    • More consistent CUDA-graph captures and execution for padded/batched workloads, improving inference efficiency.
  • Bug Fixes

    • Fixed shape/alignment issues in MTP/speculative paths when padding is present.
  • Tests

    • Tests updated to enable piecewise CUDA graphs under attention DP and tighten custom-op coverage.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 8, 2025

📝 Walkthrough

Walkthrough

Adds per-request piecewise CUDA-graph controls and padding-aware attention/MLA handling, marks a custom op argument as in-place alias in TorchScript, updates warmup/request construction and attention-DP padding propagation, and adjusts tests to reflect piecewise behavior.

Changes

Cohort / File(s) Summary
Custom op schema & tests
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp, tests/unittest/_torch/test_custom_ops.py
Marked merged_attn as aliasing/in-place in the TorchScript schema (Tensor(a!)) for merge_chunked_attention_for_mla. Removed the fake-op waiver so the op is now checked in tests.
Attention metadata and modules (padding support)
tensorrt_llm/_torch/attention_backend/interface.py, tensorrt_llm/_torch/modules/attention.py
Added padded_num_tokens to AttentionMetadata. Modules now handle padded sequences by slicing inputs/outputs to num_tokens, writing into explicit output buffers, and zeroing padded tails. Multiple MLA method signatures updated to accept non-optional output and adjusted parameter ordering.
Piecewise CUDA-graph controls & warmup
tensorrt_llm/_torch/utils.py, tensorrt_llm/_torch/compilation/piecewise_optimizer.py, tensorrt_llm/_torch/pyexecutor/model_engine.py, tests/integration/defs/accuracy/test_llm_api_pytorch.py
Added per-request piecewise flag getters/setters. PiecewiseRunner now gates on the global and per-request flags. ModelEngine: introduced piecewise_batch_sizes, new warmup request builder, propagate per-request flag, pad inputs for attention-DP to captured sizes and store padded_num_tokens. Tests changed to enable piecewise based only on cuda_graph.
Models: MTP / speculative adjustments for padding
tensorrt_llm/_torch/models/modeling_deepseekv3.py, tensorrt_llm/_torch/models/modeling_speculative.py
When padded_num_tokens is set, crop hidden_states to num_tokens and pass cropped input_ids/position_ids to MTP/spec workers to keep shapes consistent.
Piecewise optimizer import change
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
Imported get_per_request_piecewise_cuda_graph_flag and added it to the early-return gating logic in PiecewiseRunner.__call__.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant ModelEngine
  participant Backend
  participant PiecewiseRunner

  Client->>ModelEngine: request(inputs, attn_metadata)
  ModelEngine->>ModelEngine: _prepare_tp_inputs()
  alt Attention DP enabled
    ModelEngine->>ModelEngine: check per-rank eligibility
    alt All ranks eligible
      ModelEngine->>ModelEngine: pad tokens to captured batch size
      ModelEngine->>ModelEngine: set_per_request_piecewise_cuda_graph_flag(True)
    else
      ModelEngine->>ModelEngine: set_per_request_piecewise_cuda_graph_flag(False)
    end
  end
  ModelEngine->>Backend: run attention (may include padded_num_tokens)
  Backend->>PiecewiseRunner: __call__(runtime_num_of_token)
  PiecewiseRunner->>PiecewiseRunner: check entries && global && per-request flags
  alt All checks pass
    PiecewiseRunner-->>Backend: run captured graph
  else Fallback
    PiecewiseRunner-->>Backend: run default callable
  end
  Backend-->>Client: outputs
Loading
sequenceDiagram
  participant Module as Attention/MLA
  participant Backend

  Module->>Module: see attn_metadata.padded_num_tokens
  alt padded_num_tokens set
    Module->>Module: assert padded shapes, slice q/k/v/hidden_states to num_tokens
  end
  Module->>Backend: call attention backend (output buffer possibly sliced)
  Backend-->>Module: fill output buffer
  alt padded_num_tokens set
    Module->>Module: zero out tail beyond num_tokens in outputs
  end
  Module-->>Caller: final output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

SW Architecture

Suggested reviewers

  • shaharmor98
  • HuiGao-NV
  • nv-guomingz
  • pcastonguay
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 8, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14601 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🔭 Outside diff range comments (9)
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp (3)

749-761: Alias annotation incomplete: merged_softmax_stats is also mutated in-place

mergeChunkedAttentionForMLA writes into merged_softmax_stats. Only merged_attn is annotated as in-place, which can confuse alias analysis and schedulers.

Annotate merged_softmax_stats as an in-place alias too:

 m.def(
   "merge_chunked_attention_for_mla("
-  "Tensor(a!) merged_attn"
+  "Tensor(a!) merged_attn"
   ", Tensor temp_attn"
-  ", Tensor merged_softmax_stats"
+  ", Tensor(b!) merged_softmax_stats"
   ", Tensor temp_softmax_stats"
   ", int num_requests"
   ", Tensor cu_q_seq_lens"
   ", int max_q_seq_len"
   ", Tensor merge_op"
   ", int num_heads"
   ", int head_size"
   ") -> ()");

670-683: Also mark output as in-place for set_*_kv_cache_for_mla ops

Both set_paged_kv_cache_for_mla and set_chunked_kv_cache_for_mla mutate their output tensor, yet the TS schema doesn’t annotate them as aliasing. This can lead to incorrect graph transformations.

Apply alias annotations:

 m.def(
   "set_paged_kv_cache_for_mla("
-  "Tensor output"
+  "Tensor(a!) output"
   ", Tensor k"
   ", Tensor v"
   ", Tensor k_pe"
   ", int num_requests"
   ", Tensor cu_seq_lens"
   ", int max_input_seq_len"
   ", int num_heads"
   ", int kv_dim"
   ", int rope_dim"
   ", int kv_cache_tokens_per_block"
   ") -> Tensor");

 m.def(
   "set_chunked_kv_cache_for_mla("
-  "Tensor output"
+  "Tensor(a!) output"
   ", Tensor kv"
   ", Tensor k_pe"
   ", int num_requests"
   ", Tensor cu_seq_lens"
   ", int num_heads"
   ", int kv_dim"
   ", int rope_dim"
   ", int kv_cache_tokens_per_block"
   ", int max_seq_len"
   ") -> Tensor");

Also applies to: 692-705


715-739: Consider aliasing for MLA rope op in-place args

MLARopeAppendPagedKVAssignQ writes into q and latent_cache (passed by reference). The schema should reflect that for correctness under functionalization/compilers.

Update schema to:

 m.def(
   "mla_rope_append_paged_kv_assign_q("
-  "Tensor q"
-  ", Tensor latent_cache"
+  "Tensor(a!) q"
+  ", Tensor(b!) latent_cache"
   ", int num_contexts"
   ", Tensor cu_ctx_cached_kv_lens"
   ", Tensor cu_seq_lens"
   ", int max_input_uncached_seq_len"
   ", Tensor cos_sin_cache"
   ", int head_num"
   ", int nope_size"
   ", int rope_size"
   ", int lora_size"
   ", Tensor kv_cache_block_offsets"
   ", Tensor host_kv_cache_block_offsets"
   ", Tensor host_kv_cache_pool_pointers"
   ", Tensor host_kv_cache_pool_mapping"
   ", Tensor? kv_scale_orig_quant"
   ", Tensor? kv_scale_quant_orig"
   ", int layer_idx"
   ", int tokens_per_block"
   ", int attention_window_size"
   ", int sink_token_length"
   ", int beam_width"
   ", int quant_mode"
   ") -> ()");
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (3)

1-1: Missing NVIDIA OSS header

Per repo guidelines, production Python files must include the NVIDIA copyright header with the current year.


214-216: Bug: output is undefined outside capture path

In the replay path, output isn’t defined; use entry.output instead when collecting runtime output addresses.

Apply this diff:

-            runtime_output_addresses = [
-                i.data_ptr() for i in output if isinstance(i, torch.Tensor)
-            ]
+            runtime_output_addresses = [
+                i.data_ptr() for i in entry.output if isinstance(i, torch.Tensor)
+            ]

238-238: Parameter shadowing: graph_pool_handle argument is ignored

piecewise_optimizer overwrites the passed-in graph_pool_handle. Either drop the parameter or respect it when provided.

Minimal fix:

-    graph_pool_handle = torch.cuda.graph_pool_handle()
+    # Respect provided handle; create one if None
+    if graph_pool_handle is None:
+        graph_pool_handle = torch.cuda.graph_pool_handle()

You’ll also need to update the signature to accept Optional:

-def piecewise_optimizer(..., graph_pool_handle: tuple[int, int], ...):
+from typing import Optional, Tuple
+def piecewise_optimizer(..., graph_pool_handle: Optional[Tuple[int, int]], ...):
tensorrt_llm/_torch/utils.py (2)

1-1: Missing NVIDIA OSS header

Per repo guidelines, production Python files must include the NVIDIA copyright header with the current year.


265-271: Ensure per-request flag is set in every branch

It looks like in tensorrt_llm/_torch/pyexecutor/model_engine.py you call

  • True in the first if branch (line 1571)
  • False in the inner else (line 1589)
  • False in the final else (line 1596)

—but you never set it in the elif num_ctx_requests != 0: branch (around line 1591). Add an explicit disable there to avoid carrying over stale state on thread reuse.

• File: tensorrt_llm/_torch/pyexecutor/model_engine.py
Location: around line 1591

Suggested change:

     elif num_ctx_requests != 0:
         padded_num_tokens = get_padded_piecewise_tokens(
             total_num_tokens)
+        # Disable piecewise CUDA graph for context-only requests
+        set_per_request_piecewise_cuda_graph_flag(False)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1533-1600: Piecewise-CG flag is set to True unconditionally – can leak incorrect state across ranks

set_per_request_piecewise_cuda_graph_flag(True) is invoked before verifying that every rank can actually run piecewise CG.
If later logic disables the feature (set_per_request_piecewise_cuda_graph_flag(False)), there is a window where other code executed in the same call stack (e.g., logging, debug hooks) will observe the flag as True, leading to inconsistent behaviour.

Set the flag only after the conditions are validated:

- set_per_request_piecewise_cuda_graph_flag(True)
...
- else:
-     set_per_request_piecewise_cuda_graph_flag(False)
+ enable_pw = False
+ if attn_all_rank_num_tokens is not None:
+     enable_pw = (num_ctx_requests != 0
+                  and max(attn_all_rank_num_tokens) <= max_sizes)
+     enable_pw = enable_pw and all(self.dist.tp_allgather(enable_pw))
+ elif num_ctx_requests != 0:
+     enable_pw = True
+ set_per_request_piecewise_cuda_graph_flag(enable_pw)

This avoids transient incorrect global state.

🧹 Nitpick comments (9)
tensorrt_llm/_torch/attention_backend/interface.py (1)

138-140: Document semantics of padded_num_tokens and align with num_tokens invariants

The field is fine, but please tighten the docs to define it precisely relative to num_tokens and how downstream should use it (slice to num_tokens if padded_num_tokens is set). Also, it's easy to carry stale values across runs; consider resetting it when building CUDA graph metadata to avoid misuse.

Apply the following clarifying comments:

-    # The number of tokens in the padded sequence.
+    # If not None, indicates the temporary padded length used to stage inputs for piecewise CUDA graph batching.
+    # Downstream should treat `num_tokens` as the real token count and slice inputs/outputs to `num_tokens`
+    # when `padded_num_tokens` is set.
     padded_num_tokens: Optional[int] = None

And reset it when creating CUDA-graph metadata to prevent stale carry-over:

@@ def create_cuda_graph_metadata(self,
-        cuda_graph_metadata.is_cuda_graph = True
+        cuda_graph_metadata.is_cuda_graph = True
+        # Reset any batch-local padding info for graph templates.
+        cuda_graph_metadata.padded_num_tokens = None
tensorrt_llm/_torch/speculative/eagle3.py (1)

283-288: raw_logits naming is misleading after cropping

raw_logits currently points to cropped logits, not the raw input. Either rename it or assign before cropping. Renaming is minimal and clear.

-        raw_logits = logits
+        logits_cropped = logits
@@
-            'logits': raw_logits,
+            'logits': logits_cropped,

Also applies to: 382-388

tensorrt_llm/_torch/modules/logits_processor.py (1)

19-23: Tighten comment and (optionally) assert bounds

  • The inline comment is truncated; clarify intent.
  • Optional: assert that num_tokens does not exceed hidden_states rows to fail fast on bad metadata.
-        # Slicing out the actual
+        # Drop padded rows and keep only the actual tokens
         if (attn_metadata is not None
                 and attn_metadata.padded_num_tokens is not None):
+            assert attn_metadata.num_tokens <= hidden_states.size(0), \
+                "attn_metadata.num_tokens exceeds hidden_states length"
             hidden_states = hidden_states[:attn_metadata.num_tokens, ...]
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (1)

14-16: Import style deviates from namespace guideline

Guidelines request keeping module namespaces rather than importing individual functions. Consider importing the module and referencing its members (optional cleanup for consistency).

tensorrt_llm/_torch/utils.py (1)

265-271: Add docstrings and ensure per-request flag is restored to previous value

  • Add concise docstrings to clarify semantics and default behavior.
  • Consider a context manager to set and automatically restore the prior value to avoid leakage across requests when threads are reused.

Apply this diff to add docstrings:

-def set_per_request_piecewise_cuda_graph_flag(enable: bool):
-    _global_attrs.per_request_piecewise_cuda_graph_flag = enable
+def set_per_request_piecewise_cuda_graph_flag(enable: bool):
+    """Set per-request piecewise CUDA graph enablement.
+    Args:
+        enable: True to enable piecewise CUDA graph for this request, False to disable.
+    """
+    _global_attrs.per_request_piecewise_cuda_graph_flag = enable
@@
-def get_per_request_piecewise_cuda_graph_flag() -> bool:
-    return getattr(_global_attrs, 'per_request_piecewise_cuda_graph_flag', True)
+def get_per_request_piecewise_cuda_graph_flag() -> bool:
+    """Get per-request piecewise CUDA graph enablement.
+    Returns:
+        The current per-request flag; defaults to True if unset.
+    """
+    return getattr(_global_attrs, 'per_request_piecewise_cuda_graph_flag', True)

Additionally, consider adding this context manager in the same module:

import contextlib

@contextlib.contextmanager
def per_request_piecewise_flag(enabled: bool):
    """Context manager to set and restore per-request piecewise flag."""
    prev = getattr(_global_attrs, 'per_request_piecewise_cuda_graph_flag', True)
    _global_attrs.per_request_piecewise_cuda_graph_flag = enabled
    try:
        yield
    finally:
        _global_attrs.per_request_piecewise_cuda_graph_flag = prev
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

1373-1378: Out-of-date inline comment

The comment says piece-wise CUDA-graph cannot be enabled for NVFP4 + attention-DP, yet the code right below does exactly that when the cuda_graph flag is set.
Please align the comment with the new behaviour or restore the guard.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

583-650: Parameter naming & counting in get_warmup_request are misleading

num_gen_tokens is treated as number of generation requests (token_nums=[1] * num_gen_tokens) rather than “tokens”, while num_tokens represents the total tokens across context + generation.
The mismatch makes the helper hard to reason about and risks future call-site bugs.

Recommendation: rename the second parameter to num_gen_reqs (or actually generate by token count), adjust docstring, and add an assertion explaining the contract.


768-787: Duplicate captures when piecewise_batch_sizes contains repeated values

Because self.piecewise_batch_sizes may include duplicates (see earlier comment), the warm-up loop will capture identical graphs multiple times, wasting several seconds per warm-up and additional GPU memory:

for seq_lens in self.piecewise_batch_sizes:
    ...

Iterate over a set or the deduplicated list suggested above to avoid redundant work.

tensorrt_llm/_torch/modules/attention.py (1)

990-992: Redundant runtime assert
The earlier padding branch already guarantees output is not None; re-asserting here is unnecessary and slightly hurts JITability.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ebdc43e and c05e43f.

📒 Files selected for processing (12)
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp (1 hunks)
  • tensorrt_llm/_torch/attention_backend/interface.py (1 hunks)
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py (2 hunks)
  • tensorrt_llm/_torch/modules/attention.py (14 hunks)
  • tensorrt_llm/_torch/modules/logits_processor.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (7 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (1 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (2 hunks)
  • tensorrt_llm/_torch/utils.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (6 hunks)
  • tests/unittest/_torch/test_custom_ops.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/unittest/_torch/test_custom_ops.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/modules/logits_processor.py
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/utils.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/modules/attention.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/modules/logits_processor.py
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/utils.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/modules/attention.py
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
🧠 Learnings (8)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • tensorrt_llm/_torch/modules/logits_processor.py
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
  • tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/modules/logits_processor.py
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
  • tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.

Applied to files:

  • tensorrt_llm/_torch/modules/logits_processor.py
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.

Applied to files:

  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-07-22T08:33:49.109Z
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.

Applied to files:

  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/config.py (1)

70-72: Ignore default flips for iteration stats/logs in PyExecutor config

The enable_iter_req_stats and print_iter_log defaults you spotted in tensorrt_llm/_torch/pyexecutor/config.py only apply to the low-level PyExecutor’s standalone config and are overridden by the CLI‐level defaults (in llm_args.py) and by the autodeploy shim. Moreover, because enable_iter_perf_stats remains False by default, request stats aren’t collected unless explicitly enabled alongside perf stats. All CLI, tests, examples, and docs still use the previous behavior—no changes to tests or docs are necessary.

Likely an incorrect or invalid review comment.

tensorrt_llm/_torch/compilation/piecewise_optimizer.py (1)

159-163: LGTM: correct gating with per-request flag

Adding the per-request flag to the fallback condition is correct and keeps semantics consistent with the global flag.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

92-95: Re-check piecewise CUDA‐graph + attention-DP support and remove stale comment

  • tests/integration/defs/accuracy/test_llm_api_pytorch.py line 1369: remove or update the outdated
    “# Picewise Cuda Graph cannot be enabled for nvfp4 attention dp.” comment
  • Audit every test that now unconditionally passes enable_piecewise_cuda_graph (e.g. all torch_compile_config blocks in tests for bfloat16, fp8, auto-dtype, nvfp4, nvfp4_4gpus, MoE, etc.) and confirm the runtime truly supports
    enable_attention_dp=True + enable_piecewise_cuda_graph=True for all quant/attn backends (NVFP4, FP8, MoE, …)
  • For any unsupported combination, add or adjust skip/xfail marks; otherwise remove or update any other stale comments to keep the matrix accurate
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

1601-1605: Potential mismatch between padded inputs and metadata sequence lengths

input_ids_cuda / position_ids_cuda are zero-padded up to padded_num_tokens, but:

  • attn_metadata.seq_lens (when not using a CUDA graph) and
  • sequence_lengths handed to spec_metadata

still reflect the unpadded lengths.
Downstream kernels that rely on padded_num_tokens for slicing but use seq_lens for loops may read past valid data.

Verify that all kernels exclusively use padded_num_tokens when it is set, or update seq_lens accordingly.

tensorrt_llm/_torch/modules/attention.py (1)

932-947: Update docstring & return type after making output mandatory
forward_impl now returns None and mutates output in-place, but the docstring (Lines 936-947) and type hint still claim it returns torch.Tensor. This can mislead callers and static analysers.

Please adjust:

-    def forward_impl(..., output: torch.Tensor) -> None:
+    def forward_impl(..., output: torch.Tensor) -> None:
 ...
-        Returns:
-            torch.Tensor: The output tensor.
+        Returns:
+            None. Results are written in-place to `output`.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14601 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11030 completed with status: 'FAILURE'

@liji-nv liji-nv force-pushed the dev-liji-padding-piecewise branch 2 times, most recently from 367a9b1 to efa94ee Compare August 11, 2025 06:19
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🔭 Outside diff range comments (1)
tensorrt_llm/_torch/modules/attention.py (1)

932-959: MLA: Slice position_ids along the last dimension under padding

position_ids is typically [1, T]. Using [:num_tokens, ...] slices the first dimension. Prefer ellipsis slicing.

-        if padded_num_tokens is not None:
-            hidden_states = hidden_states[:num_tokens, ...]
-            if position_ids is not None:
-                position_ids = position_ids[:num_tokens, ...]
+        if padded_num_tokens is not None:
+            hidden_states = hidden_states[:num_tokens, ...]
+            if position_ids is not None:
+                position_ids = position_ids[..., :num_tokens]
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

332-337: Defensive-copy and dedupe piecewise batch sizes to avoid side-effects and duplicates

Same issue as previously reported: assigning and mutating the caller-provided list can cause hidden side effects and duplicates.

-        self._torch_compile_piecewise_cuda_graph = pytorch_backend_config.torch_compile_piecewise_cuda_graph
-
-        self.piecewise_batch_sizes = pytorch_backend_config.cuda_graph_batch_sizes
-        if self.piecewise_batch_sizes is not None:
-            self.piecewise_batch_sizes += [2048, 8192]
+        self._torch_compile_piecewise_cuda_graph = (
+            pytorch_backend_config.torch_compile_piecewise_cuda_graph
+        )
+
+        base_sizes = pytorch_backend_config.cuda_graph_batch_sizes or []
+        # Own a copy, remove dups, keep ascending order.
+        self.piecewise_batch_sizes = sorted({*base_sizes, 2048, 8192})
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

781-802: Piecewise CUDA graph warmup loop is good; ensure capture set is unique/ordered

Warmup iteration over piecewise sizes is sound. Sorting/uniqueness is important to avoid redundant captures; the previous comment’s fix to dedupe/sort self.piecewise_batch_sizes will make this robust.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c05e43f and efa94ee.

📒 Files selected for processing (10)
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp (1 hunks)
  • tensorrt_llm/_torch/attention_backend/interface.py (1 hunks)
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (2 hunks)
  • tensorrt_llm/_torch/modules/attention.py (14 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (7 hunks)
  • tensorrt_llm/_torch/utils.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (6 hunks)
  • tests/unittest/_torch/test_custom_ops.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/unittest/_torch/test_custom_ops.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tensorrt_llm/_torch/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/modules/attention.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/modules/attention.py
🧬 Code Graph Analysis (3)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
  • hidden_states (2160-2160)
tensorrt_llm/_torch/attention_backend/interface.py (1)
  • num_tokens (277-278)
cpp/tensorrt_llm/thop/dsv3RouterGemmOp.cpp (4)
  • num_tokens (35-47)
  • num_tokens (35-36)
  • num_tokens (53-65)
  • num_tokens (53-54)
tensorrt_llm/_torch/models/modeling_speculative.py (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
  • hidden_states (2160-2160)
tensorrt_llm/_torch/attention_backend/interface.py (1)
  • num_tokens (277-278)
tensorrt_llm/_torch/pyexecutor/model_engine.py (7)
tensorrt_llm/_torch/utils.py (3)
  • set_per_request_piecewise_cuda_graph_flag (265-266)
  • set_torch_compiling (27-29)
  • with_model_extra_attrs (61-71)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
  • add_dummy_requests (67-68)
  • add_dummy_requests (429-486)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
  • max_seq_len (565-575)
  • max_seq_len (578-582)
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
  • batch_size (35-36)
  • ScheduledRequests (18-39)
tensorrt_llm/_torch/speculative/mtp.py (3)
  • add_dummy_requests (83-85)
  • all_rank_num_seqs (148-149)
  • all_rank_num_seqs (152-155)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • tp_allgather (138-139)
tensorrt_llm/_torch/speculative/interface.py (2)
  • all_rank_num_tokens (192-193)
  • all_rank_num_tokens (196-199)
🔇 Additional comments (3)
tensorrt_llm/_torch/models/modeling_speculative.py (1)

384-386: Hidden states crop on padding looks correct

Cropping to attn_metadata.num_tokens ensures logits indexing matches seq lengths. No issues here.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

1187-1189: Hidden states crop on padding looks correct

Aligns token count with downstream MTP/logits processing. Good.

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

594-661: Warmup request builder: verify KV space checks for multi-request context payloads

You gate by total token budget (get_num_available_tokens) and batch_size, which is generally fine. However, the old path accounted for tokens_per_block per request to prevent underestimation. Please verify that get_num_available_tokens semantics match multi-request context allocations; otherwise, consider incorporating a per-request blocks check for ctx requests like before.

Would you like me to generate a quick instrumentation patch to assert sufficient blocks per ctx request at warmup time?

@liji-nv liji-nv force-pushed the dev-liji-padding-piecewise branch from efa94ee to c03de1e Compare August 13, 2025 02:05
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

337-341: Fix list mutation to avoid side effects.

The code directly mutates the list from pytorch_backend_config.cuda_graph_batch_sizes, which can cause unexpected side effects on the caller's config object and allows duplicates.

Apply this fix to avoid side effects and ensure uniqueness:

-        self.piecewise_batch_sizes = list(
-            pytorch_backend_config.cuda_graph_batch_sizes)
-        if self.piecewise_batch_sizes is not None:
-            self.piecewise_batch_sizes += [2048, 8192]
+        base_sizes = pytorch_backend_config.cuda_graph_batch_sizes or []
+        # Create a copy, deduplicate, and maintain sorted order
+        self.piecewise_batch_sizes = sorted(set(base_sizes) | {2048, 8192}) if base_sizes else None

1599-1643: Handle edge case when tokens exceed all captured sizes.

When total_num_tokens exceeds all captured batch sizes and we're not using DP, get_padded_piecewise_tokens returns the original token count unchanged. The per-request flag should be disabled in this case to avoid attempting to use a non-existent piecewise graph.

Apply this fix to handle the edge case:

             elif num_ctx_requests != 0:
-                padded_num_tokens = get_padded_piecewise_tokens(
-                    total_num_tokens)
+                padded_num_tokens = get_padded_piecewise_tokens(total_num_tokens)
+                # Disable per-request flag if no captured graph can handle this token count
+                if padded_num_tokens == total_num_tokens and total_num_tokens > max(
+                    self._torch_compile_backend.cuda_graph_batch_sizes):
+                    set_per_request_piecewise_cuda_graph_flag(False)
tensorrt_llm/_torch/modules/attention.py (2)

340-354: Fix potential issues with padding and NVFP4 output handling.

When padding is enabled:

  1. The code asserts that output is not None, but forward() may call _attn_impl with output=None in the non-custom-op path
  2. The assertion output_sf is None prevents NVFP4 scale factor handling when padding is used

Consider this more robust approach:

         if padded_num_tokens is not None:
             assert q.shape[0] == padded_num_tokens
             q = q[:num_tokens, :]
             if k is not None:
                 assert k.shape[0] == padded_num_tokens
                 k = k[:num_tokens, :]
             if v is not None:
                 assert v.shape[0] == padded_num_tokens
                 v = v[:num_tokens, :]
-            assert output is not None
-            assert output_sf is None
+            # When padding is enabled, we need an output buffer
+            if output is None:
+                output = self.create_output(q)
+            # Disable NVFP4 scale output when padding to avoid complexity
+            enable_attn_nvfp4_output = False

385-397: Incorrect tensor used for NVFP4 scale zero-padding.

The code zeros output_sf (the input parameter) instead of the returned scale tensor when attn_output is a tuple. This means the padded region of the actual returned scale tensor is never zeroed.

Fix the zero-padding logic:

         if isinstance(attn_output, tuple):
             assert len(
                 attn_output
             ) == 2, "attn_output should be a tuple of (output, output_sf)"
-            return attn_output[0], attn_output[1]
+            output_tensor, scale_tensor = attn_output
+            if padded_num_tokens is not None and scale_tensor is not None:
+                pad_up = fp4_utils.pad_up(num_tokens, 128)
+                if scale_tensor.shape[0] > pad_up:
+                    scale_tensor[pad_up:].fill_(0)
+            return output_tensor, scale_tensor
         if output is not None and output.shape[0] != num_tokens:
             output[num_tokens:].fill_(0)
-        if output_sf is not None and output_sf.shape[0] != fp4_utils.pad_up(
-                num_tokens, 128):
-            output_sf[fp4_utils.pad_up(num_tokens, 128):].fill_(0)
         return attn_output, None
🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)

602-681: Simplify warmup request construction logic.

The nested get_warmup_request function has complex logic with multiple early returns and calculations. Consider extracting some of the calculations into helper methods for better readability and testability.

Consider extracting the block calculation logic:

def _calculate_blocks_needed(self, num_tokens: int, kv_cache_manager) -> int:
    """Calculate the number of blocks needed for the given number of tokens."""
    max_seq_len = self.max_seq_len - 1
    num_full_seqs = num_tokens // max_seq_len
    num_leftover = num_tokens % max_seq_len
    
    blocks = num_full_seqs * math.ceil(max_seq_len / kv_cache_manager.tokens_per_block)
    if num_leftover > 0:
        blocks += math.ceil(num_leftover / kv_cache_manager.tokens_per_block)
    
    return blocks

1605-1613: Extract padding logic to improve readability.

The get_padded_piecewise_tokens function could be moved outside as a utility function or class method since it doesn't depend on local variables.

Extract as a static method:

@staticmethod
def _get_padded_piecewise_tokens(tokens: int, captured_sizes: list[int]) -> int:
    """Find the smallest captured size that can accommodate the token count."""
    for size in sorted(captured_sizes):
        if size >= tokens:
            return size
    return tokens

1692-1702: Simplify metadata gathering for attention DP.

The spec metadata gathering logic for attention DP could be simplified by using tuple unpacking more effectively.

Simplify the list comprehension:

-            if self.enable_attention_dp:
-                all_rank_num_tokens = self.dist.tp_allgather(
-                    [spec_metadata.num_tokens,
-                     len(sequence_lengths)])
-
-                spec_all_rank_num_tokens = [
-                    item[0] for item in all_rank_num_tokens
-                ]
-                all_rank_num_seqs = [item[1] for item in all_rank_num_tokens]
+            if self.enable_attention_dp:
+                gathered_data = self.dist.tp_allgather(
+                    [spec_metadata.num_tokens, len(sequence_lengths)])
+                spec_all_rank_num_tokens, all_rank_num_seqs = zip(*gathered_data)
+                spec_all_rank_num_tokens = list(spec_all_rank_num_tokens)
+                all_rank_num_seqs = list(all_rank_num_seqs)
tensorrt_llm/_torch/modules/attention.py (1)

932-935: Enforce output requirement in MLA forward_impl.

The signature change to require output: torch.Tensor is good, but the assertion message could be more informative about why output is required.

Improve the assertion message:

-        assert output is not None, "output must be provided"
+        assert output is not None, "Pre-allocated output buffer is required for MLA forward_impl to support piecewise CUDA graphs and padding"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between efa94ee and c03de1e.

📒 Files selected for processing (10)
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp (1 hunks)
  • tensorrt_llm/_torch/attention_backend/interface.py (1 hunks)
  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (2 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (2 hunks)
  • tensorrt_llm/_torch/modules/attention.py (14 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (7 hunks)
  • tensorrt_llm/_torch/utils.py (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (6 hunks)
  • tests/unittest/_torch/test_custom_ops.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/unittest/_torch/test_custom_ops.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/utils.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
  • tensorrt_llm/_torch/models/modeling_speculative.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/compilation/piecewise_optimizer.py
  • tensorrt_llm/_torch/modules/attention.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (2)

14-16: LGTM!

The import of get_per_request_piecewise_cuda_graph_flag is correctly added alongside other related utility functions, maintaining consistent import organization.


159-163: LGTM! Per-request piecewise CUDA graph control properly implemented.

The early-return condition now properly checks both global and per-request piecewise CUDA graph flags, allowing for fine-grained control of piecewise CUDA graph execution on a per-request basis.

tensorrt_llm/_torch/modules/attention.py (1)

1028-1030: Good padding handling for MLA output.

The zero-filling of the padded region in MLA output ensures that padded tokens don't affect downstream computations.

@liji-nv liji-nv force-pushed the dev-liji-padding-piecewise branch 3 times, most recently from b3d1008 to dff6f2a Compare August 19, 2025 09:18
@liji-nv liji-nv marked this pull request as ready for review August 19, 2025 09:19
@liji-nv liji-nv requested review from a team as code owners August 19, 2025 09:19
@liji-nv liji-nv requested review from hlu1, hyukn and 2ez4bz August 19, 2025 09:19
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15759 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15759 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11846 completed with status: 'FAILURE'

Copy link
Member

@yizhang-nv yizhang-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for compile part and padding part handling

@liji-nv liji-nv force-pushed the dev-liji-padding-piecewise branch from dff6f2a to c3215fc Compare August 19, 2025 10:51
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 19, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15773 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15773 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11858 completed with status: 'FAILURE'

@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 25, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16354 [ run ] triggered by Bot

@liji-nv liji-nv force-pushed the dev-liji-padding-piecewise branch from d6a5194 to 09dca6a Compare August 25, 2025 08:10
@liji-nv liji-nv force-pushed the dev-liji-padding-piecewise branch 2 times, most recently from 350dc67 to be1ca8b Compare August 25, 2025 11:02
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 25, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16437 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16354 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16437 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12349 completed with status: 'FAILURE'

Copy link
Collaborator

@2ez4bz 2ez4bz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per discussion, Yuxian reviewed the pyexecutor changes - approving the changes to modeling_speculative.py on that basis.

@liji-nv liji-nv force-pushed the dev-liji-padding-piecewise branch from be1ca8b to 902b408 Compare August 26, 2025 06:03
@liji-nv liji-nv force-pushed the dev-liji-padding-piecewise branch from 902b408 to 8dbea32 Compare August 26, 2025 06:06
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 26, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16527 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16527 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12411 completed with status: 'FAILURE'

Signed-off-by: Jin Li <[email protected]>
@liji-nv
Copy link
Collaborator Author

liji-nv commented Aug 26, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16549 [ run ] triggered by Bot

@liji-nv liji-nv enabled auto-merge (squash) August 26, 2025 10:43
@tensorrt-cicd
Copy link
Collaborator

PR_Github #16549 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12428 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@liji-nv liji-nv merged commit 0282354 into NVIDIA:main Aug 26, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants