-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[TRTLLM-6633][feat] Padding for piecewise cudagraph #6750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdds per-request piecewise CUDA-graph controls and padding-aware attention/MLA handling, marks a custom op argument as in-place alias in TorchScript, updates warmup/request construction and attention-DP padding propagation, and adjusts tests to reflect piecewise behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ModelEngine
participant Backend
participant PiecewiseRunner
Client->>ModelEngine: request(inputs, attn_metadata)
ModelEngine->>ModelEngine: _prepare_tp_inputs()
alt Attention DP enabled
ModelEngine->>ModelEngine: check per-rank eligibility
alt All ranks eligible
ModelEngine->>ModelEngine: pad tokens to captured batch size
ModelEngine->>ModelEngine: set_per_request_piecewise_cuda_graph_flag(True)
else
ModelEngine->>ModelEngine: set_per_request_piecewise_cuda_graph_flag(False)
end
end
ModelEngine->>Backend: run attention (may include padded_num_tokens)
Backend->>PiecewiseRunner: __call__(runtime_num_of_token)
PiecewiseRunner->>PiecewiseRunner: check entries && global && per-request flags
alt All checks pass
PiecewiseRunner-->>Backend: run captured graph
else Fallback
PiecewiseRunner-->>Backend: run default callable
end
Backend-->>Client: outputs
sequenceDiagram
participant Module as Attention/MLA
participant Backend
Module->>Module: see attn_metadata.padded_num_tokens
alt padded_num_tokens set
Module->>Module: assert padded shapes, slice q/k/v/hidden_states to num_tokens
end
Module->>Backend: call attention backend (output buffer possibly sliced)
Backend-->>Module: fill output buffer
alt padded_num_tokens set
Module->>Module: zero out tail beyond num_tokens in outputs
end
Module-->>Caller: final output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run |
PR_Github #14601 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🔭 Outside diff range comments (9)
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp (3)
749-761
: Alias annotation incomplete: merged_softmax_stats is also mutated in-placemergeChunkedAttentionForMLA writes into merged_softmax_stats. Only merged_attn is annotated as in-place, which can confuse alias analysis and schedulers.
Annotate merged_softmax_stats as an in-place alias too:
m.def( "merge_chunked_attention_for_mla(" - "Tensor(a!) merged_attn" + "Tensor(a!) merged_attn" ", Tensor temp_attn" - ", Tensor merged_softmax_stats" + ", Tensor(b!) merged_softmax_stats" ", Tensor temp_softmax_stats" ", int num_requests" ", Tensor cu_q_seq_lens" ", int max_q_seq_len" ", Tensor merge_op" ", int num_heads" ", int head_size" ") -> ()");
670-683
: Also mark output as in-place for set_*_kv_cache_for_mla opsBoth set_paged_kv_cache_for_mla and set_chunked_kv_cache_for_mla mutate their output tensor, yet the TS schema doesn’t annotate them as aliasing. This can lead to incorrect graph transformations.
Apply alias annotations:
m.def( "set_paged_kv_cache_for_mla(" - "Tensor output" + "Tensor(a!) output" ", Tensor k" ", Tensor v" ", Tensor k_pe" ", int num_requests" ", Tensor cu_seq_lens" ", int max_input_seq_len" ", int num_heads" ", int kv_dim" ", int rope_dim" ", int kv_cache_tokens_per_block" ") -> Tensor"); m.def( "set_chunked_kv_cache_for_mla(" - "Tensor output" + "Tensor(a!) output" ", Tensor kv" ", Tensor k_pe" ", int num_requests" ", Tensor cu_seq_lens" ", int num_heads" ", int kv_dim" ", int rope_dim" ", int kv_cache_tokens_per_block" ", int max_seq_len" ") -> Tensor");Also applies to: 692-705
715-739
: Consider aliasing for MLA rope op in-place argsMLARopeAppendPagedKVAssignQ writes into q and latent_cache (passed by reference). The schema should reflect that for correctness under functionalization/compilers.
Update schema to:
m.def( "mla_rope_append_paged_kv_assign_q(" - "Tensor q" - ", Tensor latent_cache" + "Tensor(a!) q" + ", Tensor(b!) latent_cache" ", int num_contexts" ", Tensor cu_ctx_cached_kv_lens" ", Tensor cu_seq_lens" ", int max_input_uncached_seq_len" ", Tensor cos_sin_cache" ", int head_num" ", int nope_size" ", int rope_size" ", int lora_size" ", Tensor kv_cache_block_offsets" ", Tensor host_kv_cache_block_offsets" ", Tensor host_kv_cache_pool_pointers" ", Tensor host_kv_cache_pool_mapping" ", Tensor? kv_scale_orig_quant" ", Tensor? kv_scale_quant_orig" ", int layer_idx" ", int tokens_per_block" ", int attention_window_size" ", int sink_token_length" ", int beam_width" ", int quant_mode" ") -> ()");tensorrt_llm/_torch/compilation/piecewise_optimizer.py (3)
1-1
: Missing NVIDIA OSS headerPer repo guidelines, production Python files must include the NVIDIA copyright header with the current year.
214-216
: Bug:output
is undefined outside capture pathIn the replay path,
output
isn’t defined; useentry.output
instead when collecting runtime output addresses.Apply this diff:
- runtime_output_addresses = [ - i.data_ptr() for i in output if isinstance(i, torch.Tensor) - ] + runtime_output_addresses = [ + i.data_ptr() for i in entry.output if isinstance(i, torch.Tensor) + ]
238-238
: Parameter shadowing:graph_pool_handle
argument is ignored
piecewise_optimizer
overwrites the passed-ingraph_pool_handle
. Either drop the parameter or respect it when provided.Minimal fix:
- graph_pool_handle = torch.cuda.graph_pool_handle() + # Respect provided handle; create one if None + if graph_pool_handle is None: + graph_pool_handle = torch.cuda.graph_pool_handle()You’ll also need to update the signature to accept Optional:
-def piecewise_optimizer(..., graph_pool_handle: tuple[int, int], ...): +from typing import Optional, Tuple +def piecewise_optimizer(..., graph_pool_handle: Optional[Tuple[int, int]], ...):tensorrt_llm/_torch/utils.py (2)
1-1
: Missing NVIDIA OSS headerPer repo guidelines, production Python files must include the NVIDIA copyright header with the current year.
265-271
: Ensure per-request flag is set in every branchIt looks like in tensorrt_llm/_torch/pyexecutor/model_engine.py you call
- True in the first
if
branch (line 1571)- False in the inner
else
(line 1589)- False in the final
else
(line 1596)—but you never set it in the
elif num_ctx_requests != 0:
branch (around line 1591). Add an explicit disable there to avoid carrying over stale state on thread reuse.• File: tensorrt_llm/_torch/pyexecutor/model_engine.py
Location: around line 1591Suggested change:
elif num_ctx_requests != 0: padded_num_tokens = get_padded_piecewise_tokens( total_num_tokens) + # Disable piecewise CUDA graph for context-only requests + set_per_request_piecewise_cuda_graph_flag(False)tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1533-1600
: Piecewise-CG flag is set toTrue
unconditionally – can leak incorrect state across ranks
set_per_request_piecewise_cuda_graph_flag(True)
is invoked before verifying that every rank can actually run piecewise CG.
If later logic disables the feature (set_per_request_piecewise_cuda_graph_flag(False)
), there is a window where other code executed in the same call stack (e.g., logging, debug hooks) will observe the flag asTrue
, leading to inconsistent behaviour.Set the flag only after the conditions are validated:
- set_per_request_piecewise_cuda_graph_flag(True) ... - else: - set_per_request_piecewise_cuda_graph_flag(False) + enable_pw = False + if attn_all_rank_num_tokens is not None: + enable_pw = (num_ctx_requests != 0 + and max(attn_all_rank_num_tokens) <= max_sizes) + enable_pw = enable_pw and all(self.dist.tp_allgather(enable_pw)) + elif num_ctx_requests != 0: + enable_pw = True + set_per_request_piecewise_cuda_graph_flag(enable_pw)This avoids transient incorrect global state.
🧹 Nitpick comments (9)
tensorrt_llm/_torch/attention_backend/interface.py (1)
138-140
: Document semantics of padded_num_tokens and align with num_tokens invariantsThe field is fine, but please tighten the docs to define it precisely relative to num_tokens and how downstream should use it (slice to num_tokens if padded_num_tokens is set). Also, it's easy to carry stale values across runs; consider resetting it when building CUDA graph metadata to avoid misuse.
Apply the following clarifying comments:
- # The number of tokens in the padded sequence. + # If not None, indicates the temporary padded length used to stage inputs for piecewise CUDA graph batching. + # Downstream should treat `num_tokens` as the real token count and slice inputs/outputs to `num_tokens` + # when `padded_num_tokens` is set. padded_num_tokens: Optional[int] = NoneAnd reset it when creating CUDA-graph metadata to prevent stale carry-over:
@@ def create_cuda_graph_metadata(self, - cuda_graph_metadata.is_cuda_graph = True + cuda_graph_metadata.is_cuda_graph = True + # Reset any batch-local padding info for graph templates. + cuda_graph_metadata.padded_num_tokens = Nonetensorrt_llm/_torch/speculative/eagle3.py (1)
283-288
: raw_logits naming is misleading after croppingraw_logits currently points to cropped logits, not the raw input. Either rename it or assign before cropping. Renaming is minimal and clear.
- raw_logits = logits + logits_cropped = logits @@ - 'logits': raw_logits, + 'logits': logits_cropped,Also applies to: 382-388
tensorrt_llm/_torch/modules/logits_processor.py (1)
19-23
: Tighten comment and (optionally) assert bounds
- The inline comment is truncated; clarify intent.
- Optional: assert that num_tokens does not exceed hidden_states rows to fail fast on bad metadata.
- # Slicing out the actual + # Drop padded rows and keep only the actual tokens if (attn_metadata is not None and attn_metadata.padded_num_tokens is not None): + assert attn_metadata.num_tokens <= hidden_states.size(0), \ + "attn_metadata.num_tokens exceeds hidden_states length" hidden_states = hidden_states[:attn_metadata.num_tokens, ...]tensorrt_llm/_torch/compilation/piecewise_optimizer.py (1)
14-16
: Import style deviates from namespace guidelineGuidelines request keeping module namespaces rather than importing individual functions. Consider importing the module and referencing its members (optional cleanup for consistency).
tensorrt_llm/_torch/utils.py (1)
265-271
: Add docstrings and ensure per-request flag is restored to previous value
- Add concise docstrings to clarify semantics and default behavior.
- Consider a context manager to set and automatically restore the prior value to avoid leakage across requests when threads are reused.
Apply this diff to add docstrings:
-def set_per_request_piecewise_cuda_graph_flag(enable: bool): - _global_attrs.per_request_piecewise_cuda_graph_flag = enable +def set_per_request_piecewise_cuda_graph_flag(enable: bool): + """Set per-request piecewise CUDA graph enablement. + Args: + enable: True to enable piecewise CUDA graph for this request, False to disable. + """ + _global_attrs.per_request_piecewise_cuda_graph_flag = enable @@ -def get_per_request_piecewise_cuda_graph_flag() -> bool: - return getattr(_global_attrs, 'per_request_piecewise_cuda_graph_flag', True) +def get_per_request_piecewise_cuda_graph_flag() -> bool: + """Get per-request piecewise CUDA graph enablement. + Returns: + The current per-request flag; defaults to True if unset. + """ + return getattr(_global_attrs, 'per_request_piecewise_cuda_graph_flag', True)Additionally, consider adding this context manager in the same module:
import contextlib @contextlib.contextmanager def per_request_piecewise_flag(enabled: bool): """Context manager to set and restore per-request piecewise flag.""" prev = getattr(_global_attrs, 'per_request_piecewise_cuda_graph_flag', True) _global_attrs.per_request_piecewise_cuda_graph_flag = enabled try: yield finally: _global_attrs.per_request_piecewise_cuda_graph_flag = prevtests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
1373-1378
: Out-of-date inline commentThe comment says piece-wise CUDA-graph cannot be enabled for NVFP4 + attention-DP, yet the code right below does exactly that when the
cuda_graph
flag is set.
Please align the comment with the new behaviour or restore the guard.tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
583-650
: Parameter naming & counting inget_warmup_request
are misleading
num_gen_tokens
is treated as number of generation requests (token_nums=[1] * num_gen_tokens
) rather than “tokens”, whilenum_tokens
represents the total tokens across context + generation.
The mismatch makes the helper hard to reason about and risks future call-site bugs.Recommendation: rename the second parameter to
num_gen_reqs
(or actually generate by token count), adjust docstring, and add an assertion explaining the contract.
768-787
: Duplicate captures whenpiecewise_batch_sizes
contains repeated valuesBecause
self.piecewise_batch_sizes
may include duplicates (see earlier comment), the warm-up loop will capture identical graphs multiple times, wasting several seconds per warm-up and additional GPU memory:for seq_lens in self.piecewise_batch_sizes: ...Iterate over a
set
or the deduplicated list suggested above to avoid redundant work.tensorrt_llm/_torch/modules/attention.py (1)
990-992
: Redundant runtime assert
The earlier padding branch already guaranteesoutput is not None
; re-asserting here is unnecessary and slightly hurts JITability.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
(1 hunks)tensorrt_llm/_torch/attention_backend/interface.py
(1 hunks)tensorrt_llm/_torch/compilation/piecewise_optimizer.py
(2 hunks)tensorrt_llm/_torch/modules/attention.py
(14 hunks)tensorrt_llm/_torch/modules/logits_processor.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(7 hunks)tensorrt_llm/_torch/speculative/eagle3.py
(1 hunks)tensorrt_llm/_torch/speculative/mtp.py
(2 hunks)tensorrt_llm/_torch/utils.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(6 hunks)tests/unittest/_torch/test_custom_ops.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/_torch/test_custom_ops.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/modules/logits_processor.py
tensorrt_llm/_torch/attention_backend/interface.py
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/speculative/eagle3.py
tensorrt_llm/_torch/utils.py
tensorrt_llm/_torch/speculative/mtp.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/modules/attention.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/modules/logits_processor.py
tensorrt_llm/_torch/attention_backend/interface.py
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/speculative/eagle3.py
tensorrt_llm/_torch/utils.py
tensorrt_llm/_torch/speculative/mtp.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/modules/attention.py
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
🧠 Learnings (8)
📓 Common learnings
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
📚 Learning: 2025-08-08T04:10:18.987Z
Learnt from: djns99
PR: NVIDIA/TensorRT-LLM#6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:18.987Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/modules/logits_processor.py
tensorrt_llm/_torch/attention_backend/interface.py
tensorrt_llm/_torch/speculative/eagle3.py
tensorrt_llm/_torch/speculative/mtp.py
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/modules/logits_processor.py
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-07-22T09:22:14.726Z
Learnt from: yechank-nvidia
PR: NVIDIA/TensorRT-LLM#6254
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:1201-1204
Timestamp: 2025-07-22T09:22:14.726Z
Learning: In TensorRT-LLM's multimodal processing pipeline, shared tensor recovery using `from_shared_tensor()` is only needed during the context phase. Generation requests reuse the already-recovered tensor data and only need to call `strip_for_generation()` to remove unnecessary multimodal data while preserving the recovered tensors. This avoids redundant tensor recovery operations during generation.
Applied to files:
tensorrt_llm/_torch/modules/logits_processor.py
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/_torch/speculative/mtp.py
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/_torch/pyexecutor/config.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-08-01T15:14:45.673Z
Learnt from: yibinl-nvidia
PR: NVIDIA/TensorRT-LLM#6506
File: examples/models/core/mixtral/requirements.txt:3-3
Timestamp: 2025-08-01T15:14:45.673Z
Learning: In TensorRT-LLM, examples directory can have different dependency versions than the root requirements.txt file. Version conflicts between root and examples dependencies are acceptable because examples are designed to be standalone and self-contained.
Applied to files:
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-07-22T08:33:49.109Z
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Applied to files:
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/config.py (1)
70-72
: Ignore default flips for iteration stats/logs in PyExecutor configThe
enable_iter_req_stats
andprint_iter_log
defaults you spotted intensorrt_llm/_torch/pyexecutor/config.py
only apply to the low-level PyExecutor’s standalone config and are overridden by the CLI‐level defaults (inllm_args.py
) and by the autodeploy shim. Moreover, becauseenable_iter_perf_stats
remainsFalse
by default, request stats aren’t collected unless explicitly enabled alongside perf stats. All CLI, tests, examples, and docs still use the previous behavior—no changes to tests or docs are necessary.Likely an incorrect or invalid review comment.
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (1)
159-163
: LGTM: correct gating with per-request flagAdding the per-request flag to the fallback condition is correct and keeps semantics consistent with the global flag.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
92-95
: Re-check piecewise CUDA‐graph + attention-DP support and remove stale comment
- tests/integration/defs/accuracy/test_llm_api_pytorch.py line 1369: remove or update the outdated
“# Picewise Cuda Graph cannot be enabled for nvfp4 attention dp.” comment- Audit every test that now unconditionally passes
enable_piecewise_cuda_graph
(e.g. alltorch_compile_config
blocks in tests for bfloat16, fp8, auto-dtype, nvfp4, nvfp4_4gpus, MoE, etc.) and confirm the runtime truly supports
enable_attention_dp=True
+enable_piecewise_cuda_graph=True
for all quant/attn backends (NVFP4, FP8, MoE, …)- For any unsupported combination, add or adjust
skip
/xfail
marks; otherwise remove or update any other stale comments to keep the matrix accuratetensorrt_llm/_torch/pyexecutor/model_engine.py (1)
1601-1605
: Potential mismatch between padded inputs and metadata sequence lengths
input_ids_cuda
/position_ids_cuda
are zero-padded up topadded_num_tokens
, but:
attn_metadata.seq_lens
(when not using a CUDA graph) andsequence_lengths
handed tospec_metadata
still reflect the unpadded lengths.
Downstream kernels that rely onpadded_num_tokens
for slicing but useseq_lens
for loops may read past valid data.Verify that all kernels exclusively use
padded_num_tokens
when it is set, or updateseq_lens
accordingly.tensorrt_llm/_torch/modules/attention.py (1)
932-947
: Update docstring & return type after makingoutput
mandatory
forward_impl
now returnsNone
and mutatesoutput
in-place, but the docstring (Lines 936-947) and type hint still claim it returnstorch.Tensor
. This can mislead callers and static analysers.Please adjust:
- def forward_impl(..., output: torch.Tensor) -> None: + def forward_impl(..., output: torch.Tensor) -> None: ... - Returns: - torch.Tensor: The output tensor. + Returns: + None. Results are written in-place to `output`.
PR_Github #14601 [ run ] completed with state |
367a9b1
to
efa94ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🔭 Outside diff range comments (1)
tensorrt_llm/_torch/modules/attention.py (1)
932-959
: MLA: Slice position_ids along the last dimension under paddingposition_ids is typically [1, T]. Using [:num_tokens, ...] slices the first dimension. Prefer ellipsis slicing.
- if padded_num_tokens is not None: - hidden_states = hidden_states[:num_tokens, ...] - if position_ids is not None: - position_ids = position_ids[:num_tokens, ...] + if padded_num_tokens is not None: + hidden_states = hidden_states[:num_tokens, ...] + if position_ids is not None: + position_ids = position_ids[..., :num_tokens]
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
332-337
: Defensive-copy and dedupe piecewise batch sizes to avoid side-effects and duplicatesSame issue as previously reported: assigning and mutating the caller-provided list can cause hidden side effects and duplicates.
- self._torch_compile_piecewise_cuda_graph = pytorch_backend_config.torch_compile_piecewise_cuda_graph - - self.piecewise_batch_sizes = pytorch_backend_config.cuda_graph_batch_sizes - if self.piecewise_batch_sizes is not None: - self.piecewise_batch_sizes += [2048, 8192] + self._torch_compile_piecewise_cuda_graph = ( + pytorch_backend_config.torch_compile_piecewise_cuda_graph + ) + + base_sizes = pytorch_backend_config.cuda_graph_batch_sizes or [] + # Own a copy, remove dups, keep ascending order. + self.piecewise_batch_sizes = sorted({*base_sizes, 2048, 8192})
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
781-802
: Piecewise CUDA graph warmup loop is good; ensure capture set is unique/orderedWarmup iteration over piecewise sizes is sound. Sorting/uniqueness is important to avoid redundant captures; the previous comment’s fix to dedupe/sort self.piecewise_batch_sizes will make this robust.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
(1 hunks)tensorrt_llm/_torch/attention_backend/interface.py
(1 hunks)tensorrt_llm/_torch/compilation/piecewise_optimizer.py
(2 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(2 hunks)tensorrt_llm/_torch/models/modeling_speculative.py
(2 hunks)tensorrt_llm/_torch/modules/attention.py
(14 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(7 hunks)tensorrt_llm/_torch/utils.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(6 hunks)tests/unittest/_torch/test_custom_ops.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/_torch/test_custom_ops.py
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
- tensorrt_llm/_torch/attention_backend/interface.py
- tensorrt_llm/_torch/compilation/piecewise_optimizer.py
- tests/integration/defs/accuracy/test_llm_api_pytorch.py
- tensorrt_llm/_torch/utils.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the class docstring.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/modules/attention.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/models/modeling_deepseekv3.py
tensorrt_llm/_torch/models/modeling_speculative.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
tensorrt_llm/_torch/modules/attention.py
🧬 Code Graph Analysis (3)
tensorrt_llm/_torch/models/modeling_deepseekv3.py (3)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
hidden_states
(2160-2160)tensorrt_llm/_torch/attention_backend/interface.py (1)
num_tokens
(277-278)cpp/tensorrt_llm/thop/dsv3RouterGemmOp.cpp (4)
num_tokens
(35-47)num_tokens
(35-36)num_tokens
(53-65)num_tokens
(53-54)
tensorrt_llm/_torch/models/modeling_speculative.py (2)
cpp/tests/unit_tests/kernels/mixtureOfExpertsTest.cu (1)
hidden_states
(2160-2160)tensorrt_llm/_torch/attention_backend/interface.py (1)
num_tokens
(277-278)
tensorrt_llm/_torch/pyexecutor/model_engine.py (7)
tensorrt_llm/_torch/utils.py (3)
set_per_request_piecewise_cuda_graph_flag
(265-266)set_torch_compiling
(27-29)with_model_extra_attrs
(61-71)tensorrt_llm/_torch/pyexecutor/resource_manager.py (2)
add_dummy_requests
(67-68)add_dummy_requests
(429-486)tensorrt_llm/_torch/attention_backend/trtllm.py (2)
max_seq_len
(565-575)max_seq_len
(578-582)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
batch_size
(35-36)ScheduledRequests
(18-39)tensorrt_llm/_torch/speculative/mtp.py (3)
add_dummy_requests
(83-85)all_rank_num_seqs
(148-149)all_rank_num_seqs
(152-155)tensorrt_llm/_torch/distributed/communicator.py (1)
tp_allgather
(138-139)tensorrt_llm/_torch/speculative/interface.py (2)
all_rank_num_tokens
(192-193)all_rank_num_tokens
(196-199)
🔇 Additional comments (3)
tensorrt_llm/_torch/models/modeling_speculative.py (1)
384-386
: Hidden states crop on padding looks correctCropping to attn_metadata.num_tokens ensures logits indexing matches seq lengths. No issues here.
tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)
1187-1189
: Hidden states crop on padding looks correctAligns token count with downstream MTP/logits processing. Good.
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
594-661
: Warmup request builder: verify KV space checks for multi-request context payloadsYou gate by total token budget (get_num_available_tokens) and batch_size, which is generally fine. However, the old path accounted for tokens_per_block per request to prevent underestimation. Please verify that get_num_available_tokens semantics match multi-request context allocations; otherwise, consider incorporating a per-request blocks check for ctx requests like before.
Would you like me to generate a quick instrumentation patch to assert sufficient blocks per ctx request at warmup time?
efa94ee
to
c03de1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
337-341
: Fix list mutation to avoid side effects.The code directly mutates the list from
pytorch_backend_config.cuda_graph_batch_sizes
, which can cause unexpected side effects on the caller's config object and allows duplicates.Apply this fix to avoid side effects and ensure uniqueness:
- self.piecewise_batch_sizes = list( - pytorch_backend_config.cuda_graph_batch_sizes) - if self.piecewise_batch_sizes is not None: - self.piecewise_batch_sizes += [2048, 8192] + base_sizes = pytorch_backend_config.cuda_graph_batch_sizes or [] + # Create a copy, deduplicate, and maintain sorted order + self.piecewise_batch_sizes = sorted(set(base_sizes) | {2048, 8192}) if base_sizes else None
1599-1643
: Handle edge case when tokens exceed all captured sizes.When
total_num_tokens
exceeds all captured batch sizes and we're not using DP,get_padded_piecewise_tokens
returns the original token count unchanged. The per-request flag should be disabled in this case to avoid attempting to use a non-existent piecewise graph.Apply this fix to handle the edge case:
elif num_ctx_requests != 0: - padded_num_tokens = get_padded_piecewise_tokens( - total_num_tokens) + padded_num_tokens = get_padded_piecewise_tokens(total_num_tokens) + # Disable per-request flag if no captured graph can handle this token count + if padded_num_tokens == total_num_tokens and total_num_tokens > max( + self._torch_compile_backend.cuda_graph_batch_sizes): + set_per_request_piecewise_cuda_graph_flag(False)tensorrt_llm/_torch/modules/attention.py (2)
340-354
: Fix potential issues with padding and NVFP4 output handling.When padding is enabled:
- The code asserts that
output
is not None, butforward()
may call_attn_impl
withoutput=None
in the non-custom-op path- The assertion
output_sf is None
prevents NVFP4 scale factor handling when padding is usedConsider this more robust approach:
if padded_num_tokens is not None: assert q.shape[0] == padded_num_tokens q = q[:num_tokens, :] if k is not None: assert k.shape[0] == padded_num_tokens k = k[:num_tokens, :] if v is not None: assert v.shape[0] == padded_num_tokens v = v[:num_tokens, :] - assert output is not None - assert output_sf is None + # When padding is enabled, we need an output buffer + if output is None: + output = self.create_output(q) + # Disable NVFP4 scale output when padding to avoid complexity + enable_attn_nvfp4_output = False
385-397
: Incorrect tensor used for NVFP4 scale zero-padding.The code zeros
output_sf
(the input parameter) instead of the returned scale tensor whenattn_output
is a tuple. This means the padded region of the actual returned scale tensor is never zeroed.Fix the zero-padding logic:
if isinstance(attn_output, tuple): assert len( attn_output ) == 2, "attn_output should be a tuple of (output, output_sf)" - return attn_output[0], attn_output[1] + output_tensor, scale_tensor = attn_output + if padded_num_tokens is not None and scale_tensor is not None: + pad_up = fp4_utils.pad_up(num_tokens, 128) + if scale_tensor.shape[0] > pad_up: + scale_tensor[pad_up:].fill_(0) + return output_tensor, scale_tensor if output is not None and output.shape[0] != num_tokens: output[num_tokens:].fill_(0) - if output_sf is not None and output_sf.shape[0] != fp4_utils.pad_up( - num_tokens, 128): - output_sf[fp4_utils.pad_up(num_tokens, 128):].fill_(0) return attn_output, None
🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/model_engine.py (3)
602-681
: Simplify warmup request construction logic.The nested
get_warmup_request
function has complex logic with multiple early returns and calculations. Consider extracting some of the calculations into helper methods for better readability and testability.Consider extracting the block calculation logic:
def _calculate_blocks_needed(self, num_tokens: int, kv_cache_manager) -> int: """Calculate the number of blocks needed for the given number of tokens.""" max_seq_len = self.max_seq_len - 1 num_full_seqs = num_tokens // max_seq_len num_leftover = num_tokens % max_seq_len blocks = num_full_seqs * math.ceil(max_seq_len / kv_cache_manager.tokens_per_block) if num_leftover > 0: blocks += math.ceil(num_leftover / kv_cache_manager.tokens_per_block) return blocks
1605-1613
: Extract padding logic to improve readability.The
get_padded_piecewise_tokens
function could be moved outside as a utility function or class method since it doesn't depend on local variables.Extract as a static method:
@staticmethod def _get_padded_piecewise_tokens(tokens: int, captured_sizes: list[int]) -> int: """Find the smallest captured size that can accommodate the token count.""" for size in sorted(captured_sizes): if size >= tokens: return size return tokens
1692-1702
: Simplify metadata gathering for attention DP.The spec metadata gathering logic for attention DP could be simplified by using tuple unpacking more effectively.
Simplify the list comprehension:
- if self.enable_attention_dp: - all_rank_num_tokens = self.dist.tp_allgather( - [spec_metadata.num_tokens, - len(sequence_lengths)]) - - spec_all_rank_num_tokens = [ - item[0] for item in all_rank_num_tokens - ] - all_rank_num_seqs = [item[1] for item in all_rank_num_tokens] + if self.enable_attention_dp: + gathered_data = self.dist.tp_allgather( + [spec_metadata.num_tokens, len(sequence_lengths)]) + spec_all_rank_num_tokens, all_rank_num_seqs = zip(*gathered_data) + spec_all_rank_num_tokens = list(spec_all_rank_num_tokens) + all_rank_num_seqs = list(all_rank_num_seqs)tensorrt_llm/_torch/modules/attention.py (1)
932-935
: Enforce output requirement in MLA forward_impl.The signature change to require
output: torch.Tensor
is good, but the assertion message could be more informative about why output is required.Improve the assertion message:
- assert output is not None, "output must be provided" + assert output is not None, "Pre-allocated output buffer is required for MLA forward_impl to support piecewise CUDA graphs and padding"
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
(1 hunks)tensorrt_llm/_torch/attention_backend/interface.py
(1 hunks)tensorrt_llm/_torch/compilation/piecewise_optimizer.py
(2 hunks)tensorrt_llm/_torch/models/modeling_deepseekv3.py
(2 hunks)tensorrt_llm/_torch/models/modeling_speculative.py
(2 hunks)tensorrt_llm/_torch/modules/attention.py
(14 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py
(7 hunks)tensorrt_llm/_torch/utils.py
(1 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(6 hunks)tests/unittest/_torch/test_custom_ops.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/_torch/test_custom_ops.py
🚧 Files skipped from review as they are similar to previous changes (6)
- tensorrt_llm/_torch/attention_backend/interface.py
- tensorrt_llm/_torch/utils.py
- tensorrt_llm/_torch/models/modeling_deepseekv3.py
- tests/integration/defs/accuracy/test_llm_api_pytorch.py
- cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
- tensorrt_llm/_torch/models/modeling_speculative.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/compilation/piecewise_optimizer.py
tensorrt_llm/_torch/modules/attention.py
tensorrt_llm/_torch/pyexecutor/model_engine.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/compilation/piecewise_optimizer.py (2)
14-16
: LGTM!The import of
get_per_request_piecewise_cuda_graph_flag
is correctly added alongside other related utility functions, maintaining consistent import organization.
159-163
: LGTM! Per-request piecewise CUDA graph control properly implemented.The early-return condition now properly checks both global and per-request piecewise CUDA graph flags, allowing for fine-grained control of piecewise CUDA graph execution on a per-request basis.
tensorrt_llm/_torch/modules/attention.py (1)
1028-1030
: Good padding handling for MLA output.The zero-filling of the padded region in MLA output ensures that padded tokens don't affect downstream computations.
b3d1008
to
dff6f2a
Compare
/bot run |
PR_Github #15759 [ run ] triggered by Bot |
PR_Github #15759 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM for compile part and padding part handling
dff6f2a
to
c3215fc
Compare
/bot run |
PR_Github #15773 [ run ] triggered by Bot |
PR_Github #15773 [ run ] completed with state |
/bot run |
PR_Github #16354 [ run ] triggered by Bot |
d6a5194
to
09dca6a
Compare
350dc67
to
be1ca8b
Compare
/bot run |
PR_Github #16437 [ run ] triggered by Bot |
PR_Github #16354 [ run ] completed with state |
PR_Github #16437 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per discussion, Yuxian reviewed the pyexecutor changes - approving the changes to modeling_speculative.py
on that basis.
be1ca8b
to
902b408
Compare
Signed-off-by: Jin Li <[email protected]>
902b408
to
8dbea32
Compare
/bot run |
PR_Github #16527 [ run ] triggered by Bot |
PR_Github #16527 [ run ] completed with state |
Signed-off-by: Jin Li <[email protected]>
/bot run |
PR_Github #16549 [ run ] triggered by Bot |
PR_Github #16549 [ run ] completed with state |
Summary by CodeRabbit
New Features
Performance
Bug Fixes
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.