-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example #7221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughRefactors AutoDeploy to add multimodal input support and dynamic-shape hooks: SequenceInfo internal redesign, factory-driven input/processor hooks, ADInputProcessor, multimodal data plumbing through executors, vision patch for Llama4, prompt-format normalization, export patch selection changes, and related tests/typing updates. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor User
participant LLM
participant Factory as ModelFactory
participant Processor as ADInputProcessor
participant Tokenizer as HF Processor/Tokenizer
User->>LLM: instantiate(args)
LLM->>Factory: factory = args.create_factory()
LLM->>Factory: init_tokenizer()
Factory-->>LLM: tokenizer
LLM->>Factory: init_processor()
Factory-->>LLM: processor
LLM->>Processor: _create_input_processor(tokenizer, processor)
User->>LLM: generate(inputs, sampling_params)
LLM->>Processor: __call__(inputs, sampling_params)
Processor->>Tokenizer: apply_chat_template / encode
Tokenizer-->>Processor: token_ids[, pixel_values]
Processor-->>LLM: token_ids, extra_multimodal
sequenceDiagram
autonumber
participant Engine as ADEngine/DemoEngine
participant Factory as ModelFactory
participant SeqInfo as SequenceInfo
participant KV as KV Cache Manager
Engine->>Factory: get_extra_inputs()
Factory-->>Engine: {name: (none_input, dyn_shape_cb)}*
Engine->>SeqInfo: add_extra_arg(name, none_input, cb)*
Engine->>KV: get_cache_indices(request)
KV-->>Engine: page_assignments
Engine->>SeqInfo: nest_sequences(input_ids, input_pos, page_assignments, **extra_args)
alt new tokens produced
Engine->>SeqInfo: rescatter_input_ids(new_tokens, gather_idx, scatter_ref)
end
sequenceDiagram
autonumber
participant Export as export_to_gm
participant IFace as Export Interface
participant Patch as Patch Registry
Export->>IFace: apply_export_patches(patch_configs, patch_list)
IFace->>IFace: resolve configs or derive from patch_list / defaults
IFace->>Patch: create patches from resolved configs
IFace->>Patch: apply patches (context)
Export-->>Export: perform export under patch context
IFace-->>Patch: revert patches on exit
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (8)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
74-83
: Bug: enable_gqa argument is ignored (hard-coded to False).The public parameter is never honored; you always pass enable_gqa=False to F.scaled_dot_product_attention. This silently disables GQA even when requested by callers.
Minimal fix:
scale=scale, - enable_gqa=False, + enable_gqa=enable_gqa, )Optional backward-compatible safeguard (if older torch may not accept the kwarg):
- return F.scaled_dot_product_attention( - query.contiguous(), - key.contiguous(), - value.contiguous(), - attn_mask=attn_mask, - dropout_p=dropout_p, - is_causal=is_causal, - scale=scale, - enable_gqa=enable_gqa, - ) + try: + return F.scaled_dot_product_attention( + query.contiguous(), + key.contiguous(), + value.contiguous(), + attn_mask=attn_mask, + dropout_p=dropout_p, + is_causal=is_causal, + scale=scale, + enable_gqa=enable_gqa, + ) + except TypeError: + if enable_gqa: + raise NotImplementedError( + "enable_gqa=True requested, but installed torch does not support the 'enable_gqa' kwarg." + ) + # Fallback: call without the kwarg + return F.scaled_dot_product_attention( + query.contiguous(), + key.contiguous(), + value.contiguous(), + attn_mask=attn_mask, + dropout_p=dropout_p, + is_causal=is_causal, + scale=scale, + )tensorrt_llm/executor/worker.py (1)
433-436
: Potential AttributeError when treating "_autodeploy" as PyTorch-like.Now that _is_pytorch_backend is True for "_autodeploy", accessing self._executor_config.pytorch_backend_config may fail if that attribute is absent on autodeploy configs. Guard with getattr and default to enabling overlap unless explicitly disabled.
- is_overlap_enabled = self._is_pytorch_backend and not self._executor_config.pytorch_backend_config.disable_overlap_scheduler + pytorch_cfg = getattr(self._executor_config, "pytorch_backend_config", None) + disable_overlap = getattr(pytorch_cfg, "disable_overlap_scheduler", False) + is_overlap_enabled = self._is_pytorch_backend and not disable_overlapOptionally, add a short comment indicating the default behavior for autodeploy when this sub-config is missing.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)
1-1
: Add NVIDIA Apache-2.0 copyright header (2025) at file top.Per repo guidelines, prepend the standard NVIDIA Apache-2.0 header to all Python sources, including tests.
Apply this diff at the very top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (1)
1-1
: Add NVIDIA Apache-2.0 copyright header (2025) at file top.Please prepend the standard NVIDIA Apache-2.0 header.
Apply this diff at the top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)
1-1
: Add NVIDIA Apache-2.0 copyright header (2025) at file top.Please prepend the standard header.
Apply this diff at the top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (1)
1-1
: Add NVIDIA Apache-2.0 copyright header (2025) at file top.Please prepend the standard header.
Apply this diff at the top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
1-1
: Add NVIDIA Apache-2.0 copyright header (2025) at file top.Please prepend the standard header.
Apply this diff at the top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
1-10
: Add NVIDIA Apache-2.0 header (compliance).Per repo guidelines, prepend the NVIDIA Apache-2.0 copyright header with the current year to all source files.
Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """ Attention Interface to handle various attention operators and cache operations.
🧹 Nitpick comments (30)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)
1-2
: Add NVIDIA Apache-2.0 header (2025) at file top.Per repo guidelines, prepend the NVIDIA Apache-2.0 copyright header with the current year.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.
86-91
: Fake op signature parity looks good; consider documenting enable_gqa here too.The fake now accepts enable_gqa, which keeps tracing and shape propagation stable. Add a brief note in the fake’s docstring that the flag is ignored and only affects real kernel selection.
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1)
1-1
: Add NVIDIA Apache-2.0 header at file top.Per repo guidelines, prepend the standard NVIDIA Apache-2.0 copyright header (year 2025) before the module docstring.
Apply this diff:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.tensorrt_llm/executor/worker.py (4)
1-1
: Add NVIDIA Apache-2.0 header at file top.Please prepend the standard header (year 2025).
Apply this diff:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.
84-86
: Minor: avoid list re-allocation for backend check.Use a constant set/tuple for membership to avoid per-init list construction; makes intent explicit.
- self._is_pytorch_backend = getattr(self._executor_config, "backend", - None) in ["pytorch", "_autodeploy"] + PYTORCH_LIKE_BACKENDS = ("pytorch", "_autodeploy") + self._is_pytorch_backend = getattr(self._executor_config, "backend", None) in PYTORCH_LIKE_BACKENDS
520-521
: Optional: document py_scheduling_params shape/contract.Since PyTorch-like backends now include "_autodeploy", document constraints/expected fields for request.scheduling_params to avoid divergence between backends.
84-86
: Add quick unit to cover autodeploy path.Consider adding a small test that instantiates GenerationExecutorWorker with executor_config.backend="_autodeploy" and validates that _is_pytorch_backend is True and overlap flag logic does not raise.
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (4)
1-1
: Add NVIDIA Apache-2.0 header at file top.Please prepend the standard header (year 2025).
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.
73-77
: Solid move to ordered named args; add defensive check for missing placeholders.Indexing input_nodes_mapping with unknown keys will raise a KeyError. Add a clear error if the graph is missing required placeholders—helps debugging miswired graphs/models.
- input_nodes_mapping = {n.target: n for n in input_nodes} - - # filtered and sorted for SequenceInfo arguments (input_ids, position_ids, etc.) - input_nodes_from_info = [input_nodes_mapping[k] for k in cm.info.named_standard_args.keys()] + input_nodes_mapping = {n.target: n for n in input_nodes} + required = list(cm.info.named_standard_args.keys()) + missing = [k for k in required if k not in input_nodes_mapping] + if missing: + raise ValueError(f"Missing required input placeholders in graph: {missing}. " + f"Available: {list(input_nodes_mapping.keys())}") + # filtered and sorted for SequenceInfo arguments (input_ids, position_ids, etc.) + input_nodes_from_info = [input_nodes_mapping[k] for k in required]
80-84
: Insertion anchor: prefer the first attention node over input_nodes[-1].next.Anchoring prepare_metadata right before the first source attention node makes dependency intent explicit and avoids corner-cases when graph input ordering changes.
- with graph.inserting_before(input_nodes[-1].next): + insertion_anchor = source_attn_nodes[0] if source_attn_nodes else input_nodes[-1].next + with graph.inserting_before(insertion_anchor): ret_node = graph.call_function( get_metadata, - args=(*input_nodes_from_info, *cm.info.extra_args_for_prepare_metadata), + args=(*input_nodes_from_info, *cm.info.extra_args_for_prepare_metadata), )
161-176
: KV cache resize: synchronize before/after measuring memory and clamp pages.Without synchronizing, mem_get_info can be noisy around kernel launches; also guard against degenerate division and non-positive page counts.
try: # Let's run a forward pass to get the memory usage - cm.info.set_max_num_tokens_sample() - free_mem_pre, _ = _get_mem_info_in_mb() + cm.info.set_max_num_tokens_sample() + torch.cuda.synchronize() + free_mem_pre, _ = _get_mem_info_in_mb() ad_logger.info(f"Free memory before forward pass (MB): {free_mem_pre}") egm(*cm.args) - free_mem_post, _ = _get_mem_info_in_mb() + torch.cuda.synchronize() + free_mem_post, _ = _get_mem_info_in_mb() ad_logger.info(f"Free memory after forward pass (MB): {free_mem_post}") memory_for_forward_pass = free_mem_pre - free_mem_post ad_logger.info(f"Memory for forward pass (MB): {memory_for_forward_pass}") - new_cache_size = free_mem_post * 1024 * 1024 * free_mem_ratio + current_cache_size - new_num_pages = int(new_cache_size // (current_cache_size // current_num_pages)) + bytes_per_mb = 1024 * 1024 + new_cache_size = int(free_mem_post * bytes_per_mb * free_mem_ratio + current_cache_size) + denom = max(1, current_cache_size // max(1, current_num_pages)) + new_num_pages = max(1, int(new_cache_size // denom))examples/auto_deploy/build_and_run_ad.py (3)
1-1
: Add NVIDIA Apache-2.0 header at file top.Please prepend the standard header (year 2025).
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.
75-97
: Normalization is clear; avoid aliasing nested lists across the batch.Multiplying the list duplicates references to the same nested message objects. If any later logic mutates message dicts, this can cause surprising cross-sample coupling. Consider copying list/dicts during expansion.
- queries = self.queries if isinstance(self.queries, list) else [self.queries] + queries = self.queries if isinstance(self.queries, list) else [self.queries] batch_size = self.batch_size - queries = queries * (batch_size // len(queries) + 1) + # Expand without aliasing nested structures + expanded = [] + for _ in range(batch_size): + for q in queries: + if isinstance(q, list): + expanded.append([dict(m) for m in q]) # shallow-copy each message + elif isinstance(q, dict): + expanded.append(dict(q)) + else: + expanded.append(q) + if len(expanded) == batch_size: + break + if len(expanded) == batch_size: + break - queries = queries[:batch_size] + queries = expanded[:batch_size]
269-309
: Ensure LLM shutdown on exceptions.Wrap generate/benchmark in try/finally to guarantee llm.shutdown() even if an exception is raised.
- llm = build_llm_from_config(config) - - # prompt the model and print its output - ad_logger.info("Running example prompts...") - - # now let's try piping through multimodal data - - outs = llm.generate( - config.prompt.queries, - sampling_params=SamplingParams(**config.prompt.sp_kwargs), - ) - results = {"prompts_and_outputs": print_outputs(outs)} + llm = build_llm_from_config(config) + try: + # prompt the model and print its output + ad_logger.info("Running example prompts...") + # now let's try piping through multimodal data + outs = llm.generate( + config.prompt.queries, + sampling_params=SamplingParams(**config.prompt.sp_kwargs), + ) + results = {"prompts_and_outputs": print_outputs(outs)} - # run a benchmark for the model with batch_size == config.benchmark_bs - if config.benchmark.enabled and config.args.runtime != "trtllm": - ad_logger.info("Running benchmark...") - keys_from_args = ["compile_backend", "attn_backend", "mla_backend"] - fields_to_show = [f"benchmark={config.benchmark}"] - fields_to_show.extend([f"{k}={getattr(config.args, k)}" for k in keys_from_args]) - results["benchmark_results"] = benchmark( - func=lambda: llm.generate( - torch.randint(0, 100, (config.benchmark.bs, config.benchmark.isl)).tolist(), - sampling_params=SamplingParams( - max_tokens=config.benchmark.osl, - top_k=None, - ignore_eos=True, - ), - use_tqdm=False, - ), - num_runs=config.benchmark.num, - log_prefix="Benchmark with " + ", ".join(fields_to_show), - results_path=config.benchmark.results_path, - ) - elif config.benchmark.enabled: - ad_logger.info("Skipping simple benchmarking for trtllm...") - - if config.benchmark.store_results: - store_benchmark_results(results, config.benchmark.results_path) - - llm.shutdown() + # run a benchmark for the model with batch_size == config.benchmark_bs + if config.benchmark.enabled and config.args.runtime != "trtllm": + ad_logger.info("Running benchmark...") + keys_from_args = ["compile_backend", "attn_backend", "mla_backend"] + fields_to_show = [f"benchmark={config.benchmark}"] + fields_to_show.extend([f"{k}={getattr(config.args, k)}" for k in keys_from_args]) + results["benchmark_results"] = benchmark( + func=lambda: llm.generate( + torch.randint(0, 100, (config.benchmark.bs, config.benchmark.isl)).tolist(), + sampling_params=SamplingParams( + max_tokens=config.benchmark.osl, + top_k=None, + ignore_eos=True, + ), + use_tqdm=False, + ), + num_runs=config.benchmark.num, + log_prefix="Benchmark with " + ", ".join(fields_to_show), + results_path=config.benchmark.results_path, + ) + elif config.benchmark.enabled: + ad_logger.info("Skipping simple benchmarking for trtllm...") + + if config.benchmark.store_results: + store_benchmark_results(results, config.benchmark.results_path) + finally: + llm.shutdown()tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)
41-43
: Forward signature alignment looks good; add a quick guard and silence the unused parameter.
- position_ids is intentionally unused; make that explicit to avoid lints.
- Add a divisibility/assert guard to prevent silent shape errors when hidden_size is not divisible by num_heads.
Apply:
def forward( - self, input_ids: torch.Tensor, position_ids: Optional[torch.Tensor] = None + self, input_ids: torch.Tensor, position_ids: Optional[torch.Tensor] = None ) -> torch.Tensor: @@ - b, s, _ = input_ids.shape + b, s, d_model = input_ids.shape + _ = position_ids # unused; kept to match interface + assert ( + d_model % self.num_heads == 0 + ), f"hidden_size ({d_model}) must be divisible by num_heads ({self.num_heads})" @@ - q = self.q_proj(input_ids) # [b, s, n*h_d] - k = self.k_proj(input_ids) # [b, s, n_kv*h_d] - v = self.v_proj(input_ids) # [b, s, n_kv*h_d] + q = self.q_proj(input_ids) # [b, s, n*h_d] + k = self.k_proj(input_ids) # [b, s, n_kv*h_d] + v = self.v_proj(input_ids) # [b, s, n_kv*h_d]Optional outside-range improvement (constructor robustness): prefer explicit parameters over unpacking args for readability and safety.
class GQAWithSdpa(GQA): def __init__(self, num_attention_heads: int, hidden_size: int, num_key_value_heads: int, *args, **kwargs): super().__init__(num_attention_heads, hidden_size, num_key_value_heads, *args, **kwargs) self.num_heads = num_attention_heads self.num_kv_heads = num_key_value_heads self.head_dim = hidden_size // self.num_heads self.num_key_value_groups = None if self.num_heads == self.num_kv_heads else self.num_heads // self.num_kv_headsAlso applies to: 48-48, 51-54
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (1)
95-95
: Patch target update to LLM._create_input_processor is correct.This aligns with the refactor that moved input-processor creation into the LLM class.
To reduce redundancy, you can remove the inner patch.object(..., "_build_model", ...) since it's already patched via the decorator. Also, in the non-executor branch add assert_not_called() for completeness:
@@ def test_config_flow(...): - else: - # For LLM with _autodeploy backend, executor should not be called directly - pass + else: + # For LLM with _autodeploy backend, executor should not be called directly + mock_executor.assert_not_called()tensorrt_llm/_torch/auto_deploy/models/factory.py (3)
116-124
: New hook: init_processor() — clear and useful for multimodal processors.No issues. Consider adding a brief Google-style docstring “Args/Returns” section like other methods for consistency.
218-229
: New hook: get_example_inputs() — fits export_to_gm usage.Looks good. To avoid surprises, document that the returned tensors can be on CPU; SequenceInfo will move them to the correct device.
62-80
: Docstring still assumes only (input_ids, position_ids) despite multimodal support.Minor inconsistency: the class-level expectations and build_model docstring mention only input_ids/position_ids. With this PR adding extra_args (e.g., pixel_values), consider updating the docstring to acknowledge optional extra inputs.
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)
98-117
: Consider validating multimodal data consistencyThe code correctly collects multimodal data from requests, but there's no validation that all requests have the same set of multimodal keys. This could lead to issues if requests have heterogeneous multimodal data structures.
Consider adding validation to ensure all requests have consistent multimodal data keys:
# set up sequence info object for decode phase sequence_info = self.cache_seq_interface.info sequence_info.reset() input_ids = [] total_lens = [] extra_args: Dict[str, List[torch.Tensor]] = defaultdict(list) +multimodal_keys = set() for request in requests: total_lens.append(len(request.prompt_token_ids)) input_ids.append(request.prompt_token_ids) if request.multimodal_params is not None: + current_keys = set(request.multimodal_params.multimodal_data.keys()) + if multimodal_keys and current_keys != multimodal_keys: + raise ValueError(f"Heterogeneous multimodal data keys are not supported. Expected {multimodal_keys}, got {current_keys}") + multimodal_keys = current_keys for k, v in request.multimodal_params.multimodal_data.items(): extra_args[k].append(v)tensorrt_llm/_torch/auto_deploy/llm.py (2)
44-51
: Consider making the sanity check optional in productionThe TODO comment indicates this sanity check may be removed for performance. Since this creates an intermediate string that's not actually used (only for validation), it could impact performance for large batches.
Consider adding a debug flag to make this check optional:
# check for messages field and if yes, use the apply_chat_template method if "messages" in inputs: - # TODO: we don't really need this but it makes for a good sanity check. Consider - # removing this in the future if we need to speed things up. - prompt = self.processor.apply_chat_template( - inputs["messages"], - add_generation_prompt=True, - tokenize=False, - ) - inputs["prompt"] = prompt + if ad_logger.isEnabledFor(logging.DEBUG): + # Sanity check: generate the prompt for debugging + prompt = self.processor.apply_chat_template( + inputs["messages"], + add_generation_prompt=True, + tokenize=False, + ) + inputs["prompt"] = prompt + ad_logger.debug(f"Generated prompt from messages: {prompt[:100]}...")
121-125
: Consider more specific type checkingThe assertions verify instance types but could be more specific about what's expected.
Consider adding more informative error messages:
# now correct input processor -assert isinstance(self.input_processor, DefaultInputProcessor) -assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer) +assert isinstance(self.input_processor, DefaultInputProcessor), ( + f"Expected DefaultInputProcessor, got {type(self.input_processor)}" +) +assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer), ( + f"Expected TransformersTokenizer or None, got {type(self.tokenizer)}" +) self.input_processor = self._create_input_processor()tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (3)
37-37
: Document why image_sizes is set to NoneWhile the comment mentions that image_sizes is set to None, it would be helpful to explain that this is required because torch.cond branches must have fixed argument types during export.
- image_sizes: torch.Tensor = None, # image_sizes set as None + image_sizes: torch.Tensor = None, # Set to None for torch.export compatibility (cond branches need fixed types)
89-96
: Consider enabling the validation check conditionallyThe commented-out validation check for matching token counts could be useful for debugging. Consider making it conditional based on a debug flag rather than commenting it out entirely.
-# This condition statement breaks torch.export: -# TODO: sanity check on the inputs for this -# if num_tokens_to_fill != projected_vision_flat.size(0): -# raise ValueError( -# f"Mismatch: final_mask wants {num_tokens_to_fill} embeddings, " -# f"but multi_modal_projector returned {projected_vision_flat.size(0)}" -# ) +# Skip runtime validation during export to avoid breaking torch.export +# This check is valuable for debugging but incompatible with torch.export's constraints +if not torch.compiler.is_compiling(): + num_tokens_to_fill = final_mask_1d.sum() + if num_tokens_to_fill != projected_vision_flat.size(0): + raise ValueError( + f"Mismatch: final_mask wants {num_tokens_to_fill} embeddings, " + f"but multi_modal_projector returned {projected_vision_flat.size(0)}" + )
158-158
: Document why image_hidden_states is NoneThe comment mentions skipping this "for simplicity" but it would be clearer to explain that image_hidden_states is computed inside the vision branch and not accessible outside torch.cond.
- image_hidden_states=None, # skip outputting this for simplicity + image_hidden_states=None, # Not accessible outside torch.cond vision branchtensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)
139-156
: Default dtypes: prefer clarity and consistency.
input_ids
is initialized withtorch.int
whileposition_ids
usestorch.long
. Ifinput_ids
represents token IDs by default,torch.long
is the safer choice (embedding/lookup ops require int64). Ifinput_ids
may also carry embeddings later (as tests do), consider documenting that and defaulting totorch.long
only for the token-ID path.Would you like me to push a docstring tweak and a small follow-up change to default
input_ids
totorch.long
?
248-252
: Return host-side_input_positions
for consistency (and avoid device sync).The TODO indicates intent to use the host-side list. After fixing
_update_input_pos
, returning_input_positions
avoids extra device->host sync on every call.@property def input_positions(self) -> List[int]: - # TODO: correctly set _input_positions and use it here - # return self._input_positions - return self.input_pos[: self.num_sequences].tolist() + return self._input_positions
465-475
: Make host tensors pinned for non_blocking copies and clear tails.These copies are marked
non_blocking=True
but the source tensors aren’t pinned, making them synchronous. Also, clear the trailing slots to avoid stale data.def _assign_pages_per_seq(self, page_assignments: Sequence[Sequence[int]]) -> None: """Set the cache location and pages_per_seq tensors from page assignments.""" assert len(page_assignments) == self.num_sequences - cache_loc_flat = torch.tensor( - [p_idx for pages in page_assignments for p_idx in pages], dtype=torch.int - ) - self.cache_loc[: len(cache_loc_flat)].copy_(cache_loc_flat, non_blocking=True) - - pages_per_seq = torch.tensor([len(p) for p in page_assignments], dtype=torch.int) - self.pages_per_seq[: len(pages_per_seq)].copy_(pages_per_seq, non_blocking=True) + cache_loc_flat = torch.tensor( + [p_idx for pages in page_assignments for p_idx in pages], + dtype=torch.int, + pin_memory=True, + ) + self.cache_loc[: cache_loc_flat.numel()].copy_(cache_loc_flat, non_blocking=True) + if self.cache_loc.numel() > cache_loc_flat.numel(): + self.cache_loc[cache_loc_flat.numel() :].zero_() + + pages_per_seq_host = torch.tensor([len(p) for p in page_assignments], + dtype=torch.int, pin_memory=True) + self.pages_per_seq[: pages_per_seq_host.numel()].copy_(pages_per_seq_host, non_blocking=True) + if self.pages_per_seq.numel() > pages_per_seq_host.numel(): + self.pages_per_seq[pages_per_seq_host.numel() :].zero_()
372-379
: Non-blocking reset copies: consider pinned memory for consistency.
cache_loc_host
is pinned (good). For symmetry and performance, also consider zeroinginput_pos
andpages_per_seq
tails here to avoid stale values when the active batch shrinks after a larger one.I can add a small follow-up to zero the tails here; let me know if you want it in this PR.
202-233
: Lazy dynamic-shape maps reuse the same dict instance per-arg.
bs_seq_len_shape
is shared across uncached args. That’s fine if never mutated. If you expect any per-arg tweak later, switch to a shallow copy per key to avoid aliasing surprises.- self._uncached_dynamic_shapes = {k: bs_seq_len_shape for k in self._uncached_arg_names} + self._uncached_dynamic_shapes = {k: dict(bs_seq_len_shape) for k in self._uncached_arg_names}
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (19)
examples/auto_deploy/.gitignore
(1 hunks)examples/auto_deploy/build_and_run_ad.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
(8 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/llm.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/models/factory.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/shim/demollm.py
(8 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
(2 hunks)tensorrt_llm/executor/worker.py
(1 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
(0 hunks)tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
(3 hunks)
💤 Files with no reviewable changes (2)
- tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespace when importing: from package.subpackage import foo; then use foo.SomeClass()
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; prefix k for names starting with a number (e.g., k_99th_percentile)
Global variables are UPPER_SNAKE_CASE prefixed with G (e.g., G_MY_GLOBAL)
Constants are UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings over comments; comments for internal code or local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Attributes and variables can be documented inline with trailing docstrings under the class or module
Avoid using reflection when easily avoidable; prefer explicit parameters/constructs over dict(**locals())
In try/except, catch the narrowest exception types possible
For duck-typing try/except, keep try body minimal and place logic in else after attribute existence checks
Files:
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/executor/worker.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/models/factory.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tensorrt_llm/_torch/auto_deploy/llm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA Apache-2.0 copyright header with current year to all source files
Files:
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/executor/worker.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/models/factory.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tensorrt_llm/_torch/auto_deploy/llm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
🧬 Code graph analysis (11)
examples/auto_deploy/build_and_run_ad.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
Field
(67-94)
tensorrt_llm/_torch/auto_deploy/models/factory.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
CacheConfig
(26-29)tensorrt_llm/_torch/auto_deploy/models/hf.py (3)
init_processor
(351-355)get_example_inputs
(375-418)get_extra_inputs
(420-439)
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)
named_standard_args
(188-190)get_prepare_metadata_op
(716-738)args
(193-195)extra_args_for_prepare_metadata
(198-200)set_max_num_tokens_sample
(401-413)tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
args
(24-26)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (4)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
set_example_sequence
(380-399)args
(193-195)dynamic_shapes
(235-237)tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
get_example_inputs
(375-418)tensorrt_llm/_torch/auto_deploy/models/factory.py (2)
get_example_inputs
(218-228)model
(43-45)tensorrt_llm/_torch/auto_deploy/export/export.py (1)
torch_export_to_gm
(198-284)
tensorrt_llm/_torch/auto_deploy/llm.py (6)
tensorrt_llm/inputs/registry.py (1)
DefaultInputProcessor
(44-111)tensorrt_llm/llmapi/llm.py (4)
tokenizer
(681-685)tokenizer
(688-689)prompt
(79-80)LLM
(1079-1095)tensorrt_llm/_torch/auto_deploy/models/factory.py (5)
tokenizer
(48-50)ModelFactory
(15-243)init_tokenizer
(107-114)init_processor
(116-123)prefetch_checkpoint
(125-136)tensorrt_llm/llmapi/tokenizer.py (2)
TokenizerBase
(24-25)TransformersTokenizer
(28-267)tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-486)tensorrt_llm/_torch/auto_deploy/llm_args.py (2)
LlmArgs
(226-330)create_factory
(204-215)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
nest_sequences
(506-592)reset
(366-378)tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (1)
all_close
(6-21)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (3)
tensorrt_llm/executor/request.py (1)
GenerationRequest
(84-129)tensorrt_llm/inputs/multimodal.py (1)
MultimodalParams
(152-410)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (6)
reset
(366-378)nest_sequences
(506-592)page_assignments
(272-279)input_positions
(248-251)sequence_lengths
(244-245)args
(193-195)
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
CacheConfig
(26-29)tensorrt_llm/_torch/auto_deploy/models/factory.py (6)
tokenizer
(48-50)init_tokenizer
(107-114)init_processor
(116-123)model
(43-45)get_example_inputs
(218-228)get_extra_inputs
(230-243)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (6)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
create_factory
(204-215)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
device
(163-164)page_assignments
(272-279)nest_sequences
(506-592)tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
get_extra_inputs
(420-439)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
get_extra_inputs
(230-243)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
get_cache_indices
(572-583)tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
is_dummy
(379-380)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
tensorrt_llm/_torch/models/modeling_llama.py (1)
Llama4ForConditionalGeneration
(1179-1240)tensorrt_llm/_torch/auto_deploy/export/interface.py (2)
BaseExportPatch
(38-130)ExportPatchRegistry
(166-211)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)
args
(24-26)dynamic_shapes
(29-31)to
(33-37)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
set_example_sequence
(45-53)
🔇 Additional comments (37)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
66-67
: Verify torch’sscaled_dot_product_attention
supportsenable_gqa
and update docsThe new
enable_gqa
parameter is a great parity enhancement withF.scaled_dot_product_attention
, but we need to:
- Update the function’s docstring to document the
enable_gqa
argument.- Confirm that the version of PyTorch used in CI/runtime actually exposes this keyword (to avoid runtime
TypeError
).Our sandbox couldn’t import
torch
, so please manually verify in your local or CI environment:python - << 'PY' import inspect, torch from torch.nn import functional as F sig = inspect.signature(F.scaled_dot_product_attention) print("Has 'enable_gqa':", 'enable_gqa' in sig.parameters) PYOnce confirmed, append to the docstring under
Args:
:enable_gqa (bool): If True, request grouped-query attention handling from the PyTorch scaled_dot_product_attention implementation. Forwarded when supported by the installed torch version; otherwise ignored or raises if unsupported.tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1)
68-71
: LGTM on compile_backend switch for Llama-4-Scout.The move to compile_backend="torch-simple" with attn_backend="flashinfer" is consistent with other entries and the new "_autodeploy" PyTorch-like path. No issues spotted.
examples/auto_deploy/.gitignore (1)
5-6
: No tracked YAML files in examples/auto_deploy; ignore rule is safeRan the suggested check and found no tracked
*.yaml
files underexamples/auto_deploy
. The current.gitignore
entry won’t hide any existing sample configs, so it can remain as-is.tensorrt_llm/executor/worker.py (2)
125-139
: Backend selection path looks good. Ensure parity for LoRA and scheduling knobs.The new "_autodeploy" branch uses create_autodeploy_executor; confirm that required knobs (e.g., kv_cache_transceiver, scheduling params, multimodal handling) are wired consistently with the PyTorch path. No code change requested here—just a heads-up to verify feature parity as you expand autodeploy capabilities.
433-441
: Overlap + disaggregated guard reads well; re-check constraints for autodeploy.With overlap enabled and disaggregated serving, context-only requests are disallowed. Confirm autodeploy engine implements kv_cache_transceiver consistently so this condition holds. No code change requested.
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (1)
79-84
: Good API alignment.Using cm.info.extra_args_for_prepare_metadata keeps the call signature decoupled from concrete extras (e.g., page_size). This is the right abstraction.
examples/auto_deploy/build_and_run_ad.py (2)
29-31
: LGTM: PromptInput union captures the intended formats.The alias cleanly models str/TRT-LLM dict/HF messages. This aligns with the multimodal and chat-template flow.
41-62
: Nice: richer prompt examples and clear guidance.The expanded defaults exercise all supported input shapes (text, TRT-LLM dict, HF messages) and will help users discover capabilities quickly.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (3)
190-193
: Good: using SequenceInfo.nest_sequences with input_pos keeps positions consistent.Passing input_pos to let SequenceInfo derive position_ids is aligned with the new interface. No change needed.
202-202
: Autoregressive baseline reset is correct before no-cache run.Resetting before the all-at-once call is fine and mirrors a cold start scenario.
208-211
: Per-token KV-cache loop is correct and exercises the transformation path.Chunking one token at a time with the corresponding input_pos is the right way to validate cache behavior.
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (1)
193-196
: Message update LGTM.Changing the prompt text does not affect semantics; assertions still validate text-only vs. text+image divergence and export parity.
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (2)
66-67
: Good change: source example inputs from the factory.Passing factory.get_example_inputs() through to cm.info.set_example_sequence() aligns export inputs with model factories and enables multimodal scaffolding.
66-67
: Cross-file: fix invalid PIL.Image.new usage in HF factory get_example_inputs.When this transform calls cm.info.set_example_sequence(**factory.get_example_inputs()), the HF factory’s second sample creates an image with Image.new("RGB", (0, 255, 0)) (missing size, color passed as size), which will fail. Please correct that in tensorrt_llm/_torch/auto_deploy/models/hf.py.
Suggested fix in models/hf.py (context lines 374-417 in your snippet):
- _prep_seq( - "What are the main differences between these two images?", - Image.new("RGB", (16, 16), color=(255, 0, 0)), - Image.new("RGB", (0, 255, 0)), - ), + _prep_seq( + "What are the main differences between these two images?", + Image.new("RGB", (16, 16), color=(255, 0, 0)), + Image.new("RGB", (16, 16), color=(0, 255, 0)), + ),tensorrt_llm/_torch/auto_deploy/models/factory.py (2)
11-11
: Import of DynamicShapeCallback is correct and future-proofs extra input wiring.Good alignment with SequenceInfo’s dynamic-shapes path.
230-244
: Verified new hook get_extra_inputs() — API shape and naming align with SequenceInfo._extra_argsNo further action needed; the extra‐input keys (e.g., “pixel_values” in the HF factory) correctly match the keys used by SequenceInfo, so set_example_sequence() will not trigger its missing‐keys assertion.
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (4)
39-43
: Verify multimodal data propagationThe multimodal parameters are being directly assigned to the request object. Since the
GenerationRequest
class supports multimodal parameters as shown in the external code context, this implementation is correct.
53-83
: Page assignment strategy looks correctThe new signature returning
List[List[int]]
instead of mutating internal state is a cleaner API design. The implementation correctly:
- Extracts current page assignments from sequence info
- Calculates free pages
- Allocates additional pages based on total length requirements
The logic correctly handles the case where
extra_tokens > 0
but< page_size
by adding one extra page.
133-142
: LGTM! Clean state management for decode stepsThe implementation correctly:
- Computes next positions by adding current positions and sequence lengths
- Calculates total lengths for the next step
- Updates sequence info with new tokens and page assignments
The flow maintains proper state transitions between generation steps.
280-281
: LGTM! Explicit multimodal params passingThe implementation correctly passes
multimodal_params
from the request to avoid losing the reference when the request's multimodal_params are cleared after submission (as noted in lines 336-338).tensorrt_llm/_torch/auto_deploy/llm.py (2)
15-73
: Well-structured ADInputProcessor implementationThe
ADInputProcessor
class provides a clean abstraction for handling both text-only and multimodal inputs:
- Correctly inherits from
DefaultInputProcessor
- Properly handles HF's chat template system for multimodal inputs
- Falls back to standard tokenization for text-only inputs
- Appropriately validates and returns multimodal data when present
The implementation correctly handles truncation parameters and special tokens configuration.
81-85
: Good lazy initialization pattern for factoryThe lazy initialization of the factory is well-implemented, ensuring it's only created when needed.
tensorrt_llm/_torch/auto_deploy/models/hf.py (4)
344-356
: Clean processor initialization patternThe
init_processor
method correctly initializes the AutoProcessor and returns it, whileinit_tokenizer
properly extracts the tokenizer from the processor. This is a clean separation of concerns for multimodal model support.
358-374
: LGTM! Multimodal forward signatureThe
_simple_forward
method correctly extends the base implementation to includepixel_values
parameter, enabling multimodal input processing during model execution.
375-419
: Well-structured example inputs generationThe
get_example_inputs
method correctly:
- Creates synthetic image data for testing
- Uses the processor's chat template system
- Generates properly formatted batch inputs
- Returns both input_ids and pixel_values
The use of small 16x16 images for examples is appropriate to minimize memory usage during export.
432-434
: Could you clarify which export target you have in mind—ONNX/TensorRT dynamic-shape profiles for inference, or torch.jit/torch.export tracing? Knowing the runtime will let us recommend appropriate min/opt/max defaults for batch size, height, and width.tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (4)
98-103
: LGTM! Clean separation of concernsThe SequenceInfo initialization no longer takes a device parameter, which is a cleaner design. The device is now handled separately and passed to the ADEngine constructor.
105-117
: Good integration of extra inputs from factoryThe code correctly retrieves extra inputs from the factory and adds them to the sequence info with their dynamic shape callbacks. This enables proper handling of multimodal inputs with dynamic shapes.
182-209
: Clean multimodal data collection patternThe refactored code properly:
- Converts device tensors to CPU list for indexing
- Collects page assignments per request
- Aggregates multimodal data into extra_args dictionary
- Maintains backward compatibility for non-multimodal requests
231-236
: Excellent API consolidationThe single call to
nest_sequences
with all parameters (input_ids, input_pos, page_assignments, and extra_args) is much cleaner than the previous approach of multiple separate updates. This consolidates all sequence info updates in one atomic operation.tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (4)
13-19
: Good documentation of modificationsThe comment clearly documents the three key modifications made to the original HuggingFace implementation, which helps with maintenance and understanding the differences.
60-65
: Good input validationThe XOR check for input_ids vs inputs_embeds and the mutual exclusivity check for pixel_values and inputs_embeds ensure proper input configuration.
105-113
: Clever use of torch.cond for export compatibilityThe use of
torch.cond
with a runtime check for non-zero pixel values is an elegant solution to enable both text-only and multimodal paths during export. This allows the exported model to handle both input types dynamically.
162-186
: Well-structured patch implementationThe patch class follows the established pattern correctly:
- Properly decorated with ExportPatchRegistry
- Stores original method before patching
- Restores original method on revert
- Clear documentation of purpose
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
162-181
: Keep arg ordering of dynamic_shapes in lockstep with args (export stability).
named_args
builds in the order: uncached -> extra -> cached.named_dynamic_shapes
merges dicts in the same order, which should preserve ordering. Given export sensitivity, add a unit test to guard this contract (args[i] aligns with dynamic_shapes[i]).I can add a small unit test that instantiates
SequenceInfo
, registers two extra args, toggles cached mode, and asserts zipped(name, shape) ordering matches acrossnamed_args
andnamed_dynamic_shapes
.Also applies to: 202-238
401-414
: Example sequence with max_num_tokens: extra-arg handling is TODO.
set_max_num_tokens_sample
skips extra-args alignment. If extra-args are required by the graph, this will produce mismatches in export/tests.I can wire this to accept an
extra_args: Optional[Dict[str,Tensor]] = None
and populate_extra_args
with placeholders or provided values. Want a patch?
150-156
: Meta dtypes: align across fields and kernels.Seq metadata (
seq_len
,input_pos
,cache_loc
,pages_per_seq
) aretorch.int
(int32). Ensure downstream custom ops and TorchScript/exporters expect int32 consistently. If any expects int64, we’ll need cast points.I can scan the kernels and prepare_metadata ops and add safe casts if needed.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Outdated
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Outdated
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Outdated
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
Outdated
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
Show resolved
Hide resolved
/bot reuse-pipeline --help |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
1-10
: Prepend NVIDIA 2025 license header to attention_interface.pyWe’ve confirmed that
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
does not include any NVIDIA copyright header, whereas other Python modules do:
- No “NVIDIA” header found in
attention_interface.py
- Existing headers in the repo:
modeling_multimodal_encoder.py
,modeling_multimodal_utils.py
,modeling_vila.py
use a 2024 headerfloat8_python_api.py
uses:# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
Please prepend the canonical NVIDIA header (updated to 2025) at the very top of
attention_interface.py
. For example, copying the style fromfloat8_python_api.py
:+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
If the project has standardized on the more detailed Apache 2.0–style header, instead use:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# http://www.apache.org/licenses/LICENSE-2.0 +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License.Ensure the header exactly matches the repo‐standard text and update the year to 2025.
♻️ Duplicate comments (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
441-458
: Bug: scalar copy into slice when input_pos is int; also tail not cleared.Passing an int builds a 0-D tensor and copy into a 1-D slice -> shape mismatch. Copy the expanded host list and zero the tail to avoid stale values. This was raised earlier; still present.
- # non-blocking copy - self._non_blocking_copy_to_device( - self.input_pos[: self.num_sequences], input_pos, torch.int - ) + # non-blocking copy from the expanded host list + self._non_blocking_copy_to_device( + self.input_pos[: self.num_sequences], self._input_positions, torch.int + ) + # clear any tail beyond active sequences + if self.input_pos.numel() > self.num_sequences: + self.input_pos[self.num_sequences :].zero_()
542-547
: Uncached SDPA needs uniform seq_len; otherwise view() in _shape_for_forward will fail.When self._is_cached_attn is False, you reshape to [B, S]; ragged inputs make B*S != total_num_tokens. Enforce uniform lengths at nest time (or pad/truncate).
seq_lens = [len(ids) for ids in input_ids] - self._sequence_lengths = seq_lens + # Uncached attention requires rectangular [B, S, ...] + if not self._is_cached_attn: + assert len(set(seq_lens)) == 1, ( + f"Uncached attention requires uniform seq_len per batch; got {seq_lens}" + ) + self._sequence_lengths = seq_lens
582-585
: unnest_sequences splits along wrong dim for uncached inputs.Current logic assumes cached-mode shapes. Make it mode-aware to avoid mis-splitting in uncached prefill.
- def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: - t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) - return list(torch.split(t_squeezed, self.sequence_lengths)) + def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: + if self._is_cached_attn: + # cached mode: [B,1,...] for decode or [1,S_total,...] for prefill + t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) + return list(torch.split(t_squeezed, self.sequence_lengths, dim=0)) + # uncached SDPA: [B, S, ...] -> split along S + return list(torch.split(t_nested, self.sequence_lengths, dim=1))
🧹 Nitpick comments (6)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (6)
6-7
: Nit: small grammar fix in module docstring.“…pass it on the the various…” → “…pass it on to the various…”.
- is also responsible for functionalizing information about the sequence and pass it on the the + is also responsible for functionalizing information about the sequence and pass it on to the
150-160
: Consider standardizing input_ids dtype to torch.long.Token/ID tensors in PyTorch models are typically torch.long; using torch.int can cause dtype mismatches in downstream ops and custom kernels.
- self._uncached_args_flat: Dict[str, torch.Tensor] = { - "input_ids": torch.ones(self.max_num_tokens, dtype=torch.int), - "position_ids": torch.zeros(self.max_num_tokens, dtype=torch.long), - } - self.input_ids = torch.ones(self.max_batch_size, 1, dtype=torch.int) + self._uncached_args_flat: Dict[str, torch.Tensor] = { + "input_ids": torch.ones(self.max_num_tokens, dtype=torch.long), + "position_ids": torch.zeros(self.max_num_tokens, dtype=torch.long), + } + self.input_ids = torch.ones(self.max_batch_size, 1, dtype=torch.long) self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long)If some backends require int32, we can convert at the boundary or accept a dtype parameter. Please confirm expected dtype across the stack.
226-235
: Avoid shared dict aliasing in dynamic shapes map.Each entry currently references the same bs_seq_len_shape dict. Copy per key to prevent accidental shared-state mutations.
- self._uncached_dynamic_shapes = {k: bs_seq_len_shape for k in self._uncached_arg_names} + self._uncached_dynamic_shapes = { + k: bs_seq_len_shape.copy() for k in self._uncached_arg_names + }
415-423
: Potential inconsistency: pages_per_seq filled without matching cache_loc.set_max_num_tokens_sample sets pages_per_seq but doesn’t set cache_loc/page_assignments. Downstream consumers may read inconsistent page metadata.
Options:
- Remove the pre-fill and let callers pass page_assignments when needed, or
- Generate trivial page_assignments consistent with the fill.
Minimal change:
- self.pages_per_seq.fill_(seq_len // self.page_size) - self.nest_sequences(input_ids) + # Let callers provide page_assignments if needed; avoid partial state here. + self.pages_per_seq.zero_() + self.nest_sequences(input_ids)
399-414
: Example sequence arg check is inverted for extras.The intent seems to be: require that all registered extra args are present in kwargs. The current assertion uses self._extra_args.keys(), which may include runtime values and not just the declared names. Consider checking against self._extra_none_inputs (declared extras) to avoid false positives.
- assert self._extra_args.keys() <= kwargs.keys(), ( - f"Missing extra args: {self._extra_args.keys() - kwargs.keys()}" - ) + declared = set(self._extra_none_inputs.keys()) + provided = set(kwargs.keys()) + assert declared <= provided, f"Missing extra args: {declared - provided}"
482-503
: Guard _shape_for_forward against ragged inputs in uncached mode (secondary safety).Even with the nest-time assertion, adding a defensive check here yields clearer errors and prevents silent mis-shapes if callers bypass nest_sequences.
- if not self._is_cached_attn or self.is_generate: - bs = len(self.sequence_lengths) - sl = self.sequence_lengths[0] + if not self._is_cached_attn or self.is_generate: + bs = len(self.sequence_lengths) + if bs > 0: + sl = self.sequence_lengths[0] + # sanity: uncached path expects uniform S (generate-only implies S==1) + if not self.is_generate: + assert all(s == sl for s in self.sequence_lengths), ( + f"Uncached attention expects uniform seq_len; got {self.sequence_lengths}" + ) + else: + sl = 0
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
(7 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)
args
(24-26)dynamic_shapes
(29-31)to
(33-37)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (4)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
189-201
: Ragged inputs break view in uncached mode (uses seq_len[0]).As-is,
_shape_for_forward
will mis-reshape when uncached SDPA receives non-uniform sequence lengths. Enforce uniformseq_len
whenself._is_cached_attn
is False (see suggested diff in nest_sequences comment).
543-546
: Assert uniform seq_len for uncached SDPA to guarantee valid reshape.This guarantees
_shape_for_forward
can safelyview(bs, sl, ...)
.- self._store_arg("seq_len", [len(ids) for ids in input_ids], reset=True) + seq_lens = [len(ids) for ids in input_ids] + if not self._is_cached_attn: + assert len(set(seq_lens)) == 1, f"Uncached attention requires uniform seq_len; got {seq_lens}" + self._store_arg("seq_len", seq_lens, reset=True)
616-619
: Split along the correct dimension for cached vs uncached modes.Current split defaults to dim=0 and will be wrong for uncached SDPA ([B, S, ...]).
- def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: - t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) - return list(torch.split(t_squeezed, self.seq_len)) + def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: + if self._is_cached_attn: + # cached mode: decode [B,1,...] or prefill [1,S_total,...] => split along dim 0 + t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) + return list(torch.split(t_squeezed, self.seq_len, dim=0)) + # uncached SDPA: [B, S, ...] => split along sequence dimension + return list(torch.split(t_nested, self.seq_len, dim=1))tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)
199-206
: Test 3 resets cache before continuation; invalidates expected equality.Reset wipes context; continuing with only tail tokens can’t match full-context ground truth.
Minimal fix (don’t reset before continuation):
- # Continue inference from previous context - cm.info.reset() - for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]): + # Continue inference from previous context (do not reset here) + for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]): y_with_cache[:, i_p : i_p + 1] = _call_and_unnest(x[:, i_p : i_p + 1], i_p)If you want a “resume after reset” scenario, re-prime the cache with the true prefix and compare only the tail; I can provide that patch.
🧹 Nitpick comments (7)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
1-10
: Add NVIDIA SPDX header.Source files must start with the NVIDIA copyright header.
+# SPDX-FileCopyrightText: (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 """ Attention Interface to handle various attention operators and cache operations.
243-259
: Avoid dict aliasing in dynamic shapes.Using the same
bs_seq_len_shape
dict for multiple keys risks accidental cross-mutation later.- self._dynamic_shapes = {k: bs_seq_len_shape for k in self._uncached_arg_names} + self._dynamic_shapes = {k: dict(bs_seq_len_shape) for k in self._uncached_arg_names}
495-520
: Broadentnsr_like
typing/doc; supports list-of-lists for embeddings.Current hint
List[int | float]
is inaccurate when passing embeddings (list[list[float]]).Proposed signature/doc update (outside selected lines):
def _store_arg(self, name: str, tnsr_like: Sequence[Union[int, float, Sequence[float]]], reset: bool = False) -> None: """tnsr_like may be 1D (ids) or 2D (embeddings)."""
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (3)
1-1
: Add NVIDIA SPDX header.+# SPDX-FileCopyrightText: (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 import copy
55-66
: Host mirror type forinput_ids
is inconsistent (tensor vs list).You set
_args_host["input_ids"]
to a tensor; elsewhere host mirrors are lists. Low risk today but diverges from convention.- self._args_host["input_ids"] = self._args_device["input_ids"].cpu() + self._args_host["input_ids"] = self._args_device["input_ids"].detach().cpu().tolist()Alternatively, leave unset if not consumed.
67-80
: Optional: deterministic test setup.
_add_hidden_dim
usestorch.rand
, making baseline state nondeterministic. Consider seeding or using zeros/normal for reproducible tests.tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)
1-1
: Add NVIDIA SPDX header.+# SPDX-FileCopyrightText: (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 from typing import Optional
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
(7 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/shim/demollm.py
(8 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
(2 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
(2 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
- tensorrt_llm/_torch/auto_deploy/shim/demollm.py
- tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
🧬 Code graph analysis (3)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
device
(177-178)args
(228-230)reset
(479-485)nest_sequences
(521-588)tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
args
(24-26)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)
args
(24-26)dynamic_shapes
(29-31)to
(33-37)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
nest_sequences
(67-79)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)
SequenceEmbeddingInfo
(41-79)nest_sequences
(67-79)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
input_pos
(271-272)nest_sequences
(521-588)reset
(479-485)
🔇 Additional comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
591-615
:rescatter_input_ids
assumes token-ids; undefined for embeddings.
masked_scatter_(input_ids_device == scatter_ref, ...)
won’t work with float embeddings.def rescatter_input_ids(self, ungathered_input_ids: torch.Tensor, gather_idx: List[int], scatter_ref: int): @@ - input_ids_device = self._args_device["input_ids"] - input_ids_device.masked_scatter_(input_ids_device == scatter_ref, packed_input_ids) + input_ids_device = self._args_device["input_ids"] + # Guard: only supported for 1D token-id layout + assert input_ids_device.ndim == 1, "rescatter_input_ids supports 1D token-ids only" + input_ids_device.masked_scatter_(input_ids_device == scatter_ref, packed_input_ids)If embedding support is required, we can add a 2D branch using a boolean mask over a dedicated sentinel channel or explicit indices. Want a follow-up patch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)
53-83
: Add bounds checking for page allocationThe _assign_pages method pops from free_pages without checking if enough pages are available.
def _assign_pages(self, total_lens: List[int]) -> List[List[int]]: """A simple heuristic to assign pages based on current sequence info. ...docstring... """ si = self.cache_seq_interface.info page_assignments = si.page_assignments free_pages = set(range(si.num_pages)) - {i for pages in page_assignments for i in pages} updated_assignments = [] for t_l, pages in zip(total_lens, page_assignments): extra_tokens = t_l - len(pages) * si.page_size num_extra_pages = (extra_tokens // si.page_size) + (extra_tokens > 0) + if num_extra_pages > len(free_pages): + raise RuntimeError( + f"Insufficient free pages: need {num_extra_pages}, " + f"but only {len(free_pages)} available" + ) updated_assignments.append(pages + [free_pages.pop() for _ in range(num_extra_pages)]) return updated_assignmentstensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
398-413
: Return an int from _get_sanitized_num_sequences.Currently returns a 0-d tensor; callers expect an
int
.- if s > 1: - num_seq = torch.sum(seq_len > 0) + if s > 1: + num_seq = int(torch.sum(seq_len > 0).item()) else: - num_seq = b + num_seq = int(b) return num_seq
1-1
: Add NVIDIA copyright header to attention_interface.pyPer the TensorRT-LLM coding guidelines and existing source files, every Python source must begin with the standard NVIDIA copyright and Apache 2.0 license header. Please prepend the following exact header (updating the year to 2025) to the very top of
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
before the module docstring:# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, softwareEnsure there is a blank line after the header before the existing triple-quoted docstring. Use the same template as seen in
benchmarks/cpp/prepare_dataset.py
.
♻️ Duplicate comments (2)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
438-439
: Verify the placeholder pixel_values tensor shapeThe none_pixel_values tensor uses shape
(0, 3, 336, 336)
where 336x336 appears to be a specific resolution. This might not be appropriate for all vision models.Consider making the placeholder shape configurable or deriving it from the model config:
- none_pixel_values = torch.zeros(0, 3, 336, 336) + # Get default image size from processor/model config if available + processor = self.init_processor() + if processor and hasattr(processor, 'image_processor'): + size_dict = getattr(processor.image_processor, 'size', {}) + default_size = size_dict.get('height', size_dict.get('shortest_edge', 336)) + else: + default_size = 336 + none_pixel_values = torch.zeros(0, 3, default_size, default_size)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
616-619
: Fix unnesting for uncached vs cached modes.Current logic mis-splits in uncached prefill. Use
_is_cached_attn
to select split dimension.- def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: - t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) - return list(torch.split(t_squeezed, self.seq_len)) + def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: + if self._is_cached_attn: + # cached: [B,1,...] for decode or [1,S_total,...] for prefill; split along sequence axis + t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) + return list(torch.split(t_squeezed, self.seq_len, dim=0)) + # uncached SDPA: [B, S, ...] -> split along S + return list(torch.split(t_nested, self.seq_len, dim=1))
🧹 Nitpick comments (13)
tensorrt_llm/executor/worker.py (1)
84-85
: Consider extracting the backend check logic for better maintainabilityThe backend check using a list membership is correct, but hardcoding the backend names inline makes it harder to maintain. Consider extracting this to a constant or helper method.
+# Define PyTorch-compatible backends at module level +PYTORCH_COMPATIBLE_BACKENDS = {"pytorch", "_autodeploy"} + class GenerationExecutorWorker(GenerationExecutor): ... self._executor_config = executor_config - self._is_pytorch_backend = getattr(self._executor_config, "backend", - None) in ["pytorch", "_autodeploy"] + self._is_pytorch_backend = getattr(self._executor_config, "backend", + None) in PYTORCH_COMPATIBLE_BACKENDSexamples/auto_deploy/build_and_run_ad.py (1)
29-31
: Consider adding type hints to the PromptInput type aliasWhile the type alias is functional, consider documenting the expected structure of the Dict and List[Dict] types for better clarity.
# simple string, TRT-LLM style text-only prompt or full-scale HF message template -PromptInput = Union[str, Dict, List[Dict]] +from typing import TypedDict, NotRequired + +class MessageDict(TypedDict): + role: str + content: Union[str, List[Dict[str, Any]]] + +class PromptDict(TypedDict, total=False): + prompt: str + messages: NotRequired[List[MessageDict]] + +PromptInput = Union[str, PromptDict, List[MessageDict]]tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
60-65
: Improve error messages for mutually exclusive parametersThe XOR check on line 60 and the mutual exclusivity check on lines 62-65 could provide more informative error messages about what was actually passed.
- if (input_ids is None) ^ (inputs_embeds is not None): - raise ValueError("You must specify exactly one of input_ids or inputs_embeds") + if (input_ids is None) ^ (inputs_embeds is not None): + has_input_ids = input_ids is not None + has_inputs_embeds = inputs_embeds is not None + raise ValueError( + f"You must specify exactly one of input_ids or inputs_embeds. " + f"Got: input_ids={'provided' if has_input_ids else 'None'}, " + f"inputs_embeds={'provided' if has_inputs_embeds else 'None'}" + ) if pixel_values is not None and inputs_embeds is not None: raise ValueError( - "You cannot specify both pixel_values and inputs_embeds at the same time, and must specify either one" + "You cannot specify both pixel_values and inputs_embeds at the same time. " + "Please provide only one of them." )
102-103
: Document the no-vision branch behaviorThe
_no_vision_branch
function simply returns the inputs unchanged. Consider adding a docstring to clarify this is the expected behavior for text-only inputs.def _no_vision_branch(inputs_embeds, pixel_values, input_ids): + """Pass through embeddings unchanged for text-only inputs.""" return inputs_embeds
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
344-349
: Add error handling for processor initializationThe tokenizer initialization depends on the processor, but there's no error handling if the processor fails to initialize.
def init_tokenizer(self) -> Optional[Any]: """Initialize the tokenizer—either a custom name or the model's default.""" processor = self.init_processor() if processor is None: return None + if not hasattr(processor, 'tokenizer'): + raise AttributeError( + f"Processor {type(processor).__name__} does not have a tokenizer attribute" + ) return processor.tokenizer
375-418
: Extract image creation to a helper methodThe example inputs method creates multiple test images inline. Consider extracting image creation to improve readability.
+ @staticmethod + def _create_test_image(color: Tuple[int, int, int], size: Tuple[int, int] = (16, 16)) -> Image: + """Create a test image with the specified color and size.""" + return Image.new("RGB", size, color=color) + def get_example_inputs(self) -> Dict[str, torch.Tensor]: """Return a dictionary of example inputs for the model.""" def _prep_seq(text, img1, img2): return [ { "role": "user", "content": [ {"type": "image", "image": img1}, {"type": "image", "image": img2}, {"type": "text", "text": text}, ], } ] # Create a batch of conversations (batch_size = 2) batch_messages = [ _prep_seq( "Describe what you see in the two images and their differences.", - Image.new("RGB", (16, 16), color=(128, 128, 128)), - Image.new("RGB", (16, 16), color=(64, 64, 64)), + self._create_test_image((128, 128, 128)), + self._create_test_image((64, 64, 64)), ), _prep_seq( "What are the main differences between these two images?", - Image.new("RGB", (16, 16), color=(255, 0, 0)), - Image.new("RGB", (16, 16), color=(0, 255, 0)), + self._create_test_image((255, 0, 0)), + self._create_test_image((0, 255, 0)), ), ]tensorrt_llm/_torch/auto_deploy/shim/demollm.py (2)
39-44
: Add parameter validation for multimodal_paramsThe multimodal_params is directly assigned to the request without validation. Consider adding type checking.
def __call__( self, requests: GenerationRequest, multimodal_params: Optional[MultimodalParams] ) -> mp.Queue: """Generate tokens and put the results in a queue and return the queue.""" + if multimodal_params is not None and not isinstance(multimodal_params, MultimodalParams): + raise TypeError(f"Expected MultimodalParams, got {type(multimodal_params).__name__}") requests.multimodal_params = multimodal_params
101-118
: Consider validating multimodal data compatibilityWhen collecting multimodal data from multiple requests, there's no validation that the data is compatible for batching (e.g., same keys, compatible shapes).
input_ids = [] total_lens = [] extra_args: Dict[str, List[torch.Tensor]] = defaultdict(list) + expected_keys = None for request in requests: total_lens.append(len(request.prompt_token_ids)) input_ids.append(request.prompt_token_ids) if request.multimodal_params is not None: + current_keys = set(request.multimodal_params.multimodal_data.keys()) + if expected_keys is None: + expected_keys = current_keys + elif current_keys != expected_keys: + raise ValueError( + f"Inconsistent multimodal data keys across batch: " + f"expected {expected_keys}, got {current_keys}" + ) for k, v in request.multimodal_params.multimodal_data.items(): extra_args[k].append(v)tensorrt_llm/_torch/auto_deploy/llm.py (1)
46-52
: Optional: avoid double apply_chat_template.You build the string prompt and then re-tokenize; consider skipping the first call unless another component relies on
inputs["prompt"]
being populated.Also applies to: 53-63, 67-69
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)
186-191
: Sanity-check gather/scatter counts in generate path.Add a lightweight assertion to catch scheduler/shape mismatches during development.
if new_tokens is not None: + # Optional sanity check: number of tokens to scatter should match gather indices + try: + expected = int(new_tokens.numel()) + assert expected == len(flat_gather_idx), ( + f"rescatter mismatch: new_tokens={expected}, gather_idx={len(flat_gather_idx)}" + ) + except Exception: + pass self.cache_seq_interface.info.rescatter_input_ids( ungathered_input_ids=new_tokens.flatten(), # ensure it's flattened gather_idx=flat_gather_idx, scatter_ref=dummy_token, )Also applies to: 217-220, 238-244
161-163
: Global RNG seeding inside engine constructor.Hard-coding
torch.manual_seed(1234)
alters global state and can affect reproducibility across tests/runs. Prefer making this optional/config-driven.- torch.manual_seed(1234) + if getattr(self, "seed", None) is not None: + torch.manual_seed(int(self.seed))tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
180-201
: Guard _shape_for_forward against ragged uncached inputs.As an extra safety (beyond the
nest_sequences
assertion), fail fast if uncached and ragged.- if not self._is_cached_attn or self.is_generate: + if not self._is_cached_attn or self.is_generate: bs = len(self.seq_len) - sl = self.seq_len[0] + sls = self.seq_len + assert len(set(sls)) == 1, f"Ragged seq_len not supported in uncached mode: {sls}" + sl = sls[0]
243-259
: Dynamic shapes dict aliasing.
bs_seq_len_shape
is shared across keys; fine if immutable, but easy to accidentally mutate. Consider cloning per key to avoid hidden coupling.- self._dynamic_shapes = {k: bs_seq_len_shape for k in self._uncached_arg_names} + self._dynamic_shapes = {k: dict(bs_seq_len_shape) for k in self._uncached_arg_names}
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (20)
examples/auto_deploy/.gitignore
(1 hunks)examples/auto_deploy/build_and_run_ad.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
(7 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/llm.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/models/factory.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/shim/demollm.py
(8 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
(2 hunks)tensorrt_llm/executor/worker.py
(1 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
(2 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
(0 hunks)tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
(5 hunks)
💤 Files with no reviewable changes (2)
- tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
- tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
🚧 Files skipped from review as they are similar to previous changes (10)
- tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
- examples/auto_deploy/.gitignore
- tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
- tensorrt_llm/_torch/auto_deploy/models/factory.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
- tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
- tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
- tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
- tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
🧬 Code graph analysis (7)
tensorrt_llm/_torch/auto_deploy/llm.py (6)
tensorrt_llm/inputs/registry.py (1)
DefaultInputProcessor
(44-111)tensorrt_llm/_torch/auto_deploy/models/factory.py (5)
tokenizer
(48-50)ModelFactory
(15-243)init_tokenizer
(107-114)init_processor
(116-123)prefetch_checkpoint
(125-136)tensorrt_llm/llmapi/tokenizer.py (2)
TokenizerBase
(24-25)TransformersTokenizer
(28-267)tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-486)tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
create_factory
(204-215)tensorrt_llm/_torch/auto_deploy/models/hf.py (3)
init_tokenizer
(188-192)init_tokenizer
(344-349)init_processor
(351-355)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
tensorrt_llm/_torch/models/modeling_llama.py (1)
Llama4ForConditionalGeneration
(1179-1240)tensorrt_llm/_torch/auto_deploy/export/interface.py (2)
BaseExportPatch
(38-130)ExportPatchRegistry
(166-211)
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
CacheConfig
(26-29)tensorrt_llm/_torch/auto_deploy/models/factory.py (6)
tokenizer
(48-50)init_tokenizer
(107-114)init_processor
(116-123)model
(43-45)get_example_inputs
(218-228)get_extra_inputs
(230-243)
examples/auto_deploy/build_and_run_ad.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
Field
(67-94)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (3)
tensorrt_llm/executor/request.py (1)
GenerationRequest
(84-129)tensorrt_llm/inputs/multimodal.py (1)
MultimodalParams
(152-410)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (6)
reset
(479-485)nest_sequences
(521-588)input_pos
(271-272)page_assignments
(309-311)seq_len
(267-268)args
(228-230)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (6)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
create_factory
(204-215)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
get_extra_inputs
(230-243)tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
get_extra_inputs
(420-439)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)
add_extra_arg
(620-643)page_assignments
(309-311)input_pos
(271-272)nest_sequences
(521-588)rescatter_input_ids
(590-614)tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1)
InferenceOptimizer
(25-142)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
get_cache_indices
(572-583)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)
args
(24-26)dynamic_shapes
(29-31)to
(33-37)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
nest_sequences
(67-79)
🔇 Additional comments (5)
examples/auto_deploy/build_and_run_ad.py (1)
75-96
: LGTM! Well-structured prompt normalization logicThe implementation correctly handles multiple prompt formats (strings, dicts, and HF message templates) and normalizes them to a consistent structure. The fake prompt approach for HF templates clearly separates the different input types.
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)
89-96
: I’ve added a script to verify the availability of FX‐compatible assertion primitives and JIT flags in the current PyTorch build. Once we have the results, we can decide on the best pattern (e.g., usingtorch._assert
) for a torch.export‐compatible validation.tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)
337-341
: LGTM! Good handling of multimodal_params referenceThe comment clearly explains why the multimodal_params needs to be passed explicitly to maintain the reference during asynchronous submission.
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)
111-117
: LGTM: wiring factory-provided extra inputs into SequenceInfo.Clean pass-through of
none_input
and dynamic-shape callbacks.
200-208
: Ensure multimodal tensors are on the correct device.If
request.py_multimodal_data
can be CPU tensors, consider moving them toself._device
here to avoid device mismatch later (SequenceInfo will.to(...)
but only after added).Would you like a small patch to move/stack these tensors to
self._device
before passing tonest_sequences
?
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
1-1
: Add NVIDIA 2025 copyright header.Required for source files; place above module docstring.
+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions + # are met: + # * Redistributions of source code must retain the above copyright + # notice, this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright + # notice, this list of conditions and the following disclaimer in the + # documentation and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its + # contributors may be used to endorse or promote products derived + # from this software without specific prior written permission. + # + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY + # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR + # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY + # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
♻️ Duplicate comments (8)
tensorrt_llm/_torch/auto_deploy/llm.py (5)
1-1
: Add NVIDIA 2025 copyright header.Mandatory per repo guidelines; prepend standard header above imports.
+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions + # are met: + # * Redistributions of source code must retain the above copyright + # notice, this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright + # notice, this list of conditions and the following disclaimer in the + # documentation and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its + # contributors may be used to endorse or promote products derived + # from this software without specific prior written permission. + # + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY + # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR + # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY + # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30-31
: Do not require a processor for text-only inputs.Gate the processor check under the messages path.
- if self.processor is None: - raise ValueError("processor is required to tokenize inputs") + # Only required for chat-style/multimodal inputs + if "messages" in inputs and self.processor is None: + raise ValueError("processor is required to tokenize chat-style inputs") @@ - if "messages" in inputs: + if "messages" in inputs:Also applies to: 43-44
37-41
: Preserve add_special_tokens when adding truncation.Use update() instead of overwriting kwargs.
- if sampling_params.truncate_prompt_tokens is not None: - kwargs = { - "truncation": True, - "max_length": sampling_params.truncate_prompt_tokens, - } + if sampling_params.truncate_prompt_tokens is not None: + kwargs.update({ + "truncation": True, + "max_length": sampling_params.truncate_prompt_tokens, + })
75-76
: Add tiktoken-compatible fallback for encode.Mirror DefaultInputProcessor to avoid TypeError with tokenizers that don’t accept add_special_tokens.
- token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs) + try: + token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs) + except TypeError: + # e.g., tiktoken path without add_special_tokens + token_ids = self.tokenizer.encode(inputs["prompt"])
126-129
: Relax tokenizer type check and use explicit exceptions instead of assert.Allow any TokenizerBase and raise errors even under -O.
- assert isinstance(self.input_processor, DefaultInputProcessor) - assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer) + if not isinstance(self.input_processor, DefaultInputProcessor): + raise TypeError("input_processor must be a DefaultInputProcessor") + if self.tokenizer is not None and not isinstance(self.tokenizer, TokenizerBase): + raise TypeError("tokenizer must implement TokenizerBase")tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
548-553
: Avoid stale tail values in device tensors.Zero-fill before writes for input_pos/pages_per_seq to prevent leftover entries affecting later runs.
- self._store_arg( + self._store_arg( "input_pos", [input_pos] * self.num_sequences if isinstance(input_pos, int) else input_pos, - ) + , reset=True) @@ - self._store_arg("pages_per_seq", pages_per_seq) + self._store_arg("pages_per_seq", pages_per_seq, reset=True)Also applies to: 559-561
543-546
: Uncached path: enforce uniform seq_len or pad; current view() will mis-shape ragged inputs.Add a guard when not using cached attention.
- self._store_arg("seq_len", [len(ids) for ids in input_ids], reset=True) + seq_lens_list = [len(ids) for ids in input_ids] + if not self._is_cached_attn: + assert len(set(seq_lens_list)) == 1, ( + f"Uncached attention requires uniform seq_len; got {seq_lens_list}" + ) + self._store_arg("seq_len", seq_lens_list, reset=True)
616-619
: unnest_sequences fails for uncached SDPA; split along correct dimension.Handle cached vs uncached by choosing split axis.
- def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: - t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) - return list(torch.split(t_squeezed, self.seq_len)) + def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: + if self._is_cached_attn: + t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) + return list(torch.split(t_squeezed, self.seq_len, dim=0)) + # uncached SDPA: [B, S, ...] -> split along seq dim + return list(torch.split(t_nested, self.seq_len, dim=1))
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
(7 hunks)tensorrt_llm/_torch/auto_deploy/llm.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
(1 hunks)tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
- tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/llm.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/auto_deploy/llm.py (7)
tensorrt_llm/inputs/registry.py (1)
DefaultInputProcessor
(44-111)tensorrt_llm/llmapi/llm.py (4)
tokenizer
(681-685)tokenizer
(688-689)prompt
(79-80)LLM
(1079-1095)tensorrt_llm/_torch/auto_deploy/models/factory.py (5)
tokenizer
(48-50)ModelFactory
(15-243)init_tokenizer
(107-114)init_processor
(116-123)prefetch_checkpoint
(125-136)tensorrt_llm/llmapi/tokenizer.py (2)
TokenizerBase
(24-25)TransformersTokenizer
(28-267)tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-486)tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
create_factory
(204-215)tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)
DemoGenerationExecutor
(246-346)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)
args
(24-26)dynamic_shapes
(29-31)to
(33-37)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
nest_sequences
(67-79)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (3)
1-1
: Add NVIDIA 2025 copyright header.Per repo policy, prepend the standard NVIDIA header before the module docstring.
Apply:
+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions + # are met: + # * Redistributions of source code must retain the above copyright + # notice, this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright + # notice, this list of conditions and the following disclaimer in the + # documentation and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its + # contributors may be used to endorse or promote products derived + # from this software without specific prior written permission. + # + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY + # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR + # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY + # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
18-31
: Bug: boolean mask adds +1 to allowed logits. Should be 0.0.Mapping True→1.0 skews attention; allowed positions must add 0.0, blocked −inf.
Apply:
- if attn_mask.dtype == torch.bool: - float_mask = torch.zeros_like(attn_mask, dtype=dtype) - float_mask = float_mask.masked_fill(attn_mask, 1.0) # True -> 1.0 - float_mask = float_mask.masked_fill(~attn_mask, float("-inf")) # False -> -inf - return float_mask + if attn_mask.dtype == torch.bool: + float_mask = torch.zeros_like(attn_mask, dtype=dtype) + float_mask = float_mask.masked_fill(~attn_mask, float("-inf")) # False -> -inf + return float_mask
121-125
: Guard GQA head divisibility.Avoid silent shape errors when n_heads % n_kv_heads != 0.
Apply:
- if n_heads != n_kv_heads: + if n_heads != n_kv_heads: + if n_heads % n_kv_heads != 0: + raise ValueError(f"n_heads ({n_heads}) must be an integer multiple of n_kv_heads ({n_kv_heads})") n_rep = n_heads // n_kv_heads key_t = repeat_kv(key_t, n_rep) value_t = repeat_kv(value_t, n_rep)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)
1-1
: Add NVIDIA 2025 copyright header.Required for non-test Python sources.
Apply:
+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions + # are met: + # * Redistributions of source code must retain the above copyright + # notice, this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright + # notice, this list of conditions and the following disclaimer in the + # documentation and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its + # contributors may be used to endorse or promote products derived + # from this software without specific prior written permission. + # + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY + # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR + # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY + # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
♻️ Duplicate comments (9)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
66-67
: **Explicit schema over *args/kwargs for custom_op.custom_op needs a stable schema for export; keeping explicit parameters is correct here. The added enable_gqa is fine.
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)
69-77
: Remove debugger-only dead branch; always export.The if False block prevents export and leaves gm empty.
Apply:
- if False: - gm = torch_export_to_gm( - model, - args=cm.args, - dynamic_shapes=cm.dynamic_shapes, - clone=self.config.clone_state_dict, - strict=self.config.strict, - patch_list=self.config.patch_list, - ) + gm = torch_export_to_gm( + model, + args=cm.args, + dynamic_shapes=cm.dynamic_shapes, + clone=self.config.clone_state_dict, + strict=self.config.strict, + patch_list=self.config.patch_list, + )tensorrt_llm/_torch/auto_deploy/llm.py (6)
75-76
: tiktoken-compatible encode fallback.Some tokenizers reject add_special_tokens kwarg.
Apply:
- token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs) + try: + token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs) + except TypeError: + token_ids = self.tokenizer.encode(inputs["prompt"])
1-1
: Add NVIDIA 2025 copyright header.Mandatory for source files.
Apply header block as used elsewhere in the repo.
30-31
: Do not require processor for text-only inputs.Gate the check under the messages branch to keep text-only working.
Apply:
- if self.processor is None: - raise ValueError("processor is required to tokenize inputs") + # Processor is only required for chat/multimodal inputs + if "messages" in inputs and self.processor is None: + raise ValueError("processor is required to tokenize chat-style inputs")
37-41
: Preserve existing kwargs when adding truncation.Use update(), don't overwrite.
Apply:
- if sampling_params.truncate_prompt_tokens is not None: - kwargs = { - "truncation": True, - "max_length": sampling_params.truncate_prompt_tokens, - } + if sampling_params.truncate_prompt_tokens is not None: + kwargs.update({ + "truncation": True, + "max_length": sampling_params.truncate_prompt_tokens, + })
66-74
: Accept 1D/2D input_ids; avoid assert.Handle tokenizer outputs robustly and error clearly on unexpected shapes.
Apply:
- token_ids = all_args.pop("input_ids") - assert token_ids.shape[0] == 1, "messages should be unbatched at this point." + token_ids = all_args.pop("input_ids") + if token_ids.ndim == 2: + token_ids = token_ids[0] + elif token_ids.ndim != 1: + raise ValueError(f"Unexpected token_ids shape: {tuple(token_ids.shape)}")
125-129
: Loosen tokenizer type check; do not rely on assert.Allow any TokenizerBase; raise ValueError instead of assert.
Apply:
- assert isinstance(self.input_processor, DefaultInputProcessor) - assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer) + if not isinstance(self.input_processor, DefaultInputProcessor): + raise TypeError("input_processor must be DefaultInputProcessor-compatible") + if self.tokenizer is not None and not isinstance(self.tokenizer, TokenizerBase): + raise TypeError("tokenizer must implement TokenizerBase")tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)
1-1
: Add NVIDIA copyright header.Required for new source files.
Apply the standard 2025 NVIDIA header at the top of the file.
🧹 Nitpick comments (5)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
169-187
: Align sinks implementation with comments (explicit sink column).Current code adjusts the normalizer only; either fix comments or implement the concatenation path. Suggest implementing the explicit sink column for clarity.
Apply:
- if sinks is not None: - # Concatenate sinks to attention scores following the reference implementation - # sinks should have n_heads elements, each head gets its own sink value - # Expand sinks to [b, n_heads, s_q, 1] - one sink column per head - sinks_expanded = sinks.reshape(1, -1, 1, 1).expand( - b, n_heads, s_q, 1 - ) # [b, n_heads, s_q, 1] - - # Concatenate along the key dimension (last dimension) - logits_max = torch.max(attn_scores, dim=-1, keepdim=True).values - sinks = torch.exp(sinks_expanded - logits_max) - unnormalized_scores = torch.exp(attn_scores - logits_max) - normalizer = unnormalized_scores.sum(dim=-1, keepdim=True) + sinks - scores = unnormalized_scores / normalizer - # Use only the non-sink portion for computing output - # We added exactly 1 column, so remove exactly 1 column - attn_out = torch.matmul(scores, value_t) # [b, n_heads, s_q, v_head_dim] + if sinks is not None: + # Expand sinks to [b, n_heads, s_q, 1] and concatenate as an extra "sink" key + sinks_expanded = sinks.reshape(1, -1, 1, 1).expand(b, n_heads, s_q, 1) + attn_scores_ext = torch.cat([attn_scores, sinks_expanded], dim=-1) # [b, n, s_q, s_k+1] + attn_weights_ext = torch.softmax(attn_scores_ext, dim=-1, dtype=torch.float32).to(query.dtype) + attn_weights = attn_weights_ext[..., :-1] # drop sink column for value matmul + attn_out = torch.matmul(attn_weights, value_t) # [b, n_heads, s_q, v_head_dim]tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (2)
130-136
: Remove unused localn_image_tokens
.Silences F841; variable is unused.
Apply:
- n_image_tokens = special_image_mask.sum() special_image_mask = ( special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device) )
144-151
: Optional: robust image-presence predicate for torch.cond.all-zero images yield False with
torch.any(pixel_values != 0)
. Consider a shape-based predicate.Apply:
- has_image: torch.Tensor = torch.any(pixel_values != 0) + # Treat any non-empty batch as "has image" + has_image: torch.Tensor = (torch.tensor(pixel_values.shape[0], device=pixel_values.device) > 0)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
105-113
: Prefer shape-based predicate for torch.cond to handle black images.torch.any(pixel_values != 0) fails for valid all-black images.
Apply:
- has_image: torch.Tensor = torch.any(pixel_values != 0) + # Consider any non-empty batch as "has image" + has_image: torch.Tensor = (torch.tensor(pixel_values.shape[0], device=pixel_values.device) > 0)
162-169
: Consider registering the patch.Uncomment registration so export flows can use patch_list=["hf_llama4_vision"].
Apply:
-# @ExportPatchRegistry.register("hf_llama4_vision") +# @ExportPatchRegistry.register("hf_llama4_vision") # class header remains unchangedIf registration is intentionally deferred, ignore.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (6)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/llm.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
(1 hunks)tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
(8 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures
Files:
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/llm.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/auto_deploy/llm.py (6)
tensorrt_llm/inputs/registry.py (1)
DefaultInputProcessor
(44-111)tensorrt_llm/llmapi/llm.py (4)
tokenizer
(681-685)tokenizer
(688-689)prompt
(79-80)LLM
(1079-1095)tensorrt_llm/_torch/auto_deploy/models/factory.py (5)
tokenizer
(48-50)ModelFactory
(15-243)init_tokenizer
(107-114)init_processor
(116-123)prefetch_checkpoint
(125-136)tensorrt_llm/llmapi/tokenizer.py (3)
TokenizerBase
(24-25)TransformersTokenizer
(28-267)tokenizer_factory
(270-290)tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-486)tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
create_factory
(204-215)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
tensorrt_llm/_torch/models/modeling_llama.py (1)
Llama4ForConditionalGeneration
(1179-1240)tensorrt_llm/_torch/auto_deploy/export/interface.py (1)
BaseExportPatch
(38-130)
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (4)
tensorrt_llm/_torch/auto_deploy/export/interface.py (1)
apply_export_patches
(215-249)tensorrt_llm/_torch/auto_deploy/export/export.py (1)
torch_export_to_gm
(198-284)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)
_forward_with_cond
(20-159)tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)
move_to_device
(132-139)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
set_example_sequence
(440-466)args
(228-230)dynamic_shapes
(262-264)tensorrt_llm/_torch/auto_deploy/models/factory.py (2)
get_example_inputs
(218-228)model
(43-45)tensorrt_llm/_torch/auto_deploy/export/export.py (1)
torch_export_to_gm
(198-284)
🪛 Ruff (0.12.2)
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
130-130: Local variable n_image_tokens
is assigned to but never used
Remove assignment to unused variable n_image_tokens
(F841)
🔇 Additional comments (2)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)
66-67
: Good: example inputs threaded into SequenceInterface.Passing factory.get_example_inputs() into set_example_sequence() matches the new interface.
tensorrt_llm/_torch/auto_deploy/llm.py (1)
99-100
: Tokenizer factory wiring looks good.Delegating to factory.init_tokenizer() aligns with the new ModelFactory flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/auto_deploy/export/interface.py (1)
1-4
: Add NVIDIA 2025 copyright header
Prepend the standard NVIDIA header (identical wording and license used in other tensorrt_llm source files) above the existing module docstring in tensorrt_llm/_torch/auto_deploy/export/interface.py.tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
1-1
: Prepend NVIDIA 2025 copyright headerAdd the standard NVIDIA header block at the top of
tensorrt_llm/_torch/auto_deploy/models/hf.py
as specified in CODING_GUIDELINES.md.tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)
110-117
: Fix None-handling and zero-image detection for torch.condCurrent code errors on pixel_values=None and misclassifies all-black images. Use batch-size check and produce a 0-d bool tensor.
- # decide by whether there is any non-zero pixel_values - has_image: torch.Tensor = torch.any(pixel_values != 0) + # decide by presence (batch dimension) of images + if pixel_values is None: + has_image = torch.tensor(False, dtype=torch.bool, device=inputs_embeds.device) + else: + has_image = torch.tensor(pixel_values.shape[0] > 0, dtype=torch.bool, device=inputs_embeds.device)tensorrt_llm/_torch/auto_deploy/export/export.py (1)
1-1
: Prepend NVIDIA 2025 copyright header
Add the standard NVIDIA header before the module docstring at the top ofexport.py
.
♻️ Duplicate comments (8)
tensorrt_llm/_torch/auto_deploy/llm.py (5)
75-76
: Add tiktoken-compatible fallback for encodeSome tokenizers don’t accept add_special_tokens; retry without kwargs.
- token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs) + try: + token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs) + except TypeError: + token_ids = self.tokenizer.encode(inputs["prompt"])
1-1
: Missing NVIDIA 2025 headerThis file still lacks the required header.
+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions + # are met: + # * Redistributions of source code must retain the above copyright + # notice, this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright + # notice, this list of conditions and the following disclaimer in the + # documentation and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its + # contributors may be used to endorse or promote products derived + # from this software without specific prior written permission. + # + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY + # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE + # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR + # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR + # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, + # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, + # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR + # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY + # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30-32
: Require processor only for chat-style inputsUnconditionally requiring processor breaks text-only. Move the check under the messages path and preserve kwargs when adding truncation.
- if self.processor is None: - raise ValueError("processor is required to tokenize inputs") + # Only required for chat-style/multimodal inputs @@ - if sampling_params.truncate_prompt_tokens is not None: - kwargs = { - "truncation": True, - "max_length": sampling_params.truncate_prompt_tokens, - } + if sampling_params.truncate_prompt_tokens is not None: + kwargs.update({ + "truncation": True, + "max_length": sampling_params.truncate_prompt_tokens, + }) @@ - if "messages" in inputs: + if "messages" in inputs: + if self.processor is None: + raise ValueError("processor is required to tokenize chat-style inputs")Also applies to: 43-46, 53-63
67-73
: Handle 1D/2D token_ids and avoid assertapply_chat_template can return [S] or [1,S]; accept both and raise on unexpected shapes.
- token_ids = all_args.pop("input_ids") - assert token_ids.shape[0] == 1, "messages should be unbatched at this point." + token_ids = all_args.pop("input_ids") + if token_ids.ndim == 2: + if token_ids.shape[0] != 1: + raise ValueError(f"Expected batch size 1, got {tuple(token_ids.shape)}") + token_ids = token_ids[0] + elif token_ids.ndim != 1: + raise ValueError(f"Unexpected token_ids shape: {tuple(token_ids.shape)}") @@ - return token_ids[0].tolist(), extra_processed_inputs + return token_ids.tolist(), extra_processed_inputs
126-129
: Replace asserts with explicit validation; widen tokenizer typeUse explicit raises (asserts can be optimized out) and accept any TokenizerBase.
- assert isinstance(self.input_processor, DefaultInputProcessor) - assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer) + if not isinstance(self.input_processor, DefaultInputProcessor): + raise TypeError(f"input_processor must be DefaultInputProcessor, got {type(self.input_processor)}") + if self.tokenizer is not None and not isinstance(self.tokenizer, TokenizerBase): + raise TypeError(f"tokenizer must be a TokenizerBase or None, got {type(self.tokenizer)}")tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
432-441
: Make placeholder pixel_values shape configurableHardcoding (0,3,336,336) is brittle across VLMs. Derive from config where possible and fall back to (0,C,H,W).
- none_pixel_values = torch.zeros(0, 3, 336, 336) + channels = getattr(getattr(self, "model_config", None), "num_channels", 3) + size = getattr(getattr(self, "model_config", None), "image_size", 336) + none_pixel_values = torch.zeros(0, channels, size, size)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
1-1
: Missing NVIDIA headerAdd the required header block per repo standards.
86-95
: Robust image-token attribute accessConfigs may expose image_token_id or image_token_index. Use a fallback getter to avoid AttributeError.
- if input_ids is None: + image_token = getattr(self.config, "image_token_id", + getattr(self.config, "image_token_index", None)) + if image_token is None: + raise ValueError("Model config must define image_token_id or image_token_index") + if input_ids is None: special_image_mask = inputs_embeds == self.get_input_embeddings()( - torch.tensor( - self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device - ) + torch.tensor(image_token, dtype=torch.long, device=inputs_embeds.device) ) special_image_mask = special_image_mask.all(-1) else: - special_image_mask = input_ids == self.config.image_token_id + special_image_mask = input_ids == image_token
🧹 Nitpick comments (12)
tensorrt_llm/_torch/auto_deploy/export/interface.py (4)
186-188
: Improve unknown-patch error for debuggabilityInclude available patch names in the exception to speed up troubleshooting.
- if not cls.has(name): - raise ValueError(f"Unknown patch: {name}") + if not cls.has(name): + available = ", ".join(cls.list_patches()) + raise ValueError(f"Unknown patch: {name}. Available patches: [{available}]")
217-220
: Update docstring to reflect new API (patch_list, defaults)The public docstring omits patch_list and “apply all registered patches” behavior. Please document both to avoid surprises for callers.
def apply_export_patches( patch_configs: Optional[Dict[str, Union[ExportPatchConfig, Dict[str, Any]]]] = None, patch_list: Optional[List[str]] = None, ): - """Context manager to apply multiple patches. - - Args: - patch_configs: Dict mapping patch names to their configurations. - """ + """Context manager to apply multiple patches. + + Args: + patch_configs: Optional mapping {patch_name -> config dict or ExportPatchConfig}. + If None and patch_list is None, all registered patches are applied with defaults. + patch_list: Optional list of patch names to apply with default configs. + Mutually exclusive with patch_configs. + """Also applies to: 221-225
231-237
: Default “apply all patches” may be hazardousAuto-applying every registered patch can introduce unintended side effects (e.g., vision patches on text-only models). Consider requiring explicit opt-in or gating with a boolean, e.g., apply_all_by_default=False.
-def apply_export_patches(...): +def apply_export_patches(..., apply_all_by_default: bool = False): @@ - elif patch_configs is None: - # Default patch configurations - apply all registered patches with default settings - patch_configs = {patch_name: {} for patch_name in ExportPatchRegistry.list_patches()} + elif patch_configs is None: + if apply_all_by_default: + patch_configs = {p: {} for p in ExportPatchRegistry.list_patches()} + else: + patch_configs = {}
239-240
: Deterministic order of patch applicationIf order matters, preserve user intent for patch_list and provide a stable order for dict-based configs.
- patches = [ExportPatchRegistry.create_patch(k, conf) for k, conf in patch_configs.items()] + # Preserve order: explicit patch_list order first, otherwise alphabetical + names = list(patch_configs.keys()) + if patch_list is None: + names = sorted(names) + patches = [ExportPatchRegistry.create_patch(k, patch_configs[k]) for k in names]Also applies to: 255-258
tensorrt_llm/_torch/auto_deploy/models/hf.py (3)
342-352
: Processor-first init for IT2T models: good, but guard for Noneinit_tokenizer forwarding to processor.tokenizer is fine; add a clear error if processor/model name is missing.
def init_processor(self) -> Optional[Any]: """Initialize the processor for the model.""" - if self.tokenizer is None: - return None + if self.tokenizer is None: + return None return AutoProcessor.from_pretrained(self.tokenizer, **self.tokenizer_kwargs)Follow-up: ensure callers either handle None or raise early when IT2T factory is selected without a valid tokenizer name.
Also applies to: 353-358
377-405
: Prefer using self.init_processor() and handle missing processorAvoid re-instantiating AutoProcessor; reuse init_processor and validate presence.
- processor = AutoProcessor.from_pretrained(self.tokenizer, **self.tokenizer_kwargs) + processor = self.init_processor() + if processor is None: + raise ValueError("Processor is required to generate example inputs for IT2T models.")Also applies to: 406-421
114-115
: Nit: call bound method directlytype(model).forward(model, ...) is unusual; model.forward(...) is clearer unless you’re intentionally bypassing monkey-patching.
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)
133-134
: Prefer named attributes on outputsUse outputs.logits (and peers) for clarity and stability.
- logits = outputs[0] + logits = outputs.logits @@ - return Llama4CausalLMOutputWithPast( - loss=loss, - logits=logits, - past_key_values=outputs.past_key_values, - hidden_states=outputs.hidden_states, - attentions=outputs.attentions, - image_hidden_states=None, # skip outputting this for simplicity - ) + return Llama4CausalLMOutputWithPast( + loss=loss, + logits=logits, + past_key_values=outputs.past_key_values, + hidden_states=outputs.hidden_states, + attentions=outputs.attentions, + image_hidden_states=None, + )Also applies to: 160-167
tensorrt_llm/_torch/auto_deploy/export/export.py (1)
206-231
: Docstring aligns with new behavior; add explicit cross-referenceMention that mutual exclusivity and default-all behavior are enforced by apply_export_patches to avoid confusion.
- patch_configs: Optional patch configurations. If None, all registered patches - will be applied with default settings. - patch_list: Optional list of patch names to apply with default settings. - Cannot be used together with patch_configs. + patch_configs: Optional patch configurations. If None (and patch_list is None), + apply_export_patches may apply all registered patches with defaults. + patch_list: Optional list of patch names to apply with default settings. + Cannot be used together with patch_configs (enforced in apply_export_patches).tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (3)
82-95
: Follow import refactor for export/move helpers.Use the module-qualified names introduced earlier.
Apply:
- gm = torch_export_to_gm( + gm = ad_export.torch_export_to_gm( model, args=(input_ids, position_ids, pixel_values), patch_list=[ @@ ], ) - move_to_device(gm, model.device) + graph.move_to_device(gm, model.device)
59-73
: Make logits extraction robust across Module/GraphModule/dict returns.GM may return dicts/tuples without a logits attribute. A tiny helper avoids brittle assumptions.
Apply this helper above _run_with_and_without_image and use it in the return:
+ def _get_logits(out): + if hasattr(out, "logits"): + return out.logits + if isinstance(out, dict) and "logits" in out: + return out["logits"] + if isinstance(out, (tuple, list)): + return out[0] + return out @@ - return {"no_images": out_no_images.logits, "with_images": out_with_images.logits} + return {"no_images": _get_logits(out_no_images), "with_images": _get_logits(out_with_images)}
85-93
: Avoid duplication: define a single PATCH_LIST constant.Reduces drift in future edits.
Apply:
+ PATCH_LIST = [ + "transformers_sdpa_mask", + "autocast_noop", + "torch_where", + "tensor_meta_device", + "sdpa_kernel_noop", + "sdpa", + "hf_llama4_vision", + ] @@ - patch_list=[ - "transformers_sdpa_mask", - "autocast_noop", - "torch_where", - "tensor_meta_device", - "sdpa_kernel_noop", - "sdpa", - "hf_llama4_vision", - ], + patch_list=PATCH_LIST,
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (8)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/export/export.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/export/interface.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/llm.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
(5 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces only; no tabs; indent by 4 spaces
Files:
tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent with 4 spaces; no tabs
Keep module namespace on import; import the module, not individual names; use module.symbol
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile)
Global variables use UPPER_SNAKE with G_ prefix (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE
Avoid shadowing variables from outer scopes
Initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with trailing docstrings
Avoid reflection when simple alternatives exist (e.g., avoid dict(**locals()) patterns)
Limit except clauses to specific exceptions; avoid bare except
When duck-typing with try/except, keep try body minimal and use else for logic
Files:
tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
🧠 Learnings (2)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/llm.py
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
🧬 Code graph analysis (3)
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
CacheConfig
(26-29)device
(177-178)to
(431-438)tensorrt_llm/llmapi/llm.py (2)
tokenizer
(681-685)tokenizer
(688-689)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (5)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (1)
get_small_model_config
(440-478)examples/auto_deploy/build_and_run_ad.py (1)
ExperimentConfig
(124-236)tensorrt_llm/_torch/auto_deploy/llm_args.py (2)
LlmArgs
(226-330)create_factory
(204-215)tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)
move_to_device
(132-139)tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
init_processor
(353-357)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
tensorrt_llm/_torch/models/modeling_llama.py (1)
Llama4ForConditionalGeneration
(1179-1240)tensorrt_llm/functional.py (1)
masked_scatter
(2539-2574)
🔇 Additional comments (8)
tensorrt_llm/_torch/auto_deploy/export/interface.py (1)
226-229
: Mutual-exclusivity check: good guardrailClear error early; behavior matches the function doc. LGTM.
tensorrt_llm/_torch/auto_deploy/llm.py (1)
85-90
: Factory wiring and processor creation look goodLazy factory init, tokenizer delegation, and ADInputProcessor creation are consistent with the new factory flow.
Also applies to: 105-107, 144-147
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
156-163
: Device handling after init_empty_weights: good separationpost_init() on meta and .to(device) otherwise; this prevents silent dtype/device mismatches. LGTM.
190-195
: Tokenizer init path is fineReturns None when unspecified; otherwise respects tokenizer_kwargs. LGTM.
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)
178-187
: Patch apply/revert logic is correctOriginal forward is saved/restored and replacement is scoped to the patch lifetime. LGTM.
Also applies to: 188-194
tensorrt_llm/_torch/auto_deploy/export/export.py (1)
21-22
: Delegating selection to apply_export_patches: good cleanupCentralizing validation/selection inside interface.apply_export_patches simplifies export.py. LGTM.
Also applies to: 233-235
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (2)
1-1
: Add required NVIDIA copyright header.Per coding guidelines, prepend the standard NVIDIA copyright header (current year).
Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0⛔ Skipped due to learnings
Learnt from: galagam PR: NVIDIA/TensorRT-LLM#6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
48-50
: Fix dtype conversion: .to(torch.bfloat16) passes dtype as device.BatchFeature/BatchEncoding expects dtype as a keyword; current call can error or be ignored.
Apply:
- .to(model.device) - .to(torch.bfloat16) + .to(model.device) + .to(dtype=torch.bfloat16)Likely an incorrect or invalid review comment.
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
Show resolved
Hide resolved
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
Show resolved
Hide resolved
8504bbd
to
c263049
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tensorrt_llm/_torch/auto_deploy/export/interface.py (1)
1-1
: Add NVIDIA 2025 copyright header (required).Per repository guidelines, prepend the standard NVIDIA header to this source file.
+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions are met: + # * Redistributions of source code must retain the above copyright notice, + # this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright notice, + # this list of conditions and the following disclaimer in the documentation + # and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its contributors + # may be used to endorse or promote products derived from this software + # without specific prior written permission. + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY EXPRESS + # OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN + # NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, + # INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES; HOWEVER + # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
1-1
: Add NVIDIA 2025 copyright header (required).This production Python module must include the standard NVIDIA header.
+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions are met: + # * Redistributions of source code must retain the above copyright notice, + # this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright notice, + # this list of conditions and the following disclaimer in the documentation + # and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its contributors + # may be used to endorse or promote products derived from this software + # without specific prior written permission. + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY EXPRESS + # OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN + # NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, + # INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES; HOWEVER + # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
1-1
: Add NVIDIA 2025 copyright header (required).Please prepend the standard header block to this file.
+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions are met: + # * Redistributions of source code must retain the above copyright notice, + # this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright notice, + # this list of conditions and the following disclaimer in the documentation + # and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its contributors + # may be used to endorse or promote products derived from this software + # without specific prior written permission. + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY EXPRESS + # OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN + # NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, + # INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES; HOWEVER + # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
♻️ Duplicate comments (8)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
543-546
: Enforce uniform seq_len for uncached SDPA to avoid view/shape errors.
_shape_for_forward()
assumes a singlesl
for uncached mode, butnest_sequences()
accepts raggedinput_ids
. Add a uniformity check (or pad) whennot self._is_cached_attn
.- self._store_arg("seq_len", [len(ids) for ids in input_ids], reset=True) + seq_lens = [len(ids) for ids in input_ids] + if not self._is_cached_attn: + assert len(set(seq_lens)) == 1, ( + f"Uncached attention requires uniform seq_len; got {seq_lens}" + ) + self._store_arg("seq_len", seq_lens, reset=True)
616-619
: Fix split dimension in unnest_sequences for uncached vs cached modes.Current logic uses
is_generate
to pick squeeze axis and then splits on dim 0, which breaks uncached SDPA ([B, S, ...]). Branch on_is_cached_attn
and split along the actual sequence dimension.- def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: - t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) - return list(torch.split(t_squeezed, self.seq_len)) + def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: + if self._is_cached_attn: + # cached mode: [B,1,...] for decode or [1,S_total,...] for prefill + t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) + return list(torch.split(t_squeezed, self.seq_len, dim=0)) + # uncached SDPA: [B, S, ...] -> split along sequence length (dim=1) + return list(torch.split(t_nested, self.seq_len, dim=1))tensorrt_llm/_torch/auto_deploy/llm.py (6)
75-76
: Tokenizer encode fallback for tiktoken-style implementations.Guard against
TypeError
whenadd_special_tokens
is unsupported.- token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs) + try: + token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs) + except TypeError: + token_ids = self.tokenizer.encode(inputs["prompt"])
1-1
: Add NVIDIA 2025 copyright header (required).Header is missing.
+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # Redistribution and use in source and binary forms, with or without + # modification, are permitted provided that the following conditions are met: + # * Redistributions of source code must retain the above copyright notice, + # this list of conditions and the following disclaimer. + # * Redistributions in binary form must reproduce the above copyright notice, + # this list of conditions and the following disclaimer in the documentation + # and/or other materials provided with the distribution. + # * Neither the name of NVIDIA CORPORATION nor the names of its contributors + # may be used to endorse or promote products derived from this software + # without specific prior written permission. + # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY EXPRESS + # OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES + # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN + # NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, + # INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES; HOWEVER + # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, + # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30-31
: Don't require processor for text-only inputs.Only enforce
processor
whenmessages
are provided; otherwise fall back to tokenizer.- if self.processor is None: - raise ValueError("processor is required to tokenize inputs") + # Only required for chat-style/multimodal inputs + if "messages" in inputs and self.processor is None: + raise ValueError("processor is required to tokenize chat-style inputs")
37-41
: Preserve add_special_tokens when adding truncation options.Using assignment drops previously set keys. Use update().
- if sampling_params.truncate_prompt_tokens is not None: - kwargs = { - "truncation": True, - "max_length": sampling_params.truncate_prompt_tokens, - } + if sampling_params.truncate_prompt_tokens is not None: + kwargs.update({ + "truncation": True, + "max_length": sampling_params.truncate_prompt_tokens, + })
66-74
: Handle 1D vs 2Dinput_ids
from apply_chat_template.Some processors return
[S]
instead of[1, S]
. Avoid the strict assert and normalize shape.- token_ids = all_args.pop("input_ids") - assert token_ids.shape[0] == 1, "messages should be unbatched at this point." + token_ids = all_args.pop("input_ids") + if token_ids.ndim == 2: + token_ids = token_ids[0] + elif token_ids.ndim != 1: + raise ValueError(f"Unexpected token_ids shape: {tuple(token_ids.shape)}") @@ - return token_ids[0].tolist(), extra_processed_inputs + return token_ids.tolist(), extra_processed_inputs
125-129
: Broaden tokenizer assertion to TokenizerBase (or remove assert).Restricting to
TransformersTokenizer
blocks validTokenizerBase
implementations (e.g., tiktoken wrapper).- assert isinstance(self.input_processor, DefaultInputProcessor) - assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer) + assert isinstance(self.input_processor, DefaultInputProcessor) + assert self.tokenizer is None or isinstance(self.tokenizer, TokenizerBase)
🧹 Nitpick comments (3)
tensorrt_llm/_torch/auto_deploy/export/interface.py (2)
217-237
: Document new parameters and behaviors in the docstring.
apply_export_patches()
now acceptspatch_list
and mutually excludes it withpatch_configs
, and defaults to “all registered”. Update the docstring to reflect args, mutual exclusion, defaults, and error cases.def apply_export_patches( patch_configs: Optional[Dict[str, Union[ExportPatchConfig, Dict[str, Any]]]] = None, patch_list: Optional[List[str]] = None, ): - """Context manager to apply multiple patches. - - Args: - patch_configs: Dict mapping patch names to their configurations. - """ + """Context manager to apply multiple patches. + + Args: + patch_configs: Dict mapping patch name -> config (dict or ExportPatchConfig). + patch_list: List of patch names to apply with default configs. + Raises: + ValueError: If both `patch_configs` and `patch_list` are provided. + Behavior: + - If `patch_list` is provided, it is converted to default `patch_configs`. + - If neither is provided, all registered patches are applied with defaults. + """
239-261
: Preserve deterministic apply order and validate early.Order follows dict insertion. If
patch_configs
comes from an external source, consider using anOrderedDict
or sorting for reproducibility, and validate unknown names up front for a clearer error.- patches = [ExportPatchRegistry.create_patch(k, conf) for k, conf in patch_configs.items()] + # Optional: keep deterministic order if caller provided a plain dict on <3.7 or for reproducibility + items = list(patch_configs.items()) + # Early validation for clearer error location + for name, _ in items: + ExportPatchRegistry.get(name) + patches = [ExportPatchRegistry.create_patch(k, conf) for k, conf in items]tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
180-201
: Guard _shape_for_forward against ragged uncached inputs.Given the
view(bs, sl, ...)
on uncached paths, ensuresl
is consistent across batch or derivesl
from the actual tensor shape in uncached mode. This ties to the uniformity check above.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (24)
examples/auto_deploy/.gitignore
(1 hunks)examples/auto_deploy/build_and_run_ad.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
(7 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/export/export.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/export/interface.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/llm.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/llm_args.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/models/factory.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/shim/demollm.py
(8 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
(2 hunks)tensorrt_llm/executor/worker.py
(1 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
(2 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
(5 hunks)
💤 Files with no reviewable changes (3)
- tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
- tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
🚧 Files skipped from review as they are similar to previous changes (17)
- examples/auto_deploy/.gitignore
- tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
- tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
- tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
- examples/auto_deploy/build_and_run_ad.py
- tensorrt_llm/executor/worker.py
- tensorrt_llm/_torch/auto_deploy/models/hf.py
- tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
- tensorrt_llm/_torch/auto_deploy/models/factory.py
- tensorrt_llm/_torch/auto_deploy/export/export.py
- tensorrt_llm/_torch/auto_deploy/shim/demollm.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
- tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
- tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces only; no tabs; indent by 4 spaces
Files:
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/interface.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Code must target Python 3.8+
Indent with 4 spaces; no tabs
Keep module namespace on import; import the module, not individual names; use module.symbol
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile)
Global variables use UPPER_SNAKE with G_ prefix (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE
Avoid shadowing variables from outer scopes
Initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with trailing docstrings
Avoid reflection when simple alternatives exist (e.g., avoid dict(**locals()) patterns)
Limit except clauses to specific exceptions; avoid bare except
When duck-typing with try/except, keep try body minimal and use else for logic
Files:
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/interface.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/interface.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/llm.py
🧬 Code graph analysis (4)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
Field
(67-94)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)
args
(24-26)dynamic_shapes
(29-31)to
(33-37)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
nest_sequences
(67-79)
tensorrt_llm/_torch/auto_deploy/llm.py (5)
tensorrt_llm/inputs/registry.py (1)
DefaultInputProcessor
(44-111)tensorrt_llm/_torch/auto_deploy/models/factory.py (5)
tokenizer
(48-50)ModelFactory
(15-243)init_tokenizer
(107-114)init_processor
(116-123)prefetch_checkpoint
(125-136)tensorrt_llm/llmapi/tokenizer.py (3)
TokenizerBase
(24-25)TransformersTokenizer
(28-267)tokenizer_factory
(270-290)tensorrt_llm/sampling_params.py (1)
SamplingParams
(125-486)tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
create_factory
(204-215)
tensorrt_llm/_torch/auto_deploy/export/interface.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
has
(833-834)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
has
(263-264)
🔇 Additional comments (2)
tensorrt_llm/_torch/auto_deploy/export/interface.py (1)
186-188
: Changing to ValueError is a breaking API change; confirm callers/tests.
ExportPatchRegistry.get()
now raisesValueError
for unknown patches. Verify downstream code/tests expectingKeyError
or assertions are updated accordingly.tensorrt_llm/_torch/auto_deploy/llm.py (1)
105-111
: Factory init: propagate tokenizer None safely.
tokenizer_factory(self.factory.init_tokenizer())
handles None, LGTM.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
1-10
: Add NVIDIA SPDX header at file top.Required by repo guidelines. Place SPDX lines before the module docstring.
Apply:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 """ Attention Interface to handle various attention operators and cache operations.
399-413
: Bug: slice index is a tensor; must convert to int.
num_seq = torch.sum(seq_len > 0)
yields a 0-d tensor. Slicing with it raises TypeError.Apply:
- if s > 1: - num_seq = torch.sum(seq_len > 0) - assert seq_len[num_seq:].sum() == 0, "seq_len should be zero-padded" + if s > 1: + num_seq_t = torch.count_nonzero(seq_len) + num_seq = int(num_seq_t.item()) + assert int(seq_len[num_seq:].sum().item()) == 0, "seq_len should be zero-padded"tensorrt_llm/_torch/auto_deploy/models/hf.py (3)
1-2
: Add NVIDIA SPDX header.Apply:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 """Interface to initialize and load HF models."""
82-85
: Python 3.8+: replaceQuantConfigReader | None
.Apply:
- self._quant_config_reader: QuantConfigReader | None = None + self._quant_config_reader: Optional[QuantConfigReader] = NoneEnsure
Optional
is imported (it is).
239-241
: Avoid bare except; narrow to HF hub exceptions.Catching everything masks real errors.
Apply:
- except: # noqa: E722 - pass + except Exception: + # Swallow hub inspection errors only; proceed conservatively. + passPrefer specific HF exceptions if available in your dependency set.
examples/auto_deploy/build_and_run_ad.py (1)
1-1
: Missing required NVIDIA 2025 Apache-2.0 header.Insert at file top:
# Copyright (c) 2025, NVIDIA CORPORATION. # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # http://www.apache.org/licenses/LICENSE-2.0 # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.
♻️ Duplicate comments (7)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
495-505
: Python 3.8+ compatibility: replace PEP 604 unions.
List[int | float]
breaks on 3.8. UseList[Union[int, float]]
.Apply:
- tnsr_like: List[int | float], + tnsr_like: List[Union[int, float]],Imports already include Union.
616-619
: Make unnest_sequences robust across cached/uncached.Current logic assumes cached shapes and mis-splits SDPA tensors [B, S, …].
Apply:
- def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: - t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) - return list(torch.split(t_squeezed, self.seq_len)) + def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: + if self._is_cached_attn: + t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) + return list(torch.split(t_squeezed, self.seq_len, dim=0)) + # uncached SDPA: [B, S, ...] + return list(torch.split(t_nested, self.seq_len, dim=1))tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
1-1
: Add NVIDIA SPDX header.Apply:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 from collections import defaultdicttensorrt_llm/_torch/auto_deploy/models/hf.py (1)
473-475
: Make placeholder pixel_values shape configurable (not hardcoded 336x336).Apply:
- none_pixel_values = torch.zeros(0, 3, 336, 336) + # Derive sensible defaults from config/processor when available + channels = getattr(getattr(self, "model_config", None), "num_channels", 3) + default_size = getattr(getattr(self, "model_config", None), "image_size", 336) + none_pixel_values = torch.zeros(0, channels, default_size, default_size)Alternatively, read from
processor.image_processor
if available.examples/auto_deploy/build_and_run_ad.py (1)
75-79
: Guard against empty queries; clarify/clean up replication logic.
len(queries) == 0
raises ZeroDivisionError.- Past feedback questioned repeating prompts to fill batch. If intentional, keep but make explicit and ceiling the repeat count.
- queries = queries * (batch_size // len(queries) + 1) - queries = queries[:batch_size] + if not queries: + raise ValueError("PromptConfig.queries must not be empty.") + # repeat queries to cover batch_size, then trim + repeats = (batch_size + len(queries) - 1) // len(queries) # ceil(batch_size/len) + queries = (queries * repeats)[:batch_size]If replication is optional, consider a flag (e.g.,
repeat_to_batch: bool = True
) with docstring.tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
110-110
: Consider more robust image detection logic.Using
torch.any(pixel_values != 0)
to detect images may fail for edge cases where valid images have all-zero pixels (e.g., completely black images). The previous review suggested checking the shape instead.Consider using shape-based detection:
# decide by whether there is any non-zero pixel_values - has_image: torch.Tensor = torch.any(pixel_values != 0) + # Check if pixel_values has actual image data (batch dimension > 0) + has_image: torch.Tensor = pixel_values.shape[0] > 0
1-11
: Add copyright headerAll source files should include the NVIDIA copyright header as per coding guidelines.
Add the copyright header at the beginning of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. """A patch to handle vision branch in Llama4ForConditionalGeneration.""" from typing import List, Optional, Tuple, Union
🧹 Nitpick comments (7)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
603-615
: Ensure dtype match in masked_scatter path.
packed_input_ids
may be Long; destinationinput_ids
is Long after earlier fix, but keep this robust.Apply:
- input_ids_device.masked_scatter_(input_ids_device == scatter_ref, packed_input_ids) + input_ids_device.masked_scatter_( + input_ids_device == scatter_ref, packed_input_ids.to(input_ids_device.dtype) + )
86-92
: Constructor args shadow public members; consider trailing underscores per guideline.Parameters like
max_seq_len
→self.max_seq_len
violate the repo rule.Apply (API-touching; defer if churn is high):
- def __init__(self, max_seq_len: int = 1, max_batch_size: int = 1, page_size: int = 0, max_num_tokens: Optional[int] = None): + def __init__(self, max_seq_len_: int = 1, max_batch_size_: int = 1, page_size_: int = 0, max_num_tokens_: Optional[int] = None): @@ - self.max_seq_len = max_seq_len - self.max_batch_size = max_batch_size - self.page_size = page_size if page_size > 0 else max_seq_len + self.max_seq_len = max_seq_len_ + self.max_batch_size = max_batch_size_ + self.page_size = page_size_ if page_size_ > 0 else max_seq_len_ @@ - if max_num_tokens is None or max_num_tokens < 1: + if max_num_tokens_ is None or max_num_tokens_ < 1: self.max_num_tokens = self.max_batch_size * max_seq_len_adjusted else: - self.max_num_tokens = max_num_tokens + self.max_num_tokens = max_num_tokens_Propagate call-site updates accordingly.
Also applies to: 113-118
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
65-69
: Fix logger usage: message formatting drops num_pages.Pass a single formatted string.
Apply:
- ad_logger.info("Using fake cache manager with head_dim=0 and num pages:", self.num_blocks) + ad_logger.info(f"Using fake cache manager with head_dim=0 and num pages: {self.num_blocks}")tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
439-454
: Reuse initialized processor/tokenizer to avoid duplicate downloads.Minor: prefer
processor = self.init_processor()
to honor overrides and caching.Apply:
- processor = AutoProcessor.from_pretrained(self.tokenizer, **self.tokenizer_kwargs) + processor = self.init_processor() + assert processor is not None, "Processor not initialized"
111-118
: Bound-method call style is intentional; add a brief comment to preempt confusion.
type(model).forward(model, …)
bypasses potential overrides on the instance.Apply:
- return type(model).forward(model, input_ids=input_ids, position_ids=position_ids) + # Call the class's bound forward to avoid instance-level patches interfering with export. + return type(model).forward(model, input_ids=input_ids, position_ids=position_ids)examples/auto_deploy/build_and_run_ad.py (2)
29-31
: Tighten typing for PromptInput.Use explicit key/value typing:
-PromptInput = Union[str, Dict, List[Dict]] +PromptInput = Union[str, Dict[str, Any], List[Dict[str, Any]]]
41-62
: Polish example prompts (typos/whitespace).
- Remove trailing spaces to avoid accidental token bias.
- Fix grammar: “What is the capital of Iceland?”
- "How big is the universe? ", + "How big is the universe?", @@ - {"prompt": "In simple words and a single sentence, explain the concept of gravity: "}, + {"prompt": "In simple words and a single sentence, explain the concept of gravity:"}, @@ - {"prompt": "Where is the capital of Iceland? "}, + {"prompt": "What is the capital of Iceland?"},
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (12)
examples/auto_deploy/.gitignore
(1 hunks)examples/auto_deploy/build_and_run_ad.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
(7 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/export/export.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/export/interface.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/llm.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/llm_args.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/models/factory.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
- tensorrt_llm/_torch/auto_deploy/llm_args.py
- tensorrt_llm/_torch/auto_deploy/llm.py
- tensorrt_llm/_torch/auto_deploy/models/factory.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}
: If a constructor parameter name conflicts with a public member, add a trailing underscore to the parameter (e.g., foo_).
Use uppercase literal suffixes (e.g., 1234L not 1234l).
Use spaces, not tabs; indent by 4 spaces.
Files:
tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Target Python 3.8+ for all Python code.
Indent with 4 spaces; do not use tabs.
Maintain module namespace on imports; import the module/submodule, not individual classes/functions (e.g., from package.subpackage import foo; foo.SomeClass()).
Python filenames use snake_case (e.g., some_file.py).
Class names use PascalCase.
Function and method names use snake_case.
Local variable names use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile).
Global variables use UPPER_SNAKE_CASE and prefix G_ (e.g., G_MY_GLOBAL).
Constants use UPPER_SNAKE_CASE.
Avoid shadowing outer-scope variables.
Initialize all externally visible members of a class in init.
Prefer docstrings for interfaces used outside a file; use comments for local code within functions or local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline with docstrings placed under the definition.
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns).
Limit except clauses to specific exceptions; avoid bare except.
For duck-typing try/except, keep try blocks minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA 2025 Apache-2.0 copyright header block at the top of all source files (.cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
🧬 Code graph analysis (7)
tensorrt_llm/_torch/auto_deploy/export/interface.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
has
(833-834)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
has
(277-278)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (6)
tensorrt_llm/_torch/auto_deploy/llm.py (1)
factory
(86-89)tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
create_factory
(215-226)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
get_extra_inputs
(244-257)tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
get_extra_inputs
(455-474)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)
add_extra_arg
(620-643)page_assignments
(309-311)input_pos
(271-272)nest_sequences
(521-588)rescatter_input_ids
(590-614)tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
get_cache_indices
(572-583)
examples/auto_deploy/build_and_run_ad.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
Field
(69-96)
tensorrt_llm/_torch/auto_deploy/export/export.py (2)
tensorrt_llm/_torch/auto_deploy/export/interface.py (1)
apply_export_patches
(217-260)tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)
lift_to_meta
(76-89)
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
CacheConfig
(26-29)device
(177-178)to
(431-438)tensorrt_llm/_torch/auto_deploy/models/factory.py (6)
model
(53-55)tokenizer
(58-60)init_tokenizer
(121-128)init_processor
(130-137)get_example_inputs
(232-242)get_extra_inputs
(244-257)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)
args
(24-26)dynamic_shapes
(29-31)to
(33-37)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
nest_sequences
(67-79)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
tensorrt_llm/_torch/models/modeling_llama.py (1)
Llama4ForConditionalGeneration
(1179-1240)tensorrt_llm/_torch/auto_deploy/export/interface.py (7)
BaseExportPatch
(38-130)ExportPatchRegistry
(166-213)register
(172-181)_apply_patch
(123-125)_apply_patch
(154-157)_revert_patch
(128-130)_revert_patch
(159-163)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/_torch/auto_deploy/export/export.py (2)
21-21
: LGTM! Clean separation of concerns.The import change aligns well with the refactored patch selection logic now being handled inside
apply_export_patches
.
234-234
: LGTM! Improved API flexibility.The change to pass both
patch_configs
andpatch_list
toapply_export_patches
provides better flexibility for patch configuration while maintaining backward compatibility.tensorrt_llm/_torch/auto_deploy/export/interface.py (3)
8-8
: LGTM! Necessary import for new optional parameters.The addition of
Optional
import supports the new optional parameters in the refactored API.
186-188
: Good error handling improvement.Raising a descriptive
ValueError
for unknown patches improves debugging and prevents silent failures.
217-260
: Well-designed patch configuration logic.The refactored
apply_export_patches
function provides excellent flexibility:
- Clear mutual exclusivity validation
- Convenient
patch_list
topatch_configs
conversion- Sensible default behavior when no patches are specified
- Clean early exit when no patches to apply
The implementation is robust and user-friendly.
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (3)
79-81
: LGTM! Correct device and dtype handling.The conversion of projected vision features to match the embeddings' device and dtype ensures compatibility and prevents runtime errors.
137-149
: Good handling of attention mask in loss calculation.The code correctly handles the 2D attention mask for shifted logits/labels calculation, with proper cropping for cases like PrefixTuning. The implementation is robust and well-documented.
170-194
: Well-structured patch implementation.The
Llama4VisionPatch
class follows the patch interface correctly:
- Properly registered with the ExportPatchRegistry
- Clean separation between apply and revert operations
- Stores original values for safe restoration
- Clear documentation of the patch's purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (2)
53-83
: Make page assignment deterministic and handle page exhaustion.Set iteration order and guard when free pages are insufficient.
- free_pages = set(range(si.num_pages)) - {i for pages in page_assignments for i in pages} + free_pages = sorted(set(range(si.num_pages)) - {i for pages in page_assignments for i in pages}) updated_assignments = [] for t_l, pages in zip(total_lens, page_assignments): extra_tokens = t_l - len(pages) * si.page_size - num_extra_pages = (extra_tokens // si.page_size) + (extra_tokens > 0) - updated_assignments.append(pages + [free_pages.pop() for _ in range(num_extra_pages)]) + num_extra_pages = max(0, (extra_tokens // si.page_size) + (1 if extra_tokens > 0 else 0)) + if num_extra_pages > len(free_pages): + raise RuntimeError( + f"Insufficient free pages: need {num_extra_pages}, have {len(free_pages)}" + ) + take = [free_pages.pop(0) for _ in range(num_extra_pages)] + updated_assignments.append(pages + take) return updated_assignments
92-97
: Replace asserts with explicit errors for API constraints.Asserts can be optimized out; raise exceptions.
- assert all(r.sampling_params == sampling_params for r in requests), ( - "Heterogeneous sampling params are not supported." - ) - assert sampling_params.best_of == 1, "Best-of is not supported." + if not all(r.sampling_params == sampling_params for r in requests): + raise ValueError("Heterogeneous sampling params are not supported.") + if sampling_params.best_of != 1: + raise NotImplementedError("Best-of is not supported.")tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
1-1
: Add NVIDIA 2025 Apache-2.0 header.Required for all production sources.
+ # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at + # http://www.apache.org/licenses/LICENSE-2.0 + # + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License. + """Interface to initialize and load HF models."""
82-85
: Python 3.8+ compatibility: avoid PEP 604 unions.Use Optional[...] instead of QuantConfigReader | None.
- self._quant_config_reader: QuantConfigReader | None = None + from typing import Optional + self._quant_config_reader: Optional[QuantConfigReader] = None
♻️ Duplicate comments (13)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)
208-216
: Test 3 still resets before continuation, breaking cache semantics.Reset wipes sequence context (seq_len/input_pos/pages), so equality with full-context y_model is not guaranteed. Continue without reset or re-prime the cache before comparing only the tail.
Option A (no reset):
- # Continue inference from previous context - cm.info.reset() for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]): y_with_cache[:, i_p : i_p + 1] = _call_and_unnest(x[:, i_p : i_p + 1], i_p) assert all_close(y_model, y_with_cache, atol=atol, rtol=rtol)Option B (reset + re-prime, compare tail):
- cm.info.reset() - for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]): - y_with_cache[:, i_p : i_p + 1] = _call_and_unnest(x[:, i_p : i_p + 1], i_p) - assert all_close(y_model, y_with_cache, atol=atol, rtol=rtol) + cm.info.reset() + for i_p in range(0, x.shape[1] - num_reset_steps): + _call_and_unnest(x[:, i_p : i_p + 1], i_p) + for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]): + y_with_cache[:, i_p : i_p + 1] = _call_and_unnest(x[:, i_p : i_p + 1], i_p) + assert all_close( + y_model[:, -num_reset_steps:], y_with_cache[:, -num_reset_steps:], atol=atol, rtol=rtol + )tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
148-157
: Use torch.long for indices.input_ids and _gather_idx participate in indexing/gather; int32 can break on some backends. Switch to long.
- "input_ids": torch.ones(self.max_num_tokens, dtype=torch.int), + "input_ids": torch.ones(self.max_num_tokens, dtype=torch.long), @@ - "_gather_idx": torch.empty(self.max_num_tokens, dtype=torch.int), + "_gather_idx": torch.empty(self.max_num_tokens, dtype=torch.long),
543-552
: Zero tail for input_pos and enforce uniform S in uncached mode.
- Stale device-tail values may persist when batch shrinks; pass reset=True.
- Uncached path relies on rectangular [B,S,...]; assert uniform seq_len to prevent view() errors.
- self._store_arg("seq_len", [len(ids) for ids in input_ids], reset=True) + seq_lens = [len(ids) for ids in input_ids] + if not self._is_cached_attn: + assert len(set(seq_lens)) == 1, f"Uncached attention requires uniform seq_len; got {seq_lens}" + self._store_arg("seq_len", seq_lens, reset=True) @@ - self._store_arg( + self._store_arg( "input_pos", [input_pos] * self.num_sequences if isinstance(input_pos, int) else input_pos, - ) + reset=True, + )
616-619
: unnest_sequences splits along the wrong dim in uncached mode.Current logic assumes cached shapes. Make it mode-aware to avoid mis-splitting prefill outputs.
- def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: - t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) - return list(torch.split(t_squeezed, self.seq_len)) + def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]: + if self._is_cached_attn: + t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0) + return list(torch.split(t_squeezed, self.seq_len, dim=0)) + # uncached SDPA: [B, S, ...] -> split along sequence length + return list(torch.split(t_nested, self.seq_len, dim=1))
495-500
: PEP 604 unions are Py3.10+; target is 3.8+.Replace List[int | float] with List[Union[int, float]].
- tnsr_like: List[int | float], + tnsr_like: List[Union[int, float]],tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
60-63
: Validate model_factory against the registry to catch typos early.Type widened to str; add a Pydantic validator to fail fast with a clear message.
class AutoDeployConfig(DynamicYamlMixInForSettings, BaseSettings): @@ model_factory: str = Field( default="AutoModelForCausalLM", description="The model factory to use for loading the model.", ) + @field_validator("model_factory") + @classmethod + def _validate_model_factory(cls, v: str): + if not ModelFactoryRegistry.has(v): + available = ", ".join(ModelFactoryRegistry._registry.keys()) + raise ValueError(f"Unknown model_factory '{v}'. Available: {available}") + return vtensorrt_llm/_torch/auto_deploy/llm.py (6)
1-1
: Add NVIDIA 2025 Apache-2.0 header.Per coding guidelines, prepend the NVIDIA 2025 Apache-2.0 header above imports.
Apply:
+ # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. + # + # Licensed under the Apache License, Version 2.0 (the "License"); + # you may not use this file except in compliance with the License. + # You may obtain a copy of the License at + # http://www.apache.org/licenses/LICENSE-2.0 + # + # Unless required by applicable law or agreed to in writing, software + # distributed under the License is distributed on an "AS IS" BASIS, + # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + # See the License for the specific language governing permissions and + # limitations under the License. + import types
30-31
: Don’t require a processor for text-only inputs.Gate the processor check behind the messages path.
- if self.processor is None: - raise ValueError("processor is required to tokenize inputs") + # Only required for chat-style/multimodal inputs- if "messages" in inputs: + if "messages" in inputs: + if self.processor is None: + raise ValueError("processor is required to tokenize chat-style inputs")Also applies to: 43-44
33-41
: Preserve add_special_tokens when adding truncation params.Don’t clobber kwargs.
kwargs = { "add_special_tokens": sampling_params.add_special_tokens, } if sampling_params.truncate_prompt_tokens is not None: - kwargs = { - "truncation": True, - "max_length": sampling_params.truncate_prompt_tokens, - } + kwargs.update({ + "truncation": True, + "max_length": sampling_params.truncate_prompt_tokens, + })
67-74
: Handle both [S] and [1, S] from apply_chat_template; avoid assert.Avoid assert (can be optimized-away) and support 1D outputs.
- token_ids = all_args.pop("input_ids") - assert token_ids.shape[0] == 1, "messages should be unbatched at this point." + token_ids = all_args.pop("input_ids") + # Accept either [1, S] or [S] + if token_ids.ndim == 2: + token_ids = token_ids[0] + elif token_ids.ndim != 1: + raise ValueError(f"Unexpected token_ids shape: {tuple(token_ids.shape)}") if all_args: extra_processed_inputs = {"multimodal_data": all_args} else: extra_processed_inputs = None - return token_ids[0].tolist(), extra_processed_inputs + return token_ids.tolist(), extra_processed_inputs
74-76
: Text-only path: add tiktoken fallback and support optional query.Replicate DefaultInputProcessor behavior.
- else: - token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs) - return token_ids, None + else: + toktoken_special_tokens = { + "<|startoftext|>", "<|endoftext|>", "<|reserved_200000|>", "<|reserved_200001|>", + "<|return|>", "<|constrain|>", "<|reserved_200004|>", "<|channel|>", "<|start|>", + "<|end|>", "<|message|>", "<|reserved_200009|>", "<|reserved_200010|>", + "<|reserved_200011|>", "<|call|>", "<|reserved_200013|>", + } + try: + token_ids = self.tokenizer.encode( + inputs["prompt"], **kwargs + ) + except TypeError: + # tiktoken-style path + token_ids = self.tokenizer.encode( + inputs["prompt"], allowed_special=toktoken_special_tokens + ) + + if "query" in inputs: + try: + query_token_ids = self.tokenizer.encode( + inputs["query"], **kwargs + ) + except TypeError: + query_token_ids = self.tokenizer.encode( + inputs["query"], allowed_special=toktoken_special_tokens + ) + return token_ids, {"query_token_ids": query_token_ids} + + return token_ids, None
125-128
: Prefer explicit checks over assert; allow any TokenizerBase.Let non-Transformers tokenizers (e.g., tiktoken) pass.
- assert isinstance(self.input_processor, DefaultInputProcessor) - assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer) - self.input_processor = self._create_input_processor() + if not isinstance(self.input_processor, DefaultInputProcessor): + raise TypeError("input_processor must be DefaultInputProcessor before AD patching") + if self.tokenizer is not None and not isinstance(self.tokenizer, TokenizerBase): + raise TypeError("tokenizer must be a TokenizerBase (or None)") + self.input_processor = self._create_input_processor()tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
473-475
: Derive placeholder pixel_values shape; avoid hardcoded 336x336.Make it configurable or read from config.
- none_pixel_values = torch.zeros(0, 3, 336, 336) + # Derive channels/size from config if available; fall back to (3, 336, 336) + cfg = getattr(getattr(self, "model_config", None), "vision_config", None) + channels = getattr(cfg, "num_channels", 3) if cfg else 3 + size = getattr(cfg, "image_size", 336) if cfg else 336 + none_pixel_values = torch.zeros(0, channels, size, size)
🧹 Nitpick comments (5)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)
47-53
: Type/use mismatch in _add_hidden_dim annotation.The method operates on tensors (uses .shape), but is annotated as Sequence[Sequence[Any]]. Tighten the type to accept torch.Tensor (and optionally Sequence) or handle both.
- def _add_hidden_dim(self, input_ids: Sequence[Sequence[Any]]) -> torch.Tensor: - return torch.rand( - *input_ids.shape, + def _add_hidden_dim(self, input_ids: torch.Tensor) -> torch.Tensor: + return torch.rand( + *tuple(input_ids.shape), self.hidden_size, device=self.device, dtype=self.dtype, )
55-66
: Avoid transiently breaking host invariant for input_ids.Assigning a CPU tensor to _args_host["input_ids"] deviates from the “host is list-like” invariant and is immediately overwritten by reset(). Drop the assignment.
- self._args_host["input_ids"] = self._args_device["input_ids"].cpu()
tensorrt_llm/_torch/auto_deploy/llm.py (1)
7-8
: Import style nit.Guidelines prefer module imports over symbols; consider switching to module-level imports in new code.
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)
134-143
: Ensure input_ids type matches SequenceInterface expectations.Pass Python lists for portability.
- sequence_info.nest_sequences( - token_ids, + sequence_info.nest_sequences( + token_ids.tolist(), input_pos=input_pos_next, page_assignments=self._assign_pages(total_lens_next), )tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
239-240
: Avoid bare except.Limit to Exception to prevent masking system-exiting exceptions.
- except: # noqa: E722 + except Exception: # noqa: E722 pass
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (24)
examples/auto_deploy/.gitignore
(1 hunks)examples/auto_deploy/build_and_run_ad.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
(7 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/export/export.py
(2 hunks)tensorrt_llm/_torch/auto_deploy/export/interface.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/llm.py
(4 hunks)tensorrt_llm/_torch/auto_deploy/llm_args.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/models/factory.py
(3 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(5 hunks)tensorrt_llm/_torch/auto_deploy/shim/demollm.py
(8 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
(2 hunks)tensorrt_llm/executor/worker.py
(1 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
(2 hunks)tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
(0 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
(4 hunks)
💤 Files with no reviewable changes (3)
- tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
- tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
🚧 Files skipped from review as they are similar to previous changes (13)
- tensorrt_llm/_torch/auto_deploy/models/factory.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
- examples/auto_deploy/.gitignore
- tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
- tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
- tensorrt_llm/_torch/auto_deploy/export/export.py
- tensorrt_llm/_torch/auto_deploy/export/interface.py
- tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
- tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
- tensorrt_llm/executor/worker.py
- tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
- examples/auto_deploy/build_and_run_ad.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}
: If a constructor parameter name conflicts with a public member, add a trailing underscore to the parameter (e.g., foo_).
Use uppercase literal suffixes (e.g., 1234L not 1234l).
Use spaces, not tabs; indent by 4 spaces.
Files:
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Target Python 3.8+ for all Python code.
Indent with 4 spaces; do not use tabs.
Maintain module namespace on imports; import the module/submodule, not individual classes/functions (e.g., from package.subpackage import foo; foo.SomeClass()).
Python filenames use snake_case (e.g., some_file.py).
Class names use PascalCase.
Function and method names use snake_case.
Local variable names use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile).
Global variables use UPPER_SNAKE_CASE and prefix G_ (e.g., G_MY_GLOBAL).
Constants use UPPER_SNAKE_CASE.
Avoid shadowing outer-scope variables.
Initialize all externally visible members of a class in init.
Prefer docstrings for interfaces used outside a file; use comments for local code within functions or local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline with docstrings placed under the definition.
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns).
Limit except clauses to specific exceptions; avoid bare except.
For duck-typing try/except, keep try blocks minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA 2025 Apache-2.0 copyright header block at the top of all source files (.cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
🧠 Learnings (1)
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/auto_deploy/llm.py
🧬 Code graph analysis (8)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
Field
(69-96)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)
named_standard_args
(223-225)extra_args_for_prepare_metadata
(233-235)get_prepare_metadata_op
(738-760)args
(228-230)set_max_num_tokens_sample
(468-473)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (3)
tensorrt_llm/executor/request.py (1)
GenerationRequest
(84-129)tensorrt_llm/inputs/multimodal.py (1)
MultimodalParams
(152-410)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (6)
reset
(479-485)nest_sequences
(521-588)input_pos
(271-272)page_assignments
(309-311)seq_len
(267-268)args
(228-230)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (3)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
input_pos
(271-272)nest_sequences
(521-588)reset
(479-485)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
nest_sequences
(67-79)tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (1)
all_close
(6-21)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
device
(177-178)args
(228-230)reset
(479-485)nest_sequences
(521-588)tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
args
(24-26)
tensorrt_llm/_torch/auto_deploy/llm.py (5)
tensorrt_llm/inputs/registry.py (1)
DefaultInputProcessor
(44-111)tensorrt_llm/llmapi/llm.py (4)
tokenizer
(691-695)tokenizer
(698-699)prompt
(79-80)LLM
(1011-1027)tensorrt_llm/_torch/auto_deploy/models/factory.py (4)
tokenizer
(58-60)ModelFactory
(23-257)init_tokenizer
(121-128)init_processor
(130-137)tensorrt_llm/llmapi/tokenizer.py (3)
TokenizerBase
(24-25)TransformersTokenizer
(28-267)tokenizer_factory
(270-290)tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
create_factory
(215-226)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)
args
(24-26)dynamic_shapes
(29-31)to
(33-37)tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
nest_sequences
(67-79)
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
CacheConfig
(26-29)device
(177-178)to
(431-438)tensorrt_llm/_torch/auto_deploy/models/factory.py (6)
model
(53-55)tokenizer
(58-60)init_tokenizer
(121-128)init_processor
(130-137)get_example_inputs
(232-242)get_extra_inputs
(244-257)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)
245-245
: API rename fix looks good.cm.info.set_max_num_tokens_sample() aligns with the new public API.
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)
67-79
: LGTM on tokenizer-free embedding flow.Shimming list-of-ids to embeddings is minimal and contained; super().nest_sequences will receive the correct 3D inputs.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)
56-87
: Forward signature change is consistent.Switching to input_ids and passing through position_ids matches transform expectations.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
189-201
: Ragged seq_len with uncached attention will .view() incorrectly.sl = self.seq_len[0] assumes uniform S. The assert in nest_sequences above mitigates this; ensure callers use nest_sequences before args shaping. If ragged prefill is intended, add padding instead of view().
Would you like me to follow up with a padding helper to support ragged prefill safely?
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)
112-118
: Confirm extra_args lifecycle across decode steps.nest_sequences resets extra inputs when omitted; if multimodal extra tensors are required beyond prefill, they must be passed again each step or cached inside the interface.
Would you like a follow-up patch to persist extra_args automatically across decode?
tensorrt_llm/_torch/auto_deploy/models/hf.py (2)
159-166
: Forward build flow: minor improvement.Using model.forward(...) is clearer than type(model).forward(...).
Do we need type(model).forward to bypass patched wrappers, or can we simplify to model.forward(...) safely here?
379-385
: Tokenizer via processor: verify None path.If tokenizer resolves to None, init_processor returns None and init_tokenizer returns None; ensure callers handle it (LLM._try_load_tokenizer does). LGTM.
PR_Github #17462 [ run ] completed with state |
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
/bot run |
PR_Github #17479 [ run ] triggered by Bot |
PR_Github #17479 [ run ] completed with state |
/bot run |
PR_Github #17541 [ run ] triggered by Bot |
PR_Github #17541 [ run ] completed with state |
/bot run |
PR_Github #17564 [ run ] triggered by Bot |
Signed-off-by: Lucas Liebenwein <[email protected]>
/bot run |
PR_Github #17574 [ run ] triggered by Bot |
PR_Github #17564 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the LLM API part
PR_Github #17574 [ run ] completed with state |
/bot run |
/bot run |
PR_Github #17635 [ run ] triggered by Bot |
PR_Github #17635 [ run ] completed with state |
Summary by CodeRabbit
New Features
Refactor
Tests
Chores
Description
Installation instruction for this branch:
Bench results
AutoDeploy Now (ll/vlm_arch, commit
6c20308a0
)AutoDeploy Before (without llm/vlm_arch, commit
42697ea32
)PyTorch Backend (commit
6c20308a0
)Test Coverage
llama4.yaml
config:Run
Output with
transformers==4.53.1
(4.55.1
has unknown accuracy issue)GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.