[#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example #7221

lucaslie · 2025-08-25T16:29:25Z

Summary by CodeRabbit

New Features
- Multimodal inputs via chat templates and processors; image-text-to-text path enabled.
- Llama-4 vision patch for export/runtime, preserving logits across modes.
- Prompts accept strings, dicts, and chat message lists with automatic normalization.
- SDPA/op gains enable_gqa flag; autodeploy backend treated like PyTorch.
Refactor
- Factory-driven LLM lifecycle and input processor; revamped sequence management and extra-input hooks.
- Simplified export patch selection and patch application flow.
Tests
- Added Llama-4 VLM patch test; broadened SDPA coverage and updated KV-cache tests.
Chores
- Ignore example YAML configs; relaxed model_factory typing.

Description

Installation instruction for this branch:

TRTLLM_PRECOMPILED_LOCATION=https://urm.nvidia.com/artifactory/sw-tensorrt-generic/llm-artifacts/LLM/main/L0_PostMerge/2275/x86_64-linux-gnu/tensorrt_llm-1.1.0rc3-cp312-cp312-linux_x86_64.whl pip install -e ".[devel]"

Bench results

AutoDeploy Now (ll/vlm_arch, commit `6c20308a0` )

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     117.1104
Total Output Throughput (tokens/sec):             14990.1327
Total Token Throughput (tokens/sec):              29980.2654
Total Latency (ms):                               8538.9504
Average request latency (ms):                     6648.1838
Per User Output Throughput [w/ ctx] (tps/user):   20.7469
Per GPU Output Throughput (tps/gpu):              14990.1327

AutoDeploy Before (without llm/vlm_arch, commit `42697ea32`)

= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     117.0791
Total Output Throughput (tokens/sec):             14986.1241
Total Token Throughput (tokens/sec):              29972.2481
Total Latency (ms):                               8541.2345
Average request latency (ms):                     6628.9203
Per User Output Throughput [w/ ctx] (tps/user):   20.8218
Per GPU Output Throughput (tps/gpu):              14986.1241

PyTorch Backend (commit `6c20308a0`)

===========================================================
= PERFORMANCE OVERVIEW 
===========================================================
Request Throughput (req/sec):                     124.6204
Total Output Throughput (tokens/sec):             15951.4081
Total Token Throughput (tokens/sec):              31902.8162
Total Latency (ms):                               8024.3700
Average request latency (ms):                     6194.1235
Per User Output Throughput [w/ ctx] (tps/user):   22.3377
Per GPU Output Throughput (tps/gpu):              15951.4081

Test Coverage

llama4.yaml config:

model: meta-llama/Llama-4-Scout-17B-16E-Instruct
args:
  world_size: 4
  runtime: trtllm
  compile_backend: torch-simple
  attn_page_size: 64
  max_input_len: 4096
  max_seq_len: 8192
  attn_backend: flashinfer
  model_factory: AutoModelForImageTextToText
  skip_loading_weights: false
  model_kwargs:
    {}
    # _attn_implementation: eager
prompt:
  batch_size: 4
  queries:
    - "How big is the universe? "
    - {
        "prompt": "In simple words and a single sentence, explain the concept of gravity: ",
      }
    # see for chat template format: https://huggingface.co/docs/transformers/en/chat_templating_multimodal
    - - role: user
        content:
          - type: text
            text: How to fix slicing in golf?
    - - role: user
        content:
          - type: text
            text: Please describe the natural scenery you see in the following images
          - type: image
            url: https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png
          - type: image
            url: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png

Run

python build_and_run_ad.py --yaml-configs llama4.yaml

Output with transformers==4.53.1 (4.55.1 has unknown accuracy issue)

[08/29/2025-15:29:14] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 0] How big is the universe? :  The universe is really, really big!  It’s so big that we can’t even imagine it.  Scientists have been trying to figure out just how big it is for a long time, and they’re still learning. 
 The observable universe, which is the part of the universe that we can see, is estimated to be around 93 billion light-years across.  A light-year is the distance that light travels in one year, which is about 6 trillion miles (10 trillion kilometers).
[08/29/2025-15:29:14] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 1] In simple words and a single sentence, explain the concept of gravity: : assistant

Gravity is a natural force that pulls everything towards each other, which is why things fall down when you drop them and why planets orbit around stars.
[08/29/2025-15:29:14] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 2] <|begin_of_text|><|header_start|>user<|header_end|>

How to fix slicing in golf?<|eot|><|header_start|>assistant<|header_end|>

: The slice! One of the most common and frustrating issues in golf. Fixing a slice requires a combination of understanding the causes, making adjustments to your swing, and practicing new techniques. Here's a step-by-step guide to help you fix your slice:

**Understanding the causes of a slice:**

A slice occurs when the ball curves to the right (for right-handed golfers) due to a combination of factors, including:

1. **Open clubface**: The clubface is pointing to the right of the
[08/29/2025-15:29:14] [TRT-LLM AUTO-DEPLOY] [I] [PROMPT 3] <|begin_of_text|><|header_start|>user<|header_end|>

Please describe the natural scenery you see in the following images<|image|><|image|><|eot|><|header_start|>assistant<|header_end|>

: The natural scenery in the first image is a stormy ocean with large waves crashing against the shore. The sky is overcast and dark, with thick gray clouds that suggest an impending storm. The water is choppy and rough, with white foam forming on the surface of the waves as they break.

In contrast, the second image depicts a serene and peaceful landscape featuring a mountainous terrain with a prominent rock formation in the center. The sky above is blue and clear, with only a few wispy

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-08-25T16:29:32Z

📝 Walkthrough

Walkthrough

Refactors AutoDeploy to add multimodal input support and dynamic-shape hooks: SequenceInfo internal redesign, factory-driven input/processor hooks, ADInputProcessor, multimodal data plumbing through executors, vision patch for Llama4, prompt-format normalization, export patch selection changes, and related tests/typing updates.

Changes

Cohort / File(s)	Summary
Examples: prompts and gitignore `examples/auto_deploy/build_and_run_ad.py`, `examples/auto_deploy/.gitignore`	Expand prompt input types to accept strings, dicts, and HF-style message lists with normalization; add `.gitignore` rule to ignore `*.yaml`.
Custom ops: attention interface and SDPA `tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py`	Replace SequenceInfo dataclass with stateful class supporting named extra args, dynamic shapes and callbacks; add APIs for registering extra args and rescatter; add `enable_gqa` param to scaled_dot_product_attention (and fake variant).
LLM core and factory hooks `tensorrt_llm/_torch/auto_deploy/llm.py`, `tensorrt_llm/_torch/auto_deploy/models/factory.py`, `tensorrt_llm/_torch/auto_deploy/models/hf.py`	Introduce `ADInputProcessor` and factory-driven lifecycle: `ModelFactory` gains `init_processor`, `get_example_inputs`, `get_extra_inputs`; HF factories wired for processor/tokenizer and image inputs with dynamic-shape callbacks.
Vision patch for Llama4 `tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py`	Add `Llama4VisionPatch` and registry entry to swap in a vision-aware forward; insert projected vision features via masked_scatter using `image_token_id`; apply/revert lifecycle implemented.
Shim/executor plumbing `tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`, `tensorrt_llm/_torch/auto_deploy/shim/demollm.py`, `tensorrt_llm/executor/worker.py`	Wire factory `get_extra_inputs` into SequenceInfo via `add_extra_arg`; nest_sequences accepts `extra_args`; threaded multimodal params through DemoEngine/Executor; treat `_autodeploy` as PyTorch-like backend; SequenceInfo no longer takes device at init.
Export interface and flow `tensorrt_llm/_torch/auto_deploy/export/export.py`, `tensorrt_llm/_torch/auto_deploy/export/interface.py`	Move patch selection into `apply_export_patches(patch_configs, patch_list)` with mutual-exclusivity checks and defaults; `ExportPatchRegistry.get` now raises on unknown patch.
Transform/export integration `tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py`	`set_example_sequence` now passes factory-provided example inputs to export helpers.
Config typing `tensorrt_llm/_torch/auto_deploy/llm_args.py`	Relax `model_factory` from Literal union to plain `str` (default unchanged).
KV-cache and transform library `tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`	Remove strict two-input-node assertion; assemble prepare-metadata args from named standard and extra args; call public `set_max_num_tokens_sample`.
Unit tests: harness and embeddings `tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py`, `tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py`	SequenceEmbeddingInfo now constructs embedding tensors (hidden_size/dtype) and overrides `nest_sequences`; small Llama-4 test config vision_config keys simplified.
Unit tests: engine/LLM/shims `tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py`	Remove explicit SequenceInfo sync calls; patch target for input-processor mocked via `LLM._create_input_processor`; adjust one compile_backend param.
Unit tests: KV cache usage `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py`	Rename GQAWithSdpa forward arg from `x` to `input_ids`; tests updated to pass explicit `input_pos` to `nest_sequences`.
Unit tests: attention matcher `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py`	Remove early skip so sdpa path is exercised by the test.
Unit tests: Llama4 VLM patch `tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py`	New test validating logits consistency across original, patched, and GM-exported Llama-4 VLM runs.
Misc tests adjustments `tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py`, `tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py`	Minor test harness updates to align with new APIs (removed sync calls, changed mocking target).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant LLM
  participant Factory as ModelFactory
  participant Processor as ADInputProcessor
  participant Tokenizer as HF Processor/Tokenizer

  User->>LLM: instantiate(args)
  LLM->>Factory: factory = args.create_factory()
  LLM->>Factory: init_tokenizer()
  Factory-->>LLM: tokenizer
  LLM->>Factory: init_processor()
  Factory-->>LLM: processor
  LLM->>Processor: _create_input_processor(tokenizer, processor)
  User->>LLM: generate(inputs, sampling_params)
  LLM->>Processor: __call__(inputs, sampling_params)
  Processor->>Tokenizer: apply_chat_template / encode
  Tokenizer-->>Processor: token_ids[, pixel_values]
  Processor-->>LLM: token_ids, extra_multimodal

sequenceDiagram
  autonumber
  participant Engine as ADEngine/DemoEngine
  participant Factory as ModelFactory
  participant SeqInfo as SequenceInfo
  participant KV as KV Cache Manager

  Engine->>Factory: get_extra_inputs()
  Factory-->>Engine: {name: (none_input, dyn_shape_cb)}*
  Engine->>SeqInfo: add_extra_arg(name, none_input, cb)*
  Engine->>KV: get_cache_indices(request)
  KV-->>Engine: page_assignments
  Engine->>SeqInfo: nest_sequences(input_ids, input_pos, page_assignments, **extra_args)
  alt new tokens produced
    Engine->>SeqInfo: rescatter_input_ids(new_tokens, gather_idx, scatter_ref)
  end

sequenceDiagram
  autonumber
  participant Export as export_to_gm
  participant IFace as Export Interface
  participant Patch as Patch Registry

  Export->>IFace: apply_export_patches(patch_configs, patch_list)
  IFace->>IFace: resolve configs or derive from patch_list / defaults
  IFace->>Patch: create patches from resolved configs
  IFace->>Patch: apply patches (context)
  Export-->>Export: perform export under patch context
  IFace-->>Patch: revert patches on exit

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

[TRTLLM-7326][feat] Add standalone multimodal encoder #6743 — Related multimodal/autodeploy infrastructure changes (SequenceInfo, factory hooks, multimodal paths).
[#5048][enhance] AutoDeploy: Optimize prepare_inputs #6634 — Related SequenceInfo and prepare-inputs refactor affecting nesting and cache handling.
[None][feat] Add GPT OSS support for AutoDeploy #6641 — Related changes to examples/build_and_run_ad.py prompt formats and defaults.

Suggested labels

AutoDeploy

Suggested reviewers

Fridah-nv
suyoggupta
netanel-haber
syuoni
Superjomn
QiJune

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (8)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)

74-83: Bug: enable_gqa argument is ignored (hard-coded to False).

The public parameter is never honored; you always pass enable_gqa=False to F.scaled_dot_product_attention. This silently disables GQA even when requested by callers.

Minimal fix:

         scale=scale,
-        enable_gqa=False,
+        enable_gqa=enable_gqa,
     )

Optional backward-compatible safeguard (if older torch may not accept the kwarg):

-    return F.scaled_dot_product_attention(
-        query.contiguous(),
-        key.contiguous(),
-        value.contiguous(),
-        attn_mask=attn_mask,
-        dropout_p=dropout_p,
-        is_causal=is_causal,
-        scale=scale,
-        enable_gqa=enable_gqa,
-    )
+    try:
+        return F.scaled_dot_product_attention(
+            query.contiguous(),
+            key.contiguous(),
+            value.contiguous(),
+            attn_mask=attn_mask,
+            dropout_p=dropout_p,
+            is_causal=is_causal,
+            scale=scale,
+            enable_gqa=enable_gqa,
+        )
+    except TypeError:
+        if enable_gqa:
+            raise NotImplementedError(
+                "enable_gqa=True requested, but installed torch does not support the 'enable_gqa' kwarg."
+            )
+        # Fallback: call without the kwarg
+        return F.scaled_dot_product_attention(
+            query.contiguous(),
+            key.contiguous(),
+            value.contiguous(),
+            attn_mask=attn_mask,
+            dropout_p=dropout_p,
+            is_causal=is_causal,
+            scale=scale,
+        )

tensorrt_llm/executor/worker.py (1)

433-436: Potential AttributeError when treating "_autodeploy" as PyTorch-like.

Now that _is_pytorch_backend is True for "_autodeploy", accessing self._executor_config.pytorch_backend_config may fail if that attribute is absent on autodeploy configs. Guard with getattr and default to enabling overlap unless explicitly disabled.
-        is_overlap_enabled = self._is_pytorch_backend and not self._executor_config.pytorch_backend_config.disable_overlap_scheduler
+        pytorch_cfg = getattr(self._executor_config, "pytorch_backend_config", None)
+        disable_overlap = getattr(pytorch_cfg, "disable_overlap_scheduler", False)
+        is_overlap_enabled = self._is_pytorch_backend and not disable_overlap
Optionally, add a short comment indicating the default behavior for autodeploy when this sub-config is missing.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)

1-1: Add NVIDIA Apache-2.0 copyright header (2025) at file top.

Per repo guidelines, prepend the standard NVIDIA Apache-2.0 header to all Python sources, including tests.

Apply this diff at the very top:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (1)

1-1: Add NVIDIA Apache-2.0 copyright header (2025) at file top.

Please prepend the standard NVIDIA Apache-2.0 header.

Apply this diff at the top:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)

1-1: Add NVIDIA Apache-2.0 copyright header (2025) at file top.

Please prepend the standard header.

Apply this diff at the top:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (1)

1-1: Add NVIDIA Apache-2.0 copyright header (2025) at file top.

Please prepend the standard header.

Apply this diff at the top:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

1-1: Add NVIDIA Apache-2.0 copyright header (2025) at file top.

Please prepend the standard header.

Apply this diff at the top:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

1-10: Add NVIDIA Apache-2.0 header (compliance).

Per repo guidelines, prepend the NVIDIA Apache-2.0 copyright header with the current year to all source files.

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """
 Attention Interface to handle various attention operators and cache operations.

🧹 Nitpick comments (30)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2)

1-2: Add NVIDIA Apache-2.0 header (2025) at file top.

Per repo guidelines, prepend the NVIDIA Apache-2.0 copyright header with the current year.

Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
86-91: Fake op signature parity looks good; consider documenting enable_gqa here too.

The fake now accepts enable_gqa, which keeps tracing and shape propagation stable. Add a brief note in the fake’s docstring that the flag is ignored and only affects real kernel selection.

tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1)

1-1: Add NVIDIA Apache-2.0 header at file top.

Per repo guidelines, prepend the standard NVIDIA Apache-2.0 copyright header (year 2025) before the module docstring.

Apply this diff:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

tensorrt_llm/executor/worker.py (4)

1-1: Add NVIDIA Apache-2.0 header at file top.

Please prepend the standard header (year 2025).

Apply this diff:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
84-86: Minor: avoid list re-allocation for backend check.

Use a constant set/tuple for membership to avoid per-init list construction; makes intent explicit.
-        self._is_pytorch_backend = getattr(self._executor_config, "backend",
-                                           None) in ["pytorch", "_autodeploy"]
+        PYTORCH_LIKE_BACKENDS = ("pytorch", "_autodeploy")
+        self._is_pytorch_backend = getattr(self._executor_config, "backend", None) in PYTORCH_LIKE_BACKENDS
520-521: Optional: document py_scheduling_params shape/contract.

Since PyTorch-like backends now include "_autodeploy", document constraints/expected fields for request.scheduling_params to avoid divergence between backends.

84-86: Add quick unit to cover autodeploy path.

Consider adding a small test that instantiates GenerationExecutorWorker with executor_config.backend="_autodeploy" and validates that _is_pytorch_backend is True and overlap flag logic does not raise.

tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (4)

1-1: Add NVIDIA Apache-2.0 header at file top.

Please prepend the standard header (year 2025).

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

73-77: Solid move to ordered named args; add defensive check for missing placeholders.

Indexing input_nodes_mapping with unknown keys will raise a KeyError. Add a clear error if the graph is missing required placeholders—helps debugging miswired graphs/models.

-    input_nodes_mapping = {n.target: n for n in input_nodes}
-
-    # filtered and sorted for SequenceInfo arguments (input_ids, position_ids, etc.)
-    input_nodes_from_info = [input_nodes_mapping[k] for k in cm.info.named_standard_args.keys()]
+    input_nodes_mapping = {n.target: n for n in input_nodes}
+    required = list(cm.info.named_standard_args.keys())
+    missing = [k for k in required if k not in input_nodes_mapping]
+    if missing:
+        raise ValueError(f"Missing required input placeholders in graph: {missing}. "
+                         f"Available: {list(input_nodes_mapping.keys())}")
+    # filtered and sorted for SequenceInfo arguments (input_ids, position_ids, etc.)
+    input_nodes_from_info = [input_nodes_mapping[k] for k in required]

80-84: Insertion anchor: prefer the first attention node over input_nodes[-1].next.

Anchoring prepare_metadata right before the first source attention node makes dependency intent explicit and avoids corner-cases when graph input ordering changes.

-    with graph.inserting_before(input_nodes[-1].next):
+    insertion_anchor = source_attn_nodes[0] if source_attn_nodes else input_nodes[-1].next
+    with graph.inserting_before(insertion_anchor):
         ret_node = graph.call_function(
             get_metadata,
-            args=(*input_nodes_from_info, *cm.info.extra_args_for_prepare_metadata),
+            args=(*input_nodes_from_info, *cm.info.extra_args_for_prepare_metadata),
         )

161-176: KV cache resize: synchronize before/after measuring memory and clamp pages.

Without synchronizing, mem_get_info can be noisy around kernel launches; also guard against degenerate division and non-positive page counts.

     try:
         # Let's run a forward pass to get the memory usage
-        cm.info.set_max_num_tokens_sample()
-        free_mem_pre, _ = _get_mem_info_in_mb()
+        cm.info.set_max_num_tokens_sample()
+        torch.cuda.synchronize()
+        free_mem_pre, _ = _get_mem_info_in_mb()
         ad_logger.info(f"Free memory before forward pass (MB): {free_mem_pre}")
 
         egm(*cm.args)
 
-        free_mem_post, _ = _get_mem_info_in_mb()
+        torch.cuda.synchronize()
+        free_mem_post, _ = _get_mem_info_in_mb()
         ad_logger.info(f"Free memory after forward pass (MB): {free_mem_post}")
 
         memory_for_forward_pass = free_mem_pre - free_mem_post
         ad_logger.info(f"Memory for forward pass (MB): {memory_for_forward_pass}")
 
-        new_cache_size = free_mem_post * 1024 * 1024 * free_mem_ratio + current_cache_size
-        new_num_pages = int(new_cache_size // (current_cache_size // current_num_pages))
+        bytes_per_mb = 1024 * 1024
+        new_cache_size = int(free_mem_post * bytes_per_mb * free_mem_ratio + current_cache_size)
+        denom = max(1, current_cache_size // max(1, current_num_pages))
+        new_num_pages = max(1, int(new_cache_size // denom))

examples/auto_deploy/build_and_run_ad.py (3)

1-1: Add NVIDIA Apache-2.0 header at file top.

Please prepend the standard header (year 2025).

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.

75-97: Normalization is clear; avoid aliasing nested lists across the batch.

Multiplying the list duplicates references to the same nested message objects. If any later logic mutates message dicts, this can cause surprising cross-sample coupling. Consider copying list/dicts during expansion.

-        queries = self.queries if isinstance(self.queries, list) else [self.queries]
+        queries = self.queries if isinstance(self.queries, list) else [self.queries]
         batch_size = self.batch_size
-        queries = queries * (batch_size // len(queries) + 1)
+        # Expand without aliasing nested structures
+        expanded = []
+        for _ in range(batch_size):
+            for q in queries:
+                if isinstance(q, list):
+                    expanded.append([dict(m) for m in q])  # shallow-copy each message
+                elif isinstance(q, dict):
+                    expanded.append(dict(q))
+                else:
+                    expanded.append(q)
+                if len(expanded) == batch_size:
+                    break
+            if len(expanded) == batch_size:
+                break
-        queries = queries[:batch_size]
+        queries = expanded[:batch_size]

269-309: Ensure LLM shutdown on exceptions.

Wrap generate/benchmark in try/finally to guarantee llm.shutdown() even if an exception is raised.

-    llm = build_llm_from_config(config)
-
-    # prompt the model and print its output
-    ad_logger.info("Running example prompts...")
-
-    # now let's try piping through multimodal data
-
-    outs = llm.generate(
-        config.prompt.queries,
-        sampling_params=SamplingParams(**config.prompt.sp_kwargs),
-    )
-    results = {"prompts_and_outputs": print_outputs(outs)}
+    llm = build_llm_from_config(config)
+    try:
+        # prompt the model and print its output
+        ad_logger.info("Running example prompts...")
+        # now let's try piping through multimodal data
+        outs = llm.generate(
+            config.prompt.queries,
+            sampling_params=SamplingParams(**config.prompt.sp_kwargs),
+        )
+        results = {"prompts_and_outputs": print_outputs(outs)}
 
-    # run a benchmark for the model with batch_size == config.benchmark_bs
-    if config.benchmark.enabled and config.args.runtime != "trtllm":
-        ad_logger.info("Running benchmark...")
-        keys_from_args = ["compile_backend", "attn_backend", "mla_backend"]
-        fields_to_show = [f"benchmark={config.benchmark}"]
-        fields_to_show.extend([f"{k}={getattr(config.args, k)}" for k in keys_from_args])
-        results["benchmark_results"] = benchmark(
-            func=lambda: llm.generate(
-                torch.randint(0, 100, (config.benchmark.bs, config.benchmark.isl)).tolist(),
-                sampling_params=SamplingParams(
-                    max_tokens=config.benchmark.osl,
-                    top_k=None,
-                    ignore_eos=True,
-                ),
-                use_tqdm=False,
-            ),
-            num_runs=config.benchmark.num,
-            log_prefix="Benchmark with " + ", ".join(fields_to_show),
-            results_path=config.benchmark.results_path,
-        )
-    elif config.benchmark.enabled:
-        ad_logger.info("Skipping simple benchmarking for trtllm...")
-
-    if config.benchmark.store_results:
-        store_benchmark_results(results, config.benchmark.results_path)
-
-    llm.shutdown()
+        # run a benchmark for the model with batch_size == config.benchmark_bs
+        if config.benchmark.enabled and config.args.runtime != "trtllm":
+            ad_logger.info("Running benchmark...")
+            keys_from_args = ["compile_backend", "attn_backend", "mla_backend"]
+            fields_to_show = [f"benchmark={config.benchmark}"]
+            fields_to_show.extend([f"{k}={getattr(config.args, k)}" for k in keys_from_args])
+            results["benchmark_results"] = benchmark(
+                func=lambda: llm.generate(
+                    torch.randint(0, 100, (config.benchmark.bs, config.benchmark.isl)).tolist(),
+                    sampling_params=SamplingParams(
+                        max_tokens=config.benchmark.osl,
+                        top_k=None,
+                        ignore_eos=True,
+                    ),
+                    use_tqdm=False,
+                ),
+                num_runs=config.benchmark.num,
+                log_prefix="Benchmark with " + ", ".join(fields_to_show),
+                results_path=config.benchmark.results_path,
+            )
+        elif config.benchmark.enabled:
+            ad_logger.info("Skipping simple benchmarking for trtllm...")
+
+        if config.benchmark.store_results:
+            store_benchmark_results(results, config.benchmark.results_path)
+    finally:
+        llm.shutdown()

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)

41-43: Forward signature alignment looks good; add a quick guard and silence the unused parameter.

position_ids is intentionally unused; make that explicit to avoid lints.
Add a divisibility/assert guard to prevent silent shape errors when hidden_size is not divisible by num_heads.

Apply:

 def forward(
-        self, input_ids: torch.Tensor, position_ids: Optional[torch.Tensor] = None
+        self, input_ids: torch.Tensor, position_ids: Optional[torch.Tensor] = None
     ) -> torch.Tensor:
@@
-        b, s, _ = input_ids.shape
+        b, s, d_model = input_ids.shape
+        _ = position_ids  # unused; kept to match interface
+        assert (
+            d_model % self.num_heads == 0
+        ), f"hidden_size ({d_model}) must be divisible by num_heads ({self.num_heads})"
@@
-        q = self.q_proj(input_ids)  # [b, s, n*h_d]
-        k = self.k_proj(input_ids)  # [b, s, n_kv*h_d]
-        v = self.v_proj(input_ids)  # [b, s, n_kv*h_d]
+        q = self.q_proj(input_ids)  # [b, s, n*h_d]
+        k = self.k_proj(input_ids)  # [b, s, n_kv*h_d]
+        v = self.v_proj(input_ids)  # [b, s, n_kv*h_d]

Optional outside-range improvement (constructor robustness): prefer explicit parameters over unpacking args for readability and safety.

class GQAWithSdpa(GQA):
    def __init__(self, num_attention_heads: int, hidden_size: int, num_key_value_heads: int, *args, **kwargs):
        super().__init__(num_attention_heads, hidden_size, num_key_value_heads, *args, **kwargs)
        self.num_heads = num_attention_heads
        self.num_kv_heads = num_key_value_heads
        self.head_dim = hidden_size // self.num_heads
        self.num_key_value_groups = None if self.num_heads == self.num_kv_heads else self.num_heads // self.num_kv_heads

Also applies to: 48-48, 51-54

tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (1)

95-95: Patch target update to LLM._create_input_processor is correct.

This aligns with the refactor that moved input-processor creation into the LLM class.

To reduce redundancy, you can remove the inner patch.object(..., "_build_model", ...) since it's already patched via the decorator. Also, in the non-executor branch add assert_not_called() for completeness:
@@ def test_config_flow(...):
-    else:
-        # For LLM with _autodeploy backend, executor should not be called directly
-        pass
+    else:
+        # For LLM with _autodeploy backend, executor should not be called directly
+        mock_executor.assert_not_called()

tensorrt_llm/_torch/auto_deploy/models/factory.py (3)

116-124: New hook: init_processor() — clear and useful for multimodal processors.

No issues. Consider adding a brief Google-style docstring “Args/Returns” section like other methods for consistency.

218-229: New hook: get_example_inputs() — fits export_to_gm usage.

Looks good. To avoid surprises, document that the returned tensors can be on CPU; SequenceInfo will move them to the correct device.

62-80: Docstring still assumes only (input_ids, position_ids) despite multimodal support.

Minor inconsistency: the class-level expectations and build_model docstring mention only input_ids/position_ids. With this PR adding extra_args (e.g., pixel_values), consider updating the docstring to acknowledge optional extra inputs.

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)

98-117: Consider validating multimodal data consistency

The code correctly collects multimodal data from requests, but there's no validation that all requests have the same set of multimodal keys. This could lead to issues if requests have heterogeneous multimodal data structures.

Consider adding validation to ensure all requests have consistent multimodal data keys:

 # set up sequence info object for decode phase
 sequence_info = self.cache_seq_interface.info
 sequence_info.reset()

 input_ids = []
 total_lens = []
 extra_args: Dict[str, List[torch.Tensor]] = defaultdict(list)
+multimodal_keys = set()

 for request in requests:
     total_lens.append(len(request.prompt_token_ids))
     input_ids.append(request.prompt_token_ids)
     if request.multimodal_params is not None:
+        current_keys = set(request.multimodal_params.multimodal_data.keys())
+        if multimodal_keys and current_keys != multimodal_keys:
+            raise ValueError(f"Heterogeneous multimodal data keys are not supported. Expected {multimodal_keys}, got {current_keys}")
+        multimodal_keys = current_keys
         for k, v in request.multimodal_params.multimodal_data.items():
             extra_args[k].append(v)

tensorrt_llm/_torch/auto_deploy/llm.py (2)

44-51: Consider making the sanity check optional in production

The TODO comment indicates this sanity check may be removed for performance. Since this creates an intermediate string that's not actually used (only for validation), it could impact performance for large batches.

Consider adding a debug flag to make this check optional:

 # check for messages field and if yes, use the apply_chat_template method
 if "messages" in inputs:
-    # TODO: we don't really need this but it makes for a good sanity check. Consider
-    # removing this in the future if we need to speed things up.
-    prompt = self.processor.apply_chat_template(
-        inputs["messages"],
-        add_generation_prompt=True,
-        tokenize=False,
-    )
-    inputs["prompt"] = prompt
+    if ad_logger.isEnabledFor(logging.DEBUG):
+        # Sanity check: generate the prompt for debugging
+        prompt = self.processor.apply_chat_template(
+            inputs["messages"],
+            add_generation_prompt=True,
+            tokenize=False,
+        )
+        inputs["prompt"] = prompt
+        ad_logger.debug(f"Generated prompt from messages: {prompt[:100]}...")

121-125: Consider more specific type checking

The assertions verify instance types but could be more specific about what's expected.

Consider adding more informative error messages:

 # now correct input processor
-assert isinstance(self.input_processor, DefaultInputProcessor)
-assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer)
+assert isinstance(self.input_processor, DefaultInputProcessor), (
+    f"Expected DefaultInputProcessor, got {type(self.input_processor)}"
+)
+assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer), (
+    f"Expected TransformersTokenizer or None, got {type(self.tokenizer)}"
+)
 self.input_processor = self._create_input_processor()

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (3)

37-37: Document why image_sizes is set to None

While the comment mentions that image_sizes is set to None, it would be helpful to explain that this is required because torch.cond branches must have fixed argument types during export.
-    image_sizes: torch.Tensor = None,  # image_sizes set as None
+    image_sizes: torch.Tensor = None,  # Set to None for torch.export compatibility (cond branches need fixed types)
89-96: Consider enabling the validation check conditionally

The commented-out validation check for matching token counts could be useful for debugging. Consider making it conditional based on a debug flag rather than commenting it out entirely.
-# This condition statement breaks torch.export:
-# TODO: sanity check on the inputs for this
-# if num_tokens_to_fill != projected_vision_flat.size(0):
-#     raise ValueError(
-#         f"Mismatch: final_mask wants {num_tokens_to_fill} embeddings, "
-#         f"but multi_modal_projector returned {projected_vision_flat.size(0)}"
-#     )
+# Skip runtime validation during export to avoid breaking torch.export
+# This check is valuable for debugging but incompatible with torch.export's constraints
+if not torch.compiler.is_compiling():
+    num_tokens_to_fill = final_mask_1d.sum()
+    if num_tokens_to_fill != projected_vision_flat.size(0):
+        raise ValueError(
+            f"Mismatch: final_mask wants {num_tokens_to_fill} embeddings, "
+            f"but multi_modal_projector returned {projected_vision_flat.size(0)}"
+        )
158-158: Document why image_hidden_states is None

The comment mentions skipping this "for simplicity" but it would be clearer to explain that image_hidden_states is computed inside the vision branch and not accessible outside torch.cond.
-        image_hidden_states=None,  # skip outputting this for simplicity
+        image_hidden_states=None,  # Not accessible outside torch.cond vision branch

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)

139-156: Default dtypes: prefer clarity and consistency.

input_ids is initialized with torch.int while position_ids uses torch.long. If input_ids represents token IDs by default, torch.long is the safer choice (embedding/lookup ops require int64). If input_ids may also carry embeddings later (as tests do), consider documenting that and defaulting to torch.long only for the token-ID path.

Would you like me to push a docstring tweak and a small follow-up change to default input_ids to torch.long?

248-252: Return host-side _input_positions for consistency (and avoid device sync).

The TODO indicates intent to use the host-side list. After fixing _update_input_pos, returning _input_positions avoids extra device->host sync on every call.
 @property
 def input_positions(self) -> List[int]:
-    # TODO: correctly set _input_positions and use it here
-    # return self._input_positions
-    return self.input_pos[: self.num_sequences].tolist()
+    return self._input_positions
465-475: Make host tensors pinned for non_blocking copies and clear tails.

These copies are marked non_blocking=True but the source tensors aren’t pinned, making them synchronous. Also, clear the trailing slots to avoid stale data.
 def _assign_pages_per_seq(self, page_assignments: Sequence[Sequence[int]]) -> None:
     """Set the cache location and pages_per_seq tensors from page assignments."""
     assert len(page_assignments) == self.num_sequences
-    cache_loc_flat = torch.tensor(
-        [p_idx for pages in page_assignments for p_idx in pages], dtype=torch.int
-    )
-    self.cache_loc[: len(cache_loc_flat)].copy_(cache_loc_flat, non_blocking=True)
-
-    pages_per_seq = torch.tensor([len(p) for p in page_assignments], dtype=torch.int)
-    self.pages_per_seq[: len(pages_per_seq)].copy_(pages_per_seq, non_blocking=True)
+    cache_loc_flat = torch.tensor(
+        [p_idx for pages in page_assignments for p_idx in pages],
+        dtype=torch.int,
+        pin_memory=True,
+    )
+    self.cache_loc[: cache_loc_flat.numel()].copy_(cache_loc_flat, non_blocking=True)
+    if self.cache_loc.numel() > cache_loc_flat.numel():
+        self.cache_loc[cache_loc_flat.numel() :].zero_()
+
+    pages_per_seq_host = torch.tensor([len(p) for p in page_assignments],
+                                      dtype=torch.int, pin_memory=True)
+    self.pages_per_seq[: pages_per_seq_host.numel()].copy_(pages_per_seq_host, non_blocking=True)
+    if self.pages_per_seq.numel() > pages_per_seq_host.numel():
+        self.pages_per_seq[pages_per_seq_host.numel() :].zero_()
372-379: Non-blocking reset copies: consider pinned memory for consistency.

cache_loc_host is pinned (good). For symmetry and performance, also consider zeroing input_pos and pages_per_seq tails here to avoid stale values when the active batch shrinks after a larger one.

I can add a small follow-up to zero the tails here; let me know if you want it in this PR.

202-233: Lazy dynamic-shape maps reuse the same dict instance per-arg.

bs_seq_len_shape is shared across uncached args. That’s fine if never mutated. If you expect any per-arg tweak later, switch to a shallow copy per key to avoid aliasing surprises.
- self._uncached_dynamic_shapes = {k: bs_seq_len_shape for k in self._uncached_arg_names}
+ self._uncached_dynamic_shapes = {k: dict(bs_seq_len_shape) for k in self._uncached_arg_names}

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6a44e5b and a30236a.

📒 Files selected for processing (19)

examples/auto_deploy/.gitignore (1 hunks)
examples/auto_deploy/build_and_run_ad.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (8 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/llm.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/models/factory.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/models/hf.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (8 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (2 hunks)
tensorrt_llm/executor/worker.py (1 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (0 hunks)
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py (0 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (3 hunks)

💤 Files with no reviewable changes (2)

tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespace when importing: from package.subpackage import foo; then use foo.SomeClass()
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; prefix k for names starting with a number (e.g., k_99th_percentile)
Global variables are UPPER_SNAKE_CASE prefixed with G (e.g., G_MY_GLOBAL)
Constants are UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings over comments; comments for internal code or local interfaces
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Attributes and variables can be documented inline with trailing docstrings under the class or module
Avoid using reflection when easily avoidable; prefer explicit parameters/constructs over dict(**locals())
In try/except, catch the narrowest exception types possible
For duck-typing try/except, keep try body minimal and place logic in else after attribute existence checks

Files:

tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/executor/worker.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/models/factory.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tensorrt_llm/_torch/auto_deploy/llm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

**/*.{h,hpp,hxx,hh,c,cc,cpp,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA Apache-2.0 copyright header with current year to all source files

Files:

tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/executor/worker.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/models/factory.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tensorrt_llm/_torch/auto_deploy/llm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

🧠 Learnings (1)

📚 Learning: 2025-08-14T21:04:50.248Z

Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py

🧬 Code graph analysis (11)

examples/auto_deploy/build_and_run_ad.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

Field (67-94)

tensorrt_llm/_torch/auto_deploy/models/factory.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

CacheConfig (26-29)

tensorrt_llm/_torch/auto_deploy/models/hf.py (3)

init_processor (351-355)

get_example_inputs (375-418)

get_extra_inputs (420-439)

tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)

named_standard_args (188-190)

get_prepare_metadata_op (716-738)

args (193-195)

extra_args_for_prepare_metadata (198-200)

set_max_num_tokens_sample (401-413)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

args (24-26)

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (4)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

set_example_sequence (380-399)

args (193-195)

dynamic_shapes (235-237)

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

get_example_inputs (375-418)

tensorrt_llm/_torch/auto_deploy/models/factory.py (2)

get_example_inputs (218-228)

model (43-45)

tensorrt_llm/_torch/auto_deploy/export/export.py (1)

torch_export_to_gm (198-284)

tensorrt_llm/_torch/auto_deploy/llm.py (6)

tensorrt_llm/inputs/registry.py (1)

DefaultInputProcessor (44-111)

tensorrt_llm/llmapi/llm.py (4)

tokenizer (681-685)

tokenizer (688-689)

prompt (79-80)

LLM (1079-1095)

tensorrt_llm/_torch/auto_deploy/models/factory.py (5)

tokenizer (48-50)

ModelFactory (15-243)

init_tokenizer (107-114)

init_processor (116-123)

prefetch_checkpoint (125-136)

tensorrt_llm/llmapi/tokenizer.py (2)

TokenizerBase (24-25)

TransformersTokenizer (28-267)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-486)

tensorrt_llm/_torch/auto_deploy/llm_args.py (2)

LlmArgs (226-330)

create_factory (204-215)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

nest_sequences (506-592)

reset (366-378)

tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (1)

all_close (6-21)

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (3)

tensorrt_llm/executor/request.py (1)

GenerationRequest (84-129)

tensorrt_llm/inputs/multimodal.py (1)

MultimodalParams (152-410)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (6)

reset (366-378)

nest_sequences (506-592)

page_assignments (272-279)

input_positions (248-251)

sequence_lengths (244-245)

args (193-195)

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

CacheConfig (26-29)

tensorrt_llm/_torch/auto_deploy/models/factory.py (6)

tokenizer (48-50)

init_tokenizer (107-114)

init_processor (116-123)

model (43-45)

get_example_inputs (218-228)

get_extra_inputs (230-243)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (6)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

create_factory (204-215)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

device (163-164)

page_assignments (272-279)

nest_sequences (506-592)

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

get_extra_inputs (420-439)

tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

get_extra_inputs (230-243)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

get_cache_indices (572-583)

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

is_dummy (379-380)

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)

tensorrt_llm/_torch/models/modeling_llama.py (1)

Llama4ForConditionalGeneration (1179-1240)

tensorrt_llm/_torch/auto_deploy/export/interface.py (2)

BaseExportPatch (38-130)

ExportPatchRegistry (166-211)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)

args (24-26)

dynamic_shapes (29-31)

to (33-37)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

set_example_sequence (45-53)

🔇 Additional comments (37)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)
66-67: Verify torch’s scaled_dot_product_attention supports enable_gqa and update docs

The new enable_gqa parameter is a great parity enhancement with F.scaled_dot_product_attention, but we need to:

Update the function’s docstring to document the enable_gqa argument.

Confirm that the version of PyTorch used in CI/runtime actually exposes this keyword (to avoid runtime TypeError).

Our sandbox couldn’t import torch, so please manually verify in your local or CI environment:
python - << 'PY'
import inspect, torch
from torch.nn import functional as F
sig = inspect.signature(F.scaled_dot_product_attention)
print("Has 'enable_gqa':", 'enable_gqa' in sig.parameters)
PY
Once confirmed, append to the docstring under Args::
    enable_gqa (bool): If True, request grouped-query attention handling from the
        PyTorch scaled_dot_product_attention implementation. Forwarded when supported
        by the installed torch version; otherwise ignored or raises if unsupported.
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1)

68-71: LGTM on compile_backend switch for Llama-4-Scout.

The move to compile_backend="torch-simple" with attn_backend="flashinfer" is consistent with other entries and the new "_autodeploy" PyTorch-like path. No issues spotted.

examples/auto_deploy/.gitignore (1)

5-6: No tracked YAML files in examples/auto_deploy; ignore rule is safe

Ran the suggested check and found no tracked *.yaml files under examples/auto_deploy. The current .gitignore entry won’t hide any existing sample configs, so it can remain as-is.

tensorrt_llm/executor/worker.py (2)

125-139: Backend selection path looks good. Ensure parity for LoRA and scheduling knobs.

The new "_autodeploy" branch uses create_autodeploy_executor; confirm that required knobs (e.g., kv_cache_transceiver, scheduling params, multimodal handling) are wired consistently with the PyTorch path. No code change requested here—just a heads-up to verify feature parity as you expand autodeploy capabilities.

433-441: Overlap + disaggregated guard reads well; re-check constraints for autodeploy.

With overlap enabled and disaggregated serving, context-only requests are disallowed. Confirm autodeploy engine implements kv_cache_transceiver consistently so this condition holds. No code change requested.

tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (1)

79-84: Good API alignment.

Using cm.info.extra_args_for_prepare_metadata keeps the call signature decoupled from concrete extras (e.g., page_size). This is the right abstraction.

examples/auto_deploy/build_and_run_ad.py (2)

29-31: LGTM: PromptInput union captures the intended formats.

The alias cleanly models str/TRT-LLM dict/HF messages. This aligns with the multimodal and chat-template flow.

41-62: Nice: richer prompt examples and clear guidance.

The expanded defaults exercise all supported input shapes (text, TRT-LLM dict, HF messages) and will help users discover capabilities quickly.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (3)

190-193: Good: using SequenceInfo.nest_sequences with input_pos keeps positions consistent.

Passing input_pos to let SequenceInfo derive position_ids is aligned with the new interface. No change needed.

202-202: Autoregressive baseline reset is correct before no-cache run.

Resetting before the all-at-once call is fine and mirrors a cold start scenario.

208-211: Per-token KV-cache loop is correct and exercises the transformation path.

Chunking one token at a time with the corresponding input_pos is the right way to validate cache behavior.

tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (1)

193-196: Message update LGTM.

Changing the prompt text does not affect semantics; assertions still validate text-only vs. text+image divergence and export parity.
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (2)
66-67: Good change: source example inputs from the factory.

Passing factory.get_example_inputs() through to cm.info.set_example_sequence() aligns export inputs with model factories and enables multimodal scaffolding.

66-67: Cross-file: fix invalid PIL.Image.new usage in HF factory get_example_inputs.

When this transform calls cm.info.set_example_sequence(**factory.get_example_inputs()), the HF factory’s second sample creates an image with Image.new("RGB", (0, 255, 0)) (missing size, color passed as size), which will fail. Please correct that in tensorrt_llm/_torch/auto_deploy/models/hf.py.

Suggested fix in models/hf.py (context lines 374-417 in your snippet):
-            _prep_seq(
-                "What are the main differences between these two images?",
-                Image.new("RGB", (16, 16), color=(255, 0, 0)),
-                Image.new("RGB", (0, 255, 0)),
-            ),
+            _prep_seq(
+                "What are the main differences between these two images?",
+                Image.new("RGB", (16, 16), color=(255, 0, 0)),
+                Image.new("RGB", (16, 16), color=(0, 255, 0)),
+            ),
tensorrt_llm/_torch/auto_deploy/models/factory.py (2)

11-11: Import of DynamicShapeCallback is correct and future-proofs extra input wiring.

Good alignment with SequenceInfo’s dynamic-shapes path.

230-244: Verified new hook get_extra_inputs() — API shape and naming align with SequenceInfo._extra_args

No further action needed; the extra‐input keys (e.g., “pixel_values” in the HF factory) correctly match the keys used by SequenceInfo, so set_example_sequence() will not trigger its missing‐keys assertion.

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (4)

39-43: Verify multimodal data propagation

The multimodal parameters are being directly assigned to the request object. Since the GenerationRequest class supports multimodal parameters as shown in the external code context, this implementation is correct.

53-83: Page assignment strategy looks correct

The new signature returning List[List[int]] instead of mutating internal state is a cleaner API design. The implementation correctly:

Extracts current page assignments from sequence info

Calculates free pages

Allocates additional pages based on total length requirements

The logic correctly handles the case where extra_tokens > 0 but < page_size by adding one extra page.

133-142: LGTM! Clean state management for decode steps

The implementation correctly:

Computes next positions by adding current positions and sequence lengths

Calculates total lengths for the next step

Updates sequence info with new tokens and page assignments

The flow maintains proper state transitions between generation steps.

280-281: LGTM! Explicit multimodal params passing

The implementation correctly passes multimodal_params from the request to avoid losing the reference when the request's multimodal_params are cleared after submission (as noted in lines 336-338).

tensorrt_llm/_torch/auto_deploy/llm.py (2)

15-73: Well-structured ADInputProcessor implementation

The ADInputProcessor class provides a clean abstraction for handling both text-only and multimodal inputs:

Correctly inherits from DefaultInputProcessor

Properly handles HF's chat template system for multimodal inputs

Falls back to standard tokenization for text-only inputs

Appropriately validates and returns multimodal data when present

The implementation correctly handles truncation parameters and special tokens configuration.

81-85: Good lazy initialization pattern for factory

The lazy initialization of the factory is well-implemented, ensuring it's only created when needed.

tensorrt_llm/_torch/auto_deploy/models/hf.py (4)

344-356: Clean processor initialization pattern

The init_processor method correctly initializes the AutoProcessor and returns it, while init_tokenizer properly extracts the tokenizer from the processor. This is a clean separation of concerns for multimodal model support.

358-374: LGTM! Multimodal forward signature

The _simple_forward method correctly extends the base implementation to include pixel_values parameter, enabling multimodal input processing during model execution.

375-419: Well-structured example inputs generation

The get_example_inputs method correctly:

Creates synthetic image data for testing

Uses the processor's chat template system

Generates properly formatted batch inputs

Returns both input_ids and pixel_values

The use of small 16x16 images for examples is appropriate to minimize memory usage during export.

432-434: Could you clarify which export target you have in mind—ONNX/TensorRT dynamic-shape profiles for inference, or torch.jit/torch.export tracing? Knowing the runtime will let us recommend appropriate min/opt/max defaults for batch size, height, and width.

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (4)

98-103: LGTM! Clean separation of concerns

The SequenceInfo initialization no longer takes a device parameter, which is a cleaner design. The device is now handled separately and passed to the ADEngine constructor.

105-117: Good integration of extra inputs from factory

The code correctly retrieves extra inputs from the factory and adds them to the sequence info with their dynamic shape callbacks. This enables proper handling of multimodal inputs with dynamic shapes.

182-209: Clean multimodal data collection pattern

The refactored code properly:

Converts device tensors to CPU list for indexing

Collects page assignments per request

Aggregates multimodal data into extra_args dictionary

Maintains backward compatibility for non-multimodal requests

231-236: Excellent API consolidation

The single call to nest_sequences with all parameters (input_ids, input_pos, page_assignments, and extra_args) is much cleaner than the previous approach of multiple separate updates. This consolidates all sequence info updates in one atomic operation.

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (4)

13-19: Good documentation of modifications

The comment clearly documents the three key modifications made to the original HuggingFace implementation, which helps with maintenance and understanding the differences.

60-65: Good input validation

The XOR check for input_ids vs inputs_embeds and the mutual exclusivity check for pixel_values and inputs_embeds ensure proper input configuration.

105-113: Clever use of torch.cond for export compatibility

The use of torch.cond with a runtime check for non-zero pixel values is an elegant solution to enable both text-only and multimodal paths during export. This allows the exported model to handle both input types dynamically.

162-186: Well-structured patch implementation

The patch class follows the established pattern correctly:

Properly decorated with ExportPatchRegistry

Stores original method before patching

Restores original method on revert

Clear documentation of purpose

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

162-181: Keep arg ordering of dynamic_shapes in lockstep with args (export stability).

named_args builds in the order: uncached -> extra -> cached. named_dynamic_shapes merges dicts in the same order, which should preserve ordering. Given export sensitivity, add a unit test to guard this contract (args[i] aligns with dynamic_shapes[i]).

I can add a small unit test that instantiates SequenceInfo, registers two extra args, toggles cached mode, and asserts zipped(name, shape) ordering matches across named_args and named_dynamic_shapes.

Also applies to: 202-238

401-414: Example sequence with max_num_tokens: extra-arg handling is TODO.

set_max_num_tokens_sample skips extra-args alignment. If extra-args are required by the graph, this will produce mismatches in export/tests.

I can wire this to accept an extra_args: Optional[Dict[str,Tensor]] = None and populate _extra_args with placeholders or provided values. Want a patch?

150-156: Meta dtypes: align across fields and kernels.

Seq metadata (seq_len, input_pos, cache_loc, pages_per_seq) are torch.int (int32). Ensure downstream custom ops and TorchScript/exporters expect int32 consistently. If any expects int64, we’ll need cast points.

I can scan the kernels and prepare_metadata ops and add safe casts if needed.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

tensorrt_llm/_torch/auto_deploy/models/hf.py

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py

lucaslie · 2025-08-25T22:38:54Z

/bot reuse-pipeline --help

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
1-10: Prepend NVIDIA 2025 license header to attention_interface.py

We’ve confirmed that tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py does not include any NVIDIA copyright header, whereas other Python modules do:
No “NVIDIA” header found in attention_interface.py
Existing headers in the repo:
modeling_multimodal_encoder.py, modeling_multimodal_utils.py, modeling_vila.py use a 2024 header
float8_python_api.py uses:
# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
Please prepend the canonical NVIDIA header (updated to 2025) at the very top of attention_interface.py. For example, copying the style from float8_python_api.py:
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
If the project has standardized on the more detailed Apache 2.0–style header, instead use:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#     http://www.apache.org/licenses/LICENSE-2.0
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
Ensure the header exactly matches the repo‐standard text and update the year to 2025.

♻️ Duplicate comments (3)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

441-458: Bug: scalar copy into slice when input_pos is int; also tail not cleared.

Passing an int builds a 0-D tensor and copy into a 1-D slice -> shape mismatch. Copy the expanded host list and zero the tail to avoid stale values. This was raised earlier; still present.

-        # non-blocking copy
-        self._non_blocking_copy_to_device(
-            self.input_pos[: self.num_sequences], input_pos, torch.int
-        )
+        # non-blocking copy from the expanded host list
+        self._non_blocking_copy_to_device(
+            self.input_pos[: self.num_sequences], self._input_positions, torch.int
+        )
+        # clear any tail beyond active sequences
+        if self.input_pos.numel() > self.num_sequences:
+            self.input_pos[self.num_sequences :].zero_()

542-547: Uncached SDPA needs uniform seq_len; otherwise view() in _shape_for_forward will fail.

When self._is_cached_attn is False, you reshape to [B, S]; ragged inputs make B*S != total_num_tokens. Enforce uniform lengths at nest time (or pad/truncate).

         seq_lens = [len(ids) for ids in input_ids]
-        self._sequence_lengths = seq_lens
+        # Uncached attention requires rectangular [B, S, ...]
+        if not self._is_cached_attn:
+            assert len(set(seq_lens)) == 1, (
+                f"Uncached attention requires uniform seq_len per batch; got {seq_lens}"
+            )
+        self._sequence_lengths = seq_lens

582-585: unnest_sequences splits along wrong dim for uncached inputs.

Current logic assumes cached-mode shapes. Make it mode-aware to avoid mis-splitting in uncached prefill.

-    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
-        t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
-        return list(torch.split(t_squeezed, self.sequence_lengths))
+    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
+        if self._is_cached_attn:
+            # cached mode: [B,1,...] for decode or [1,S_total,...] for prefill
+            t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
+            return list(torch.split(t_squeezed, self.sequence_lengths, dim=0))
+        # uncached SDPA: [B, S, ...] -> split along S
+        return list(torch.split(t_nested, self.sequence_lengths, dim=1))

🧹 Nitpick comments (6)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (6)
6-7: Nit: small grammar fix in module docstring.

“…pass it on the the various…” → “…pass it on to the various…”.
- is also responsible for functionalizing information about the sequence and pass it on the the
+ is also responsible for functionalizing information about the sequence and pass it on to the
150-160: Consider standardizing input_ids dtype to torch.long.

Token/ID tensors in PyTorch models are typically torch.long; using torch.int can cause dtype mismatches in downstream ops and custom kernels.
-        self._uncached_args_flat: Dict[str, torch.Tensor] = {
-            "input_ids": torch.ones(self.max_num_tokens, dtype=torch.int),
-            "position_ids": torch.zeros(self.max_num_tokens, dtype=torch.long),
-        }
-        self.input_ids = torch.ones(self.max_batch_size, 1, dtype=torch.int)
+        self._uncached_args_flat: Dict[str, torch.Tensor] = {
+            "input_ids": torch.ones(self.max_num_tokens, dtype=torch.long),
+            "position_ids": torch.zeros(self.max_num_tokens, dtype=torch.long),
+        }
+        self.input_ids = torch.ones(self.max_batch_size, 1, dtype=torch.long)
         self.position_ids = torch.zeros(self.max_batch_size, 1, dtype=torch.long)
If some backends require int32, we can convert at the boundary or accept a dtype parameter. Please confirm expected dtype across the stack.

226-235: Avoid shared dict aliasing in dynamic shapes map.

Each entry currently references the same bs_seq_len_shape dict. Copy per key to prevent accidental shared-state mutations.
-            self._uncached_dynamic_shapes = {k: bs_seq_len_shape for k in self._uncached_arg_names}
+            self._uncached_dynamic_shapes = {
+                k: bs_seq_len_shape.copy() for k in self._uncached_arg_names
+            }
415-423: Potential inconsistency: pages_per_seq filled without matching cache_loc.

set_max_num_tokens_sample sets pages_per_seq but doesn’t set cache_loc/page_assignments. Downstream consumers may read inconsistent page metadata.

Options:

Remove the pre-fill and let callers pass page_assignments when needed, or

Generate trivial page_assignments consistent with the fill.

Minimal change:
-        self.pages_per_seq.fill_(seq_len // self.page_size)
-        self.nest_sequences(input_ids)
+        # Let callers provide page_assignments if needed; avoid partial state here.
+        self.pages_per_seq.zero_()
+        self.nest_sequences(input_ids)
399-414: Example sequence arg check is inverted for extras.

The intent seems to be: require that all registered extra args are present in kwargs. The current assertion uses self._extra_args.keys(), which may include runtime values and not just the declared names. Consider checking against self._extra_none_inputs (declared extras) to avoid false positives.
-        assert self._extra_args.keys() <= kwargs.keys(), (
-            f"Missing extra args: {self._extra_args.keys() - kwargs.keys()}"
-        )
+        declared = set(self._extra_none_inputs.keys())
+        provided = set(kwargs.keys())
+        assert declared <= provided, f"Missing extra args: {declared - provided}"
482-503: Guard _shape_for_forward against ragged inputs in uncached mode (secondary safety).

Even with the nest-time assertion, adding a defensive check here yields clearer errors and prevents silent mis-shapes if callers bypass nest_sequences.
-        if not self._is_cached_attn or self.is_generate:
-            bs = len(self.sequence_lengths)
-            sl = self.sequence_lengths[0]
+        if not self._is_cached_attn or self.is_generate:
+            bs = len(self.sequence_lengths)
+            if bs > 0:
+                sl = self.sequence_lengths[0]
+                # sanity: uncached path expects uniform S (generate-only implies S==1)
+                if not self.is_generate:
+                    assert all(s == sl for s in self.sequence_lengths), (
+                        f"Uncached attention expects uniform seq_len; got {self.sequence_lengths}"
+                    )
+            else:
+                sl = 0

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a30236a and fd5d5f3.

📒 Files selected for processing (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

🧬 Code graph analysis (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)

args (24-26)

dynamic_shapes (29-31)

to (33-37)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
189-201: Ragged inputs break view in uncached mode (uses seq_len[0]).

As-is, _shape_for_forward will mis-reshape when uncached SDPA receives non-uniform sequence lengths. Enforce uniform seq_len when self._is_cached_attn is False (see suggested diff in nest_sequences comment).

543-546: Assert uniform seq_len for uncached SDPA to guarantee valid reshape.

This guarantees _shape_for_forward can safely view(bs, sl, ...).
-        self._store_arg("seq_len", [len(ids) for ids in input_ids], reset=True)
+        seq_lens = [len(ids) for ids in input_ids]
+        if not self._is_cached_attn:
+            assert len(set(seq_lens)) == 1, f"Uncached attention requires uniform seq_len; got {seq_lens}"
+        self._store_arg("seq_len", seq_lens, reset=True)
616-619: Split along the correct dimension for cached vs uncached modes.

Current split defaults to dim=0 and will be wrong for uncached SDPA ([B, S, ...]).
-    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
-        t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
-        return list(torch.split(t_squeezed, self.seq_len))
+    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
+        if self._is_cached_attn:
+            # cached mode: decode [B,1,...] or prefill [1,S_total,...] => split along dim 0
+            t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
+            return list(torch.split(t_squeezed, self.seq_len, dim=0))
+        # uncached SDPA: [B, S, ...] => split along sequence dimension
+        return list(torch.split(t_nested, self.seq_len, dim=1))
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)
199-206: Test 3 resets cache before continuation; invalidates expected equality.

Reset wipes context; continuing with only tail tokens can’t match full-context ground truth.

Minimal fix (don’t reset before continuation):
-    # Continue inference from previous context
-    cm.info.reset()
-    for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]):
+    # Continue inference from previous context (do not reset here)
+    for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]):
         y_with_cache[:, i_p : i_p + 1] = _call_and_unnest(x[:, i_p : i_p + 1], i_p)
If you want a “resume after reset” scenario, re-prime the cache with the true prefix and compare only the tail; I can provide that patch.

🧹 Nitpick comments (7)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
1-10: Add NVIDIA SPDX header.

Source files must start with the NVIDIA copyright header.
+# SPDX-FileCopyrightText: (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 """
 Attention Interface to handle various attention operators and cache operations.
243-259: Avoid dict aliasing in dynamic shapes.

Using the same bs_seq_len_shape dict for multiple keys risks accidental cross-mutation later.
-            self._dynamic_shapes = {k: bs_seq_len_shape for k in self._uncached_arg_names}
+            self._dynamic_shapes = {k: dict(bs_seq_len_shape) for k in self._uncached_arg_names}
495-520: Broaden tnsr_like typing/doc; supports list-of-lists for embeddings.

Current hint List[int | float] is inaccurate when passing embeddings (list[list[float]]).

Proposed signature/doc update (outside selected lines):
def _store_arg(self, name: str, tnsr_like: Sequence[Union[int, float, Sequence[float]]], reset: bool = False) -> None:
    """tnsr_like may be 1D (ids) or 2D (embeddings)."""
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (3)
1-1: Add NVIDIA SPDX header.
+# SPDX-FileCopyrightText: (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 import copy
55-66: Host mirror type for input_ids is inconsistent (tensor vs list).

You set _args_host["input_ids"] to a tensor; elsewhere host mirrors are lists. Low risk today but diverges from convention.
-        self._args_host["input_ids"] = self._args_device["input_ids"].cpu()
+        self._args_host["input_ids"] = self._args_device["input_ids"].detach().cpu().tolist()
Alternatively, leave unset if not consumed.

67-80: Optional: deterministic test setup.

_add_hidden_dim uses torch.rand, making baseline state nondeterministic. Consider seeding or using zeros/normal for reproducible tests.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)
1-1: Add NVIDIA SPDX header.
+# SPDX-FileCopyrightText: (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 from typing import Optional

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between fd5d5f3 and 5be954c.

📒 Files selected for processing (6)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (7 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (8 hunks)
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (2 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py

🧬 Code graph analysis (3)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)

device (177-178)

args (228-230)

reset (479-485)

nest_sequences (521-588)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

args (24-26)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)

args (24-26)

dynamic_shapes (29-31)

to (33-37)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

nest_sequences (67-79)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (2)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)

SequenceEmbeddingInfo (41-79)

nest_sequences (67-79)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

input_pos (271-272)

nest_sequences (521-588)

reset (479-485)

🔇 Additional comments (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)
591-615: rescatter_input_ids assumes token-ids; undefined for embeddings.

masked_scatter_(input_ids_device == scatter_ref, ...) won’t work with float embeddings.
 def rescatter_input_ids(self, ungathered_input_ids: torch.Tensor, gather_idx: List[int], scatter_ref: int):
@@
-        input_ids_device = self._args_device["input_ids"]
-        input_ids_device.masked_scatter_(input_ids_device == scatter_ref, packed_input_ids)
+        input_ids_device = self._args_device["input_ids"]
+        # Guard: only supported for 1D token-id layout
+        assert input_ids_device.ndim == 1, "rescatter_input_ids supports 1D token-ids only"
+        input_ids_device.masked_scatter_(input_ids_device == scatter_ref, packed_input_ids)
If embedding support is required, we can add a 2D branch using a boolean mask over a dedicated sentinel channel or explicit indices. Want a follow-up patch?

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)
53-83: Add bounds checking for page allocation

The _assign_pages method pops from free_pages without checking if enough pages are available.
     def _assign_pages(self, total_lens: List[int]) -> List[List[int]]:
         """A simple heuristic to assign pages based on current sequence info.
         
         ...docstring...
         """
         si = self.cache_seq_interface.info
         page_assignments = si.page_assignments

         free_pages = set(range(si.num_pages)) - {i for pages in page_assignments for i in pages}
         updated_assignments = []
         for t_l, pages in zip(total_lens, page_assignments):
             extra_tokens = t_l - len(pages) * si.page_size
             num_extra_pages = (extra_tokens // si.page_size) + (extra_tokens > 0)
+            if num_extra_pages > len(free_pages):
+                raise RuntimeError(
+                    f"Insufficient free pages: need {num_extra_pages}, "
+                    f"but only {len(free_pages)} available"
+                )
             updated_assignments.append(pages + [free_pages.pop() for _ in range(num_extra_pages)])
         return updated_assignments
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
398-413: Return an int from _get_sanitized_num_sequences.

Currently returns a 0-d tensor; callers expect an int.
-        if s > 1:
-            num_seq = torch.sum(seq_len > 0)
+        if s > 1:
+            num_seq = int(torch.sum(seq_len > 0).item())
         else:
-            num_seq = b
+            num_seq = int(b)
         return num_seq
1-1: Add NVIDIA copyright header to attention_interface.py

Per the TensorRT-LLM coding guidelines and existing source files, every Python source must begin with the standard NVIDIA copyright and Apache 2.0 license header. Please prepend the following exact header (updating the year to 2025) to the very top of tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py before the module docstring:
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
Ensure there is a blank line after the header before the existing triple-quoted docstring. Use the same template as seen in benchmarks/cpp/prepare_dataset.py.

♻️ Duplicate comments (2)

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

438-439: Verify the placeholder pixel_values tensor shape

The none_pixel_values tensor uses shape (0, 3, 336, 336) where 336x336 appears to be a specific resolution. This might not be appropriate for all vision models.

Consider making the placeholder shape configurable or deriving it from the model config:

-        none_pixel_values = torch.zeros(0, 3, 336, 336)
+        # Get default image size from processor/model config if available
+        processor = self.init_processor()
+        if processor and hasattr(processor, 'image_processor'):
+            size_dict = getattr(processor.image_processor, 'size', {})
+            default_size = size_dict.get('height', size_dict.get('shortest_edge', 336))
+        else:
+            default_size = 336
+        none_pixel_values = torch.zeros(0, 3, default_size, default_size)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

616-619: Fix unnesting for uncached vs cached modes.

Current logic mis-splits in uncached prefill. Use _is_cached_attn to select split dimension.

-    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
-        t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
-        return list(torch.split(t_squeezed, self.seq_len))
+    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
+        if self._is_cached_attn:
+            # cached: [B,1,...] for decode or [1,S_total,...] for prefill; split along sequence axis
+            t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
+            return list(torch.split(t_squeezed, self.seq_len, dim=0))
+        # uncached SDPA: [B, S, ...] -> split along S
+        return list(torch.split(t_nested, self.seq_len, dim=1))

🧹 Nitpick comments (13)

tensorrt_llm/executor/worker.py (1)

84-85: Consider extracting the backend check logic for better maintainability

The backend check using a list membership is correct, but hardcoding the backend names inline makes it harder to maintain. Consider extracting this to a constant or helper method.

+# Define PyTorch-compatible backends at module level
+PYTORCH_COMPATIBLE_BACKENDS = {"pytorch", "_autodeploy"}
+
 class GenerationExecutorWorker(GenerationExecutor):

...

         self._executor_config = executor_config
-        self._is_pytorch_backend = getattr(self._executor_config, "backend",
-                                           None) in ["pytorch", "_autodeploy"]
+        self._is_pytorch_backend = getattr(self._executor_config, "backend",
+                                           None) in PYTORCH_COMPATIBLE_BACKENDS

examples/auto_deploy/build_and_run_ad.py (1)

29-31: Consider adding type hints to the PromptInput type alias

While the type alias is functional, consider documenting the expected structure of the Dict and List[Dict] types for better clarity.

 # simple string, TRT-LLM style text-only prompt or full-scale HF message template
-PromptInput = Union[str, Dict, List[Dict]]
+from typing import TypedDict, NotRequired
+
+class MessageDict(TypedDict):
+    role: str
+    content: Union[str, List[Dict[str, Any]]]
+
+class PromptDict(TypedDict, total=False):
+    prompt: str
+    messages: NotRequired[List[MessageDict]]
+
+PromptInput = Union[str, PromptDict, List[MessageDict]]

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)

60-65: Improve error messages for mutually exclusive parameters

The XOR check on line 60 and the mutual exclusivity check on lines 62-65 could provide more informative error messages about what was actually passed.

-    if (input_ids is None) ^ (inputs_embeds is not None):
-        raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
+    if (input_ids is None) ^ (inputs_embeds is not None):
+        has_input_ids = input_ids is not None
+        has_inputs_embeds = inputs_embeds is not None
+        raise ValueError(
+            f"You must specify exactly one of input_ids or inputs_embeds. "
+            f"Got: input_ids={'provided' if has_input_ids else 'None'}, "
+            f"inputs_embeds={'provided' if has_inputs_embeds else 'None'}"
+        )
     if pixel_values is not None and inputs_embeds is not None:
         raise ValueError(
-            "You cannot specify both pixel_values and inputs_embeds at the same time, and must specify either one"
+            "You cannot specify both pixel_values and inputs_embeds at the same time. "
+            "Please provide only one of them."
         )

102-103: Document the no-vision branch behavior

The _no_vision_branch function simply returns the inputs unchanged. Consider adding a docstring to clarify this is the expected behavior for text-only inputs.

     def _no_vision_branch(inputs_embeds, pixel_values, input_ids):
+        """Pass through embeddings unchanged for text-only inputs."""
         return inputs_embeds

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

344-349: Add error handling for processor initialization

The tokenizer initialization depends on the processor, but there's no error handling if the processor fails to initialize.

     def init_tokenizer(self) -> Optional[Any]:
         """Initialize the tokenizer—either a custom name or the model's default."""
         processor = self.init_processor()
         if processor is None:
             return None
+        if not hasattr(processor, 'tokenizer'):
+            raise AttributeError(
+                f"Processor {type(processor).__name__} does not have a tokenizer attribute"
+            )
         return processor.tokenizer

375-418: Extract image creation to a helper method

The example inputs method creates multiple test images inline. Consider extracting image creation to improve readability.

+    @staticmethod
+    def _create_test_image(color: Tuple[int, int, int], size: Tuple[int, int] = (16, 16)) -> Image:
+        """Create a test image with the specified color and size."""
+        return Image.new("RGB", size, color=color)
+
     def get_example_inputs(self) -> Dict[str, torch.Tensor]:
         """Return a dictionary of example inputs for the model."""

         def _prep_seq(text, img1, img2):
             return [
                 {
                     "role": "user",
                     "content": [
                         {"type": "image", "image": img1},
                         {"type": "image", "image": img2},
                         {"type": "text", "text": text},
                     ],
                 }
             ]

         # Create a batch of conversations (batch_size = 2)
         batch_messages = [
             _prep_seq(
                 "Describe what you see in the two images and their differences.",
-                Image.new("RGB", (16, 16), color=(128, 128, 128)),
-                Image.new("RGB", (16, 16), color=(64, 64, 64)),
+                self._create_test_image((128, 128, 128)),
+                self._create_test_image((64, 64, 64)),
             ),
             _prep_seq(
                 "What are the main differences between these two images?",
-                Image.new("RGB", (16, 16), color=(255, 0, 0)),
-                Image.new("RGB", (16, 16), color=(0, 255, 0)),
+                self._create_test_image((255, 0, 0)),
+                self._create_test_image((0, 255, 0)),
             ),
         ]

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (2)

39-44: Add parameter validation for multimodal_params

The multimodal_params is directly assigned to the request without validation. Consider adding type checking.

     def __call__(
         self, requests: GenerationRequest, multimodal_params: Optional[MultimodalParams]
     ) -> mp.Queue:
         """Generate tokens and put the results in a queue and return the queue."""
+        if multimodal_params is not None and not isinstance(multimodal_params, MultimodalParams):
+            raise TypeError(f"Expected MultimodalParams, got {type(multimodal_params).__name__}")
         requests.multimodal_params = multimodal_params

101-118: Consider validating multimodal data compatibility

When collecting multimodal data from multiple requests, there's no validation that the data is compatible for batching (e.g., same keys, compatible shapes).

         input_ids = []
         total_lens = []
         extra_args: Dict[str, List[torch.Tensor]] = defaultdict(list)
+        expected_keys = None

         for request in requests:
             total_lens.append(len(request.prompt_token_ids))
             input_ids.append(request.prompt_token_ids)
             if request.multimodal_params is not None:
+                current_keys = set(request.multimodal_params.multimodal_data.keys())
+                if expected_keys is None:
+                    expected_keys = current_keys
+                elif current_keys != expected_keys:
+                    raise ValueError(
+                        f"Inconsistent multimodal data keys across batch: "
+                        f"expected {expected_keys}, got {current_keys}"
+                    )
                 for k, v in request.multimodal_params.multimodal_data.items():
                     extra_args[k].append(v)

tensorrt_llm/_torch/auto_deploy/llm.py (1)

46-52: Optional: avoid double apply_chat_template.

You build the string prompt and then re-tokenize; consider skipping the first call unless another component relies on inputs["prompt"] being populated.

Also applies to: 53-63, 67-69

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)

186-191: Sanity-check gather/scatter counts in generate path.

Add a lightweight assertion to catch scheduler/shape mismatches during development.

         if new_tokens is not None:
+            # Optional sanity check: number of tokens to scatter should match gather indices
+            try:
+                expected = int(new_tokens.numel())
+                assert expected == len(flat_gather_idx), (
+                    f"rescatter mismatch: new_tokens={expected}, gather_idx={len(flat_gather_idx)}"
+                )
+            except Exception:
+                pass
             self.cache_seq_interface.info.rescatter_input_ids(
                 ungathered_input_ids=new_tokens.flatten(),  # ensure it's flattened
                 gather_idx=flat_gather_idx,
                 scatter_ref=dummy_token,
             )

Also applies to: 217-220, 238-244

161-163: Global RNG seeding inside engine constructor.

Hard-coding torch.manual_seed(1234) alters global state and can affect reproducibility across tests/runs. Prefer making this optional/config-driven.

-        torch.manual_seed(1234)
+        if getattr(self, "seed", None) is not None:
+            torch.manual_seed(int(self.seed))

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

180-201: Guard _shape_for_forward against ragged uncached inputs.

As an extra safety (beyond the nest_sequences assertion), fail fast if uncached and ragged.

-        if not self._is_cached_attn or self.is_generate:
+        if not self._is_cached_attn or self.is_generate:
             bs = len(self.seq_len)
-            sl = self.seq_len[0]
+            sls = self.seq_len
+            assert len(set(sls)) == 1, f"Ragged seq_len not supported in uncached mode: {sls}"
+            sl = sls[0]

243-259: Dynamic shapes dict aliasing.

bs_seq_len_shape is shared across keys; fine if immutable, but easy to accidentally mutate. Consider cloning per key to avoid hidden coupling.

-            self._dynamic_shapes = {k: bs_seq_len_shape for k in self._uncached_arg_names}
+            self._dynamic_shapes = {k: dict(bs_seq_len_shape) for k in self._uncached_arg_names}

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5be954c and 57e20fe.

📒 Files selected for processing (20)

examples/auto_deploy/.gitignore (1 hunks)
examples/auto_deploy/build_and_run_ad.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (7 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/llm.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/models/factory.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/models/hf.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (8 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (2 hunks)
tensorrt_llm/executor/worker.py (1 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (0 hunks)
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py (0 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (5 hunks)

💤 Files with no reviewable changes (2)

tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py

🚧 Files skipped from review as they are similar to previous changes (10)

tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
examples/auto_deploy/.gitignore
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
tensorrt_llm/_torch/auto_deploy/models/factory.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

🧬 Code graph analysis (7)

tensorrt_llm/_torch/auto_deploy/llm.py (6)

tensorrt_llm/inputs/registry.py (1)

DefaultInputProcessor (44-111)

tensorrt_llm/_torch/auto_deploy/models/factory.py (5)

tokenizer (48-50)

ModelFactory (15-243)

init_tokenizer (107-114)

init_processor (116-123)

prefetch_checkpoint (125-136)

tensorrt_llm/llmapi/tokenizer.py (2)

TokenizerBase (24-25)

TransformersTokenizer (28-267)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-486)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

create_factory (204-215)

tensorrt_llm/_torch/auto_deploy/models/hf.py (3)

init_tokenizer (188-192)

init_tokenizer (344-349)

init_processor (351-355)

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)

tensorrt_llm/_torch/models/modeling_llama.py (1)

Llama4ForConditionalGeneration (1179-1240)

tensorrt_llm/_torch/auto_deploy/export/interface.py (2)

BaseExportPatch (38-130)

ExportPatchRegistry (166-211)

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

CacheConfig (26-29)

tensorrt_llm/_torch/auto_deploy/models/factory.py (6)

tokenizer (48-50)

init_tokenizer (107-114)

init_processor (116-123)

model (43-45)

get_example_inputs (218-228)

get_extra_inputs (230-243)

examples/auto_deploy/build_and_run_ad.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

Field (67-94)

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (3)

tensorrt_llm/executor/request.py (1)

GenerationRequest (84-129)

tensorrt_llm/inputs/multimodal.py (1)

MultimodalParams (152-410)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (6)

reset (479-485)

nest_sequences (521-588)

input_pos (271-272)

page_assignments (309-311)

seq_len (267-268)

args (228-230)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (6)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

create_factory (204-215)

tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

get_extra_inputs (230-243)

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

get_extra_inputs (420-439)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)

add_extra_arg (620-643)

page_assignments (309-311)

input_pos (271-272)

nest_sequences (521-588)

rescatter_input_ids (590-614)

tensorrt_llm/_torch/auto_deploy/transformations/transform.py (1)

InferenceOptimizer (25-142)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

get_cache_indices (572-583)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)

args (24-26)

dynamic_shapes (29-31)

to (33-37)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

nest_sequences (67-79)

🔇 Additional comments (5)

examples/auto_deploy/build_and_run_ad.py (1)

75-96: LGTM! Well-structured prompt normalization logic

The implementation correctly handles multiple prompt formats (strings, dicts, and HF message templates) and normalizes them to a consistent structure. The fake prompt approach for HF templates clearly separates the different input types.

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)

89-96: I’ve added a script to verify the availability of FX‐compatible assertion primitives and JIT flags in the current PyTorch build. Once we have the results, we can decide on the best pattern (e.g., using torch._assert) for a torch.export‐compatible validation.

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)

337-341: LGTM! Good handling of multimodal_params reference

The comment clearly explains why the multimodal_params needs to be passed explicitly to maintain the reference during asynchronous submission.

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2)

111-117: LGTM: wiring factory-provided extra inputs into SequenceInfo.

Clean pass-through of none_input and dynamic-shape callbacks.

200-208: Ensure multimodal tensors are on the correct device.

If request.py_multimodal_data can be CPU tensors, consider moving them to self._device here to avoid device mismatch later (SequenceInfo will .to(...) but only after added).

Would you like a small patch to move/stack these tensors to self._device before passing to nest_sequences?

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

tensorrt_llm/_torch/auto_deploy/llm.py

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

examples/auto_deploy/build_and_run_ad.py

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py

tensorrt_llm/_torch/auto_deploy/llm.py

tensorrt_llm/_torch/auto_deploy/models/factory.py

tensorrt_llm/_torch/auto_deploy/models/hf.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

1-1: Add NVIDIA 2025 copyright header.

Required for source files; place above module docstring.

+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ #
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted provided that the following conditions
+ # are met:
+ #  * Redistributions of source code must retain the above copyright
+ #    notice, this list of conditions and the following disclaimer.
+ #  * Redistributions in binary form must reproduce the above copyright
+ #    notice, this list of conditions and the following disclaimer in the
+ #    documentation and/or other materials provided with the distribution.
+ #  * Neither the name of NVIDIA CORPORATION nor the names of its
+ #    contributors may be used to endorse or promote products derived
+ #    from this software without specific prior written permission.
+ #
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY
+ # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

♻️ Duplicate comments (8)

tensorrt_llm/_torch/auto_deploy/llm.py (5)

1-1: Add NVIDIA 2025 copyright header.

Mandatory per repo guidelines; prepend standard header above imports.

+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ #
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted provided that the following conditions
+ # are met:
+ #  * Redistributions of source code must retain the above copyright
+ #    notice, this list of conditions and the following disclaimer.
+ #  * Redistributions in binary form must reproduce the above copyright
+ #    notice, this list of conditions and the following disclaimer in the
+ #    documentation and/or other materials provided with the distribution.
+ #  * Neither the name of NVIDIA CORPORATION nor the names of its
+ #    contributors may be used to endorse or promote products derived
+ #    from this software without specific prior written permission.
+ #
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY
+ # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

30-31: Do not require a processor for text-only inputs.

Gate the processor check under the messages path.

-        if self.processor is None:
-            raise ValueError("processor is required to tokenize inputs")
+        # Only required for chat-style/multimodal inputs
+        if "messages" in inputs and self.processor is None:
+            raise ValueError("processor is required to tokenize chat-style inputs")
@@
-        if "messages" in inputs:
+        if "messages" in inputs:

Also applies to: 43-44

37-41: Preserve add_special_tokens when adding truncation.

Use update() instead of overwriting kwargs.

-        if sampling_params.truncate_prompt_tokens is not None:
-            kwargs = {
-                "truncation": True,
-                "max_length": sampling_params.truncate_prompt_tokens,
-            }
+        if sampling_params.truncate_prompt_tokens is not None:
+            kwargs.update({
+                "truncation": True,
+                "max_length": sampling_params.truncate_prompt_tokens,
+            })

75-76: Add tiktoken-compatible fallback for encode.

Mirror DefaultInputProcessor to avoid TypeError with tokenizers that don’t accept add_special_tokens.

-            token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs)
+            try:
+                token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs)
+            except TypeError:
+                # e.g., tiktoken path without add_special_tokens
+                token_ids = self.tokenizer.encode(inputs["prompt"])

126-129: Relax tokenizer type check and use explicit exceptions instead of assert.

Allow any TokenizerBase and raise errors even under -O.

-        assert isinstance(self.input_processor, DefaultInputProcessor)
-        assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer)
+        if not isinstance(self.input_processor, DefaultInputProcessor):
+            raise TypeError("input_processor must be a DefaultInputProcessor")
+        if self.tokenizer is not None and not isinstance(self.tokenizer, TokenizerBase):
+            raise TypeError("tokenizer must implement TokenizerBase")

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

548-553: Avoid stale tail values in device tensors.

Zero-fill before writes for input_pos/pages_per_seq to prevent leftover entries affecting later runs.

-            self._store_arg(
+            self._store_arg(
                 "input_pos",
                 [input_pos] * self.num_sequences if isinstance(input_pos, int) else input_pos,
-            )
+            , reset=True)
@@
-            self._store_arg("pages_per_seq", pages_per_seq)
+            self._store_arg("pages_per_seq", pages_per_seq, reset=True)

Also applies to: 559-561

543-546: Uncached path: enforce uniform seq_len or pad; current view() will mis-shape ragged inputs.

Add a guard when not using cached attention.

-        self._store_arg("seq_len", [len(ids) for ids in input_ids], reset=True)
+        seq_lens_list = [len(ids) for ids in input_ids]
+        if not self._is_cached_attn:
+            assert len(set(seq_lens_list)) == 1, (
+                f"Uncached attention requires uniform seq_len; got {seq_lens_list}"
+            )
+        self._store_arg("seq_len", seq_lens_list, reset=True)

616-619: unnest_sequences fails for uncached SDPA; split along correct dimension.

Handle cached vs uncached by choosing split axis.

-    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
-        t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
-        return list(torch.split(t_squeezed, self.seq_len))
+    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
+        if self._is_cached_attn:
+            t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
+            return list(torch.split(t_squeezed, self.seq_len, dim=0))
+        # uncached SDPA: [B, S, ...] -> split along seq dim
+        return list(torch.split(t_nested, self.seq_len, dim=1))

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 57e20fe and 504afd2.

📒 Files selected for processing (4)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (7 hunks)
tensorrt_llm/_torch/auto_deploy/llm.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1 hunks)
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

🧠 Learnings (1)

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/auto_deploy/llm.py

🧬 Code graph analysis (2)

tensorrt_llm/_torch/auto_deploy/llm.py (7)

tensorrt_llm/inputs/registry.py (1)

DefaultInputProcessor (44-111)

tensorrt_llm/llmapi/llm.py (4)

tokenizer (681-685)

tokenizer (688-689)

prompt (79-80)

LLM (1079-1095)

tensorrt_llm/_torch/auto_deploy/models/factory.py (5)

tokenizer (48-50)

ModelFactory (15-243)

init_tokenizer (107-114)

init_processor (116-123)

prefetch_checkpoint (125-136)

tensorrt_llm/llmapi/tokenizer.py (2)

TokenizerBase (24-25)

TransformersTokenizer (28-267)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-486)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

create_factory (204-215)

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)

DemoGenerationExecutor (246-346)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)

args (24-26)

dynamic_shapes (29-31)

to (33-37)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

nest_sequences (67-79)

tensorrt_llm/_torch/auto_deploy/llm.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (3)

1-1: Add NVIDIA 2025 copyright header.

Per repo policy, prepend the standard NVIDIA header before the module docstring.

Apply:

+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ #
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted provided that the following conditions
+ # are met:
+ #  * Redistributions of source code must retain the above copyright
+ #    notice, this list of conditions and the following disclaimer.
+ #  * Redistributions in binary form must reproduce the above copyright
+ #    notice, this list of conditions and the following disclaimer in the
+ #    documentation and/or other materials provided with the distribution.
+ #  * Neither the name of NVIDIA CORPORATION nor the names of its
+ #    contributors may be used to endorse or promote products derived
+ #    from this software without specific prior written permission.
+ #
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY
+ # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

18-31: Bug: boolean mask adds +1 to allowed logits. Should be 0.0.

Mapping True→1.0 skews attention; allowed positions must add 0.0, blocked −inf.

Apply:

-    if attn_mask.dtype == torch.bool:
-        float_mask = torch.zeros_like(attn_mask, dtype=dtype)
-        float_mask = float_mask.masked_fill(attn_mask, 1.0)  # True -> 1.0
-        float_mask = float_mask.masked_fill(~attn_mask, float("-inf"))  # False -> -inf
-        return float_mask
+    if attn_mask.dtype == torch.bool:
+        float_mask = torch.zeros_like(attn_mask, dtype=dtype)
+        float_mask = float_mask.masked_fill(~attn_mask, float("-inf"))  # False -> -inf
+        return float_mask

121-125: Guard GQA head divisibility.

Avoid silent shape errors when n_heads % n_kv_heads != 0.

Apply:

-    if n_heads != n_kv_heads:
+    if n_heads != n_kv_heads:
+        if n_heads % n_kv_heads != 0:
+            raise ValueError(f"n_heads ({n_heads}) must be an integer multiple of n_kv_heads ({n_kv_heads})")
         n_rep = n_heads // n_kv_heads
         key_t = repeat_kv(key_t, n_rep)
         value_t = repeat_kv(value_t, n_rep)

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)

1-1: Add NVIDIA 2025 copyright header.

Required for non-test Python sources.

Apply:

+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ #
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted provided that the following conditions
+ # are met:
+ #  * Redistributions of source code must retain the above copyright
+ #    notice, this list of conditions and the following disclaimer.
+ #  * Redistributions in binary form must reproduce the above copyright
+ #    notice, this list of conditions and the following disclaimer in the
+ #    documentation and/or other materials provided with the distribution.
+ #  * Neither the name of NVIDIA CORPORATION nor the names of its
+ #    contributors may be used to endorse or promote products derived
+ #    from this software without specific prior written permission.
+ #
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY
+ # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

♻️ Duplicate comments (9)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)

66-67: **Explicit schema over *args/kwargs for custom_op.

custom_op needs a stable schema for export; keeping explicit parameters is correct here. The added enable_gqa is fine.

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)

69-77: Remove debugger-only dead branch; always export.

The if False block prevents export and leaves gm empty.

Apply:

-        if False:
-            gm = torch_export_to_gm(
-                model,
-                args=cm.args,
-                dynamic_shapes=cm.dynamic_shapes,
-                clone=self.config.clone_state_dict,
-                strict=self.config.strict,
-                patch_list=self.config.patch_list,
-            )
+        gm = torch_export_to_gm(
+            model,
+            args=cm.args,
+            dynamic_shapes=cm.dynamic_shapes,
+            clone=self.config.clone_state_dict,
+            strict=self.config.strict,
+            patch_list=self.config.patch_list,
+        )

tensorrt_llm/_torch/auto_deploy/llm.py (6)

75-76: tiktoken-compatible encode fallback.

Some tokenizers reject add_special_tokens kwarg.

Apply:

-            token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs)
+            try:
+                token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs)
+            except TypeError:
+                token_ids = self.tokenizer.encode(inputs["prompt"])

1-1: Add NVIDIA 2025 copyright header.

Mandatory for source files.

Apply header block as used elsewhere in the repo.

30-31: Do not require processor for text-only inputs.

Gate the check under the messages branch to keep text-only working.

Apply:

-        if self.processor is None:
-            raise ValueError("processor is required to tokenize inputs")
+        # Processor is only required for chat/multimodal inputs
+        if "messages" in inputs and self.processor is None:
+            raise ValueError("processor is required to tokenize chat-style inputs")

37-41: Preserve existing kwargs when adding truncation.

Use update(), don't overwrite.

Apply:

-        if sampling_params.truncate_prompt_tokens is not None:
-            kwargs = {
-                "truncation": True,
-                "max_length": sampling_params.truncate_prompt_tokens,
-            }
+        if sampling_params.truncate_prompt_tokens is not None:
+            kwargs.update({
+                "truncation": True,
+                "max_length": sampling_params.truncate_prompt_tokens,
+            })

66-74: Accept 1D/2D input_ids; avoid assert.

Handle tokenizer outputs robustly and error clearly on unexpected shapes.

Apply:

-            token_ids = all_args.pop("input_ids")
-            assert token_ids.shape[0] == 1, "messages should be unbatched at this point."
+            token_ids = all_args.pop("input_ids")
+            if token_ids.ndim == 2:
+                token_ids = token_ids[0]
+            elif token_ids.ndim != 1:
+                raise ValueError(f"Unexpected token_ids shape: {tuple(token_ids.shape)}")

125-129: Loosen tokenizer type check; do not rely on assert.

Allow any TokenizerBase; raise ValueError instead of assert.

Apply:

-        assert isinstance(self.input_processor, DefaultInputProcessor)
-        assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer)
+        if not isinstance(self.input_processor, DefaultInputProcessor):
+            raise TypeError("input_processor must be DefaultInputProcessor-compatible")
+        if self.tokenizer is not None and not isinstance(self.tokenizer, TokenizerBase):
+            raise TypeError("tokenizer must implement TokenizerBase")

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)

1-1: Add NVIDIA copyright header.

Required for new source files.

Apply the standard 2025 NVIDIA header at the top of the file.

🧹 Nitpick comments (5)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (1)

169-187: Align sinks implementation with comments (explicit sink column).

Current code adjusts the normalizer only; either fix comments or implement the concatenation path. Suggest implementing the explicit sink column for clarity.

Apply:

-    if sinks is not None:
-        # Concatenate sinks to attention scores following the reference implementation
-        # sinks should have n_heads elements, each head gets its own sink value
-        # Expand sinks to [b, n_heads, s_q, 1] - one sink column per head
-        sinks_expanded = sinks.reshape(1, -1, 1, 1).expand(
-            b, n_heads, s_q, 1
-        )  # [b, n_heads, s_q, 1]
-
-        # Concatenate along the key dimension (last dimension)
-        logits_max = torch.max(attn_scores, dim=-1, keepdim=True).values
-        sinks = torch.exp(sinks_expanded - logits_max)
-        unnormalized_scores = torch.exp(attn_scores - logits_max)
-        normalizer = unnormalized_scores.sum(dim=-1, keepdim=True) + sinks
-        scores = unnormalized_scores / normalizer
-        # Use only the non-sink portion for computing output
-        # We added exactly 1 column, so remove exactly 1 column
-        attn_out = torch.matmul(scores, value_t)  # [b, n_heads, s_q, v_head_dim]
+    if sinks is not None:
+        # Expand sinks to [b, n_heads, s_q, 1] and concatenate as an extra "sink" key
+        sinks_expanded = sinks.reshape(1, -1, 1, 1).expand(b, n_heads, s_q, 1)
+        attn_scores_ext = torch.cat([attn_scores, sinks_expanded], dim=-1)  # [b, n, s_q, s_k+1]
+        attn_weights_ext = torch.softmax(attn_scores_ext, dim=-1, dtype=torch.float32).to(query.dtype)
+        attn_weights = attn_weights_ext[..., :-1]  # drop sink column for value matmul
+        attn_out = torch.matmul(attn_weights, value_t)  # [b, n_heads, s_q, v_head_dim]

tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (2)

130-136: Remove unused local n_image_tokens.

Silences F841; variable is unused.

Apply:
-        n_image_tokens = special_image_mask.sum()
         special_image_mask = (
             special_image_mask.unsqueeze(-1).expand_as(inputs_embeds).to(inputs_embeds.device)
         )
144-151: Optional: robust image-presence predicate for torch.cond.

all-zero images yield False with torch.any(pixel_values != 0). Consider a shape-based predicate.

Apply:
-    has_image: torch.Tensor = torch.any(pixel_values != 0)
+    # Treat any non-empty batch as "has image"
+    has_image: torch.Tensor = (torch.tensor(pixel_values.shape[0], device=pixel_values.device) > 0)

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)

105-113: Prefer shape-based predicate for torch.cond to handle black images.

torch.any(pixel_values != 0) fails for valid all-black images.

Apply:
-    has_image: torch.Tensor = torch.any(pixel_values != 0)
+    # Consider any non-empty batch as "has image"
+    has_image: torch.Tensor = (torch.tensor(pixel_values.shape[0], device=pixel_values.device) > 0)
162-169: Consider registering the patch.

Uncomment registration so export flows can use patch_list=["hf_llama4_vision"].

Apply:
-# @ExportPatchRegistry.register("hf_llama4_vision")
+# @ExportPatchRegistry.register("hf_llama4_vision")
 # class header remains unchanged
If registration is intentionally deferred, ignore.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 504afd2 and abb6285.

📒 Files selected for processing (6)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/llm.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1 hunks)
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (8 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py

**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py

🧠 Learnings (1)

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/auto_deploy/llm.py

🧬 Code graph analysis (4)

tensorrt_llm/_torch/auto_deploy/llm.py (6)

tensorrt_llm/inputs/registry.py (1)

DefaultInputProcessor (44-111)

tensorrt_llm/llmapi/llm.py (4)

tokenizer (681-685)

tokenizer (688-689)

prompt (79-80)

LLM (1079-1095)

tensorrt_llm/_torch/auto_deploy/models/factory.py (5)

tokenizer (48-50)

ModelFactory (15-243)

init_tokenizer (107-114)

init_processor (116-123)

prefetch_checkpoint (125-136)

tensorrt_llm/llmapi/tokenizer.py (3)

TokenizerBase (24-25)

TransformersTokenizer (28-267)

tokenizer_factory (270-290)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-486)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

create_factory (204-215)

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)

tensorrt_llm/_torch/models/modeling_llama.py (1)

Llama4ForConditionalGeneration (1179-1240)

tensorrt_llm/_torch/auto_deploy/export/interface.py (1)

BaseExportPatch (38-130)

tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py (4)

tensorrt_llm/_torch/auto_deploy/export/interface.py (1)

apply_export_patches (215-249)

tensorrt_llm/_torch/auto_deploy/export/export.py (1)

torch_export_to_gm (198-284)

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)

_forward_with_cond (20-159)

tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)

move_to_device (132-139)

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (3)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

set_example_sequence (440-466)

args (228-230)

dynamic_shapes (262-264)

tensorrt_llm/_torch/auto_deploy/models/factory.py (2)

get_example_inputs (218-228)

model (43-45)

tensorrt_llm/_torch/auto_deploy/export/export.py (1)

torch_export_to_gm (198-284)

🪛 Ruff (0.12.2)

tests/unittest/_torch/auto_deploy/integration/test_llama4_vlm_export.py

130-130: Local variable n_image_tokens is assigned to but never used

Remove assignment to unused variable n_image_tokens

(F841)

🔇 Additional comments (2)

tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1)

66-67: Good: example inputs threaded into SequenceInterface.

Passing factory.get_example_inputs() into set_example_sequence() matches the new interface.

tensorrt_llm/_torch/auto_deploy/llm.py (1)

99-100: Tokenizer factory wiring looks good.

Delegating to factory.init_tokenizer() aligns with the new ModelFactory flow.

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tensorrt_llm/_torch/auto_deploy/export/interface.py (1)

1-4: Add NVIDIA 2025 copyright header
Prepend the standard NVIDIA header (identical wording and license used in other tensorrt_llm source files) above the existing module docstring in tensorrt_llm/_torch/auto_deploy/export/interface.py.

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

1-1: Prepend NVIDIA 2025 copyright header

Add the standard NVIDIA header block at the top of tensorrt_llm/_torch/auto_deploy/models/hf.py as specified in CODING_GUIDELINES.md.
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)
110-117: Fix None-handling and zero-image detection for torch.cond

Current code errors on pixel_values=None and misclassifies all-black images. Use batch-size check and produce a 0-d bool tensor.
-    # decide by whether there is any non-zero pixel_values
-    has_image: torch.Tensor = torch.any(pixel_values != 0)
+    # decide by presence (batch dimension) of images
+    if pixel_values is None:
+        has_image = torch.tensor(False, dtype=torch.bool, device=inputs_embeds.device)
+    else:
+        has_image = torch.tensor(pixel_values.shape[0] > 0, dtype=torch.bool, device=inputs_embeds.device)
tensorrt_llm/_torch/auto_deploy/export/export.py (1)

1-1: Prepend NVIDIA 2025 copyright header
Add the standard NVIDIA header before the module docstring at the top of export.py.

♻️ Duplicate comments (8)

tensorrt_llm/_torch/auto_deploy/llm.py (5)

75-76: Add tiktoken-compatible fallback for encode

Some tokenizers don’t accept add_special_tokens; retry without kwargs.

-            token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs)
+            try:
+                token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs)
+            except TypeError:
+                token_ids = self.tokenizer.encode(inputs["prompt"])

1-1: Missing NVIDIA 2025 header

This file still lacks the required header.

+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ #
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted provided that the following conditions
+ # are met:
+ #  * Redistributions of source code must retain the above copyright
+ #    notice, this list of conditions and the following disclaimer.
+ #  * Redistributions in binary form must reproduce the above copyright
+ #    notice, this list of conditions and the following disclaimer in the
+ #    documentation and/or other materials provided with the distribution.
+ #  * Neither the name of NVIDIA CORPORATION nor the names of its
+ #    contributors may be used to endorse or promote products derived
+ #    from this software without specific prior written permission.
+ #
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY
+ # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
+ # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+ # EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+ # PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ # PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
+ # OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ # (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

30-32: Require processor only for chat-style inputs

Unconditionally requiring processor breaks text-only. Move the check under the messages path and preserve kwargs when adding truncation.

-        if self.processor is None:
-            raise ValueError("processor is required to tokenize inputs")
+        # Only required for chat-style/multimodal inputs
@@
-        if sampling_params.truncate_prompt_tokens is not None:
-            kwargs = {
-                "truncation": True,
-                "max_length": sampling_params.truncate_prompt_tokens,
-            }
+        if sampling_params.truncate_prompt_tokens is not None:
+            kwargs.update({
+                "truncation": True,
+                "max_length": sampling_params.truncate_prompt_tokens,
+            })
@@
-        if "messages" in inputs:
+        if "messages" in inputs:
+            if self.processor is None:
+                raise ValueError("processor is required to tokenize chat-style inputs")

Also applies to: 43-46, 53-63

67-73: Handle 1D/2D token_ids and avoid assert

apply_chat_template can return [S] or [1,S]; accept both and raise on unexpected shapes.

-            token_ids = all_args.pop("input_ids")
-            assert token_ids.shape[0] == 1, "messages should be unbatched at this point."
+            token_ids = all_args.pop("input_ids")
+            if token_ids.ndim == 2:
+                if token_ids.shape[0] != 1:
+                    raise ValueError(f"Expected batch size 1, got {tuple(token_ids.shape)}")
+                token_ids = token_ids[0]
+            elif token_ids.ndim != 1:
+                raise ValueError(f"Unexpected token_ids shape: {tuple(token_ids.shape)}")
@@
-            return token_ids[0].tolist(), extra_processed_inputs
+            return token_ids.tolist(), extra_processed_inputs

126-129: Replace asserts with explicit validation; widen tokenizer type

Use explicit raises (asserts can be optimized out) and accept any TokenizerBase.

-        assert isinstance(self.input_processor, DefaultInputProcessor)
-        assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer)
+        if not isinstance(self.input_processor, DefaultInputProcessor):
+            raise TypeError(f"input_processor must be DefaultInputProcessor, got {type(self.input_processor)}")
+        if self.tokenizer is not None and not isinstance(self.tokenizer, TokenizerBase):
+            raise TypeError(f"tokenizer must be a TokenizerBase or None, got {type(self.tokenizer)}")

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

432-441: Make placeholder pixel_values shape configurable

Hardcoding (0,3,336,336) is brittle across VLMs. Derive from config where possible and fall back to (0,C,H,W).
-        none_pixel_values = torch.zeros(0, 3, 336, 336)
+        channels = getattr(getattr(self, "model_config", None), "num_channels", 3)
+        size = getattr(getattr(self, "model_config", None), "image_size", 336)
+        none_pixel_values = torch.zeros(0, channels, size, size)

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)

1-1: Missing NVIDIA header

Add the required header block per repo standards.

86-95: Robust image-token attribute access

Configs may expose image_token_id or image_token_index. Use a fallback getter to avoid AttributeError.

-        if input_ids is None:
+        image_token = getattr(self.config, "image_token_id",
+                              getattr(self.config, "image_token_index", None))
+        if image_token is None:
+            raise ValueError("Model config must define image_token_id or image_token_index")
+        if input_ids is None:
             special_image_mask = inputs_embeds == self.get_input_embeddings()(
-                torch.tensor(
-                    self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device
-                )
+                torch.tensor(image_token, dtype=torch.long, device=inputs_embeds.device)
             )
             special_image_mask = special_image_mask.all(-1)
         else:
-            special_image_mask = input_ids == self.config.image_token_id
+            special_image_mask = input_ids == image_token

🧹 Nitpick comments (12)

tensorrt_llm/_torch/auto_deploy/export/interface.py (4)

186-188: Improve unknown-patch error for debuggability

Include available patch names in the exception to speed up troubleshooting.

-        if not cls.has(name):
-            raise ValueError(f"Unknown patch: {name}")
+        if not cls.has(name):
+            available = ", ".join(cls.list_patches())
+            raise ValueError(f"Unknown patch: {name}. Available patches: [{available}]")

217-220: Update docstring to reflect new API (patch_list, defaults)

The public docstring omits patch_list and “apply all registered patches” behavior. Please document both to avoid surprises for callers.

 def apply_export_patches(
     patch_configs: Optional[Dict[str, Union[ExportPatchConfig, Dict[str, Any]]]] = None,
     patch_list: Optional[List[str]] = None,
 ):
-    """Context manager to apply multiple patches.
-
-    Args:
-        patch_configs: Dict mapping patch names to their configurations.
-    """
+    """Context manager to apply multiple patches.
+
+    Args:
+        patch_configs: Optional mapping {patch_name -> config dict or ExportPatchConfig}.
+            If None and patch_list is None, all registered patches are applied with defaults.
+        patch_list: Optional list of patch names to apply with default configs.
+            Mutually exclusive with patch_configs.
+    """

Also applies to: 221-225

231-237: Default “apply all patches” may be hazardous

Auto-applying every registered patch can introduce unintended side effects (e.g., vision patches on text-only models). Consider requiring explicit opt-in or gating with a boolean, e.g., apply_all_by_default=False.

-def apply_export_patches(...):
+def apply_export_patches(..., apply_all_by_default: bool = False):
@@
-    elif patch_configs is None:
-        # Default patch configurations - apply all registered patches with default settings
-        patch_configs = {patch_name: {} for patch_name in ExportPatchRegistry.list_patches()}
+    elif patch_configs is None:
+        if apply_all_by_default:
+            patch_configs = {p: {} for p in ExportPatchRegistry.list_patches()}
+        else:
+            patch_configs = {}

239-240: Deterministic order of patch application

If order matters, preserve user intent for patch_list and provide a stable order for dict-based configs.

-    patches = [ExportPatchRegistry.create_patch(k, conf) for k, conf in patch_configs.items()]
+    # Preserve order: explicit patch_list order first, otherwise alphabetical
+    names = list(patch_configs.keys())
+    if patch_list is None:
+        names = sorted(names)
+    patches = [ExportPatchRegistry.create_patch(k, patch_configs[k]) for k in names]

Also applies to: 255-258

tensorrt_llm/_torch/auto_deploy/models/hf.py (3)

342-352: Processor-first init for IT2T models: good, but guard for None

init_tokenizer forwarding to processor.tokenizer is fine; add a clear error if processor/model name is missing.
     def init_processor(self) -> Optional[Any]:
         """Initialize the processor for the model."""
-        if self.tokenizer is None:
-            return None
+        if self.tokenizer is None:
+            return None
         return AutoProcessor.from_pretrained(self.tokenizer, **self.tokenizer_kwargs)
Follow-up: ensure callers either handle None or raise early when IT2T factory is selected without a valid tokenizer name.

Also applies to: 353-358

377-405: Prefer using self.init_processor() and handle missing processor

Avoid re-instantiating AutoProcessor; reuse init_processor and validate presence.
-        processor = AutoProcessor.from_pretrained(self.tokenizer, **self.tokenizer_kwargs)
+        processor = self.init_processor()
+        if processor is None:
+            raise ValueError("Processor is required to generate example inputs for IT2T models.")
Also applies to: 406-421

114-115: Nit: call bound method directly

type(model).forward(model, ...) is unusual; model.forward(...) is clearer unless you’re intentionally bypassing monkey-patching.

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)

133-134: Prefer named attributes on outputs

Use outputs.logits (and peers) for clarity and stability.

-    logits = outputs[0]
+    logits = outputs.logits
@@
-    return Llama4CausalLMOutputWithPast(
-        loss=loss,
-        logits=logits,
-        past_key_values=outputs.past_key_values,
-        hidden_states=outputs.hidden_states,
-        attentions=outputs.attentions,
-        image_hidden_states=None,  # skip outputting this for simplicity
-    )
+    return Llama4CausalLMOutputWithPast(
+        loss=loss,
+        logits=logits,
+        past_key_values=outputs.past_key_values,
+        hidden_states=outputs.hidden_states,
+        attentions=outputs.attentions,
+        image_hidden_states=None,
+    )

Also applies to: 160-167

tensorrt_llm/_torch/auto_deploy/export/export.py (1)

206-231: Docstring aligns with new behavior; add explicit cross-reference

Mention that mutual exclusivity and default-all behavior are enforced by apply_export_patches to avoid confusion.

-        patch_configs: Optional patch configurations. If None, all registered patches
-                      will be applied with default settings.
-        patch_list: Optional list of patch names to apply with default settings.
-                   Cannot be used together with patch_configs.
+        patch_configs: Optional patch configurations. If None (and patch_list is None),
+                      apply_export_patches may apply all registered patches with defaults.
+        patch_list: Optional list of patch names to apply with default settings.
+                   Cannot be used together with patch_configs (enforced in apply_export_patches).

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (3)

82-95: Follow import refactor for export/move helpers.

Use the module-qualified names introduced earlier.

Apply:

-    gm = torch_export_to_gm(
+    gm = ad_export.torch_export_to_gm(
         model,
         args=(input_ids, position_ids, pixel_values),
         patch_list=[
@@
         ],
     )
-    move_to_device(gm, model.device)
+    graph.move_to_device(gm, model.device)

59-73: Make logits extraction robust across Module/GraphModule/dict returns.

GM may return dicts/tuples without a logits attribute. A tiny helper avoids brittle assumptions.

Apply this helper above _run_with_and_without_image and use it in the return:

+    def _get_logits(out):
+        if hasattr(out, "logits"):
+            return out.logits
+        if isinstance(out, dict) and "logits" in out:
+            return out["logits"]
+        if isinstance(out, (tuple, list)):
+            return out[0]
+        return out
@@
-            return {"no_images": out_no_images.logits, "with_images": out_with_images.logits}
+            return {"no_images": _get_logits(out_no_images), "with_images": _get_logits(out_with_images)}

85-93: Avoid duplication: define a single PATCH_LIST constant.

Reduces drift in future edits.

Apply:

+    PATCH_LIST = [
+        "transformers_sdpa_mask",
+        "autocast_noop",
+        "torch_where",
+        "tensor_meta_device",
+        "sdpa_kernel_noop",
+        "sdpa",
+        "hf_llama4_vision",
+    ]
@@
-        patch_list=[
-            "transformers_sdpa_mask",
-            "autocast_noop",
-            "torch_where",
-            "tensor_meta_device",
-            "sdpa_kernel_noop",
-            "sdpa",
-            "hf_llama4_vision",
-        ],
+        patch_list=PATCH_LIST,

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between abb6285 and e38f7fb.

📒 Files selected for processing (8)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/export/export.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/export/interface.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/llm.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/models/hf.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (5 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py

🚧 Files skipped from review as they are similar to previous changes (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces only; no tabs; indent by 4 spaces

Files:

tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; no tabs
Keep module namespace on import; import the module, not individual names; use module.symbol
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile)
Global variables use UPPER_SNAKE with G_ prefix (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE
Avoid shadowing variables from outer scopes
Initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with trailing docstrings
Avoid reflection when simple alternatives exist (e.g., avoid dict(**locals()) patterns)
Limit except clauses to specific exceptions; avoid bare except
When duck-typing with try/except, keep try body minimal and use else for logic

Files:

tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

🧠 Learnings (2)

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/auto_deploy/llm.py

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py

🧬 Code graph analysis (3)

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

CacheConfig (26-29)

device (177-178)

to (431-438)

tensorrt_llm/llmapi/llm.py (2)

tokenizer (681-685)

tokenizer (688-689)

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (5)

tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (1)

get_small_model_config (440-478)

examples/auto_deploy/build_and_run_ad.py (1)

ExperimentConfig (124-236)

tensorrt_llm/_torch/auto_deploy/llm_args.py (2)

LlmArgs (226-330)

create_factory (204-215)

tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)

move_to_device (132-139)

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

init_processor (353-357)

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)

tensorrt_llm/_torch/models/modeling_llama.py (1)

Llama4ForConditionalGeneration (1179-1240)

tensorrt_llm/functional.py (1)

masked_scatter (2539-2574)

🔇 Additional comments (8)

tensorrt_llm/_torch/auto_deploy/export/interface.py (1)

226-229: Mutual-exclusivity check: good guardrail

Clear error early; behavior matches the function doc. LGTM.

tensorrt_llm/_torch/auto_deploy/llm.py (1)

85-90: Factory wiring and processor creation look good

Lazy factory init, tokenizer delegation, and ADInputProcessor creation are consistent with the new factory flow.

Also applies to: 105-107, 144-147

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

156-163: Device handling after init_empty_weights: good separation

post_init() on meta and .to(device) otherwise; this prevents silent dtype/device mismatches. LGTM.

190-195: Tokenizer init path is fine

Returns None when unspecified; otherwise respects tokenizer_kwargs. LGTM.

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (1)

178-187: Patch apply/revert logic is correct

Original forward is saved/restored and replacement is scoped to the patch lifetime. LGTM.

Also applies to: 188-194

tensorrt_llm/_torch/auto_deploy/export/export.py (1)

21-22: Delegating selection to apply_export_patches: good cleanup

Centralizing validation/selection inside interface.apply_export_patches simplifies export.py. LGTM.

Also applies to: 233-235
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (2)
1-1: Add required NVIDIA copyright header.

Per coding guidelines, prepend the standard NVIDIA copyright header (current year).

Apply this diff at the top of the file:
+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
⛔ Skipped due to learnings
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
48-50: Fix dtype conversion: .to(torch.bfloat16) passes dtype as device.

BatchFeature/BatchEncoding expects dtype as a keyword; current call can error or be ignored.

Apply:
-        .to(model.device)
-        .to(torch.bfloat16)
+        .to(model.device)
+        .to(dtype=torch.bfloat16)
Likely an incorrect or invalid review comment.

tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

tensorrt_llm/_torch/auto_deploy/export/interface.py (1)

1-1: Add NVIDIA 2025 copyright header (required).

Per repository guidelines, prepend the standard NVIDIA header to this source file.

+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted provided that the following conditions are met:
+ #  * Redistributions of source code must retain the above copyright notice,
+ #    this list of conditions and the following disclaimer.
+ #  * Redistributions in binary form must reproduce the above copyright notice,
+ #    this list of conditions and the following disclaimer in the documentation
+ #    and/or other materials provided with the distribution.
+ #  * Neither the name of NVIDIA CORPORATION nor the names of its contributors
+ #    may be used to endorse or promote products derived from this software
+ #    without specific prior written permission.
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY EXPRESS
+ # OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN
+ # NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ # INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES; HOWEVER
+ # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

1-1: Add NVIDIA 2025 copyright header (required).

This production Python module must include the standard NVIDIA header.

+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted provided that the following conditions are met:
+ #  * Redistributions of source code must retain the above copyright notice,
+ #    this list of conditions and the following disclaimer.
+ #  * Redistributions in binary form must reproduce the above copyright notice,
+ #    this list of conditions and the following disclaimer in the documentation
+ #    and/or other materials provided with the distribution.
+ #  * Neither the name of NVIDIA CORPORATION nor the names of its contributors
+ #    may be used to endorse or promote products derived from this software
+ #    without specific prior written permission.
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY EXPRESS
+ # OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN
+ # NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ # INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES; HOWEVER
+ # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

1-1: Add NVIDIA 2025 copyright header (required).

Please prepend the standard header block to this file.

+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted provided that the following conditions are met:
+ #  * Redistributions of source code must retain the above copyright notice,
+ #    this list of conditions and the following disclaimer.
+ #  * Redistributions in binary form must reproduce the above copyright notice,
+ #    this list of conditions and the following disclaimer in the documentation
+ #    and/or other materials provided with the distribution.
+ #  * Neither the name of NVIDIA CORPORATION nor the names of its contributors
+ #    may be used to endorse or promote products derived from this software
+ #    without specific prior written permission.
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY EXPRESS
+ # OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN
+ # NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ # INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES; HOWEVER
+ # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

♻️ Duplicate comments (8)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

543-546: Enforce uniform seq_len for uncached SDPA to avoid view/shape errors.

_shape_for_forward() assumes a single sl for uncached mode, but nest_sequences() accepts ragged input_ids. Add a uniformity check (or pad) when not self._is_cached_attn.

-        self._store_arg("seq_len", [len(ids) for ids in input_ids], reset=True)
+        seq_lens = [len(ids) for ids in input_ids]
+        if not self._is_cached_attn:
+            assert len(set(seq_lens)) == 1, (
+                f"Uncached attention requires uniform seq_len; got {seq_lens}"
+            )
+        self._store_arg("seq_len", seq_lens, reset=True)

616-619: Fix split dimension in unnest_sequences for uncached vs cached modes.

Current logic uses is_generate to pick squeeze axis and then splits on dim 0, which breaks uncached SDPA ([B, S, ...]). Branch on _is_cached_attn and split along the actual sequence dimension.

-    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
-        t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
-        return list(torch.split(t_squeezed, self.seq_len))
+    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
+        if self._is_cached_attn:
+            # cached mode: [B,1,...] for decode or [1,S_total,...] for prefill
+            t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
+            return list(torch.split(t_squeezed, self.seq_len, dim=0))
+        # uncached SDPA: [B, S, ...] -> split along sequence length (dim=1)
+        return list(torch.split(t_nested, self.seq_len, dim=1))

tensorrt_llm/_torch/auto_deploy/llm.py (6)

75-76: Tokenizer encode fallback for tiktoken-style implementations.

Guard against TypeError when add_special_tokens is unsupported.

-            token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs)
+            try:
+                token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs)
+            except TypeError:
+                token_ids = self.tokenizer.encode(inputs["prompt"])

1-1: Add NVIDIA 2025 copyright header (required).

Header is missing.

+ # Copyright 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ # Redistribution and use in source and binary forms, with or without
+ # modification, are permitted provided that the following conditions are met:
+ #  * Redistributions of source code must retain the above copyright notice,
+ #    this list of conditions and the following disclaimer.
+ #  * Redistributions in binary form must reproduce the above copyright notice,
+ #    this list of conditions and the following disclaimer in the documentation
+ #    and/or other materials provided with the distribution.
+ #  * Neither the name of NVIDIA CORPORATION nor the names of its contributors
+ #    may be used to endorse or promote products derived from this software
+ #    without specific prior written permission.
+ # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS “AS IS” AND ANY EXPRESS
+ # OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
+ # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN
+ # NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT,
+ # INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES; HOWEVER
+ # CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

30-31: Don't require processor for text-only inputs.

Only enforce processor when messages are provided; otherwise fall back to tokenizer.

-        if self.processor is None:
-            raise ValueError("processor is required to tokenize inputs")
+        # Only required for chat-style/multimodal inputs
+        if "messages" in inputs and self.processor is None:
+            raise ValueError("processor is required to tokenize chat-style inputs")

37-41: Preserve add_special_tokens when adding truncation options.

Using assignment drops previously set keys. Use update().

-        if sampling_params.truncate_prompt_tokens is not None:
-            kwargs = {
-                "truncation": True,
-                "max_length": sampling_params.truncate_prompt_tokens,
-            }
+        if sampling_params.truncate_prompt_tokens is not None:
+            kwargs.update({
+                "truncation": True,
+                "max_length": sampling_params.truncate_prompt_tokens,
+            })

66-74: Handle 1D vs 2D input_ids from apply_chat_template.

Some processors return [S] instead of [1, S]. Avoid the strict assert and normalize shape.

-            token_ids = all_args.pop("input_ids")
-            assert token_ids.shape[0] == 1, "messages should be unbatched at this point."
+            token_ids = all_args.pop("input_ids")
+            if token_ids.ndim == 2:
+                token_ids = token_ids[0]
+            elif token_ids.ndim != 1:
+                raise ValueError(f"Unexpected token_ids shape: {tuple(token_ids.shape)}")
@@
-            return token_ids[0].tolist(), extra_processed_inputs
+            return token_ids.tolist(), extra_processed_inputs

125-129: Broaden tokenizer assertion to TokenizerBase (or remove assert).

Restricting to TransformersTokenizer blocks valid TokenizerBase implementations (e.g., tiktoken wrapper).

-        assert isinstance(self.input_processor, DefaultInputProcessor)
-        assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer)
+        assert isinstance(self.input_processor, DefaultInputProcessor)
+        assert self.tokenizer is None or isinstance(self.tokenizer, TokenizerBase)

🧹 Nitpick comments (3)

tensorrt_llm/_torch/auto_deploy/export/interface.py (2)
217-237: Document new parameters and behaviors in the docstring.

apply_export_patches() now accepts patch_list and mutually excludes it with patch_configs, and defaults to “all registered”. Update the docstring to reflect args, mutual exclusion, defaults, and error cases.
 def apply_export_patches(
     patch_configs: Optional[Dict[str, Union[ExportPatchConfig, Dict[str, Any]]]] = None,
     patch_list: Optional[List[str]] = None,
 ):
-    """Context manager to apply multiple patches.
-
-    Args:
-        patch_configs: Dict mapping patch names to their configurations.
-    """
+    """Context manager to apply multiple patches.
+
+    Args:
+        patch_configs: Dict mapping patch name -> config (dict or ExportPatchConfig).
+        patch_list: List of patch names to apply with default configs.
+    Raises:
+        ValueError: If both `patch_configs` and `patch_list` are provided.
+    Behavior:
+        - If `patch_list` is provided, it is converted to default `patch_configs`.
+        - If neither is provided, all registered patches are applied with defaults.
+    """
239-261: Preserve deterministic apply order and validate early.

Order follows dict insertion. If patch_configs comes from an external source, consider using an OrderedDict or sorting for reproducibility, and validate unknown names up front for a clearer error.
-    patches = [ExportPatchRegistry.create_patch(k, conf) for k, conf in patch_configs.items()]
+    # Optional: keep deterministic order if caller provided a plain dict on <3.7 or for reproducibility
+    items = list(patch_configs.items())
+    # Early validation for clearer error location
+    for name, _ in items:
+        ExportPatchRegistry.get(name)
+    patches = [ExportPatchRegistry.create_patch(k, conf) for k, conf in items]
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

180-201: Guard _shape_for_forward against ragged uncached inputs.

Given the view(bs, sl, ...) on uncached paths, ensure sl is consistent across batch or derive sl from the actual tensor shape in uncached mode. This ties to the uniformity check above.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e38f7fb and c263049.

📒 Files selected for processing (24)

examples/auto_deploy/.gitignore (1 hunks)
examples/auto_deploy/build_and_run_ad.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (7 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/export/export.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/export/interface.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/llm.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/models/factory.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/models/hf.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (8 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py (2 hunks)
tensorrt_llm/executor/worker.py (1 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (0 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py (0 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (0 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (5 hunks)

💤 Files with no reviewable changes (3)

tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py

🚧 Files skipped from review as they are similar to previous changes (17)

examples/auto_deploy/.gitignore
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
tensorrt_llm/_torch/auto_deploy/transformations/library/kvcache.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/models/factory.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces only; no tabs; indent by 4 spaces

Files:

tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/interface.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; no tabs
Keep module namespace on import; import the module, not individual names; use module.symbol
Python filenames use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variables use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile)
Global variables use UPPER_SNAKE with G_ prefix (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE
Avoid shadowing variables from outer scopes
Initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with trailing docstrings
Avoid reflection when simple alternatives exist (e.g., avoid dict(**locals()) patterns)
Limit except clauses to specific exceptions; avoid bare except
When duck-typing with try/except, keep try body minimal and use else for logic

Files:

tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/interface.py

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/export/interface.py

🧠 Learnings (1)

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/auto_deploy/llm.py

🧬 Code graph analysis (4)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

Field (67-94)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)

args (24-26)

dynamic_shapes (29-31)

to (33-37)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

nest_sequences (67-79)

tensorrt_llm/_torch/auto_deploy/llm.py (5)

tensorrt_llm/inputs/registry.py (1)

DefaultInputProcessor (44-111)

tensorrt_llm/_torch/auto_deploy/models/factory.py (5)

tokenizer (48-50)

ModelFactory (15-243)

init_tokenizer (107-114)

init_processor (116-123)

prefetch_checkpoint (125-136)

tensorrt_llm/llmapi/tokenizer.py (3)

TokenizerBase (24-25)

TransformersTokenizer (28-267)

tokenizer_factory (270-290)

tensorrt_llm/sampling_params.py (1)

SamplingParams (125-486)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

create_factory (204-215)

tensorrt_llm/_torch/auto_deploy/export/interface.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

has (833-834)

tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

has (263-264)

🔇 Additional comments (2)

tensorrt_llm/_torch/auto_deploy/export/interface.py (1)

186-188: Changing to ValueError is a breaking API change; confirm callers/tests.

ExportPatchRegistry.get() now raises ValueError for unknown patches. Verify downstream code/tests expecting KeyError or assertions are updated accordingly.

tensorrt_llm/_torch/auto_deploy/llm.py (1)

105-111: Factory init: propagate tokenizer None safely.

tokenizer_factory(self.factory.init_tokenizer()) handles None, LGTM.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

tensorrt_llm/_torch/auto_deploy/llm_args.py

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

1-10: Add NVIDIA SPDX header at file top.

Required by repo guidelines. Place SPDX lines before the module docstring.

Apply:

+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 """
 Attention Interface to handle various attention operators and cache operations.

399-413: Bug: slice index is a tensor; must convert to int.

num_seq = torch.sum(seq_len > 0) yields a 0-d tensor. Slicing with it raises TypeError.

Apply:

-        if s > 1:
-            num_seq = torch.sum(seq_len > 0)
-            assert seq_len[num_seq:].sum() == 0, "seq_len should be zero-padded"
+        if s > 1:
+            num_seq_t = torch.count_nonzero(seq_len)
+            num_seq = int(num_seq_t.item())
+            assert int(seq_len[num_seq:].sum().item()) == 0, "seq_len should be zero-padded"

tensorrt_llm/_torch/auto_deploy/models/hf.py (3)

1-2: Add NVIDIA SPDX header.

Apply:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 """Interface to initialize and load HF models."""
82-85: Python 3.8+: replace QuantConfigReader | None.

Apply:
-        self._quant_config_reader: QuantConfigReader | None = None
+        self._quant_config_reader: Optional[QuantConfigReader] = None
Ensure Optional is imported (it is).

239-241: Avoid bare except; narrow to HF hub exceptions.

Catching everything masks real errors.

Apply:
-        except:  # noqa: E722
-            pass
+        except Exception:
+            # Swallow hub inspection errors only; proceed conservatively.
+            pass
Prefer specific HF exceptions if available in your dependency set.

examples/auto_deploy/build_and_run_ad.py (1)

1-1: Missing required NVIDIA 2025 Apache-2.0 header.

Insert at file top:

# Copyright (c) 2025, NVIDIA CORPORATION.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#     http://www.apache.org/licenses/LICENSE-2.0
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

♻️ Duplicate comments (7)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
495-505: Python 3.8+ compatibility: replace PEP 604 unions.

List[int | float] breaks on 3.8. Use List[Union[int, float]].

Apply:
-        tnsr_like: List[int | float],
+        tnsr_like: List[Union[int, float]],
Imports already include Union.

616-619: Make unnest_sequences robust across cached/uncached.

Current logic assumes cached shapes and mis-splits SDPA tensors [B, S, …].

Apply:
-    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
-        t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
-        return list(torch.split(t_squeezed, self.seq_len))
+    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
+        if self._is_cached_attn:
+            t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
+            return list(torch.split(t_squeezed, self.seq_len, dim=0))
+        # uncached SDPA: [B, S, ...]
+        return list(torch.split(t_nested, self.seq_len, dim=1))
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
1-1: Add NVIDIA SPDX header.

Apply:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
 from collections import defaultdict
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
473-475: Make placeholder pixel_values shape configurable (not hardcoded 336x336).

Apply:
-        none_pixel_values = torch.zeros(0, 3, 336, 336)
+        # Derive sensible defaults from config/processor when available
+        channels = getattr(getattr(self, "model_config", None), "num_channels", 3)
+        default_size = getattr(getattr(self, "model_config", None), "image_size", 336)
+        none_pixel_values = torch.zeros(0, channels, default_size, default_size)
Alternatively, read from processor.image_processor if available.
examples/auto_deploy/build_and_run_ad.py (1)
75-79: Guard against empty queries; clarify/clean up replication logic.

len(queries) == 0 raises ZeroDivisionError.

Past feedback questioned repeating prompts to fill batch. If intentional, keep but make explicit and ceiling the repeat count.
-        queries = queries * (batch_size // len(queries) + 1)
-        queries = queries[:batch_size]
+        if not queries:
+            raise ValueError("PromptConfig.queries must not be empty.")
+        # repeat queries to cover batch_size, then trim
+        repeats = (batch_size + len(queries) - 1) // len(queries)  # ceil(batch_size/len)
+        queries = (queries * repeats)[:batch_size]
If replication is optional, consider a flag (e.g., repeat_to_batch: bool = True) with docstring.
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)
110-110: Consider more robust image detection logic.

Using torch.any(pixel_values != 0) to detect images may fail for edge cases where valid images have all-zero pixels (e.g., completely black images). The previous review suggested checking the shape instead.

Consider using shape-based detection:
     # decide by whether there is any non-zero pixel_values
-    has_image: torch.Tensor = torch.any(pixel_values != 0)
+    # Check if pixel_values has actual image data (batch dimension > 0)
+    has_image: torch.Tensor = pixel_values.shape[0] > 0
1-11: Add copyright header

All source files should include the NVIDIA copyright header as per coding guidelines.

Add the copyright header at the beginning of the file:
+# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
 """A patch to handle vision branch in Llama4ForConditionalGeneration."""

 from typing import List, Optional, Tuple, Union

🧹 Nitpick comments (7)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

603-615: Ensure dtype match in masked_scatter path.

packed_input_ids may be Long; destination input_ids is Long after earlier fix, but keep this robust.

Apply:

-        input_ids_device.masked_scatter_(input_ids_device == scatter_ref, packed_input_ids)
+        input_ids_device.masked_scatter_(
+            input_ids_device == scatter_ref, packed_input_ids.to(input_ids_device.dtype)
+        )

86-92: Constructor args shadow public members; consider trailing underscores per guideline.

Parameters like max_seq_len → self.max_seq_len violate the repo rule.

Apply (API-touching; defer if churn is high):

-    def __init__(self, max_seq_len: int = 1, max_batch_size: int = 1, page_size: int = 0, max_num_tokens: Optional[int] = None):
+    def __init__(self, max_seq_len_: int = 1, max_batch_size_: int = 1, page_size_: int = 0, max_num_tokens_: Optional[int] = None):
@@
-        self.max_seq_len = max_seq_len
-        self.max_batch_size = max_batch_size
-        self.page_size = page_size if page_size > 0 else max_seq_len
+        self.max_seq_len = max_seq_len_
+        self.max_batch_size = max_batch_size_
+        self.page_size = page_size_ if page_size_ > 0 else max_seq_len_
@@
-        if max_num_tokens is None or max_num_tokens < 1:
+        if max_num_tokens_ is None or max_num_tokens_ < 1:
             self.max_num_tokens = self.max_batch_size * max_seq_len_adjusted
         else:
-            self.max_num_tokens = max_num_tokens
+            self.max_num_tokens = max_num_tokens_

Propagate call-site updates accordingly.

Also applies to: 113-118

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)

65-69: Fix logger usage: message formatting drops num_pages.

Pass a single formatted string.

Apply:

-        ad_logger.info("Using fake cache manager with head_dim=0 and num pages:", self.num_blocks)
+        ad_logger.info(f"Using fake cache manager with head_dim=0 and num pages: {self.num_blocks}")

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

439-454: Reuse initialized processor/tokenizer to avoid duplicate downloads.

Minor: prefer processor = self.init_processor() to honor overrides and caching.

Apply:
-        processor = AutoProcessor.from_pretrained(self.tokenizer, **self.tokenizer_kwargs)
+        processor = self.init_processor()
+        assert processor is not None, "Processor not initialized"
111-118: Bound-method call style is intentional; add a brief comment to preempt confusion.

type(model).forward(model, …) bypasses potential overrides on the instance.

Apply:
-        return type(model).forward(model, input_ids=input_ids, position_ids=position_ids)
+        # Call the class's bound forward to avoid instance-level patches interfering with export.
+        return type(model).forward(model, input_ids=input_ids, position_ids=position_ids)

examples/auto_deploy/build_and_run_ad.py (2)

29-31: Tighten typing for PromptInput.

Use explicit key/value typing:

-PromptInput = Union[str, Dict, List[Dict]]
+PromptInput = Union[str, Dict[str, Any], List[Dict[str, Any]]]

41-62: Polish example prompts (typos/whitespace).

Remove trailing spaces to avoid accidental token bias.
Fix grammar: “What is the capital of Iceland?”

-            "How big is the universe? ",
+            "How big is the universe?",
@@
-            {"prompt": "In simple words and a single sentence, explain the concept of gravity: "},
+            {"prompt": "In simple words and a single sentence, explain the concept of gravity:"},
@@
-            {"prompt": "Where is the capital of Iceland? "},
+            {"prompt": "What is the capital of Iceland?"},

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c263049 and 34ffebd.

📒 Files selected for processing (12)

examples/auto_deploy/.gitignore (1 hunks)
examples/auto_deploy/build_and_run_ad.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (7 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/export/export.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/export/interface.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/llm.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/models/factory.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/models/hf.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (5 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/models/factory.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}: If a constructor parameter name conflicts with a public member, add a trailing underscore to the parameter (e.g., foo_).
Use uppercase literal suffixes (e.g., 1234L not 1234l).
Use spaces, not tabs; indent by 4 spaces.

Files:

tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Target Python 3.8+ for all Python code.
Indent with 4 spaces; do not use tabs.
Maintain module namespace on imports; import the module/submodule, not individual classes/functions (e.g., from package.subpackage import foo; foo.SomeClass()).
Python filenames use snake_case (e.g., some_file.py).
Class names use PascalCase.
Function and method names use snake_case.
Local variable names use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile).
Global variables use UPPER_SNAKE_CASE and prefix G_ (e.g., G_MY_GLOBAL).
Constants use UPPER_SNAKE_CASE.
Avoid shadowing outer-scope variables.
Initialize all externally visible members of a class in init.
Prefer docstrings for interfaces used outside a file; use comments for local code within functions or local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline with docstrings placed under the definition.
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns).
Limit except clauses to specific exceptions; avoid bare except.
For duck-typing try/except, keep try blocks minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA 2025 Apache-2.0 copyright header block at the top of all source files (.cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/auto_deploy/export/interface.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
examples/auto_deploy/build_and_run_ad.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

🧬 Code graph analysis (7)

tensorrt_llm/_torch/auto_deploy/export/interface.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

has (833-834)

tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

has (277-278)

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (6)

tensorrt_llm/_torch/auto_deploy/llm.py (1)

factory (86-89)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

create_factory (215-226)

tensorrt_llm/_torch/auto_deploy/models/factory.py (1)

get_extra_inputs (244-257)

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

get_extra_inputs (455-474)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)

add_extra_arg (620-643)

page_assignments (309-311)

input_pos (271-272)

nest_sequences (521-588)

rescatter_input_ids (590-614)

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

get_cache_indices (572-583)

examples/auto_deploy/build_and_run_ad.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

Field (69-96)

tensorrt_llm/_torch/auto_deploy/export/export.py (2)

tensorrt_llm/_torch/auto_deploy/export/interface.py (1)

apply_export_patches (217-260)

tensorrt_llm/_torch/auto_deploy/transformations/_graph.py (1)

lift_to_meta (76-89)

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

CacheConfig (26-29)

device (177-178)

to (431-438)

tensorrt_llm/_torch/auto_deploy/models/factory.py (6)

model (53-55)

tokenizer (58-60)

init_tokenizer (121-128)

init_processor (130-137)

get_example_inputs (232-242)

get_extra_inputs (244-257)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)

args (24-26)

dynamic_shapes (29-31)

to (33-37)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

nest_sequences (67-79)

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (2)

tensorrt_llm/_torch/models/modeling_llama.py (1)

Llama4ForConditionalGeneration (1179-1240)

tensorrt_llm/_torch/auto_deploy/export/interface.py (7)

BaseExportPatch (38-130)

ExportPatchRegistry (166-213)

register (172-181)

_apply_patch (123-125)

_apply_patch (154-157)

_revert_patch (128-130)

_revert_patch (159-163)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (8)

tensorrt_llm/_torch/auto_deploy/export/export.py (2)

21-21: LGTM! Clean separation of concerns.

The import change aligns well with the refactored patch selection logic now being handled inside apply_export_patches.

234-234: LGTM! Improved API flexibility.

The change to pass both patch_configs and patch_list to apply_export_patches provides better flexibility for patch configuration while maintaining backward compatibility.

tensorrt_llm/_torch/auto_deploy/export/interface.py (3)

8-8: LGTM! Necessary import for new optional parameters.

The addition of Optional import supports the new optional parameters in the refactored API.

186-188: Good error handling improvement.

Raising a descriptive ValueError for unknown patches improves debugging and prevents silent failures.

217-260: Well-designed patch configuration logic.

The refactored apply_export_patches function provides excellent flexibility:

Clear mutual exclusivity validation

Convenient patch_list to patch_configs conversion

Sensible default behavior when no patches are specified

Clean early exit when no patches to apply

The implementation is robust and user-friendly.

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (3)

79-81: LGTM! Correct device and dtype handling.

The conversion of projected vision features to match the embeddings' device and dtype ensures compatibility and prevents runtime errors.

137-149: Good handling of attention mask in loss calculation.

The code correctly handles the 2D attention mask for shifted logits/labels calculation, with proper cropping for cases like PrefixTuning. The implementation is robust and well-documented.

170-194: Well-structured patch implementation.

The Llama4VisionPatch class follows the patch interface correctly:

Properly registered with the ExportPatchRegistry

Clean separation between apply and revert operations

Stores original values for safe restoration

Clear documentation of the patch's purpose

examples/auto_deploy/.gitignore

examples/auto_deploy/build_and_run_ad.py

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (2)

53-83: Make page assignment deterministic and handle page exhaustion.

Set iteration order and guard when free pages are insufficient.

-        free_pages = set(range(si.num_pages)) - {i for pages in page_assignments for i in pages}
+        free_pages = sorted(set(range(si.num_pages)) - {i for pages in page_assignments for i in pages})
         updated_assignments = []
         for t_l, pages in zip(total_lens, page_assignments):
             extra_tokens = t_l - len(pages) * si.page_size
-            num_extra_pages = (extra_tokens // si.page_size) + (extra_tokens > 0)
-            updated_assignments.append(pages + [free_pages.pop() for _ in range(num_extra_pages)])
+            num_extra_pages = max(0, (extra_tokens // si.page_size) + (1 if extra_tokens > 0 else 0))
+            if num_extra_pages > len(free_pages):
+                raise RuntimeError(
+                    f"Insufficient free pages: need {num_extra_pages}, have {len(free_pages)}"
+                )
+            take = [free_pages.pop(0) for _ in range(num_extra_pages)]
+            updated_assignments.append(pages + take)
         return updated_assignments

92-97: Replace asserts with explicit errors for API constraints.

Asserts can be optimized out; raise exceptions.

-        assert all(r.sampling_params == sampling_params for r in requests), (
-            "Heterogeneous sampling params are not supported."
-        )
-        assert sampling_params.best_of == 1, "Best-of is not supported."
+        if not all(r.sampling_params == sampling_params for r in requests):
+            raise ValueError("Heterogeneous sampling params are not supported.")
+        if sampling_params.best_of != 1:
+            raise NotImplementedError("Best-of is not supported.")

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

1-1: Add NVIDIA 2025 Apache-2.0 header.

Required for all production sources.

+ # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ #
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
+
  """Interface to initialize and load HF models."""

82-85: Python 3.8+ compatibility: avoid PEP 604 unions.

Use Optional[...] instead of QuantConfigReader | None.

-        self._quant_config_reader: QuantConfigReader | None = None
+        from typing import Optional
+        self._quant_config_reader: Optional[QuantConfigReader] = None

♻️ Duplicate comments (13)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)

208-216: Test 3 still resets before continuation, breaking cache semantics.

Reset wipes sequence context (seq_len/input_pos/pages), so equality with full-context y_model is not guaranteed. Continue without reset or re-prime the cache before comparing only the tail.

Option A (no reset):

-    # Continue inference from previous context
-    cm.info.reset()
     for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]):
         y_with_cache[:, i_p : i_p + 1] = _call_and_unnest(x[:, i_p : i_p + 1], i_p)
     assert all_close(y_model, y_with_cache, atol=atol, rtol=rtol)

Option B (reset + re-prime, compare tail):

-    cm.info.reset()
-    for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]):
-        y_with_cache[:, i_p : i_p + 1] = _call_and_unnest(x[:, i_p : i_p + 1], i_p)
-    assert all_close(y_model, y_with_cache, atol=atol, rtol=rtol)
+    cm.info.reset()
+    for i_p in range(0, x.shape[1] - num_reset_steps):
+        _call_and_unnest(x[:, i_p : i_p + 1], i_p)
+    for i_p in range(x.shape[1] - num_reset_steps, x.shape[1]):
+        y_with_cache[:, i_p : i_p + 1] = _call_and_unnest(x[:, i_p : i_p + 1], i_p)
+    assert all_close(
+        y_model[:, -num_reset_steps:], y_with_cache[:, -num_reset_steps:], atol=atol, rtol=rtol
+    )

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)

148-157: Use torch.long for indices.

input_ids and _gather_idx participate in indexing/gather; int32 can break on some backends. Switch to long.

-            "input_ids": torch.ones(self.max_num_tokens, dtype=torch.int),
+            "input_ids": torch.ones(self.max_num_tokens, dtype=torch.long),
@@
-            "_gather_idx": torch.empty(self.max_num_tokens, dtype=torch.int),
+            "_gather_idx": torch.empty(self.max_num_tokens, dtype=torch.long),

543-552: Zero tail for input_pos and enforce uniform S in uncached mode.

Stale device-tail values may persist when batch shrinks; pass reset=True.
Uncached path relies on rectangular [B,S,...]; assert uniform seq_len to prevent view() errors.

-        self._store_arg("seq_len", [len(ids) for ids in input_ids], reset=True)
+        seq_lens = [len(ids) for ids in input_ids]
+        if not self._is_cached_attn:
+            assert len(set(seq_lens)) == 1, f"Uncached attention requires uniform seq_len; got {seq_lens}"
+        self._store_arg("seq_len", seq_lens, reset=True)
@@
-            self._store_arg(
+            self._store_arg(
                 "input_pos",
                 [input_pos] * self.num_sequences if isinstance(input_pos, int) else input_pos,
-            )
+                reset=True,
+            )

616-619: unnest_sequences splits along the wrong dim in uncached mode.

Current logic assumes cached shapes. Make it mode-aware to avoid mis-splitting prefill outputs.

-    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
-        t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
-        return list(torch.split(t_squeezed, self.seq_len))
+    def unnest_sequences(self, t_nested: torch.Tensor) -> List[torch.Tensor]:
+        if self._is_cached_attn:
+            t_squeezed = t_nested.squeeze(1) if self.is_generate else t_nested.squeeze(0)
+            return list(torch.split(t_squeezed, self.seq_len, dim=0))
+        # uncached SDPA: [B, S, ...] -> split along sequence length
+        return list(torch.split(t_nested, self.seq_len, dim=1))

495-500: PEP 604 unions are Py3.10+; target is 3.8+.

Replace List[int | float] with List[Union[int, float]].

-        tnsr_like: List[int | float],
+        tnsr_like: List[Union[int, float]],

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

60-63: Validate model_factory against the registry to catch typos early.

Type widened to str; add a Pydantic validator to fail fast with a clear message.

 class AutoDeployConfig(DynamicYamlMixInForSettings, BaseSettings):
@@
     model_factory: str = Field(
         default="AutoModelForCausalLM",
         description="The model factory to use for loading the model.",
     )
+    @field_validator("model_factory")
+    @classmethod
+    def _validate_model_factory(cls, v: str):
+        if not ModelFactoryRegistry.has(v):
+            available = ", ".join(ModelFactoryRegistry._registry.keys())
+            raise ValueError(f"Unknown model_factory '{v}'. Available: {available}")
+        return v

tensorrt_llm/_torch/auto_deploy/llm.py (6)

1-1: Add NVIDIA 2025 Apache-2.0 header.

Per coding guidelines, prepend the NVIDIA 2025 Apache-2.0 header above imports.

Apply:

+ # Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ #
+ # Licensed under the Apache License, Version 2.0 (the "License");
+ # you may not use this file except in compliance with the License.
+ # You may obtain a copy of the License at
+ #     http://www.apache.org/licenses/LICENSE-2.0
+ #
+ # Unless required by applicable law or agreed to in writing, software
+ # distributed under the License is distributed on an "AS IS" BASIS,
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ # See the License for the specific language governing permissions and
+ # limitations under the License.
+
  import types

30-31: Don’t require a processor for text-only inputs.

Gate the processor check behind the messages path.

-        if self.processor is None:
-            raise ValueError("processor is required to tokenize inputs")
+        # Only required for chat-style/multimodal inputs

-        if "messages" in inputs:
+        if "messages" in inputs:
+            if self.processor is None:
+                raise ValueError("processor is required to tokenize chat-style inputs")

Also applies to: 43-44

33-41: Preserve add_special_tokens when adding truncation params.

Don’t clobber kwargs.

         kwargs = {
             "add_special_tokens": sampling_params.add_special_tokens,
         }
         if sampling_params.truncate_prompt_tokens is not None:
-            kwargs = {
-                "truncation": True,
-                "max_length": sampling_params.truncate_prompt_tokens,
-            }
+            kwargs.update({
+                "truncation": True,
+                "max_length": sampling_params.truncate_prompt_tokens,
+            })

67-74: Handle both [S] and [1, S] from apply_chat_template; avoid assert.

Avoid assert (can be optimized-away) and support 1D outputs.

-            token_ids = all_args.pop("input_ids")
-            assert token_ids.shape[0] == 1, "messages should be unbatched at this point."
+            token_ids = all_args.pop("input_ids")
+            # Accept either [1, S] or [S]
+            if token_ids.ndim == 2:
+                token_ids = token_ids[0]
+            elif token_ids.ndim != 1:
+                raise ValueError(f"Unexpected token_ids shape: {tuple(token_ids.shape)}")
             if all_args:
                 extra_processed_inputs = {"multimodal_data": all_args}
             else:
                 extra_processed_inputs = None
-            return token_ids[0].tolist(), extra_processed_inputs
+            return token_ids.tolist(), extra_processed_inputs

74-76: Text-only path: add tiktoken fallback and support optional query.

Replicate DefaultInputProcessor behavior.

-        else:
-            token_ids = self.tokenizer.encode(inputs["prompt"], **kwargs)
-            return token_ids, None
+        else:
+            toktoken_special_tokens = {
+                "<|startoftext|>", "<|endoftext|>", "<|reserved_200000|>", "<|reserved_200001|>",
+                "<|return|>", "<|constrain|>", "<|reserved_200004|>", "<|channel|>", "<|start|>",
+                "<|end|>", "<|message|>", "<|reserved_200009|>", "<|reserved_200010|>",
+                "<|reserved_200011|>", "<|call|>", "<|reserved_200013|>",
+            }
+            try:
+                token_ids = self.tokenizer.encode(
+                    inputs["prompt"], **kwargs
+                )
+            except TypeError:
+                # tiktoken-style path
+                token_ids = self.tokenizer.encode(
+                    inputs["prompt"], allowed_special=toktoken_special_tokens
+                )
+
+            if "query" in inputs:
+                try:
+                    query_token_ids = self.tokenizer.encode(
+                        inputs["query"], **kwargs
+                    )
+                except TypeError:
+                    query_token_ids = self.tokenizer.encode(
+                        inputs["query"], allowed_special=toktoken_special_tokens
+                    )
+                return token_ids, {"query_token_ids": query_token_ids}
+
+            return token_ids, None

125-128: Prefer explicit checks over assert; allow any TokenizerBase.

Let non-Transformers tokenizers (e.g., tiktoken) pass.

-        assert isinstance(self.input_processor, DefaultInputProcessor)
-        assert self.tokenizer is None or isinstance(self.tokenizer, TransformersTokenizer)
-        self.input_processor = self._create_input_processor()
+        if not isinstance(self.input_processor, DefaultInputProcessor):
+            raise TypeError("input_processor must be DefaultInputProcessor before AD patching")
+        if self.tokenizer is not None and not isinstance(self.tokenizer, TokenizerBase):
+            raise TypeError("tokenizer must be a TokenizerBase (or None)")
+        self.input_processor = self._create_input_processor()

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

473-475: Derive placeholder pixel_values shape; avoid hardcoded 336x336.

Make it configurable or read from config.

-        none_pixel_values = torch.zeros(0, 3, 336, 336)
+        # Derive channels/size from config if available; fall back to (3, 336, 336)
+        cfg = getattr(getattr(self, "model_config", None), "vision_config", None)
+        channels = getattr(cfg, "num_channels", 3) if cfg else 3
+        size = getattr(cfg, "image_size", 336) if cfg else 336
+        none_pixel_values = torch.zeros(0, channels, size, size)

🧹 Nitpick comments (5)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)
47-53: Type/use mismatch in _add_hidden_dim annotation.

The method operates on tensors (uses .shape), but is annotated as Sequence[Sequence[Any]]. Tighten the type to accept torch.Tensor (and optionally Sequence) or handle both.
-    def _add_hidden_dim(self, input_ids: Sequence[Sequence[Any]]) -> torch.Tensor:
-        return torch.rand(
-            *input_ids.shape,
+    def _add_hidden_dim(self, input_ids: torch.Tensor) -> torch.Tensor:
+        return torch.rand(
+            *tuple(input_ids.shape),
             self.hidden_size,
             device=self.device,
             dtype=self.dtype,
         )
55-66: Avoid transiently breaking host invariant for input_ids.

Assigning a CPU tensor to _args_host["input_ids"] deviates from the “host is list-like” invariant and is immediately overwritten by reset(). Drop the assignment.
-        self._args_host["input_ids"] = self._args_device["input_ids"].cpu()
tensorrt_llm/_torch/auto_deploy/llm.py (1)

7-8: Import style nit.

Guidelines prefer module imports over symbols; consider switching to module-level imports in new code.
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)
134-143: Ensure input_ids type matches SequenceInterface expectations.

Pass Python lists for portability.
-            sequence_info.nest_sequences(
-                token_ids,
+            sequence_info.nest_sequences(
+                token_ids.tolist(),
                 input_pos=input_pos_next,
                 page_assignments=self._assign_pages(total_lens_next),
             )
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
239-240: Avoid bare except.

Limit to Exception to prevent masking system-exiting exceptions.
-        except:  # noqa: E722
+        except Exception:  # noqa: E722
             pass

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 34ffebd and 60b2bad.

📒 Files selected for processing (24)

examples/auto_deploy/.gitignore (1 hunks)
examples/auto_deploy/build_and_run_ad.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (7 hunks)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/export/export.py (2 hunks)
tensorrt_llm/_torch/auto_deploy/export/interface.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/llm.py (4 hunks)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/models/factory.py (3 hunks)
tensorrt_llm/_torch/auto_deploy/models/hf.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (5 hunks)
tensorrt_llm/_torch/auto_deploy/shim/demollm.py (8 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py (1 hunks)
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (2 hunks)
tensorrt_llm/executor/worker.py (1 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2 hunks)
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py (0 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py (0 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py (1 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py (0 hunks)
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (4 hunks)

💤 Files with no reviewable changes (3)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_attention_matcher_hf.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
tests/unittest/_torch/auto_deploy/_utils_test/_model_test_utils.py

🚧 Files skipped from review as they are similar to previous changes (13)

tensorrt_llm/_torch/auto_deploy/models/factory.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_llama4_vlm_patch.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_build_small_single.py
examples/auto_deploy/.gitignore
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py
tensorrt_llm/_torch/auto_deploy/export/export.py
tensorrt_llm/_torch/auto_deploy/export/interface.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_llm_config.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py
examples/auto_deploy/build_and_run_ad.py

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.{h,hpp,hh,hxx,cpp,cc,cxx,cu,cuh,py}: If a constructor parameter name conflicts with a public member, add a trailing underscore to the parameter (e.g., foo_).
Use uppercase literal suffixes (e.g., 1234L not 1234l).
Use spaces, not tabs; indent by 4 spaces.

Files:

tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/hf.py

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Target Python 3.8+ for all Python code.
Indent with 4 spaces; do not use tabs.
Maintain module namespace on imports; import the module/submodule, not individual classes/functions (e.g., from package.subpackage import foo; foo.SomeClass()).
Python filenames use snake_case (e.g., some_file.py).
Class names use PascalCase.
Function and method names use snake_case.
Local variable names use snake_case; if starting with a number, prefix with k_ (e.g., k_99th_percentile).
Global variables use UPPER_SNAKE_CASE and prefix G_ (e.g., G_MY_GLOBAL).
Constants use UPPER_SNAKE_CASE.
Avoid shadowing outer-scope variables.
Initialize all externally visible members of a class in init.
Prefer docstrings for interfaces used outside a file; use comments for local code within functions or local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline with docstrings placed under the definition.
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns).
Limit except clauses to specific exceptions; avoid bare except.
For duck-typing try/except, keep try blocks minimal and use else for the main logic.

Files:

tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/hf.py

**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA 2025 Apache-2.0 copyright header block at the top of all source files (.cpp, .h, .cu, .py).

Files:

tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py
tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py
tensorrt_llm/_torch/auto_deploy/llm.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/models/hf.py

🧠 Learnings (1)

📚 Learning: 2025-08-06T13:58:07.506Z

Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

tensorrt_llm/_torch/auto_deploy/llm.py

🧬 Code graph analysis (8)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

tensorrt_llm/llmapi/llm_args.py (1)

Field (69-96)

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (5)

named_standard_args (223-225)

extra_args_for_prepare_metadata (233-235)

get_prepare_metadata_op (738-760)

args (228-230)

set_max_num_tokens_sample (468-473)

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (3)

tensorrt_llm/executor/request.py (1)

GenerationRequest (84-129)

tensorrt_llm/inputs/multimodal.py (1)

MultimodalParams (152-410)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (6)

reset (479-485)

nest_sequences (521-588)

input_pos (271-272)

page_assignments (309-311)

seq_len (267-268)

args (228-230)

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (3)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

input_pos (271-272)

nest_sequences (521-588)

reset (479-485)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

nest_sequences (67-79)

tests/unittest/_torch/auto_deploy/_utils_test/_torch_test_utils.py (1)

all_close (6-21)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)

device (177-178)

args (228-230)

reset (479-485)

nest_sequences (521-588)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

args (24-26)

tensorrt_llm/_torch/auto_deploy/llm.py (5)

tensorrt_llm/inputs/registry.py (1)

DefaultInputProcessor (44-111)

tensorrt_llm/llmapi/llm.py (4)

tokenizer (691-695)

tokenizer (698-699)

prompt (79-80)

LLM (1011-1027)

tensorrt_llm/_torch/auto_deploy/models/factory.py (4)

tokenizer (58-60)

ModelFactory (23-257)

init_tokenizer (121-128)

init_processor (130-137)

tensorrt_llm/llmapi/tokenizer.py (3)

TokenizerBase (24-25)

TransformersTokenizer (28-267)

tokenizer_factory (270-290)

tensorrt_llm/_torch/auto_deploy/llm_args.py (1)

create_factory (215-226)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

tensorrt_llm/_torch/auto_deploy/shim/interface.py (3)

args (24-26)

dynamic_shapes (29-31)

to (33-37)

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

nest_sequences (67-79)

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

CacheConfig (26-29)

device (177-178)

to (431-438)

tensorrt_llm/_torch/auto_deploy/models/factory.py (6)

model (53-55)

tokenizer (58-60)

init_tokenizer (121-128)

init_processor (130-137)

get_example_inputs (232-242)

get_extra_inputs (244-257)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (7)

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)

245-245: API rename fix looks good.

cm.info.set_max_num_tokens_sample() aligns with the new public API.

tests/unittest/_torch/auto_deploy/_utils_test/_graph_test_helpers.py (1)

67-79: LGTM on tokenizer-free embedding flow.

Shimming list-of-ids to embeddings is minimal and contained; super().nest_sequences will receive the correct 3D inputs.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_kv_cache.py (1)

56-87: Forward signature change is consistent.

Switching to input_ids and passing through position_ids matches transform expectations.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

189-201: Ragged seq_len with uncached attention will .view() incorrectly.

sl = self.seq_len[0] assumes uniform S. The assert in nest_sequences above mitigates this; ensure callers use nest_sequences before args shaping. If ragged prefill is intended, add padding instead of view().

Would you like me to follow up with a padding helper to support ragged prefill safely?

tensorrt_llm/_torch/auto_deploy/shim/demollm.py (1)

112-118: Confirm extra_args lifecycle across decode steps.

nest_sequences resets extra inputs when omitted; if multimodal extra tensors are required beyond prefill, they must be passed again each step or cached inside the interface.

Would you like a follow-up patch to persist extra_args automatically across decode?

tensorrt_llm/_torch/auto_deploy/models/hf.py (2)

159-166: Forward build flow: minor improvement.

Using model.forward(...) is clearer than type(model).forward(...).

Do we need type(model).forward to bypass patched wrappers, or can we simplify to model.forward(...) safely here?

379-385: Tokenizer via processor: verify None path.

If tokenizer resolves to None, init_processor returns None and init_tokenizer returns None; ensure callers handle it (LLM._try_load_tokenizer does). LGTM.

tensorrt_llm/_torch/auto_deploy/shim/demollm.py

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py

tensorrt-cicd · 2025-09-03T04:57:05Z

PR_Github #17462 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #13122 completed with status: 'FAILURE'

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie · 2025-09-03T05:30:15Z

/bot run

tensorrt-cicd · 2025-09-03T05:35:31Z

PR_Github #17479 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-03T07:02:51Z

PR_Github #17479 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13133 completed with status: 'FAILURE'

lucaslie · 2025-09-03T13:56:42Z

/bot run

tensorrt-cicd · 2025-09-03T14:01:48Z

PR_Github #17541 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-03T18:50:58Z

PR_Github #17541 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13187 completed with status: 'FAILURE'

lucaslie · 2025-09-03T18:52:37Z

/bot run

tensorrt-cicd · 2025-09-03T18:57:53Z

PR_Github #17564 [ run ] triggered by Bot

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie · 2025-09-03T22:45:37Z

/bot run

tensorrt-cicd · 2025-09-03T23:03:09Z

PR_Github #17574 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-03T23:03:11Z

PR_Github #17564 [ run ] completed with state ABORTED

syuoni

Looks good for the LLM API part

tensorrt-cicd · 2025-09-04T01:04:55Z

PR_Github #17574 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13216 completed with status: 'FAILURE'

lucaslie · 2025-09-04T01:08:55Z

/bot run

lucaslie · 2025-09-04T06:04:31Z

/bot run

tensorrt-cicd · 2025-09-04T06:09:51Z

PR_Github #17635 [ run ] triggered by Bot

tensorrt-cicd · 2025-09-04T14:26:00Z

PR_Github #17635 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #13256 completed with status: 'ABORTED'

github-project-automation bot added this to AutoDeploy Board Aug 25, 2025

github-project-automation bot moved this to Backlog in AutoDeploy Board Aug 25, 2025

coderabbitai bot reviewed Aug 25, 2025

View reviewed changes

lucaslie force-pushed the ll/vlm_arch branch from a30236a to fd5d5f3 Compare August 26, 2025 00:36

coderabbitai bot reviewed Aug 26, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated Show resolved Hide resolved

lucaslie force-pushed the ll/vlm_arch branch from fd5d5f3 to 5be954c Compare August 27, 2025 02:13

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

lucaslie force-pushed the ll/vlm_arch branch from 5be954c to 57e20fe Compare August 27, 2025 02:34

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

2ez4bz reviewed Aug 27, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/models/hf.py Show resolved Hide resolved

coderabbitai bot reviewed Aug 27, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/llm.py Show resolved Hide resolved

lucaslie force-pushed the ll/vlm_arch branch from 504afd2 to abb6285 Compare August 28, 2025 04:16

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_attention.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/models/patches/llama4.py Outdated Show resolved Hide resolved

lucaslie force-pushed the ll/vlm_arch branch from abb6285 to e38f7fb Compare August 28, 2025 18:49

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

lucaslie force-pushed the ll/vlm_arch branch 2 times, most recently from 8504bbd to c263049 Compare August 28, 2025 22:38

coderabbitai bot reviewed Aug 28, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/llm_args.py Show resolved Hide resolved

lucaslie force-pushed the ll/vlm_arch branch from c263049 to 34ffebd Compare August 29, 2025 01:50

coderabbitai bot reviewed Aug 29, 2025

View reviewed changes

lucaslie force-pushed the ll/vlm_arch branch from 34ffebd to 60b2bad Compare August 29, 2025 02:06

coderabbitai bot reviewed Aug 29, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/shim/demollm.py Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py Outdated Show resolved Hide resolved

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py Show resolved Hide resolved

lucaslie moved this from Backlog to In review in AutoDeploy Board Aug 29, 2025

lucaslie marked this pull request as ready for review August 29, 2025 04:08

lucaslie requested review from a team as code owners August 29, 2025 04:08

2ez4bz approved these changes Sep 3, 2025

View reviewed changes

lucaslie added 2 commits September 2, 2025 21:59

export patch for llama4

2d1c84b

Signed-off-by: Lucas Liebenwein <[email protected]>

reviewer feedback

988673a

Signed-off-by: Lucas Liebenwein <[email protected]>

suyoggupta reviewed Sep 3, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Show resolved Hide resolved

suyoggupta reviewed Sep 3, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Show resolved Hide resolved

suyoggupta reviewed Sep 3, 2025

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Show resolved Hide resolved

suyoggupta self-requested a review September 3, 2025 18:59

suyoggupta approved these changes Sep 3, 2025

View reviewed changes

nvtx range markers for attention_interface

6c20308

Signed-off-by: Lucas Liebenwein <[email protected]>

syuoni approved these changes Sep 3, 2025

View reviewed changes

lucaslie enabled auto-merge (squash) September 3, 2025 23:26

chzblych approved these changes Sep 4, 2025

View reviewed changes

nv-guomingz approved these changes Sep 4, 2025

View reviewed changes

[#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example #7221

Are you sure you want to change the base?

[#6120][feat] AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example #7221

Uh oh!

Conversation

lucaslie commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Bench results

AutoDeploy Now (ll/vlm_arch, commit 6c20308a0 )

AutoDeploy Before (without llm/vlm_arch, commit 42697ea32)

PyTorch Backend (commit 6c20308a0)

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lucaslie commented Aug 25, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

lucaslie commented Aug 25, 2025 •

edited

Loading

AutoDeploy Now (ll/vlm_arch, commit `6c20308a0` )

AutoDeploy Before (without llm/vlm_arch, commit `42697ea32`)

PyTorch Backend (commit `6c20308a0`)

coderabbitai bot commented Aug 25, 2025 •

edited

Loading