Skip to content

[TRTLLM-6746]Enable two-model spec dec for MTP Eagle #7001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sunnyqgg
Copy link
Collaborator

@sunnyqgg sunnyqgg commented Aug 18, 2025

Summary by CodeRabbit

  • New Features
    • Added MTP Eagle decoding with both one-model and two-model options.
    • Introduced configurable Eagle capture layers and relaxed acceptance controls (top-k, delta).
    • Enabled MTP draft decoding with automatic routing based on configuration.
    • Draft prompts now drop the first token in Eagle/MTP Eagle modes for improved alignment.
  • Refactor
    • Modularized weight loading for improved reliability, dequantization support, and tensor-parallel handling.
    • Dynamic capture-layer handling reduces memory usage and scales with configuration.
  • Chores
    • Updated internal mode predicates and configuration wiring to support new decoding variants.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 18, 2025

📝 Walkthrough

Walkthrough

Introduces MTP Eagle variants and a separate MTP draft model. Adds new decoding modes, config fields, and dynamic Eagle3 layer capture. Implements DeepseekV3 weight loading via a dedicated loader. Adjusts auto-model mapping, executor draft-layer override, and example setup. Updates speculative utilities and interfaces accordingly.

Changes

Cohort / File(s) Summary
Example setup (Quickstart)
examples/llm-api/quickstart_advanced.py
Reworks MTP branch: switches to EagleDecodingConfig for two-model style; augments one-model MTP config; updates messaging and relaxed-acceptance knobs.
Auto-model routing
tensorrt_llm/_torch/models/modeling_auto.py
Remaps DeepseekV3 arch to MTPDraftModelForCausalLM when max_draft_len==0 before class lookup.
DeepseekV3 weights
tensorrt_llm/_torch/models/modeling_deepseekv3.py
Replaces DeepseekV3MTPHead with DeepseekV3WeightLoader; centralizes weight loading, TP splits, KV proj (and dequant) handling; invoked from model load path.
Speculative modeling (draft + flow)
tensorrt_llm/_torch/models/modeling_speculative.py
Adds MTPDraftModel and MTPDraftModelForCausalLM; updates MTP/Eagle3 one-model checks; adjusts Eagle3DraftModel FC init; integrates DeepseekV3WeightLoader for draft.
Spec interfaces and modes
tensorrt_llm/_torch/speculative/interface.py
Adds MTP_EAGLE_ONE_MODEL; replaces is_mtp with is_mtp_one_model; adds is_mtp_eagle_one_model; updates helpers; adds SpecMetadata.is_layer_capture.
Spec utils and resource wiring
tensorrt_llm/_torch/speculative/utils.py
Refactors mode checks for MTP/Eagle variants; propagates layers_to_capture and is_mtp_eagle; updates resource manager calls and extra-KV logic.
Eagle3 capture dynamics
tensorrt_llm/_torch/speculative/eagle3.py
Makes capture-layer count dynamic; adds is_mtp_eagle; computes layers_to_capture lazily; resizes hidden-state buffers based on capture count.
Drafter prompt handling
tensorrt_llm/_torch/speculative/model_drafter.py
Drops first token for Eagle3 and MTP Eagle modes in draft prompt preparation.
Executor tweaks
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
For DeepseekV3 draft model, overrides num_hidden_layers to 1 before KV cache and weight wiring.
Decoding config/API
tensorrt_llm/llmapi/llm_args.py
Extends EagleDecodingConfig with capture/relaxed-acceptance fields and computed num_capture_layers; extends MTPDecodingConfig with mtp_eagle_one_model, sets max_draft_len in from_dict; updates spec_dec_mode branches.
MTP metadata/flow
tensorrt_llm/_torch/speculative/mtp.py
Switches to new one-model predicates; updates metadata prep and cross-rank token handling for MTP Eagle one-model vs two-model.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant Quickstart
  participant Config
  participant AutoModel
  participant DraftModel
  participant TargetModel

  User->>Quickstart: setup_llm(args)
  Quickstart->>Config: Build MTP/Eagle config (incl. one-model flags)
  Quickstart->>AutoModel: create_draft_model(spec_mode)
  AutoModel->>AutoModel: remap arch (DeepseekV3 & max_draft_len==0 -> MTPDraftModelForCausalLM)
  AutoModel-->>DraftModel: instantiate draft model
  Quickstart->>TargetModel: instantiate target model
  DraftModel->>DraftModel: load_weights (DeepseekV3WeightLoader)
  DraftModel->>TargetModel: load shared embeddings/lm_head (if needed)
  User->>TargetModel: generate(spec_metadata, inputs)
  TargetModel->>DraftModel: request draft logits/hidden states
  DraftModel-->>TargetModel: draft outputs
  TargetModel-->>User: final tokens
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

Community want to contribute

Suggested reviewers

  • yweng0828
  • nv-guomingz
  • mikeiovine
  • hlu1
  • Wanli-Jiang

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@sunnyqgg sunnyqgg changed the title TRTLLM-6746]Enable two-model spec dec for MTP Eagle [TRTLLM-6746]Enable two-model spec dec for MTP Eagle Aug 18, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🔭 Outside diff range comments (8)
tensorrt_llm/_torch/speculative/mtp.py (3)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

Apply at top of file:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

70-71: Scalar copy_ misuse — use zero_/fill_ instead.

copy_ expects a Tensor, not a Python scalar. This will error at runtime on some PyTorch versions.

Use zero_ or fill_ for in-place scalar writes:

-                    self.mtp_relaxed_delta_pool[slot_id].copy_(
-                        0, non_blocking=True)
+                    self.mtp_relaxed_delta_pool[slot_id].zero_()
-            self.mtp_relaxed_delta_pool[free_slot_id].copy_(0,
-                                                            non_blocking=True)
+            self.mtp_relaxed_delta_pool[free_slot_id].zero_()

Also applies to: 79-81


1168-1196: Draft loop calls first MTP layer repeatedly (likely bug).

Inside the loop over self.mtp_num_modules, you always invoke draft_model.mtp_layers[0] rather than indexing by i. This will ignore all but the first layer and produce incorrect tokens/hidden states when num_nextn_predict_layers > 1.

Use the ith layer:

-            if i == 0:
-                hidden_states = draft_model.mtp_layers[0](
+            if i == 0:
+                hidden_states = draft_model.mtp_layers[i](
                     embed_tokens=draft_model.embed_tokens,
                     all_rank_num_tokens=spec_metadata.all_rank_num_tokens,
                     all_rank_max_num_tokens=spec_metadata.
                     all_rank_max_num_tokens,
                     **inputs)
...
-            else:
-                hidden_states = draft_model.mtp_layers[0](
+            else:
+                hidden_states = draft_model.mtp_layers[i](
                     embed_tokens=draft_model.embed_tokens,
                     all_rank_num_tokens=spec_metadata.
                     subseq_all_rank_num_tokens,
                     all_rank_max_num_tokens=max(
                         spec_metadata.subseq_all_rank_num_tokens)
                     if spec_metadata.subseq_all_rank_num_tokens is not None else
                     None,
                     **inputs)
examples/llm-api/quickstart_advanced.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/speculative/eagle3.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/_torch/speculative/utils.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tensorrt_llm/llmapi/llm_args.py (2)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the NVIDIA copyright header (current year) to all source files.

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

1676-1691: EagleDecodingConfig validate: reinforces need for draft_model_dir in quickstart.

Given speculative_model_dir is required for Eagle-based paths, the quickstart must pass --draft_model_dir in two-model MTP Eagle.

See related comment in quickstart_advanced.py with a concrete diff to fix this wiring mistake.

🧹 Nitpick comments (15)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

264-266: Avoid post-construction mutation of num_hidden_layers; guard against MTPDraftModel remap

Mutating pretrained_config.num_hidden_layers after the draft model is already instantiated risks mismatched assumptions (e.g., KV cache sizing, layer-indexed logic) and will also trigger even when AutoModel remapped to MTPDraftModelForCausalLM (architectures remains "DeepseekV3ForCausalLM"). Add guards so this only applies to an actual DeepseekV3ForCausalLM draft engine. Also handle missing/empty architectures gracefully.

Apply this minimal guard:

-            if draft_model_engine.model.model_config.pretrained_config.architectures[
-                    0] == "DeepseekV3ForCausalLM":
-                draft_model_engine.model.model_config.pretrained_config.num_hidden_layers = 1
+            archs = getattr(
+                draft_model_engine.model.model_config.pretrained_config,
+                "architectures", None)
+            model_arch = archs[0] if archs and len(archs) > 0 else None
+            # Only force to 1 layer when the instantiated class is DeepseekV3ForCausalLM
+            if (model_arch == "DeepseekV3ForCausalLM"
+                    and draft_model_engine.model.__class__.__name__
+                    == "DeepseekV3ForCausalLM"):
+                draft_model_engine.model.model_config.pretrained_config.num_hidden_layers = 1

If the intent was to influence KV cache allocation only, consider moving this into a draft-engine construction-time knob (or pass through PyTorchModelEngine is_draft_model logic) rather than mutating config after instantiation.

Also, per repository guidelines, prepend the NVIDIA copyright header to this file.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/speculative/interface.py (1)

99-109: Verify SM gating in extend_ctx; consider >=100 instead of ==100

extend_ctx disables capture for TrtllmAttention only when get_sm_version() == 100. If future Blackwell revisions report >100, this may degrade unexpectedly. Consider using >= 100 unless there’s a known incompatibility.

Would you like me to adjust the predicate to >= 100?

tensorrt_llm/_torch/models/modeling_deepseekv3.py (4)

132-141: New DeepseekV3WeightLoader improves modularity; add a brief docstring

The separation of weight-loading concerns into a dedicated class is a solid move. Add a short docstring to clarify scope and expected inputs per guidelines.

Example:

class DeepseekV3WeightLoader:
    """TP-aware weight loader for DeepseekV3 variants.

    Args:
        model: The target model instance (e.g., DeepseekV3ForCausalLM or draft variant).
        is_draft_model: Whether weights are being loaded for the draft path.
    """

Also, add the NVIDIA header to this file in addition to preserving upstream license:

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

514-535: Wrap long line; comply with Ruff E501 (<=120 chars)

predicted_tokens_per_seq assignment exceeds 120 chars.

-        predicted_tokens_per_seq = model_config.spec_config.max_draft_len + 1 if model_config.spec_config is not None else 1
+        spec_cfg = model_config.spec_config
+        predicted_tokens_per_seq = (spec_cfg.max_draft_len + 1) if spec_cfg is not None else 1

566-574: Prefer torch.sigmoid over torch.nn.functional.sigmoid (deprecation)

torch.nn.functional.sigmoid is deprecated; use torch.sigmoid for clarity and future-proofing.

-        scores = F.sigmoid(logits)
+        scores = torch.sigmoid(logits)

659-667: Ruff E501: wrap long call to DeepseekV3Gate constructor

A few of the argument lines exceed 120 chars. Wrap to satisfy E501.

For example, break routed_scaling_factor line:

-            routed_scaling_factor=config.routed_scaling_factor,
+            routed_scaling_factor=config.routed_scaling_factor,

and ensure each argument is on its own line with hanging indentation (PEP8).

If you want, I can run through the file and submit a style-only patch for the flagged lines.

tensorrt_llm/_torch/speculative/model_drafter.py (1)

32-35: Add NVIDIA copyright header to file

Per repo guidelines, prepend the header.

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tensorrt_llm/_torch/speculative/mtp.py (2)

495-497: None check order in boolean guard.

Accessing attn_metadata.is_cuda_graph before verifying attn_metadata is not None risks AttributeError. Reverse the condition order.

-        if attn_metadata.is_cuda_graph and attn_metadata is not None:
+        if attn_metadata is not None and attn_metadata.is_cuda_graph:

292-296: Minor performance nit: avoid clone() when not needed.

Shifting input buffers uses clone() unnecessarily. copy_ achieves the same semantics with less overhead.

-            input_ids[:-1] = input_ids[1:].clone()
+            input_ids[:-1].copy_(input_ids[1:])
...
-            draft_hidden_states[:-1] = draft_hidden_states[1:].clone()
+            draft_hidden_states[:-1].copy_(draft_hidden_states[1:])
tensorrt_llm/_torch/speculative/eagle3.py (1)

95-105: Default layer-capture pattern: “last-4” could be fragile.

Using (1, mid-1, last-4) is a notable change. For shallow models (<=5 layers) you error out; for 6–7 layers, mid-1 and last-4 may collide or yield very early layers. Please confirm this is intentional and measured. Consider clamping and deduping, or fall back to capturing the last layer when spacing becomes too tight.

If intended, add a short inline comment explaining the rationale for last-4 vs last-1.

tensorrt_llm/_torch/models/modeling_speculative.py (5)

382-385: Use LongTensor for token indices in type hints.

Token IDs and position IDs are conventionally torch.LongTensor (int64). The current hints use torch.IntTensor in some places.

Apply this diff:

-        input_ids: torch.IntTensor,
-        position_ids: torch.IntTensor,
+        input_ids: torch.LongTensor,
+        position_ids: torch.LongTensor,

And:

-                input_ids: torch.IntTensor = None,
-                position_ids: torch.IntTensor = None,
+                input_ids: torch.LongTensor = None,
+                position_ids: torch.LongTensor = None,

Also applies to: 437-439


370-375: Align Embedding instantiation with TP mapping (optional).

Elsewhere embeddings are usually constructed with mapping/tensor-parallel settings and gathered outputs. Consider matching that here to avoid surprises in TP runs.

Apply this diff:

-        self.embed_tokens = Embedding(
-            self.config.vocab_size,
-            self.config.hidden_size,
-            dtype=self.config.torch_dtype,
-        )
+        self.embed_tokens = Embedding(
+            self.config.vocab_size,
+            self.config.hidden_size,
+            dtype=self.config.torch_dtype,
+            mapping=model_config.mapping,
+            tensor_parallel_mode=TensorParallelMode.COLUMN,
+            gather_output=True,
+        )

431-433: Embedding sharing likely never happens; decide whether to always share or always own.

self.model.embed_tokens is created in MTPDraftModel.__init__, so this if ... is None branch will never run. If the intent is to share embeddings with the target model to save memory and keep weights in sync, set it unconditionally; otherwise, remove the dead branch.

Proposed change to share:

-        if self.model.embed_tokens is None:
-            self.model.embed_tokens = target_model.model.embed_tokens
+        # Share embeddings with target model to save memory and keep weights in sync.
+        self.model.embed_tokens = target_model.model.embed_tokens

Please confirm whether the draft model should own separate embeddings (loaded via weight loader) or share with the target model.


353-359: Optional: Add concise docstring to new public class.

Add a short, Google-style docstring to clarify the per-layer draft behavior and expected inputs.

Example:

 class MTPDraftModel(nn.Module):
 
     def __init__(self, model_config: ModelConfig[PretrainedConfig],
                  layer_idx: int, aux_stream_dict: Dict[AuxStreamType,
                                                        torch.cuda.Stream]):
         super().__init__()
+        """
+        Per-layer MTP draft block used in the two-model MTP-Eagle flow.
+        
+        Args:
+            model_config: Global model configuration.
+            layer_idx: Index of the draft layer relative to the target stack.
+            aux_stream_dict: CUDA aux streams for attention/MoE overlap.
+        """

403-411: Optional: Add docstring to the public MTPDraftModelForCausalLM wrapper.

Clarify the two-stream setup and how weights are loaded.

Example:

 @register_auto_model("MTPDraftModelForCausalLM")
 class MTPDraftModelForCausalLM(DecoderModelForCausalLM[MTPDraftModel,
                                                        PretrainedConfig]):
 
     def __init__(self, model_config: ModelConfig[PretrainedConfig]):
-        self.model_config = model_config
+        """
+        Causal LM wrapper around a per-layer MTP draft model for two-model
+        MTP-Eagle speculative decoding.
+
+        Sets up CUDA aux streams and delegates weight loading to the DeepseekV3 loader.
+        """
+        self.model_config = model_config
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1ce2354 and 9ae0936.

📒 Files selected for processing (11)
  • examples/llm-api/quickstart_advanced.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_auto.py (1 hunks)
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py (9 hunks)
  • tensorrt_llm/_torch/models/modeling_speculative.py (5 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1 hunks)
  • tensorrt_llm/_torch/speculative/eagle3.py (4 hunks)
  • tensorrt_llm/_torch/speculative/interface.py (6 hunks)
  • tensorrt_llm/_torch/speculative/model_drafter.py (1 hunks)
  • tensorrt_llm/_torch/speculative/mtp.py (3 hunks)
  • tensorrt_llm/_torch/speculative/utils.py (10 hunks)
  • tensorrt_llm/llmapi/llm_args.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_auto.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/speculative/utils.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/_torch/speculative/model_drafter.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/_torch/speculative/mtp.py
  • tensorrt_llm/_torch/models/modeling_auto.py
  • examples/llm-api/quickstart_advanced.py
  • tensorrt_llm/_torch/speculative/interface.py
  • tensorrt_llm/llmapi/llm_args.py
  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
  • tensorrt_llm/_torch/models/modeling_speculative.py
  • tensorrt_llm/_torch/speculative/eagle3.py
  • tensorrt_llm/_torch/speculative/utils.py
🧠 Learnings (1)
📚 Learning: 2025-08-14T06:36:40.681Z
Learnt from: timlee0212
PR: NVIDIA/TensorRT-LLM#6886
File: tensorrt_llm/_torch/models/modeling_deepseekv3.py:0-0
Timestamp: 2025-08-14T06:36:40.681Z
Learning: In DeepSeek V3 model (tensorrt_llm/_torch/models/modeling_deepseekv3.py), the disagreement between AllReduce.__init__ guard and _compute_mlp_tp_size logic for MNNVL usage is expected by design. The AllReduce component and MLP TP-size computation intentionally use different criteria for MNNVL availability decisions.

Applied to files:

  • tensorrt_llm/_torch/models/modeling_deepseekv3.py
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/models/modeling_deepseekv3.py

514-514: Line too long (124 > 120)

(E501)


659-659: Line too long (122 > 120)

(E501)


797-797: Line too long (125 > 120)

(E501)


803-803: Line too long (136 > 120)

(E501)

tensorrt_llm/_torch/models/modeling_speculative.py

463-463: Comparison to None should be cond is not None

Replace with cond is not None

(E711)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (24)
tensorrt_llm/_torch/speculative/interface.py (3)

24-29: Helper predicates for one-model MTP look good

is_mtp_one_model and is_mtp_eagle_one_model are clear and simplify downstream checks. No issues.


39-41: use_one_engine correctly covers MTP one-model

The updated predicate now includes MTP one-model alongside EAGLE3 one-model, which is consistent with the intended architecture split.


196-202: Default SpecMetadata.is_layer_capture stub is fine

Providing an overridable hook with a safe default is appropriate.

tensorrt_llm/_torch/models/modeling_deepseekv3.py (1)

1496-1499: Weight loader integration looks correct

Switching to DeepseekV3WeightLoader.load_weights simplifies the class and centralizes logic. LGTM.

tensorrt_llm/_torch/speculative/model_drafter.py (1)

32-35: Drop-first-token logic extended to MTP_EAGLE (two-model) — LGTM

Aligns with EAGLE3 behavior and the broader MTP Eagle flow. Safe for short prompts (returns empty tensor when len==1).

tensorrt_llm/_torch/speculative/mtp.py (3)

154-156: One-model Eagle gating: confirm behavior.

Setting subseq_all_rank_num_tokens only for is_mtp_eagle_one_model looks intentional for the MTPEagle one-model path. Please confirm that two-model Eagle does not rely on subseq_all_rank_num_tokens here (it should be using Eagle3 metadata instead). If not, add the appropriate branch.


171-173: Double-check num_tokens adjustment gating.

MTP vanilla uses total max_draft_len tokens, while MTP Eagle’s first step uses max_draft_len+1 and subsequent steps 1. The change to gate on not is_mtp_eagle_one_model() seems correct for the one-model Eagle path. Please verify that the two-model Eagle path won’t traverse this code (i.e., uses Eagle3 metadata exclusively).


181-202: Vanilla-only hidden-state updates gating looks correct.

Updating hidden-state pointers only under is_mtp_one_model() aligns with excluding the MTP Eagle one-model path, which has its own flow.

examples/llm-api/quickstart_advanced.py (1)

186-193: One-model MTP path wiring looks consistent.

Using MTPDecodingConfig with num_nextn_predict_layers and mtp_eagle_one_model mirrors the new spec_dec_mode predicates.

tensorrt_llm/_torch/speculative/eagle3.py (2)

39-41: Capacity of hidden_states now depends on config.num_capture_layers — verify alignment.

Resource allocation uses EagleDecodingConfig.num_capture_layers, while metadata can derive layers_to_capture dynamically. Ensure config.num_capture_layers and metadata.layers_to_capture length always match, or document the assumption. Mismatch can cause OOB writes during capture.


187-219: Eagle3OneModelSpecMetadata mirrors capture logic — good parity.

The mirrored lazy init and allocation for one-model Eagle3 aligns with the two-model path.

tensorrt_llm/_torch/speculative/utils.py (6)

22-30: One-model MTP metadata unification looks correct.

Routing both vanilla MTP and MTP Eagle one-model through MTPSpecMetadata via is_mtp_one_model() simplifies the stack.


75-85: Resource manager for MTP Eagle one-model only when relaxed acceptance is enabled.

This conditional avoids unnecessary buffers. Please confirm MTPEagleWorker never expects mtp_* pools when use_relaxed_acceptance_for_thinking=False. Otherwise, allocate conditionally per usage site.


92-101: Two-model Eagle and MTP Eagle share Eagle3ResourceManager — assert message updated.

The assertion covers both modes. Good consistency.


111-118: Decoder selection via is_mtp_one_model() is fine if it includes Eagle one-model.

Ensure SpeculativeDecodingMode.is_mtp_one_model() returns True for both vanilla MTP and MTP Eagle one-model. If not, add an explicit is_mtp_eagle_one_model() branch to return MTPSampler.


162-170: Worker selection split is clear and correct.

MTP Vanilla → MTPWorker; MTP Eagle one-model → MTPEagleWorker; Eagle3 one-model → Eagle3OneModelWorker.


186-191: MTP one-model max_draft_len reconciliation is necessary — good.

Ensures low-level APIs expecting max_draft_len receive a consistent value derived from num_nextn_predict_layers.

tensorrt_llm/llmapi/llm_args.py (4)

445-452: num_capture_layers logic: confirm alignment with eagle3.py defaults.

If eagle3_layers_to_capture is unset, you return 1 for MTP Eagle and 3 otherwise. This should match eagle3.py’s computed layers_to_capture length. Please verify consistency to avoid allocation/usage mismatches.


457-461: Spec mode mapping for MTP Eagle is correct.

EAGLE3_ONE_MODEL takes priority; otherwise MTP_EAGLE if flagged.


542-547: from_dict sets max_draft_len — good for downstream APIs.

Keeps legacy code that reads max_draft_len working.


557-561: MTP_EAGLE vs MTP_EAGLE_ONE_MODEL selection looks right.

Using num_nextn_predict_layers_from_model_config and mtp_eagle_one_model disambiguates both modes cleanly.

tensorrt_llm/_torch/models/modeling_speculative.py (3)

22-22: LGTM: AuxStreamType import is appropriate.

Import looks correct and aligns with the new aux-stream wiring.


340-343: LGTM: MTP layer count computation is consistent with the new predicates.

Using is_mtp_one_model() and is_mtp_eagle_one_model() reads clearly, and the fallback to num_nextn_predict_layers makes sense.


468-474: LGTM: New MTP-Eagle branch in get_draft_model.

Adding the is_mtp_eagle() path to return MTPDraftModelForCausalLM is consistent with enabling the two-model spec-dec for MTP Eagle.

Comment on lines 173 to 184
if not args.use_one_model:
print(
"MTP only supports one model style spec decode; ignoring default use_one_model=False"
print("Running MTP eagle with two model style.")
spec_config = EagleDecodingConfig(
max_draft_len=args.spec_decode_max_draft_len,
speculative_model_dir=args.model_dir,
eagle3_one_model=args.use_one_model,
is_mtp_eagle=True,
use_relaxed_acceptance_for_thinking=args.
use_relaxed_acceptance_for_thinking,
relaxed_topk=args.relaxed_topk,
relaxed_delta=args.relaxed_delta,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Two-model MTP Eagle should use draft_model_dir, not model_dir.

EagleDecodingConfig.validate requires a draft model path. Using args.model_dir here is likely incorrect and will fail if EAGLE weights aren’t colocated with the main model.

Enforce draft_model_dir and wire it:

-        if not args.use_one_model:
-            print("Running MTP eagle with two model style.")
-            spec_config = EagleDecodingConfig(
-                max_draft_len=args.spec_decode_max_draft_len,
-                speculative_model_dir=args.model_dir,
+        if not args.use_one_model:
+            print("Running MTP eagle with two model style.")
+            if args.draft_model_dir is None:
+                raise ValueError("For two-model MTP Eagle, --draft_model_dir must be provided.")
+            spec_config = EagleDecodingConfig(
+                max_draft_len=args.spec_decode_max_draft_len,
+                speculative_model_dir=args.draft_model_dir,
                 eagle3_one_model=args.use_one_model,
                 is_mtp_eagle=True,
                 use_relaxed_acceptance_for_thinking=args.
                 use_relaxed_acceptance_for_thinking,
                 relaxed_topk=args.relaxed_topk,
                 relaxed_delta=args.relaxed_delta,
             )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if not args.use_one_model:
print(
"MTP only supports one model style spec decode; ignoring default use_one_model=False"
print("Running MTP eagle with two model style.")
spec_config = EagleDecodingConfig(
max_draft_len=args.spec_decode_max_draft_len,
speculative_model_dir=args.model_dir,
eagle3_one_model=args.use_one_model,
is_mtp_eagle=True,
use_relaxed_acceptance_for_thinking=args.
use_relaxed_acceptance_for_thinking,
relaxed_topk=args.relaxed_topk,
relaxed_delta=args.relaxed_delta,
)
if not args.use_one_model:
print("Running MTP eagle with two model style.")
if args.draft_model_dir is None:
raise ValueError("For two-model MTP Eagle, --draft_model_dir must be provided.")
spec_config = EagleDecodingConfig(
max_draft_len=args.spec_decode_max_draft_len,
speculative_model_dir=args.draft_model_dir,
eagle3_one_model=args.use_one_model,
is_mtp_eagle=True,
use_relaxed_acceptance_for_thinking=args.
use_relaxed_acceptance_for_thinking,
relaxed_topk=args.relaxed_topk,
relaxed_delta=args.relaxed_delta,
)
🤖 Prompt for AI Agents
In examples/llm-api/quickstart_advanced.py around lines 173 to 184, the config
for two-model MTP Eagle incorrectly passes args.model_dir as the
draft/speculative model path; change it to use the draft model path
(args.draft_model_dir) and/or the proper EagleDecodingConfig parameter name
(draft_model_dir or speculative_model_dir as defined in the config API) so the
spec/draft weights are read from the draft model directory instead of the main
model_dir; update the argument passed to EagleDecodingConfig accordingly and
ensure the CLI/argparser exposes args.draft_model_dir.

Comment on lines +23 to +24
if model_arch == "DeepseekV3ForCausalLM" and config.spec_config.max_draft_len == 0:
model_arch = "MTPDraftModelForCausalLM"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix: guard config.spec_config None to avoid AttributeError

Accessing config.spec_config.max_draft_len without a None-check will raise if spec_config is None. Add a guard.

-        if model_arch == "DeepseekV3ForCausalLM" and config.spec_config.max_draft_len == 0:
+        if (model_arch == "DeepseekV3ForCausalLM"
+                and config.spec_config is not None
+                and config.spec_config.max_draft_len == 0):
             model_arch = "MTPDraftModelForCausalLM"

Also, per repository guidelines, prepend the NVIDIA copyright header:

# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_auto.py around lines 23-24, accessing
config.spec_config.max_draft_len can raise AttributeError when spec_config is
None; add a None-check so you only read max_draft_len if config.spec_config is
not None (e.g., guard with if config.spec_config and
config.spec_config.max_draft_len == 0) and update the branch accordingly; also
prepend the NVIDIA copyright header line "# Copyright (c) 2025, NVIDIA
CORPORATION. All rights reserved." at the top of the file.

Comment on lines +773 to +809
def _compute_shared_expert_tp_size(self, intermediate_size: int,
block_size: int) -> int:
"""
In the case of Deepseek-R1, the TP size of MLP is capped by intermediate_size // block_size.
For example, when the intermediate_size is 2048 and block scaling size is 128,
TP sizes are limited to {1, 2, 4, 8, 16} because of 2048/128 = 16.

Args:
intermediate_size (int): MLP intermediate size.
block_size (int): The quantization block scale size. In the case of Deepseek FP8 recipe,
it's 128. For NVFP4, it's 16.

Returns:
int: The computed tp_size.
"""

assert intermediate_size % block_size == 0, "intermediate_size must be divisible by block_size."

shared_output_scale = None
# The block scale size is 128, which requires shared_expert_intermediate_size to be divisible by 128.
if self.use_dp:
# If using attention DP, the shared experts also use DP instead of TP.
shared_tp_size = 1
else:
# Due to the restriction of block scale size (i.e., 128), the supported TP sizes only include 1, 2, 4, 8, and 16.
# The math.gcd operation ensures that shared_tp_size falls in the supported TP sizes.
shared_tp_size = math.gcd(
intermediate_size // block_size,
self.mapping.tp_size,
)
# If shared_tp_size has been overridden, the output of shared experts needs to be scaled down accordingly before all-reduce.
if shared_tp_size != self.mapping.tp_size:
shared_output_scale = shared_tp_size / self.mapping.tp_size

return shared_tp_size, shared_output_scale

def compute_routed_output(self, hidden_states, hidden_states_fp4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix type annotation and docstring: function returns tuple, not int

_compute_shared_expert_tp_size returns a tuple (shared_tp_size, shared_output_scale), but the annotation and docstring say int. Align both.

-    def _compute_shared_expert_tp_size(self, intermediate_size: int,
-                                       block_size: int) -> int:
+    def _compute_shared_expert_tp_size(
+        self, intermediate_size: int, block_size: int
+    ) -> Tuple[int, Optional[float]]:
@@
-        Returns:
-            int: The computed tp_size.
+        Returns:
+            Tuple[int, Optional[float]]: (tp_size, shared_output_scale)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _compute_shared_expert_tp_size(self, intermediate_size: int,
block_size: int) -> int:
"""
In the case of Deepseek-R1, the TP size of MLP is capped by intermediate_size // block_size.
For example, when the intermediate_size is 2048 and block scaling size is 128,
TP sizes are limited to {1, 2, 4, 8, 16} because of 2048/128 = 16.
Args:
intermediate_size (int): MLP intermediate size.
block_size (int): The quantization block scale size. In the case of Deepseek FP8 recipe,
it's 128. For NVFP4, it's 16.
Returns:
int: The computed tp_size.
"""
assert intermediate_size % block_size == 0, "intermediate_size must be divisible by block_size."
shared_output_scale = None
# The block scale size is 128, which requires shared_expert_intermediate_size to be divisible by 128.
if self.use_dp:
# If using attention DP, the shared experts also use DP instead of TP.
shared_tp_size = 1
else:
# Due to the restriction of block scale size (i.e., 128), the supported TP sizes only include 1, 2, 4, 8, and 16.
# The math.gcd operation ensures that shared_tp_size falls in the supported TP sizes.
shared_tp_size = math.gcd(
intermediate_size // block_size,
self.mapping.tp_size,
)
# If shared_tp_size has been overridden, the output of shared experts needs to be scaled down accordingly before all-reduce.
if shared_tp_size != self.mapping.tp_size:
shared_output_scale = shared_tp_size / self.mapping.tp_size
return shared_tp_size, shared_output_scale
def compute_routed_output(self, hidden_states, hidden_states_fp4,
def _compute_shared_expert_tp_size(
self, intermediate_size: int, block_size: int
) -> Tuple[int, Optional[float]]:
"""
In the case of Deepseek-R1, the TP size of MLP is capped by intermediate_size // block_size.
For example, when the intermediate_size is 2048 and block scaling size is 128,
TP sizes are limited to {1, 2, 4, 8, 16} because of 2048/128 = 16.
Args:
intermediate_size (int): MLP intermediate size.
block_size (int): The quantization block scale size. In the case of Deepseek FP8 recipe,
it's 128. For NVFP4, it's 16.
Returns:
Tuple[int, Optional[float]]: (tp_size, shared_output_scale)
"""
assert intermediate_size % block_size == 0, "intermediate_size must be divisible by block_size."
shared_output_scale = None
# The block scale size is 128, which requires shared_expert_intermediate_size to be divisible by 128.
if self.use_dp:
# If using attention DP, the shared experts also use DP instead of TP.
shared_tp_size = 1
else:
# Due to the restriction of block scale size (i.e., 128), the supported TP sizes only include 1, 2, 4, 8, and 16.
# The math.gcd operation ensures that shared_tp_size falls in the supported TP sizes.
shared_tp_size = math.gcd(
intermediate_size // block_size,
self.mapping.tp_size,
)
# If shared_tp_size has been overridden, the output of shared experts needs to be scaled down accordingly before all-reduce.
if shared_tp_size != self.mapping.tp_size:
shared_output_scale = shared_tp_size / self.mapping.tp_size
return shared_tp_size, shared_output_scale
🧰 Tools
🪛 Ruff (0.12.2)

797-797: Line too long (125 > 120)

(E501)


803-803: Line too long (136 > 120)

(E501)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_deepseekv3.py around lines 773 to 809,
the _compute_shared_expert_tp_size function actually returns a tuple
(shared_tp_size, shared_output_scale) but is annotated and documented as
returning an int; update the function signature to return the correct typing
(e.g., Tuple[int, Optional[float]]), add necessary imports from typing (Tuple,
Optional) if not present, and modify the docstring Returns section to describe
the two-tuple (shared_tp_size: int, shared_output_scale: Optional[float]) and
their meanings.

@@ -1,4 +1,4 @@
from typing import Any, Dict, Generic, Optional, Tuple
from typing import Any, Dict, Generic, List, Optional, Tuple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add the required NVIDIA copyright header (2025).

Per the coding guidelines, prepend the 2025 NVIDIA copyright header to all source files.

Apply this diff at the top of the file:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
 from typing import Any, Dict, Generic, List, Optional, Tuple

If the repository uses a standardized multi-line header or SPDX tag, prefer that exact format.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from typing import Any, Dict, Generic, List, Optional, Tuple
# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
from typing import Any, Dict, Generic, List, Optional, Tuple
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_speculative.py at the top of the file
(around line 1), the required NVIDIA 2025 copyright header is missing; prepend
the repository's standard NVIDIA copyright header (or SPDX tag if the repo uses
that standardized format) as the first lines of the file, ensuring correct year
(2025), owner, and any required license identifier and preserving a blank line
after the header.

Comment on lines +159 to 165
if self.spec_config.num_capture_layers > 1:
self.fc = Linear(self.hidden_size_in *
self.spec_config.num_capture_layers,
config.hidden_size,
bias=getattr(config, "bias", False),
dtype=config.torch_dtype)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential AttributeError/use-after-uninitialized: fc may be missing for single-capture-layer dimensionality mismatch.

apply_eagle3_fc later calls self.model.fc(...) when hidden size mismatches, but self.fc is only created when num_capture_layers > 1. If num_capture_layers == 1 and hidden_size_in != config.hidden_size, this will raise AttributeError. Initialize fc whenever a projection is needed.

Apply this diff to make fc exist when needed and with the correct input features:

-        if self.spec_config.num_capture_layers > 1:
-            self.fc = Linear(self.hidden_size_in *
-                             self.spec_config.num_capture_layers,
-                             config.hidden_size,
-                             bias=getattr(config, "bias", False),
-                             dtype=config.torch_dtype)
+        if (self.spec_config.num_capture_layers > 1
+                or self.hidden_size_in != config.hidden_size):
+            in_features = (self.hidden_size_in *
+                           self.spec_config.num_capture_layers
+                           if self.spec_config.num_capture_layers > 1
+                           else self.hidden_size_in)
+            self.fc = Linear(
+                in_features,
+                config.hidden_size,
+                bias=getattr(config, "bias", False),
+                dtype=config.torch_dtype,
+            )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if self.spec_config.num_capture_layers > 1:
self.fc = Linear(self.hidden_size_in *
self.spec_config.num_capture_layers,
config.hidden_size,
bias=getattr(config, "bias", False),
dtype=config.torch_dtype)
if (self.spec_config.num_capture_layers > 1
or self.hidden_size_in != config.hidden_size):
in_features = (self.hidden_size_in *
self.spec_config.num_capture_layers
if self.spec_config.num_capture_layers > 1
else self.hidden_size_in)
self.fc = Linear(
in_features,
config.hidden_size,
bias=getattr(config, "bias", False),
dtype=config.torch_dtype,
)

Comment on lines +362 to +368
mtp_layer = DeepseekV3MTP(model_config,
layer_idx,
aux_stream_dict,
is_separate_draft_engine=True)
setattr(self, f"layers.{layer_idx}", mtp_layer)
self.layers = mtp_layer
self.layer_idx = layer_idx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Illegal module name: attribute names cannot contain '.' in nn.Module.

setattr(self, f"layers.{layer_idx}", mtp_layer) will fail in PyTorch with “module name can't contain '.'”. You already assign self.layers = mtp_layer below; the setattr line should be removed.

Apply this diff:

-        setattr(self, f"layers.{layer_idx}", mtp_layer)
-        self.layers = mtp_layer
+        self.layers = mtp_layer  # single MTP layer for this draft engine
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mtp_layer = DeepseekV3MTP(model_config,
layer_idx,
aux_stream_dict,
is_separate_draft_engine=True)
setattr(self, f"layers.{layer_idx}", mtp_layer)
self.layers = mtp_layer
self.layer_idx = layer_idx
mtp_layer = DeepseekV3MTP(model_config,
layer_idx,
aux_stream_dict,
is_separate_draft_engine=True)
self.layers = mtp_layer # single MTP layer for this draft engine
self.layer_idx = layer_idx
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_speculative.py around lines 362 to 368,
remove the invalid setattr call that attempts to set an attribute name
containing a dot: `setattr(self, f"layers.{layer_idx}", mtp_layer)` — this is
illegal for nn.Module and will raise "module name can't contain '.'"; keep the
existing assignment `self.layers = mtp_layer` (and `self.layer_idx = layer_idx`)
and delete the setattr line so the module is registered correctly.

Comment on lines +389 to +401
) -> Tuple[torch.Tensor, torch.Tensor]:
hidden_states = self.layers(
input_ids,
position_ids,
hidden_states,
embed_tokens=self.embed_tokens,
attn_metadata=attn_metadata,
all_rank_num_tokens=all_rank_num_tokens,
all_rank_max_num_tokens=all_rank_max_num_tokens,
)

return hidden_states

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Return type mismatch: forward is annotated to return a Tuple but returns a single tensor.

The annotation says -> Tuple[torch.Tensor, torch.Tensor], but the body returns a single hidden_states. This will confuse type checkers and readers.

Apply this diff:

-    ) -> Tuple[torch.Tensor, torch.Tensor]:
+    ) -> torch.Tensor:
@@
-        return hidden_states
+        return hidden_states
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
) -> Tuple[torch.Tensor, torch.Tensor]:
hidden_states = self.layers(
input_ids,
position_ids,
hidden_states,
embed_tokens=self.embed_tokens,
attn_metadata=attn_metadata,
all_rank_num_tokens=all_rank_num_tokens,
all_rank_max_num_tokens=all_rank_max_num_tokens,
)
return hidden_states
) -> torch.Tensor:
hidden_states = self.layers(
input_ids,
position_ids,
hidden_states,
embed_tokens=self.embed_tokens,
attn_metadata=attn_metadata,
all_rank_num_tokens=all_rank_num_tokens,
all_rank_max_num_tokens=all_rank_max_num_tokens,
)
return hidden_states
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_speculative.py around lines 389 to 401,
the forward method is annotated to return Tuple[torch.Tensor, torch.Tensor] but
only returns a single tensor (hidden_states); update the signature to return
torch.Tensor (remove Tuple[...] and second tensor), and ensure any callers or
type hints expecting a tuple are adjusted accordingly (or if a second tensor is
actually required, modify the return to return a tuple with the additional
tensor instead).

Comment on lines +423 to +428
def load_weights(self, weights: Dict):
# Import here to avoid circular import
from .modeling_deepseekv3 import DeepseekV3WeightLoader
weight_loader = DeepseekV3WeightLoader(self, is_draft_model=True)
weight_loader.load_weights(weights)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Signature mismatch will raise TypeError in load_draft_weights path.

SpecDecOneEngineForCausalLM.load_draft_weights calls self.draft_model.load_weights(weights=..., weight_mapper=...). Here, MTPDraftModelForCausalLM.load_weights doesn't accept weight_mapper, causing a TypeError. Allow the optional param to maintain the common interface.

Apply this diff:

-    def load_weights(self, weights: Dict):
+    def load_weights(self, weights: Dict, weight_mapper: Optional[BaseWeightMapper] = None) -> None:
         # Import here to avoid circular import
         from .modeling_deepseekv3 import DeepseekV3WeightLoader
         weight_loader = DeepseekV3WeightLoader(self, is_draft_model=True)
         weight_loader.load_weights(weights)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def load_weights(self, weights: Dict):
# Import here to avoid circular import
from .modeling_deepseekv3 import DeepseekV3WeightLoader
weight_loader = DeepseekV3WeightLoader(self, is_draft_model=True)
weight_loader.load_weights(weights)
def load_weights(self, weights: Dict, weight_mapper: Optional[BaseWeightMapper] = None) -> None:
# Import here to avoid circular import
from .modeling_deepseekv3 import DeepseekV3WeightLoader
weight_loader = DeepseekV3WeightLoader(self, is_draft_model=True)
weight_loader.load_weights(weights)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_speculative.py around lines 423 to 428,
the load_weights signature lacks an optional weight_mapper parameter expected by
callers; update the method to accept weight_mapper: Optional[Callable]=None (or
appropriate type) and forward it to the DeepseekV3WeightLoader.load_weights call
(e.g., weight_loader.load_weights(weights, weight_mapper=weight_mapper)) so the
method matches the common interface and avoids a TypeError.

Comment on lines +445 to +453
hidden_states = spec_metadata.get_hidden_states()
output = self.model(
input_ids=input_ids,
position_ids=position_ids,
hidden_states=hidden_states,
attn_metadata=attn_metadata,
all_rank_num_tokens=attn_metadata.all_rank_num_tokens,
all_rank_max_num_tokens=attn_metadata.all_rank_max_num_tokens,
**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Guard against missing spec_metadata and support explicit hidden_states.

If spec_metadata is None, this will crash. It’s safer to accept hidden_states directly when provided and fall back to spec_metadata otherwise.

Apply this diff:

-        hidden_states = spec_metadata.get_hidden_states()
+        if hidden_states is None:
+            if spec_metadata is None:
+                raise ValueError(
+                    "spec_metadata must be provided, or hidden_states must be passed"
+                )
+            hidden_states = spec_metadata.get_hidden_states()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
hidden_states = spec_metadata.get_hidden_states()
output = self.model(
input_ids=input_ids,
position_ids=position_ids,
hidden_states=hidden_states,
attn_metadata=attn_metadata,
all_rank_num_tokens=attn_metadata.all_rank_num_tokens,
all_rank_max_num_tokens=attn_metadata.all_rank_max_num_tokens,
**kwargs)
if hidden_states is None:
if spec_metadata is None:
raise ValueError(
"spec_metadata must be provided, or hidden_states must be passed"
)
hidden_states = spec_metadata.get_hidden_states()
output = self.model(
input_ids=input_ids,
position_ids=position_ids,
hidden_states=hidden_states,
attn_metadata=attn_metadata,
all_rank_num_tokens=attn_metadata.all_rank_num_tokens,
all_rank_max_num_tokens=attn_metadata.all_rank_max_num_tokens,
**kwargs)
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_speculative.py around lines 445 to 453,
the code unconditionally calls spec_metadata.get_hidden_states(), which will
crash if spec_metadata is None and prevents callers from passing explicit
hidden_states; change the logic to prefer an explicit hidden_states argument if
provided (e.g., from a local variable or kwargs) and only call
spec_metadata.get_hidden_states() when hidden_states is not provided and
spec_metadata is not None, otherwise set hidden_states to None before calling
self.model; ensure you do not assume spec_metadata is present when accessing
get_hidden_states.

return_context_logits,
)


def get_draft_model(model_config, draft_config, lm_head, model):
assert getattr(model_config, 'spec_config', None) != None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Use identity comparison for None (ruff E711).

Replace != None with is not None.

Apply this diff:

-    assert getattr(model_config, 'spec_config', None) != None
+    assert getattr(model_config, 'spec_config', None) is not None
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert getattr(model_config, 'spec_config', None) != None
assert getattr(model_config, 'spec_config', None) is not None
🧰 Tools
🪛 Ruff (0.12.2)

463-463: Comparison to None should be cond is not None

Replace with cond is not None

(E711)

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/models/modeling_speculative.py around line 463, the
assertion uses `!= None`; change it to use identity comparison by replacing that
check with `is not None` so it reads as an identity comparison against None
(i.e., assert getattr(model_config, 'spec_config', None) is not None).

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Aug 18, 2025
@mikeiovine mikeiovine self-requested a review August 18, 2025 14:38
@@ -20,6 +20,8 @@ def from_config(
"") # Strip the appended EAGLE3
if hasattr(config.pretrained_config, "draft_vocab_size"):
model_arch = "EAGLE3" + model_arch
if model_arch == "DeepseekV3ForCausalLM" and config.spec_config.max_draft_len == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could config.spec_config be None?

Comment on lines 157 to +158
else:
hidden_states = hidden_states[-1].unsqueeze(0)
return torch.chunk(v, tp_size, dim=dim)[idx].contiguous()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not necessary.

Suggested change
else:
hidden_states = hidden_states[-1].unsqueeze(0)
return torch.chunk(v, tp_size, dim=dim)[idx].contiguous()
return torch.chunk(v, tp_size, dim=dim)[idx].contiguous()


if not do_finalize:
return [shared_output, *routed_output]
else:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is not necessary.

Comment on lines +102 to +106
if get_sm_version() == 100 and issubclass(attention_backend,
TrtllmAttention):
return False
else:
return True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be simplified to return not (get_sm_version() == ....

Copy link
Collaborator

@mikeiovine mikeiovine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have not reviewed fully. Will come back to it once #6210 is merged; looks like this one has a lot of overlap with it.

print("Running MTP eagle with two model style.")
spec_config = EagleDecodingConfig(
max_draft_len=args.spec_decode_max_draft_len,
speculative_model_dir=args.model_dir,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the weights for MTP included in the target model checkpoint?

print(
"MTP only supports one model style spec decode; ignoring default use_one_model=False"
print("Running MTP eagle with two model style.")
spec_config = EagleDecodingConfig(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be EagleDecodingConfig? On the UX side, it's not great that you have to use different config objects for MTP_EAGLE 2-model vs 1-model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community want to contribute PRs initiated from Community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants