Skip to content

Conversation

leslie-fang25
Copy link
Collaborator

@leslie-fang25 leslie-fang25 commented Aug 26, 2025

Summary by CodeRabbit

  • New Features

    • Explicit max sequence length is now honored across executor and sampling, improving generation limits.
    • AutoDeploy backend is supported in worker and execution flows.
    • Executor creation accepts optional tokenizer, logits post-processor, and parallelization settings.
  • Refactor

    • Executor and worker now use unified/base LLM argument objects for configuration.
    • Executor creation is derived from LLM args, simplifying wiring and configuration propagation.

Description

This PR mainly refactors the interface of create_py_executor by replacing executor config with llm args.

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 26, 2025

📝 Walkthrough

Walkthrough

Executor creation was refactored to accept LLM-args objects (BaseLlmArgs/TorchLlmArgs/ADLllmArgs) across worker, autodeploy, and PyTorch paths; max_seq_len is computed from the model engine and propagated through sampler creation into PyExecutor; several public signatures were updated accordingly.

Changes

Cohort / File(s) Summary
AutoDeploy executor API
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
create_autodeploy_executor signature changed to accept ad_config: LlmArgs directly; removed ExecutorConfig usage/derivation. Core ADEngine flow unchanged.
PyExecutor utils: max_seq_len plumbing
tensorrt_llm/_torch/pyexecutor/_util.py
create_py_executor_instance gains max_seq_len: Optional[int]; create_torch_sampler_args is now keyword-only with *, max_seq_len, enable_mixed_sampler; instantiate_sampler forwards engine.max_seq_len into sampler args.
PyExecutor constructor
tensorrt_llm/_torch/pyexecutor/py_executor.py
PyExecutor.__init__ adds optional max_seq_len: Optional[int] and assigns self.max_seq_len.
PyExecutor creator API and flow
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
create_py_executor now accepts llm_args: TorchLlmArgs (replacing direct ExecutorConfig); optional tokenizer, logits_post_processor_config, and parallel_config added. Builds executor_config from llm_args, computes/propagates max_seq_len, and passes it to create_py_executor_instance. Garbage-collection threshold sourced from llm_args.
Executor type surface
tensorrt_llm/executor/executor.py
GenerationExecutor.create type for llm_args changed to Optional[BaseLlmArgs] (import/signature update).
Worker refactor: LlmArgs-first and backend dispatch
tensorrt_llm/executor/worker.py
Worker and worker_main now accept Optional[BaseLlmArgs]. _create_py_executor builds executor from llm_args, supports pytorch and autodeploy backends (uses ADLllmArgs for autodeploy), stores mapping/checkpoint loader, derives/stores max_seq_len, updates shutdown and token-deduction to use llm_args.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant Worker as GenerationExecutorWorker
  participant Creator as py_executor_creator.create_py_executor
  participant Util as _util.create_py_executor_instance
  participant PyEx as PyExecutor
  participant AD as AutoDeploy

  Client->>Worker: start(llm_args: BaseLlmArgs, ...)
  alt llm_args.backend == "pytorch"
    Worker->>Creator: create_py_executor(llm_args, checkpoint_dir, tokenizer, lora_config, logits_post_processor_config, parallel_config)
    note right of Creator #E6F4EA: build executor_config from llm_args\ncompute max_seq_len
    Creator->>Util: create_py_executor_instance(..., max_seq_len)
    Util->>PyEx: __init__(..., max_seq_len)
    PyEx-->>Creator: PyExecutor instance
    Creator-->>Worker: PyExecutor
  else llm_args.backend == "autodeploy"
    Worker->>Worker: ad_config = llm_args.to_ad_config()
    Worker->>AD: create_autodeploy_executor(ad_config)
    note right of AD #FFF4E6: uses ad_config directly (signature change)
    AD-->>Worker: Executor
  end
  Worker-->>Client: ready
Loading
sequenceDiagram
  autonumber
  participant Engine as ModelEngine
  participant Util as _util.instantiate_sampler
  participant Sampler as TorchSampler

  Engine->>Util: instantiate_sampler(engine, pytorch_backend_config, ...)
  note right of Util #E8F0FF: create_torch_sampler_args(max_seq_len=engine.max_seq_len, enable_mixed_sampler=cfg.enable_mixed_sampler)
  Util->>Sampler: Args(max_seq_len, ...)
  Sampler-->>Util: Sampler instance
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

AutoDeploy

Suggested reviewers

  • Superjomn
  • syuoni
  • QiJune
  • suyoggupta

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)

1-1: Missing NVIDIA copyright header.

Per repo guidelines, prepend the NVIDIA copyright header (2025) to all Python sources.

I can generate a standardized header if you confirm the preferred form.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1-1: Missing NVIDIA copyright header.

Please add the 2025 NVIDIA copyright header at the top.

tensorrt_llm/executor/executor.py (1)

1-1: Missing NVIDIA copyright header.

Please prepend the standard 2025 NVIDIA header.

tensorrt_llm/_torch/pyexecutor/_util.py (1)

1-1: Missing NVIDIA copyright header.

Please add the 2025 NVIDIA header.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

1-1: Missing NVIDIA copyright header.

Please add the 2025 NVIDIA header.

tensorrt_llm/executor/worker.py (1)

1-1: Add NVIDIA copyright header (2025) per repo guidelines

This source file is missing the required NVIDIA copyright header.

Apply at the very top:

+# Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
🧹 Nitpick comments (15)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (4)

261-266: Update docstring to match new signature (no checkpoint_dir).

Docstring still mentions "checkpoint directory" even though the function now only accepts ad_config: LlmArgs.

Apply this diff to align the docstring:

-def create_autodeploy_executor(ad_config: LlmArgs):
-    """Create an AutoDeploy executor from the given configuration and checkpoint directory.
-
-    This is the entrypoint API to the _autodeploy backend.
-    """
+def create_autodeploy_executor(ad_config: LlmArgs):
+    """Create an AutoDeploy executor from the given LlmArgs configuration.
+
+    This is the entrypoint API to the _autodeploy backend.
+    """

276-279: Assertion message is stale; prefer explicit type check error.

The message references "pytorch_backend_config" but the parameter is now ad_config. Consider raising a TypeError for clarity.

Apply this diff:

-    msg = "pytorch_backend_config must be an AD LlmArgs object"
-    assert isinstance(ad_config, LlmArgs), msg
+    if not isinstance(ad_config, LlmArgs):
+        raise TypeError("ad_config must be an AutoDeploy LlmArgs object")

330-343: Propagate garbage_collection_gen0_threshold and max_seq_len to PyExecutor.

The PyExecutor now accepts max_seq_len and garbage_collection_gen0_threshold. In the Torch path (_util.create_py_executor_instance), both are passed; mirror that here for consistency and to avoid silently using defaults.

Apply this diff:

     py_executor = PyExecutor(
         resource_manager,
         scheduler,
         model_engine=engine,
         sampler=sampler,
         dist=mpi_dist,
         max_num_sequences=max_num_sequences,
         disable_overlap_scheduler=ad_config.disable_overlap_scheduler,
         max_input_len=ad_config.max_input_len,
         max_batch_size=ad_config.max_batch_size,
         max_draft_len=max_draft_len,
         max_beam_width=ad_config.max_beam_width,
+        garbage_collection_gen0_threshold=ad_config.garbage_collection_gen0_threshold,
+        max_seq_len=ad_config.max_seq_len,
     )

267-275: Optional: Avoid calling get_free_port() on every rank.

Currently, all ranks call dist.get_free_port() and then broadcast. Minor improvement: only rank 0 should acquire the port; other ranks pass a placeholder to broadcast.

Example:

port = dist.get_free_port() if rank == 0 else None
port = mpi_dist.broadcast(port, root=0)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

156-158: New max_seq_len argument: document intent and usage.

The constructor stores max_seq_len but does not reference it elsewhere. If this is intentionally for external consumers (e.g., tooling/tests), add a short docstring note; otherwise, consider wiring it to where sequence-length decisions are made in executor code.

Would you like me to draft a concise constructor docstring including the new parameter?

tensorrt_llm/_torch/pyexecutor/_util.py (2)

1-20: Duplicate/ambiguous imports (minor cleanup).

ModelConfig is imported twice (tensorrt_llm._torch.model_config and ..model_config). Also, is_mla is imported from both ._util and .config_utils in other files. Consider deduplicating to avoid confusion.


592-601: Confirm max_seq_len semantics in instantiate_sampler

Audit of the two sampler‐instantiation paths reveals a divergence:

  • In the AutoDeploy shim (tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py:321–323), the sampler is created with
      max_seq_len=ad_config.max_seq_len
  • In the PyExecutor util (tensorrt_llm/_torch/pyexecutor/_util.py:592–601), it uses
      max_seq_len=engine.max_seq_len

Since create_py_executor adjusts executor_config.max_seq_len (adding 1 for overlap scheduling and max_draft_len for speculative decoding), you’ll want to verify whether the sampler should use that inflated limit rather than the raw engine.max_seq_len.

If the sampler must account for those extra tokens, update the util path to pass executor_config.max_seq_len; otherwise, leave it using engine.max_seq_len as originally intended.

• File: tensorrt_llm/_torch/pyexecutor/_util.py, lines 592–601
• Suggested diff:

-    sampler_args = create_torch_sampler_args(
-        executor_config,
-        mapping,
-        max_seq_len=engine.max_seq_len,
+    sampler_args = create_torch_sampler_args(
+        executor_config,
+        mapping,
+        max_seq_len=executor_config.max_seq_len,
         enable_mixed_sampler=pytorch_backend_config.enable_mixed_sampler)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)

210-217: Public API shift to llm_args is clear; consider adding a short docstring.

Given the signature change, a brief docstring noting that executor_config is now derived from llm_args would help downstream users.

I can draft a Google-style docstring if helpful.


13-20: Imports: ensure minimal necessary and consistent types.

Adding LogitsPostProcessorConfig and ParallelConfig is fine. Minor nit: in this file, is_mla is imported twice (from ._util and .config_utils) elsewhere—dedupe when convenient.

tensorrt_llm/executor/worker.py (6)

93-99: Type hint for engine contradicts runtime handling of list

Runtime accepts a list of engines (per-rank), but the type hint is Union[Path, Engine]. Consider reflecting this in the signature to avoid confusion for callers and static analyzers.

Apply:

-        engine: Union[Path, Engine],
+        engine: Union[Path, Engine, List[Union[Path, Engine]]],

96-104: Guard against zero CUDA devices to avoid ZeroDivisionError

device_id modulo torch.cuda.device_count() will crash on systems without visible GPUs (e.g., CI, CPU-only dev nodes).

Apply:

-        def _get_comm_ranks_device_id():
-            device_id = self.global_rank % torch.cuda.device_count()
+        def _get_comm_ranks_device_id():
+            num_devices = torch.cuda.device_count()
+            if num_devices <= 0:
+                raise RuntimeError("CUDA device not found. Ensure GPUs are visible before initializing executor.")
+            device_id = self.global_rank % num_devices
             torch.cuda.set_device(device_id)

107-115: Replace asserts with explicit exceptions (production code path) and fix long-line lint

Asserts may be stripped with -O and are not ideal for input validation; also flagged by E501 later. Use explicit checks and raise a clear exception.

Apply:

-            assert hasattr(
-                self.llm_args, "backend"
-            ), "llm_args should be with backend in _create_py_executor"
+            if not hasattr(self.llm_args, "backend"):
+                raise ValueError("llm_args must define a 'backend' attribute in _create_py_executor()")

@@
-            else:
-                raise ValueError(
-                    f"Unsupported backend config: {self.llm_args.backend}")
+            else:
+                raise ValueError(f"Unsupported backend config: {self.llm_args.backend}")

Also applies to: 131-135


126-133: Type check for AutoDeploy args should be explicit exception (not assert)

Avoid assert for runtime type-checking; raise TypeError with a concise message.

Apply:

-                assert isinstance(self.llm_args, ADLlmArgs)
-                args["ad_config"] = self.llm_args.get_pytorch_backend_config()
+                if not isinstance(self.llm_args, ADLlmArgs):
+                    raise TypeError("For backend '_autodeploy', llm_args must be an instance of AutoDeploy LlmArgs.")
+                args["ad_config"] = self.llm_args.get_pytorch_backend_config()

653-666: Use explicit exception instead of assert during shutdown; also resolves long line lint

Asserts may be stripped at runtime; make the contract explicit and fix E501 on line 654.

Apply:

-            if self.llm_args is not None:
-                assert self._executor_config is None, "An empty executor_config is expected in shutdown when LLM arguments are defined."
+            if self.llm_args is not None:
+                if self._executor_config is not None:
+                    raise RuntimeError("executor_config must be None during shutdown when LLM arguments are defined.")
                 if (self.llm_args.backend == "pytorch"
                         and hasattr(self, "checkpoint_loader")
                         and self.checkpoint_loader is not None):
                     self.checkpoint_loader.cleanup()
                     self.checkpoint_loader = None

694-712: Consider documenting new parameters (llm_args) in public interfaces

worker_main is a public entrypoint. Add a concise Google-style docstring describing llm_args expectations and backend behavior.

Example:

 def worker_main(
@@
-    llm_args: Optional[BaseLlmArgs] = None,
+    llm_args: Optional[BaseLlmArgs] = None,
 ) -> None:
+    """Worker process main loop.
+
+    Args:
+        engine: Engine path or Engine object for TRT-LLM runtime.
+        worker_queues: IPC addresses for worker-proxy communication.
+        log_level: Logging level string for the leader process.
+        executor_config: Legacy executor config; must be None when llm_args is provided.
+        batched_logits_processor: Optional batched logits processor.
+        worker_cls: Worker class to instantiate.
+        tracer_init_kwargs: Optional VizTracer initialization kwargs.
+        _torch_model_class_mapping: Optional mapping to augment model registry.
+        postproc_worker_config: Post-processing worker settings.
+        ready_signal: Message to send on successful init.
+        is_llm_executor: True if this worker handles request submission.
+        lora_config: LoRA configuration when enabling LoRA.
+        hf_model_dir: HuggingFace model directory for checkpoints.
+        tokenizer: Tokenizer instance.
+        llm_args: BaseLlmArgs-derived configuration. When provided, selects the backend
+            via llm_args.backend ('pytorch' or '_autodeploy') and must be used instead
+            of executor_config.
+    """
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4f84a45 and 37a6357.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (4 hunks)
  • tensorrt_llm/executor/executor.py (2 hunks)
  • tensorrt_llm/executor/worker.py (8 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Preserve module namespaces when importing; import modules/packages and access members via the module (e.g., from package.subpackage import foo; foo.SomeClass())
Python file names should be snake_case
Python class names should be PascalCase
Python functions/methods and local variables should be snake_case; variables beginning with a number should be prefixed with k_ (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE prefixed with G_ (e.g., G_MY_GLOBAL); constants should be UPPER_SNAKE_CASE
Avoid shadowing variables from outer scopes; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; comments should be reserved for in-function or file-local interfaces
Use Google-style docstrings for classes and functions; attributes and variables may be documented inline with trailing string literals
Avoid reflection when simpler, explicit code suffices (e.g., avoid dict(**locals()) patterns)
In try/except, catch the narrowest exceptions possible
For duck-typing patterns, keep the try body minimal and move logic to else to avoid masking unrelated failures

Files:

  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
**/*.{c,cc,cpp,cxx,h,hh,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
🧠 Learnings (1)
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
PR: NVIDIA/TensorRT-LLM#6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
🧬 Code graph analysis (5)
tensorrt_llm/executor/executor.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
  • BaseLlmArgs (1136-1913)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
  • LlmArgs (237-351)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
  • max_seq_len (558-568)
  • max_seq_len (571-575)
tensorrt_llm/executor/worker.py (5)
tensorrt_llm/llmapi/llm_args.py (5)
  • BaseLlmArgs (1136-1913)
  • TorchLlmArgs (2122-2544)
  • get_pytorch_backend_config (2479-2544)
  • parallel_config (1356-1357)
  • to_mapping (282-294)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
  • create_py_executor (210-493)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
  • LlmArgs (237-351)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
  • create_autodeploy_executor (261-344)
tensorrt_llm/_torch/pyexecutor/config.py (1)
  • _construct_checkpoint_loader (175-197)
tensorrt_llm/_torch/pyexecutor/_util.py (2)
tensorrt_llm/_torch/attention_backend/trtllm.py (2)
  • max_seq_len (558-568)
  • max_seq_len (571-575)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
  • PyExecutor (138-1784)
🪛 Ruff (0.12.2)
tensorrt_llm/executor/worker.py

494-494: Line too long (140 > 120)

(E501)


654-654: Line too long (136 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (17)
tensorrt_llm/executor/executor.py (1)

24-24: Verified downstream compatibility for BaseLlmArgs

  • In executor.py, the create_executor factory (lines 362–364) now takes llm_args: Optional[BaseLlmArgs] as expected.
  • In worker.py, GenerationExecutorWorker’s constructor signature (lines 64–67) has been updated to accept BaseLlmArgs.
  • In proxy.py, all accesses to worker_kwargs["llm_args"] remain valid against the BaseLlmArgs API (e.g. garbage_collection_gen0_threshold).

No further changes required—everything downstream already supports BaseLlmArgs.

tensorrt_llm/_torch/pyexecutor/_util.py (3)

426-428: Forwarding max_seq_len into create_py_executor_instance is correct.

Good addition to keep PyExecutor aware of the configured sequence length; aligns with the new constructor contract.


574-576: PyExecutor construction passes new knobs as intended.

garbage_collection_gen0_threshold and max_seq_len are forwarded—consistent with PR goals. LGTM.


578-590: All create_torch_sampler_args calls use keyword-only arguments
I’ve confirmed that the only invocation of create_torch_sampler_args sits at line 596 in tensorrt_llm/_torch/pyexecutor/_util.py and explicitly passes both max_seq_len= and enable_mixed_sampler=. There are no other call sites in the repo, so this API tightening is safe and won’t break positional usages.

  • Call site: _util.py:596 uses keyword args for both parameters
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (4)

219-224: Wiring logits_post_processor_config and parallel_config through executor_config looks correct.

No functional issues spotted.


299-313: max_seq_len plumbing: matches design described in PR summary.

The computed (possibly inflated) max_seq_len is stored in executor_config and used downstream. Good.


443-444: Pass-through of computed max_seq_len to create_py_executor_instance is correct.

Both the first and the post-estimation code paths correctly forward max_seq_len.

Also applies to: 487-488


210-217: Verify all callers use the new create_py_executor signature

The signature of create_py_executor in
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (now taking llm_args first and no longer accepting executor_config) has changed. Please audit and update every call site:

tensorrt_llm/executor/worker.py
Implementation of _create_py_executor (around line 105): ensure it calls create_py_executor(llm_args, …) rather than passing an executor_config directly.
Invocation at line 175: replace any use of _create_py_executor() that relies on the old parameter order.

Search for any other calls to create_py_executor( across the codebase to confirm none still pass executor_config first.

tensorrt_llm/executor/worker.py (9)

21-21: Import set looks correct for new LLM-args pathway

Importing BaseLlmArgs and TorchLlmArgs is appropriate for the mixed backend code paths; PybindMirror is used later for lookahead_config. No action needed.


66-66: Constructor now accepts BaseLlmArgs — good generalization

Broader type improves reuse across backends and keeps worker decoupled from Torch-specific args.


105-153: LLM-args-based executor creation is well structured

Clean backend dispatch to PyTorch and AutoDeploy, proper propagation of tokenizer/lora/logits config, and capturing mapping/checkpoint_loader for downstream use all look solid.


137-146: Nice: Prepare mapping/checkpoint_loader and capture max_seq_len from executor

Storing mapping and harmonizing max_seq_len with model engine avoids duplicated inference elsewhere. Good forward-compat hook.

Also applies to: 148-152


175-177: Engine selection logic is clear and preserves legacy path

Using llm_args gate to choose PyExecutor vs TRT C++ executor keeps behavior consistent.


199-211: LoRA integration for PyTorch backend is correct

Fetching PEFT cache manager from resource manager and wiring through LoraManager is appropriate. Asserts presence of lora_model_config, which is fine given guarded path.


522-528: Forwarding llm_args into _deduce_max_tokens is correct

Ensures new pathway drives token deduction. Good.


494-494: Resolve Ruff E501 (line too long)

Long assert/message lines are flagged. The diffs above replace asserts and wrap messages, addressing these lint errors.

Also applies to: 654-654


695-713: Review Python version requirements and PEP 604 union usage

It looks like you’ve applied PEP 604 unions (e.g. Path | Engine) in many places across the repo. That syntax is only valid under Python 3.10+ grammar, but the project’s setup still indicates Python 3.8+ support. Please verify which path you intend and adjust accordingly:

• If you are now targeting Python 3.10 or newer, update your python_requires in setup.py to at least ">=3.10" so the use of A | B parses correctly.
• If you must maintain Python 3.8 compatibility, you’ll need to replace all occurrences of PEP 604 unions with typing.Union[...] (or Optional[...]) project-wide—fixing only the one signature in worker.py won’t be sufficient.

In tensorrt_llm/executor/worker.py (lines 695–713), for example, you could apply:

- def worker_main(
-     engine: Path | Engine,
+ def worker_main(
+     engine: Union[Path, Engine],
      worker_queues: WorkerCommIpcAddrs,
      log_level: str,
      executor_config: Optional[tllm.ExecutorConfig] = None,
      …
 ) -> None:
     mpi_comm().barrier()

But please confirm your overall strategy—either bump the minimum Python version in setup.py or convert all PEP 604 unions across the codebase.

@leslie-fang25
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16515 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16515 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12405 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@leslie-fang25 leslie-fang25 marked this pull request as ready for review August 26, 2025 23:27
@leslie-fang25 leslie-fang25 requested review from a team as code owners August 26, 2025 23:27
@leslie-fang25 leslie-fang25 force-pushed the leslie/use_llm_args_for_py_executor_part2_new branch from 37a6357 to 1c25084 Compare August 28, 2025 08:36
@leslie-fang25
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)
tensorrt_llm/executor/worker.py (3)

1-1: Add NVIDIA copyright header (2025).

Required by repo guidelines; missing in this file.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

56-67: Validate mutually exclusive inputs early: llm_args vs executor_config.

Fail fast if both are provided; avoids later asserts and ambiguous paths.

     def __init__(
         self,
         engine: Union[Path, Engine],
         executor_config: Optional[tllm.ExecutorConfig] = None,
@@
         tokenizer: Optional[TokenizerBase] = None,
         llm_args: Optional[BaseLlmArgs] = None,
     ) -> None:
+        if llm_args is not None and executor_config is not None:
+            raise ValueError("Pass either llm_args or executor_config, not both.")

96-104: Guard against zero CUDA devices in rank/device setup
Check torch.cuda.device_count() before using it in a modulo to avoid a ZeroDivisionError and raise a clear error when no GPUs are available:

 def _get_comm_ranks_device_id():
-    device_id = self.global_rank % torch.cuda.device_count()
+    device_count = torch.cuda.device_count()
+    if device_count == 0:
+        raise RuntimeError("No CUDA devices found; a GPU is required to create the executor.")
+    device_id = self.global_rank % device_count
     torch.cuda.set_device(device_id)
     # Make sure C++ executor would use same devices/ranks as py_executor
     global_rank = global_mpi_rank()
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)

1-1: Add NVIDIA copyright header (2025).

Required by repo guidelines; missing in this file.

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

246-247: Fix logging call: incorrect usage of logger.info causes formatting error.

Use %s placeholder; current code passes extra arg without placeholder.

-    logger.info("ATTENTION RUNTIME FEATURES: ", attn_runtime_features)
+    logger.info("ATTENTION RUNTIME FEATURES: %s", attn_runtime_features)
♻️ Duplicate comments (1)
tensorrt_llm/executor/worker.py (1)

482-516: Fix deduction of max_tokens: handle None max_seq_len, ceil split per CP, and cp_size validity.

Prevents arithmetic with None, divide-by-zero, and underestimation due to floor division. Also shortens long lines (Ruff E501).

-        def _deduce_max_tokens(request: GenerationRequest,
-                               executor_config: tllm.ExecutorConfig,
-                               llm_args: Optional[BaseLlmArgs] = None) -> int:
+        def _deduce_max_tokens(
+                request: GenerationRequest,
+                executor_config: Optional[tllm.ExecutorConfig],
+                llm_args: Optional[BaseLlmArgs] = None) -> int:
@@
-            cp_size = 1
-            max_seq_len = None
+            cp_size = 1
+            max_seq_len = None
             if llm_args is not None:
-                # deduce max_tokens by llm args
-                assert executor_config is None, "An empty executor_config in _deduce_max_tokens is expected when LLM arguments are defined."
-                if hasattr(self,
-                           "mapping") and self.mapping.cp_size is not None:
+                # deduce max_tokens by llm args
+                if executor_config is not None:
+                    raise ValueError(
+                        "executor_config must be None when llm_args are provided to _deduce_max_tokens().")
+                if hasattr(self, "mapping") and getattr(self.mapping, "cp_size", None) is not None:
                     cp_size = self.mapping.cp_size
-                max_seq_len = getattr(self, "max_seq_len", None)
+                if not hasattr(self, "max_seq_len") or self.max_seq_len is None:
+                    raise RuntimeError(
+                        "max_tokens is not set and cannot be deduced: "
+                        "max_seq_len unavailable from llm_args/executor.")
+                max_seq_len = self.max_seq_len
             else:
                 # deduce max_tokens by executor config
-                if hasattr(executor_config, "mapping"
-                           ) and executor_config.mapping.cp_size is not None:
-                    cp_size = executor_config.mapping.cp_size
-                max_seq_len = getattr(executor_config, "max_seq_len", None)
-            if max_seq_len is None:
-                logger.warning("`default_max_tokens` cannot be deduced")
-                if max_tokens is None:
-                    raise ValueError(
-                        "`max_tokens` must be set when `default_max_tokens` cannot be deduced"
-                    )
-                else:
-                    # use max_tokens if can't deduce default_max_tokens
-                    return max_tokens
-            splited_prompt_len = int(len(prompt_token_ids) / cp_size)
-            default_max_tokens = max_seq_len - splited_prompt_len - query_token_len
+                if hasattr(executor_config, "mapping") and getattr(executor_config.mapping, "cp_size", None) is not None:
+                    cp_size = executor_config.mapping.cp_size
+                if not hasattr(executor_config, "max_seq_len") or executor_config.max_seq_len is None:
+                    raise RuntimeError(
+                        "max_tokens is not set and cannot be deduced: "
+                        "executor_config.max_seq_len is missing.")
+                max_seq_len = executor_config.max_seq_len
+            if not isinstance(cp_size, int) or cp_size < 1:
+                raise ValueError(f"Invalid cp_size={cp_size}; expected integer >= 1.")
+            # ceil(len(prompt_token_ids) / cp_size)
+            split_prompt_len = (len(prompt_token_ids) + cp_size - 1) // cp_size
+            default_max_tokens = max_seq_len - split_prompt_len - query_token_len
             if default_max_tokens <= 0:
-                logger.warning(
-                    f"`default_max_tokens` ({default_max_tokens}) should be greater than 0, "
-                    f"`default_max_tokens` ({default_max_tokens}) = max_seq_len ({max_seq_len})"
-                    f" - `splited_prompt_len` ({splited_prompt_len}) - `query_token_len` ({query_token_len})"
-                )
+                logger.warning(
+                    "default_max_tokens (%s) <= 0; max_seq_len(%s) - split_prompt_len(%s) - "
+                    "query_token_len(%s).",
+                    default_max_tokens, max_seq_len, split_prompt_len, query_token_len)
                 if max_tokens is None:
                     raise ValueError(
                         "`max_tokens` must be set when `default_max_tokens` is illegal"
                     )
🧹 Nitpick comments (2)
tensorrt_llm/executor/worker.py (1)

199-211: LoRA setup may NPE if PEFT cache manager missing.

Defensive None-check prevents AttributeError when lora_config is given but manager isn’t present.

         if self.llm_args and getattr(
                 self.llm_args, "backend",
                 "") == "pytorch" and lora_config is not None:
             from tensorrt_llm._torch.pyexecutor.resource_manager import \
                 ResourceManagerType
             peft_cache_manager = self.engine.resource_manager.resource_managers.get(
                 ResourceManagerType.PEFT_CACHE_MANAGER)
-            self._lora_manager = LoraManager(
-                cpp_peft_cache_manager=peft_cache_manager.impl)
+            if peft_cache_manager is None:
+                raise RuntimeError("PEFT cache manager not available; cannot enable LoRA.")
+            self._lora_manager = LoraManager(cpp_peft_cache_manager=peft_cache_manager.impl)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

30-35: Duplicate import of is_mla; remove one to avoid shadowing/confusion.

Keep a single source (config_utils) or _util, not both.

-from ._util import (KvCacheCreator, _adjust_torch_mem_fraction,
-                    create_py_executor_instance, instantiate_sampler, is_mla)
+from ._util import (KvCacheCreator, _adjust_torch_mem_fraction,
+                    create_py_executor_instance, instantiate_sampler)
 from .config_utils import is_mla
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 37a6357 and 1c25084.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (4 hunks)
  • tensorrt_llm/executor/executor.py (2 hunks)
  • tensorrt_llm/executor/worker.py (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic

Files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header with current year to all source files

Files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🧠 Learnings (2)
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🧬 Code graph analysis (1)
tensorrt_llm/executor/worker.py (5)
tensorrt_llm/llmapi/llm_args.py (5)
  • BaseLlmArgs (1172-1949)
  • PybindMirror (579-723)
  • TorchLlmArgs (2158-2586)
  • get_pytorch_backend_config (2521-2586)
  • parallel_config (1392-1393)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
  • create_py_executor (210-493)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
  • LlmArgs (237-351)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
  • create_autodeploy_executor (261-344)
tensorrt_llm/_torch/pyexecutor/config.py (1)
  • _construct_checkpoint_loader (175-197)
🪛 Ruff (0.12.2)
tensorrt_llm/executor/worker.py

494-494: Line too long (140 > 120)

(E501)


669-669: Line too long (136 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tensorrt_llm/executor/worker.py (4)

137-146: Checkpoint loader construction: limit to pytorch backend (LGTM), but prefer explicit None check.

Current logic is fine; clear separation by backend.


148-152: Propagate max_seq_len from executor (LGTM).

Good: respects engine-updated max_seq_len.


541-543: Passing llm_args into _deduce_max_tokens (LGTM).

Correctly routes through new deduction path.


709-727: worker_main: new llm_args arg (LGTM).

Signature change consistent with worker.init and overall flow.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (4)

13-20: Imports extended for new surface (LGTM).

New types (LogitsPostProcessorConfig, ParallelConfig, TorchLlmArgs, TokenizerBase) are appropriate here.


210-217: Public API refactor of create_py_executor (LGTM).

Clean switch to llm_args; parameters read clearly reflect the new surface.


219-224: ExecutorConfig derivation and GC threshold wiring (LGTM).

Correctly derives executor_config from llm_args and plumbs additional configs.


443-444: Pass max_seq_len to create_py_executor_instance (LGTM).

Keeps runtime in sync with engine-updated sequence length in both paths.

Also applies to: 487-488

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16831 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16831 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12636 completed with status: 'FAILURE'

@leslie-fang25
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16838 [ run ] triggered by Bot

@leslie-fang25 leslie-fang25 force-pushed the leslie/use_llm_args_for_py_executor_part2_new branch from 1c25084 to c11aad7 Compare August 28, 2025 09:43
@leslie-fang25
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16840 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16838 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)

261-279: Add explicit return type and enforce runtime validation
In tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py:

  • Change
    def create_autodeploy_executor(ad_config: LlmArgs):
    to
    def create_autodeploy_executor(ad_config: LlmArgs) -> PyExecutor:
  • Replace the type‐check assert
    msg = "pytorch_backend_config must be an AD LlmArgs object"
    assert isinstance(ad_config, LlmArgs), msg
    with
    if not isinstance(ad_config, LlmArgs):
        raise TypeError("ad_config must be an AutoDeploy LlmArgs instance")
  • Replace the beam‐width assert
    assert ad_config.max_beam_width <= 1, "_autodeploy + beam_search is not supported"
    with
    if ad_config.max_beam_width > 1:
        raise ValueError("beam_search is not supported in AutoDeploy")

Optional: to avoid out-of-range GPU errors on multi-node setups, use

torch.cuda.set_device(rank % torch.cuda.device_count())
♻️ Duplicate comments (3)
tensorrt_llm/executor/worker.py (3)

482-536: Fix max_tokens deduction: None-handling, CP ceil split, and robust validation.

Keep logic consistent with llm-args flow; avoid asserts and floor-division underestimation.

-        def _deduce_max_tokens(request: GenerationRequest,
-                               executor_config: tllm.ExecutorConfig,
-                               llm_args: Optional[BaseLlmArgs] = None) -> int:
+        def _deduce_max_tokens(
+                request: GenerationRequest,
+                executor_config: Optional[tllm.ExecutorConfig],
+                llm_args: Optional[BaseLlmArgs] = None) -> int:
@@
-            cp_size = 1
-            max_seq_len = None
+            cp_size = 1
+            max_seq_len = None
             if llm_args is not None:
-                # deduce max_tokens by llm args
-                assert executor_config is None, "An empty executor_config in _deduce_max_tokens is expected when LLM arguments are defined."
-                if hasattr(self,
-                           "mapping") and self.mapping.cp_size is not None:
+                # deduce max_tokens by llm args
+                if executor_config is not None:
+                    raise ValueError("executor_config must be None when llm_args are provided to _deduce_max_tokens().")
+                if hasattr(self, "mapping") and getattr(self.mapping, "cp_size", None) is not None:
                     cp_size = self.mapping.cp_size
-                max_seq_len = getattr(self, "max_seq_len", None)
+                max_seq_len = getattr(self, "max_seq_len", None)
             else:
                 # deduce max_tokens by executor config
-                if hasattr(executor_config, "mapping"
-                           ) and executor_config.mapping.cp_size is not None:
+                if hasattr(executor_config, "mapping") and getattr(executor_config.mapping, "cp_size", None) is not None:
                     cp_size = executor_config.mapping.cp_size
-                max_seq_len = getattr(executor_config, "max_seq_len", None)
-            if max_seq_len is None:
-                logger.warning("`default_max_tokens` cannot be deduced")
-                if max_tokens is None:
-                    raise ValueError(
-                        "`max_tokens` must be set when `default_max_tokens` cannot be deduced"
-                    )
-                else:
-                    # use max_tokens if can't deduce default_max_tokens
-                    return max_tokens
-            splited_prompt_len = int(len(prompt_token_ids) / cp_size)
-            default_max_tokens = max_seq_len - splited_prompt_len - query_token_len
+                max_seq_len = getattr(executor_config, "max_seq_len", None)
+            if not isinstance(cp_size, int) or cp_size < 1:
+                raise ValueError(f"Invalid cp_size={cp_size}; expected integer >= 1.")
+            if max_seq_len is None:
+                if max_tokens is None:
+                    raise ValueError("max_tokens is required: max_seq_len unavailable to deduce a default.")
+                return max_tokens
+            # ceil(len(prompt_token_ids)/cp_size) without importing math
+            split_prompt_len = (len(prompt_token_ids) + cp_size - 1) // cp_size
+            default_max_tokens = max_seq_len - split_prompt_len - query_token_len
             if default_max_tokens <= 0:
-                logger.warning(
-                    f"`default_max_tokens` ({default_max_tokens}) should be greater than 0, "
-                    f"`default_max_tokens` ({default_max_tokens}) = max_seq_len ({max_seq_len})"
-                    f" - `splited_prompt_len` ({splited_prompt_len}) - `query_token_len` ({query_token_len})"
-                )
+                logger.warning(
+                    f"default_max_tokens ({default_max_tokens}) <= 0; "
+                    f"per-CP prompt length {split_prompt_len} + query length {query_token_len} "
+                    f"exceeds max_seq_len {max_seq_len}."
+                )
                 if max_tokens is None:
                     raise ValueError(
-                        "`max_tokens` must be set when `default_max_tokens` is illegal"
+                        "max_tokens must be set when deduced default_max_tokens is invalid."
                     )
             # default_max_tokens is the biggest available value
             if max_tokens is None:
                 return default_max_tokens
             elif max_tokens > default_max_tokens:
                 logger.warning(
-                    f"User-specified `max_tokens` ({max_tokens}) is greater than deduced "
-                    f"`default_max_tokens` ({default_max_tokens}), using default_max_tokens instead."
+                    f"User max_tokens ({max_tokens}) > deduced default_max_tokens ({default_max_tokens}); using default."
                 )
                 return default_max_tokens
             return max_tokens

This also addresses Ruff E501 at the previous assert line.


105-137: Don’t use asserts for runtime validation in executor creation.

Replace both asserts with explicit errors; keeps checks in optimized runs and fixes Ruff E501 risk.

-        def _create_py_executor():
+        def _create_py_executor():
             args = {}
-            assert hasattr(
-                self.llm_args, "backend"
-            ), "llm_args should be with backend in _create_py_executor"
+            if not hasattr(self.llm_args, "backend"):
+                raise ValueError("llm_args must have 'backend' in _create_py_executor().")
@@
-            elif self.llm_args.backend == "_autodeploy":
+            elif self.llm_args.backend == "_autodeploy":
                 from tensorrt_llm._torch.auto_deploy.llm_args import \
                     LlmArgs as ADLlmArgs
@@
-                assert isinstance(self.llm_args, ADLlmArgs)
+                if not isinstance(self.llm_args, ADLlmArgs):
+                    raise TypeError("llm_args must be AutoDeploy LlmArgs for '_autodeploy' backend.")

668-675: Replace assert in shutdown with explicit error; wrap long condition (E501).

Prevents assert stripping and fixes long line.

-            if self.llm_args is not None:
-                assert self._executor_config is None, "An empty executor_config is expected in shutdown when LLM arguments are defined."
-                if (self.llm_args.backend == "pytorch"
-                        and hasattr(self, "checkpoint_loader")
-                        and self.checkpoint_loader is not None):
+            if self.llm_args is not None:
+                if self._executor_config is not None:
+                    raise ValueError(
+                        "executor_config must be None in shutdown when LLM arguments are defined."
+                    )
+                if (
+                    self.llm_args.backend == "pytorch"
+                    and hasattr(self, "checkpoint_loader")
+                    and self.checkpoint_loader is not None
+                ):
                     self.checkpoint_loader.cleanup()
                     self.checkpoint_loader = None
🧹 Nitpick comments (2)
tensorrt_llm/executor/worker.py (1)

66-67: Constructor change: ensure docstring and param validation.

Add a brief docstring note that llm_args drives executor creation when provided.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)

219-224: executor_config.parallel_config is ignored by mapping
Mapping in _get_mapping only reads executor_config.mapping, so executor_config.parallel_config (set at py_executor_creator.py:221) has no effect. Either remove this dead assignment or update _get_mapping to use parallel_config.to_mapping() (e.g., set executor_config.mapping = parallel_config.to_mapping()) before deriving the mapping.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1c25084 and c11aad7.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/_util.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (2 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (4 hunks)
  • tensorrt_llm/executor/executor.py (2 hunks)
  • tensorrt_llm/executor/worker.py (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • tensorrt_llm/executor/executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports: import the subpackage/module, not the symbol (from package.subpackage import foo; foo.SomeClass())
Naming: files snake_case; classes PascalCase; functions/methods snake_case; local variables snake_case (k_ prefix if starting with a number); globals G_ + UPPER_SNAKE_CASE; constants UPPER_SNAKE_CASE
Avoid shadowing outer-scope variables; initialize all externally visible members in init
Prefer docstrings for interfaces used outside a file; reserve comments for function-internal or file-local interfaces
Use Google-style docstrings for classes and functions; inline docstrings for attributes/variables are allowed
Avoid reflection when straightforward code suffices (e.g., prefer explicit parameters over dict(**locals()))
Use narrow except clauses (e.g., catch FileNotFoundError instead of bare except)
For duck-typing try/except, keep try body minimal and use else for the main logic

Files:

  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header with current year to all source files

Files:

  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🧠 Learnings (2)
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.

Applied to files:

  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
PR: NVIDIA/TensorRT-LLM#7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
  • LlmArgs (237-351)
tensorrt_llm/executor/worker.py (5)
tensorrt_llm/llmapi/llm_args.py (5)
  • BaseLlmArgs (1172-1949)
  • TorchLlmArgs (2158-2586)
  • get_pytorch_backend_config (2521-2586)
  • parallel_config (1392-1393)
  • to_mapping (282-294)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
  • create_py_executor (210-493)
tensorrt_llm/auto_parallel/parallelization.py (1)
  • ParallelConfig (51-77)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
  • LlmArgs (237-351)
tensorrt_llm/_torch/pyexecutor/config.py (1)
  • _construct_checkpoint_loader (175-197)
🪛 Ruff (0.12.2)
tensorrt_llm/executor/worker.py

494-494: Line too long (140 > 120)

(E501)


669-669: Line too long (136 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
tensorrt_llm/executor/worker.py (3)

21-21: Import surface looks good.

Switch to BaseLlmArgs/TorchLlmArgs/PybindMirror aligns with the refactor.


199-202: LoRA manager gating per backend is correct.

Condition ensures PyTorch-only LoRA path. LGTM.


726-727: worker_main now accepts llm_args: OK.

Signature aligns with constructor; proxy wiring remains consistent.

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (2)

210-217: API change: create_py_executor(llm_args, …): Good migration.

Signature is clear and typed; matches worker usage.


443-444: Propagating max_seq_len to PyExecutor: LGTM.

This keeps worker.max_seq_len in sync with engine-derived limits.

Also applies to: 487-488

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16840 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12643 completed with status: 'FAILURE'

@leslie-fang25 leslie-fang25 force-pushed the leslie/use_llm_args_for_py_executor_part2_new branch from c11aad7 to a2b4248 Compare August 29, 2025 06:56
Copy link
Collaborator

@LinPoly LinPoly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leslie-fang25
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16967 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16967 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #12736 completed with status: 'FAILURE'

@leslie-fang25 leslie-fang25 force-pushed the leslie/use_llm_args_for_py_executor_part2_new branch from de88835 to ec3ed20 Compare August 29, 2025 10:47
@leslie-fang25
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16982 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16982 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12750 completed with status: 'FAILURE'

@leslie-fang25
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17025 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17025 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12788 completed with status: 'FAILURE'

@leslie-fang25
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17029 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17029 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12791 completed with status: 'FAILURE'

@leslie-fang25
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17053 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #17053 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12814 completed with status: 'SUCCESS'

Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leslie-fang25 leslie-fang25 requested a review from QiJune September 1, 2025 05:01
Copy link
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@leslie-fang25 leslie-fang25 enabled auto-merge (squash) September 1, 2025 05:37
@leslie-fang25
Copy link
Collaborator Author

Hi @lucaslie, @suyoggupta could you help to take a look of this PR? It seems need your kind approval to get landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants