Skip to content

Conversation

dominicshanshan
Copy link
Collaborator

@dominicshanshan dominicshanshan commented Aug 13, 2025

Summary by CodeRabbit

  • New Features

    • Examples adopt KV-cache configuration with optional FP8; backend selection made test-configurable.
    • LLM temporarily forces verbose backend logging during initialization.
  • Bug Fixes

    • More reliable GPU workspace zero-initialization for Top-K.
    • Improved executor startup error reporting and handshake semantics.
  • Tests

    • Added FP8 prequantized tests, multi-instance disaggregated-serving, backend-logging QA, 2‑GPU multimodal E2E; added unit coverage for executor failures; updated accuracy refs and test matrices.
  • Chores

    • Multimodal builder compatibility for newer Transformer layouts.
    • Build script resolves absolute project root path.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Aug 13, 2025

📝 Walkthrough

Walkthrough

Adjusts CUDA workspace memset base, standardizes executor init payloads and error logging, temporarily forces INFO during LLM init and logs backend, adopts KvCacheConfig/FP8 test paths, adds transformer-structure compatibility, propagates per-submodel quant_config for Gemma, and expands integration/unit tests and test manifests.

Changes

Cohort / File(s) Change summary
CUDA kernel workspace init
cpp/tensorrt_llm/kernels/topkLastDim.cu
Use aligned_pointers[0] as destination for cudaMemsetAsync when zeroing workspace (size unchanged).
Executor init handshake & error propagation
tensorrt_llm/executor/proxy.py, tensorrt_llm/executor/worker.py
Worker now enqueues 2-tuple (status, traceback); proxy unpacks (ready_signal, error_trace), logs error_trace on failure, aborts MPI and raises.
LLM init logging guard & backend logs
tensorrt_llm/llmapi/llm.py
Temporarily set logger level to INFO during BaseLLM.__init__, restore in finally; add explicit backend selection log messages.
Example: KvCacheConfig adoption
examples/llm-api/star_attention.py
Replace QuantConfig path with KvCacheConfig(...); pass kv_cache_config into LLM constructor.
Gemma3VLM per-submodel quant_config
tensorrt_llm/_torch/models/modeling_gemma3vl.py
Provide quant_config to text submodel; set vision submodel quant_config=None to disable vision quantization for vision weights.
Transformer version compatibility
tensorrt_llm/tools/multimodal_builder.py
Detect nested model.model.vision_model/model.model.multi_modal_projector vs legacy top-level attributes and use resolved components.
FP8 accuracy references
tests/integration/defs/accuracy/references/*.yaml
Add FP8 entries (quant_algo: FP8, kv_cache_quant_algo: FP8) for Gemma models while retaining baselines.
Disaggregated serving: multi-instance & health
tests/integration/defs/accuracy/test_disaggregated_serving.py
Launch per-instance context/gen servers (computed ports/GPUs), add multi_popen, health check (/health), update run_parallel_test signature to include ctx_instances/gen_instances, add test_multi_instance.
LLM API PyTorch tests & cuda_graph_config arg
tests/integration/defs/accuracy/test_llm_api_pytorch.py
Add FP8 prequantized tests for Gemma; switch tests to pass cuda_graph_config (replacing use_cuda_graph boolean).
Backend selector in test runner
tests/integration/defs/llmapi/_run_llmapi_llm.py
Add --backend option; choose TrtLLM vs LLM at runtime; pass build_config only for TensorRT backend.
QA tests for backend types & logs
tests/integration/defs/llmapi/test_llm_api_qa.py
New tests asserting default vs TensorRT backend types and presence of backend log messages.
Perf config: cuda_graph_config schema
tests/integration/defs/perf/pytorch_model_config.py
Replace flat CUDA graph flags with nested cuda_graph_config dict (enable_padding, batch_sizes) across patterns.
E2E multimodal 2-GPU test
tests/integration/defs/test_e2e.py
Add parameterized test_ptp_quickstart_multimodal_2gpu for three models; duplicate definition present in patch.
Disaggregated benchmark tweak
tests/integration/defs/disaggregated/test_disaggregated.py
Lower max_seq_len from 384 to 320 in context and generation server configs.
Test manifests and waivers
tests/integration/test_lists/*, tests/integration/test_lists/waives.txt, tests/integration/test_lists/test-db/*.yml
Add/adjust entries for new tests and FP8 variants; swap SKIP entries; some duplicate manifest lines observed.
Unit test: executor failure path
tests/unittest/llmapi/test_llm_pytorch.py
Add FailingExecutorWorker/FailingExecutor and test_llm_with_proxy_error asserting LLM construction surfaces executor-init errors.
Triton build script path resolution
triton_backend/inflight_batcher_llm/scripts/build.sh
Compute LLM_ROOT relative to BUILD_DIR and canonicalize via cd/pwd before invoking tag script.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor WorkerProcess
  participant Worker as Executor Worker
  participant Queue as worker_init_status_queue
  participant Proxy as Executor Proxy
  participant MPI as MPI Session

  Worker->>Queue: put (ready_signal, error_trace)
  Proxy->>Queue: get()
  Queue-->>Proxy: (ready_signal, error_trace)
  alt ready_signal == READY
    Proxy-->>WorkerProcess: proceed
  else
    Proxy->>Proxy: logger.error(error_trace)
    Proxy->>MPI: Abort()
    Proxy-->>WorkerProcess: raise RuntimeError
  end
Loading
sequenceDiagram
  autonumber
  actor Caller
  participant LLM as BaseLLM.__init__
  participant Logger

  Caller->>LLM: construct(...)
  LLM->>Logger: save current level
  LLM->>Logger: set level INFO
  alt backend == "pytorch"
    LLM->>Logger: log "Using LLM with PyTorch backend"
  else backend == "_autodeploy"
    LLM->>Logger: log "Using LLM with AutoDeploy backend"
  else
    LLM->>Logger: log "Using LLM with TensorRT backend"
  end
  LLM->>Logger: restore original level (finally)
  LLM-->>Caller: return / raise
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Possibly related PRs

Suggested labels

Release Blocker

Suggested reviewers

  • litaotju
  • chzblych
  • Tabrizian
  • Superjomn
  • mikeiovine

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@dominicshanshan
Copy link
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15121 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15122 [ ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🔭 Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_gemma3vl.py (1)

20-27: Avoid shadowing ModelConfig with two different imports

ModelConfig is imported twice from different modules (..model_config and .modeling_utils). This shadowing is error-prone and can cause subtle typing/runtime issues if the two definitions diverge. Import ModelConfig from a single source and only import filter_weights/register_auto_model from modeling_utils.

Apply this diff to avoid the shadowing:

-from ..model_config import ModelConfig
+from ..model_config import ModelConfig
@@
-from .modeling_utils import ModelConfig, filter_weights, register_auto_model
+from .modeling_utils import filter_weights, register_auto_model
tests/integration/test_lists/qa/llm_function_full.txt (1)

713-719: Duplicate test entry detected

The test llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_logging appears twice (lines 715 and 719). This duplication will cause the test to run multiple times unnecessarily.

Remove the duplicate entry:

 # llm-api promote pytorch to default
 llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_logging
 llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_type_tensorrt
 llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_type_default
-llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_logging
🧹 Nitpick comments (19)
cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1278-1279: Nit: clarify intent and guard zero-byte memset

Add a brief comment about which regions are zero-initialized, and skip cudaMemsetAsync when the computed byte count is zero. This documents the contract and avoids an unnecessary runtime call for edge cases (e.g., batch_size == 0).

-        cudaMemsetAsync(aligned_pointers[0], 0,
-            static_cast<char*>(aligned_pointers[2]) - static_cast<char*>(aligned_pointers[0]), stream);
+        // Zero-initialize counters and histograms only (aligned region [aligned_pointers[0], aligned_pointers[2])).
+        const auto zeroBytes = static_cast<size_t>(
+            static_cast<char*>(aligned_pointers[2]) - static_cast<char*>(aligned_pointers[0]));
+        if (zeroBytes > 0)
+        {
+            cudaMemsetAsync(aligned_pointers[0], 0, zeroBytes, stream);
+        }
tensorrt_llm/executor/worker.py (2)

1-1: Add NVIDIA copyright header

Per repo guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

777-778: Avoid sending raw Exception objects over IPC; add a safe fallback

Transmitting Exception instances via ZeroMQ/pickle can fail (custom exceptions may be unpicklable). Since the proxy only logs the traceback and uses the exception as a cause, add a fallback to send a picklable payload if the first send fails.

Apply:

-            worker_init_status_queue.put((e, traceback.format_exc()))
+            try:
+                worker_init_status_queue.put((e, traceback.format_exc()))
+            except Exception:
+                # Fallback to a picklable payload (no exception instance)
+                worker_init_status_queue.put((repr(e), traceback.format_exc()))

Follow-up: see proxy-side defensive unpacking/logging suggestion to handle both Exception and str causes.

tensorrt_llm/executor/proxy.py (1)

1-1: Add NVIDIA copyright header

Per repo guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
tests/unittest/llmapi/test_llm_pytorch.py (3)

1-2: Add NVIDIA copyright header

Per repo guidelines, prepend the current-year NVIDIA copyright header to all source files.

Apply at file top:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.

821-827: Remove unused FailingExecutor type

The dynamically created FailingExecutor is unused. Simplify by removing it.

Apply:

-FailingExecutor = type(
-    "FailingExecutor", (), {
-        "create":
-        classmethod(
-            lambda cls, *args, **kwargs: FailingExecutorWorker(*args, **kwargs))
-    })

839-847: Fix lint (F841) and simplify the patching

  • Avoid assigning to an unused variable.
  • You can raise directly via side_effect without constructing FailingExecutorWorker, which reduces coupling to the worker class.

Apply:

-    with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
-               side_effect=lambda *args, **kwargs: FailingExecutorWorker(
-                   *args, **kwargs)):
+    with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
+               side_effect=RuntimeError(
+                   "Mock GenerationExecutorWorker initialization failed")):
         with pytest.raises(
                 RuntimeError,
                 match="Mock GenerationExecutorWorker initialization failed"):
-            llm = LLM(model=llama_model_path,
-                      kv_cache_config=global_kvcache_config)
+            LLM(model=llama_model_path, kv_cache_config=global_kvcache_config)
tensorrt_llm/tools/multimodal_builder.py (1)

1195-1201: Consider improving the version detection logic.

The current check relies on attribute existence (hasattr) which may not be the most robust approach for detecting transformers version differences. Consider checking the transformers version directly for more explicit version handling.

-    # Check if the model structure is updated to transformers >= 4.52.0
-    if hasattr(model, 'model') and hasattr(model.model, 'vision_model'):
+    # Check transformers version for model structure compatibility
+    import transformers
+    if hasattr(transformers, '__version__') and transformers.__version__ >= '4.52.0':
         vision_model = model.model.vision_model
         multi_modal_projector = model.model.multi_modal_projector
     else:
-        # transformers < 4.52.0
+        # Fallback for transformers < 4.52.0
         vision_model = model.vision_model
         multi_modal_projector = model.multi_modal_projector

However, the current approach using hasattr is actually more defensive and will work even if the version detection fails, so both approaches have merit.

triton_backend/inflight_batcher_llm/scripts/build.sh (1)

56-58: LLM_ROOT resolution via BUILD_DIR is fine; make minor hardening to avoid re-run failures.

Two small robustness tweaks:

  • Line 43 uses mkdir without -p; re-runs will fail if the directory exists.
  • Consider set -euo pipefail at the top to fail fast on errors.

Outside this hunk, suggested changes:

# At the top (after #!/bin/bash)
set -euo pipefail

# Around current line 43
mkdir -p "$BUILD_DIR"
docker/Makefile (1)

4-5: More robust ARG extraction for TRITON_ (handle leading whitespace).*

Your grep-based extraction works if ARG is at column 0. To make it resilient to indentation, prefer awk with a whitespace-tolerant pattern.

Apply:

-TRITON_IMAGE       ?= $(shell grep '^ARG TRITON_IMAGE=' Dockerfile.multi | grep -o '=.*' | tr -d '="')
-TRITON_BASE_TAG    ?= $(shell grep '^ARG TRITON_BASE_TAG=' Dockerfile.multi | grep -o '=.*' | tr -d '="')
+TRITON_IMAGE       ?= $(shell awk -F= '/^[[:space:]]*ARG[[:space:]]+TRITON_IMAGE=/ {gsub(/[ "\t]/,"",$2); print $$2}' Dockerfile.multi)
+TRITON_BASE_TAG    ?= $(shell awk -F= '/^[[:space:]]*ARG[[:space:]]+TRITON_BASE_TAG=/ {gsub(/[ "\t]/,"",$2); print $$2}' Dockerfile.multi)
jenkins/BuildDockerImage.groovy (1)

344-351: Use the retry wrapper for the custom tag build for consistency and resilience.

All other builds use llmExecStepWithRetry; mirror that here.

-                sh """
-                cd ${LLM_ROOT} && make -C docker ${target}_${action} \
-                BASE_IMAGE=${BASE_IMAGE} \
-                TRITON_IMAGE=${TRITON_IMAGE} \
-                TORCH_INSTALL_TYPE=${torchInstallType} \
-                IMAGE_WITH_TAG=${customImageWithTag} \
-                STAGE=${dockerfileStage} \
-                BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args}
-                """
+                trtllm_utils.llmExecStepWithRetry(this, script: """
+                    cd ${LLM_ROOT} && make -C docker ${target}_${action} \
+                    BASE_IMAGE=${BASE_IMAGE} \
+                    TRITON_IMAGE=${TRITON_IMAGE} \
+                    TORCH_INSTALL_TYPE=${torchInstallType} \
+                    IMAGE_WITH_TAG=${customImageWithTag} \
+                    STAGE=${dockerfileStage} \
+                    BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args}
+                """, sleepInSecs: 300, shortCommondRunTimeMax: 7200)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)

816-834: New Gemma3 27B FP8 prequantized test is correct and well-scoped

  • Disables KV-cache reuse to account for Gemma3’s non-inclusive sliding window support gaps.
  • Forces FLASHINFER backend for custom mask support.
  • Asserts QuantAlgo.FP8 and hits CNN Dailymail, GSM8K, and MMLU: good task coverage.

One nit to fix: the long inline comment on Line 817 exceeds the 120-char limit flagged by Ruff. Suggest wrapping it across lines.

Apply this diff to satisfy E501:

-        # Disabling kv cache reuse as a WAR to deal with gaps in kernel support for Gemma3's non-inclusive sliding window size.
+        # Disabling kv cache reuse as a WAR to deal with gaps in kernel support
+        # for Gemma3's non-inclusive sliding window size.
tests/integration/defs/test_e2e.py (2)

2440-2542: New 2-GPU multimodal quickstart test is well put together; fix long assert line

  • Uses tp_size=2 and adapts model-specific flags (FLASHINFER + KV-reuse off for Gemma; LoRA + short rope for Phi-4-MM).
  • Validates output via parse_output and keyword ratios. Good approach.

Address Ruff’s E501 on the long assertion message by splitting it across lines.

Apply this diff to break the long assert message:

-        assert obs_match_ratio >= match_ratio, f"Incorrect output!\nGenerated \"{prompt_output}\"\nExpected keywords \"{prompt_keywords}\"\n Matched keywords: {matches}\n Observed match ratio {obs_match_ratio} below threshold {match_ratio}"
+        assert obs_match_ratio >= match_ratio, (
+            "Incorrect output!\n"
+            f"Generated \"{prompt_output}\"\n"
+            f"Expected keywords \"{prompt_keywords}\"\n"
+            f"Matched keywords: {matches}\n"
+            f"Observed match ratio {obs_match_ratio} below threshold {match_ratio}"
+        )

1-1: Update copyright year to include 2025

Per coding guidelines, use the current year in headers for modified source files.

Apply this diff:

-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
tests/integration/defs/llmapi/test_llm_api_qa.py (3)

1-1: Consider removing or expanding the non-standard comment

The comment "Confirm that the default backend is changed" doesn't follow standard Python documentation conventions and lacks context about what default backend is being changed.

Either remove the comment or expand it to provide more context:

-# Confirm that the default backend is changed
+"""Test suite to verify that the default LLM backend is PyTorch in v1.0."""

12-14: Fix docstring formatting

The docstring doesn't follow Python conventions - it should either be a one-liner or properly formatted multi-line docstring.

-    """
-    Check that the default backend is PyTorch for v1.0 breaking change
-    """
+    """Check that the default backend is PyTorch for v1.0 breaking change."""

60-60: Consider extracting backend keywords as constants

The backend keywords are hardcoded as string literals in assertion messages. Consider defining them as class-level constants for better maintainability.

 class TestLlmDefaultBackend:
     """Check that the default backend is PyTorch for v1.0 breaking change."""
+    
+    PYTORCH_BACKEND_LOG = "Using LLM with PyTorch backend"
+    TENSORRT_BACKEND_LOG = "Using LLM with TensorRT backend"
 
     def test_llm_args_type_default(self, llm_root, llm_venv):
         # ... existing code ...
 
     def test_llm_args_logging(self, llm_root, llm_venv):
         # ... existing code ...
-        assert "Using LLM with PyTorch backend" in pytorch_output, f"Expected 'pytorch' in logs, got: {pytorch_output}"
+        assert self.PYTORCH_BACKEND_LOG in pytorch_output, f"Expected 'pytorch' in logs, got: {pytorch_output}"
         
         # ... existing code ...
-        assert "Using LLM with TensorRT backend" in tensorrt_output, f"Expected 'tensorrt' in logs, got: {tensorrt_output}"
+        assert self.TENSORRT_BACKEND_LOG in tensorrt_output, f"Expected 'tensorrt' in logs, got: {tensorrt_output}"

Also applies to: 70-70

tests/integration/defs/accuracy/test_disaggregated_serving.py (2)

250-250: Break long line for better readability

The error message line exceeds the recommended line length limit.

Split the long error message across multiple lines:

-            f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
+            f"Not enough devices for {ctx_instances} ctx instances "
+            f"(ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances "
+            f"(gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"

169-187: Consider extracting the multi_popen helper to a shared utility

The multi_popen context manager provides useful functionality for managing multiple subprocess lifecycles. Since similar patterns appear in test_disaggregated.py, consider extracting this to a shared test utility module to promote code reuse.

Consider creating a shared test utility module (e.g., tests/integration/defs/disaggregated_utils.py) and moving reusable functions like multi_popen there. This would reduce code duplication and make maintenance easier.

Would you like me to help create this shared utility module?

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fe7dda8 and 7cad59d.

📒 Files selected for processing (31)
  • cpp/tensorrt_llm/kernels/topkLastDim.cu (1 hunks)
  • docker/Makefile (3 hunks)
  • examples/llm-api/star_attention.py (3 hunks)
  • jenkins/BuildDockerImage.groovy (3 hunks)
  • jenkins/L0_Test.groovy (1 hunks)
  • jenkins/current_image_tags.properties (1 hunks)
  • tensorrt_llm/_torch/models/modeling_gemma3vl.py (1 hunks)
  • tensorrt_llm/executor/proxy.py (1 hunks)
  • tensorrt_llm/executor/worker.py (2 hunks)
  • tensorrt_llm/llmapi/llm.py (2 hunks)
  • tensorrt_llm/serve/scripts/benchmark_serving.py (1 hunks)
  • tensorrt_llm/tools/multimodal_builder.py (1 hunks)
  • tests/integration/defs/accuracy/references/cnn_dailymail.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (6 hunks)
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py (6 hunks)
  • tests/integration/defs/disaggregated/test_disaggregated.py (2 hunks)
  • tests/integration/defs/llmapi/_run_llmapi_llm.py (1 hunks)
  • tests/integration/defs/llmapi/test_llm_api_qa.py (1 hunks)
  • tests/integration/defs/perf/pytorch_model_config.py (1 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (5 hunks)
  • tests/integration/test_lists/qa/llm_function_sanity.txt (2 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_b200.yml (1 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (3 hunks)
  • tests/integration/test_lists/test-db/l0_h100.yml (1 hunks)
  • tests/integration/test_lists/waives.txt (1 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
  • tests/unittest/test_pip_install.py (0 hunks)
  • triton_backend/inflight_batcher_llm/scripts/build.sh (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/unittest/test_pip_install.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/serve/scripts/benchmark_serving.py
  • tensorrt_llm/tools/multimodal_builder.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/models/modeling_gemma3vl.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/llmapi/llm.py
  • tests/integration/defs/perf/pytorch_model_config.py
  • tests/integration/defs/llmapi/_run_llmapi_llm.py
  • tests/integration/defs/test_e2e.py
  • tests/integration/defs/llmapi/test_llm_api_qa.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • examples/llm-api/star_attention.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/serve/scripts/benchmark_serving.py
  • tensorrt_llm/tools/multimodal_builder.py
  • tensorrt_llm/executor/proxy.py
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/_torch/models/modeling_gemma3vl.py
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tensorrt_llm/llmapi/llm.py
  • cpp/tensorrt_llm/kernels/topkLastDim.cu
  • tests/integration/defs/perf/pytorch_model_config.py
  • tests/integration/defs/llmapi/_run_llmapi_llm.py
  • tests/integration/defs/test_e2e.py
  • tests/integration/defs/llmapi/test_llm_api_qa.py
  • tests/integration/defs/disaggregated/test_disaggregated.py
  • examples/llm-api/star_attention.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/kernels/topkLastDim.cu
🧠 Learnings (2)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/test-db/l0_dgx_b200.yml
  • tests/unittest/llmapi/test_llm_pytorch.py
  • tests/integration/test_lists/qa/llm_function_sanity.txt
  • tests/integration/defs/llmapi/_run_llmapi_llm.py
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/integration/defs/test_e2e.py
  • tests/integration/defs/llmapi/test_llm_api_qa.py
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/integration/test_lists/qa/llm_function_full.txt
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/qa/llm_function_full.txt
🧬 Code Graph Analysis (10)
tensorrt_llm/executor/proxy.py (3)
tensorrt_llm/executor/ipc.py (2)
  • get (157-176)
  • get (278-279)
tensorrt_llm/logger.py (1)
  • error (125-126)
tensorrt_llm/executor/executor.py (1)
  • _handle_background_error (244-273)
tensorrt_llm/executor/worker.py (2)
tensorrt_llm/executor/utils.py (1)
  • put (119-120)
tensorrt_llm/executor/ipc.py (2)
  • put (116-126)
  • put (270-276)
tests/unittest/llmapi/test_llm_pytorch.py (3)
tensorrt_llm/executor/worker.py (1)
  • GenerationExecutorWorker (50-626)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1087-1103)
tensorrt_llm/_torch/llm.py (1)
  • LLM (4-9)
tensorrt_llm/llmapi/llm.py (3)
tensorrt_llm/llmapi/llm_args.py (2)
  • TorchLlmArgs (2006-2386)
  • TrtLlmArgs (1824-1962)
tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
  • LlmArgs (224-316)
tests/microbenchmarks/build_time_benchmark.py (1)
  • log_level (145-148)
tests/integration/defs/llmapi/_run_llmapi_llm.py (3)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1087-1103)
tensorrt_llm/_torch/llm.py (1)
  • LLM (4-9)
tensorrt_llm/evaluate/json_mode_eval.py (1)
  • command (125-138)
tests/integration/defs/test_e2e.py (4)
tests/integration/defs/conftest.py (3)
  • llm_root (180-181)
  • llm_models_root (77-83)
  • llm_venv (707-723)
tests/integration/defs/triton_server/conftest.py (1)
  • llm_models_root (24-33)
tests/integration/defs/common.py (1)
  • parse_output (665-681)
tests/integration/defs/examples/test_llama.py (1)
  • parse_output (129-139)
tests/integration/defs/llmapi/test_llm_api_qa.py (3)
tests/integration/defs/conftest.py (1)
  • llm_venv (707-723)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1087-1103)
tensorrt_llm/llmapi/llm_args.py (1)
  • TorchLlmArgs (2006-2386)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (5)
  • llm_models_root (77-83)
  • llm_root (180-181)
  • disaggregated_test_root (2339-2344)
  • disaggregated_example_root (270-275)
  • llm_venv (707-723)
tests/integration/defs/trt_test_alternative.py (3)
  • check_call (230-238)
  • check_output (241-267)
  • popen (179-198)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
tests/integration/defs/conftest.py (1)
  • get_device_count (1941-1943)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
tensorrt_llm/llmapi/llm_args.py (2)
  • CudaGraphConfig (106-163)
  • KvCacheConfig (929-1024)
tests/integration/defs/conftest.py (1)
  • llm_models_root (77-83)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1087-1103)
tensorrt_llm/quantization/mode.py (1)
  • QuantAlgo (23-46)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_pytorch.py

844-844: Local variable llm is assigned to but never used

Remove assignment to unused variable llm

(F841)

tests/integration/defs/test_e2e.py

2529-2529: Line too long (240 > 120)

(E501)

tests/integration/defs/llmapi/test_llm_api_qa.py

12-13: One-line docstring should fit on one line

Reformat to one line

(D200)


12-13: First line should end with a period, question mark, or exclamation point

Add closing punctuation

(D415)


69-69: Line too long (123 > 120)

(E501)

tests/integration/defs/accuracy/test_disaggregated_serving.py

250-250: Line too long (209 > 120)

(E501)

tests/integration/defs/accuracy/test_llm_api_pytorch.py

817-817: Line too long (127 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (36)
cpp/tensorrt_llm/kernels/topkLastDim.cu (1)

1278-1279: Aligned zero-init base is correct and safer

Zeroing from aligned_pointers[0] up to aligned_pointers[2] precisely clears counters and histograms at their aligned bases, avoiding any pre-alignment padding. This is the right fix for correctness and avoids UB risks from writing into potentially unaligned/slack regions.

tensorrt_llm/executor/worker.py (1)

795-796: Handshake tuple shape is clear and aligned with proxy

Pushing (ready_signal, None) on success is consistent and keeps the payload shape stable. No issues.

tests/unittest/llmapi/test_llm_pytorch.py (1)

812-819: Optional: Patch the worker’s init to cover the MPI handshake path

If the intent is to specifically exercise the proxy’s new tuple handshake path, consider patching GenerationExecutorWorker.init to raise, so the failure originates inside worker_main and traverses over IPC. The current patch may fail earlier depending on how GenerationExecutor.create is used.

I can generate a targeted test variant if you want to cover that path explicitly.

tensorrt_llm/tools/multimodal_builder.py (1)

1194-1203: LGTM! Well-designed compatibility layer for transformers version differences.

The implementation correctly handles the structural changes in the MLLaMA model between transformers versions by checking for the existence of the newer nested structure (model.model.vision_model) before falling back to the older direct structure (model.vision_model). This ensures backward compatibility while supporting the latest transformers versions.

jenkins/current_image_tags.properties (1)

14-17: Tag updates LGTM and align with URM usage.

The refreshed tags and corrected repository path for the SBSA image look good and consistent with the broader registry migration.

docker/Makefile (2)

85-87: Passing TRITON_IMAGE/TRITON_BASE_TAG through to docker build is a solid improvement.

This ensures Dockerfile.multi ARGs are honored during build; nice addition.


194-204: Base image source unified to nvcr.io; consistent with internal mirror handling upstream.

The base image overrides for rockylinux8 and ubuntu22 targets look correct and consistent with the Jenkins pipeline logic that rewrites to URM during builds.

tests/integration/defs/accuracy/references/cnn_dailymail.yaml (1)

8-10: LGTM! Valid addition of FP8 quantization reference.

The new entry adds FP8 quantization results for google/gemma-3-27b-it with both quant_algo and kv_cache_quant_algo set to FP8, showing expected accuracy degradation (27.90 vs 28.90 baseline). The format and structure are consistent with other model entries in the file.

tests/integration/defs/accuracy/references/mmlu.yaml (1)

119-121: LGTM! Consistent FP8 quantization reference addition.

The new FP8 quantization entry for google/gemma-3-27b-it follows the same pattern as other reference files, with expected accuracy impact (76.80 vs 77.80 baseline). This is part of a coordinated update across multiple accuracy reference files for the same model.

tests/integration/test_lists/waives.txt (1)

295-298: LGTM! Proper test waiving with bug tracking.

All four SKIP entries correctly reference bug tracking numbers and align with newly added tests in the codebase. The waived tests cover different areas (accuracy, disaggregated serving, benchmarking, and multimodal) with appropriate justification via bug references.

tensorrt_llm/serve/scripts/benchmark_serving.py (1)

593-594: LGTM! Defensive programming improvement for metrics handling.

The change from unconditional results[k] access to conditional for k in metrics if k in results prevents KeyError exceptions when some metrics are missing from the results dictionary. This is particularly important for disaggregated/multi-backend scenarios where not all metrics may be available.

tests/integration/defs/perf/pytorch_model_config.py (1)

162-169: LGTM! Configuration structure modernization.

The change from flat CUDA graph configuration keys to a nested cuda_graph_config structure improves configuration organization. The batch_sizes values remain identical, ensuring no functional change while providing better structure alignment with other parts of the codebase that use nested CUDA graph configuration.

tests/integration/defs/accuracy/references/gsm8k.yaml (3)

146-148: FP8 entry for gemma-3-1b-it looks consistent with new test coverage

Adding kv_cache_quant_algo: FP8 with updated accuracy is aligned with the FP8 prequantized tests introduced in this PR. No issues found.


151-153: FP8 entry for gemma-3-27b-it is correct and matches new tests

The FP8 + FP8 KV entry and accuracy value correspond to the new FP8 prequantized test for 27B. Looks good.


156-158: Ministral-8B-Instruct-2410 FP8 metadata extension is reasonable

Including kv_cache_quant_algo: FP8 and updating accuracy is consistent with wider FP8 coverage. No concerns.

tensorrt_llm/_torch/models/modeling_gemma3vl.py (1)

197-206: Per-submodel quant_config handling is correct (vision unquantized, text inherits)

Explicitly setting quant_config=None for vision prevents vision weight-loading from interpreting quant params, while flowing the parent config to the text sub-model. Also, selecting FLASHINFER for text aligns with the need for custom masks. No issues.

tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)

564-565: Switch to cuda_graph_config is clear and future-proof

The conditional CudaGraphConfig vs None improves readability and centralizes CUDA graph settings. Good change.


588-589: Same cuda_graph_config pattern here: LGTM

Consistent with the refactor elsewhere.


709-710: Consistent use of cuda_graph_config toggle

Keeps tests uniform. No issues.


732-733: Consistent cuda_graph_config toggle (FP4 path)

Matches the FP8 path; good consistency.


868-869: Expanded coverage for 1B FP8 prequantized: LGTM

Adding GSM8K between CNN Dailymail and MMLU improves breadth without increasing config complexity.

tests/integration/test_lists/test-db/l0_h100.yml (1)

209-209: Include new Gemma3 27B FP8 prequantized test in l0_h100 post-merge list

The added entry lines up with the new test in the suite. Placement after the 1B FP8 prequantized test is appropriate.

tests/integration/test_lists/qa/llm_function_sanity.txt (2)

49-49: Test entry addition looks correct.

The addition of accuracy/test_llm_api_pytorch.py::TestMistralSmall24B::test_auto_dtype follows the established format and is consistent with other test entries in the file.


114-115: Test entries for nixl backend testing are properly structured.

Both test entries follow the established format:

  • accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_nixl_backend
  • accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_nixl_backend

These align with the nixl backend functionality being tested as mentioned in the AI summary.

tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)

82-87: YAML test entries are correctly formatted and consistent.

The six new test entries are properly structured:

  • Four disaggregated benchmark tests with appropriate backend parameters
  • Two nixl backend tests for disaggregated serving

All entries follow the established YAML format and are consistent with the test file structure.

tensorrt_llm/llmapi/llm.py (3)

125-127: Log level management implementation is solid.

The code temporarily elevates the logger to "info" level to ensure backend selection messages are displayed, storing the original level for proper restoration.


130-140: Backend-specific logging is well implemented.

The conditional logging clearly communicates which backend is being used:

  • PyTorch backend: "Using LLM with PyTorch backend"
  • AutoDeploy backend: "Using LLM with AutoDeploy backend"
  • TensorRT backend (default): "Using LLM with TensorRT backend"

This provides valuable debugging information and aligns with the test assertions mentioned in the relevant code snippets.


169-171: Proper exception handling with log level restoration.

The finally block ensures the original log level is restored regardless of whether initialization succeeds or fails, preventing log level pollution across LLM instances.

examples/llm-api/star_attention.py (3)

10-10: Import change aligns with the new KvCacheConfig approach.

The import of KvCacheConfig from tensorrt_llm.llmapi.llm_args replaces the previous quantization-based approach, which is consistent with the PR's objective to migrate from QuantConfig to KvCacheConfig for FP8 KV cache configuration.


57-57: KvCacheConfig usage is correct and clear.

The configuration properly sets the dtype based on the fp8_kv_cache parameter:

  • "fp8" when FP8 KV cache is enabled
  • "auto" for automatic dtype selection

This is a cleaner approach than the previous QuantConfig-based method.


69-69: LLM constructor update is consistent with the new configuration approach.

Passing kv_cache_config instead of quant_config to the LLM constructor is the correct way to configure FP8 KV cache with the new API design.

tests/integration/defs/llmapi/_run_llmapi_llm.py (4)

3-3: Optional import addition is appropriate.

The import of Optional from typing is needed for the new backend parameter type annotation.


7-8: Backend class imports are correctly structured.

The import structure clearly differentiates between the two LLM classes:

  • TrtLLM from tensorrt_llm._tensorrt_engine for TensorRT backend
  • LLM from tensorrt_llm.llmapi for PyTorch backend (default)

15-16: CLI parameter and function signature updates are consistent.

The addition of the --backend CLI option with default None and the corresponding function parameter with proper type annotation (Optional[str]) follows good practices.


22-29: Backend selection logic is well-implemented.

The implementation correctly:

  1. Defaults to "tensorrt" when no backend is specified
  2. Validates that only "pytorch" or "tensorrt" are accepted
  3. Selects the appropriate LLM class based on the backend
  4. Conditionally passes build_config only for the TensorRT backend (PyTorch backend doesn't use it)

This design aligns with the different initialization requirements of the two backends as shown in the relevant code snippets.

tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)

218-218: LGTM!

The addition of auto_trigger: others for the Triton backend post_merge block aligns with the trigger patterns used in other backend blocks.

@dominicshanshan dominicshanshan changed the title chore: Mass integration of release/1.0 [None][chore] Mass integration of release/1.0 Aug 13, 2025
@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)

736-771: Remove redundant duplicate lines.

Lines 749-750 contain duplicate dictionary access without any operation, which appears to be a copy-paste error that should be removed.

Remove the redundant lines:

-        ctx_server_config["cache_transceiver_config"]
-        ctx_server_config["cache_transceiver_config"]
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)

243-251: Fix line length violation and improve error message readability.

The error message on Line 250 exceeds the 120-character limit and could benefit from better formatting for readability.

Apply this diff to fix the line length and improve readability:

-        pytest.fail(
-            f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
-        )
+        pytest.fail(
+            f"Not enough devices for {ctx_instances} ctx instances "
+            f"(ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances "
+            f"(gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
+        )
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cad59d and 73b903a.

📒 Files selected for processing (1)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
🪛 Ruff (0.12.2)
tests/integration/defs/accuracy/test_disaggregated_serving.py

250-250: Line too long (209 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Check PR Title Format
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tests/integration/defs/accuracy/test_disaggregated_serving.py (11)

118-123: LGTM! URL parsing and port extraction logic is correct.

The implementation correctly extracts URLs from the disaggregated server config and parses port numbers from them using string splitting.


124-145: LGTM! Context server configuration logic is well-structured.

The per-instance GPU allocation and environment setup for context servers is correctly implemented, including proper CUDA_VISIBLE_DEVICES management and conditional max_num_tokens handling.


146-166: LGTM! Generation server configuration mirrors context server logic appropriately.

The generation server setup follows the same pattern as context servers with proper GPU allocation continuation and environment configuration.


167-185: LGTM! Multi-process orchestration is well-designed.

The multi_popen context manager correctly handles parallel process startup and cleanup using ExitStack, with proper exception handling and cleanup coordination.


186-202: LGTM! Server startup and health check orchestration is robust.

The nested context manager approach properly coordinates the startup of all servers followed by the disaggregated server, with a robust health check loop to ensure readiness before proceeding.


203-240: LGTM! OpenAI client integration and request handling is well-implemented.

The OpenAI client setup and the send_request/generate_async functions correctly translate between OpenAI API and TensorRT-LLM's request/response format, maintaining compatibility with existing test harnesses.


276-291: LGTM! Multi-instance URL generation and configuration is correct.

The URL generation for multiple instances and the disaggregated server configuration structure properly support the new multi-instance functionality.


534-536: LGTM! Test parameter updates support new multi-instance functionality.

The updated calls to run_parallel_test correctly pass the new ctx_instances and gen_instances parameters.


545-547: LGTM! Asymmetric test configuration is correctly updated.

The test properly exercises different combinations of context pipeline parallelism and generation tensor parallelism with the new instance parameters.


548-553: LGTM! Multi-instance test provides good coverage.

The new test method exercises 2 context instances and 2 generation instances, providing valuable test coverage for the multi-instance functionality.


600-633: LGTM! nixl backend test is well-structured.

The test properly configures both context and generation servers to use the "nixl" backend for cache transceiver and validates functionality with both MMLU and GSM8K tasks.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15123 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15121 [ run ] completed with state ABORTED

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15124 [ run ] triggered by Bot

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
tests/integration/defs/accuracy/test_disaggregated_serving.py (3)

1-1: Prepend the NVIDIA 2025 copyright header required by repo guidelines.

All Python source files must include the NVIDIA copyright header (current year). Please add it at the top of this file.

Apply this diff at the top of the file:

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+

55-55: Fix Python 3.8 compatibility: avoid PEP 585 built-in generics.

The annotation list[...] requires Python 3.9+. Our guidelines target Python 3.8+, so use typing.List[...] instead to avoid a SyntaxError on 3.8.

-        self.futures: list[concurrent.futures.Future[RequestOutput]] = []
+        self.futures: List[concurrent.futures.Future[RequestOutput]] = []

57-66: Ensure thread pool is shut down if any future raises during clean exit.

If future.result() raises while exc_type is None, super().__exit__ won’t run and the pool may leak threads. Wrap the loop to always shutdown on errors.

     def __exit__(self, exc_type, exc_val, exc_tb):
-        if exc_type is None:
-            for future in self.futures:
-                future.result()
-            return super().__exit__(exc_type, exc_val, exc_tb)
+        if exc_type is None:
+            try:
+                for future in self.futures:
+                    future.result()
+            except Exception:
+                for future in self.futures:
+                    future.cancel()
+                self.shutdown(wait=True, cancel_futures=True)
+                raise
+            return super().__exit__(exc_type, exc_val, exc_tb)
♻️ Duplicate comments (3)
tests/integration/defs/accuracy/test_disaggregated_serving.py (2)

271-273: Correct tuple order for DuckLLM (args, tokenizer, generate_async).

This fixes the previously reported mismatch. Good catch.


752-753: Previous duplicate no-op lines are no longer present.

The earlier reported duplicated ctx_server_config["cache_transceiver_config"] appears resolved.

tests/unittest/llmapi/test_llm_pytorch.py (1)

857-858: Fix Ruff F841: assigned to but never used.

You don’t need to bind the constructed LLM; the exception is raised during construction.

-            llm = LLM(model=llama_model_path,
-                      kv_cache_config=global_kvcache_config)
+            LLM(
+                model=llama_model_path,
+                kv_cache_config=global_kvcache_config,
+            )
🧹 Nitpick comments (4)
tests/integration/defs/accuracy/test_disaggregated_serving.py (2)

135-137: Make port parsing robust to future URL formats.

Splitting by the first “:” breaks if a scheme is introduced (e.g., http://). Use rsplit to extract the last segment.

-    ctx_ports = [int(url.split(":")[1]) for url in ctx_urls]
-    gen_ports = [int(url.split(":")[1]) for url in gen_urls]
+    ctx_ports = [int(url.rsplit(":", 1)[-1]) for url in ctx_urls]
+    gen_ports = [int(url.rsplit(":", 1)[-1]) for url in gen_urls]

181-199: Simplify multi_popen with ExitStack and improve debuggability.

You can remove the intermediate processes list and enter all contexts in one expression. Also consider logging the command on failure.

-    @contextlib.contextmanager
-    def multi_popen(server_configs):
-        processes = []
-        try:
-            for env, args in server_configs:
-                proc = popen(args, env=env)
-                processes.append(proc)
-
-            with contextlib.ExitStack() as stack:
-                opened_processes = [
-                    stack.enter_context(proc) for proc in processes
-                ]
-                yield opened_processes
-        except Exception as e:
-            print(
-                f"Failed to start disaggregated server processes in multi_popen: {e}"
-            )
-            raise
+    @contextlib.contextmanager
+    def multi_popen(server_configs):
+        try:
+            with contextlib.ExitStack() as stack:
+                procs = [
+                    stack.enter_context(popen(args, env=env))
+                    for env, args in server_configs
+                ]
+                yield procs
+        except Exception as e:
+            cmds = [args for _, args in server_configs]
+            print(f"Failed to start disaggregated server processes: {e}\nCommands: {cmds}")
+            raise
tests/unittest/llmapi/test_llm_pytorch.py (2)

824-831: Optional: avoid subclassing the worker; simulate failure by raising directly.

The mock subclass works, but you can simplify by making GenerationExecutor.create raise directly. This reduces coupling to internal class locations and removes the need for the extra import/class.

If you adopt the direct-raise approach in the test (see suggestion on Lines 851-853), you can remove this class:

-class FailingExecutorWorker(GenerationExecutorWorker):
-    """Mock worker that fails during initialization to test error handling."""
-
-    def __init__(self, *args, **kwargs):
-        # Simulate a constructor failure
-        raise RuntimeError(
-            "Mock GenerationExecutorWorker initialization failed")

833-839: Remove unused FailingExecutor helper.

FailingExecutor is defined but never used. Drop it to keep the test lean.

-FailingExecutor = type(
-    "FailingExecutor", (), {
-        "create":
-        classmethod(
-            lambda cls, *args, **kwargs: FailingExecutorWorker(*args, **kwargs))
-    })
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f987d2d and 28a4732.

📒 Files selected for processing (7)
  • tensorrt_llm/executor/proxy.py (1 hunks)
  • tensorrt_llm/executor/worker.py (2 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (6 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (4 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)
  • tests/integration/test_lists/waives.txt (1 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tests/integration/test_lists/waives.txt
🚧 Files skipped from review as they are similar to previous changes (4)
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/executor/proxy.py
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/unittest/llmapi/test_llm_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/unittest/llmapi/test_llm_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/llmapi/test_llm_pytorch.py
🧬 Code Graph Analysis (2)
tests/integration/defs/accuracy/test_disaggregated_serving.py (4)
tests/integration/defs/trt_test_alternative.py (1)
  • popen (179-198)
tests/unittest/_torch/auto_deploy/unit/singlegpu/utils/test_config.py (1)
  • temp_dir (67-70)
tensorrt_llm/llmapi/utils.py (1)
  • get_device_count (83-84)
tests/integration/defs/conftest.py (1)
  • get_device_count (1941-1943)
tests/unittest/llmapi/test_llm_pytorch.py (2)
tensorrt_llm/executor/worker.py (1)
  • GenerationExecutorWorker (50-626)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1090-1106)
🪛 Ruff (0.12.2)
tests/integration/defs/accuracy/test_disaggregated_serving.py

290-290: Line too long (209 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

856-856: Local variable llm is assigned to but never used

Remove assignment to unused variable llm

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/integration/defs/accuracy/test_disaggregated_serving.py (6)

132-180: Per-instance GPU assignment looks correct.

Contiguous GPU blocks per process with a running offset is a sensible and deterministic approach. This pairs well with the upstream device count validation in run_parallel_test.


315-317: Ports allocation strategy LGTM.

Alternating 8001/8002 per instance avoids collisions between ctx/gen servers. Clear and predictable.


323-329: Disaggregated server config matches per-instance URLs.

Including both num_instances and explicit urls is consistent and unambiguous for the router.


578-578: Updated call site now passes instance counts explicitly.

Moving to (ctx_instances=1, gen_instances=1) maintains previous behavior while aligning with the new signature.


580-590: Asymmetric test updates look consistent.

Instance counts added with no behavior change. Guard still skips when capacity is insufficient.


591-595: New multi-instance test adds valuable coverage.

Exercising (2, 2) instances is a good sanity check for the orchestration and GPU allocation logic.

tests/unittest/llmapi/test_llm_pytorch.py (2)

841-859: Good addition: validates that executor creation failures surface at LLM construction.

The test clearly asserts the error path and message. This improves coverage of initialization error propagation.


7-7: Import re-export verified: no change required

Confirmed that GenerationExecutorWorker is explicitly re-exported in tensorrt_llm/executor/__init__.py, so the existing import from tensorrt_llm.executor is valid and sufficiently robust.

Comment on lines +201 to +206
thread_pool, temp_dir, multi_popen(ctx_servers + gen_servers),
popen([
trtllm_serve_path, "disaggregated", "-c",
disaggregated_serving_config_path, "--server_start_timeout",
"3600"
]) as disaggregated_server):
])):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add timeouts to the health check to prevent long hangs.

requests.get() has no default timeout and can hang indefinitely if the socket is established but never responds. Add a reasonable timeout and handle read timeouts.

-                response = requests.get("http://localhost:8000/health")
+                response = requests.get("http://localhost:8000/health", timeout=5)
                 if response.status_code == 200:
                     break
-            except requests.exceptions.ConnectionError:
+            except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout):
                 continue

Also applies to: 207-216

🤖 Prompt for AI Agents
In tests/integration/defs/accuracy/test_disaggregated_serving.py around lines
201-216, the health check calls use requests.get without timeouts which can hang
indefinitely; update those requests.get calls to include a reasonable timeout
(e.g., timeout=(connect_timeout, read_timeout) or a single timeout value) and
wrap them to catch requests.exceptions.Timeout and
requests.exceptions.RequestException, treating timeouts as failed health checks
(return False or raise a controlled error) and logging or retrying as
appropriate so the test doesn't hang.

Comment on lines +281 to +292
ctx_instances: int,
gen_instances: int,
test_sets: List[LlmapiAccuracyTestHarness],
ctx_model: str = None,
gen_model: str = None):
if ctx_tp * ctx_pp + gen_tp * gen_pp > get_device_count():
total_ctx_gpus = ctx_tp * ctx_pp * ctx_instances
total_gen_gpus = gen_tp * gen_pp * gen_instances
if total_ctx_gpus + total_gen_gpus > get_device_count():
pytest.fail(
f"Not enough devices for ctx_pp={ctx_pp}+ctx_tp={ctx_tp} and gen_pp={gen_pp}+gen_tp={gen_tp} test"
f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enforce line length and consider skipping instead of failing for device scarcity.

  • The pytest.fail(...) message exceeds 120 chars (E501). Rewrap the string.
  • Optional: For insufficient devices, pytest.skip(...) is consistent with other tests and avoids turning capacity issues into failures.
-    if total_ctx_gpus + total_gen_gpus > get_device_count():
-        pytest.fail(
-            f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
-        )
+    if total_ctx_gpus + total_gen_gpus > get_device_count():
+        msg = (
+            "Not enough devices for "
+            f"{ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) "
+            f"+ {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), "
+            f"total: {total_ctx_gpus + total_gen_gpus}"
+        )
+        # Consider skipping instead of failing to align with other tests:
+        # pytest.skip(msg)
+        pytest.fail(msg)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
ctx_instances: int,
gen_instances: int,
test_sets: List[LlmapiAccuracyTestHarness],
ctx_model: str = None,
gen_model: str = None):
if ctx_tp * ctx_pp + gen_tp * gen_pp > get_device_count():
total_ctx_gpus = ctx_tp * ctx_pp * ctx_instances
total_gen_gpus = gen_tp * gen_pp * gen_instances
if total_ctx_gpus + total_gen_gpus > get_device_count():
pytest.fail(
f"Not enough devices for ctx_pp={ctx_pp}+ctx_tp={ctx_tp} and gen_pp={gen_pp}+gen_tp={gen_tp} test"
f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
)
total_ctx_gpus = ctx_tp * ctx_pp * ctx_instances
total_gen_gpus = gen_tp * gen_pp * gen_instances
if total_ctx_gpus + total_gen_gpus > get_device_count():
msg = (
"Not enough devices for "
f"{ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) "
f"+ {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), "
f"total: {total_ctx_gpus + total_gen_gpus}"
)
# Consider skipping instead of failing to align with other tests:
# pytest.skip(msg)
pytest.fail(msg)
🧰 Tools
🪛 Ruff (0.12.2)

290-290: Line too long (209 > 120)

(E501)

🤖 Prompt for AI Agents
In tests/integration/defs/accuracy/test_disaggregated_serving.py around lines
281 to 292, the pytest.fail call uses a single long string exceeding 120 chars
and should be skipped rather than failed when devices are scarce; change
pytest.fail(...) to pytest.skip(...) and rewrap the message into a shorter
concatenated or formatted string (or split into multiple shorter f-strings) so
no line exceeds 120 characters while preserving the same contextual details
about ctx/gen instances, pp/tp and total devices.

Comment on lines +851 to +853
with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
side_effect=lambda *args, **kwargs: FailingExecutorWorker(
*args, **kwargs)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Patch the symbol at the use site to harden the test.

Ensure you’re patching the exact attribute looked up by LLM at runtime. In many cases, patching in the consumer’s module (e.g., tensorrt_llm.llmapi.llm.GenerationExecutor.create) is more robust than patching the defining module.

To verify the call site, run:

Optionally, simplify the failure simulation and avoid the custom worker class:

-with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
-           side_effect=lambda *args, **kwargs: FailingExecutorWorker(
-               *args, **kwargs)):
+with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
+           side_effect=RuntimeError(
+               "Mock GenerationExecutorWorker initialization failed")):

If you adopt this, you can also remove the GenerationExecutorWorker import (Line 7) and the FailingExecutorWorker class (Lines 824-831).


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find all places where GenerationExecutor.create is invoked
rg -nP -C3 '\bGenerationExecutor\s*\.\s*create\s*\(' --type=py tensorrt_llm

# Check what LLM imports/binds for GenerationExecutor
rg -nP -C5 'from\s+tensorrt_llm\.executor(?:\.executor)?\s+import\s+GenerationExecutor\b|import\s+tensorrt_llm\.executor\.executor\s+as' tensorrt_llm/llmapi/llm.py

Length of output: 97


To pinpoint the correct patch target, let’s double-check how GenerationExecutor is imported and invoked in the LLM module.


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1. How is GenerationExecutor brought into the LLM API?
rg -nP -C3 'from\s+tensorrt_llm\.executor\.executor\s+import\s+GenerationExecutor' --type=py tensorrt_llm/llmapi/llm.py
rg -nP -C3 'import\s+tensorrt_llm\.executor\.executor\s+as' --type=py tensorrt_llm/llmapi/llm.py

# 2. Does the LLM code ever bind it under a different alias?
rg -nP -C3 'GenerationExecutor\s*=' --type=py tensorrt_llm/llmapi/llm.py

# 3. Where is create() actually called at runtime?
rg -nP -C3 'GenerationExecutor\s*\.\s*create' --type=py .

Length of output: 141


Let’s search the LLM API module to see exactly how GenerationExecutor is imported and where create() is called:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# 1. Any direct import of GenerationExecutor in the LLM API
rg -nP --type=py 'GenerationExecutor' -n -C3 tensorrt_llm/llmapi/llm.py

# 2. Any aliasing of the executor module
rg -nP --type=py 'import\s+tensorrt_llm\.executor' -n -C3 tensorrt_llm/llmapi/llm.py

# 3. Where create() is invoked in the LLM API
rg -nP --type=py 'create\s*\(' -n -C3 tensorrt_llm/llmapi/llm.py

Length of output: 1639


To pinpoint the correct patch target in your test, let’s inspect how you’re using patch(...) in test_llm_pytorch.py. Please run:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the `with patch(...)` blocks around where FailingExecutorWorker is used
rg -nP 'with\s+patch' -n -C10 --type=py tests/unittest/llmapi/test_llm_pytorch.py

Length of output: 1065


Let’s inspect how _executor_cls is used in the LLM API to see exactly how the executor is instantiated:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show where _executor_cls is used to create the executor
rg -nP '_executor_cls\.' -n -C5 --type=py tensorrt_llm/llmapi/llm.py

Length of output: 1241


Let’s inspect the test file’s imports to pinpoint where GenerationExecutorWorker is brought in and prepare the correct removals:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Show the first 80 lines (imports and helper classes) of the test file
sed -n '1,80p' tests/unittest/llmapi/test_llm_pytorch.py

Length of output: 3209


Patch GenerationExecutor.create at the LLM import site

– In tests/unittest/llmapi/test_llm_pytorch.py, change the patch target to where LLM actually looks it up:

- with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
-            side_effect=lambda *args, **kwargs: FailingExecutorWorker(
-                *args, **kwargs)):
+ with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create',
+            side_effect=lambda *args, **kwargs: FailingExecutorWorker(
+                *args, **kwargs)):

Optionally simplify the failure injection and remove the custom worker entirely:

- with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create',
-            side_effect=lambda *args, **kwargs: FailingExecutorWorker(
-                *args, **kwargs)):
+ with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create',
+            side_effect=RuntimeError(
+                "Mock GenerationExecutorWorker initialization failed")):

If you adopt the above, you can also remove from the top of the file:

  • the from tensorrt_llm.executor import GenerationExecutorWorker import
  • the class FailingExecutorWorker helper (around lines 824–831)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
side_effect=lambda *args, **kwargs: FailingExecutorWorker(
*args, **kwargs)):
with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create',
side_effect=lambda *args, **kwargs: FailingExecutorWorker(
*args, **kwargs)):
🤖 Prompt for AI Agents
In tests/unittest/llmapi/test_llm_pytorch.py around lines 851–853, the test
currently patches GenerationExecutor.create at tensorrt_llm.executor.executor
but LLM looks up that symbol from its own import site; change the patch target
to the module where the LLM imports GenerationExecutor (e.g.,
tensorrt_llm.llm.GenerationExecutor.create) so the monkeypatch takes effect.
Replace the custom side_effect that constructs FailingExecutorWorker with a
simpler side_effect that raises the desired Exception (or a simple lambda that
raises) and then remove the now-unused import of GenerationExecutorWorker and
the FailingExecutorWorker helper class around lines 824–831.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15915 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15915 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11962 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/unittest/llmapi/test_llm_pytorch.py (1)

841-859: Patch the symbol where LLM looks it up and fix Ruff F841 (unused variable)

Patch GenerationExecutor.create at the import site used by LLM so the monkeypatch takes effect, and don’t assign the constructed LLM to a variable (fixes F841). Also, directly raise the desired RuntimeError—no need to construct a failing worker.

 def test_llm_with_proxy_error():
@@
-    # Test that the error is properly caught and re-raised by LLM
-    # We patch GenerationExecutor.create directly to return our failing worker
-    with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
-               side_effect=lambda *args, **kwargs: FailingExecutorWorker(
-                   *args, **kwargs)):
-        with pytest.raises(
-                RuntimeError,
-                match="Mock GenerationExecutorWorker initialization failed"):
-            llm = LLM(model=llama_model_path,
-                      kv_cache_config=global_kvcache_config)
+    # Patch where LLM resolves GenerationExecutor to ensure the side effect is hit
+    with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create',
+               side_effect=RuntimeError(
+                   "Mock GenerationExecutorWorker initialization failed")):
+        with pytest.raises(RuntimeError,
+                           match="Mock GenerationExecutorWorker initialization failed"):
+            LLM(model=llama_model_path, kv_cache_config=global_kvcache_config)
♻️ Duplicate comments (1)
tests/integration/test_lists/waives.txt (1)

305-305: Duplicate SKIP entry for the same test; keep a single canonical line

This line duplicates Line 266 for the same test. Please keep only one entry (prefer the earlier one that already references both bug IDs) to avoid ambiguity in waives and to keep the list tidy.

Apply:

-accuracy/test_llm_api_pytorch.py::TestQwen3_235B_A22B::test_nvfp4[latency_moe_trtllm_eagle3] SKIP (https://nvbugs/5437384)
🧹 Nitpick comments (5)
tests/unittest/llmapi/test_llm_pytorch.py (2)

824-839: Remove unnecessary failing worker scaffolding (simplify the test and avoid heavy imports)

FailingExecutorWorker and FailingExecutor aren’t needed if you patch GenerationExecutor.create to raise directly. Drop these to reduce test brittleness and dependencies.

-class FailingExecutorWorker(GenerationExecutorWorker):
-    """Mock worker that fails during initialization to test error handling."""
-
-    def __init__(self, *args, **kwargs):
-        # Simulate a constructor failure
-        raise RuntimeError(
-            "Mock GenerationExecutorWorker initialization failed")
-
-
-FailingExecutor = type(
-    "FailingExecutor", (), {
-        "create":
-        classmethod(
-            lambda cls, *args, **kwargs: FailingExecutorWorker(*args, **kwargs))
-    })

7-7: Drop unused heavy import

After simplifying the test, GenerationExecutorWorker is no longer needed. Remove the import to avoid unnecessary module loading.

-from tensorrt_llm.executor import GenerationExecutorWorker
tests/integration/defs/accuracy/test_disaggregated_serving.py (3)

132-137: More robust port parsing

url.split(":")[1] is brittle if a scheme or extra colons ever appear. Using rsplit(":", 1) is safer.

-    ctx_ports = [int(url.split(":")[1]) for url in ctx_urls]
-    gen_ports = [int(url.split(":")[1]) for url in gen_urls]
+    ctx_ports = [int(url.rsplit(":", 1)[1]) for url in ctx_urls]
+    gen_ports = [int(url.rsplit(":", 1)[1]) for url in gen_urls]

200-216: Add request timeouts to health check to prevent hangs

requests.get() without a timeout can hang indefinitely. Add a timeout and catch ReadTimeout as well.

         start_time = time.time()
         while time.time() - start_time < 3600:
             time.sleep(1)
             try:
                 print("Checking health endpoint")
-                response = requests.get("http://localhost:8000/health")
+                response = requests.get("http://localhost:8000/health", timeout=5)
                 if response.status_code == 200:
                     break
-            except requests.exceptions.ConnectionError:
+            except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout):
                 continue

286-292: Prefer skip over fail for capacity issues and fix E501 long line

Capacity scarcity should skip the test (consistent with other tests), not fail the suite. Also rewrap the long message to satisfy line-length linting.

-    if total_ctx_gpus + total_gen_gpus > get_device_count():
-        pytest.fail(
-            f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
-        )
+    if total_ctx_gpus + total_gen_gpus > get_device_count():
+        msg = (
+            "Not enough devices for "
+            f"{ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) "
+            f"+ {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), "
+            f"total: {total_ctx_gpus + total_gen_gpus}"
+        )
+        pytest.skip(msg)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 28a4732 and 0e0eeed.

📒 Files selected for processing (7)
  • tensorrt_llm/executor/proxy.py (1 hunks)
  • tensorrt_llm/executor/worker.py (2 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (7 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (4 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)
  • tests/integration/test_lists/waives.txt (1 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • tensorrt_llm/executor/worker.py
  • tensorrt_llm/executor/proxy.py
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/unittest/llmapi/test_llm_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/unittest/llmapi/test_llm_pytorch.py
🧠 Learnings (3)
📚 Learning: 2025-07-22T08:33:49.109Z
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.

Applied to files:

  • tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tests/integration/test_lists/waives.txt
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/test_lists/waives.txt
  • tests/unittest/llmapi/test_llm_pytorch.py
🧬 Code Graph Analysis (2)
tests/integration/defs/accuracy/test_disaggregated_serving.py (2)
tests/integration/defs/trt_test_alternative.py (1)
  • popen (179-198)
tests/integration/defs/conftest.py (1)
  • get_device_count (1941-1943)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tensorrt_llm/executor/worker.py (1)
  • GenerationExecutorWorker (50-626)
🪛 Ruff (0.12.2)
tests/integration/defs/accuracy/test_disaggregated_serving.py

290-290: Line too long (209 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

856-856: Local variable llm is assigned to but never used

Remove assignment to unused variable llm

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15949 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15949 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11986 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
tensorrt_llm/executor/worker.py (1)

772-779: Avoid sending Exception instances over IPC; send a simple, reliably serializable status tuple instead

IpcQueue.put serializes with pickle; many Exception instances (esp. custom ones with non-serializable attrs) can fail pickling, causing initialization handshakes to break. You already carry the full traceback string; make the first element a simple sentinel (e.g., None or "ERROR") to eliminate this risk.

Apply:

-            worker_init_status_queue.put((e, traceback.format_exc()))
+            # Use a simple sentinel to avoid pickling Exceptions across IPC
+            worker_init_status_queue.put((None, traceback.format_exc()))

If there are any other error-path puts to worker_init_status_queue, make the same change there for consistency.

tests/integration/defs/accuracy/test_disaggregated_serving.py (3)

53-56: Fix Python 3.8 compatibility: list[...] generics are 3.9+ only

self.futures: list[concurrent.futures.Future[RequestOutput]] will raise at import time on Python 3.8. Use typing.Future with List[...] instead.

Apply these changes:

-        self.futures: list[concurrent.futures.Future[RequestOutput]] = []
+        self.futures: List[Future[RequestOutput]] = []

And update imports (see next comment).


13-13: Import Future for the type hint fix

Complement to previous comment: bring Future from typing.

-from typing import Any, Dict, List, Optional
+from typing import Any, Dict, List, Optional, Future

894-907: Add required ctx_instances and gen_instances to test_mixed_ctx_gen_model call

The test_mixed_ctx_gen_model invocation is missing the two new mandatory parameters—ctx_instances and gen_instances—and will raise a TypeError. Update the call to include them (using 1 for each to match existing defaults) before test_sets:

--- a/tests/integration/defs/accuracy/test_disaggregated_serving.py
+++ b/tests/integration/defs/accuracy/test_disaggregated_serving.py
@@ -896,13 +896,18 @@ class TestDisaggregatedServing:
     @pytest.mark.skip_less_device(4)
     @pytest.mark.parametrize("ctx_pp,gen_tp", [(2, 2)], ids=["ctxpp2gentp2"])
     def test_mixed_ctx_gen_model(self, ctx_pp, gen_tp):
         ctx_model = self.FP4_MODEL
         gen_model = self.FP8_MODEL
-        return run_parallel_test("Qwen3/Qwen3-30B-A3B",
-                                 ctx_model,
-                                 ctx_pp=ctx_pp,
-                                 ctx_tp=1,
-                                 gen_pp=1,
-                                 gen_tp=gen_tp,
-                                 test_sets=[GSM8K, MMLU],
-                                 ctx_model=ctx_model,
-                                 gen_model=gen_model)
+        return run_parallel_test(
+            "Qwen3/Qwen3-30B-A3B",
+            ctx_model,
+            ctx_pp=ctx_pp,
+            ctx_tp=1,
+            gen_pp=1,
+            gen_tp=gen_tp,
+            ctx_instances=1,
+            gen_instances=1,
+            test_sets=[GSM8K, MMLU],
+            ctx_model=ctx_model,
+            gen_model=gen_model,
+        )

This aligns the call with the updated signature:

def run_parallel_test(model_name, model_path,
                      ctx_pp, ctx_tp, gen_pp, gen_tp,
                      ctx_instances, gen_instances,
                      test_sets, ctx_model=None, gen_model=None):
    ...

No other call sites were missing these arguments.

♻️ Duplicate comments (3)
tests/unittest/llmapi/test_llm_pytorch.py (1)

841-859: Harden the failure injection: patch at the LLM import site and raise directly; remove unused assignment (Ruff F841)

  • Patch the symbol where LLM actually looks it up (tensorrt_llm.llmapi.llm.GenerationExecutor.create) to guarantee the monkeypatch takes effect across import indirections.
  • No need to construct a failing subclass; a side_effect that raises RuntimeError is simpler and decouples the test from production internals.
  • Remove the unused llm variable to satisfy Ruff F841.

Apply:

-    with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
-               side_effect=lambda *args, **kwargs: FailingExecutorWorker(
-                   *args, **kwargs)):
-        with pytest.raises(
-                RuntimeError,
-                match="Mock GenerationExecutorWorker initialization failed"):
-            llm = LLM(model=llama_model_path,
-                      kv_cache_config=global_kvcache_config)
+    with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create',
+               side_effect=RuntimeError(
+                   "Mock GenerationExecutorWorker initialization failed")):
+        with pytest.raises(
+                RuntimeError,
+                match="Mock GenerationExecutorWorker initialization failed"):
+            LLM(model=llama_model_path, kv_cache_config=global_kvcache_config)
tests/integration/defs/accuracy/test_disaggregated_serving.py (2)

200-207: Health-check loop can hang indefinitely without request timeouts

A prior review flagged this and it was added in #6660; current code dropped the timeout. requests.get() without timeout may hang despite the 3600s outer loop.

Apply:

-                response = requests.get("http://localhost:8000/health")
+                response = requests.get("http://localhost:8000/health", timeout=(2, 5))
                 if response.status_code == 200:
                     break
-            except requests.exceptions.ConnectionError:
+            except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
                 continue

281-292: Ruff E501: overlong pytest.fail(...) message will fail CI

Line 290 is 209 chars (>120). Rewrap to satisfy the linter. Keeping fail behavior unchanged here; see next comment for an alternative using skip.

-    if total_ctx_gpus + total_gen_gpus > get_device_count():
-        pytest.fail(
-            f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
-        )
+    if total_ctx_gpus + total_gen_gpus > get_device_count():
+        msg = (
+            "Not enough devices for "
+            f"{ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) "
+            f"+ {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), "
+            f"total: {total_ctx_gpus + total_gen_gpus}"
+        )
+        pytest.fail(msg)
🧹 Nitpick comments (10)
tensorrt_llm/executor/worker.py (1)

1-1: Add NVIDIA copyright header (policy compliance)

This source file lacks the required NVIDIA header. Please add it (current year) above imports.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
tests/unittest/llmapi/test_llm_pytorch.py (3)

824-839: Remove test-only subclass and helper; they’re unnecessary after switching to a direct side_effect

The FailingExecutorWorker subclass and dynamic FailingExecutor type add coupling to production code and import weight. After adopting a direct side_effect that raises, these can be removed.

Apply:

-class FailingExecutorWorker(GenerationExecutorWorker):
-    """Mock worker that fails during initialization to test error handling."""
-
-    def __init__(self, *args, **kwargs):
-        # Simulate a constructor failure
-        raise RuntimeError(
-            "Mock GenerationExecutorWorker initialization failed")
-
-
-FailingExecutor = type(
-    "FailingExecutor", (), {
-        "create":
-        classmethod(
-            lambda cls, *args, **kwargs: FailingExecutorWorker(*args, **kwargs))
-    })
+# Removed: use direct side_effect to raise instead of subclassing the worker.

7-7: Drop unused import after removing the test-only subclass

GenerationExecutorWorker import becomes unused once the subclass is deleted.

Apply:

-from tensorrt_llm.executor import GenerationExecutorWorker

1-3: Add NVIDIA copyright header (policy compliance)

Tests are also subject to the header requirement.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
tests/integration/defs/accuracy/test_disaggregated_serving.py (6)

1-1: Add required NVIDIA copyright header (2025) at file top

Repo guidelines mandate an NVIDIA copyright header on all Python sources. Add it above the current comments to avoid license/pre-commit failures.

+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
+ # Licensed under the NVIDIA Software License (or the project’s declared license). See LICENSE for details.

132-137: Make URL-to-port parsing robust (handles extra colons/IPv6)

url.split(":")[1] breaks if the host contains additional colons (e.g., IPv6) or if the URL includes a scheme. Prefer rsplit(":", 1).

-    ctx_ports = [int(url.split(":")[1]) for url in ctx_urls]
-    gen_ports = [int(url.split(":")[1]) for url in gen_urls]
+    ctx_ports = [int(url.rsplit(":", 1)[1]) for url in ctx_urls]
+    gen_ports = [int(url.rsplit(":", 1)[1]) for url in gen_urls]

138-180: GPU allocation logic looks correct; consider a safety check for direct callers

Contiguous allocation with an offset per-server is fine. Since launch_disaggregated_llm is also called directly by tests in this file (bypassing run_parallel_test’s device-count guard), consider asserting that current_gpu_offset never exceeds get_device_count() to fail fast with a clear error when misconfigured.

@@
-    current_gpu_offset = 0
+    current_gpu_offset = 0
+    total_needed = (len(ctx_ports) * ctx_total_gpus) + (len(gen_ports) * gen_total_gpus)
+    if total_needed > get_device_count():
+        raise RuntimeError(
+            f"Disaggregated serving requires {total_needed} GPUs "
+            f"(ctx={len(ctx_ports)}*{ctx_total_gpus}, gen={len(gen_ports)}*{gen_total_gpus}), "
+            f"but only {get_device_count()} available."
+        )

281-292: Consider pytest.skip(...) instead of failure on scarce devices

Other tests skip when capacity is insufficient; aligning behavior avoids turning lab capacity constraints into failures.

-        pytest.fail(msg)
+        # pytest.skip(msg)
+        pytest.fail(msg)  # keep failure if this is intended; otherwise prefer skip

591-596: New multi-instance test is a good coverage addition

Covers the 2x2 instance case. Consider marking with a higher device/memory skip if it flakes in CI, but fine to start with 4-device skip.


1-5: Replace top-of-file narrative comments with a proper module docstring

Tests benefit from a succinct docstring over free-form comments. Optional, but improves consistency.

-# I want to create accuracy tests for disaggregated serving.
-# I need to to this by creating a new class that mimics LLM class. Instead of implementing the
-# actual methods it will send OAI requests to the disaggregated serving endpoint.
-# Please take a look at the existing test_llm_api_pytorch.py file for reference.
+"""Accuracy tests for disaggregated serving using an OpenAI-compatible client.
+
+These tests launch per-instance context and generation servers plus a
+disaggregated router, then evaluate standard accuracy suites (GSM8K, MMLU, etc.).
+"""
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 0e0eeed and a69120d.

📒 Files selected for processing (7)
  • tensorrt_llm/executor/proxy.py (1 hunks)
  • tensorrt_llm/executor/worker.py (2 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (6 hunks)
  • tests/integration/test_lists/qa/llm_function_full.txt (4 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)
  • tests/integration/test_lists/waives.txt (1 hunks)
  • tests/unittest/llmapi/test_llm_pytorch.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tensorrt_llm/executor/proxy.py
  • tests/integration/test_lists/waives.txt
  • tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/executor/worker.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/unittest/llmapi/test_llm_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • tensorrt_llm/executor/worker.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tests/unittest/llmapi/test_llm_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/unittest/llmapi/test_llm_pytorch.py
🧬 Code Graph Analysis (2)
tensorrt_llm/executor/worker.py (2)
tensorrt_llm/executor/utils.py (1)
  • put (119-120)
tensorrt_llm/executor/ipc.py (2)
  • put (116-126)
  • put (270-276)
tests/unittest/llmapi/test_llm_pytorch.py (2)
tensorrt_llm/executor/worker.py (1)
  • GenerationExecutorWorker (50-626)
tensorrt_llm/llmapi/llm.py (1)
  • LLM (1090-1106)
🪛 Ruff (0.12.2)
tests/integration/defs/accuracy/test_disaggregated_serving.py

290-290: Line too long (209 > 120)

(E501)

tests/unittest/llmapi/test_llm_pytorch.py

856-856: Local variable llm is assigned to but never used

Remove assignment to unused variable llm

(F841)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tensorrt_llm/executor/worker.py (2)

792-797: 2-tuple init-status protocol change looks good; verify consumer side strictly unpacks (signal, error_trace)

On success you now send (ready_signal, None). Ensure the proxy/consumer exclusively relies on the tuple form and never expects a bare object, or mixed producers. Also confirm ready_signal is set to a constant (e.g., "READY") and compared by value, not truthiness, to avoid ambiguity if None is used in the error case.

Would you like me to add a unit test that mocks worker_init_status_queue.get() to validate the new protocol and error logging path end-to-end?


731-734: Typing uses PEP 585 built-in generics; ensure Py3.8 compatibility or gate with future annotations

This line (and others like dict[...]) use built-in generics (tuple[str, ...], dict[...]) which require Python 3.9+ unless you enable postponed evaluation. Given the repo targets Python 3.8+, either:

  • add from __future__ import annotations at the top of the file, or
  • switch to typing.Tuple / typing.Dict.

Apply one of:

+from __future__ import annotations

or:

-    proxy_result_queue: tuple[str, Optional[bytes]] = worker_queues.result_queue_addr
+    proxy_result_queue: Tuple[str, Optional[bytes]] = worker_queues.result_queue_addr

If the project minimum is actually 3.9+, please ignore.

tests/unittest/llmapi/test_llm_pytorch.py (1)

1-1: Add future annotations for Py3.8 compatibility (built-in generics used elsewhere in this file)

This test file uses list[int], etc. To keep compatibility with Python 3.8, add postponed evaluation.

Apply:

+from __future__ import annotations

If CI runs on Python ≥3.9 exclusively, this is optional.

tests/integration/defs/accuracy/test_disaggregated_serving.py (6)

181-199: Nice use of ExitStack for multi-process lifecycle

multi_popen correctly composes multiple process contexts and ensures cleanup on exceptions. This prevents orphaned processes. LGTM.


271-273: Correct DuckLLM construction (args, tokenizer, generate_async)

This fixes the earlier mismatch with the namedtuple definition. Good catch.


315-317: Port series generation is clear and deterministic

Even/odd split for ctx/gen avoids collisions under the same host. Looks good for test scaffolding.


323-329: disaggregated_server_config structure now matches multi-instance orchestration

The config wiring with num_instances and urls aligns with the new launcher. LGTM.


577-579: Updated call site to the new run_parallel_test signature

Passes ctx_instances and gen_instances as positional arguments. Good.


580-590: Asymmetric test updated correctly; consistent with new signature

Named args + instances = 1 keep prior behavior. LGTM.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15964 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #15964 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #11997 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16033 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16033 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12054 completed with status: 'FAILURE'

@dominicshanshan
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16062 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16062 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #12077 completed with status: 'SUCCESS'

@joyang-nv joyang-nv merged commit 6f245ec into NVIDIA:main Aug 22, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.