-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][chore] Mass integration of release/1.0 #6864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughAdjusts CUDA workspace memset base, standardizes executor init payloads and error logging, temporarily forces INFO during LLM init and logs backend, adopts KvCacheConfig/FP8 test paths, adds transformer-structure compatibility, propagates per-submodel quant_config for Gemma, and expands integration/unit tests and test manifests. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor WorkerProcess
participant Worker as Executor Worker
participant Queue as worker_init_status_queue
participant Proxy as Executor Proxy
participant MPI as MPI Session
Worker->>Queue: put (ready_signal, error_trace)
Proxy->>Queue: get()
Queue-->>Proxy: (ready_signal, error_trace)
alt ready_signal == READY
Proxy-->>WorkerProcess: proceed
else
Proxy->>Proxy: logger.error(error_trace)
Proxy->>MPI: Abort()
Proxy-->>WorkerProcess: raise RuntimeError
end
sequenceDiagram
autonumber
actor Caller
participant LLM as BaseLLM.__init__
participant Logger
Caller->>LLM: construct(...)
LLM->>Logger: save current level
LLM->>Logger: set level INFO
alt backend == "pytorch"
LLM->>Logger: log "Using LLM with PyTorch backend"
else backend == "_autodeploy"
LLM->>Logger: log "Using LLM with AutoDeploy backend"
else
LLM->>Logger: log "Using LLM with TensorRT backend"
end
LLM->>Logger: restore original level (finally)
LLM-->>Caller: return / raise
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Tip 🔌 Remote MCP (Model Context Protocol) integration is now available!Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats. ✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
/bot run --disable-fail-fast |
/bot kill |
PR_Github #15121 [ run ] triggered by Bot |
PR_Github #15122 [ ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
🔭 Outside diff range comments (2)
tensorrt_llm/_torch/models/modeling_gemma3vl.py (1)
20-27
: Avoid shadowing ModelConfig with two different importsModelConfig is imported twice from different modules (..model_config and .modeling_utils). This shadowing is error-prone and can cause subtle typing/runtime issues if the two definitions diverge. Import ModelConfig from a single source and only import filter_weights/register_auto_model from modeling_utils.
Apply this diff to avoid the shadowing:
-from ..model_config import ModelConfig +from ..model_config import ModelConfig @@ -from .modeling_utils import ModelConfig, filter_weights, register_auto_model +from .modeling_utils import filter_weights, register_auto_modeltests/integration/test_lists/qa/llm_function_full.txt (1)
713-719
: Duplicate test entry detectedThe test
llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_logging
appears twice (lines 715 and 719). This duplication will cause the test to run multiple times unnecessarily.Remove the duplicate entry:
# llm-api promote pytorch to default llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_logging llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_type_tensorrt llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_type_default -llmapi/test_llm_api_qa.py::TestLlmDefaultBackend::test_llm_args_logging
🧹 Nitpick comments (19)
cpp/tensorrt_llm/kernels/topkLastDim.cu (1)
1278-1279
: Nit: clarify intent and guard zero-byte memsetAdd a brief comment about which regions are zero-initialized, and skip cudaMemsetAsync when the computed byte count is zero. This documents the contract and avoids an unnecessary runtime call for edge cases (e.g., batch_size == 0).
- cudaMemsetAsync(aligned_pointers[0], 0, - static_cast<char*>(aligned_pointers[2]) - static_cast<char*>(aligned_pointers[0]), stream); + // Zero-initialize counters and histograms only (aligned region [aligned_pointers[0], aligned_pointers[2])). + const auto zeroBytes = static_cast<size_t>( + static_cast<char*>(aligned_pointers[2]) - static_cast<char*>(aligned_pointers[0])); + if (zeroBytes > 0) + { + cudaMemsetAsync(aligned_pointers[0], 0, zeroBytes, stream); + }tensorrt_llm/executor/worker.py (2)
1-1
: Add NVIDIA copyright headerPer repo guidelines, prepend the current-year NVIDIA copyright header to all source files.
Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
777-778
: Avoid sending raw Exception objects over IPC; add a safe fallbackTransmitting Exception instances via ZeroMQ/pickle can fail (custom exceptions may be unpicklable). Since the proxy only logs the traceback and uses the exception as a cause, add a fallback to send a picklable payload if the first send fails.
Apply:
- worker_init_status_queue.put((e, traceback.format_exc())) + try: + worker_init_status_queue.put((e, traceback.format_exc())) + except Exception: + # Fallback to a picklable payload (no exception instance) + worker_init_status_queue.put((repr(e), traceback.format_exc()))Follow-up: see proxy-side defensive unpacking/logging suggestion to handle both Exception and str causes.
tensorrt_llm/executor/proxy.py (1)
1-1
: Add NVIDIA copyright headerPer repo guidelines, prepend the current-year NVIDIA copyright header to all source files.
Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
tests/unittest/llmapi/test_llm_pytorch.py (3)
1-2
: Add NVIDIA copyright headerPer repo guidelines, prepend the current-year NVIDIA copyright header to all source files.
Apply at file top:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
821-827
: Remove unused FailingExecutor typeThe dynamically created FailingExecutor is unused. Simplify by removing it.
Apply:
-FailingExecutor = type( - "FailingExecutor", (), { - "create": - classmethod( - lambda cls, *args, **kwargs: FailingExecutorWorker(*args, **kwargs)) - })
839-847
: Fix lint (F841) and simplify the patching
- Avoid assigning to an unused variable.
- You can raise directly via side_effect without constructing FailingExecutorWorker, which reduces coupling to the worker class.
Apply:
- with patch('tensorrt_llm.executor.executor.GenerationExecutor.create', - side_effect=lambda *args, **kwargs: FailingExecutorWorker( - *args, **kwargs)): + with patch('tensorrt_llm.executor.executor.GenerationExecutor.create', + side_effect=RuntimeError( + "Mock GenerationExecutorWorker initialization failed")): with pytest.raises( RuntimeError, match="Mock GenerationExecutorWorker initialization failed"): - llm = LLM(model=llama_model_path, - kv_cache_config=global_kvcache_config) + LLM(model=llama_model_path, kv_cache_config=global_kvcache_config)tensorrt_llm/tools/multimodal_builder.py (1)
1195-1201
: Consider improving the version detection logic.The current check relies on attribute existence (
hasattr
) which may not be the most robust approach for detecting transformers version differences. Consider checking the transformers version directly for more explicit version handling.- # Check if the model structure is updated to transformers >= 4.52.0 - if hasattr(model, 'model') and hasattr(model.model, 'vision_model'): + # Check transformers version for model structure compatibility + import transformers + if hasattr(transformers, '__version__') and transformers.__version__ >= '4.52.0': vision_model = model.model.vision_model multi_modal_projector = model.model.multi_modal_projector else: - # transformers < 4.52.0 + # Fallback for transformers < 4.52.0 vision_model = model.vision_model multi_modal_projector = model.multi_modal_projectorHowever, the current approach using
hasattr
is actually more defensive and will work even if the version detection fails, so both approaches have merit.triton_backend/inflight_batcher_llm/scripts/build.sh (1)
56-58
: LLM_ROOT resolution via BUILD_DIR is fine; make minor hardening to avoid re-run failures.Two small robustness tweaks:
- Line 43 uses mkdir without -p; re-runs will fail if the directory exists.
- Consider set -euo pipefail at the top to fail fast on errors.
Outside this hunk, suggested changes:
# At the top (after #!/bin/bash) set -euo pipefail # Around current line 43 mkdir -p "$BUILD_DIR"docker/Makefile (1)
4-5
: More robust ARG extraction for TRITON_ (handle leading whitespace).*Your grep-based extraction works if ARG is at column 0. To make it resilient to indentation, prefer awk with a whitespace-tolerant pattern.
Apply:
-TRITON_IMAGE ?= $(shell grep '^ARG TRITON_IMAGE=' Dockerfile.multi | grep -o '=.*' | tr -d '="') -TRITON_BASE_TAG ?= $(shell grep '^ARG TRITON_BASE_TAG=' Dockerfile.multi | grep -o '=.*' | tr -d '="') +TRITON_IMAGE ?= $(shell awk -F= '/^[[:space:]]*ARG[[:space:]]+TRITON_IMAGE=/ {gsub(/[ "\t]/,"",$2); print $$2}' Dockerfile.multi) +TRITON_BASE_TAG ?= $(shell awk -F= '/^[[:space:]]*ARG[[:space:]]+TRITON_BASE_TAG=/ {gsub(/[ "\t]/,"",$2); print $$2}' Dockerfile.multi)jenkins/BuildDockerImage.groovy (1)
344-351
: Use the retry wrapper for the custom tag build for consistency and resilience.All other builds use llmExecStepWithRetry; mirror that here.
- sh """ - cd ${LLM_ROOT} && make -C docker ${target}_${action} \ - BASE_IMAGE=${BASE_IMAGE} \ - TRITON_IMAGE=${TRITON_IMAGE} \ - TORCH_INSTALL_TYPE=${torchInstallType} \ - IMAGE_WITH_TAG=${customImageWithTag} \ - STAGE=${dockerfileStage} \ - BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args} - """ + trtllm_utils.llmExecStepWithRetry(this, script: """ + cd ${LLM_ROOT} && make -C docker ${target}_${action} \ + BASE_IMAGE=${BASE_IMAGE} \ + TRITON_IMAGE=${TRITON_IMAGE} \ + TORCH_INSTALL_TYPE=${torchInstallType} \ + IMAGE_WITH_TAG=${customImageWithTag} \ + STAGE=${dockerfileStage} \ + BUILD_WHEEL_OPTS='-j ${build_jobs}' ${args} + """, sleepInSecs: 300, shortCommondRunTimeMax: 7200)tests/integration/defs/accuracy/test_llm_api_pytorch.py (1)
816-834
: New Gemma3 27B FP8 prequantized test is correct and well-scoped
- Disables KV-cache reuse to account for Gemma3’s non-inclusive sliding window support gaps.
- Forces FLASHINFER backend for custom mask support.
- Asserts QuantAlgo.FP8 and hits CNN Dailymail, GSM8K, and MMLU: good task coverage.
One nit to fix: the long inline comment on Line 817 exceeds the 120-char limit flagged by Ruff. Suggest wrapping it across lines.
Apply this diff to satisfy E501:
- # Disabling kv cache reuse as a WAR to deal with gaps in kernel support for Gemma3's non-inclusive sliding window size. + # Disabling kv cache reuse as a WAR to deal with gaps in kernel support + # for Gemma3's non-inclusive sliding window size.tests/integration/defs/test_e2e.py (2)
2440-2542
: New 2-GPU multimodal quickstart test is well put together; fix long assert line
- Uses tp_size=2 and adapts model-specific flags (FLASHINFER + KV-reuse off for Gemma; LoRA + short rope for Phi-4-MM).
- Validates output via parse_output and keyword ratios. Good approach.
Address Ruff’s E501 on the long assertion message by splitting it across lines.
Apply this diff to break the long assert message:
- assert obs_match_ratio >= match_ratio, f"Incorrect output!\nGenerated \"{prompt_output}\"\nExpected keywords \"{prompt_keywords}\"\n Matched keywords: {matches}\n Observed match ratio {obs_match_ratio} below threshold {match_ratio}" + assert obs_match_ratio >= match_ratio, ( + "Incorrect output!\n" + f"Generated \"{prompt_output}\"\n" + f"Expected keywords \"{prompt_keywords}\"\n" + f"Matched keywords: {matches}\n" + f"Observed match ratio {obs_match_ratio} below threshold {match_ratio}" + )
1-1
: Update copyright year to include 2025Per coding guidelines, use the current year in headers for modified source files.
Apply this diff:
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.tests/integration/defs/llmapi/test_llm_api_qa.py (3)
1-1
: Consider removing or expanding the non-standard commentThe comment "Confirm that the default backend is changed" doesn't follow standard Python documentation conventions and lacks context about what default backend is being changed.
Either remove the comment or expand it to provide more context:
-# Confirm that the default backend is changed +"""Test suite to verify that the default LLM backend is PyTorch in v1.0."""
12-14
: Fix docstring formattingThe docstring doesn't follow Python conventions - it should either be a one-liner or properly formatted multi-line docstring.
- """ - Check that the default backend is PyTorch for v1.0 breaking change - """ + """Check that the default backend is PyTorch for v1.0 breaking change."""
60-60
: Consider extracting backend keywords as constantsThe backend keywords are hardcoded as string literals in assertion messages. Consider defining them as class-level constants for better maintainability.
class TestLlmDefaultBackend: """Check that the default backend is PyTorch for v1.0 breaking change.""" + + PYTORCH_BACKEND_LOG = "Using LLM with PyTorch backend" + TENSORRT_BACKEND_LOG = "Using LLM with TensorRT backend" def test_llm_args_type_default(self, llm_root, llm_venv): # ... existing code ... def test_llm_args_logging(self, llm_root, llm_venv): # ... existing code ... - assert "Using LLM with PyTorch backend" in pytorch_output, f"Expected 'pytorch' in logs, got: {pytorch_output}" + assert self.PYTORCH_BACKEND_LOG in pytorch_output, f"Expected 'pytorch' in logs, got: {pytorch_output}" # ... existing code ... - assert "Using LLM with TensorRT backend" in tensorrt_output, f"Expected 'tensorrt' in logs, got: {tensorrt_output}" + assert self.TENSORRT_BACKEND_LOG in tensorrt_output, f"Expected 'tensorrt' in logs, got: {tensorrt_output}"Also applies to: 70-70
tests/integration/defs/accuracy/test_disaggregated_serving.py (2)
250-250
: Break long line for better readabilityThe error message line exceeds the recommended line length limit.
Split the long error message across multiple lines:
- f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}" + f"Not enough devices for {ctx_instances} ctx instances " + f"(ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances " + f"(gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
169-187
: Consider extracting the multi_popen helper to a shared utilityThe
multi_popen
context manager provides useful functionality for managing multiple subprocess lifecycles. Since similar patterns appear intest_disaggregated.py
, consider extracting this to a shared test utility module to promote code reuse.Consider creating a shared test utility module (e.g.,
tests/integration/defs/disaggregated_utils.py
) and moving reusable functions likemulti_popen
there. This would reduce code duplication and make maintenance easier.Would you like me to help create this shared utility module?
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (31)
cpp/tensorrt_llm/kernels/topkLastDim.cu
(1 hunks)docker/Makefile
(3 hunks)examples/llm-api/star_attention.py
(3 hunks)jenkins/BuildDockerImage.groovy
(3 hunks)jenkins/L0_Test.groovy
(1 hunks)jenkins/current_image_tags.properties
(1 hunks)tensorrt_llm/_torch/models/modeling_gemma3vl.py
(1 hunks)tensorrt_llm/executor/proxy.py
(1 hunks)tensorrt_llm/executor/worker.py
(2 hunks)tensorrt_llm/llmapi/llm.py
(2 hunks)tensorrt_llm/serve/scripts/benchmark_serving.py
(1 hunks)tensorrt_llm/tools/multimodal_builder.py
(1 hunks)tests/integration/defs/accuracy/references/cnn_dailymail.yaml
(1 hunks)tests/integration/defs/accuracy/references/gsm8k.yaml
(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml
(1 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py
(6 hunks)tests/integration/defs/accuracy/test_llm_api_pytorch.py
(6 hunks)tests/integration/defs/disaggregated/test_disaggregated.py
(2 hunks)tests/integration/defs/llmapi/_run_llmapi_llm.py
(1 hunks)tests/integration/defs/llmapi/test_llm_api_qa.py
(1 hunks)tests/integration/defs/perf/pytorch_model_config.py
(1 hunks)tests/integration/defs/test_e2e.py
(1 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(5 hunks)tests/integration/test_lists/qa/llm_function_sanity.txt
(2 hunks)tests/integration/test_lists/test-db/l0_dgx_b200.yml
(1 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(3 hunks)tests/integration/test_lists/test-db/l0_h100.yml
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)tests/unittest/test_pip_install.py
(0 hunks)triton_backend/inflight_batcher_llm/scripts/build.sh
(1 hunks)
💤 Files with no reviewable changes (1)
- tests/unittest/test_pip_install.py
🧰 Additional context used
📓 Path-based instructions (5)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/serve/scripts/benchmark_serving.py
tensorrt_llm/tools/multimodal_builder.py
tensorrt_llm/executor/proxy.py
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/models/modeling_gemma3vl.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/llmapi/llm.py
tests/integration/defs/perf/pytorch_model_config.py
tests/integration/defs/llmapi/_run_llmapi_llm.py
tests/integration/defs/test_e2e.py
tests/integration/defs/llmapi/test_llm_api_qa.py
tests/integration/defs/disaggregated/test_disaggregated.py
examples/llm-api/star_attention.py
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/serve/scripts/benchmark_serving.py
tensorrt_llm/tools/multimodal_builder.py
tensorrt_llm/executor/proxy.py
tensorrt_llm/executor/worker.py
tensorrt_llm/_torch/models/modeling_gemma3vl.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/llmapi/llm.py
cpp/tensorrt_llm/kernels/topkLastDim.cu
tests/integration/defs/perf/pytorch_model_config.py
tests/integration/defs/llmapi/_run_llmapi_llm.py
tests/integration/defs/test_e2e.py
tests/integration/defs/llmapi/test_llm_api_qa.py
tests/integration/defs/disaggregated/test_disaggregated.py
examples/llm-api/star_attention.py
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/integration/defs/accuracy/test_llm_api_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/kernels/topkLastDim.cu
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/kernels/topkLastDim.cu
🧠 Learnings (2)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/test-db/l0_dgx_b200.yml
tests/unittest/llmapi/test_llm_pytorch.py
tests/integration/test_lists/qa/llm_function_sanity.txt
tests/integration/defs/llmapi/_run_llmapi_llm.py
tests/integration/test_lists/test-db/l0_h100.yml
tests/integration/defs/test_e2e.py
tests/integration/defs/llmapi/test_llm_api_qa.py
tests/integration/test_lists/test-db/l0_dgx_h100.yml
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/integration/test_lists/qa/llm_function_full.txt
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/qa/llm_function_full.txt
🧬 Code Graph Analysis (10)
tensorrt_llm/executor/proxy.py (3)
tensorrt_llm/executor/ipc.py (2)
get
(157-176)get
(278-279)tensorrt_llm/logger.py (1)
error
(125-126)tensorrt_llm/executor/executor.py (1)
_handle_background_error
(244-273)
tensorrt_llm/executor/worker.py (2)
tensorrt_llm/executor/utils.py (1)
put
(119-120)tensorrt_llm/executor/ipc.py (2)
put
(116-126)put
(270-276)
tests/unittest/llmapi/test_llm_pytorch.py (3)
tensorrt_llm/executor/worker.py (1)
GenerationExecutorWorker
(50-626)tensorrt_llm/llmapi/llm.py (1)
LLM
(1087-1103)tensorrt_llm/_torch/llm.py (1)
LLM
(4-9)
tensorrt_llm/llmapi/llm.py (3)
tensorrt_llm/llmapi/llm_args.py (2)
TorchLlmArgs
(2006-2386)TrtLlmArgs
(1824-1962)tensorrt_llm/_torch/auto_deploy/llm_args.py (1)
LlmArgs
(224-316)tests/microbenchmarks/build_time_benchmark.py (1)
log_level
(145-148)
tests/integration/defs/llmapi/_run_llmapi_llm.py (3)
tensorrt_llm/llmapi/llm.py (1)
LLM
(1087-1103)tensorrt_llm/_torch/llm.py (1)
LLM
(4-9)tensorrt_llm/evaluate/json_mode_eval.py (1)
command
(125-138)
tests/integration/defs/test_e2e.py (4)
tests/integration/defs/conftest.py (3)
llm_root
(180-181)llm_models_root
(77-83)llm_venv
(707-723)tests/integration/defs/triton_server/conftest.py (1)
llm_models_root
(24-33)tests/integration/defs/common.py (1)
parse_output
(665-681)tests/integration/defs/examples/test_llama.py (1)
parse_output
(129-139)
tests/integration/defs/llmapi/test_llm_api_qa.py (3)
tests/integration/defs/conftest.py (1)
llm_venv
(707-723)tensorrt_llm/llmapi/llm.py (1)
LLM
(1087-1103)tensorrt_llm/llmapi/llm_args.py (1)
TorchLlmArgs
(2006-2386)
tests/integration/defs/disaggregated/test_disaggregated.py (2)
tests/integration/defs/conftest.py (5)
llm_models_root
(77-83)llm_root
(180-181)disaggregated_test_root
(2339-2344)disaggregated_example_root
(270-275)llm_venv
(707-723)tests/integration/defs/trt_test_alternative.py (3)
check_call
(230-238)check_output
(241-267)popen
(179-198)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
tests/integration/defs/conftest.py (1)
get_device_count
(1941-1943)
tests/integration/defs/accuracy/test_llm_api_pytorch.py (4)
tensorrt_llm/llmapi/llm_args.py (2)
CudaGraphConfig
(106-163)KvCacheConfig
(929-1024)tests/integration/defs/conftest.py (1)
llm_models_root
(77-83)tensorrt_llm/llmapi/llm.py (1)
LLM
(1087-1103)tensorrt_llm/quantization/mode.py (1)
QuantAlgo
(23-46)
🪛 Ruff (0.12.2)
tests/unittest/llmapi/test_llm_pytorch.py
844-844: Local variable llm
is assigned to but never used
Remove assignment to unused variable llm
(F841)
tests/integration/defs/test_e2e.py
2529-2529: Line too long (240 > 120)
(E501)
tests/integration/defs/llmapi/test_llm_api_qa.py
12-13: One-line docstring should fit on one line
Reformat to one line
(D200)
12-13: First line should end with a period, question mark, or exclamation point
Add closing punctuation
(D415)
69-69: Line too long (123 > 120)
(E501)
tests/integration/defs/accuracy/test_disaggregated_serving.py
250-250: Line too long (209 > 120)
(E501)
tests/integration/defs/accuracy/test_llm_api_pytorch.py
817-817: Line too long (127 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (36)
cpp/tensorrt_llm/kernels/topkLastDim.cu (1)
1278-1279
: Aligned zero-init base is correct and saferZeroing from aligned_pointers[0] up to aligned_pointers[2] precisely clears counters and histograms at their aligned bases, avoiding any pre-alignment padding. This is the right fix for correctness and avoids UB risks from writing into potentially unaligned/slack regions.
tensorrt_llm/executor/worker.py (1)
795-796
: Handshake tuple shape is clear and aligned with proxyPushing (ready_signal, None) on success is consistent and keeps the payload shape stable. No issues.
tests/unittest/llmapi/test_llm_pytorch.py (1)
812-819
: Optional: Patch the worker’s init to cover the MPI handshake pathIf the intent is to specifically exercise the proxy’s new tuple handshake path, consider patching GenerationExecutorWorker.init to raise, so the failure originates inside worker_main and traverses over IPC. The current patch may fail earlier depending on how GenerationExecutor.create is used.
I can generate a targeted test variant if you want to cover that path explicitly.
tensorrt_llm/tools/multimodal_builder.py (1)
1194-1203
: LGTM! Well-designed compatibility layer for transformers version differences.The implementation correctly handles the structural changes in the MLLaMA model between transformers versions by checking for the existence of the newer nested structure (
model.model.vision_model
) before falling back to the older direct structure (model.vision_model
). This ensures backward compatibility while supporting the latest transformers versions.jenkins/current_image_tags.properties (1)
14-17
: Tag updates LGTM and align with URM usage.The refreshed tags and corrected repository path for the SBSA image look good and consistent with the broader registry migration.
docker/Makefile (2)
85-87
: Passing TRITON_IMAGE/TRITON_BASE_TAG through to docker build is a solid improvement.This ensures Dockerfile.multi ARGs are honored during build; nice addition.
194-204
: Base image source unified to nvcr.io; consistent with internal mirror handling upstream.The base image overrides for rockylinux8 and ubuntu22 targets look correct and consistent with the Jenkins pipeline logic that rewrites to URM during builds.
tests/integration/defs/accuracy/references/cnn_dailymail.yaml (1)
8-10
: LGTM! Valid addition of FP8 quantization reference.The new entry adds FP8 quantization results for google/gemma-3-27b-it with both
quant_algo
andkv_cache_quant_algo
set to FP8, showing expected accuracy degradation (27.90 vs 28.90 baseline). The format and structure are consistent with other model entries in the file.tests/integration/defs/accuracy/references/mmlu.yaml (1)
119-121
: LGTM! Consistent FP8 quantization reference addition.The new FP8 quantization entry for google/gemma-3-27b-it follows the same pattern as other reference files, with expected accuracy impact (76.80 vs 77.80 baseline). This is part of a coordinated update across multiple accuracy reference files for the same model.
tests/integration/test_lists/waives.txt (1)
295-298
: LGTM! Proper test waiving with bug tracking.All four SKIP entries correctly reference bug tracking numbers and align with newly added tests in the codebase. The waived tests cover different areas (accuracy, disaggregated serving, benchmarking, and multimodal) with appropriate justification via bug references.
tensorrt_llm/serve/scripts/benchmark_serving.py (1)
593-594
: LGTM! Defensive programming improvement for metrics handling.The change from unconditional
results[k]
access to conditionalfor k in metrics if k in results
prevents KeyError exceptions when some metrics are missing from the results dictionary. This is particularly important for disaggregated/multi-backend scenarios where not all metrics may be available.tests/integration/defs/perf/pytorch_model_config.py (1)
162-169
: LGTM! Configuration structure modernization.The change from flat CUDA graph configuration keys to a nested
cuda_graph_config
structure improves configuration organization. Thebatch_sizes
values remain identical, ensuring no functional change while providing better structure alignment with other parts of the codebase that use nested CUDA graph configuration.tests/integration/defs/accuracy/references/gsm8k.yaml (3)
146-148
: FP8 entry for gemma-3-1b-it looks consistent with new test coverageAdding kv_cache_quant_algo: FP8 with updated accuracy is aligned with the FP8 prequantized tests introduced in this PR. No issues found.
151-153
: FP8 entry for gemma-3-27b-it is correct and matches new testsThe FP8 + FP8 KV entry and accuracy value correspond to the new FP8 prequantized test for 27B. Looks good.
156-158
: Ministral-8B-Instruct-2410 FP8 metadata extension is reasonableIncluding kv_cache_quant_algo: FP8 and updating accuracy is consistent with wider FP8 coverage. No concerns.
tensorrt_llm/_torch/models/modeling_gemma3vl.py (1)
197-206
: Per-submodel quant_config handling is correct (vision unquantized, text inherits)Explicitly setting quant_config=None for vision prevents vision weight-loading from interpreting quant params, while flowing the parent config to the text sub-model. Also, selecting FLASHINFER for text aligns with the need for custom masks. No issues.
tests/integration/defs/accuracy/test_llm_api_pytorch.py (5)
564-565
: Switch to cuda_graph_config is clear and future-proofThe conditional CudaGraphConfig vs None improves readability and centralizes CUDA graph settings. Good change.
588-589
: Same cuda_graph_config pattern here: LGTMConsistent with the refactor elsewhere.
709-710
: Consistent use of cuda_graph_config toggleKeeps tests uniform. No issues.
732-733
: Consistent cuda_graph_config toggle (FP4 path)Matches the FP8 path; good consistency.
868-869
: Expanded coverage for 1B FP8 prequantized: LGTMAdding GSM8K between CNN Dailymail and MMLU improves breadth without increasing config complexity.
tests/integration/test_lists/test-db/l0_h100.yml (1)
209-209
: Include new Gemma3 27B FP8 prequantized test in l0_h100 post-merge listThe added entry lines up with the new test in the suite. Placement after the 1B FP8 prequantized test is appropriate.
tests/integration/test_lists/qa/llm_function_sanity.txt (2)
49-49
: Test entry addition looks correct.The addition of
accuracy/test_llm_api_pytorch.py::TestMistralSmall24B::test_auto_dtype
follows the established format and is consistent with other test entries in the file.
114-115
: Test entries for nixl backend testing are properly structured.Both test entries follow the established format:
accuracy/test_disaggregated_serving.py::TestQwen3_8B::test_nixl_backend
accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_nixl_backend
These align with the nixl backend functionality being tested as mentioned in the AI summary.
tests/integration/test_lists/test-db/l0_dgx_b200.yml (1)
82-87
: YAML test entries are correctly formatted and consistent.The six new test entries are properly structured:
- Four disaggregated benchmark tests with appropriate backend parameters
- Two nixl backend tests for disaggregated serving
All entries follow the established YAML format and are consistent with the test file structure.
tensorrt_llm/llmapi/llm.py (3)
125-127
: Log level management implementation is solid.The code temporarily elevates the logger to "info" level to ensure backend selection messages are displayed, storing the original level for proper restoration.
130-140
: Backend-specific logging is well implemented.The conditional logging clearly communicates which backend is being used:
- PyTorch backend: "Using LLM with PyTorch backend"
- AutoDeploy backend: "Using LLM with AutoDeploy backend"
- TensorRT backend (default): "Using LLM with TensorRT backend"
This provides valuable debugging information and aligns with the test assertions mentioned in the relevant code snippets.
169-171
: Proper exception handling with log level restoration.The
finally
block ensures the original log level is restored regardless of whether initialization succeeds or fails, preventing log level pollution across LLM instances.examples/llm-api/star_attention.py (3)
10-10
: Import change aligns with the new KvCacheConfig approach.The import of
KvCacheConfig
fromtensorrt_llm.llmapi.llm_args
replaces the previous quantization-based approach, which is consistent with the PR's objective to migrate from QuantConfig to KvCacheConfig for FP8 KV cache configuration.
57-57
: KvCacheConfig usage is correct and clear.The configuration properly sets the dtype based on the
fp8_kv_cache
parameter:
- "fp8" when FP8 KV cache is enabled
- "auto" for automatic dtype selection
This is a cleaner approach than the previous QuantConfig-based method.
69-69
: LLM constructor update is consistent with the new configuration approach.Passing
kv_cache_config
instead ofquant_config
to the LLM constructor is the correct way to configure FP8 KV cache with the new API design.tests/integration/defs/llmapi/_run_llmapi_llm.py (4)
3-3
: Optional import addition is appropriate.The import of
Optional
from typing is needed for the new backend parameter type annotation.
7-8
: Backend class imports are correctly structured.The import structure clearly differentiates between the two LLM classes:
TrtLLM
fromtensorrt_llm._tensorrt_engine
for TensorRT backendLLM
fromtensorrt_llm.llmapi
for PyTorch backend (default)
15-16
: CLI parameter and function signature updates are consistent.The addition of the
--backend
CLI option with defaultNone
and the corresponding function parameter with proper type annotation (Optional[str]
) follows good practices.
22-29
: Backend selection logic is well-implemented.The implementation correctly:
- Defaults to "tensorrt" when no backend is specified
- Validates that only "pytorch" or "tensorrt" are accepted
- Selects the appropriate LLM class based on the backend
- Conditionally passes
build_config
only for the TensorRT backend (PyTorch backend doesn't use it)This design aligns with the different initialization requirements of the two backends as shown in the relevant code snippets.
tests/integration/test_lists/test-db/l0_dgx_h100.yml (1)
218-218
: LGTM!The addition of
auto_trigger: others
for the Triton backend post_merge block aligns with the trigger patterns used in other backend blocks.
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
736-771
: Remove redundant duplicate lines.Lines 749-750 contain duplicate dictionary access without any operation, which appears to be a copy-paste error that should be removed.
Remove the redundant lines:
- ctx_server_config["cache_transceiver_config"] - ctx_server_config["cache_transceiver_config"]
🧹 Nitpick comments (1)
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
243-251
: Fix line length violation and improve error message readability.The error message on Line 250 exceeds the 120-character limit and could benefit from better formatting for readability.
Apply this diff to fix the line length and improve readability:
- pytest.fail( - f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}" - ) + pytest.fail( + f"Not enough devices for {ctx_instances} ctx instances " + f"(ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances " + f"(gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}" + )
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/integration/defs/accuracy/test_disaggregated_serving.py
(6 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
🪛 Ruff (0.12.2)
tests/integration/defs/accuracy/test_disaggregated_serving.py
250-250: Line too long (209 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Check PR Title Format
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tests/integration/defs/accuracy/test_disaggregated_serving.py (11)
118-123
: LGTM! URL parsing and port extraction logic is correct.The implementation correctly extracts URLs from the disaggregated server config and parses port numbers from them using string splitting.
124-145
: LGTM! Context server configuration logic is well-structured.The per-instance GPU allocation and environment setup for context servers is correctly implemented, including proper CUDA_VISIBLE_DEVICES management and conditional max_num_tokens handling.
146-166
: LGTM! Generation server configuration mirrors context server logic appropriately.The generation server setup follows the same pattern as context servers with proper GPU allocation continuation and environment configuration.
167-185
: LGTM! Multi-process orchestration is well-designed.The
multi_popen
context manager correctly handles parallel process startup and cleanup usingExitStack
, with proper exception handling and cleanup coordination.
186-202
: LGTM! Server startup and health check orchestration is robust.The nested context manager approach properly coordinates the startup of all servers followed by the disaggregated server, with a robust health check loop to ensure readiness before proceeding.
203-240
: LGTM! OpenAI client integration and request handling is well-implemented.The OpenAI client setup and the
send_request
/generate_async
functions correctly translate between OpenAI API and TensorRT-LLM's request/response format, maintaining compatibility with existing test harnesses.
276-291
: LGTM! Multi-instance URL generation and configuration is correct.The URL generation for multiple instances and the disaggregated server configuration structure properly support the new multi-instance functionality.
534-536
: LGTM! Test parameter updates support new multi-instance functionality.The updated calls to
run_parallel_test
correctly pass the newctx_instances
andgen_instances
parameters.
545-547
: LGTM! Asymmetric test configuration is correctly updated.The test properly exercises different combinations of context pipeline parallelism and generation tensor parallelism with the new instance parameters.
548-553
: LGTM! Multi-instance test provides good coverage.The new test method exercises 2 context instances and 2 generation instances, providing valuable test coverage for the multi-instance functionality.
600-633
: LGTM! nixl backend test is well-structured.The test properly configures both context and generation servers to use the "nixl" backend for cache transceiver and validates functionality with both MMLU and GSM8K tasks.
PR_Github #15123 [ run ] triggered by Bot |
PR_Github #15121 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #15124 [ run ] triggered by Bot |
f987d2d
to
28a4732
Compare
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
tests/integration/defs/accuracy/test_disaggregated_serving.py (3)
1-1
: Prepend the NVIDIA 2025 copyright header required by repo guidelines.All Python source files must include the NVIDIA copyright header (current year). Please add it at the top of this file.
Apply this diff at the top of the file:
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +
55-55
: Fix Python 3.8 compatibility: avoid PEP 585 built-in generics.The annotation
list[...]
requires Python 3.9+. Our guidelines target Python 3.8+, so usetyping.List[...]
instead to avoid a SyntaxError on 3.8.- self.futures: list[concurrent.futures.Future[RequestOutput]] = [] + self.futures: List[concurrent.futures.Future[RequestOutput]] = []
57-66
: Ensure thread pool is shut down if any future raises during clean exit.If
future.result()
raises whileexc_type is None
,super().__exit__
won’t run and the pool may leak threads. Wrap the loop to always shutdown on errors.def __exit__(self, exc_type, exc_val, exc_tb): - if exc_type is None: - for future in self.futures: - future.result() - return super().__exit__(exc_type, exc_val, exc_tb) + if exc_type is None: + try: + for future in self.futures: + future.result() + except Exception: + for future in self.futures: + future.cancel() + self.shutdown(wait=True, cancel_futures=True) + raise + return super().__exit__(exc_type, exc_val, exc_tb)
♻️ Duplicate comments (3)
tests/integration/defs/accuracy/test_disaggregated_serving.py (2)
271-273
: Correct tuple order for DuckLLM (args, tokenizer, generate_async).This fixes the previously reported mismatch. Good catch.
752-753
: Previous duplicate no-op lines are no longer present.The earlier reported duplicated
ctx_server_config["cache_transceiver_config"]
appears resolved.tests/unittest/llmapi/test_llm_pytorch.py (1)
857-858
: Fix Ruff F841: assigned to but never used.You don’t need to bind the constructed
LLM
; the exception is raised during construction.- llm = LLM(model=llama_model_path, - kv_cache_config=global_kvcache_config) + LLM( + model=llama_model_path, + kv_cache_config=global_kvcache_config, + )
🧹 Nitpick comments (4)
tests/integration/defs/accuracy/test_disaggregated_serving.py (2)
135-137
: Make port parsing robust to future URL formats.Splitting by the first “:” breaks if a scheme is introduced (e.g., http://). Use rsplit to extract the last segment.
- ctx_ports = [int(url.split(":")[1]) for url in ctx_urls] - gen_ports = [int(url.split(":")[1]) for url in gen_urls] + ctx_ports = [int(url.rsplit(":", 1)[-1]) for url in ctx_urls] + gen_ports = [int(url.rsplit(":", 1)[-1]) for url in gen_urls]
181-199
: Simplify multi_popen with ExitStack and improve debuggability.You can remove the intermediate
processes
list and enter all contexts in one expression. Also consider logging the command on failure.- @contextlib.contextmanager - def multi_popen(server_configs): - processes = [] - try: - for env, args in server_configs: - proc = popen(args, env=env) - processes.append(proc) - - with contextlib.ExitStack() as stack: - opened_processes = [ - stack.enter_context(proc) for proc in processes - ] - yield opened_processes - except Exception as e: - print( - f"Failed to start disaggregated server processes in multi_popen: {e}" - ) - raise + @contextlib.contextmanager + def multi_popen(server_configs): + try: + with contextlib.ExitStack() as stack: + procs = [ + stack.enter_context(popen(args, env=env)) + for env, args in server_configs + ] + yield procs + except Exception as e: + cmds = [args for _, args in server_configs] + print(f"Failed to start disaggregated server processes: {e}\nCommands: {cmds}") + raisetests/unittest/llmapi/test_llm_pytorch.py (2)
824-831
: Optional: avoid subclassing the worker; simulate failure by raising directly.The mock subclass works, but you can simplify by making
GenerationExecutor.create
raise directly. This reduces coupling to internal class locations and removes the need for the extra import/class.If you adopt the direct-raise approach in the test (see suggestion on Lines 851-853), you can remove this class:
-class FailingExecutorWorker(GenerationExecutorWorker): - """Mock worker that fails during initialization to test error handling.""" - - def __init__(self, *args, **kwargs): - # Simulate a constructor failure - raise RuntimeError( - "Mock GenerationExecutorWorker initialization failed")
833-839
: Remove unused FailingExecutor helper.
FailingExecutor
is defined but never used. Drop it to keep the test lean.-FailingExecutor = type( - "FailingExecutor", (), { - "create": - classmethod( - lambda cls, *args, **kwargs: FailingExecutorWorker(*args, **kwargs)) - })
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
tensorrt_llm/executor/proxy.py
(1 hunks)tensorrt_llm/executor/worker.py
(2 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py
(6 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(4 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- tests/integration/test_lists/waives.txt
🚧 Files skipped from review as they are similar to previous changes (4)
- tensorrt_llm/executor/worker.py
- tensorrt_llm/executor/proxy.py
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
- tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/unittest/llmapi/test_llm_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/unittest/llmapi/test_llm_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
🧬 Code Graph Analysis (2)
tests/integration/defs/accuracy/test_disaggregated_serving.py (4)
tests/integration/defs/trt_test_alternative.py (1)
popen
(179-198)tests/unittest/_torch/auto_deploy/unit/singlegpu/utils/test_config.py (1)
temp_dir
(67-70)tensorrt_llm/llmapi/utils.py (1)
get_device_count
(83-84)tests/integration/defs/conftest.py (1)
get_device_count
(1941-1943)
tests/unittest/llmapi/test_llm_pytorch.py (2)
tensorrt_llm/executor/worker.py (1)
GenerationExecutorWorker
(50-626)tensorrt_llm/llmapi/llm.py (1)
LLM
(1090-1106)
🪛 Ruff (0.12.2)
tests/integration/defs/accuracy/test_disaggregated_serving.py
290-290: Line too long (209 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
856-856: Local variable llm
is assigned to but never used
Remove assignment to unused variable llm
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (8)
tests/integration/defs/accuracy/test_disaggregated_serving.py (6)
132-180
: Per-instance GPU assignment looks correct.Contiguous GPU blocks per process with a running offset is a sensible and deterministic approach. This pairs well with the upstream device count validation in
run_parallel_test
.
315-317
: Ports allocation strategy LGTM.Alternating 8001/8002 per instance avoids collisions between ctx/gen servers. Clear and predictable.
323-329
: Disaggregated server config matches per-instance URLs.Including both num_instances and explicit urls is consistent and unambiguous for the router.
578-578
: Updated call site now passes instance counts explicitly.Moving to
(ctx_instances=1, gen_instances=1)
maintains previous behavior while aligning with the new signature.
580-590
: Asymmetric test updates look consistent.Instance counts added with no behavior change. Guard still skips when capacity is insufficient.
591-595
: New multi-instance test adds valuable coverage.Exercising
(2, 2)
instances is a good sanity check for the orchestration and GPU allocation logic.tests/unittest/llmapi/test_llm_pytorch.py (2)
841-859
: Good addition: validates that executor creation failures surface at LLM construction.The test clearly asserts the error path and message. This improves coverage of initialization error propagation.
7-7
: Import re-export verified: no change requiredConfirmed that
GenerationExecutorWorker
is explicitly re-exported intensorrt_llm/executor/__init__.py
, so the existing import fromtensorrt_llm.executor
is valid and sufficiently robust.
thread_pool, temp_dir, multi_popen(ctx_servers + gen_servers), | ||
popen([ | ||
trtllm_serve_path, "disaggregated", "-c", | ||
disaggregated_serving_config_path, "--server_start_timeout", | ||
"3600" | ||
]) as disaggregated_server): | ||
])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add timeouts to the health check to prevent long hangs.
requests.get()
has no default timeout and can hang indefinitely if the socket is established but never responds. Add a reasonable timeout and handle read timeouts.
- response = requests.get("http://localhost:8000/health")
+ response = requests.get("http://localhost:8000/health", timeout=5)
if response.status_code == 200:
break
- except requests.exceptions.ConnectionError:
+ except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout):
continue
Also applies to: 207-216
🤖 Prompt for AI Agents
In tests/integration/defs/accuracy/test_disaggregated_serving.py around lines
201-216, the health check calls use requests.get without timeouts which can hang
indefinitely; update those requests.get calls to include a reasonable timeout
(e.g., timeout=(connect_timeout, read_timeout) or a single timeout value) and
wrap them to catch requests.exceptions.Timeout and
requests.exceptions.RequestException, treating timeouts as failed health checks
(return False or raise a controlled error) and logging or retrying as
appropriate so the test doesn't hang.
ctx_instances: int, | ||
gen_instances: int, | ||
test_sets: List[LlmapiAccuracyTestHarness], | ||
ctx_model: str = None, | ||
gen_model: str = None): | ||
if ctx_tp * ctx_pp + gen_tp * gen_pp > get_device_count(): | ||
total_ctx_gpus = ctx_tp * ctx_pp * ctx_instances | ||
total_gen_gpus = gen_tp * gen_pp * gen_instances | ||
if total_ctx_gpus + total_gen_gpus > get_device_count(): | ||
pytest.fail( | ||
f"Not enough devices for ctx_pp={ctx_pp}+ctx_tp={ctx_tp} and gen_pp={gen_pp}+gen_tp={gen_tp} test" | ||
f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enforce line length and consider skipping instead of failing for device scarcity.
- The
pytest.fail(...)
message exceeds 120 chars (E501). Rewrap the string. - Optional: For insufficient devices,
pytest.skip(...)
is consistent with other tests and avoids turning capacity issues into failures.
- if total_ctx_gpus + total_gen_gpus > get_device_count():
- pytest.fail(
- f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}"
- )
+ if total_ctx_gpus + total_gen_gpus > get_device_count():
+ msg = (
+ "Not enough devices for "
+ f"{ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) "
+ f"+ {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), "
+ f"total: {total_ctx_gpus + total_gen_gpus}"
+ )
+ # Consider skipping instead of failing to align with other tests:
+ # pytest.skip(msg)
+ pytest.fail(msg)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
ctx_instances: int, | |
gen_instances: int, | |
test_sets: List[LlmapiAccuracyTestHarness], | |
ctx_model: str = None, | |
gen_model: str = None): | |
if ctx_tp * ctx_pp + gen_tp * gen_pp > get_device_count(): | |
total_ctx_gpus = ctx_tp * ctx_pp * ctx_instances | |
total_gen_gpus = gen_tp * gen_pp * gen_instances | |
if total_ctx_gpus + total_gen_gpus > get_device_count(): | |
pytest.fail( | |
f"Not enough devices for ctx_pp={ctx_pp}+ctx_tp={ctx_tp} and gen_pp={gen_pp}+gen_tp={gen_tp} test" | |
f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}" | |
) | |
total_ctx_gpus = ctx_tp * ctx_pp * ctx_instances | |
total_gen_gpus = gen_tp * gen_pp * gen_instances | |
if total_ctx_gpus + total_gen_gpus > get_device_count(): | |
msg = ( | |
"Not enough devices for " | |
f"{ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) " | |
f"+ {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), " | |
f"total: {total_ctx_gpus + total_gen_gpus}" | |
) | |
# Consider skipping instead of failing to align with other tests: | |
# pytest.skip(msg) | |
pytest.fail(msg) |
🧰 Tools
🪛 Ruff (0.12.2)
290-290: Line too long (209 > 120)
(E501)
🤖 Prompt for AI Agents
In tests/integration/defs/accuracy/test_disaggregated_serving.py around lines
281 to 292, the pytest.fail call uses a single long string exceeding 120 chars
and should be skipped rather than failed when devices are scarce; change
pytest.fail(...) to pytest.skip(...) and rewrap the message into a shorter
concatenated or formatted string (or split into multiple shorter f-strings) so
no line exceeds 120 characters while preserving the same contextual details
about ctx/gen instances, pp/tp and total devices.
with patch('tensorrt_llm.executor.executor.GenerationExecutor.create', | ||
side_effect=lambda *args, **kwargs: FailingExecutorWorker( | ||
*args, **kwargs)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Patch the symbol at the use site to harden the test.
Ensure you’re patching the exact attribute looked up by LLM at runtime. In many cases, patching in the consumer’s module (e.g., tensorrt_llm.llmapi.llm.GenerationExecutor.create
) is more robust than patching the defining module.
To verify the call site, run:
Optionally, simplify the failure simulation and avoid the custom worker class:
-with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
- side_effect=lambda *args, **kwargs: FailingExecutorWorker(
- *args, **kwargs)):
+with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
+ side_effect=RuntimeError(
+ "Mock GenerationExecutorWorker initialization failed")):
If you adopt this, you can also remove the GenerationExecutorWorker
import (Line 7) and the FailingExecutorWorker
class (Lines 824-831).
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find all places where GenerationExecutor.create is invoked
rg -nP -C3 '\bGenerationExecutor\s*\.\s*create\s*\(' --type=py tensorrt_llm
# Check what LLM imports/binds for GenerationExecutor
rg -nP -C5 'from\s+tensorrt_llm\.executor(?:\.executor)?\s+import\s+GenerationExecutor\b|import\s+tensorrt_llm\.executor\.executor\s+as' tensorrt_llm/llmapi/llm.py
Length of output: 97
To pinpoint the correct patch target, let’s double-check how GenerationExecutor
is imported and invoked in the LLM module.
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1. How is GenerationExecutor brought into the LLM API?
rg -nP -C3 'from\s+tensorrt_llm\.executor\.executor\s+import\s+GenerationExecutor' --type=py tensorrt_llm/llmapi/llm.py
rg -nP -C3 'import\s+tensorrt_llm\.executor\.executor\s+as' --type=py tensorrt_llm/llmapi/llm.py
# 2. Does the LLM code ever bind it under a different alias?
rg -nP -C3 'GenerationExecutor\s*=' --type=py tensorrt_llm/llmapi/llm.py
# 3. Where is create() actually called at runtime?
rg -nP -C3 'GenerationExecutor\s*\.\s*create' --type=py .
Length of output: 141
Let’s search the LLM API module to see exactly how GenerationExecutor
is imported and where create()
is called:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# 1. Any direct import of GenerationExecutor in the LLM API
rg -nP --type=py 'GenerationExecutor' -n -C3 tensorrt_llm/llmapi/llm.py
# 2. Any aliasing of the executor module
rg -nP --type=py 'import\s+tensorrt_llm\.executor' -n -C3 tensorrt_llm/llmapi/llm.py
# 3. Where create() is invoked in the LLM API
rg -nP --type=py 'create\s*\(' -n -C3 tensorrt_llm/llmapi/llm.py
Length of output: 1639
To pinpoint the correct patch target in your test, let’s inspect how you’re using patch(...)
in test_llm_pytorch.py
. Please run:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the `with patch(...)` blocks around where FailingExecutorWorker is used
rg -nP 'with\s+patch' -n -C10 --type=py tests/unittest/llmapi/test_llm_pytorch.py
Length of output: 1065
Let’s inspect how _executor_cls
is used in the LLM API to see exactly how the executor is instantiated:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show where _executor_cls is used to create the executor
rg -nP '_executor_cls\.' -n -C5 --type=py tensorrt_llm/llmapi/llm.py
Length of output: 1241
Let’s inspect the test file’s imports to pinpoint where GenerationExecutorWorker
is brought in and prepare the correct removals:
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Show the first 80 lines (imports and helper classes) of the test file
sed -n '1,80p' tests/unittest/llmapi/test_llm_pytorch.py
Length of output: 3209
Patch GenerationExecutor.create
at the LLM import site
– In tests/unittest/llmapi/test_llm_pytorch.py, change the patch target to where LLM actually looks it up:
- with patch('tensorrt_llm.executor.executor.GenerationExecutor.create',
- side_effect=lambda *args, **kwargs: FailingExecutorWorker(
- *args, **kwargs)):
+ with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create',
+ side_effect=lambda *args, **kwargs: FailingExecutorWorker(
+ *args, **kwargs)):
Optionally simplify the failure injection and remove the custom worker entirely:
- with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create',
- side_effect=lambda *args, **kwargs: FailingExecutorWorker(
- *args, **kwargs)):
+ with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create',
+ side_effect=RuntimeError(
+ "Mock GenerationExecutorWorker initialization failed")):
If you adopt the above, you can also remove from the top of the file:
- the
from tensorrt_llm.executor import GenerationExecutorWorker
import - the
class FailingExecutorWorker
helper (around lines 824–831)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
with patch('tensorrt_llm.executor.executor.GenerationExecutor.create', | |
side_effect=lambda *args, **kwargs: FailingExecutorWorker( | |
*args, **kwargs)): | |
with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create', | |
side_effect=lambda *args, **kwargs: FailingExecutorWorker( | |
*args, **kwargs)): |
🤖 Prompt for AI Agents
In tests/unittest/llmapi/test_llm_pytorch.py around lines 851–853, the test
currently patches GenerationExecutor.create at tensorrt_llm.executor.executor
but LLM looks up that symbol from its own import site; change the patch target
to the module where the LLM imports GenerationExecutor (e.g.,
tensorrt_llm.llm.GenerationExecutor.create) so the monkeypatch takes effect.
Replace the custom side_effect that constructs FailingExecutorWorker with a
simpler side_effect that raises the desired Exception (or a simple lambda that
raises) and then remove the now-unused import of GenerationExecutorWorker and
the FailingExecutorWorker helper class around lines 824–831.
PR_Github #15915 [ run ] triggered by Bot |
PR_Github #15915 [ run ] completed with state |
28a4732
to
0e0eeed
Compare
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/unittest/llmapi/test_llm_pytorch.py (1)
841-859
: Patch the symbol where LLM looks it up and fix Ruff F841 (unused variable)Patch
GenerationExecutor.create
at the import site used byLLM
so the monkeypatch takes effect, and don’t assign the constructedLLM
to a variable (fixes F841). Also, directly raise the desiredRuntimeError
—no need to construct a failing worker.def test_llm_with_proxy_error(): @@ - # Test that the error is properly caught and re-raised by LLM - # We patch GenerationExecutor.create directly to return our failing worker - with patch('tensorrt_llm.executor.executor.GenerationExecutor.create', - side_effect=lambda *args, **kwargs: FailingExecutorWorker( - *args, **kwargs)): - with pytest.raises( - RuntimeError, - match="Mock GenerationExecutorWorker initialization failed"): - llm = LLM(model=llama_model_path, - kv_cache_config=global_kvcache_config) + # Patch where LLM resolves GenerationExecutor to ensure the side effect is hit + with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create', + side_effect=RuntimeError( + "Mock GenerationExecutorWorker initialization failed")): + with pytest.raises(RuntimeError, + match="Mock GenerationExecutorWorker initialization failed"): + LLM(model=llama_model_path, kv_cache_config=global_kvcache_config)
♻️ Duplicate comments (1)
tests/integration/test_lists/waives.txt (1)
305-305
: Duplicate SKIP entry for the same test; keep a single canonical lineThis line duplicates Line 266 for the same test. Please keep only one entry (prefer the earlier one that already references both bug IDs) to avoid ambiguity in waives and to keep the list tidy.
Apply:
-accuracy/test_llm_api_pytorch.py::TestQwen3_235B_A22B::test_nvfp4[latency_moe_trtllm_eagle3] SKIP (https://nvbugs/5437384)
🧹 Nitpick comments (5)
tests/unittest/llmapi/test_llm_pytorch.py (2)
824-839
: Remove unnecessary failing worker scaffolding (simplify the test and avoid heavy imports)
FailingExecutorWorker
andFailingExecutor
aren’t needed if you patchGenerationExecutor.create
to raise directly. Drop these to reduce test brittleness and dependencies.-class FailingExecutorWorker(GenerationExecutorWorker): - """Mock worker that fails during initialization to test error handling.""" - - def __init__(self, *args, **kwargs): - # Simulate a constructor failure - raise RuntimeError( - "Mock GenerationExecutorWorker initialization failed") - - -FailingExecutor = type( - "FailingExecutor", (), { - "create": - classmethod( - lambda cls, *args, **kwargs: FailingExecutorWorker(*args, **kwargs)) - })
7-7
: Drop unused heavy importAfter simplifying the test,
GenerationExecutorWorker
is no longer needed. Remove the import to avoid unnecessary module loading.-from tensorrt_llm.executor import GenerationExecutorWorker
tests/integration/defs/accuracy/test_disaggregated_serving.py (3)
132-137
: More robust port parsing
url.split(":")[1]
is brittle if a scheme or extra colons ever appear. Usingrsplit(":", 1)
is safer.- ctx_ports = [int(url.split(":")[1]) for url in ctx_urls] - gen_ports = [int(url.split(":")[1]) for url in gen_urls] + ctx_ports = [int(url.rsplit(":", 1)[1]) for url in ctx_urls] + gen_ports = [int(url.rsplit(":", 1)[1]) for url in gen_urls]
200-216
: Add request timeouts to health check to prevent hangs
requests.get()
without a timeout can hang indefinitely. Add a timeout and catchReadTimeout
as well.start_time = time.time() while time.time() - start_time < 3600: time.sleep(1) try: print("Checking health endpoint") - response = requests.get("http://localhost:8000/health") + response = requests.get("http://localhost:8000/health", timeout=5) if response.status_code == 200: break - except requests.exceptions.ConnectionError: + except (requests.exceptions.ConnectionError, requests.exceptions.ReadTimeout): continue
286-292
: Prefer skip over fail for capacity issues and fix E501 long lineCapacity scarcity should skip the test (consistent with other tests), not fail the suite. Also rewrap the long message to satisfy line-length linting.
- if total_ctx_gpus + total_gen_gpus > get_device_count(): - pytest.fail( - f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}" - ) + if total_ctx_gpus + total_gen_gpus > get_device_count(): + msg = ( + "Not enough devices for " + f"{ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) " + f"+ {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), " + f"total: {total_ctx_gpus + total_gen_gpus}" + ) + pytest.skip(msg)
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
tensorrt_llm/executor/proxy.py
(1 hunks)tensorrt_llm/executor/worker.py
(2 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py
(7 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(4 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tensorrt_llm/executor/worker.py
- tensorrt_llm/executor/proxy.py
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
- tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/unittest/llmapi/test_llm_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/unittest/llmapi/test_llm_pytorch.py
🧠 Learnings (3)
📚 Learning: 2025-07-22T08:33:49.109Z
Learnt from: yiqingy0
PR: NVIDIA/TensorRT-LLM#5198
File: jenkins/mergeWaiveList.py:0-0
Timestamp: 2025-07-22T08:33:49.109Z
Learning: In the TensorRT-LLM waive list merging system, removed lines are always located at the end of the merge waive lists, which is why the mergeWaiveList.py script uses reverse traversal - it's an optimization for this specific domain constraint.
Applied to files:
tests/integration/test_lists/waives.txt
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
PR: NVIDIA/TensorRT-LLM#6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tests/integration/test_lists/waives.txt
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/test_lists/waives.txt
tests/unittest/llmapi/test_llm_pytorch.py
🧬 Code Graph Analysis (2)
tests/integration/defs/accuracy/test_disaggregated_serving.py (2)
tests/integration/defs/trt_test_alternative.py (1)
popen
(179-198)tests/integration/defs/conftest.py (1)
get_device_count
(1941-1943)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tensorrt_llm/executor/worker.py (1)
GenerationExecutorWorker
(50-626)
🪛 Ruff (0.12.2)
tests/integration/defs/accuracy/test_disaggregated_serving.py
290-290: Line too long (209 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
856-856: Local variable llm
is assigned to but never used
Remove assignment to unused variable llm
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
PR_Github #15949 [ run ] triggered by Bot |
PR_Github #15949 [ run ] completed with state |
…VIDIA#6674) Signed-off-by: raayandhar <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Yanchao Lu <[email protected]> Co-authored-by: Zhanrui Sun <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…NVIDIA#6793) Signed-off-by: Superjomn <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
Signed-off-by: Balaram Buddharaju <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
…A#6826) Signed-off-by: Yuxian Qiu <[email protected]> Signed-off-by: Wangshanshan <[email protected]>
0e0eeed
to
a69120d
Compare
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
tensorrt_llm/executor/worker.py (1)
772-779
: Avoid sending Exception instances over IPC; send a simple, reliably serializable status tuple insteadIpcQueue.put serializes with pickle; many Exception instances (esp. custom ones with non-serializable attrs) can fail pickling, causing initialization handshakes to break. You already carry the full traceback string; make the first element a simple sentinel (e.g., None or "ERROR") to eliminate this risk.
Apply:
- worker_init_status_queue.put((e, traceback.format_exc())) + # Use a simple sentinel to avoid pickling Exceptions across IPC + worker_init_status_queue.put((None, traceback.format_exc()))If there are any other error-path puts to worker_init_status_queue, make the same change there for consistency.
tests/integration/defs/accuracy/test_disaggregated_serving.py (3)
53-56
: Fix Python 3.8 compatibility:list[...]
generics are 3.9+ only
self.futures: list[concurrent.futures.Future[RequestOutput]]
will raise at import time on Python 3.8. Usetyping.Future
withList[...]
instead.Apply these changes:
- self.futures: list[concurrent.futures.Future[RequestOutput]] = [] + self.futures: List[Future[RequestOutput]] = []And update imports (see next comment).
13-13
: ImportFuture
for the type hint fixComplement to previous comment: bring
Future
fromtyping
.-from typing import Any, Dict, List, Optional +from typing import Any, Dict, List, Optional, Future
894-907
: Add requiredctx_instances
andgen_instances
totest_mixed_ctx_gen_model
callThe
test_mixed_ctx_gen_model
invocation is missing the two new mandatory parameters—ctx_instances
andgen_instances
—and will raise aTypeError
. Update the call to include them (using1
for each to match existing defaults) beforetest_sets
:--- a/tests/integration/defs/accuracy/test_disaggregated_serving.py +++ b/tests/integration/defs/accuracy/test_disaggregated_serving.py @@ -896,13 +896,18 @@ class TestDisaggregatedServing: @pytest.mark.skip_less_device(4) @pytest.mark.parametrize("ctx_pp,gen_tp", [(2, 2)], ids=["ctxpp2gentp2"]) def test_mixed_ctx_gen_model(self, ctx_pp, gen_tp): ctx_model = self.FP4_MODEL gen_model = self.FP8_MODEL - return run_parallel_test("Qwen3/Qwen3-30B-A3B", - ctx_model, - ctx_pp=ctx_pp, - ctx_tp=1, - gen_pp=1, - gen_tp=gen_tp, - test_sets=[GSM8K, MMLU], - ctx_model=ctx_model, - gen_model=gen_model) + return run_parallel_test( + "Qwen3/Qwen3-30B-A3B", + ctx_model, + ctx_pp=ctx_pp, + ctx_tp=1, + gen_pp=1, + gen_tp=gen_tp, + ctx_instances=1, + gen_instances=1, + test_sets=[GSM8K, MMLU], + ctx_model=ctx_model, + gen_model=gen_model, + )This aligns the call with the updated signature:
def run_parallel_test(model_name, model_path, ctx_pp, ctx_tp, gen_pp, gen_tp, ctx_instances, gen_instances, test_sets, ctx_model=None, gen_model=None): ...No other call sites were missing these arguments.
♻️ Duplicate comments (3)
tests/unittest/llmapi/test_llm_pytorch.py (1)
841-859
: Harden the failure injection: patch at the LLM import site and raise directly; remove unused assignment (Ruff F841)
- Patch the symbol where LLM actually looks it up (tensorrt_llm.llmapi.llm.GenerationExecutor.create) to guarantee the monkeypatch takes effect across import indirections.
- No need to construct a failing subclass; a side_effect that raises RuntimeError is simpler and decouples the test from production internals.
- Remove the unused llm variable to satisfy Ruff F841.
Apply:
- with patch('tensorrt_llm.executor.executor.GenerationExecutor.create', - side_effect=lambda *args, **kwargs: FailingExecutorWorker( - *args, **kwargs)): - with pytest.raises( - RuntimeError, - match="Mock GenerationExecutorWorker initialization failed"): - llm = LLM(model=llama_model_path, - kv_cache_config=global_kvcache_config) + with patch('tensorrt_llm.llmapi.llm.GenerationExecutor.create', + side_effect=RuntimeError( + "Mock GenerationExecutorWorker initialization failed")): + with pytest.raises( + RuntimeError, + match="Mock GenerationExecutorWorker initialization failed"): + LLM(model=llama_model_path, kv_cache_config=global_kvcache_config)tests/integration/defs/accuracy/test_disaggregated_serving.py (2)
200-207
: Health-check loop can hang indefinitely without request timeoutsA prior review flagged this and it was added in #6660; current code dropped the timeout.
requests.get()
withouttimeout
may hang despite the 3600s outer loop.Apply:
- response = requests.get("http://localhost:8000/health") + response = requests.get("http://localhost:8000/health", timeout=(2, 5)) if response.status_code == 200: break - except requests.exceptions.ConnectionError: + except (requests.exceptions.ConnectionError, requests.exceptions.Timeout): continue
281-292
: Ruff E501: overlongpytest.fail(...)
message will fail CILine 290 is 209 chars (>120). Rewrap to satisfy the linter. Keeping
fail
behavior unchanged here; see next comment for an alternative usingskip
.- if total_ctx_gpus + total_gen_gpus > get_device_count(): - pytest.fail( - f"Not enough devices for {ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) + {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), total: {total_ctx_gpus + total_gen_gpus}" - ) + if total_ctx_gpus + total_gen_gpus > get_device_count(): + msg = ( + "Not enough devices for " + f"{ctx_instances} ctx instances (ctx_pp={ctx_pp}*ctx_tp={ctx_tp}) " + f"+ {gen_instances} gen instances (gen_pp={gen_pp}*gen_tp={gen_tp}), " + f"total: {total_ctx_gpus + total_gen_gpus}" + ) + pytest.fail(msg)
🧹 Nitpick comments (10)
tensorrt_llm/executor/worker.py (1)
1-1
: Add NVIDIA copyright header (policy compliance)This source file lacks the required NVIDIA header. Please add it (current year) above imports.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# SPDX-License-Identifier: Apache-2.0tests/unittest/llmapi/test_llm_pytorch.py (3)
824-839
: Remove test-only subclass and helper; they’re unnecessary after switching to a direct side_effectThe FailingExecutorWorker subclass and dynamic FailingExecutor type add coupling to production code and import weight. After adopting a direct side_effect that raises, these can be removed.
Apply:
-class FailingExecutorWorker(GenerationExecutorWorker): - """Mock worker that fails during initialization to test error handling.""" - - def __init__(self, *args, **kwargs): - # Simulate a constructor failure - raise RuntimeError( - "Mock GenerationExecutorWorker initialization failed") - - -FailingExecutor = type( - "FailingExecutor", (), { - "create": - classmethod( - lambda cls, *args, **kwargs: FailingExecutorWorker(*args, **kwargs)) - }) +# Removed: use direct side_effect to raise instead of subclassing the worker.
7-7
: Drop unused import after removing the test-only subclassGenerationExecutorWorker import becomes unused once the subclass is deleted.
Apply:
-from tensorrt_llm.executor import GenerationExecutorWorker
1-3
: Add NVIDIA copyright header (policy compliance)Tests are also subject to the header requirement.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# SPDX-License-Identifier: Apache-2.0tests/integration/defs/accuracy/test_disaggregated_serving.py (6)
1-1
: Add required NVIDIA copyright header (2025) at file topRepo guidelines mandate an NVIDIA copyright header on all Python sources. Add it above the current comments to avoid license/pre-commit failures.
+ # Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. + # Licensed under the NVIDIA Software License (or the project’s declared license). See LICENSE for details.
132-137
: Make URL-to-port parsing robust (handles extra colons/IPv6)
url.split(":")[1]
breaks if the host contains additional colons (e.g., IPv6) or if the URL includes a scheme. Preferrsplit(":", 1)
.- ctx_ports = [int(url.split(":")[1]) for url in ctx_urls] - gen_ports = [int(url.split(":")[1]) for url in gen_urls] + ctx_ports = [int(url.rsplit(":", 1)[1]) for url in ctx_urls] + gen_ports = [int(url.rsplit(":", 1)[1]) for url in gen_urls]
138-180
: GPU allocation logic looks correct; consider a safety check for direct callersContiguous allocation with an offset per-server is fine. Since
launch_disaggregated_llm
is also called directly by tests in this file (bypassingrun_parallel_test
’s device-count guard), consider asserting thatcurrent_gpu_offset
never exceedsget_device_count()
to fail fast with a clear error when misconfigured.@@ - current_gpu_offset = 0 + current_gpu_offset = 0 + total_needed = (len(ctx_ports) * ctx_total_gpus) + (len(gen_ports) * gen_total_gpus) + if total_needed > get_device_count(): + raise RuntimeError( + f"Disaggregated serving requires {total_needed} GPUs " + f"(ctx={len(ctx_ports)}*{ctx_total_gpus}, gen={len(gen_ports)}*{gen_total_gpus}), " + f"but only {get_device_count()} available." + )
281-292
: Considerpytest.skip(...)
instead of failure on scarce devicesOther tests skip when capacity is insufficient; aligning behavior avoids turning lab capacity constraints into failures.
- pytest.fail(msg) + # pytest.skip(msg) + pytest.fail(msg) # keep failure if this is intended; otherwise prefer skip
591-596
: New multi-instance test is a good coverage additionCovers the 2x2 instance case. Consider marking with a higher device/memory skip if it flakes in CI, but fine to start with 4-device skip.
1-5
: Replace top-of-file narrative comments with a proper module docstringTests benefit from a succinct docstring over free-form comments. Optional, but improves consistency.
-# I want to create accuracy tests for disaggregated serving. -# I need to to this by creating a new class that mimics LLM class. Instead of implementing the -# actual methods it will send OAI requests to the disaggregated serving endpoint. -# Please take a look at the existing test_llm_api_pytorch.py file for reference. +"""Accuracy tests for disaggregated serving using an OpenAI-compatible client. + +These tests launch per-instance context and generation servers plus a +disaggregated router, then evaluate standard accuracy suites (GSM8K, MMLU, etc.). +"""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
tensorrt_llm/executor/proxy.py
(1 hunks)tensorrt_llm/executor/worker.py
(2 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py
(6 hunks)tests/integration/test_lists/qa/llm_function_full.txt
(4 hunks)tests/integration/test_lists/test-db/l0_dgx_h100.yml
(1 hunks)tests/integration/test_lists/waives.txt
(1 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/integration/test_lists/test-db/l0_dgx_h100.yml
- tensorrt_llm/executor/proxy.py
- tests/integration/test_lists/waives.txt
- tests/integration/test_lists/qa/llm_function_full.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/executor/worker.py
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/unittest/llmapi/test_llm_pytorch.py
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
tensorrt_llm/executor/worker.py
tests/integration/defs/accuracy/test_disaggregated_serving.py
tests/unittest/llmapi/test_llm_pytorch.py
🧠 Learnings (1)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
PR: NVIDIA/TensorRT-LLM#6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/unittest/llmapi/test_llm_pytorch.py
🧬 Code Graph Analysis (2)
tensorrt_llm/executor/worker.py (2)
tensorrt_llm/executor/utils.py (1)
put
(119-120)tensorrt_llm/executor/ipc.py (2)
put
(116-126)put
(270-276)
tests/unittest/llmapi/test_llm_pytorch.py (2)
tensorrt_llm/executor/worker.py (1)
GenerationExecutorWorker
(50-626)tensorrt_llm/llmapi/llm.py (1)
LLM
(1090-1106)
🪛 Ruff (0.12.2)
tests/integration/defs/accuracy/test_disaggregated_serving.py
290-290: Line too long (209 > 120)
(E501)
tests/unittest/llmapi/test_llm_pytorch.py
856-856: Local variable llm
is assigned to but never used
Remove assignment to unused variable llm
(F841)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (9)
tensorrt_llm/executor/worker.py (2)
792-797
: 2-tuple init-status protocol change looks good; verify consumer side strictly unpacks (signal, error_trace)On success you now send (ready_signal, None). Ensure the proxy/consumer exclusively relies on the tuple form and never expects a bare object, or mixed producers. Also confirm ready_signal is set to a constant (e.g., "READY") and compared by value, not truthiness, to avoid ambiguity if None is used in the error case.
Would you like me to add a unit test that mocks worker_init_status_queue.get() to validate the new protocol and error logging path end-to-end?
731-734
: Typing uses PEP 585 built-in generics; ensure Py3.8 compatibility or gate with future annotationsThis line (and others like dict[...]) use built-in generics (tuple[str, ...], dict[...]) which require Python 3.9+ unless you enable postponed evaluation. Given the repo targets Python 3.8+, either:
- add
from __future__ import annotations
at the top of the file, or- switch to typing.Tuple / typing.Dict.
Apply one of:
+from __future__ import annotations
or:
- proxy_result_queue: tuple[str, Optional[bytes]] = worker_queues.result_queue_addr + proxy_result_queue: Tuple[str, Optional[bytes]] = worker_queues.result_queue_addrIf the project minimum is actually 3.9+, please ignore.
tests/unittest/llmapi/test_llm_pytorch.py (1)
1-1
: Add future annotations for Py3.8 compatibility (built-in generics used elsewhere in this file)This test file uses list[int], etc. To keep compatibility with Python 3.8, add postponed evaluation.
Apply:
+from __future__ import annotations
If CI runs on Python ≥3.9 exclusively, this is optional.
tests/integration/defs/accuracy/test_disaggregated_serving.py (6)
181-199
: Nice use ofExitStack
for multi-process lifecycle
multi_popen
correctly composes multiple process contexts and ensures cleanup on exceptions. This prevents orphaned processes. LGTM.
271-273
: CorrectDuckLLM
construction (args, tokenizer, generate_async)This fixes the earlier mismatch with the namedtuple definition. Good catch.
315-317
: Port series generation is clear and deterministicEven/odd split for ctx/gen avoids collisions under the same host. Looks good for test scaffolding.
323-329
:disaggregated_server_config
structure now matches multi-instance orchestrationThe config wiring with
num_instances
andurls
aligns with the new launcher. LGTM.
577-579
: Updated call site to the newrun_parallel_test
signaturePasses
ctx_instances
andgen_instances
as positional arguments. Good.
580-590
: Asymmetric test updated correctly; consistent with new signatureNamed args + instances = 1 keep prior behavior. LGTM.
PR_Github #15964 [ run ] triggered by Bot |
PR_Github #15964 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16033 [ run ] triggered by Bot |
PR_Github #16033 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #16062 [ run ] triggered by Bot |
PR_Github #16062 [ run ] completed with state |
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.