-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][opt] ADP schedule balance optimization #6061
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/bot --disable-fail-fast |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
cdf234a
to
8c0a5fb
Compare
/bot --disable-fail-fast |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand.
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
/bot run --disable-fail-fast |
PR_Github #11998 [ run ] triggered by Bot |
8c0a5fb
to
01303b3
Compare
PR_Github #11998 [ run ] completed with state |
01303b3
to
35da779
Compare
📝 WalkthroughWalkthroughThis change introduces new configuration options and balancing logic for attention data parallelism (attention_dp) in the PyTorch backend and API layers. It adds new configuration fields, validation, and propagation mechanisms for attention_dp balancing, implements request balancing logic in the PyExecutor scheduler, updates CLI and API interfaces, and adds corresponding integration and API stability tests. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI/Example
participant TorchLlmArgs
participant PyTorchConfig
participant PyExecutor
User->>CLI/Example: Provide attention_dp_* CLI args
CLI/Example->>TorchLlmArgs: Construct with AttentionDpConfig
TorchLlmArgs->>PyTorchConfig: get_pytorch_backend_config()
PyTorchConfig->>PyExecutor: Pass config params
PyExecutor->>PyExecutor: _balance_adp_requests() during scheduling
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~15–25 minutes Possibly related PRs
Suggested reviewers
Note ⚡️ Unit Test Generation is now available in beta!Learn more here, or try it out under "Finishing Touches" below. 📜 Recent review detailsConfiguration used: .coderabbit.yaml 📒 Files selected for processing (9)
🚧 Files skipped from review as they are similar to previous changes (8)
🧰 Additional context used📓 Path-based instructions (2)**/*.py📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
**/*.{cpp,h,cu,py}📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Files:
🪛 Ruff (0.12.2)tensorrt_llm/llmapi/llm_args.py2008-2008: Line too long (138 > 120) (E501) ⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (3)
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1129-1169
: Consider refactoring complex scheduling logic for better maintainability.The attention DP scheduling logic is complex and would benefit from being broken down into smaller, well-documented helper methods.
Consider extracting the statistics gathering and decision logic into separate methods:
def _gather_attention_dp_stats(self, context_requests, scheduler_output): """Gather scheduling statistics across all tensor parallel ranks.""" num_scheduled_context_requests = len(context_requests) num_scheduled_generation_requests = len(scheduler_output.generation_requests) responses_list = self.dist.tp_allgather([ num_scheduled_context_requests, num_scheduled_generation_requests ]) return { 'all_ranks_num_scheduled_context_requests': [r[0] for r in responses_list], 'all_ranks_num_scheduled_generation_requests': [r[1] for r in responses_list] } def _should_wait_for_context_balance(self, stats): """Determine if we should wait for better context request balance across ranks.""" all_ranks_have_free_ctx_slots = all([ num_gen < self.max_batch_size for num_gen in stats['all_ranks_num_scheduled_generation_requests'] ]) all_ranks_have_ctx_requests = all([ num_ctx > 0 for num_ctx in stats['all_ranks_num_scheduled_context_requests'] ]) return all_ranks_have_free_ctx_slots and all_ranks_have_ctx_requestsThis would make the main scheduling logic much more readable and easier to test.
tensorrt_llm/llmapi/llm_args.py (1)
92-95
: Fix docstring formatting to comply with style guidelines.The docstring should be formatted as a single line since it's short and simple.
- """ - Configuration for attention DP. - """ + """Configuration for attention DP."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(3 hunks)tensorrt_llm/llmapi/llm_args.py
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/config.py
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1147-1147: Local variable all_ranks_num_scheduled_tokens
is assigned to but never used
Remove assignment to unused variable all_ranks_num_scheduled_tokens
(F841)
1155-1155: Local variable all_ranks_have_multi_gen
is assigned to but never used
Remove assignment to unused variable all_ranks_have_multi_gen
(F841)
1183-1183: Line too long (132 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
92-93: One-line docstring should fit on one line
Reformat to one line
(D200)
🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/config.py (1)
49-51
: LGTM! Well-structured configuration additions.The new attention DP configuration fields are properly typed, follow Python naming conventions, and have sensible default values. The implementation integrates well with the existing configuration structure.
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
179-181
: LGTM! Configuration extraction follows existing patterns.The new attention DP configuration fields are properly extracted from the backend configuration and assigned to instance variables, following the established pattern in the constructor.
tensorrt_llm/llmapi/llm_args.py (3)
1837-1840
: LGTM!The field definition follows the established pattern for optional configuration fields and is properly typed and documented.
2175-2182
: LGTM!The integration of attention DP configuration parameters follows the established pattern used for other config parameters in this method, with proper null-checking and default value handling.
2199-2199
: LGTM!The addition to the field mapping dictionary correctly enables handling of
attention_dp_config
in the argument update logic.
tensorrt_llm/llmapi/llm_args.py
Outdated
@model_validator(mode='after') | ||
def validate_attention_dp_config(self) -> 'TorchLlmArgs': | ||
"""Validate attention DP configuration. | ||
|
||
Ensures that: | ||
1. If attention_dp_config.enable_balance is true, attention_dp_config.batching_wait_iters must be greater than 0 | ||
2. If attention_dp_config.enable_balance is true, attention_dp_config.timeout_iters must be greater than 0 | ||
""" | ||
if self.attention_dp_config is None: | ||
return self | ||
|
||
config = self.attention_dp_config | ||
if config.enable_balance: | ||
if config.batching_wait_iters < 0: | ||
raise ValueError( | ||
"attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true" | ||
) | ||
if config.timeout_iters < 0: | ||
raise ValueError( | ||
"attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true" | ||
) | ||
return self | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix validation logic inconsistency.
The validation conditions check < 0
but the error messages state "must be greater than 0". This creates a logical inconsistency where a value of 0 would pass validation despite the error message suggesting it shouldn't be allowed.
For timeout and batching wait iterations in a balancing context, 0 values typically don't make sense. Apply this fix:
- if config.batching_wait_iters < 0:
+ if config.batching_wait_iters <= 0:
raise ValueError(
- "attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true"
+ "attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true"
)
- if config.timeout_iters < 0:
+ if config.timeout_iters <= 0:
raise ValueError(
- "attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true"
+ "attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true"
)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
@model_validator(mode='after') | |
def validate_attention_dp_config(self) -> 'TorchLlmArgs': | |
"""Validate attention DP configuration. | |
Ensures that: | |
1. If attention_dp_config.enable_balance is true, attention_dp_config.batching_wait_iters must be greater than 0 | |
2. If attention_dp_config.enable_balance is true, attention_dp_config.timeout_iters must be greater than 0 | |
""" | |
if self.attention_dp_config is None: | |
return self | |
config = self.attention_dp_config | |
if config.enable_balance: | |
if config.batching_wait_iters < 0: | |
raise ValueError( | |
"attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true" | |
) | |
if config.timeout_iters < 0: | |
raise ValueError( | |
"attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true" | |
) | |
return self | |
@model_validator(mode='after') | |
def validate_attention_dp_config(self) -> 'TorchLlmArgs': | |
"""Validate attention DP configuration. | |
Ensures that: | |
1. If attention_dp_config.enable_balance is true, attention_dp_config.batching_wait_iters must be greater than 0 | |
2. If attention_dp_config.enable_balance is true, attention_dp_config.timeout_iters must be greater than 0 | |
""" | |
if self.attention_dp_config is None: | |
return self | |
config = self.attention_dp_config | |
if config.enable_balance: | |
if config.batching_wait_iters <= 0: | |
raise ValueError( | |
"attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true" | |
) | |
if config.timeout_iters <= 0: | |
raise ValueError( | |
"attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true" | |
) | |
return self |
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm_args.py around lines 2100 to 2122, the validation
checks for batching_wait_iters and timeout_iters use '< 0' but the error
messages say these values must be greater than 0, causing inconsistency. Update
the conditions to check for '<= 0' instead of '< 0' to ensure zero values are
also rejected, aligning the logic with the error messages.
35da779
to
dbcd7d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
220-222
: Fix syntax error with duplicateself.
prefix.Line 221 has a syntax error with duplicate
self.
that will cause a runtime error.Apply this diff to fix the syntax error:
- self.self.adp_ctx_waiting_iters_count = 0 + self.adp_ctx_waiting_iters_count = 0
1097-1156
: Fix multiple syntax errors and address code quality issues.The attention DP scheduling logic has several critical issues that need to be addressed:
- Syntax errors with duplicate
self.
prefixes on lines 1141, 1150, 1152-1153- Unused variables identified by static analysis:
all_ranks_num_scheduled_tokens
andall_ranks_have_multi_gen
- Line length violation on line 1152
Apply this diff to fix the syntax errors:
- if all_ranks_have_free_ctx_slots and all_ranks_have_ctx_requests: - self.self.adp_ctx_waiting_iters_count = 0 + if all_ranks_have_free_ctx_slots and all_ranks_have_ctx_requests: + self.adp_ctx_waiting_iters_count = 0- else: - self.self.adp_ctx_waiting_iters_count += 1 - context_requests = [] - if self.self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters or not all_ranks_have_gen_requests: - self.self.adp_ctx_waiting_iters_count = 0 + else: + self.adp_ctx_waiting_iters_count += 1 + context_requests = [] + if self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters or not all_ranks_have_gen_requests: + self.adp_ctx_waiting_iters_count = 0Consider removing the unused variables identified by static analysis:
- all_ranks_num_scheduled_tokens = [ - response[2] for response in responses_list - ] - - all_ranks_have_multi_gen = all([ - num_gen > 1 - for num_gen in all_ranks_num_scheduled_generation_requests - ])Also consider breaking the long line 1152 for better readability:
- if self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters or not all_ranks_have_gen_requests: + timeout_reached = self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters + if timeout_reached or not all_ranks_have_gen_requests:tensorrt_llm/llmapi/llm_args.py (1)
2198-2220
: Fix validation logic inconsistency.The validation conditions check
< 0
but the error messages state "must be greater than 0". This creates a logical inconsistency where a value of 0 would pass validation despite the error message suggesting it shouldn't be allowed.For timeout and batching wait iterations in a balancing context, 0 values typically don't make sense. Apply this fix:
- if config.batching_wait_iters < 0: + if config.batching_wait_iters <= 0: raise ValueError( "attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true" ) - if config.timeout_iters < 0: + if config.timeout_iters <= 0: raise ValueError( "attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true" )
🧹 Nitpick comments (2)
tensorrt_llm/llmapi/llm_args.py (2)
123-138
: Well-structured configuration class with minor docstring improvement needed.The
AttentionDpConfig
class follows established patterns in the codebase and provides clear field definitions with appropriate defaults.Consider reformatting the docstring to one line as suggested by static analysis:
- """ - Configuration for attention DP. - """ + """Configuration for attention DP."""
1896-1900
: Field definition is correct with minor formatting improvement needed.The
attention_dp_config
field is properly defined as an optional configuration parameter.Consider breaking the long description line to comply with the 120-character limit:
attention_dp_config: Optional[AttentionDpConfig] = Field( default=None, - description= - "Attention DP config. If true, use attention DP optimized scheduler.") + description="Attention DP config. If true, use attention DP optimized scheduler.")
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(3 hunks)tensorrt_llm/llmapi/llm_args.py
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/pyexecutor/config.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
tensorrt_llm/llmapi/llm_args.py
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1116-1116: Local variable all_ranks_num_scheduled_tokens
is assigned to but never used
Remove assignment to unused variable all_ranks_num_scheduled_tokens
(F841)
1124-1124: Local variable all_ranks_have_multi_gen
is assigned to but never used
Remove assignment to unused variable all_ranks_have_multi_gen
(F841)
1152-1152: Line too long (132 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
138-139: One-line docstring should fit on one line
Reformat to one line
(D200)
1898-1898: Line too long (138 > 120)
(E501)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
179-181
: LGTM: Configuration fields properly initialized.The new attention DP configuration fields are correctly accessed from the PyTorch backend config and will provide the necessary parameters for the scheduling logic.
tensorrt_llm/llmapi/llm_args.py (2)
2273-2280
: LGTM! Proper integration of attention DP configuration.The method correctly propagates attention DP configuration parameters to the PyTorch backend config, following established patterns with appropriate fallback to defaults when the config is not provided.
2297-2297
: LGTM! Proper field mapping for attention DP configuration.The addition of
"attention_dp_config": AttentionDpConfig
to the field mapping enables correct instantiation from dictionary format, consistent with other configuration fields.
dbcd7d2
to
3a49091
Compare
/bot run --disable-fail-fast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
tensorrt_llm/llmapi/llm_args.py (1)
2112-2119
: Fix validation logic inconsistency.The validation conditions check
< 0
but the error messages state "must be greater than 0". This creates a logical inconsistency where a value of 0 would pass validation despite the error message suggesting it shouldn't be allowed.For timeout and batching wait iterations in a balancing context, 0 values typically don't make sense. Apply this fix:
- if config.batching_wait_iters < 0: + if config.batching_wait_iters <= 0: raise ValueError( "attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true" ) - if config.timeout_iters < 0: + if config.timeout_iters <= 0: raise ValueError( "attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true" )
🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
1147-1149
: Remove unused variable to improve code quality.The
all_ranks_num_scheduled_tokens
variable is computed but never used in the scheduling logic.- all_ranks_num_scheduled_tokens = [ - response[2] for response in responses_list - ]
1155-1158
: Remove unused variable to improve code quality.The
all_ranks_have_multi_gen
variable is computed but never used in the scheduling logic, making it dead code.- all_ranks_have_multi_gen = all([ - num_gen > 1 - for num_gen in all_ranks_num_scheduled_generation_requests - ])
1169-1186
: Address line length violation and verify logical correctness.The attention DP balancing logic implements a two-phase waiting mechanism but has a line length violation and should be verified for correctness.
Line length issue:
- if self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters or not all_ranks_have_gen_requests: + timeout_reached = self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters + if timeout_reached or not all_ranks_have_gen_requests:Verification needed: The logic appears to implement a complex waiting strategy where:
- If all ranks have free slots and context requests → reset waiting, possibly delay further if all have gen requests
- Otherwise → increment waiting counter, block context requests unless timeout or no gen requests
Please verify this implements the intended ADP scheduling behavior, particularly the interaction between
adp_ctx_waiting_iters_count
andadp_ctx_batching_wait_iters_count
counters.tensorrt_llm/llmapi/llm_args.py (1)
92-95
: Fix docstring formatting.The docstring should be formatted as a one-line docstring since it's short and simple.
- """ - Configuration for attention DP. - """ + """Configuration for attention DP."""
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
examples/llm-api/quickstart_advanced.py
(4 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(3 hunks)tensorrt_llm/llmapi/__init__.py
(2 hunks)tensorrt_llm/llmapi/llm_args.py
(5 hunks)tests/integration/defs/test_e2e.py
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- tensorrt_llm/llmapi/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
- tensorrt_llm/_torch/pyexecutor/config.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py
1147-1147: Local variable all_ranks_num_scheduled_tokens
is assigned to but never used
Remove assignment to unused variable all_ranks_num_scheduled_tokens
(F841)
1155-1155: Local variable all_ranks_have_multi_gen
is assigned to but never used
Remove assignment to unused variable all_ranks_have_multi_gen
(F841)
1183-1183: Line too long (127 > 120)
(E501)
tensorrt_llm/llmapi/llm_args.py
92-93: One-line docstring should fit on one line
Reformat to one line
(D200)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tests/integration/defs/test_e2e.py (1)
1776-1813
: LGTM! Well-structured integration test for ADP balancing feature.The test function properly exercises the new attention data parallelism balancing functionality by:
- Using appropriate pytest decorators for hardware requirements
- Configuring the necessary command-line arguments for ADP balancing (
--attention_dp_enable_balance
,--attention_dp_time_out_iters
,--attention_dp_batching_wait_iters
)- Including proper memory usage verification for 8 GPUs
- Following the established testing pattern in this file
examples/llm-api/quickstart_advanced.py (4)
4-7
: LGTM! Correct import addition for AttentionDpConfig.The import statement properly includes the new
AttentionDpConfig
class alongside other configuration classes.
57-63
: LGTM! Well-defined command-line arguments for ADP balancing.The new arguments follow the established naming convention and have appropriate types and defaults:
--attention_dp_enable_balance
: Boolean flag for enabling balancing--attention_dp_time_out_iters
: Integer timeout iterations--attention_dp_batching_wait_iters
: Integer batching wait iterations
199-203
: LGTM! Proper AttentionDpConfig instance creation.The configuration object is correctly created using the command-line arguments, following the same pattern as other configuration objects in the file.
227-227
: LGTM! Correct integration of attention_dp_config into LLM constructor.The
attention_dp_config
parameter is properly passed to the LLM constructor, maintaining consistency with other configuration parameters.tensorrt_llm/_torch/pyexecutor/py_executor.py (3)
179-181
: LGTM: Configuration parameters correctly extracted from backend config.The attention DP configuration parameters are properly extracted from the model engine's PyTorch backend configuration and stored as instance variables for use in scheduling logic.
220-222
: LGTM: Counter variables properly initialized.The ADP context waiting and batching wait iteration counters are correctly initialized to zero, providing clean starting state for the scheduling algorithm.
1187-1187
: LGTM: Context requests properly assigned after balancing logic.The processed
context_requests
(which may be filtered by the ADP balancing logic) are correctly assigned to the scheduled requests structure.tensorrt_llm/llmapi/llm_args.py (3)
1836-1839
: LGTM!The field addition follows the established pattern for optional configuration fields in the class.
2172-2181
: LGTM!The configuration propagation logic correctly handles both the presence/absence of the config and properly falls back to default values. The boolean logic for
attention_dp_enable_balance
appropriately checks both conditions.
2198-2198
: LGTM!The field mapping addition follows the established pattern and enables proper handling of
attention_dp_config
in dictionary-based updates.
PR_Github #13198 [ run ] triggered by Bot |
PR_Github #13198 [ run ] completed with state |
/bot run |
PR_Github #14014 [ run ] triggered by Bot |
PR_Github #14014 [ run ] completed with state |
/bot run |
PR_Github #14048 [ run ] triggered by Bot |
PR_Github #14048 [ run ] completed with state |
/bot run |
PR_Github #14102 [ run ] triggered by Bot |
PR_Github #14102 [ run ] completed with state |
/bot run |
PR_Github #14126 [ run ] triggered by Bot |
Signed-off-by: yunruis <[email protected]>
Head branch was pushed to by a user without write access
fbc7682
to
7ba91cb
Compare
/bot run |
PR_Github #14136 [ run ] triggered by Bot |
PR_Github #14126 [ run ] completed with state |
PR_Github #14136 [ run ] completed with state |
/bot run |
PR_Github #14157 [ run ] triggered by Bot |
PR_Github #14157 [ run ] completed with state |
@lucaslie @NVIDIA/trt-llm-torch-autodeploy-devs ![]() If you have any concern/comment/suggestion, please let us know, and we can submit another PR, does it OK for you? Thanks |
Signed-off-by: yunruis <[email protected]> Signed-off-by: Lanyu Liao <[email protected]>
Signed-off-by: yunruis <[email protected]>
This PR introduces scheduler optimizations to improve load balancing across ranks during the Attention phase, aiming to evenly distribute both the
num_requests
and thenum_tokens
among ranks.Evaluated on a dataset with an input sequence length (ISL) of 803 and output sequence length (OSL) of 3653, the optimization delivers a 45% improvement in throughput.
How to enable: For
trtllm-bench
, Add the following configuration toextra_config.yaml
Limitations:
There may be a slight increase in TPOT (Time Per Output Token) and TTFT (Time To First Token). The parameters can be adjusted based on workload patterns and SLO requirements.
Summary by CodeRabbit
New Features
Tests
Documentation