Skip to content

Conversation

yunruis
Copy link
Contributor

@yunruis yunruis commented Jul 15, 2025

This PR introduces scheduler optimizations to improve load balancing across ranks during the Attention phase, aiming to evenly distribute both the num_requests and the num_tokens among ranks.

Evaluated on a dataset with an input sequence length (ISL) of 803 and output sequence length (OSL) of 3653, the optimization delivers a 45% improvement in throughput.

How to enable: For trtllm-bench, Add the following configuration to extra_config.yaml

attention_dp_config:
    enable_balance: true
    batching_wait_iters: 50
    timeout_iters: 10

Limitations:

There may be a slight increase in TPOT (Time Per Output Token) and TTFT (Time To First Token). The parameters can be adjusted based on workload patterns and SLO requirements.

Summary by CodeRabbit

  • New Features

    • Added configurable options for balancing attention data parallelism, including enable/disable flags and tuning of timeout and batching wait iterations.
    • Introduced new command-line arguments and configuration fields to control attention data parallelism balancing.
    • Added support for validating and propagating attention data parallelism configuration throughout the API and backend.
    • Exported the new AttentionDpConfig configuration class for user access.
  • Tests

    • Added an integration test to verify attention data parallelism balancing with specific configurations.
  • Documentation

    • Updated API stability references to include the new attention data parallelism configuration parameter.

@yunruis yunruis requested review from a team as code owners July 15, 2025 14:38
@yunruis
Copy link
Contributor Author

yunruis commented Jul 15, 2025

/bot --disable-fail-fast

Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yunruis yunruis force-pushed the opt/adp_schedule_optimize branch from cdf234a to 8c0a5fb Compare July 15, 2025 14:49
@yunruis
Copy link
Contributor Author

yunruis commented Jul 15, 2025

/bot --disable-fail-fast

Copy link

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--disable-fail-fast --skip-test --stage-list "A10-1, xxx" --gpu-type "A30, H100_PCIe" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-[Post-Merge]-1, xxx"]

Launch build/test pipelines. All previously running jobs will be killed.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests. Will also run L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-[Post-Merge]-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-[Post-Merge]-1, xxx".

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yunruis
Copy link
Contributor Author

yunruis commented Jul 16, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #11998 [ run ] triggered by Bot

@yunruis yunruis force-pushed the opt/adp_schedule_optimize branch from 8c0a5fb to 01303b3 Compare July 16, 2025 02:56
@tensorrt-cicd
Copy link
Collaborator

PR_Github #11998 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #8908 completed with status: 'FAILURE'

@yunruis yunruis force-pushed the opt/adp_schedule_optimize branch from 01303b3 to 35da779 Compare July 28, 2025 05:53
Copy link
Contributor

coderabbitai bot commented Jul 28, 2025

📝 Walkthrough

Walkthrough

This change introduces new configuration options and balancing logic for attention data parallelism (attention_dp) in the PyTorch backend and API layers. It adds new configuration fields, validation, and propagation mechanisms for attention_dp balancing, implements request balancing logic in the PyExecutor scheduler, updates CLI and API interfaces, and adds corresponding integration and API stability tests.

Changes

Cohort / File(s) Change Summary
Attention DP Config Propagation
tensorrt_llm/llmapi/llm_args.py, tensorrt_llm/llmapi/__init__.py, tests/unittest/api_stability/references/llm.yaml
Adds AttentionDpConfig model, validation, and propagation through TorchLlmArgs and API exports. Updates API stability reference to include the new config parameter.
PyTorch Backend Config
tensorrt_llm/_torch/pyexecutor/config.py, tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
Adds three new attention_dp balancing fields to the PyTorch config dataclass and backend config namespace.
PyExecutor Attention DP Balancing Logic
tensorrt_llm/_torch/pyexecutor/py_executor.py
Implements request balancing logic for attention_dp in PyExecutor, including new config parameters, counters, and a private balancing method integrated into the scheduling flow.
CLI and Example Update
examples/llm-api/quickstart_advanced.py
Adds CLI arguments for attention_dp balancing and passes the config into LLM setup.
Integration Test
tests/integration/defs/test_e2e.py, tests/integration/test_lists/test-db/l0_dgx_h100.yml
Adds a new integration test for attention_dp balancing and updates the test exclusion list for specific hardware.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI/Example
    participant TorchLlmArgs
    participant PyTorchConfig
    participant PyExecutor

    User->>CLI/Example: Provide attention_dp_* CLI args
    CLI/Example->>TorchLlmArgs: Construct with AttentionDpConfig
    TorchLlmArgs->>PyTorchConfig: get_pytorch_backend_config()
    PyTorchConfig->>PyExecutor: Pass config params
    PyExecutor->>PyExecutor: _balance_adp_requests() during scheduling
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~15–25 minutes

Possibly related PRs

Suggested reviewers

  • litaotju
  • pcastonguay
  • HuiGao-NV
  • jinyangyuan-nvidia
  • Superjomn

Note

⚡️ Unit Test Generation is now available in beta!

Learn more here, or try it out under "Finishing Touches" below.


📜 Recent review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fbc7682 and 7ba91cb.

📒 Files selected for processing (9)
  • examples/llm-api/quickstart_advanced.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
  • tensorrt_llm/llmapi/__init__.py (2 hunks)
  • tensorrt_llm/llmapi/llm_args.py (5 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml (1 hunks)
  • tests/unittest/api_stability/references/llm.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • tensorrt_llm/llmapi/init.py
  • tests/unittest/api_stability/references/llm.yaml
  • tests/integration/defs/test_e2e.py
  • tensorrt_llm/_torch/pyexecutor/config.py
  • examples/llm-api/quickstart_advanced.py
  • tests/integration/test_lists/test-db/l0_dgx_h100.yml
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without reflection.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/llmapi/llm_args.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/llmapi/llm_args.py
🪛 Ruff (0.12.2)
tensorrt_llm/llmapi/llm_args.py

2008-2008: Line too long (138 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/llmapi/llm_args.py (3)

190-205: LGTM! Well-structured configuration class.

The AttentionDpConfig class follows established patterns with appropriate field definitions and default values for attention data parallel balancing.


2351-2360: LGTM! Proper parameter passing to PyTorchConfig.

The attention DP configuration parameters are correctly passed to the PyTorchConfig constructor with appropriate null handling and default value fallbacks.


2377-2377: LGTM! Proper field mapping addition.

The AttentionDpConfig is correctly added to the field mapping, enabling proper instantiation from external configuration dictionaries.

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1129-1169: Consider refactoring complex scheduling logic for better maintainability.

The attention DP scheduling logic is complex and would benefit from being broken down into smaller, well-documented helper methods.

Consider extracting the statistics gathering and decision logic into separate methods:

def _gather_attention_dp_stats(self, context_requests, scheduler_output):
    """Gather scheduling statistics across all tensor parallel ranks."""
    num_scheduled_context_requests = len(context_requests)
    num_scheduled_generation_requests = len(scheduler_output.generation_requests)
    
    responses_list = self.dist.tp_allgather([
        num_scheduled_context_requests,
        num_scheduled_generation_requests
    ])
    
    return {
        'all_ranks_num_scheduled_context_requests': [r[0] for r in responses_list],
        'all_ranks_num_scheduled_generation_requests': [r[1] for r in responses_list]
    }

def _should_wait_for_context_balance(self, stats):
    """Determine if we should wait for better context request balance across ranks."""
    all_ranks_have_free_ctx_slots = all([
        num_gen < self.max_batch_size
        for num_gen in stats['all_ranks_num_scheduled_generation_requests']
    ])
    all_ranks_have_ctx_requests = all([
        num_ctx > 0
        for num_ctx in stats['all_ranks_num_scheduled_context_requests']
    ])
    
    return all_ranks_have_free_ctx_slots and all_ranks_have_ctx_requests

This would make the main scheduling logic much more readable and easier to test.

tensorrt_llm/llmapi/llm_args.py (1)

92-95: Fix docstring formatting to comply with style guidelines.

The docstring should be formatted as a single line since it's short and simple.

-    """
-    Configuration for attention DP.
-    """
+    """Configuration for attention DP."""
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97f7e12 and 35da779.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
  • tensorrt_llm/llmapi/llm_args.py (5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/llmapi/llm_args.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/pyexecutor/config.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/llmapi/llm_args.py
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

1147-1147: Local variable all_ranks_num_scheduled_tokens is assigned to but never used

Remove assignment to unused variable all_ranks_num_scheduled_tokens

(F841)


1155-1155: Local variable all_ranks_have_multi_gen is assigned to but never used

Remove assignment to unused variable all_ranks_have_multi_gen

(F841)


1183-1183: Line too long (132 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

92-93: One-line docstring should fit on one line

Reformat to one line

(D200)

🔇 Additional comments (5)
tensorrt_llm/_torch/pyexecutor/config.py (1)

49-51: LGTM! Well-structured configuration additions.

The new attention DP configuration fields are properly typed, follow Python naming conventions, and have sensible default values. The implementation integrates well with the existing configuration structure.

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

179-181: LGTM! Configuration extraction follows existing patterns.

The new attention DP configuration fields are properly extracted from the backend configuration and assigned to instance variables, following the established pattern in the constructor.

tensorrt_llm/llmapi/llm_args.py (3)

1837-1840: LGTM!

The field definition follows the established pattern for optional configuration fields and is properly typed and documented.


2175-2182: LGTM!

The integration of attention DP configuration parameters follows the established pattern used for other config parameters in this method, with proper null-checking and default value handling.


2199-2199: LGTM!

The addition to the field mapping dictionary correctly enables handling of attention_dp_config in the argument update logic.

Comment on lines 2100 to 2300
@model_validator(mode='after')
def validate_attention_dp_config(self) -> 'TorchLlmArgs':
"""Validate attention DP configuration.

Ensures that:
1. If attention_dp_config.enable_balance is true, attention_dp_config.batching_wait_iters must be greater than 0
2. If attention_dp_config.enable_balance is true, attention_dp_config.timeout_iters must be greater than 0
"""
if self.attention_dp_config is None:
return self

config = self.attention_dp_config
if config.enable_balance:
if config.batching_wait_iters < 0:
raise ValueError(
"attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true"
)
if config.timeout_iters < 0:
raise ValueError(
"attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true"
)
return self

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix validation logic inconsistency.

The validation conditions check < 0 but the error messages state "must be greater than 0". This creates a logical inconsistency where a value of 0 would pass validation despite the error message suggesting it shouldn't be allowed.

For timeout and batching wait iterations in a balancing context, 0 values typically don't make sense. Apply this fix:

-            if config.batching_wait_iters < 0:
+            if config.batching_wait_iters <= 0:
                 raise ValueError(
-                    "attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true"
+                    "attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true"
                 )
-            if config.timeout_iters < 0:
+            if config.timeout_iters <= 0:
                 raise ValueError(
-                    "attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true"
+                    "attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true"
                 )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@model_validator(mode='after')
def validate_attention_dp_config(self) -> 'TorchLlmArgs':
"""Validate attention DP configuration.
Ensures that:
1. If attention_dp_config.enable_balance is true, attention_dp_config.batching_wait_iters must be greater than 0
2. If attention_dp_config.enable_balance is true, attention_dp_config.timeout_iters must be greater than 0
"""
if self.attention_dp_config is None:
return self
config = self.attention_dp_config
if config.enable_balance:
if config.batching_wait_iters < 0:
raise ValueError(
"attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true"
)
if config.timeout_iters < 0:
raise ValueError(
"attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true"
)
return self
@model_validator(mode='after')
def validate_attention_dp_config(self) -> 'TorchLlmArgs':
"""Validate attention DP configuration.
Ensures that:
1. If attention_dp_config.enable_balance is true, attention_dp_config.batching_wait_iters must be greater than 0
2. If attention_dp_config.enable_balance is true, attention_dp_config.timeout_iters must be greater than 0
"""
if self.attention_dp_config is None:
return self
config = self.attention_dp_config
if config.enable_balance:
if config.batching_wait_iters <= 0:
raise ValueError(
"attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true"
)
if config.timeout_iters <= 0:
raise ValueError(
"attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true"
)
return self
🤖 Prompt for AI Agents
In tensorrt_llm/llmapi/llm_args.py around lines 2100 to 2122, the validation
checks for batching_wait_iters and timeout_iters use '< 0' but the error
messages say these values must be greater than 0, causing inconsistency. Update
the conditions to check for '<= 0' instead of '< 0' to ensure zero values are
also rejected, aligning the logic with the error messages.

@yunruis yunruis force-pushed the opt/adp_schedule_optimize branch from 35da779 to dbcd7d2 Compare July 28, 2025 11:37
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

220-222: Fix syntax error with duplicate self. prefix.

Line 221 has a syntax error with duplicate self. that will cause a runtime error.

Apply this diff to fix the syntax error:

-        self.self.adp_ctx_waiting_iters_count = 0
+        self.adp_ctx_waiting_iters_count = 0

1097-1156: Fix multiple syntax errors and address code quality issues.

The attention DP scheduling logic has several critical issues that need to be addressed:

  1. Syntax errors with duplicate self. prefixes on lines 1141, 1150, 1152-1153
  2. Unused variables identified by static analysis: all_ranks_num_scheduled_tokens and all_ranks_have_multi_gen
  3. Line length violation on line 1152

Apply this diff to fix the syntax errors:

-                if all_ranks_have_free_ctx_slots and all_ranks_have_ctx_requests:
-                    self.self.adp_ctx_waiting_iters_count = 0
+                if all_ranks_have_free_ctx_slots and all_ranks_have_ctx_requests:
+                    self.adp_ctx_waiting_iters_count = 0
-                else:
-                    self.self.adp_ctx_waiting_iters_count += 1
-                    context_requests = []
-                    if self.self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters or not all_ranks_have_gen_requests:
-                        self.self.adp_ctx_waiting_iters_count = 0
+                else:
+                    self.adp_ctx_waiting_iters_count += 1
+                    context_requests = []
+                    if self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters or not all_ranks_have_gen_requests:
+                        self.adp_ctx_waiting_iters_count = 0

Consider removing the unused variables identified by static analysis:

-            all_ranks_num_scheduled_tokens = [
-                response[2] for response in responses_list
-            ]
-
-            all_ranks_have_multi_gen = all([
-                num_gen > 1 
-                for num_gen in all_ranks_num_scheduled_generation_requests
-            ])

Also consider breaking the long line 1152 for better readability:

-                    if self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters or not all_ranks_have_gen_requests:
+                    timeout_reached = self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters
+                    if timeout_reached or not all_ranks_have_gen_requests:
tensorrt_llm/llmapi/llm_args.py (1)

2198-2220: Fix validation logic inconsistency.

The validation conditions check < 0 but the error messages state "must be greater than 0". This creates a logical inconsistency where a value of 0 would pass validation despite the error message suggesting it shouldn't be allowed.

For timeout and batching wait iterations in a balancing context, 0 values typically don't make sense. Apply this fix:

-            if config.batching_wait_iters < 0:
+            if config.batching_wait_iters <= 0:
                 raise ValueError(
                     "attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true"
                 )
-            if config.timeout_iters < 0:
+            if config.timeout_iters <= 0:
                 raise ValueError(
                     "attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true"
                 )
🧹 Nitpick comments (2)
tensorrt_llm/llmapi/llm_args.py (2)

123-138: Well-structured configuration class with minor docstring improvement needed.

The AttentionDpConfig class follows established patterns in the codebase and provides clear field definitions with appropriate defaults.

Consider reformatting the docstring to one line as suggested by static analysis:

-    """
-    Configuration for attention DP.
-    """
+    """Configuration for attention DP."""

1896-1900: Field definition is correct with minor formatting improvement needed.

The attention_dp_config field is properly defined as an optional configuration parameter.

Consider breaking the long description line to comply with the 120-character limit:

     attention_dp_config: Optional[AttentionDpConfig] = Field(
         default=None,
-        description=
-        "Attention DP config. If true, use attention DP optimized scheduler.")
+        description="Attention DP config. If true, use attention DP optimized scheduler.")
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35da779 and dbcd7d2.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
  • tensorrt_llm/llmapi/llm_args.py (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/pyexecutor/config.py
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile = ...).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL = ...).
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a file, prefer docstrings over comments in Python.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/llmapi/llm_args.py
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • tensorrt_llm/llmapi/llm_args.py
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

1116-1116: Local variable all_ranks_num_scheduled_tokens is assigned to but never used

Remove assignment to unused variable all_ranks_num_scheduled_tokens

(F841)


1124-1124: Local variable all_ranks_have_multi_gen is assigned to but never used

Remove assignment to unused variable all_ranks_have_multi_gen

(F841)


1152-1152: Line too long (132 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

138-139: One-line docstring should fit on one line

Reformat to one line

(D200)


1898-1898: Line too long (138 > 120)

(E501)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

179-181: LGTM: Configuration fields properly initialized.

The new attention DP configuration fields are correctly accessed from the PyTorch backend config and will provide the necessary parameters for the scheduling logic.

tensorrt_llm/llmapi/llm_args.py (2)

2273-2280: LGTM! Proper integration of attention DP configuration.

The method correctly propagates attention DP configuration parameters to the PyTorch backend config, following established patterns with appropriate fallback to defaults when the config is not provided.


2297-2297: LGTM! Proper field mapping for attention DP configuration.

The addition of "attention_dp_config": AttentionDpConfig to the field mapping enables correct instantiation from dictionary format, consistent with other configuration fields.

@yunruis
Copy link
Contributor Author

yunruis commented Jul 28, 2025

/bot run --disable-fail-fast

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
tensorrt_llm/llmapi/llm_args.py (1)

2112-2119: Fix validation logic inconsistency.

The validation conditions check < 0 but the error messages state "must be greater than 0". This creates a logical inconsistency where a value of 0 would pass validation despite the error message suggesting it shouldn't be allowed.

For timeout and batching wait iterations in a balancing context, 0 values typically don't make sense. Apply this fix:

-            if config.batching_wait_iters < 0:
+            if config.batching_wait_iters <= 0:
                 raise ValueError(
                     "attention_dp_config.batching_wait_iters must be greater than 0 when enable_balance is true"
                 )
-            if config.timeout_iters < 0:
+            if config.timeout_iters <= 0:
                 raise ValueError(
                     "attention_dp_config.timeout_iters must be greater than 0 when enable_balance is true"
                 )
🧹 Nitpick comments (4)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

1147-1149: Remove unused variable to improve code quality.

The all_ranks_num_scheduled_tokens variable is computed but never used in the scheduling logic.

-            all_ranks_num_scheduled_tokens = [
-                response[2] for response in responses_list
-            ]

1155-1158: Remove unused variable to improve code quality.

The all_ranks_have_multi_gen variable is computed but never used in the scheduling logic, making it dead code.

-            all_ranks_have_multi_gen = all([
-                num_gen > 1 
-                for num_gen in all_ranks_num_scheduled_generation_requests
-            ])

1169-1186: Address line length violation and verify logical correctness.

The attention DP balancing logic implements a two-phase waiting mechanism but has a line length violation and should be verified for correctness.

Line length issue:

-                    if self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters or not all_ranks_have_gen_requests:
+                    timeout_reached = self.adp_ctx_waiting_iters_count >= self.attention_dp_time_out_iters
+                    if timeout_reached or not all_ranks_have_gen_requests:

Verification needed: The logic appears to implement a complex waiting strategy where:

  1. If all ranks have free slots and context requests → reset waiting, possibly delay further if all have gen requests
  2. Otherwise → increment waiting counter, block context requests unless timeout or no gen requests

Please verify this implements the intended ADP scheduling behavior, particularly the interaction between adp_ctx_waiting_iters_count and adp_ctx_batching_wait_iters_count counters.

tensorrt_llm/llmapi/llm_args.py (1)

92-95: Fix docstring formatting.

The docstring should be formatted as a one-line docstring since it's short and simple.

-    """
-    Configuration for attention DP.
-    """
+    """Configuration for attention DP."""
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dbcd7d2 and 3a49091.

📒 Files selected for processing (6)
  • examples/llm-api/quickstart_advanced.py (4 hunks)
  • tensorrt_llm/_torch/pyexecutor/config.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
  • tensorrt_llm/llmapi/__init__.py (2 hunks)
  • tensorrt_llm/llmapi/llm_args.py (5 hunks)
  • tests/integration/defs/test_e2e.py (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • tensorrt_llm/llmapi/init.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tensorrt_llm/_torch/pyexecutor/config.py
🧰 Additional context used
🧠 Learnings (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/py_executor.py

1147-1147: Local variable all_ranks_num_scheduled_tokens is assigned to but never used

Remove assignment to unused variable all_ranks_num_scheduled_tokens

(F841)


1155-1155: Local variable all_ranks_have_multi_gen is assigned to but never used

Remove assignment to unused variable all_ranks_have_multi_gen

(F841)


1183-1183: Line too long (127 > 120)

(E501)

tensorrt_llm/llmapi/llm_args.py

92-93: One-line docstring should fit on one line

Reformat to one line

(D200)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (11)
tests/integration/defs/test_e2e.py (1)

1776-1813: LGTM! Well-structured integration test for ADP balancing feature.

The test function properly exercises the new attention data parallelism balancing functionality by:

  • Using appropriate pytest decorators for hardware requirements
  • Configuring the necessary command-line arguments for ADP balancing (--attention_dp_enable_balance, --attention_dp_time_out_iters, --attention_dp_batching_wait_iters)
  • Including proper memory usage verification for 8 GPUs
  • Following the established testing pattern in this file
examples/llm-api/quickstart_advanced.py (4)

4-7: LGTM! Correct import addition for AttentionDpConfig.

The import statement properly includes the new AttentionDpConfig class alongside other configuration classes.


57-63: LGTM! Well-defined command-line arguments for ADP balancing.

The new arguments follow the established naming convention and have appropriate types and defaults:

  • --attention_dp_enable_balance: Boolean flag for enabling balancing
  • --attention_dp_time_out_iters: Integer timeout iterations
  • --attention_dp_batching_wait_iters: Integer batching wait iterations

199-203: LGTM! Proper AttentionDpConfig instance creation.

The configuration object is correctly created using the command-line arguments, following the same pattern as other configuration objects in the file.


227-227: LGTM! Correct integration of attention_dp_config into LLM constructor.

The attention_dp_config parameter is properly passed to the LLM constructor, maintaining consistency with other configuration parameters.

tensorrt_llm/_torch/pyexecutor/py_executor.py (3)

179-181: LGTM: Configuration parameters correctly extracted from backend config.

The attention DP configuration parameters are properly extracted from the model engine's PyTorch backend configuration and stored as instance variables for use in scheduling logic.


220-222: LGTM: Counter variables properly initialized.

The ADP context waiting and batching wait iteration counters are correctly initialized to zero, providing clean starting state for the scheduling algorithm.


1187-1187: LGTM: Context requests properly assigned after balancing logic.

The processed context_requests (which may be filtered by the ADP balancing logic) are correctly assigned to the scheduled requests structure.

tensorrt_llm/llmapi/llm_args.py (3)

1836-1839: LGTM!

The field addition follows the established pattern for optional configuration fields in the class.


2172-2181: LGTM!

The configuration propagation logic correctly handles both the presence/absence of the config and properly falls back to default values. The boolean logic for attention_dp_enable_balance appropriately checks both conditions.


2198-2198: LGTM!

The field mapping addition follows the established pattern and enables proper handling of attention_dp_config in dictionary-based updates.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13198 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13198 [ run ] completed with state FAILURE

@yunruis
Copy link
Contributor Author

yunruis commented Aug 4, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14014 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14014 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10563 completed with status: 'FAILURE'

@Kefeng-Duan Kefeng-Duan enabled auto-merge (squash) August 4, 2025 23:26
@Kefeng-Duan
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14048 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14048 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10596 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Aug 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14102 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14102 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10641 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Aug 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14126 [ run ] triggered by Bot

auto-merge was automatically disabled August 5, 2025 10:01

Head branch was pushed to by a user without write access

@yunruis yunruis force-pushed the opt/adp_schedule_optimize branch from fbc7682 to 7ba91cb Compare August 5, 2025 10:01
@yunruis
Copy link
Contributor Author

yunruis commented Aug 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14136 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14126 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14136 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10665 completed with status: 'FAILURE'

@yunruis
Copy link
Contributor Author

yunruis commented Aug 5, 2025

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14157 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14157 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #10683 completed with status: 'SUCCESS'

@litaotju litaotju merged commit 3ff4f50 into NVIDIA:main Aug 6, 2025
5 checks passed
@Kefeng-Duan
Copy link
Collaborator

@lucaslie @NVIDIA/trt-llm-torch-autodeploy-devs
Hi, Dear all
Sorry the PR is urgent for some important scenarios (Mlperf, etc.), and we have merged it when the CI is ready.
Specifically, we need to add some configs into the interface and modify the code of ad_executor.py

image

If you have any concern/comment/suggestion, please let us know, and we can submit another PR, does it OK for you?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants