Skip to content

Conversation

yiz-liu
Copy link
Collaborator

@yiz-liu yiz-liu commented Aug 20, 2025

What this PR does / why we need it?

Fixes hang when batch size < DP size.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

After this change, the function in DP case will work now.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix a service hang when the batch size is smaller than the data parallelism (DP) size by adding a missing collective operation in _dummy_run to synchronize with _process_reqs. While adding the get_dp_padding call is the correct step to resolve the hang, the implementation introduces an inconsistency. The padding calculated is not applied to num_tokens in _dummy_run, whereas it is applied in _process_reqs when using ACL graphs. This can lead to a mismatch between the captured graph and runtime execution, potentially causing errors or incorrect behavior. My review includes a suggestion to address this inconsistency.

Comment on lines 1933 to 1918
num_pad, num_tokens_across_dp_native = self.get_dp_padding(
num_tokens)
# num_tokens += num_pad ## Uncomment this after TorchAir is removed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The addition of get_dp_padding correctly introduces the necessary collective call to prevent hangs. However, the calculated num_pad is not applied to num_tokens because the line num_tokens += num_pad is commented out. This creates an inconsistency with the _process_reqs method (line 1075), which does apply this padding when use_aclgraph is true. Since _dummy_run is used for capturing ACL graphs, this discrepancy can lead to a mismatch between the captured graph's expected input size and the actual input size at runtime, which is a critical issue. The padding should be applied to ensure consistency.

Suggested change
num_pad, num_tokens_across_dp_native = self.get_dp_padding(
num_tokens)
# num_tokens += num_pad ## Uncomment this after TorchAir is removed
num_pad, num_tokens_across_dp_native = self.get_dp_padding(
num_tokens)
num_tokens += num_pad

@MengqingCao
Copy link
Collaborator

Thanks for the fix, could you add an e2e test to prevent this from breaking again?

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@wangxiyuan
Copy link
Collaborator

please make CI happy

@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Aug 21, 2025

please make CI happy

@wangxiyuan Ready, waiting for other fixes.

…implementation specific to TorchAir. Make sure server do not hang when batch size < DP size.

Signed-off-by: Yizhou Liu <[email protected]>
@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Aug 22, 2025

Thanks for the fix, could you add an e2e test to prevent this from breaking again?

@MengqingCao Has the refactoring of the ModelRunner class been completed?

Copy link

codecov bot commented Aug 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.04%. Comparing base (3629bc4) to head (e659504).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2454   +/-   ##
=======================================
  Coverage   78.04%   78.04%           
=======================================
  Files         132      132           
  Lines       17557    17557           
=======================================
  Hits        13702    13702           
  Misses       3855     3855           
Flag Coverage Δ
unittests 78.04% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MengqingCao
Copy link
Collaborator

@MengqingCao Has the refactoring of the ModelRunner class been completed?

The refactor of ModelRunner is included in #2445, but I'm not sure if it is all the refactoring, maybe @weiguihua2 could give more info

@Yikun
Copy link
Collaborator

Yikun commented Aug 23, 2025

LGTM except:

  1. what's the regression e2e test plan as mengqing required?
  2. quick question: We already have vllm_ascend/torchair/torchair_model_runner.py, why there is still torchair specific code in vllm_ascend/worker/model_runner_v1.py?

@yiz-liu
Copy link
Collaborator Author

yiz-liu commented Aug 23, 2025

LGTM except:

  1. what's the regression e2e test plan as mengqing required?
  2. quick question: We already have vllm_ascend/torchair/torchair_model_runner.py, why there is still torchair specific code in vllm_ascend/worker/model_runner_v1.py?
  1. Working on it, tests will be added with or after [2/N][Feat] Add MC2 communication method for MoE layers #2469 .
  2. @linfeng-yuan PLS take a look at this.

num_pad, num_tokens_across_dp_native = self.get_dp_padding(num_tokens)
# num_tokens += num_pad ## Uncomment this after TorchAir is removed

# Padding for DP (for TorchAir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this _get_forward_metadata_across_dp_and_pad function not for torchair, this method can be rewritten in torchair runner

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will figure out a way to merge these two paths.

@Yikun
Copy link
Collaborator

Yikun commented Aug 23, 2025

CI seems failed due to flaky hccl timeout , I retried let's see latest results:
https://github.com/vllm-project/vllm-ascend/actions/runs/17171726635/job/48723250542?pr=2454

@wangxiyuan
Copy link
Collaborator

we need fix this issue first. For the pad logic, let's create a RFC to improve the performance.

@wangxiyuan wangxiyuan merged commit 99bf25a into vllm-project:main Aug 25, 2025
36 of 41 checks passed
@yiz-liu yiz-liu deleted the fix-dp branch August 25, 2025 12:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants