[main][bugfix] Fix bugs and refactor cached mask generation logic #2442

rjg-lyh · 2025-08-19T10:30:49Z

What this PR does / why we need it?

This PR fix bugs and refactor cached mask generation logic. Now just pre-construct and use the cached mask on cpu instead of device on npu.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

CI passed with new added/existing test.

vLLM version: v0.10.1.1
vLLM main: vllm-project/vllm@9b5f642

github-actions · 2025-08-19T10:30:57Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request refactors the attention mask generation logic to keep the cached mask on the CPU and only move it to the device when needed. This is a good optimization that simplifies the logic and reduces device memory pressure. The changes in vllm_ascend/worker/model_runner_v1.py are consistent with this refactoring, correctly preparing and passing CPU tensors for mask creation.

I've found one critical issue: a typo in a method call within get_splitfuse_attn_mask that will cause an AttributeError at runtime. Please see the specific comment for details.

gemini-code-assist · 2025-08-19T10:31:55Z

vllm_ascend/attention/attention_mask.py

-            current_row += q_len
-
-        return attn_mask.to(device, non_blocking=True)
+        self.update_attn_cache(max_seq_len, dtype)


There's a typo in the method call here. The method is named _update_attn_cache (with a leading underscore), but it's being called as update_attn_cache. This will raise an AttributeError at runtime.

Suggested change

self.update_attn_cache(max_seq_len, dtype)

self._update_attn_cache(max_seq_len, dtype)

github-actions · 2025-08-21T00:57:10Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

codecov · 2025-08-24T07:35:58Z

Codecov Report

❌ Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 77.96%. Comparing base (5d8ec28) to head (2effecc).
⚠️ Report is 12 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/attention/attention_mask.py	95.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2442      +/-   ##
==========================================
- Coverage   77.99%   77.96%   -0.03%     
==========================================
  Files         134      134              
  Lines       18498    18474      -24     
==========================================
- Hits        14427    14403      -24     
  Misses       4071     4071

Flag	Coverage Δ
unittests	`77.96% <97.22%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tests/ut/attention/test_attention_mask.py

Signed-off-by: rjg-lyh <[email protected]>

MengqingCao

lgtm

wangxiyuan · 2025-08-27T02:38:30Z

@ApsarasX please double check the response, if it's fine. Feel free to merge this.

tests/ut/attention/test_attention_mask.py

rjg-lyh · 2025-08-27T04:58:46Z

tests/ut/attention/test_attention_mask.py

+                device=torch.device("cpu"),
+            )
+
+    def test_mask_value_cleanliness(self):


I have add this test here. @ApsarasX

I have add this test here. @ApsarasX

OK, I see.

gemini-code-assist bot reviewed Aug 19, 2025

View reviewed changes

rjg-lyh force-pushed the main branch 3 times, most recently from 0092f92 to 25b1903 Compare August 20, 2025 08:17

github-actions bot added the module:tests label Aug 20, 2025

github-actions bot added the merge-conflicts label Aug 21, 2025

rjg-lyh force-pushed the main branch 6 times, most recently from c0c5517 to d6fecb4 Compare August 22, 2025 08:39

github-actions bot removed the merge-conflicts label Aug 22, 2025

wangxiyuan approved these changes Aug 22, 2025

View reviewed changes

rjg-lyh force-pushed the main branch 2 times, most recently from 24ebd95 to bf6ffbe Compare August 24, 2025 07:22

ApsarasX requested changes Aug 25, 2025

View reviewed changes

tests/ut/attention/test_attention_mask.py Show resolved Hide resolved

rjg-lyh force-pushed the main branch 3 times, most recently from 86565ef to 4f01f49 Compare August 26, 2025 01:35

[main][bugfix] Fix bugs and refactor cached mask generation logic

2effecc

Signed-off-by: rjg-lyh <[email protected]>

rjg-lyh force-pushed the main branch from 4f01f49 to 2effecc Compare August 26, 2025 07:56

MengqingCao approved these changes Aug 27, 2025

View reviewed changes

MengqingCao mentioned this pull request Aug 27, 2025

[Release]: Release checklist for v0.10.1rc1 #2525

Closed

48 tasks

ApsarasX approved these changes Aug 27, 2025

View reviewed changes

tests/ut/attention/test_attention_mask.py Show resolved Hide resolved

ApsarasX merged commit 2bfbf9b into vllm-project:main Aug 27, 2025
25 checks passed

rjg-lyh commented Aug 27, 2025

View reviewed changes

tt545571022 mentioned this pull request Sep 11, 2025

[Bug]: v0.10.1 refactoring of attn_mask code causes >70% single-NPU inference performance drop on models like DeepSeek-R1-Distill-Qwen-1.5B compared to v0.10.0. #2832

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[main][bugfix] Fix bugs and refactor cached mask generation logic #2442

[main][bugfix] Fix bugs and refactor cached mask generation logic #2442

rjg-lyh commented Aug 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 19, 2025

Uh oh!

github-actions bot commented Aug 21, 2025

Uh oh!

codecov bot commented Aug 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

MengqingCao left a comment

Uh oh!

wangxiyuan commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

rjg-lyh Aug 27, 2025

Uh oh!

ApsarasX Aug 27, 2025

Uh oh!

Uh oh!

	self.update_attn_cache(max_seq_len, dtype)
	self._update_attn_cache(max_seq_len, dtype)

[main][bugfix] Fix bugs and refactor cached mask generation logic #2442

[main][bugfix] Fix bugs and refactor cached mask generation logic #2442

Conversation

rjg-lyh commented Aug 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Aug 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 21, 2025

Uh oh!

codecov bot commented Aug 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

MengqingCao left a comment

Choose a reason for hiding this comment

Uh oh!

wangxiyuan commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

rjg-lyh Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

ApsarasX Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rjg-lyh commented Aug 19, 2025 •

edited by github-actions bot

Loading

codecov bot commented Aug 24, 2025 •

edited

Loading