-
Notifications
You must be signed in to change notification settings - Fork 441
[main][bugfix] Fix bugs and refactor cached mask generation logic #2442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the attention mask generation logic to keep the cached mask on the CPU and only move it to the device when needed. This is a good optimization that simplifies the logic and reduces device memory pressure. The changes in vllm_ascend/worker/model_runner_v1.py
are consistent with this refactoring, correctly preparing and passing CPU tensors for mask creation.
I've found one critical issue: a typo in a method call within get_splitfuse_attn_mask
that will cause an AttributeError
at runtime. Please see the specific comment for details.
current_row += q_len | ||
|
||
return attn_mask.to(device, non_blocking=True) | ||
self.update_attn_cache(max_seq_len, dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the method call here. The method is named _update_attn_cache
(with a leading underscore), but it's being called as update_attn_cache
. This will raise an AttributeError
at runtime.
self.update_attn_cache(max_seq_len, dtype) | |
self._update_attn_cache(max_seq_len, dtype) |
0092f92
to
25b1903
Compare
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
c0c5517
to
d6fecb4
Compare
24ebd95
to
bf6ffbe
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2442 +/- ##
==========================================
- Coverage 77.99% 77.96% -0.03%
==========================================
Files 134 134
Lines 18498 18474 -24
==========================================
- Hits 14427 14403 -24
Misses 4071 4071
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
86565ef
to
4f01f49
Compare
Signed-off-by: rjg-lyh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@ApsarasX please double check the response, if it's fine. Feel free to merge this. |
device=torch.device("cpu"), | ||
) | ||
|
||
def test_mask_value_cleanliness(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have add this test here. @ApsarasX
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have add this test here. @ApsarasX
OK, I see.
What this PR does / why we need it?
This PR fix bugs and refactor cached mask generation logic. Now just pre-construct and use the cached mask on cpu instead of device on npu.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
CI passed with new added/existing test.