Skip to content

Conversation

socrahow
Copy link
Contributor

@socrahow socrahow commented Aug 19, 2025

…gemmarmsnorm operator of the gemma3 model on NPU

What this PR does / why we need it?

Fix AddRMSNormW8A8Quant init bug and optimize the performance of the gemmarmsnorm operator of the gemma3 model on NPU

Before fixing bug,it will raise error "TypeError: wrapper_rmsnorm_init..init() takes 2 positional arguments but 6 were given". After fixing, it can run smoothly.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Test by running the gemma3 model

Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the Gemma3 model on Ascend NPUs, including performance optimizations for the gemmarmsnorm operator and a fix for an initialization bug in AddRMSNormW8A8Quant. The changes are well-structured, adding the necessary model definitions, registration, and a test case.

However, I've found a critical issue in the new AscendGemma3DecoderLayer implementation where an incorrect layer is used for quantization fusion in the MLP block, likely due to a copy-paste error. This would lead to incorrect model execution. A fix is suggested in the detailed comments.

AscendW8A8LinearMethod):
self.pre_feedforward_layernorm = AddRMSNormW8A8Quant(
config.hidden_size,
layer=self.self_attn.qkv_proj,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a copy-paste error. The pre_feedforward_layernorm is applied just before the MLP block. For the AddRMSNormW8A8Quant fusion to work correctly, it needs to be linked with the subsequent linear layer, which is self.mlp.gate_up_proj, not self.self_attn.qkv_proj.

Suggested change
layer=self.self_attn.qkv_proj,
layer=self.mlp.gate_up_proj,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@realliujiaxu
Copy link
Contributor

Please briefly describe the optimization principles, solutions, and the optimization results.

@socrahow
Copy link
Contributor Author

Please briefly describe the optimization principles, solutions, and the optimization results.

Before optimizing,the rmsnorm time in one decoding is 531.5us. After optimizing,the rmsnorm time in one decoding is 105us.

@socrahow socrahow changed the title Fix AddRMSNormW8A8Quant init bug and optimize the performance of the gemmarmsnorm operator of the gemma3 model on NPU [main] Fix AddRMSNormW8A8Quant init bug and optimize the performance of the gemmarmsnorm operator of the gemma3 model on NPU Aug 19, 2025
dtype: Optional[torch.dtype] = None,
) -> None:
super().__init__(hidden_size, eps, var_hidden_size, has_weight, dtype)
super().__init__(hidden_size=hidden_size, eps=eps, var_hidden_size=var_hidden_size, has_weight=has_weight, dtype=dtype)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest splitting this PR into two PRs: one focused on the bugfix, and the other focused on the new model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I delete new model part

@socrahow socrahow changed the title [main] Fix AddRMSNormW8A8Quant init bug and optimize the performance of the gemmarmsnorm operator of the gemma3 model on NPU [main] Fix AddRMSNormW8A8Quant init bug Aug 20, 2025
@ApsarasX
Copy link
Collaborator

Maybe we should modify the wrapper_rmsnorm_init method just like the following?

def wrapper_rmsnorm_init(func):

-    def init(self, hidden_size: int, **extra_args) -> None:
-        func(self, hidden_size, **extra_args)
+    def init(self, hidden_size: int,  *args, **kwargs) -> None:
+        func(self, hidden_size, *args, **kwargs)
        self.ignore_anti = True
        self.bias = torch.nn.Parameter(torch.zeros(hidden_size),
                                       requires_grad=False)

    return init

Copy link

codecov bot commented Aug 22, 2025

Codecov Report

❌ Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 77.71%. Comparing base (2bb7e55) to head (7223570).
⚠️ Report is 195 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/ops/layernorm.py 50.00% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (50.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2440      +/-   ##
==========================================
+ Coverage   76.18%   77.71%   +1.52%     
==========================================
  Files         120      132      +12     
  Lines       13532    17520    +3988     
==========================================
+ Hits        10310    13615    +3305     
- Misses       3222     3905     +683     
Flag Coverage Δ
unittests 77.71% <50.00%> (+1.52%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants