-
-
Notifications
You must be signed in to change notification settings - Fork 9.3k
self.gate dtype update for GLM-4.5 #22203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to ensure the self.gate
module in Glm4MoeBlock
operates in float32
for performance reasons, as stated in the description.
The change to initialize the gate's weights in float32
is correct. However, the explicit casting of hidden_states
to float32
in the forward
method is redundant and introduces unnecessary performance overhead. PyTorch's linear layer implementation automatically handles type promotion, ensuring the computation is performed in float32
when the weights are float32
.
I've recommended removing the explicit cast to avoid the performance penalty of an extra memory copy.
@@ -180,7 +181,7 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: | |||
|
|||
if self.n_shared_experts is not None: | |||
shared_output = self.shared_experts(hidden_states) | |||
router_logits, _ = self.gate(hidden_states) | |||
router_logits, _ = self.gate(hidden_states.to(dtype=torch.float32)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit cast hidden_states.to(dtype=torch.float32)
is redundant and introduces an unnecessary memory copy, which can negatively impact performance.
Since self.gate.weight
is already of dtype=torch.float32
(due to the change in __init__
), torch.nn.functional.linear
(which is called internally by ColumnParallelLinear
) will automatically perform the matrix multiplication in float32
by upcasting the hidden_states
tensor. This implicit type promotion is more efficient than an explicit cast.
Removing the explicit cast will rely on this standard PyTorch behavior and avoid the overhead, while still achieving the goal of performing the gate computation in float32
.
router_logits, _ = self.gate(hidden_states.to(dtype=torch.float32)) | |
router_logits, _ = self.gate(hidden_states) |
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Head branch was pushed to by a user without write access
Also, the name changed of Final GLM-V model |
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: x22x22 <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
@zRzRzRzRzRzRzR What kind of benchmarks showed degraded performance without this change? There is a discussion taking place on the pull request that was merged into llama.cpp that introduced support for these models. We are wondering if llama.cpp would need a similar change, and in which usecases it would help. A perplexity test did not show any improvements when this was changed to float32. Pull request in question can be found here: ggml-org/llama.cpp#14939 Thank you very much in advance for your clarification! |
Cherry-pick: vllm-project@6fa41e0 Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: jingyu <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Noam Gat <[email protected]>
Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Avery Yingyi Huang <[email protected]>
The entire self.gate module needs to remain in float32 to ensure benchmark performance for GLM-4.5 and GLM-4.5V during propagation.