self.gate dtype update for GLM-4.5 #22203

zRzRzRzRzRzRzR · 2025-08-04T16:33:46Z

The entire self.gate module needs to remain in float32 to ensure benchmark performance for GLM-4.5 and GLM-4.5V during propagation.

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

github-actions · 2025-08-04T16:33:54Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request aims to ensure the self.gate module in Glm4MoeBlock operates in float32 for performance reasons, as stated in the description.

The change to initialize the gate's weights in float32 is correct. However, the explicit casting of hidden_states to float32 in the forward method is redundant and introduces unnecessary performance overhead. PyTorch's linear layer implementation automatically handles type promotion, ensuring the computation is performed in float32 when the weights are float32.

I've recommended removing the explicit cast to avoid the performance penalty of an extra memory copy.

gemini-code-assist · 2025-08-04T16:34:45Z

vllm/model_executor/models/glm4_moe.py

@@ -180,7 +181,7 @@ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:

        if self.n_shared_experts is not None:
            shared_output = self.shared_experts(hidden_states)
-        router_logits, _ = self.gate(hidden_states)
+        router_logits, _ = self.gate(hidden_states.to(dtype=torch.float32))


The explicit cast hidden_states.to(dtype=torch.float32) is redundant and introduces an unnecessary memory copy, which can negatively impact performance.

Since self.gate.weight is already of dtype=torch.float32 (due to the change in __init__), torch.nn.functional.linear (which is called internally by ColumnParallelLinear) will automatically perform the matrix multiplication in float32 by upcasting the hidden_states tensor. This implicit type promotion is more efficient than an explicit cast.

Removing the explicit cast will rely on this standard PyTorch behavior and avoid the overhead, while still achieving the goal of performing the gate computation in float32.

Suggested change

router_logits, _ = self.gate(hidden_states.to(dtype=torch.float32))

router_logits, _ = self.gate(hidden_states)

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

zRzRzRzRzRzRzR · 2025-08-04T16:39:13Z

Also, the name changed of Final GLM-V model

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: x22x22 <[email protected]>

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

Mushoz · 2025-08-06T14:26:39Z

@zRzRzRzRzRzRzR What kind of benchmarks showed degraded performance without this change? There is a discussion taking place on the pull request that was merged into llama.cpp that introduced support for these models. We are wondering if llama.cpp would need a similar change, and in which usecases it would help. A perplexity test did not show any improvements when this was changed to float32.

Pull request in question can be found here: ggml-org/llama.cpp#14939

Thank you very much in advance for your clarification!

Cherry-pick: vllm-project@6fa41e0 Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: jingyu <[email protected]>

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Noam Gat <[email protected]>

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Avery Yingyi Huang <[email protected]>

zRzRzRzRzRzRzR added 4 commits August 3, 2025 17:25

fuse fp32

e2b1ab9

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

Merge branch 'vllm-project:main' into glm-45

54b2e8b

Update glm4_moe.py

e124e40

Merge branch 'vllm-project:main' into glm-45

04c219f

DarkLight1337 approved these changes Aug 4, 2025

View reviewed changes

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) August 4, 2025 16:34

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 4, 2025

Isotr0py approved these changes Aug 4, 2025

View reviewed changes

Isotr0py added this to the v0.10.1 milestone Aug 4, 2025

rename

b1c9528

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

auto-merge was automatically disabled August 4, 2025 16:38
Head branch was pushed to by a user without write access

zRzRzRzRzRzRzR requested review from hmellor and ywang96 as code owners August 4, 2025 16:38

mergify bot added the documentation Improvements or additions to documentation label Aug 4, 2025

Isotr0py enabled auto-merge (squash) August 4, 2025 17:26

vllm-bot merged commit 6fa41e0 into vllm-project:main Aug 5, 2025
42 of 44 checks passed

juuice-lee pushed a commit to juuice-lee/vllm-moe.code that referenced this pull request Aug 5, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

e78c1a1

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

eb9ca7f

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

682b289

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: x22x22 <[email protected]>

x22x22 pushed a commit to x22x22/vllm that referenced this pull request Aug 5, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

0297d05

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: x22x22 <[email protected]>

Mushoz mentioned this pull request Aug 6, 2025

model: Add support for GLM 4.5 family of models (#14921) ggml-org/llama.cpp#14939

Merged

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

c7b4630

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

wenbinc-Bin pushed a commit to wenbinc-Bin/vllm-fork that referenced this pull request Aug 7, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

810c986

Cherry-pick: vllm-project@6fa41e0 Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

myselvess pushed a commit to myselvess/vllm that referenced this pull request Aug 7, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

37bf7c9

Signed-off-by: zRzRzRzRzRzRzR <[email protected]>

jingyu-ml pushed a commit to jingyu-ml/vllm that referenced this pull request Aug 8, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

b291b77

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: jingyu <[email protected]>

jinzhen-lin pushed a commit to jinzhen-lin/vllm that referenced this pull request Aug 9, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

6bada90

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

noamgat pushed a commit to noamgat/vllm that referenced this pull request Aug 9, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

2dcb390

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Noam Gat <[email protected]>

yyihuang pushed a commit to yyihuang/vllm that referenced this pull request Aug 11, 2025

self.gate dtype update for GLM-4.5 (vllm-project#22203)

fadc4b2

Signed-off-by: zRzRzRzRzRzRzR <[email protected]> Signed-off-by: Avery Yingyi Huang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

self.gate dtype update for GLM-4.5 #22203

self.gate dtype update for GLM-4.5 #22203

Uh oh!

zRzRzRzRzRzRzR commented Aug 4, 2025

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 4, 2025

Uh oh!

zRzRzRzRzRzRzR commented Aug 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Mushoz commented Aug 6, 2025

Uh oh!

Uh oh!

	router_logits, _ = self.gate(hidden_states.to(dtype=torch.float32))
	router_logits, _ = self.gate(hidden_states)

Uh oh!

self.gate dtype update for GLM-4.5 #22203

self.gate dtype update for GLM-4.5 #22203

Uh oh!

Conversation

zRzRzRzRzRzRzR commented Aug 4, 2025

Uh oh!

github-actions bot commented Aug 4, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

zRzRzRzRzRzRzR commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Mushoz commented Aug 6, 2025

Uh oh!

Uh oh!

zRzRzRzRzRzRzR commented Aug 4, 2025 •

edited

Loading