Better sharding for dsv3 moe layer #2373

suexu1025 · 2025-09-19T22:44:55Z

Description

land sharding strategy for moe layer.

dsv3 step time decrease from 47s to 43s.
no change for mixtral 8x7b model.

If the change fixes a bug or a Github issue, please include a link, e.g.,:
FIXES: b/123456
FIXES: #123456

Notice 1: Once all tests pass, the "pull ready" label will automatically be assigned.
This label is used for administrative purposes. Please do not add it manually.

Notice 2: For external contributions, our settings currently require an approval from a MaxText maintainer to trigger CI tests.

Tests

Please describe how you tested this change, and include any instructions and/or
commands to reproduce.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

gobbleturk

@richjames0 for thoughts =D

gobbleturk · 2025-09-19T22:55:47Z

src/MaxText/layers/moe.py

@@ -300,8 +300,13 @@ def __init__(
    self.quant = quant
    self.rngs = rngs

-    self.wi_kernel_axes = ("exp", "embed_no_exp", "mlp")
-    self.wo_kernel_axes = ("exp", "mlp", "embed_no_exp")
+    # special sharding for dsv3


@richjames0 for all sharding changes =D

gobbleturk · 2025-09-19T22:57:39Z

src/MaxText/layers/moe.py

-    self.wi_kernel_axes = ("exp", "embed_no_exp", "mlp")
-    self.wo_kernel_axes = ("exp", "mlp", "embed_no_exp")
+    # special sharding for dsv3
+    if self.config.num_experts == 256:


We need a better logic for such a specific conditional. It should be controllable via a base.yml config argument like ("expert_first_dim")

This change doesn't surprise me that expert first is the most performant. Are there are any downsides to just flipping it here? Other than having to modify all of our checkpoint conversion scripts =D

If modifying the checkpoint conversion scripts is really the hardest part we can support both options (with a base.yml config) for a while with a deprecation warning on the current behavior to migrate to the new one.

RissyRan

Thanks Qinwen! Could you also attach test results in the description?

RissyRan · 2025-09-19T23:09:26Z

src/MaxText/layers/moe.py

-    w1_pspec = nn.logical_to_mesh_axes(("exp", "embed_tensor_transpose", "mlp_no_fsdp"))
-    wo_pspec = nn.logical_to_mesh_axes(("exp", "mlp_no_fsdp", "embed_tensor_transpose"))
+    # special sharding for dsv3 to remove overhead between gmm/AG
+    if self.config.num_experts == 256:


Similar comment with Matt. For instance, we could have a flag similar to expert_shard_attention_option, and we could have expert_shard_mlp_option or similar.

This condition self.config.num_experts == 256 may raise up question, how about 128 or 512 experts?

suexu1025 added 2 commits September 19, 2025 22:05

update sharding to remove extra datacopy

4ff5f33

update

e22eb62

suexu1025 requested review from RissyRan, michelle-yooh, gagika and richjames0 as code owners September 19, 2025 22:44

suexu1025 requested a review from gobbleturk September 19, 2025 22:49

suexu1025 changed the title ~~Better Sharding for dsv3 moe layer~~ Better sharding for dsv3 moe layer Sep 19, 2025

gobbleturk reviewed Sep 19, 2025

View reviewed changes

RissyRan reviewed Sep 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better sharding for dsv3 moe layer #2373

Better sharding for dsv3 moe layer #2373

Uh oh!

suexu1025 commented Sep 19, 2025 •

edited

Loading

Uh oh!

gobbleturk left a comment

Uh oh!

gobbleturk Sep 19, 2025

Uh oh!

gobbleturk Sep 19, 2025

Uh oh!

gobbleturk Sep 19, 2025 •

edited

Loading

Uh oh!

RissyRan left a comment

Uh oh!

RissyRan Sep 19, 2025

Uh oh!

Uh oh!

Better sharding for dsv3 moe layer #2373

Are you sure you want to change the base?

Better sharding for dsv3 moe layer #2373

Uh oh!

Conversation

suexu1025 commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

gobbleturk left a comment

Choose a reason for hiding this comment

Uh oh!

gobbleturk Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gobbleturk Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gobbleturk Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

RissyRan Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

suexu1025 commented Sep 19, 2025 •

edited

Loading

gobbleturk Sep 19, 2025 •

edited

Loading