[EP] add support for ETP=1 #1555

tianyu-l · 2025-08-12T05:49:31Z

This is a followup of original EP support #1324

PR summary

[TBA] description + figure

numerics verification

setup

optimizer Adam
steps 100, warmup_steps 20
seed 42

comparison set

FSDP 2
FSDP 2, CP 2, TP 2, EP 8, ETP 1
FSDP 2 (EP 2), PP 2, TP 2 (ETP 2)

sanketpurandare · 2025-08-13T17:35:22Z

torchtitan/config/job_config.py

+    Expert parallelism degree. 1 means disabled. No effect for non-MoE models.
+    Currently, it is supported with the following constraints:
+    - when etp = tp: cp * tp <= ep <= dp_shard * cp * tp
+    - when etp = 1: cp * tp <= ep <= dp_shard * cp * tp


Can we also add some comments about the divisibility constraints? For instance, we require ep % cp = 0 and dp_shard * cp % ep == 0. Or do you think since people usually use powers of 2 this is not needed?

Sounds good, I can add something. Fwiw I used to use | symbol to denote mod x == 0 but people don't seem to understand what it means.

Yeah, I think the % symbol is more common?

sanketpurandare · 2025-08-13T17:49:49Z

torchtitan/models/moe.py

@@ -143,16 +143,17 @@ def init_weights(self, init_std: float):
        nn.init.trunc_normal_(self.w3, mean=0.0, std=init_std)


-class TokenChoiceTopKRouter(nn.Module):
+class TokenRouter(nn.Module):


I feel the earlier name was more descriptive, since we are doing the torch.topk operation in this module. TokenRouter seems more like just the routing part of token choice.

sanketpurandare · 2025-08-13T17:58:25Z

I added a couple of nit comments. The loss curves look consistent. LGTM!

vwxyzjn · 2025-08-14T00:01:12Z

@tianyu-l this is very cool. Could you help clarify what's the difference between ETP and EP + TP on a high-level? Do you have an example config to run it, please?

fixes bug introduced in #1555

…tants (#160805) Used in pytorch/torchtitan#1555 Pull Request resolved: #160805 Approved by: https://github.com/StrongerXi, https://github.com/mlazos

tianyu-l requested a review from sanketpurandare August 12, 2025 05:49

tianyu-l requested review from fegin, wwwjn and wconstab as code owners August 12, 2025 05:49

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 12, 2025

tianyu-l force-pushed the etp branch 3 times, most recently from 8143949 to 79b2934 Compare August 13, 2025 01:51

sanketpurandare reviewed Aug 13, 2025

View reviewed changes

sanketpurandare approved these changes Aug 13, 2025

View reviewed changes

wwwjn approved these changes Aug 13, 2025

View reviewed changes

[EP] add support for ETP=1

ba020a3

tianyu-l force-pushed the etp branch from 79b2934 to ba020a3 Compare August 13, 2025 21:04

tianyu-l merged commit aeb3a4b into main Aug 13, 2025
6 of 7 checks passed

tianyu-l deleted the etp branch August 13, 2025 21:07

xmfan mentioned this pull request Aug 16, 2025

[dynamo][dist] trace DeviceMesh's get_local_rank and get_rank as constants pytorch/pytorch#160805

Closed

tianyu-l mentioned this pull request Aug 18, 2025

[EP] bug fixes #1586

Merged

tianyu-l added a commit that referenced this pull request Aug 18, 2025

[EP] bug fixes (#1586)

9233d83

fixes bug introduced in #1555

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EP] add support for ETP=1 #1555

[EP] add support for ETP=1 #1555

Uh oh!

tianyu-l commented Aug 12, 2025 •

edited

Loading

Uh oh!

sanketpurandare Aug 13, 2025

Uh oh!

tianyu-l Aug 13, 2025

Uh oh!

sanketpurandare Aug 13, 2025

Uh oh!

sanketpurandare Aug 13, 2025

Uh oh!

sanketpurandare commented Aug 13, 2025

Uh oh!

Uh oh!

vwxyzjn commented Aug 14, 2025

Uh oh!

Uh oh!

[EP] add support for ETP=1 #1555

[EP] add support for ETP=1 #1555

Uh oh!

Conversation

tianyu-l commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR summary

numerics verification

Uh oh!

sanketpurandare Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

sanketpurandare Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

sanketpurandare Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

sanketpurandare commented Aug 13, 2025

Uh oh!

Uh oh!

vwxyzjn commented Aug 14, 2025

Uh oh!

Uh oh!

tianyu-l commented Aug 12, 2025 •

edited

Loading