Skip to content

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Aug 12, 2025

This is a followup of original EP support #1324

PR summary

[TBA] description + figure

numerics verification

setup

  • optimizer Adam
  • steps 100, warmup_steps 20
  • seed 42

comparison set

  • FSDP 2
  • FSDP 2, CP 2, TP 2, EP 8, ETP 1
  • FSDP 2 (EP 2), PP 2, TP 2 (ETP 2)
image

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 12, 2025
@tianyu-l tianyu-l force-pushed the etp branch 3 times, most recently from 8143949 to 79b2934 Compare August 13, 2025 01:51
Expert parallelism degree. 1 means disabled. No effect for non-MoE models.
Currently, it is supported with the following constraints:
- when etp = tp: cp * tp <= ep <= dp_shard * cp * tp
- when etp = 1: cp * tp <= ep <= dp_shard * cp * tp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add some comments about the divisibility constraints? For instance, we require ep % cp = 0 and dp_shard * cp % ep == 0. Or do you think since people usually use powers of 2 this is not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I can add something. Fwiw I used to use | symbol to denote mod x == 0 but people don't seem to understand what it means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the % symbol is more common?

@@ -143,16 +143,17 @@ def init_weights(self, init_std: float):
nn.init.trunc_normal_(self.w3, mean=0.0, std=init_std)


class TokenChoiceTopKRouter(nn.Module):
class TokenRouter(nn.Module):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the earlier name was more descriptive, since we are doing the torch.topk operation in this module. TokenRouter seems more like just the routing part of token choice.

@sanketpurandare
Copy link
Contributor

I added a couple of nit comments. The loss curves look consistent. LGTM!

@tianyu-l tianyu-l merged commit aeb3a4b into main Aug 13, 2025
6 of 7 checks passed
@tianyu-l tianyu-l deleted the etp branch August 13, 2025 21:07
@vwxyzjn
Copy link

vwxyzjn commented Aug 14, 2025

@tianyu-l this is very cool. Could you help clarify what's the difference between ETP and EP + TP on a high-level? Do you have an example config to run it, please?

@tianyu-l tianyu-l mentioned this pull request Aug 18, 2025
tianyu-l added a commit that referenced this pull request Aug 18, 2025
fixes bug introduced in #1555
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants