Make token group alignment size configurable #1503

danielvegamyhre · 2025-07-31T17:25:50Z

Summary

For mxfp8, token group sizes must be multiples of "block_size" because in the backward pass for grad_weight = grad_output_t @ input, the "M" (token) dimension is the contracting dimension, and each token group is a logically distinct subtensor, so we scale them separately. This means token groups contracting dimension must be divisible by the mxfp8 block_size (default 32). Here is a diagram showing the problem: https://www.internalfb.com/excalidraw/EX521879
To solve this, this PR makes the token group M aligment configurable.

Test plan

Integration test with torchao passes: Make scaling type configurable for MoE training ao#2642
Did manual test run with llama4 debug model using bf16

torchtitan/config/job_config.py

torchtitan/experiments/llama4/infra/expert_parallel.py

tianyu-l · 2025-07-31T18:16:17Z

torchtitan/components/quantization/mx.py

@@ -59,6 +59,13 @@ def __init__(self, job_config: JobConfig, parallel_dims: ParallelDims):
            and job_config.parallelism.tensor_parallel_degree > 1
        ), "TP not yet supported with torch.compile for mxfp8"

+        # For MoE training with mxfp8, token group sizes must be multiples of 32
+        if job_config.mx.moe_fqns_prototype:
+            from torchtitan.experiments.llama4.infra.expert_parallel import set_token_group_alignment_size


Looks OK to me.
I will do some refactor to move expert_parallel into torchtitan/distributed. Then the import will look nicer.

tianyu-l · 2025-07-31T18:16:48Z

torchtitan/components/quantization/mx.py

+        if job_config.mx.moe_fqns_prototype:
+            from torchtitan.experiments.llama4.infra.expert_parallel import set_token_group_alignment_size
+            mxfp8_block_size = 32
+            set_token_group_alignment_size(mxfp8_block_size)


don't we need to do this for Float8 as well, as IIRC it supports grouped gemm too

Yes but the default (16) is what is needed for float8, so no need to manually set it.

I'm not sure if we should use 16 as default.
For bf16, is 16 enough or is 8 enough?
I think we should still set it, in case the default changes later.

Actually yeah I think you're right.

For bf16, 8 is enough (16 byte alignment / 2 bytes per elem = 8 elements).

For fp8, 16 byte alignment / 1 byte per elem = 16 elements.

For mxfp8, we need 32 (or block_size) because scaling block size is (1 x 32), so when doing per-token-group quantization on each logically distinct subtensor, we need to ensure the contracting dim is divisible by block_size. In the backward pass, grad_weight = (grad_output_t @ input).t() has gemm dims (N, M) @ (M, K) so M is the contracting dim, and group offsets are along M, so we need 32 element alignment.

Updated this accordingly.

danielvegamyhre · 2025-07-31T20:56:51Z

@tianyu-l I addressed your comments, and did a test run with llama4 debug model with bf16 to make sure it runs correctly with the new default. However, I keep getting linter errors, despite pre-commit passing locally. I uninstalled requirements-dev.txt and re-installed, ran precommit again, but it still says no errors locally and fails in CI. Any thoughts on how to proceed?

tianyu-l

please address comments before merge.

tianyu-l · 2025-07-31T22:16:02Z

torchtitan/experiments/llama4/infra/expert_parallel.py

@@ -24,6 +24,29 @@
 from torch.distributed.tensor.placement_types import Placement


+TOKEN_GROUP_ALIGN_SIZE_M = 8


OK for now. Later we may want to set this as private field and provide a getter function too.

torchtitan/experiments/llama4/__init__.py

tianyu-l · 2025-07-31T22:18:01Z

I addressed your comments, and did a test run with llama4 debug model with bf16 to make sure it runs correctly with the new default. However, I keep getting linter errors, despite pre-commit passing locally. I uninstalled requirements-dev.txt and re-installed, ran precommit again, but it still says no errors locally and fails in CI. Any thoughts on how to proceed?

the error says it's extra whitespace in expert_parallel.py, is it not legit?

danielvegamyhre · 2025-07-31T22:21:15Z

I addressed your comments, and did a test run with llama4 debug model with bf16 to make sure it runs correctly with the new default. However, I keep getting linter errors, despite pre-commit passing locally. I uninstalled requirements-dev.txt and re-installed, ran precommit again, but it still says no errors locally and fails in CI. Any thoughts on how to proceed?

the error says it's extra whitespace in expert_parallel.py, is it not legit?

Maybe it's legit but my confusion is how the local linter and CI linter are out of sync somehow, even after re-installation....

torchtitan/experiments/llama4/infra/expert_parallel.py

## Summary - For mxfp8, token group sizes must be multiples of "block_size" because in the backward pass for `grad_weight = grad_output_t @ input`, the "M" (token) dimension is the contracting dimension, and each token group is a logically distinct subtensor, so we scale them separately. This means token groups contracting dimension must be divisible by the mxfp8 block_size (default 32). Here is a diagram showing the problem: https://www.internalfb.com/excalidraw/EX521879 - To solve this, this PR makes the token group M aligment configurable. ## Test plan - Integration test with torchao passes: pytorch/ao#2642 - Did manual test run with llama4 debug model using bf16

- For mxfp8, token group sizes must be multiples of "block_size" because in the backward pass for `grad_weight = grad_output_t @ input`, the "M" (token) dimension is the contracting dimension, and each token group is a logically distinct subtensor, so we scale them separately. This means token groups contracting dimension must be divisible by the mxfp8 block_size (default 32). Here is a diagram showing the problem: https://www.internalfb.com/excalidraw/EX521879 - To solve this, this PR makes the token group M aligment configurable. - Integration test with torchao passes: pytorch/ao#2642 - Did manual test run with llama4 debug model using bf16

## Summary - For mxfp8, token group sizes must be multiples of "block_size" because in the backward pass for `grad_weight = grad_output_t @ input`, the "M" (token) dimension is the contracting dimension, and each token group is a logically distinct subtensor, so we scale them separately. This means token groups contracting dimension must be divisible by the mxfp8 block_size (default 32). Here is a diagram showing the problem: https://www.internalfb.com/excalidraw/EX521879 - To solve this, this PR makes the token group M aligment configurable. ## Test plan - Integration test with torchao passes: pytorch/ao#2642 - Did manual test run with llama4 debug model using bf16

danielvegamyhre requested review from tianyu-l, fegin, wwwjn and wconstab as code owners July 31, 2025 17:25

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 31, 2025

danielvegamyhre force-pushed the align branch from 2122f27 to dd7aa72 Compare July 31, 2025 17:32

Make token group alignment size configurable

8f27e56

danielvegamyhre mentioned this pull request Jul 31, 2025

Make scaling type configurable for MoE training pytorch/ao#2642

Merged

tianyu-l reviewed Jul 31, 2025

View reviewed changes

danielvegamyhre force-pushed the align branch from dd7aa72 to 5eaee4c Compare July 31, 2025 20:55

danielvegamyhre force-pushed the align branch from 5eaee4c to 74ee0c2 Compare July 31, 2025 21:03

tianyu-l approved these changes Jul 31, 2025

View reviewed changes

address comments

b55fbc7

danielvegamyhre force-pushed the align branch from 74ee0c2 to b55fbc7 Compare July 31, 2025 22:19

lint

41230f4

drisspg reviewed Jul 31, 2025

View reviewed changes

torchtitan/experiments/llama4/infra/expert_parallel.py Outdated Show resolved Hide resolved

danielvegamyhre force-pushed the align branch from 6b953db to 8fd8e10 Compare August 1, 2025 00:36

use literal

444c9e9

danielvegamyhre force-pushed the align branch from 8fd8e10 to 444c9e9 Compare August 1, 2025 00:40

danielvegamyhre merged commit d655e16 into main Aug 1, 2025
8 checks passed

danielvegamyhre mentioned this pull request Aug 4, 2025

[moe training] set token group alignment size to 16 for fp8 training test pytorch/ao#2678

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make token group alignment size configurable #1503

Make token group alignment size configurable #1503

Uh oh!

danielvegamyhre commented Jul 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

tianyu-l Jul 31, 2025

Uh oh!

tianyu-l Jul 31, 2025

Uh oh!

danielvegamyhre Jul 31, 2025

Uh oh!

tianyu-l Jul 31, 2025

Uh oh!

danielvegamyhre Jul 31, 2025 •

edited

Loading

Uh oh!

danielvegamyhre commented Jul 31, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

tianyu-l Jul 31, 2025

Uh oh!

Uh oh!

tianyu-l commented Jul 31, 2025

Uh oh!

danielvegamyhre commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		@@ -24,6 +24,29 @@
		from torch.distributed.tensor.placement_types import Placement


		TOKEN_GROUP_ALIGN_SIZE_M = 8

Make token group alignment size configurable #1503

Make token group alignment size configurable #1503

Uh oh!

Conversation

danielvegamyhre commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

Uh oh!

tianyu-l Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Jul 31, 2025

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

tianyu-l Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianyu-l commented Jul 31, 2025

Uh oh!

danielvegamyhre commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Jul 31, 2025 •

edited

Loading

danielvegamyhre Jul 31, 2025 •

edited

Loading