[moe training] set token group alignment size to 16 for fp8 training test #2678

danielvegamyhre · 2025-08-04T20:27:21Z

In pytorch/torchtitan#1503 the default TOKEN_GROUP_ALIGNMENT_SIZE_M was changed from 16 (required for fp8) to 8 (minimum for bf16). See PR description for details.

Thus, in our fp8 training tests, we need to set it to 16. This is required so that
each logically distinct gemm in the grouped gemm grad_weight = grad_output_t @ input
has the contraction dim be divisible by 16. 16 byte alignment is required for the slowest moving dim (stride 1), so 16 bytes / 1 byte per element in fp8 = 16 elements.

Test plan

Test: pytest test/prototype/moe_training/test_training.py

Error without change:

E       torch.AcceleratorError: CUDA error: device-side assert triggered
E       Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
E       Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

torchao/prototype/moe_training/scaled_grouped_mm.py:259: AcceleratorError
---------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------
/pytorch/aten/src/ATen/native/cuda/GroupMMCommon.cuh:64: prepare_grouped_gemm_data: block: [0,0,0], thread: [0,0,0] Assertion `delta % align == 0 && "expected input tensor dynamic dimension byte size to be non-negative multiple of 16\n"` failed.
/pytorch/aten/src/ATen/native/cuda/GroupMMCommon.cuh:64: prepare_grouped_gemm_data: block: [0,0,0], thread: [1,0,0] Assertion `delta % align == 0 && "expected input tensor dynamic dimension byte size to be non-negative multiple of 16\n"` failed.
/pytorch/aten/src/ATen/native/cuda/GroupMMCommon.cuh:64: prepare_grouped_gemm_data: block: [0,0,0], thread: [2,0,0] Assertion `delta % align == 0 && "expected input tensor dynamic dimension byte size to be non-negative multiple of 16\n"` failed.
/pytorch/aten/src/ATen/native/cuda/GroupMMCommon.cuh:64: prepare_grouped_gemm_data: block: [0,0,0], thread: [3,0,0] Assertion `delta % align == 0 && "expected input tensor dynamic dimension byte size to be non-negative multiple of 16\n"` failed.
______________________________________________________________

With change, tests pass.

…test

pytorch-bot · 2025-08-04T20:27:24Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2678

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Cancelled Job

As of commit 3a87a56 with merge base 7dbc816 ():

NEW FAILURES - The following jobs have failed:

Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
test/test_low_bit_optim.py::TestFSDP2::test_uneven_shard
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
test/test_low_bit_optim.py::TestFSDP2::test_uneven_shard
Run TorchAO Experimental Tests / test-cpu-ops (macos-14) (gh)
Process completed with exit code 1.

CANCELLED JOB - The following job was cancelled. Please retry:

Run TorchAO Experimental Tests / test-cpu-ops (linux.arm64.2xlarge) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[moe training] set token group alignment size to 16 for fp8 training …

82aead4

…test

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 4, 2025

danielvegamyhre added topic: not user facing Use this tag if you don't want this PR to show up in release notes and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels Aug 4, 2025

lint

51e0176

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 4, 2025

add comment

3a87a56

HDCharles approved these changes Aug 5, 2025

View reviewed changes

danielvegamyhre merged commit be40518 into main Aug 5, 2025
16 of 20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[moe training] set token group alignment size to 16 for fp8 training test #2678

[moe training] set token group alignment size to 16 for fp8 training test #2678

Uh oh!

danielvegamyhre commented Aug 4, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[moe training] set token group alignment size to 16 for fp8 training test #2678

[moe training] set token group alignment size to 16 for fp8 training test #2678

Uh oh!

Conversation

danielvegamyhre commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

pytorch-bot bot commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2678

❌ 3 New Failures, 1 Cancelled Job

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Aug 4, 2025 •

edited

Loading

pytorch-bot bot commented Aug 4, 2025 •

edited

Loading