[mxfp8 moe training] use new 3d colwise quantization kernel #3037

danielvegamyhre · 2025-09-20T15:30:35Z

Summary

Integrate new cuda kernel added in [mxfp8 moe training] add CUDA kernel to quantize 3d tensor colwise #3002 in scaled grouped mm backwards pass

Microbenchmarks

M,N,K,G                  recipe                  bf16_fwd_bwd_us    scaled_fwd_bwd_us  scaled_fwd_bwd_speedup      bf16_fwd_us    scaled_fwd_us  scaled_fwd_speedup
-----------------------  --------------------  -----------------  -------------------  ------------------------  -------------  ---------------  --------------------
(16384, 8192, 5120, 1)   MoEScalingType.MXFP8           4257.76              2928.22   1.454x                         1236.03           790.56   1.563x
(16384, 8192, 5120, 2)   MoEScalingType.MXFP8           4202.05              3236.9    1.298x                         1255.55           793.776  1.582x
(16384, 8192, 5120, 4)   MoEScalingType.MXFP8           4661.22              3249.68   1.434x                         1108.03           834.592  1.328x
(16384, 8192, 5120, 8)   MoEScalingType.MXFP8           4124.86              3656.8    1.128x                         1060.9           1073.74   0.988x
(128000, 8192, 5120, 1)  MoEScalingType.MXFP8          32988.7              23543.7    1.401x                        14311.9           6367.14   2.248x
(128000, 8192, 5120, 2)  MoEScalingType.MXFP8          39088.2              25779.2    1.516x                        18585.6           6288.45   2.956x
(128000, 8192, 5120, 4)  MoEScalingType.MXFP8          41717.3              24252.9    1.72x                         10254.1           6420.38   1.597x
(128000, 8192, 5120, 8)  MoEScalingType.MXFP8          44263.5              25204.8    1.756x                        10200             5816.16   1.754x
(16384, 1536, 5120, 1)   MoEScalingType.MXFP8            799.744             1003.49   0.797x                          250.944          273.44   0.918x
(16384, 1536, 5120, 2)   MoEScalingType.MXFP8            815.168             1037.36   0.786x                          248.832          285.824  0.871x
(16384, 1536, 5120, 4)   MoEScalingType.MXFP8            787.424              932.704  0.844x                          216.096          244.784  0.883x
(16384, 1536, 5120, 8)   MoEScalingType.MXFP8            828.416              956.464  0.866x                          244.704          259.264  0.944x
(128000, 1536, 5120, 1)  MoEScalingType.MXFP8           7694.38              7609.34   1.011x                         2100.16          1956.83   1.073x
(128000, 1536, 5120, 2)  MoEScalingType.MXFP8           6830.18              6708.77   1.018x                         3091.31          1771.42   1.745x
(128000, 1536, 5120, 4)  MoEScalingType.MXFP8           7140.34              6703.1    1.065x                         2339.94          1738.75   1.346x
(128000, 1536, 5120, 8)  MoEScalingType.MXFP8           6538.21              6231.54   1.049x                         2672.67          1600.54   1.67x
(16384, 2048, 7168, 1)   MoEScalingType.MXFP8           1458.74              1436.51   1.015x                          425.248          396.224  1.073x
(16384, 2048, 7168, 2)   MoEScalingType.MXFP8           1375.2               1498.22   0.918x                          405.44           400.512  1.012x
(16384, 2048, 7168, 4)   MoEScalingType.MXFP8           1503.79              1494.08   1.007x                          468.16           400.608  1.169x
(16384, 2048, 7168, 8)   MoEScalingType.MXFP8           1484.86              1544.26   0.962x                          427.04           429.088  0.995x
(128000, 2048, 7168, 1)  MoEScalingType.MXFP8          18171.9              11094      1.638x                         3728.38          2947.07   1.265x
(128000, 2048, 7168, 2)  MoEScalingType.MXFP8          12609.5              10944.5    1.152x                         5013.07          3030.16   1.654x
(128000, 2048, 7168, 4)  MoEScalingType.MXFP8          13331.4              11065.4    1.205x                         3588.24          2879.07   1.246x
(128000, 2048, 7168, 8)  MoEScalingType.MXFP8          13108.7              11143.3    1.176x                         3685.86          2667.09   1.382x

Torchtitan Llama4 e2e training benchmarks (single node, FSDP2 only)

BF16:

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.tensor_parallel_degree=1 --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable --training.seq_len=8192" ./llama4.sh

Median Tokens/Second (excluding step 1): 65611.0
Max Memory Usage: 109.22 GiB

MXFP8 DENSE ONLY: 13.4% speedup over bf16

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.tensor_parallel_degree=1 --model.converters="mx" --mx.recipe_name="mxfp8_cublas" --mx.filter_fqns="output,router.gate,wk,wv" --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable --training.seq_len=8192" ./llama4.sh

Median Tokens/Second (excluding step 1): 74409.5
Max Memory Usage: 109.41 GiB

MXFP8 MOE + DENSE: 29.9% speedup

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.tensor_parallel_degree=1 --model.converters="mx" --mx.recipe_name="mxfp8_cublas" --mx.filter_fqns="output,router.gate,wk,wv" --mx.moe_fqns_prototype="experts" --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable --training.seq_len=8192" ./llama4.sh

Median Tokens/Second (excluding step 1): 85253.0
Max Memory Usage: 107.41 GiB

pytorch-bot · 2025-09-20T15:30:40Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3037

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 85ecd0b with merge base d2fae7a ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg

This should only have an impact on bwd perf correct?

danielvegamyhre · 2025-09-22T17:25:02Z

This should only have an impact on bwd perf correct?

That's correct

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 20, 2025

danielvegamyhre added mx topic: not user facing Use this tag if you don't want this PR to show up in release notes moe and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels Sep 20, 2025

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 20, 2025

danielvegamyhre mentioned this pull request Sep 20, 2025

[mxfp8 moe training] add MX MoE model converter using torchao mxfp8 moe training pytorch/torchtitan#1701

Open

[mxfp8 moe training] use new 3d colwise quantization kernel

85ecd0b

danielvegamyhre force-pushed the use-3d-kernel branch from 2c79368 to 85ecd0b Compare September 20, 2025 18:33

danielvegamyhre requested review from vkuzo and drisspg September 22, 2025 15:14

drisspg approved these changes Sep 22, 2025

View reviewed changes

danielvegamyhre merged commit 9d88c16 into main Sep 22, 2025
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[mxfp8 moe training] use new 3d colwise quantization kernel #3037

[mxfp8 moe training] use new 3d colwise quantization kernel #3037

Uh oh!

danielvegamyhre commented Sep 20, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Sep 20, 2025 •

edited

Loading

Uh oh!

drisspg left a comment

Uh oh!

danielvegamyhre commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

[mxfp8 moe training] use new 3d colwise quantization kernel #3037

[mxfp8 moe training] use new 3d colwise quantization kernel #3037

Uh oh!

Conversation

danielvegamyhre commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Microbenchmarks

Torchtitan Llama4 e2e training benchmarks (single node, FSDP2 only)

Uh oh!

pytorch-bot bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3037

✅ No Failures

Uh oh!

drisspg left a comment

Choose a reason for hiding this comment

Uh oh!

danielvegamyhre commented Sep 22, 2025

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Sep 20, 2025 •

edited

Loading

pytorch-bot bot commented Sep 20, 2025 •

edited

Loading