Skip to content

Conversation

danielvegamyhre
Copy link
Contributor

@danielvegamyhre danielvegamyhre commented Sep 20, 2025

Summary

Microbenchmarks

M,N,K,G                  recipe                  bf16_fwd_bwd_us    scaled_fwd_bwd_us  scaled_fwd_bwd_speedup      bf16_fwd_us    scaled_fwd_us  scaled_fwd_speedup
-----------------------  --------------------  -----------------  -------------------  ------------------------  -------------  ---------------  --------------------
(16384, 8192, 5120, 1)   MoEScalingType.MXFP8           4257.76              2928.22   1.454x                         1236.03           790.56   1.563x
(16384, 8192, 5120, 2)   MoEScalingType.MXFP8           4202.05              3236.9    1.298x                         1255.55           793.776  1.582x
(16384, 8192, 5120, 4)   MoEScalingType.MXFP8           4661.22              3249.68   1.434x                         1108.03           834.592  1.328x
(16384, 8192, 5120, 8)   MoEScalingType.MXFP8           4124.86              3656.8    1.128x                         1060.9           1073.74   0.988x
(128000, 8192, 5120, 1)  MoEScalingType.MXFP8          32988.7              23543.7    1.401x                        14311.9           6367.14   2.248x
(128000, 8192, 5120, 2)  MoEScalingType.MXFP8          39088.2              25779.2    1.516x                        18585.6           6288.45   2.956x
(128000, 8192, 5120, 4)  MoEScalingType.MXFP8          41717.3              24252.9    1.72x                         10254.1           6420.38   1.597x
(128000, 8192, 5120, 8)  MoEScalingType.MXFP8          44263.5              25204.8    1.756x                        10200             5816.16   1.754x
(16384, 1536, 5120, 1)   MoEScalingType.MXFP8            799.744             1003.49   0.797x                          250.944          273.44   0.918x
(16384, 1536, 5120, 2)   MoEScalingType.MXFP8            815.168             1037.36   0.786x                          248.832          285.824  0.871x
(16384, 1536, 5120, 4)   MoEScalingType.MXFP8            787.424              932.704  0.844x                          216.096          244.784  0.883x
(16384, 1536, 5120, 8)   MoEScalingType.MXFP8            828.416              956.464  0.866x                          244.704          259.264  0.944x
(128000, 1536, 5120, 1)  MoEScalingType.MXFP8           7694.38              7609.34   1.011x                         2100.16          1956.83   1.073x
(128000, 1536, 5120, 2)  MoEScalingType.MXFP8           6830.18              6708.77   1.018x                         3091.31          1771.42   1.745x
(128000, 1536, 5120, 4)  MoEScalingType.MXFP8           7140.34              6703.1    1.065x                         2339.94          1738.75   1.346x
(128000, 1536, 5120, 8)  MoEScalingType.MXFP8           6538.21              6231.54   1.049x                         2672.67          1600.54   1.67x
(16384, 2048, 7168, 1)   MoEScalingType.MXFP8           1458.74              1436.51   1.015x                          425.248          396.224  1.073x
(16384, 2048, 7168, 2)   MoEScalingType.MXFP8           1375.2               1498.22   0.918x                          405.44           400.512  1.012x
(16384, 2048, 7168, 4)   MoEScalingType.MXFP8           1503.79              1494.08   1.007x                          468.16           400.608  1.169x
(16384, 2048, 7168, 8)   MoEScalingType.MXFP8           1484.86              1544.26   0.962x                          427.04           429.088  0.995x
(128000, 2048, 7168, 1)  MoEScalingType.MXFP8          18171.9              11094      1.638x                         3728.38          2947.07   1.265x
(128000, 2048, 7168, 2)  MoEScalingType.MXFP8          12609.5              10944.5    1.152x                         5013.07          3030.16   1.654x
(128000, 2048, 7168, 4)  MoEScalingType.MXFP8          13331.4              11065.4    1.205x                         3588.24          2879.07   1.246x
(128000, 2048, 7168, 8)  MoEScalingType.MXFP8          13108.7              11143.3    1.176x                         3685.86          2667.09   1.382x

Torchtitan Llama4 e2e training benchmarks (single node, FSDP2 only)

BF16:

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.tensor_parallel_degree=1 --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable --training.seq_len=8192" ./llama4.sh

Median Tokens/Second (excluding step 1): 65611.0
Max Memory Usage: 109.22 GiB

MXFP8 DENSE ONLY: 13.4% speedup over bf16

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.tensor_parallel_degree=1 --model.converters="mx" --mx.recipe_name="mxfp8_cublas" --mx.filter_fqns="output,router.gate,wk,wv" --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable --training.seq_len=8192" ./llama4.sh

Median Tokens/Second (excluding step 1): 74409.5
Max Memory Usage: 109.41 GiB

MXFP8 MOE + DENSE: 29.9% speedup

rm -rf /tmp/torchinductor_danvm; TORCHTITAN_ROOT=/home/danvm/torchtitan CUDA_VISIBLE_DEVICES="2,3,4,5" NGPU=4 EXTRA_ARGS="--parallelism.data_parallel_shard_degree=4 --parallelism.tensor_parallel_degree=1 --model.converters="mx" --mx.recipe_name="mxfp8_cublas" --mx.filter_fqns="output,router.gate,wk,wv" --mx.moe_fqns_prototype="experts" --model.print-after-conversion --metrics.log_freq=10 --training.steps=100 --compile.enable --training.seq_len=8192" ./llama4.sh

Median Tokens/Second (excluding step 1): 85253.0
Max Memory Usage: 107.41 GiB

Copy link

pytorch-bot bot commented Sep 20, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3037

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 85ecd0b with merge base d2fae7a (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 20, 2025
@danielvegamyhre danielvegamyhre added mx topic: not user facing Use this tag if you don't want this PR to show up in release notes moe and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels Sep 20, 2025
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 20, 2025
Copy link
Contributor

@drisspg drisspg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only have an impact on bwd perf correct?

@danielvegamyhre
Copy link
Contributor Author

This should only have an impact on bwd perf correct?

That's correct

@danielvegamyhre danielvegamyhre merged commit 9d88c16 into main Sep 22, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. moe mx topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants