[moe training] add bench script for fp8 rowwise kernels and update autotune configs #2697

danielvegamyhre · 2025-08-05T23:11:52Z

Stacked PRs:

[moe training] add bench script for fp8 rowwise kernels and update autotune configs

Performance vs torch.compile

It's faster for llama4 shape (16, 5120, 4*5120), but slower for skinny shapes.

input_shape            torch_time_us    triton_time_us
-------------------  ---------------  ----------------
(8, (4096, 1024))              92.32           113.248
(16, (20480, 5120))          5746.62          4075.38

There is more we can do, for example writing row major outputs was roughly 2x faster, but we need the outputs in col-major. I can probably look with NCU and figure out what's going on but for now this is a start.

pytorch-bot · 2025-08-05T23:11:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2697

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

ghstack-mergeability-check and Check labels failing with 'Resource not accessible by integration'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2025-08-06T12:29:02Z

torchao/prototype/moe_training/benchmarks/benchmark_rowwise_3d_quant_kernels.py

+    print(tabulate(rows, headers=headers))
+
+
+def benchmark_cuda_function_in_microseconds(f, *args):


we have so many of these, maybe reuse in a separate PR?

…totune configs stack-info: PR: #2697, branch: danielvegamyhre/stack/31

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from 59e34b1 to 362cfb2 Compare August 5, 2025 23:12

danielvegamyhre force-pushed the danielvegamyhre/stack/30 branch from af159db to f6688be Compare August 5, 2025 23:12

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 5, 2025

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 5, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/30 to main August 5, 2025 23:44

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from 362cfb2 to 14acee2 Compare August 5, 2025 23:44

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/30 August 5, 2025 23:44

danielvegamyhre changed the base branch from danielvegamyhre/stack/30 to main August 5, 2025 23:56

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from 14acee2 to e3c97ec Compare August 5, 2025 23:56

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/30 August 5, 2025 23:57

danielvegamyhre changed the base branch from danielvegamyhre/stack/30 to main August 6, 2025 00:00

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from e3c97ec to 5118bcf Compare August 6, 2025 00:00

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/30 August 6, 2025 00:01

danielvegamyhre changed the base branch from danielvegamyhre/stack/30 to main August 6, 2025 00:13

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from 5118bcf to 0305192 Compare August 6, 2025 00:13

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/30 August 6, 2025 00:13

danielvegamyhre changed the base branch from danielvegamyhre/stack/30 to main August 6, 2025 00:56

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from 0305192 to c621ce5 Compare August 6, 2025 00:56

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/30 August 6, 2025 00:56

danielvegamyhre changed the base branch from danielvegamyhre/stack/30 to main August 6, 2025 01:19

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from c621ce5 to ab43973 Compare August 6, 2025 01:20

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/30 August 6, 2025 01:20

danielvegamyhre changed the base branch from danielvegamyhre/stack/30 to main August 6, 2025 01:36

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from ab43973 to 1fb9ee1 Compare August 6, 2025 01:36

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/30 August 6, 2025 01:36

vkuzo reviewed Aug 6, 2025

View reviewed changes

vkuzo approved these changes Aug 6, 2025

View reviewed changes

danielvegamyhre force-pushed the danielvegamyhre/stack/30 branch from 6704fd3 to c789281 Compare August 6, 2025 17:24

danielvegamyhre added a commit that referenced this pull request Aug 6, 2025

[moe training] add bench script for fp8 rowwise kernels and update au…

f124235

…totune configs stack-info: PR: #2697, branch: danielvegamyhre/stack/31

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from 1fb9ee1 to f124235 Compare August 6, 2025 17:24

[moe training] add bench script for fp8 rowwise kernels and update au…

73b26f1

…totune configs stack-info: PR: #2697, branch: danielvegamyhre/stack/31

danielvegamyhre changed the base branch from danielvegamyhre/stack/30 to main August 6, 2025 17:28

danielvegamyhre force-pushed the danielvegamyhre/stack/31 branch from f124235 to 73b26f1 Compare August 6, 2025 17:28

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/30 August 6, 2025 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[moe training] add bench script for fp8 rowwise kernels and update autotune configs #2697

[moe training] add bench script for fp8 rowwise kernels and update autotune configs #2697

danielvegamyhre commented Aug 5, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 5, 2025 •

edited

Loading

Uh oh!

vkuzo Aug 6, 2025

Uh oh!

Uh oh!

		print(tabulate(rows, headers=headers))


		def benchmark_cuda_function_in_microseconds(f, *args):

[moe training] add bench script for fp8 rowwise kernels and update autotune configs #2697

Are you sure you want to change the base?

[moe training] add bench script for fp8 rowwise kernels and update autotune configs #2697

Conversation

danielvegamyhre commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance vs torch.compile

Uh oh!

pytorch-bot bot commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2697

❗ 1 Active SEVs

Uh oh!

vkuzo Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielvegamyhre commented Aug 5, 2025 •

edited

Loading

pytorch-bot bot commented Aug 5, 2025 •

edited

Loading