Add Triton kernels for fp8 blockwise quantization and GEMMs #2617

danielvegamyhre · 2025-07-28T16:06:01Z

Stacked PRs:

Add Triton kernels for fp8 blockwise quantization and GEMMs

I wrote the following triton kernels to perform fp8 blockwize quantization ops and GEMMs needed for the forward + backward of a linear layer
Added unit tests verifying numerics

GEMMS:

blockwise_fp8_gemm_1x128_128x128 for:
- out = input @ weight.T
- grad_input = grad_output @ weight
blockwise_fp8_gemm_1x128_128x1 for:
- grad_weight = grad_output.T @ input

Quantization:

fp8_blockwise_act_quant_lhs
fp8_blockwise_act_quant_rhs
fp8_blockwise_act_quant_transposed_lhs
fp8_blockwise_weight_quant_rhs
fp8_blockwise_weight_quant_transposed_rhs

Test plan

pytest test/prototype/blockwise_fp8_training/test_blockwise_kernels.py

Attempted usage of DeepGEMM cutlass kernels

Unfortunately the GEMM APIs in @vkuzo's PoC here no longer exist in DeepGEMM. I tried using the new GEMM APIs (fp8_gemm_nt etc), and:

On B200, with both Vasiliy's PR and my PR, got device-side asserts on this line, that were not immediately clear how to resolve.
On H100, I only tried Vasiliy's PR, but got undefined symbols error from CUDA, despite using CUDA toolkit 12.8+ as stated in the readme.

Attempted usage of torch._scaled_mm

I also tried using torch._scaled_mm in torch nightly, and the error messages indicate it does not support groupwise scaled "A" tensor with blockwise scaled "B" tensor, so I went ahead and finished writing these triton GEMMs.

Today, however, I talked to Luca and it seems the error message is inaccurate and it is indeed supported, but will require some changes to scale strides and alignment to adhere to the API requirements.

If we want to make this prototype more performant, I can make these changes and swap out the GEMMs to use torch._scaled_mm (and update the PT core error message to be more accurate).

pytorch-bot · 2025-07-28T16:06:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2617

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 14420ef with merge base 0e00df3 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2025-07-28T18:06:14Z

test/prototype/blockwise_fp8_training/test_blockwise_kernels.py

+    ref_fp32 = ref_fp8.to(torch.float32)
+
+    # Check that the quantized tensors are close
+    assert torch.allclose(triton_fp32, ref_fp32, rtol=1e-3, atol=1e-3), (


shouldn't this be bit exact? If it's not bit exact and there no exact reason why not, I wouldn't really trust the triton kernel.

IMO I would just go with torch native kernels for everything for now (since they are the easiest to verify numerical correctness for) except the gemms for now, and leave writing triton kernels for quant of weights/activations as a future thing

Yes, updated the tests to use torch.equal instead of allclose to assert bitwise equivalence, let me know what you think.

vkuzo · 2025-07-28T18:25:00Z

test/prototype/blockwise_fp8_training/test_blockwise_kernels.py

+
+    sqnr = compute_error(C, C_q)
+    min_sqnr = 28.0
+    print(f"blockwise_fp8_gemm_1x128_128x128 ({M},{N},{K}) SQNR: {sqnr}")


remove the prints before landing

vkuzo · 2025-07-28T18:25:26Z

test/prototype/blockwise_fp8_training/test_blockwise_kernels.py

+]
+
+
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")


check for sm90?

vkuzo · 2025-07-28T18:26:46Z

test/prototype/blockwise_fp8_training/test_blockwise_kernels.py

+    ref_fp32 = ref_fp8.to(torch.float32)
+
+    # Check that the quantized tensors are close
+    assert torch.equal(triton_fp32, ref_fp32), (


nit: torch.testing.assert_close(..., rtol=0, atol=0) everywhere

stack-info: PR: #2617, branch: danielvegamyhre/stack/17

danielvegamyhre · 2025-07-28T19:57:16Z

Thanks for the review @vkuzo, I finished addressing your comments, this is ready for another look.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 28, 2025

danielvegamyhre force-pushed the danielvegamyhre/stack/17 branch from 4c0250f to 89357e5 Compare July 28, 2025 16:06

danielvegamyhre mentioned this pull request Jul 28, 2025

Add Float8BlockwiseLinear for training #2618

Merged

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 28, 2025

danielvegamyhre force-pushed the danielvegamyhre/stack/17 branch 2 times, most recently from b2e5d4d to 7f21a73 Compare July 28, 2025 17:21

vkuzo reviewed Jul 28, 2025

View reviewed changes

danielvegamyhre force-pushed the danielvegamyhre/stack/17 branch from 7f21a73 to 5ce55f6 Compare July 28, 2025 18:15

vkuzo reviewed Jul 28, 2025

View reviewed changes

Add Triton kernels for fp8 blockwise quantization and GEMMs

14420ef

stack-info: PR: #2617, branch: danielvegamyhre/stack/17

danielvegamyhre force-pushed the danielvegamyhre/stack/17 branch from 5ce55f6 to 14420ef Compare July 28, 2025 19:49

vkuzo approved these changes Jul 30, 2025

View reviewed changes

danielvegamyhre merged commit 6b82931 into main Jul 30, 2025
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Triton kernels for fp8 blockwise quantization and GEMMs #2617

Add Triton kernels for fp8 blockwise quantization and GEMMs #2617

Uh oh!

danielvegamyhre commented Jul 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 28, 2025 •

edited

Loading

Uh oh!

vkuzo Jul 28, 2025

Uh oh!

danielvegamyhre Jul 28, 2025 •

edited

Loading

Uh oh!

vkuzo Jul 28, 2025

Uh oh!

vkuzo Jul 28, 2025

Uh oh!

vkuzo Jul 28, 2025

Uh oh!

danielvegamyhre commented Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

		]


		@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")

Add Triton kernels for fp8 blockwise quantization and GEMMs #2617

Add Triton kernels for fp8 blockwise quantization and GEMMs #2617

Uh oh!

Conversation

danielvegamyhre commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!