Add Float8BlockwiseLinear for training #2618

danielvegamyhre · 2025-07-28T16:06:03Z

Stacked PRs:

Add Float8BlockwiseLinear for training

Add autograd func wrapping triton kernels for fp8 blockwise linear layer
Add tests validating numerics
Validated e2e training with torchtitan, loss curve looks good.

Test plan

pytest test/prototype/blockwise_fp8_training/test_blockwise_linear.py
e2e training in torchtitan for 100 steps, loss looks same as bf16 (fp8 logs, bf16 logs)

Limitations

Only FSDP supported for parallelisms
torch.compile not supported yet

Performance and next steps

The perf is bad (4.4k TPS vs 6.3k TPS bf16) due largely due to slow GEMMs. As mentioned in #2617, if we want to improve perf, I can make the change necessary for compatibility with torch._scaled_mm and see if perf improves (very likely, I'd say).

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

pytorch-bot · 2025-07-28T16:06:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2618

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 90376ea with merge base 6b82931 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

vkuzo · 2025-07-28T18:07:38Z

the integration looks good, but IMO we should use pytorch native code for quantization of weights/activations, and only add triton kernels when we're confident they match the pytorch native code with bitwise accuracy. Without this, it's hard to trust and debug the accuracy of this prototype.

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

danielvegamyhre · 2025-07-28T18:22:44Z

the integration looks good, but IMO we should use pytorch native code for quantization of weights/activations, and only add triton kernels when we're confident they match the pytorch native code with bitwise accuracy. Without this, it's hard to trust and debug the accuracy of this prototype.

Makes sense, I updated prior PR to assert bitwise equivalence, let me know what you think.

vkuzo · 2025-07-28T18:27:37Z

test/prototype/blockwise_fp8_training/test_blockwise_linear.py

+from torchao.prototype.blockwise_fp8_training.linear import Float8BlockwiseLinear
+
+
+@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")


vkuzo · 2025-07-28T18:28:17Z

test/prototype/blockwise_fp8_training/test_blockwise_linear.py

+    if in_features % block_size != 0 or out_features % block_size != 0:
+        pytest.skip(f"Dimensions must be divisible by block_size={block_size}")
+
+    torch.random.manual_seed(0)


usually people do this one next to the imports, and then use copy.deepcopy to create copies of models. It's unusually to set the seed twice to get the same effect, even if it works.

Makes sense, updated.

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

Add Float8BlockwiseLinear for training

90376ea

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

danielvegamyhre added a commit that referenced this pull request Jul 28, 2025

Add Float8BlockwiseLinear for training

21a36dc

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

danielvegamyhre force-pushed the danielvegamyhre/stack/17 branch from 4c0250f to 89357e5 Compare July 28, 2025 16:06

danielvegamyhre force-pushed the danielvegamyhre/stack/18 branch from c1683b6 to 21a36dc Compare July 28, 2025 16:06

danielvegamyhre mentioned this pull request Jul 28, 2025

Add Triton kernels for fp8 blockwise quantization and GEMMs #2617

Merged

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 28, 2025

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jul 28, 2025

danielvegamyhre changed the base branch from danielvegamyhre/stack/17 to main July 28, 2025 16:14

danielvegamyhre added a commit that referenced this pull request Jul 28, 2025

Add Float8BlockwiseLinear for training

eca0126

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

danielvegamyhre force-pushed the danielvegamyhre/stack/18 branch from 21a36dc to eca0126 Compare July 28, 2025 16:15

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/17 July 28, 2025 16:15

danielvegamyhre changed the base branch from danielvegamyhre/stack/17 to main July 28, 2025 17:21

danielvegamyhre added a commit that referenced this pull request Jul 28, 2025

Add Float8BlockwiseLinear for training

5bfc200

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

danielvegamyhre force-pushed the danielvegamyhre/stack/18 branch from eca0126 to 5bfc200 Compare July 28, 2025 17:21

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/17 July 28, 2025 17:21

danielvegamyhre changed the base branch from danielvegamyhre/stack/17 to main July 28, 2025 18:15

danielvegamyhre added a commit that referenced this pull request Jul 28, 2025

Add Float8BlockwiseLinear for training

cb92b94

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

danielvegamyhre force-pushed the danielvegamyhre/stack/18 branch from 5bfc200 to cb92b94 Compare July 28, 2025 18:15

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/17 July 28, 2025 18:15

vkuzo reviewed Jul 28, 2025

View reviewed changes

danielvegamyhre changed the base branch from danielvegamyhre/stack/17 to main July 28, 2025 19:49

danielvegamyhre added a commit that referenced this pull request Jul 28, 2025

Add Float8BlockwiseLinear for training

2ffbc4f

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

danielvegamyhre force-pushed the danielvegamyhre/stack/18 branch from cb92b94 to 2ffbc4f Compare July 28, 2025 19:49

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/17 July 28, 2025 19:49

danielvegamyhre changed the base branch from danielvegamyhre/stack/17 to main July 28, 2025 19:54

danielvegamyhre added a commit that referenced this pull request Jul 28, 2025

Add Float8BlockwiseLinear for training

33255a2

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

danielvegamyhre force-pushed the danielvegamyhre/stack/18 branch from 2ffbc4f to 33255a2 Compare July 28, 2025 19:54

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/17 July 28, 2025 19:54

danielvegamyhre changed the base branch from danielvegamyhre/stack/17 to main July 29, 2025 15:56

danielvegamyhre added a commit that referenced this pull request Jul 29, 2025

Add Float8BlockwiseLinear for training

ef21071

stack-info: PR: #2618, branch: danielvegamyhre/stack/18

danielvegamyhre force-pushed the danielvegamyhre/stack/18 branch from 33255a2 to ef21071 Compare July 29, 2025 15:56

danielvegamyhre changed the base branch from main to danielvegamyhre/stack/17 July 29, 2025 15:56

danielvegamyhre force-pushed the danielvegamyhre/stack/18 branch from ef21071 to 90376ea Compare July 30, 2025 15:50

danielvegamyhre changed the base branch from danielvegamyhre/stack/17 to main July 30, 2025 15:50

vkuzo approved these changes Aug 1, 2025

View reviewed changes

danielvegamyhre merged commit 3c466f8 into main Aug 1, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Float8BlockwiseLinear for training #2618

Add Float8BlockwiseLinear for training #2618

danielvegamyhre commented Jul 28, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 28, 2025 •

edited

Loading

Uh oh!

vkuzo commented Jul 28, 2025

Uh oh!

danielvegamyhre commented Jul 28, 2025

Uh oh!

vkuzo Jul 28, 2025

Uh oh!

vkuzo Jul 28, 2025

Uh oh!

danielvegamyhre Jul 28, 2025

Uh oh!

Uh oh!

Uh oh!

		from torchao.prototype.blockwise_fp8_training.linear import Float8BlockwiseLinear


		@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")

Add Float8BlockwiseLinear for training #2618

Add Float8BlockwiseLinear for training #2618

Conversation

danielvegamyhre commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!