[not for land] float8 blockwise scaling training prototype using deep_gemm #2386

vkuzo · 2025-06-16T20:01:08Z

Since this is a common community request, I did a test drive of how we could integrate deep_gemm into an e2e training workflow. deep_gemm (https://github.com/deepseek-ai/DeepGEMM) provides the following things:

fwd and bwd gemms for dense linear (tested in this PR)
fwd and bwd grouped gemms for MoE (not tested yet)

What I saw:

gemm performance: good on my H100! (seeing ~60% of peak TOPs)
toy linear with correct numerics: done
e2e performance work: not started, and currently not planned. I did notice that 128x128 scaling of a single tensor is not torch.compile friendly as written - it results in 3 kernels per tensor.
trying this e2e in torchtitan: [not for land] testing out float8 128_1_128_128 blockwise scaling torchtitan#1317
- if we use the 128_1_128_1 gemm, currently crashes during the backward with illegal memory access, seems like this is specific to the result of grad_weight: https://gist.github.com/vkuzo/6e9cacb226593f7e5f27ac5cd5e79fb1. For now, work around this by leaving the gemm to calculate grad_weight in bf16. Something is funky with how we are wrapping the 128_1_128_1 gemm.

If we were to integrate this, here is the path forward:

either make deep_gemms 128_1_128_1 gemm work properly, or write our own, or just leave this matmul in bf16
get a fast version of 128x128 scaling going with a handwritten kernel, and file a compile issue to have compile catch up. This seems generally useful for other formats as well.
optimize performance
integrate as a recipe into the main float8 training path

pytorch-bot · 2025-06-16T20:01:11Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2386

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures

As of commit c2115b5 with merge base 5bdc25d ():

NEW FAILURES - The following jobs have failed:

Code Analysis with Ruff / build (3.9) (gh)
Process completed with exit code 1.
PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.
Run Regression Tests / test (CPU 2.5.1, linux.4xlarge, torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu, cp... / linux-job (gh)
RuntimeError: Command docker exec -t a7b6612e45f28a6da743a22e0a98d2f0c86f706b8aeb3f661806620aa47122de /exec failed with exit code 2
Run Regression Tests / test (CPU 2.6, linux.4xlarge, torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 4db0da10221f31928ca9bd94602648a21948c8e88e49c3963b79493b69fddf38 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 582d46c9128d9f24ee3ae34f5c08cda8529efcaab53d27c760e5b5e4baa3094c /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
RuntimeError: Command docker exec -t 95fd11bbbfdeb7c568e69e23457c01266a98f1966d7cac04724ea77c21c368a9 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t d88186ff7b0afc1bf8a0279aa027202ad1dbed328c0cfedba426877706596833 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t e71f51c68df6d20165ce61844146371275b80fd5194fd15ba557c1a30ffa7294 /exec failed with exit code 2
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
RuntimeError: Command docker exec -t fc020600418e892622d90a9d7e64cbd9ef1fbc34c557a271b3544d3f90b3c73e /exec failed with exit code 2
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t ca908711b7b88bea803315f1f8c805eed98ef15e28a8d34d8d479542e027cd5e /exec failed with exit code 2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: Test drive of pytorch/ao#2386, not for land Test Plan: ```bash with-proxy CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.converters float8 --model.print_after_conversion ``` Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 16, 2025

vkuzo changed the title ~~20250616 deepgemm hack~~ [not for land] 20250616 deepgemm hack Jun 16, 2025

vkuzo force-pushed the 20250616_deepgemm_hack branch 10 times, most recently from c4df31a to a2a31eb Compare June 17, 2025 19:54

vkuzo changed the title ~~[not for land] 20250616 deepgemm hack~~ [not for land] try blockwise scaling using deep_gemm Jun 18, 2025

vkuzo changed the title ~~[not for land] try blockwise scaling using deep_gemm~~ [not for land] float8 blockwise scaling training prototype using deep_gemm Jun 18, 2025

not for land: deepgemm hack

c2115b5

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

vkuzo force-pushed the 20250616_deepgemm_hack branch from a2a31eb to c2115b5 Compare June 18, 2025 12:44

vkuzo mentioned this pull request Jun 18, 2025

[not for land] testing out float8 128_1_128_128 blockwise scaling pytorch/torchtitan#1317

Open

danielvegamyhre mentioned this pull request Jul 25, 2025

make fp8 blockwise linear differentiable; use new kernels #2602

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[not for land] float8 blockwise scaling training prototype using deep_gemm #2386

[not for land] float8 blockwise scaling training prototype using deep_gemm #2386

Uh oh!

vkuzo commented Jun 16, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

[not for land] float8 blockwise scaling training prototype using deep_gemm #2386

Are you sure you want to change the base?

[not for land] float8 blockwise scaling training prototype using deep_gemm #2386

Uh oh!

Conversation

vkuzo commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2386

❌ 10 New Failures

Uh oh!

Uh oh!

vkuzo commented Jun 16, 2025 •

edited

Loading

pytorch-bot bot commented Jun 16, 2025 •

edited

Loading