[mxfp8] add usage documentation and benchmarks

danielvegamyhre · danielvegamyhre · commit 3626393319e9 · 2025-12-02T20:07:53.000-08:00
diff --git a/README.md b/README.md
@@ -59,6 +59,7 @@ To accelerate contributions to and innovations around torchtitan, we host an [`e
    - [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
 5. `torch.compile` support
 6. [Float8](https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323) support ([how-to](docs/float8.md))
+7. [MXFP8 training for dense and MoE models](docs/mxfp8.md) on Blackwell GPUs.
 7. DDP and HSDP
 8. [TorchFT](https://github.com/pytorch/torchft) integration
 9. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries) and support for [custom datasets](docs/datasets.md)
diff --git a/assets/images/mxfp8_with_loss.png b/assets/images/mxfp8_with_loss.png
diff --git a/docs/mxfp8.md b/docs/mxfp8.md
@@ -0,0 +1,189 @@
+## MXFP8 Training on B200 GPUs
+
+MXFP8 training can provide substantial training speedups for models where the majority of GEMMs are sufficiently large. MXFP8 is a microscaling format from the [MX OCP spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) that uses block-based scaling to maintain numerical accuracy while leveraging low-precision tensor cores. On NVIDIA B200 GPUs, MXFP8 training achieves up to **28% speedup** over bfloat16 baseline with minimal accuracy degradation.
+
+> **📖 For a comprehensive case study of using TorchTitan MXFP8 to train dense models at scale**, see our blog post: [Accelerating 2K+ Scale Pre-training up to 1.28x with TorchAO MXFP8 and TorchTitan on Crusoe B200 Cluster](https://pytorch.org/blog/accelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster/)
+
+### Table of Contents
+
+- [Requirements](#requirements)
+- [How MXFP8 Works](#how-mxfp8-works)
+- [MXFP8 for Linear Modules](#mxfp8-for-linear-modules)
+  - [Usage](#usage)
+- [MXFP8 for Grouped GEMMs (MoE)](#mxfp8-for-grouped-gemms-moe)
+  - [Usage](#usage-1)
+- [Example TOML Configuration](#example-toml-configuration)
+- [Performance](#performance)
+  - [Dense Models](#dense-models)
+  - [MoE models](#moe-models)
+- [Composability](#composability)
+- [Known Limitations](#known-limitations)
+- [Additional Resources](#additional-resources)
+
+### Requirements
+
+- NVIDIA B200 (SM100 or SM100a)
+- PyTorch nightly
+- TorchAO v0.14.0 or newer ([TorchAO Installation Guide](https://github.com/pytorch/ao#installation))
+
+Note: GB200 is also supported but requires building torchao from source (see installation guide above).
+
+### How MXFP8 Works
+
+MXFP8 differs from standard Float8 training in its scaling approach:
+
+- **Block-based scaling**: Instead of using a single scale factor per tensor (tensorwise) or per row/column (rowwise), MXFP8 uses block-based scaling with a default block size of 1x32 elements. Each block of 32 elements shares a common scale factor. The data dtype is `torch.float8_e4m3fn`, and the scale factor dtype is `torch.float8_e8mfnu`.
+- **Native hardware support**: On NVIDIA B200 (Blackwell) GPUs, MXFP8 GEMMs are accelerated using cuBLAS kernels exposed via `torch._scaled_mm`, achieving up to 2x speedup over bfloat16 on common shapes.
+- **Dynamic quantization**: Both activations and weights are dynamically quantized to MXFP8 during forward and backward passes, with high-precision accumulation.
+
+### MXFP8 for Linear Modules
+
+#### Usage
+
+To enable MXFP8 training for linear layers, launch your training job with the following command (or alternatively set configs in toml files):
+
+```bash
+CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh \
+  --model.converters="quantize.linear.mx" \
+  --quantize.linear.mx.recipe_name="mxfp8_cublas" \
+  --compile.enable
+```
+
+**Configuration Options:**
+
+* `--model.converters="quantize.linear.mx"`: Swap `nn.Linear` with `MXLinear` to perform MXFP8 matmul.
+* `--quantize.linear.mx.recipe_name="mxfp8_cublas"`: Use the cuBLAS-based MXFP8 recipe for best performance on B200 GPUs. Alternative: `"mxfp8_cublas_rceil"` uses round-ceiling mode for scale calculation.
+* `--quantize.linear.mx.mxfp8_dim1_cast_kernel_choice="triton"`: Choose the kernel for dimension-1 quantization. Options: `"triton"` (default), `"cuda"`, or `"torch"`.
+* `--quantize.linear.mx.filter_fqns="..."` (optional): Comma-separated list of fully qualified names of modules not to convert to MXFP8 training.
+  * Example: `--quantize.linear.mx.filter_fqns="attention.wq,attention.wk,attention.wv,output"`
+  * This allows you to selectively apply MXFP8 only to layers that will benefit from it.
+* `--compile.enable` (required for competitive performance): Use `torch.compile` to fuse the MXFP8 scaling/casting kernels.
+
+**Hardware Requirements:**
+
+MXFP8 training requires NVIDIA B200 (SM100) or newer GPUs. The implementation uses native cuBLAS MXFP8 kernels available on these architectures.
+
+### MXFP8 for Grouped GEMMs (MoE)
+
+For Mixture-of-Experts (MoE) models, MXFP8 can accelerate the expert computation through dynamically quantized grouped GEMMs. This is particularly beneficial for MoE models where multiple experts are processed in parallel.
+
+#### Usage
+
+To enable MXFP8 for MoE expert layers:
+
+```bash
+CONFIG_FILE="./torchtitan/models/llama4/train_configs/llama4_17bx16e.toml" ./run_train.sh \
+  --model.converters="quantize.grouped_mm.mx" \
+  --quantize.grouped_mm.mx.fqns="experts" \
+  --quantize.grouped_mm.mx.recipe_name="mxfp8" \
+  --compile.enable \
+  --model.print_after_conversion
+```
+
+**Combined usage**: You can use MXFP8 for both linear modules and grouped GEMMs simultaneously by specifying both converters:
+  ```bash
+  --model.converters="quantize.linear.mx,quantize.grouped_mm.mx"
+  ```
+
+**Configuration Options:**
+
+* `--model.converters="quantize.grouped_mm.mx"`: Enable MXFP8 grouped GEMM conversion for MoE layers.
+* `--quantize.grouped_mm.mx.fqns="experts"`: Comma-separated list of fully qualified names of MoE modules to apply MXFP8 dynamic quantization on grouped GEMM operations. Any module that matches the FQN will be converted, if it has (1) experts represented as 3d nn.Parameter instances (which is the case for TorchTitan MoEs), and (2) a `torch._grouped_mm` op performs the actual routed expert computation using those 3d expert weights.
+  * You can specify multiple FQNs to target different MoE layers in your model.
+* `--quantize.grouped_mm.mx.recipe_name="mxfp8"`: Quantization recipe for grouped GEMMs (currently only `"mxfp8"` is supported).
+* `--compile.enable`: Use `torch.compile` for best performance.
+
+**Important Notes:**
+
+* **Token group alignment**: For MoE training with MXFP8, token group sizes must be multiples of 32 (the MXFP8 block size). This is automatically configured [here](https://github.com/pytorch/torchtitan/blob/b39377f9fe33865fefb9bf64a33f6d74a598be87/torchtitan/components/quantization/mx.py#L131) when you enable MXFP8 grouped GEMMs in TorchTitan.
+
+* **torch.compile recommendation**: All benchmarks in this document were run with `torch.compile` enabled. We recommend using `torch.compile` for best performance.
+
+### Example TOML Configuration
+
+Here's an example configuration for MXFP8 training in a TOML file:
+
+```toml
+[model]
+converters = ["quantize.linear.mx", "quantize.grouped_mm.mx"]
+
+[quantize.linear.mx]
+recipe_name = "mxfp8_cublas"
+mxfp8_dim1_cast_kernel_choice = "cuda"
+filter_fqns = ["output", "router.gate"]
+
+[quantize.grouped_mm.mx]
+recipe_name = "mxfp8"
+fqns = ["experts"]
+
+[compile]
+enable = true
+components = ["model"]
+```
+
+### Performance
+
+#### Dense Models
+
+Single-node training on 8x power limited B200 GPUs, batch size 1, sequence length 8192, steps 100, torch.compile, FSDP2, per-op SAC:
+
+| Scaling Method          | Peak Memory (GB) | Median tokens/s | Speedup over BF16 |
+|------------------------|------------------|-----------------|-------------------|
+| None (bfloat16)        | 33.71           | 8307.5          | -                 |
+| mxfp8_cublas           | 33.88           | 9969.0          | +20.0%            |
+| mxfp8_cublas_rceil     | 33.88           | 9642.0          | +16.1%            |
+| float8 tensorwise      | 33.38           | 10417.0         | +25.4%            |
+
+- pytorch version: `2.9.0.dev20250815+cu128`
+- torchao version: `0.13.0+gite4e681be`
+- torchtitan commit: `6fc499f6f5b32151a799188be2208cfb09faed30`
+
+*Source: [TorchAO MX Formats Benchmarks](https://github.com/pytorch/ao/tree/main/torchao/prototype/mx_formats#training-e2e-benchmarks-on-nvidia-b200)*
+
+#### MoE models
+
+512 GPU training on 64 node GB200 cluster:
+
+| Scaling Method          | Median tokens/s | Speedup over BF16 |
+|------------------------|-----------------|-------------------|
+| None (bfloat16)        | 6169            | -                 |
+| mxfp8                  | 7401            | +20.3%            |
+
+Training runs on 64 node GB200 cluster with TorchTitan Llama4 Scout show that MXFP8 MoE training has equivalent convergence to bfloat16 training baseline. In fact, after 3,000 steps it finishes with slightly *lower* loss than bfloat16! This is consistent with our scaling experiments with [MXFP8 training for dense models](https://pytorch.org/blog/accelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster/).
+
+![MXFP8 vs BF16 Training Loss Curves](static/mxfp8_with_loss.png)
+*Training loss curves over 3,000 steps showing MXFP8 achieves equivalent convergence to bfloat16 baseline.*
+
+Training and model configurations for this run:
+- Model: Llama4 Scout
+- Dataset: C4
+- Sequence length: 8192
+- Local batch size: 10
+- Learning rate: 1e-4
+- LR scheduler warmup steps: 2000
+- Parallelisms (64 nodes of 4 devices each = 256 chips):
+    - FSDP=256 (on attention layers, shared experts, dense layer FFNs) and 256/4=64 (on routed experts)
+    - EP=16 (on routed experts)
+- Activation checkpointing mode: `none` (ideally this should use selective per op AC but there was a bug at the time preventing us from using it).
+- `torch.compile` enabled
+- `mxfp8` applied to routed experts computation (grouped GEMMs)
+- `mxfp8` applied to all linear layers except: `output`, `router.gate`, `attention.wk`, `attention.wv` (Wk and Wv too small to benefit from mxfp8)
+
+### Composability
+For distributed training, MXFP8 is compatible with:
+- `torch.compile`
+- FSDP2/TP/EP/PP
+- Full activation checkpointing
+
+All distributed communication for MXFP8 training is currently done in high precision.
+
+### Known Limitations
+- Currently in prototype stage - no BC guarantees.
+- Requires torch nightly - important bug fixes have landed since 2.9.1
+- For GB200s, requires building torchao from source
+
+### Additional Resources
+
+- [Accelerating 2K+ Scale Pre-training up to 1.28x with TorchAO MXFP8 and TorchTitan on Crusoe B200 Cluster](https://pytorch.org/blog/accelerating-2k-scale-pre-training-up-to-1-28x-with-torchao-mxfp8-and-torchtitan-on-crusoe-b200-cluster/) - Blog post on accelerating dense model training with MXFP8
+- [TorchAO MX Formats Documentation](https://github.com/pytorch/ao/tree/main/torchao/prototype/mx_formats)
+- [TorchAO MoE Training Documentation](https://github.com/pytorch/ao/tree/main/torchao/prototype/moe_training)