Validate exhaustive autotuning for FP8 Inductor templates #355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jananisriram wants to merge 1 commit into main from export-D80958642

+4 −0

Contributor

jananisriram commented Aug 25, 2025

Summary:
X-link: pytorch/pytorch#161442

Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require block_k >= 32. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Differential Revision: D80958642

jananisriram temporarily deployed to docker-s3-upload

August 25, 2025 22:06

— with

GitHub Actions Inactive

jananisriram temporarily deployed to docker-s3-upload

August 25, 2025 22:06

— with

GitHub Actions Inactive

meta-cla bot added the cla signed label

Contributor

facebook-github-bot commented Aug 25, 2025

This pull request was exported from Phabricator. Differential Revision: D80958642

facebook-github-bot added the fb-exported label

jananisriram added a commit to jananisriram/pytorch that referenced this pull request


          [Inductor][FP8] Validate exhaustive autotuning for FP8 Inductor templ…

3cf9b5d

…ates (pytorch#161442)

Summary:
X-link: meta-pytorch/tritonbench#355

Pull Request resolved: pytorch#161442

Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_SEARCH_SPACE=DEFAULT buck2 run mode/{opt,inplace} pytorch/t
ritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/json_fi
les/rowwise_ptma_0.json --output="/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/autotune/gpu0_bench.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/exhaustive_
autotune_rowwise_persistent_tma/autotune/gpu0.log
```
autotunes on the maximum configs available, rather than the defaults, and skips configs not compatible with TMA.

Rollback Plan:

Differential Revision: D80958642

jananisriram added a commit to jananisriram/pytorch that referenced this pull request


          [Inductor][FP8] Validate exhaustive autotuning for FP8 Inductor templ…

ed87b04

…ates (pytorch#161442)

Summary:
X-link: meta-pytorch/tritonbench#355


Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_SEARCH_SPACE=DEFAULT buck2 run mode/{opt,inplace} pytorch/t
ritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/json_fi
les/rowwise_ptma_0.json --output="/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/autotune/gpu0_bench.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/exhaustive_
autotune_rowwise_persistent_tma/autotune/gpu0.log
```
autotunes on the maximum configs available, rather than the defaults, and skips configs not compatible with TMA.

Rollback Plan:

Differential Revision: D80958642

Contributor

facebook-github-bot commented Aug 26, 2025

This pull request was exported from Phabricator. Differential Revision: D80958642

jananisriram added a commit that referenced this pull request


          Validate exhaustive autotuning for FP8 Inductor templates (#355)

Summary:
Pull Request resolved: #355

X-link: pytorch/pytorch#161442

Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Differential Revision: D80958642

jananisriram force-pushed the export-D80958642 branch from 2bd053d to 2912432 Compare

August 26, 2025 05:26

jananisriram temporarily deployed to docker-s3-upload

August 26, 2025 05:26

— with

GitHub Actions Inactive

jananisriram temporarily deployed to docker-s3-upload

August 26, 2025 05:26

— with

GitHub Actions Inactive

jananisriram added a commit to jananisriram/pytorch that referenced this pull request


          [Inductor][FP8] Validate exhaustive autotuning for FP8 Inductor templ…

f7ad6b1

…ates (pytorch#161442)

Summary:
X-link: meta-pytorch/tritonbench#355

Pull Request resolved: pytorch#161442

Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_SEARCH_SPACE=DEFAULT buck2 run mode/{opt,inplace} pytorch/t
ritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/json_fi
les/rowwise_ptma_0.json --output="/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/autotune/gpu0_bench.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/exhaustive_
autotune_rowwise_persistent_tma/autotune/gpu0.log
```
autotunes on the maximum configs available, rather than the defaults, and skips configs not compatible with TMA.

Rollback Plan:

Differential Revision: D80958642

facebook-github-bot pushed a commit that referenced this pull request


          Validate exhaustive autotuning for FP8 Inductor templates (#355)

2c9a2f4

Summary:

X-link: pytorch/pytorch#161442

Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Reviewed By: coconutruben

Differential Revision: D80958642

facebook-github-bot force-pushed the export-D80958642 branch from 2912432 to 2c9a2f4 Compare

August 27, 2025 01:44

facebook-github-bot had a problem deploying to docker-s3-upload

August 27, 2025 01:44

— with

GitHub Actions Failure

facebook-github-bot temporarily deployed to docker-s3-upload

August 27, 2025 01:44

— with

GitHub Actions Inactive

Contributor

facebook-github-bot commented Aug 27, 2025

This pull request was exported from Phabricator. Differential Revision: D80958642

jananisriram added a commit to jananisriram/pytorch that referenced this pull request


          [Inductor][FP8] Validate exhaustive autotuning for FP8 Inductor templ…

9af74e1

…ates (pytorch#161442)

Summary:
X-link: meta-pytorch/tritonbench#355


Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_SEARCH_SPACE=DEFAULT buck2 run mode/{opt,inplace} pytorch/t
ritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/json_fi
les/rowwise_ptma_0.json --output="/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/autotune/gpu0_bench.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/exhaustive_
autotune_rowwise_persistent_tma/autotune/gpu0.log
```
autotunes on the maximum configs available, rather than the defaults, and skips configs not compatible with TMA.

Rollback Plan:

Reviewed By: coconutruben

Differential Revision: D80958642

facebook-github-bot pushed a commit that referenced this pull request


          Validate exhaustive autotuning for FP8 Inductor templates (#355)

83d742c

Summary:

X-link: pytorch/pytorch#161442

Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Reviewed By: coconutruben

Differential Revision: D80958642

facebook-github-bot force-pushed the export-D80958642 branch from 2c9a2f4 to 83d742c Compare

August 27, 2025 20:51

facebook-github-bot had a problem deploying to docker-s3-upload

August 27, 2025 20:51

— with

GitHub Actions Failure

facebook-github-bot temporarily deployed to docker-s3-upload

August 27, 2025 20:51

— with

GitHub Actions Inactive

Contributor

facebook-github-bot commented Aug 27, 2025

This pull request was exported from Phabricator. Differential Revision: D80958642

jananisriram added a commit to jananisriram/pytorch that referenced this pull request


          [Inductor][FP8] Validate exhaustive autotuning for FP8 Inductor templ…

6c7e78d

…ates (pytorch#161442)

Summary:
X-link: meta-pytorch/tritonbench#355


Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_SEARCH_SPACE=DEFAULT buck2 run mode/{opt,inplace} pytorch/t
ritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/json_fi
les/rowwise_ptma_0.json --output="/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/autotune/gpu0_bench.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/exhaustive_
autotune_rowwise_persistent_tma/autotune/gpu0.log
```
autotunes on the maximum configs available, rather than the defaults, and skips configs not compatible with TMA.

Rollback Plan:

Reviewed By: coconutruben

Differential Revision: D80958642


          Validate exhaustive autotuning for FP8 Inductor templates (#355)

d637a66

Summary:

X-link: pytorch/pytorch#161442

Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Reviewed By: coconutruben

Differential Revision: D80958642

facebook-github-bot force-pushed the export-D80958642 branch from 83d742c to d637a66 Compare

August 28, 2025 17:08

facebook-github-bot had a problem deploying to docker-s3-upload

August 28, 2025 17:08

— with

GitHub Actions Failure

facebook-github-bot temporarily deployed to docker-s3-upload

August 28, 2025 17:08

— with

GitHub Actions Inactive

Contributor

facebook-github-bot commented Aug 28, 2025

This pull request was exported from Phabricator. Differential Revision: D80958642

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request


          [Inductor][FP8] Validate exhaustive autotuning for FP8 Inductor templ…

41663bc

…ates (#161442)

Summary:
X-link: meta-pytorch/tritonbench#355


Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_SEARCH_SPACE=DEFAULT buck2 run mode/{opt,inplace} pytorch/t
ritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/json_fi
les/rowwise_ptma_0.json --output="/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/autotune/gpu0_bench.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/exhaustive_
autotune_rowwise_persistent_tma/autotune/gpu0.log
```
autotunes on the maximum configs available, rather than the defaults, and skips configs not compatible with TMA.

Rollback Plan:

Reviewed By: coconutruben

Differential Revision: D80958642

pytorch-bot bot pushed a commit to pytorch/pytorch that referenced this pull request


          [Inductor][FP8] Validate exhaustive autotuning for FP8 Inductor templ…

f65e21b

…ates (#161442)

Summary:
X-link: meta-pytorch/tritonbench#355

Pull Request resolved: #161442

Validate exhaustive autotuning for FP8 Inductor templates: scaled MM templates require `block_k >= 32`. Before, exhaustive autotuning defaulted to a limited set of autotuning configs, as limitations for exhaustively autotuning on FP8 shapes had not been tested.

Test Plan:
```
CUDA_VISIBLE_DEVICES=0 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_SEARCH_SPACE=DEFAULT buck2 run mode/{opt,inplace} pytorch/t
ritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --input-loader=/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/json_fi
les/rowwise_ptma_0.json --output="/home/jananisriram/personal/exhaustive_autotune_rowwise_persistent_tma/autotune/gpu0_bench.csv" --atol=1e-2 --rtol=0.5 2>&1 | tee ~/personal/exhaustive_
autotune_rowwise_persistent_tma/autotune/gpu0.log
```
autotunes on the maximum configs available, rather than the defaults, and skips configs not compatible with TMA.

Rollback Plan:

Reviewed By: coconutruben

Differential Revision: D80958642

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported