-
Notifications
You must be signed in to change notification settings - Fork 338
Add Sparsify overhead benchmark #3021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR adds sparsify overhead benchmark, omitted in ICLR workshop paper: https://arxiv.org/abs/2503.16672 In the paper, there are two parts for the benchmark: 1) Sparsify operation overhead, 2) Sparse-GEMM kernel performance. Part 1) was omitted from the original benchmark, so this PR adds the missing sparsify-only benchmark comparing `torchao.sparse24_sm90_sparsify` against `torch._cslt_compress` (cuSPASRELt) baseline. Test plan: CI
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3021
Note: Links to docs will display an error until the docs builds have been completed. This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@jcaip Please review this PR, thanks. |
@namgyu-youn Can you share the results of your benchmark script? |
@jcaip unfortunately not available to H100 HBM, please feel free to edit for benchmarks result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of nits but otherwise looks good - thanks for adding!
lambda: torch.ops.torchao.sparse24_sm90_sparsify( | ||
input_tensor, | ||
"cutlass", | ||
"srelu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be "identity" here instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update this :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reminder; I missed it.
scale=X_scale, | ||
) | ||
) | ||
cusparse_time = benchmark_microseconds(lambda: torch._cslt_compress(input_tensor)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need this lambda? Can you just pass in like we do in L41 above:
cusparse_time = benchmark_microseconds(torch._cslt_compress, input_tensor)
# Sparsify-only benchmarks | ||
X_scale = torch.empty([num_tokens, 1], device="cuda", dtype=torch.float32) | ||
ao_cusparse_time = benchmark_microseconds( | ||
lambda: torch.ops.torchao.sparse24_sm90_sparsify( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same nit as below
"srelu", | ||
"largest", | ||
dtype=torch.float8_e4m3fn, | ||
scale=X_scale, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can pass in None
to scale for the fairest comparison.
"fp8_c_time (us)": fp8_c_time, | ||
"fp8_c_sparse_time (us)": fp8_c_sparse_time, | ||
"fp8_c_activation_sparse_time (us)": fp8_c_activation_sparse_time, | ||
"ao_cusparse_time (us)": ao_cusparse_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think something like ao_fast_sparsification_time
is a better var name.
"fp8_c_sparse_time (us)": fp8_c_sparse_time, | ||
"fp8_c_activation_sparse_time (us)": fp8_c_activation_sparse_time, | ||
"ao_cusparse_time (us)": ao_cusparse_time, | ||
"cusparse_compress_time (us)": cusparse_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cusparselt* instead of cusparse so we don't get confused :)
lambda: torch.ops.torchao.sparse24_sm90_sparsify( | ||
input_tensor, | ||
"cutlass", | ||
"srelu", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update this :)
"fp8_c_sparse_time (us)": fp8_c_sparse_time, | ||
"fp8_c_activation_sparse_time (us)": fp8_c_activation_sparse_time, | ||
"ao_fast_sparsification_time (us)": ao_fast_sparsification_time, | ||
"cusparse*_compress_time (us)": cusparse_time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cusparse_time isnt a good name for this because there is a seperate cusparse library, aside from cusparselt. please use cusparselt here instead
Also looks like theres a typo in the string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I didn't know it due to my lack of background.
Summary:
This PR adds sparsify overhead benchmark, omitted in ICLR workshop paper: https://arxiv.org/abs/2503.16672
sparse24_sm90_sparsify
overhead #2612In the paper, there are two parts for the benchmark: 1) Sparsify operation overhead, 2) Sparse-GEMM kernel performance. Part 1) was omitted from the original benchmark, so this PR adds the missing sparsify-only benchmark comparing
torchao.sparse24_sm90_sparsify
againsttorch._cslt_compress
(cuSPASRELt) baseline.Test plan: CI