-
Notifications
You must be signed in to change notification settings - Fork 342
[CPU] add Float8OpaqueTensor for dynamic float8 act float8 weight #3075
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3075
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit d460134 with merge base 4013764 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
CC @mingfeima for review. Thanks. |
Hi @mingfeima @jerryzh168 @andrewor14 Could you please review this PR? Thanks. |
@common_utils.parametrize( | ||
"x_granularity", | ||
[PerTensor(), PerRow(), PerGroup(32), PerGroup(64), PerGroup(128)], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does torch.ao support per block quantization, e.g. deepseek style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment. The supported granularity varies among different quantization methods in Torchao. For float8 da8w8 on CPU, it does not support the block-wise quantization used in DeepSeek.
|
||
class Float8OpaqueTensor(TorchAOBaseTensor): | ||
""" | ||
Float8 dynamic activation float8 weight on CPU. The weight tensor is reordered to a blocked layout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be Float8 dynamic quantized float8 weight on CPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expression here is interpreted as "Float8 dynamic activation" "float8 weight" on CPU, which means activation is dynamic quantized to float8 and weight is static quantized to float8. This is aligned with other parts in Torchao.
[block_k, block_n] may be further reordered to VNNI layout depending on supported CPU ISA. | ||
|
||
Tensor Attributes: | ||
qdata: Reordered float8 weight on CPU with shape = [N/block_n, K/block_k, block_k, block_n]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we computing with float32 here, as the weight is not packed in vnni2 format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are computing with bf16 or fp8, depending on ISA. The exposed shape does not have the VNNI dimension but the memory layout is VNNI-2 or VNNI-4 if ISA is supported.
Hi @mingfeima @jerryzh168 @andrewor14 Though this PR depends on #3100, could you please review this PR? Thanks. |
Summary
We split the original big PR #2505 into the following smaller ones:
Test plan