-
Notifications
You must be signed in to change notification settings - Fork 646
Add palletization/codebook support to CoreML backend #13051
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13051
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 2 Unrelated FailuresAs of commit 5ee46af with merge base 1709a83 ( NEW FAILURES - The following jobs have failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
nbits = inputs[2].val | ||
|
||
# information in block_size is redundant with codebook.shape | ||
block_size = inputs[3].val # noqa: F841 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@YifanShenSZ is there any restriction on the block size here needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of, and I don't see any from our constexpr_lut_to_dense op doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! 💯
Speak of pin, we have released coremltools 9.0b1
nbits = inputs[2].val | ||
|
||
# information in block_size is redundant with codebook.shape | ||
block_size = inputs[3].val # noqa: F841 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of, and I don't see any from our constexpr_lut_to_dense op doc
torch_alias=["quant::dequantize_codebook", "quant.dequantize_codebook"], | ||
override=False, | ||
) | ||
def dequantize_codebook(context, node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: seems that "codebook" corresponds to our look up table (lut)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, codebook is the same as the LUT and codes are the same as the indices.
Nice! Where can I learn more about what's new? |
0fa6302
to
c4ca106
Compare
@cccclai @digantdesai can I get a stamp here. @YifanShenSZ has approved, but I need an approver in Pytorch to merge. |
torch_alias=["quant::dequantize_codebook", "quant.dequantize_codebook"], | ||
override=False, | ||
) | ||
def dequantize_codebook(context, node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have quantize variant because the weights are always folded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm enabling weight-only right now, so there is no quantize variant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm enabling weight-only right now, so there is no quantize variant
CodebookWeightOnlyConfig(dtype=torch.uint2, block_size=[-1, 16]), | ||
) | ||
ep = torch.export.export(model, example_inputs) | ||
print("ORIGINAL MODEL", ep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to assert if dequantize_codebook
is present in the graph?
@@ -8,6 +8,7 @@ | |||
# coremltools than is used by ExecuTorch. Each op registered here should have a link to a PR in coremltools that adds | |||
# the op to the coremltools library. | |||
|
|||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any constraint on numpy version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think most versions would work. I only use it for np.int8
model, example_inputs = self._get_test_model() | ||
quantize_( | ||
model, | ||
CodebookWeightOnlyConfig(dtype=torch.uint3, block_size=[-1, 16]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does coremltools recognize _construct_constexpr_lut_op
followed by embedding lookup as special pattern for quantized embedding that gets optimized? Same for say LUT quantized linear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From @cymbalrush, during on-device compilation CoreML fuses dequant ops with linear ops into one kernel.
f"Core ML ignores output_dtype {out_np_dtype} on torchao.dequantize_affine and instead uses the native precision." | ||
) | ||
|
||
output = _utils._construct_constexpr_lut_op( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont follow the constexpr thing for this though? what does that mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This translates the dequantize_codebook op to one of the following CoreML ops:
- iOS16.constexpr_ops.constexpr_lut_to_dense (https://fburl.com/hccppb8q)
- iOS18.compression.constexpr_lut_to_dense (https://fburl.com/51xpft2d)
These ops get fused with the following linear op at runtime.
17a2728
to
5ee46af
Compare
This adds palletization support for embedding/linear layers in CoreML using TorchAO's quantize_ API.
Note, this needs to wait for pytorch/ao#2648 to land in ao + a pin bump in ET before landing.