Add int4 Quantization Support #21435

JyotinderSingh · 2025-06-29T09:34:52Z

Summary

This PR introduces support for int4 weight-only quantization for the Dense layer. The implementation includes the necessary logic for packing and unpacking int4 values, performing the quantized matrix multiplication, and ensuring compatibility with features like LoRA.

The code currently implements W4A8 quantization scheme.

Description

The core changes include:

Support for int4 quantization mode.
Packing and Unpacking Utilities:
- pack_int4 takes an int8 tensor (representing int4 values) and packs two 4-bit values into a single int8 byte.
- unpack_int4 performs the reverse operation, unpacking the int8 tensor back into an int8 tensor of int4 values.
Dense Layer Modifications:
- _int4_build: Builds a packed kernel of int8 dtype and a kernel_scale variable. The original input dimension is saved in _orig_input_dim to handle unpacking correctly.
- _int4_call: Defines the forward pass for the int4 quantized layer. It uses a custom_gradient to perform the matrix multiplication with the unpacked kernel and correctly computes the gradients with respect to the original inputs.
- The quantize method now handles mode="int4". It quantizes the float weights to int4 values and then packs them using pack_int4.
- LoRA Compatibility:
  - The enable_lora method correctly determines the input dimension for the LoRA matrices when the layer is int4 quantized by using the saved _orig_input_dim.
  - The _get_kernel_with_merged_lora method handles the unpacking of the int4 kernel before merging the LoRA weights, followed by re-quantization and re-packing.

Testing

Added tests for int4 quantization in dense_test.py. These tests cover basic correctness, serialization (saving/loading models), behavior with LoRA enabled, and various edge cases.
Added unit tests for the pack_int4 and unpack_int4 functions in quantizers_test.py to ensure they work correctly for various tensor shapes and axes.

Benchmarking

Note: Results collected with warmed-up GPUs and pre-loaded models and kernels.

Micro Benchmark with OPT 125M using KerasHub

[colab link]

Micro Benchmark with BERT Classifier using KerasHub

[colab link]

Limitation

The current implementation performs a kernel unpack on every forward-pass (to unpack the int4 kernel from it's packed int8 representation where each byte stores two nibbles). This means that we lose some memory savings at runtime along with some performance penalty.

We may be able to work around this in the future by writing custom kernels which operate directly on the packed int4 representation.

Further work

Exploring calibration methods discussed in AWQ (Activation-aware Weight Quantization) and GPTQ papers which could potentially be used to expose new APIs to allow better inference performance.

codecov-commenter · 2025-06-29T09:40:53Z

Codecov Report

Attention: Patch coverage is 90.51724% with 11 lines in your changes missing coverage. Please review.

Project coverage is 82.78%. Comparing base (744b8be) to head (f187306).
Report is 9 commits behind head on master.

Files with missing lines	Patch %	Lines
keras/src/layers/core/dense.py	92.18%	1 Missing and 4 partials ⚠️
keras/src/quantizers/quantizers.py	90.00%	2 Missing and 2 partials ⚠️
keras/api/_tf_keras/keras/quantizers/__init__.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #21435      +/-   ##
==========================================
+ Coverage   74.94%   82.78%   +7.83%     
==========================================
  Files         565      565              
  Lines       55224    55404     +180     
  Branches     8610     8635      +25     
==========================================
+ Hits        41386    45864    +4478     
+ Misses      11880     7425    -4455     
- Partials     1958     2115     +157

Flag	Coverage Δ
keras	`82.59% <90.51%> (+7.82%)`	⬆️
keras-jax	`63.38% <87.06%> (+0.04%)`	⬆️
keras-numpy	`58.59% <70.68%> (?)`
keras-openvino	`33.73% <10.34%> (?)`
keras-tensorflow	`63.82% <90.51%> (+0.07%)`	⬆️
keras-torch	`63.49% <87.06%> (+0.12%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fchollet

Thanks for the PR! The code generally looks good to me. What is the performance profile? How did you benchmark the change?

JyotinderSingh · 2025-07-04T13:21:09Z

Thanks for the PR! The code generally looks good to me. What is the performance profile? How did you benchmark the change?

I hadn't yet benchmarked the code. I've now created two micro-benchmarks and have linked them in the PR description, please take a look!

int4 quantization support

c02e302

google-ml-butler bot added the size:L label Jun 29, 2025

google-ml-butler bot assigned gbaned Jun 29, 2025

JyotinderSingh changed the title ~~[DRAFT] int4 quantization support~~ [DRAFT] Add int4 Quantization Support to Dense Layers and DType Policies Jun 29, 2025

JyotinderSingh changed the title ~~[DRAFT] Add int4 Quantization Support to Dense Layers and DType Policies~~ [DRAFT] Add int4 Quantization Support to Dense Layer Jun 29, 2025

refactor packing utils into quantizers

dd11851

JyotinderSingh force-pushed the int4_quantization branch from 410977a to 71c116a Compare June 29, 2025 15:51

generalize int4 packing

777b5e6

JyotinderSingh force-pushed the int4_quantization branch from c1a58b7 to 777b5e6 Compare June 29, 2025 16:14

restored pytest skip conditions

72a8cbc

gbaned requested a review from mattdangerw June 30, 2025 08:18

google-ml-butler bot added the awaiting review label Jun 30, 2025

gbaned added this to PR Queue Jun 30, 2025

github-project-automation bot moved this to Assigned Reviewer in PR Queue Jun 30, 2025

gbaned removed the awaiting review label Jun 30, 2025

JyotinderSingh added 4 commits June 30, 2025 13:50

fixes 'tuple' object has no attribute 'rank' error

efe244e

fix dtype check to work across backends

7297410

fixed torch compatibility

3a9e26c

fixed jax compatibility

9e25042

JyotinderSingh changed the title ~~[DRAFT] Add int4 Quantization Support to Dense Layer~~ [DRAFT - DO NOT REVIEW] Add int4 Quantization Support to Dense Layer Jun 30, 2025

JyotinderSingh changed the title ~~[DRAFT - DO NOT REVIEW] Add int4 Quantization Support to Dense Layer~~ [DRAFT] Add int4 Quantization Support to Dense Layer Jun 30, 2025

JyotinderSingh added 2 commits June 30, 2025 21:36

removes redundant self._orig_input_dim initialization

1aa86de

improves readability

f9013ae

JyotinderSingh changed the title ~~[DRAFT] Add int4 Quantization Support to Dense Layer~~ Add int4 Quantization Support to Dense Layer Jul 1, 2025

JyotinderSingh changed the title ~~Add int4 Quantization Support to Dense Layer~~ Add int4 Quantization Support Jul 1, 2025

W4A8

f334156

fchollet reviewed Jul 3, 2025

View reviewed changes

added _int4_call stub

f187306

JyotinderSingh requested a review from fchollet July 4, 2025 13:22

google-ml-butler bot added the awaiting review label Jul 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add int4 Quantization Support #21435

Add int4 Quantization Support #21435

JyotinderSingh commented Jun 29, 2025 •

edited

Loading

Uh oh!

codecov-commenter commented Jun 29, 2025 •

edited

Loading

Uh oh!

fchollet left a comment

Uh oh!

JyotinderSingh commented Jul 4, 2025

Uh oh!

Uh oh!

Add int4 Quantization Support #21435

Are you sure you want to change the base?

Add int4 Quantization Support #21435

Conversation

JyotinderSingh commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Description

Testing

Benchmarking

Micro Benchmark with OPT 125M using KerasHub

Micro Benchmark with BERT Classifier using KerasHub

Limitation

Further work

Uh oh!

codecov-commenter commented Jun 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fchollet left a comment

Choose a reason for hiding this comment

Uh oh!

JyotinderSingh commented Jul 4, 2025

Uh oh!

Uh oh!

JyotinderSingh commented Jun 29, 2025 •

edited

Loading

codecov-commenter commented Jun 29, 2025 •

edited

Loading