Skip to content

Add int4 Quantization Support #21435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

JyotinderSingh
Copy link
Collaborator

@JyotinderSingh JyotinderSingh commented Jun 29, 2025

Summary

This PR introduces support for int4 weight-only quantization for the Dense layer. The implementation includes the necessary logic for packing and unpacking int4 values, performing the quantized matrix multiplication, and ensuring compatibility with features like LoRA.

The code currently implements W4A8 quantization scheme.

Description

The core changes include:

  • Support for int4 quantization mode.

  • Packing and Unpacking Utilities:

    • pack_int4 takes an int8 tensor (representing int4 values) and packs two 4-bit values into a single int8 byte.
    • unpack_int4 performs the reverse operation, unpacking the int8 tensor back into an int8 tensor of int4 values.
  • Dense Layer Modifications:

    • _int4_build: Builds a packed kernel of int8 dtype and a kernel_scale variable. The original input dimension is saved in _orig_input_dim to handle unpacking correctly.
    • _int4_call: Defines the forward pass for the int4 quantized layer. It uses a custom_gradient to perform the matrix multiplication with the unpacked kernel and correctly computes the gradients with respect to the original inputs.
    • The quantize method now handles mode="int4". It quantizes the float weights to int4 values and then packs them using pack_int4.
    • LoRA Compatibility:
      • The enable_lora method correctly determines the input dimension for the LoRA matrices when the layer is int4 quantized by using the saved _orig_input_dim.
      • The _get_kernel_with_merged_lora method handles the unpacking of the int4 kernel before merging the LoRA weights, followed by re-quantization and re-packing.

Testing

  • Added tests for int4 quantization in dense_test.py. These tests cover basic correctness, serialization (saving/loading models), behavior with LoRA enabled, and various edge cases.
  • Added unit tests for the pack_int4 and unpack_int4 functions in quantizers_test.py to ensure they work correctly for various tensor shapes and axes.

Benchmarking

Note: Results collected with warmed-up GPUs and pre-loaded models and kernels.

Micro Benchmark with OPT 125M using KerasHub

[colab link]

opt_benchmark

Micro Benchmark with BERT Classifier using KerasHub

[colab link]

bert_benchmark

Limitation

The current implementation performs a kernel unpack on every forward-pass (to unpack the int4 kernel from it's packed int8 representation where each byte stores two nibbles). This means that we lose some memory savings at runtime along with some performance penalty.

We may be able to work around this in the future by writing custom kernels which operate directly on the packed int4 representation.

Further work

  1. Exploring calibration methods discussed in AWQ (Activation-aware Weight Quantization) and GPTQ papers which could potentially be used to expose new APIs to allow better inference performance.

@JyotinderSingh JyotinderSingh changed the title [DRAFT] int4 quantization support [DRAFT] Add int4 Quantization Support to Dense Layers and DType Policies Jun 29, 2025
@codecov-commenter
Copy link

codecov-commenter commented Jun 29, 2025

Codecov Report

Attention: Patch coverage is 90.51724% with 11 lines in your changes missing coverage. Please review.

Project coverage is 82.78%. Comparing base (744b8be) to head (f187306).
Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
keras/src/layers/core/dense.py 92.18% 1 Missing and 4 partials ⚠️
keras/src/quantizers/quantizers.py 90.00% 2 Missing and 2 partials ⚠️
keras/api/_tf_keras/keras/quantizers/__init__.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #21435      +/-   ##
==========================================
+ Coverage   74.94%   82.78%   +7.83%     
==========================================
  Files         565      565              
  Lines       55224    55404     +180     
  Branches     8610     8635      +25     
==========================================
+ Hits        41386    45864    +4478     
+ Misses      11880     7425    -4455     
- Partials     1958     2115     +157     
Flag Coverage Δ
keras 82.59% <90.51%> (+7.82%) ⬆️
keras-jax 63.38% <87.06%> (+0.04%) ⬆️
keras-numpy 58.59% <70.68%> (?)
keras-openvino 33.73% <10.34%> (?)
keras-tensorflow 63.82% <90.51%> (+0.07%) ⬆️
keras-torch 63.49% <87.06%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@JyotinderSingh JyotinderSingh changed the title [DRAFT] Add int4 Quantization Support to Dense Layers and DType Policies [DRAFT] Add int4 Quantization Support to Dense Layer Jun 29, 2025
@gbaned gbaned requested a review from mattdangerw June 30, 2025 08:18
@gbaned gbaned added this to PR Queue Jun 30, 2025
@github-project-automation github-project-automation bot moved this to Assigned Reviewer in PR Queue Jun 30, 2025
@JyotinderSingh JyotinderSingh changed the title [DRAFT] Add int4 Quantization Support to Dense Layer [DRAFT - DO NOT REVIEW] Add int4 Quantization Support to Dense Layer Jun 30, 2025
@JyotinderSingh JyotinderSingh changed the title [DRAFT - DO NOT REVIEW] Add int4 Quantization Support to Dense Layer [DRAFT] Add int4 Quantization Support to Dense Layer Jun 30, 2025
@JyotinderSingh JyotinderSingh changed the title [DRAFT] Add int4 Quantization Support to Dense Layer Add int4 Quantization Support to Dense Layer Jul 1, 2025
@JyotinderSingh JyotinderSingh changed the title Add int4 Quantization Support to Dense Layer Add int4 Quantization Support Jul 1, 2025
Copy link
Collaborator

@fchollet fchollet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! The code generally looks good to me. What is the performance profile? How did you benchmark the change?

@JyotinderSingh
Copy link
Collaborator Author

Thanks for the PR! The code generally looks good to me. What is the performance profile? How did you benchmark the change?

I hadn't yet benchmarked the code. I've now created two micro-benchmarks and have linked them in the PR description, please take a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Assigned Reviewer
Development

Successfully merging this pull request may close these issues.

4 participants