Align Int4Tensor implementation details with the design of Float8Tensor #2687

jerryzh168 · 2025-08-05T03:24:55Z

Stacked PRs:

Align Int4Tensor implementation details with the design of Float8Tensor

Summary:
Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N]

Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops)

Added VERSION 2 for Int4WeightOnlyConfig
Migrated op implementation and tests from Add support for resharding and int4 preshuffle kernel #2387

Test Plan:
python test/quantization/quantize_/workflows/int4/test_int4_tensor.py

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-08-05T03:24:59Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2687

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 28bd29c with merge base c086ade ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16

test/quantization/quantize_/workflows/int4/test_int4_tensor.py

torchao/quantization/quantize_/workflows/int4/int4_tensor.py

vkuzo · 2025-08-05T20:26:59Z

torchao/quantization/quantize_/workflows/int4/int4_tensor.py

    res = torch.ops.fbgemm.bf16i4bf16_rowwise(
        input_tensor,
-        weight_tensor._data.contiguous(),
+        weight_tensor.qdata.contiguous(),


is it expected that the tensors are not contiguous? if not, can we assert for this instead of calling contiguous?

the non-contiguous comes from the reshape ops like transpose, view I think, but the kernel will need these to be contiguous, I can try changing these to assert and do the contiguous operation in user side to see if it works

I would have expected the weights to be stored in a format aligned with what the kernel needs, without any need for just-in-time layout transforms. Does this match how the current code works?

normally it is, but the weights also goes through some transformations like the ones we listed in test_moe_weight_reshape_ops which makes weight / scale etc. non-contiguous I think, but I can try to do call contiguous in user code, that might be cleaner I think

turns out the contiguous is not implemented properly, just fixed that and we can remove contiguous calls in linear/bmm now

torchao/quantization/quantize_/workflows/int4/int4_tensor.py

Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16

vkuzo · 2025-08-05T23:40:57Z

test/quantization/quantize_/workflows/int4/test_int4_tensor.py

+
+
+@register_quantize_module_handler(TestOnlyMoEQuantConfig)
+def moe_quant_fn(module, config: TestOnlyMoEQuantConfig):


this is really confusing, could you share the result of print(model) after this function has been applied?

if it's going to print model with parameters wrapped in Int4Tensor, can we just wrap the parameters directly without all of these layers of abstraction?

if this is working around the fact that quantize_ needs to work on modules, IMO we should change quantize_ to handle this instead of working around? seems important for MoEs.

yeah the parameters are wrapped in Int4Tensor, this is just applying quantization to each of the moe weights: w1, w2 and w3

I can inline these for now. can follow up with how to have an API for weights + configs separately

probably not worth changing API right now since MoE quant is also moving, let me know if current code looks good

test/quantization/quantize_/workflows/int4/test_int4_tensor.py

torchao/quantization/quantize_/workflows/int4/int4_tensor.py

vkuzo · 2025-08-05T23:44:28Z

torchao/testing/model_architectures.py

@@ -177,3 +178,63 @@ def create_model_and_input_data(
    else:
        raise ValueError(f"Unknown model type: {model_type}")
    return model, input_data
+
+
+class Experts(nn.Module):


maybe call it something like FeedForwardWithExperts? Experts is ambiguous

this is adapted from https://github.com/meta-llama/llama-models/blob/a9c89c471f793423afd4cc3ca8671d6e56fe64cb/models/llama4/moe.py#L22, how about renaming to LLama4Experts to make it more specific

Summary: Int4Tensor is the non-preshuffled version of int4 quantized Tensor, data is [N, K/2], scale/zero_point has shape: [K/group_size, N] Multiple fixes for Int4Tensor to align with the design of Float8Tensor (only calling fbgemm ops) * Added VERSION 2 for Int4WeightOnlyConfig * Migrated op implementation and tests from #2387 Test Plan: python test/quantization/quantize_/workflows/int4/test_int4_tensor.py Reviewers: Subscribers: Tasks: Tags: stack-info: PR: #2687, branch: jerryzh168/stack/16

jerryzh168 force-pushed the jerryzh168/stack/16 branch from 1d84542 to 4874773 Compare August 5, 2025 03:25

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 5, 2025

jerryzh168 added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 5, 2025

jerryzh168 changed the base branch from jerryzh168/stack/10 to main August 5, 2025 18:39

jerryzh168 force-pushed the jerryzh168/stack/16 branch from 4874773 to 1beccb0 Compare August 5, 2025 18:39

jerryzh168 changed the base branch from main to jerryzh168/stack/10 August 5, 2025 18:39