[Bugfix] Fix Per-Token Dynamic Activation Quantization #393

max410011 · 2025-07-14T17:56:04Z

Summary

This PR fixes the activation quantization issue described in Issue #394, where the input scale shape was incorrect when using the Dynamic TOKEN strategy.

Fix

Corrected the reduction dimensions to ensure only the hidden dimension is reduced.
This ensures the input scale shape is (batch_size, seq_len, 1) instead of (1, seq_len, hidden_dim).

brian-dellabetta · 2025-07-15T21:53:34Z

Hi @max410011 , appreciate the thorough detail in the issue! I tried your PR, and both original main and your branch seem to work, the resultant models can be loaded up and run in vllm, which surprises me. This is some old code, and per-token/per-channel always slips me up. I will ask around to see if your reasoning in the issue description is correct.

brian-dellabetta

i validated that this gives the shape described in #394 , and after internal conversations this is correct. This is only an issue when running outside of vllm

kylesayrs

Could you add a test to demonstrate and verify that these changes are correct? Awesome catch and resolution, thanks!

kylesayrs · 2025-07-23T15:20:49Z

src/compressed_tensors/quantization/utils/helpers.py

@@ -167,7 +167,7 @@ def compute_dynamic_scales_and_zp(

    keep_dims = True
    if args.strategy == QuantizationStrategy.TOKEN:
-        dim = {1, 2}
+        dim = {0, 1}


Shouldn't this be generalized to reflect all dims except the last? There are cases where activations are passed with 4 or 5 dimensions, not just 3

As far as I know, the input/output to a linear layer in typical LLMs usually has the shape (bs, seq_len, hidden_dim).

Other activation shapes, such as (bs, num_heads, seq_len, head_dim), generally appear in the attention computation (e.g., Q @ K^T, attention_weights @ V). If we intend for this function to support quantization in those parts as well, then yes, it makes sense to generalize it accordingly.

Fix per-token dynamic quant

b8c5a91

max410011 mentioned this pull request Jul 14, 2025

Unexpected Input Scale Shape for Dynamic Per-Token Activation Quantization #394

Open

brian-dellabetta approved these changes Jul 22, 2025

View reviewed changes

brian-dellabetta requested review from markurtz, kylesayrs, dsikka and shanjiaz July 22, 2025 21:15

kylesayrs reviewed Jul 23, 2025

View reviewed changes

brian-dellabetta mentioned this pull request Jul 31, 2025

[Quantization][Decompression] Fix QDQ for dynamic quant; Update NVFP4 Compression Params #407

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix Per-Token Dynamic Activation Quantization #393

[Bugfix] Fix Per-Token Dynamic Activation Quantization #393

Uh oh!

max410011 commented Jul 14, 2025 •

edited

Loading

Uh oh!

brian-dellabetta commented Jul 15, 2025 •

edited

Loading

Uh oh!

brian-dellabetta left a comment

Uh oh!

kylesayrs left a comment

Uh oh!

kylesayrs Jul 23, 2025

Uh oh!

max410011 Jul 23, 2025

Uh oh!

Uh oh!

[Bugfix] Fix Per-Token Dynamic Activation Quantization #393

Are you sure you want to change the base?

[Bugfix] Fix Per-Token Dynamic Activation Quantization #393

Uh oh!

Conversation

max410011 commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Uh oh!

brian-dellabetta commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

kylesayrs Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

max410011 Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

max410011 commented Jul 14, 2025 •

edited

Loading

brian-dellabetta commented Jul 15, 2025 •

edited

Loading