[Observers] Refactor for better FP4 support, static and memoryless observers #1903

kylesayrs · 2025-10-07T15:19:34Z

Purpose

FP4

Fix bug discovered here where dynamic="local" nvfp4 calculations would increment the observer twice as fast as normal

Enable MSE observer to be used with FP4

mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))

Simplification
- Make supporting attention calibration easier by separating out weight/activation/attention reshaping
- Improve readability of observer codes by removing many levels of function indirection
- Drop support for calibration with non-divisible group sizes. This is not really a loss, since forward passes also make this assumption
New observers
- memoryless_minmax computes min and max values on the fly in a dynamic-quantization style. This observer is useful for PTQ weight quantization
- static_minmax computes absolute min and max values across all observations. This observer is useful for PTQ activation quantization
- memoryless_mse computes best qparams w.r.t. MSE loss for each observation. This observer is useful for PTQ weight quantization
Memory improvements
- All observers no longer store copies of scales and zero points, reducing the amount of required memory
- Newly introduced "memoryless" observers do not store any quantization parameters, which greatly reduces the memory requirements for PTQ weight quantization of very large models

Diagrams
Before

After

Changes

Standardize reshaping using flatten_for_calibration
- This function reshapes all observed values to (num_observations, *qparams_shape, group_size)
- This function the complexity associated with passing "reduce dims" and trying to handle weights, activations, and attention states all in the same function
- In the future, this function could be applied to the quantization forward pass, although there's probably no need to outside of standardization
Implement get_global_scale on Observer base
- This function decouples minmax calculations from regular qparam calculations (avoiding the double increment bug)
- This function enables the MSE observer to be used with FP4 global scales

Testing

Added additional minmax tests which check exact values of scales. This test passes both on main and this branch, demonstrating that minmax observer behavior remains unchanged
Added additional MSE tests which check exact values of mse losses. This test passes both on main and this branch, demonstrating that MSE observer behavior remains unchanged
Added FP4 MSE test

Evaluation

nvfp4-static-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6167|±  |   N/A|

nvfp4-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6011|±  |   N/A|

github-actions · 2025-10-07T15:19:42Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: Kyle Sayers <[email protected]>

shanjiaz

Looks good to me, thanks for fixing the weird observer_kwargs setup.

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta

The cleanup looks good, nice that we can prune so much code in src. Some clarifying questions/nits. Would be good to have @dsikka or @rahul-tuli check this out too, since they know the observer logic better than me.

Does this remove the need for #1840 ?

src/llmcompressor/modifiers/quantization/calibration.py

src/llmcompressor/observers/base.py

src/llmcompressor/observers/min_max.py

src/llmcompressor/observers/mse.py

kylesayrs · 2025-10-09T18:08:12Z

@brian-dellabetta Yes, this adds support for FP4 + MSE in a more direct way, and has a test to validate this.

src/llmcompressor/observers/base.py

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta

One suggestion on adding tests for the new static observer classes

tests/llmcompressor/observers/test_min_max.py

SUMMARY: Pick up compressed-tensors 0.12.2 for patch release 0.8.1 TEST PLAN: All tests Signed-off-by: Dan Huang <[email protected]>

Signed-off-by: Kyle Sayers <[email protected]>

src/llmcompressor/observers/min_max.py

src/llmcompressor/observers/mse.py

dsikka · 2025-10-13T02:55:56Z

Just an FYI - the Qwen3 VL MoE NVFP4 example has OOO errors on this branch, for both the static_minmax and minmax observer.

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-10-13T19:46:17Z

@dsikka The OOO errors were due to gradients being calculated. Now, gradient calculations are disabled for observers.

The bad calibrations you saw with the MSE observer were due to the MSE observer using the uninitialized global_scale when computing MSE loss. I was able to replicate and fix the issue by recomputing the global scale for each iteration (note that this is only done when computing global_scales, not scales/zps).

src/llmcompressor/observers/mse.py

HDCharles

I think the change to not have a static version of MSE makes sense given the lack of a clear algorithm. LGTM

brian-dellabetta

very nice! looks so much cleaner

…servers (#1903) * FP4 * Fix bug discovered [here](#1830 (comment)) where dynamic="local" nvfp4 calculations would increment the observer twice as fast as normal * Enable MSE observer to be used with FP4 ```psuedocode mse_quant_error := mean((x - fake_quant(x))**2) global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x)) scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale)) ``` * Simplification * Make supporting attention calibration easier by separating out weight/activation/attention reshaping * Improve readability of observer codes by removing many levels of function indirection * Drop support for calibration with non-divisible group sizes. This is not really a loss, since [forward passes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L279) also make this assumption * New observers * `memoryless_minmax` computes min and max values on the fly in a dynamic-quantization style. This observer is useful for PTQ weight quantization * `static_minmax` computes absolute min and max values across all observations. This observer is useful for PTQ activation quantization * `memoryless_mse` computes best qparams w.r.t. MSE loss for each observation. This observer is useful for PTQ weight quantization * Memory improvements * All observers no longer store copies of scales and zero points, reducing the amount of required memory * Newly introduced "memoryless" observers do not store any quantization parameters, which greatly reduces the memory requirements for PTQ weight quantization of very large models | Diagrams | | - | | Before | | <img width="886" height="595" alt="before" src="https://github.com/user-attachments/assets/660d94c2-3ac8-4e05-9e9b-53d21145abac" /> | | After | <img width="1527" height="595" alt="after" src="https://github.com/user-attachments/assets/51a0107e-3fbd-413c-a7a6-03ddc3612169" /> | * Standardize reshaping using `flatten_for_calibration` * This function reshapes all observed values to `(num_observations, *qparams_shape, group_size)` * This function the complexity associated with passing "reduce dims" and trying to handle weights, activations, and attention states all in the same function * In the future, this function could be applied to the quantization forward pass, although there's probably no need to outside of standardization * Implement `get_global_scale` on `Observer` base * This function decouples minmax calculations from regular qparam calculations (avoiding the double increment bug) * This function enables the MSE observer to be used with FP4 global scales * Added additional minmax tests which check exact values of scales. This test passes both on main and this branch, demonstrating that minmax observer behavior remains unchanged * Added additional MSE tests which check exact values of mse losses. This test passes both on main and this branch, demonstrating that MSE observer behavior remains unchanged * Added FP4 MSE test ``` nvfp4-static-minmax | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|--------|---|-----:|---|------| |mmmu_val| 0|none | 0|mmmu_acc|↑ |0.6167|± | N/A| ``` ``` nvfp4-minmax | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|--------|---|-----:|---|------| |mmmu_val| 0|none | 0|mmmu_acc|↑ |0.6011|± | N/A| ``` --------- Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Dan Huang <[email protected]> Co-authored-by: dhuangnm <[email protected]>

…servers (vllm-project#1903) ## Purpose ## * FP4 * Fix bug discovered [here](vllm-project#1830 (comment)) where dynamic="local" nvfp4 calculations would increment the observer twice as fast as normal * Enable MSE observer to be used with FP4 ```psuedocode mse_quant_error := mean((x - fake_quant(x))**2) global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x)) scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale)) ``` * Simplification * Make supporting attention calibration easier by separating out weight/activation/attention reshaping * Improve readability of observer codes by removing many levels of function indirection * Drop support for calibration with non-divisible group sizes. This is not really a loss, since [forward passes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L279) also make this assumption * New observers * `memoryless_minmax` computes min and max values on the fly in a dynamic-quantization style. This observer is useful for PTQ weight quantization * `static_minmax` computes absolute min and max values across all observations. This observer is useful for PTQ activation quantization * `memoryless_mse` computes best qparams w.r.t. MSE loss for each observation. This observer is useful for PTQ weight quantization * Memory improvements * All observers no longer store copies of scales and zero points, reducing the amount of required memory * Newly introduced "memoryless" observers do not store any quantization parameters, which greatly reduces the memory requirements for PTQ weight quantization of very large models | Diagrams | | - | | Before | | <img width="886" height="595" alt="before" src="https://github.com/user-attachments/assets/660d94c2-3ac8-4e05-9e9b-53d21145abac" /> | | After | <img width="1527" height="595" alt="after" src="https://github.com/user-attachments/assets/51a0107e-3fbd-413c-a7a6-03ddc3612169" /> | ## Changes ## * Standardize reshaping using `flatten_for_calibration` * This function reshapes all observed values to `(num_observations, *qparams_shape, group_size)` * This function the complexity associated with passing "reduce dims" and trying to handle weights, activations, and attention states all in the same function * In the future, this function could be applied to the quantization forward pass, although there's probably no need to outside of standardization * Implement `get_global_scale` on `Observer` base * This function decouples minmax calculations from regular qparam calculations (avoiding the double increment bug) * This function enables the MSE observer to be used with FP4 global scales ## Testing ## * Added additional minmax tests which check exact values of scales. This test passes both on main and this branch, demonstrating that minmax observer behavior remains unchanged * Added additional MSE tests which check exact values of mse losses. This test passes both on main and this branch, demonstrating that MSE observer behavior remains unchanged * Added FP4 MSE test ## Evaluation ## ``` nvfp4-static-minmax | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|--------|---|-----:|---|------| |mmmu_val| 0|none | 0|mmmu_acc|↑ |0.6167|± | N/A| ``` ``` nvfp4-minmax | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|--------|---|-----:|---|------| |mmmu_val| 0|none | 0|mmmu_acc|↑ |0.6011|± | N/A| ``` --------- Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Dan Huang <[email protected]> Co-authored-by: dhuangnm <[email protected]> Signed-off-by: ronantakizawa <[email protected]>

…servers (vllm-project#1903) ## Purpose ## * FP4 * Fix bug discovered [here](vllm-project#1830 (comment)) where dynamic="local" nvfp4 calculations would increment the observer twice as fast as normal * Enable MSE observer to be used with FP4 ```psuedocode mse_quant_error := mean((x - fake_quant(x))**2) global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x)) scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale)) ``` * Simplification * Make supporting attention calibration easier by separating out weight/activation/attention reshaping * Improve readability of observer codes by removing many levels of function indirection * Drop support for calibration with non-divisible group sizes. This is not really a loss, since [forward passes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L279) also make this assumption * New observers * `memoryless_minmax` computes min and max values on the fly in a dynamic-quantization style. This observer is useful for PTQ weight quantization * `static_minmax` computes absolute min and max values across all observations. This observer is useful for PTQ activation quantization * `memoryless_mse` computes best qparams w.r.t. MSE loss for each observation. This observer is useful for PTQ weight quantization * Memory improvements * All observers no longer store copies of scales and zero points, reducing the amount of required memory * Newly introduced "memoryless" observers do not store any quantization parameters, which greatly reduces the memory requirements for PTQ weight quantization of very large models | Diagrams | | - | | Before | | <img width="886" height="595" alt="before" src="https://github.com/user-attachments/assets/660d94c2-3ac8-4e05-9e9b-53d21145abac" /> | | After | <img width="1527" height="595" alt="after" src="https://github.com/user-attachments/assets/51a0107e-3fbd-413c-a7a6-03ddc3612169" /> | ## Changes ## * Standardize reshaping using `flatten_for_calibration` * This function reshapes all observed values to `(num_observations, *qparams_shape, group_size)` * This function the complexity associated with passing "reduce dims" and trying to handle weights, activations, and attention states all in the same function * In the future, this function could be applied to the quantization forward pass, although there's probably no need to outside of standardization * Implement `get_global_scale` on `Observer` base * This function decouples minmax calculations from regular qparam calculations (avoiding the double increment bug) * This function enables the MSE observer to be used with FP4 global scales ## Testing ## * Added additional minmax tests which check exact values of scales. This test passes both on main and this branch, demonstrating that minmax observer behavior remains unchanged * Added additional MSE tests which check exact values of mse losses. This test passes both on main and this branch, demonstrating that MSE observer behavior remains unchanged * Added FP4 MSE test ## Evaluation ## ``` nvfp4-static-minmax | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|--------|---|-----:|---|------| |mmmu_val| 0|none | 0|mmmu_acc|↑ |0.6167|± | N/A| ``` ``` nvfp4-minmax | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|--------|---|-----:|---|------| |mmmu_val| 0|none | 0|mmmu_acc|↑ |0.6011|± | N/A| ``` --------- Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Dan Huang <[email protected]> Co-authored-by: dhuangnm <[email protected]>

…servers (vllm-project#1903) ## Purpose ## * FP4 * Fix bug discovered [here](vllm-project#1830 (comment)) where dynamic="local" nvfp4 calculations would increment the observer twice as fast as normal * Enable MSE observer to be used with FP4 ```psuedocode mse_quant_error := mean((x - fake_quant(x))**2) global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x)) scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale)) ``` * Simplification * Make supporting attention calibration easier by separating out weight/activation/attention reshaping * Improve readability of observer codes by removing many levels of function indirection * Drop support for calibration with non-divisible group sizes. This is not really a loss, since [forward passes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L279) also make this assumption * New observers * `memoryless_minmax` computes min and max values on the fly in a dynamic-quantization style. This observer is useful for PTQ weight quantization * `static_minmax` computes absolute min and max values across all observations. This observer is useful for PTQ activation quantization * `memoryless_mse` computes best qparams w.r.t. MSE loss for each observation. This observer is useful for PTQ weight quantization * Memory improvements * All observers no longer store copies of scales and zero points, reducing the amount of required memory * Newly introduced "memoryless" observers do not store any quantization parameters, which greatly reduces the memory requirements for PTQ weight quantization of very large models | Diagrams | | - | | Before | | <img width="886" height="595" alt="before" src="https://github.com/user-attachments/assets/660d94c2-3ac8-4e05-9e9b-53d21145abac" /> | | After | <img width="1527" height="595" alt="after" src="https://github.com/user-attachments/assets/51a0107e-3fbd-413c-a7a6-03ddc3612169" /> | ## Changes ## * Standardize reshaping using `flatten_for_calibration` * This function reshapes all observed values to `(num_observations, *qparams_shape, group_size)` * This function the complexity associated with passing "reduce dims" and trying to handle weights, activations, and attention states all in the same function * In the future, this function could be applied to the quantization forward pass, although there's probably no need to outside of standardization * Implement `get_global_scale` on `Observer` base * This function decouples minmax calculations from regular qparam calculations (avoiding the double increment bug) * This function enables the MSE observer to be used with FP4 global scales ## Testing ## * Added additional minmax tests which check exact values of scales. This test passes both on main and this branch, demonstrating that minmax observer behavior remains unchanged * Added additional MSE tests which check exact values of mse losses. This test passes both on main and this branch, demonstrating that MSE observer behavior remains unchanged * Added FP4 MSE test ## Evaluation ## ``` nvfp4-static-minmax | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|--------|---|-----:|---|------| |mmmu_val| 0|none | 0|mmmu_acc|↑ |0.6167|± | N/A| ``` ``` nvfp4-minmax | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |--------|------:|------|-----:|--------|---|-----:|---|------| |mmmu_val| 0|none | 0|mmmu_acc|↑ |0.6011|± | N/A| ``` --------- Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: Dan Huang <[email protected]> Co-authored-by: dhuangnm <[email protected]> Signed-off-by: LeiZhang <[email protected]>

kylesayrs mentioned this pull request Oct 7, 2025

MSE observer for NVFP4 #1840

Open

refactor observers

79c7e86

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/observers-refactor branch from 0027707 to 79c7e86 Compare October 7, 2025 21:41

kylesayrs changed the title ~~[Observers] Refactor to fix gparam bug, support attention, readability~~ [Observers] Refactor for better FP4 support, easier attention support Oct 7, 2025

kylesayrs marked this pull request as ready for review October 7, 2025 21:46

kylesayrs added 3 commits October 7, 2025 18:29

add torch inductor ignore

1c2d550

Signed-off-by: Kyle Sayers <[email protected]>

ignore inductor, add fp4 test

32879da

Signed-off-by: Kyle Sayers <[email protected]>

add fp4 test

a0b83b4

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added the ready When a PR is ready for review label Oct 9, 2025

kylesayrs mentioned this pull request Oct 9, 2025

[Observers] Small observers cleanup, add e2e quantization tests #1830

Closed

shanjiaz previously approved these changes Oct 9, 2025

View reviewed changes

fix gptq observer call

96b3995

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed shanjiaz’s stale review via 96b3995 October 9, 2025 17:42

brian-dellabetta reviewed Oct 9, 2025

View reviewed changes

src/llmcompressor/modifiers/quantization/calibration.py Show resolved Hide resolved

src/llmcompressor/observers/base.py Show resolved Hide resolved

src/llmcompressor/observers/min_max.py Outdated Show resolved Hide resolved

src/llmcompressor/observers/mse.py Outdated Show resolved Hide resolved

dsikka reviewed Oct 9, 2025

View reviewed changes

src/llmcompressor/observers/base.py Outdated Show resolved Hide resolved

kylesayrs added 2 commits October 10, 2025 01:43

abstraction

292b131

Signed-off-by: Kyle Sayers <[email protected]>

comments

de6f302

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the title ~~[Observers] Refactor for better FP4 support, easier attention support~~ [Observers] Refactor for better FP4 support, static observers Oct 10, 2025

brian-dellabetta reviewed Oct 10, 2025

View reviewed changes

tests/llmcompressor/observers/test_min_max.py Show resolved Hide resolved

dhuangnm and others added 2 commits October 10, 2025 14:32

Pick up compressed-tensors 0.12.2 for patch release (#1904)

599912e

SUMMARY: Pick up compressed-tensors 0.12.2 for patch release 0.8.1 TEST PLAN: All tests Signed-off-by: Dan Huang <[email protected]>

use offload utils

21083a0

Signed-off-by: Kyle Sayers <[email protected]>

HDCharles reviewed Oct 10, 2025

View reviewed changes

src/llmcompressor/observers/min_max.py Show resolved Hide resolved

HDCharles reviewed Oct 10, 2025

View reviewed changes

src/llmcompressor/observers/mse.py Outdated Show resolved Hide resolved

HDCharles reviewed Oct 10, 2025

View reviewed changes

src/llmcompressor/observers/mse.py Outdated Show resolved Hide resolved

Merge branch 'main' into kylesayrs/observers-refactor

54d7f2d

kylesayrs marked this pull request as draft October 10, 2025 21:41

kylesayrs removed the ready When a PR is ready for review label Oct 10, 2025

kylesayrs added 4 commits October 13, 2025 10:08

small cleanup

e722e20

Signed-off-by: Kyle Sayers <[email protected]>

update tests

5edfe0e

Signed-off-by: Kyle Sayers <[email protected]>

update test, slightly change mse

79b4c33

Signed-off-by: Kyle Sayers <[email protected]>

save gptq compute

c44f213

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs added the ready When a PR is ready for review label Oct 13, 2025

kylesayrs added 2 commits October 13, 2025 15:36

skip gradient calculations to save memory

c8a00d1

Signed-off-by: Kyle Sayers <[email protected]>

fix lifecycle tests

f023f5f

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the title ~~[Observers] Refactor for better FP4 support, static observers~~ [Observers] Refactor for better FP4 support, static and memoryless observers Oct 13, 2025

kylesayrs marked this pull request as ready for review October 13, 2025 20:07

HDCharles reviewed Oct 14, 2025

View reviewed changes

src/llmcompressor/observers/mse.py Show resolved Hide resolved

HDCharles approved these changes Oct 14, 2025

View reviewed changes

brian-dellabetta approved these changes Oct 14, 2025

View reviewed changes

kylesayrs enabled auto-merge (squash) October 14, 2025 19:01

Merge branch 'main' into kylesayrs/observers-refactor

d9c827b

kylesayrs merged commit d7d1b45 into main Oct 14, 2025
11 of 12 checks passed

kylesayrs deleted the kylesayrs/observers-refactor branch October 14, 2025 19:42

[Observers] Refactor for better FP4 support, static and memoryless observers #1903

[Observers] Refactor for better FP4 support, static and memoryless observers #1903

Uh oh!

Conversation

kylesayrs commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Testing

Evaluation

Uh oh!

github-actions bot commented Oct 7, 2025

Uh oh!

shanjiaz left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Oct 9, 2025

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylesayrs commented Oct 13, 2025

Uh oh!

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kylesayrs commented Oct 7, 2025 •

edited

Loading

dsikka commented Oct 13, 2025 •

edited

Loading