Skip to content

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Oct 7, 2025

Purpose

  • FP4
    • Fix bug discovered here where dynamic="local" nvfp4 calculations would increment the observer twice as fast as normal
    • Enable MSE observer to be used with FP4
      mse_quant_error := mean((x - fake_quant(x))**2)
      global_scale <- min[min_vals, max_vals, global_scale](mse_quant_error(x))
      scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))
      
  • Simplification
    • Make supporting attention calibration easier by separating out weight/activation/attention reshaping
    • Improve readability of observer codes by removing many levels of function indirection
    • Drop support for calibration with non-divisible group sizes. This is not really a loss, since forward passes also make this assumption
  • New observers
    • memoryless_minmax computes min and max values on the fly in a dynamic-quantization style. This observer is useful for PTQ weight quantization
    • static_minmax computes absolute min and max values across all observations. This observer is useful for PTQ activation quantization
    • memoryless_mse computes best qparams w.r.t. MSE loss for each observation. This observer is useful for PTQ weight quantization
  • Memory improvements
    • All observers no longer store copies of scales and zero points, reducing the amount of required memory
    • Newly introduced "memoryless" observers do not store any quantization parameters, which greatly reduces the memory requirements for PTQ weight quantization of very large models
Diagrams
Before
before
After
after

Changes

  • Standardize reshaping using flatten_for_calibration
    • This function reshapes all observed values to (num_observations, *qparams_shape, group_size)
    • This function the complexity associated with passing "reduce dims" and trying to handle weights, activations, and attention states all in the same function
    • In the future, this function could be applied to the quantization forward pass, although there's probably no need to outside of standardization
  • Implement get_global_scale on Observer base
    • This function decouples minmax calculations from regular qparam calculations (avoiding the double increment bug)
    • This function enables the MSE observer to be used with FP4 global scales

Testing

  • Added additional minmax tests which check exact values of scales. This test passes both on main and this branch, demonstrating that minmax observer behavior remains unchanged
  • Added additional MSE tests which check exact values of mse losses. This test passes both on main and this branch, demonstrating that MSE observer behavior remains unchanged
  • Added FP4 MSE test

Evaluation

nvfp4-static-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6167|±  |   N/A|
nvfp4-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6011|±  |   N/A|

Copy link

github-actions bot commented Oct 7, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs force-pushed the kylesayrs/observers-refactor branch from 0027707 to 79c7e86 Compare October 7, 2025 21:41
@kylesayrs kylesayrs changed the title [Observers] Refactor to fix gparam bug, support attention, readability [Observers] Refactor for better FP4 support, easier attention support Oct 7, 2025
@kylesayrs kylesayrs marked this pull request as ready for review October 7, 2025 21:46
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs added the ready When a PR is ready for review label Oct 9, 2025
shanjiaz
shanjiaz previously approved these changes Oct 9, 2025
Copy link
Collaborator

@shanjiaz shanjiaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks for fixing the weird observer_kwargs setup.

Signed-off-by: Kyle Sayers <[email protected]>
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup looks good, nice that we can prune so much code in src. Some clarifying questions/nits. Would be good to have @dsikka or @rahul-tuli check this out too, since they know the observer logic better than me.

Does this remove the need for #1840 ?

@kylesayrs
Copy link
Collaborator Author

@brian-dellabetta Yes, this adds support for FP4 + MSE in a more direct way, and has a test to validate this.

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs changed the title [Observers] Refactor for better FP4 support, easier attention support [Observers] Refactor for better FP4 support, static observers Oct 10, 2025
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion on adding tests for the new static observer classes

dhuangnm and others added 2 commits October 10, 2025 14:32
SUMMARY:
Pick up compressed-tensors 0.12.2 for patch release 0.8.1


TEST PLAN:
All tests

Signed-off-by: Dan Huang <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs marked this pull request as draft October 10, 2025 21:41
@kylesayrs kylesayrs removed the ready When a PR is ready for review label Oct 10, 2025
@dsikka
Copy link
Collaborator

dsikka commented Oct 13, 2025

Just an FYI - the Qwen3 VL MoE NVFP4 example has OOO errors on this branch, for both the static_minmax and minmax observer.

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs added the ready When a PR is ready for review label Oct 13, 2025
@kylesayrs kylesayrs changed the title [Observers] Refactor for better FP4 support, static observers [Observers] Refactor for better FP4 support, static and memoryless observers Oct 13, 2025
@kylesayrs
Copy link
Collaborator Author

@dsikka The OOO errors were due to gradients being calculated. Now, gradient calculations are disabled for observers.

The bad calibrations you saw with the MSE observer were due to the MSE observer using the uninitialized global_scale when computing MSE loss. I was able to replicate and fix the issue by recomputing the global scale for each iteration (note that this is only done when computing global_scales, not scales/zps).

@kylesayrs kylesayrs marked this pull request as ready for review October 13, 2025 20:07
Copy link
Collaborator

@HDCharles HDCharles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the change to not have a static version of MSE makes sense given the lack of a clear algorithm. LGTM

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice! looks so much cleaner

@kylesayrs kylesayrs enabled auto-merge (squash) October 14, 2025 19:01
@kylesayrs kylesayrs merged commit d7d1b45 into main Oct 14, 2025
11 of 12 checks passed
@kylesayrs kylesayrs deleted the kylesayrs/observers-refactor branch October 14, 2025 19:42
kylesayrs added a commit that referenced this pull request Oct 14, 2025
…servers (#1903)

* FP4
* Fix bug discovered
[here](#1830 (comment))
where dynamic="local" nvfp4 calculations would increment the observer
twice as fast as normal
  * Enable MSE observer to be used with FP4
    ```psuedocode
    mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals,
global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))
    ```
* Simplification
* Make supporting attention calibration easier by separating out
weight/activation/attention reshaping
* Improve readability of observer codes by removing many levels of
function indirection
* Drop support for calibration with non-divisible group sizes. This is
not really a loss, since [forward
passes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L279)
also make this assumption
* New observers
* `memoryless_minmax` computes min and max values on the fly in a
dynamic-quantization style. This observer is useful for PTQ weight
quantization
* `static_minmax` computes absolute min and max values across all
observations. This observer is useful for PTQ activation quantization
* `memoryless_mse` computes best qparams w.r.t. MSE loss for each
observation. This observer is useful for PTQ weight quantization
* Memory improvements
* All observers no longer store copies of scales and zero points,
reducing the amount of required memory
* Newly introduced "memoryless" observers do not store any quantization
parameters, which greatly reduces the memory requirements for PTQ weight
quantization of very large models

| Diagrams |
| - |
| Before |
| <img width="886" height="595" alt="before"
src="https://github.com/user-attachments/assets/660d94c2-3ac8-4e05-9e9b-53d21145abac"
/> |
| After |
<img width="1527" height="595" alt="after"
src="https://github.com/user-attachments/assets/51a0107e-3fbd-413c-a7a6-03ddc3612169"
/> |

* Standardize reshaping using `flatten_for_calibration`
* This function reshapes all observed values to `(num_observations,
*qparams_shape, group_size)`
* This function the complexity associated with passing "reduce dims" and
trying to handle weights, activations, and attention states all in the
same function
* In the future, this function could be applied to the quantization
forward pass, although there's probably no need to outside of
standardization
* Implement `get_global_scale` on `Observer` base
* This function decouples minmax calculations from regular qparam
calculations (avoiding the double increment bug)
* This function enables the MSE observer to be used with FP4 global
scales

* Added additional minmax tests which check exact values of scales. This
test passes both on main and this branch, demonstrating that minmax
observer behavior remains unchanged
* Added additional MSE tests which check exact values of mse losses.
This test passes both on main and this branch, demonstrating that MSE
observer behavior remains unchanged
* Added FP4 MSE test

```
nvfp4-static-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6167|±  |   N/A|
```

```
nvfp4-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6011|±  |   N/A|
```

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Dan Huang <[email protected]>
Co-authored-by: dhuangnm <[email protected]>
ronantakizawa pushed a commit to ronantakizawa/llm-compressor that referenced this pull request Oct 15, 2025
…servers (vllm-project#1903)

## Purpose ##
* FP4
* Fix bug discovered
[here](vllm-project#1830 (comment))
where dynamic="local" nvfp4 calculations would increment the observer
twice as fast as normal
  * Enable MSE observer to be used with FP4
    ```psuedocode
    mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals,
global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))
    ```
* Simplification
* Make supporting attention calibration easier by separating out
weight/activation/attention reshaping
* Improve readability of observer codes by removing many levels of
function indirection
* Drop support for calibration with non-divisible group sizes. This is
not really a loss, since [forward
passes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L279)
also make this assumption
* New observers
* `memoryless_minmax` computes min and max values on the fly in a
dynamic-quantization style. This observer is useful for PTQ weight
quantization
* `static_minmax` computes absolute min and max values across all
observations. This observer is useful for PTQ activation quantization
* `memoryless_mse` computes best qparams w.r.t. MSE loss for each
observation. This observer is useful for PTQ weight quantization
* Memory improvements
* All observers no longer store copies of scales and zero points,
reducing the amount of required memory
* Newly introduced "memoryless" observers do not store any quantization
parameters, which greatly reduces the memory requirements for PTQ weight
quantization of very large models

| Diagrams |
| - |
| Before |
| <img width="886" height="595" alt="before"
src="https://github.com/user-attachments/assets/660d94c2-3ac8-4e05-9e9b-53d21145abac"
/> |
| After |
<img width="1527" height="595" alt="after"
src="https://github.com/user-attachments/assets/51a0107e-3fbd-413c-a7a6-03ddc3612169"
/> |

## Changes ##
* Standardize reshaping using `flatten_for_calibration`
* This function reshapes all observed values to `(num_observations,
*qparams_shape, group_size)`
* This function the complexity associated with passing "reduce dims" and
trying to handle weights, activations, and attention states all in the
same function
* In the future, this function could be applied to the quantization
forward pass, although there's probably no need to outside of
standardization
* Implement `get_global_scale` on `Observer` base
* This function decouples minmax calculations from regular qparam
calculations (avoiding the double increment bug)
* This function enables the MSE observer to be used with FP4 global
scales

## Testing ##
* Added additional minmax tests which check exact values of scales. This
test passes both on main and this branch, demonstrating that minmax
observer behavior remains unchanged
* Added additional MSE tests which check exact values of mse losses.
This test passes both on main and this branch, demonstrating that MSE
observer behavior remains unchanged
* Added FP4 MSE test

## Evaluation ##
```
nvfp4-static-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6167|±  |   N/A|
```

```
nvfp4-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6011|±  |   N/A|
```

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Dan Huang <[email protected]>
Co-authored-by: dhuangnm <[email protected]>
Signed-off-by: ronantakizawa <[email protected]>
cajeonrh pushed a commit to cajeonrh/llm-compressor that referenced this pull request Oct 16, 2025
…servers (vllm-project#1903)

## Purpose ##
* FP4
* Fix bug discovered
[here](vllm-project#1830 (comment))
where dynamic="local" nvfp4 calculations would increment the observer
twice as fast as normal
  * Enable MSE observer to be used with FP4
    ```psuedocode
    mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals,
global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))
    ```
* Simplification
* Make supporting attention calibration easier by separating out
weight/activation/attention reshaping
* Improve readability of observer codes by removing many levels of
function indirection
* Drop support for calibration with non-divisible group sizes. This is
not really a loss, since [forward
passes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L279)
also make this assumption
* New observers
* `memoryless_minmax` computes min and max values on the fly in a
dynamic-quantization style. This observer is useful for PTQ weight
quantization
* `static_minmax` computes absolute min and max values across all
observations. This observer is useful for PTQ activation quantization
* `memoryless_mse` computes best qparams w.r.t. MSE loss for each
observation. This observer is useful for PTQ weight quantization
* Memory improvements
* All observers no longer store copies of scales and zero points,
reducing the amount of required memory
* Newly introduced "memoryless" observers do not store any quantization
parameters, which greatly reduces the memory requirements for PTQ weight
quantization of very large models

| Diagrams |
| - |
| Before |
| <img width="886" height="595" alt="before"
src="https://github.com/user-attachments/assets/660d94c2-3ac8-4e05-9e9b-53d21145abac"
/> |
| After | 
<img width="1527" height="595" alt="after"
src="https://github.com/user-attachments/assets/51a0107e-3fbd-413c-a7a6-03ddc3612169"
/> |

## Changes ##
* Standardize reshaping using `flatten_for_calibration`
* This function reshapes all observed values to `(num_observations,
*qparams_shape, group_size)`
* This function the complexity associated with passing "reduce dims" and
trying to handle weights, activations, and attention states all in the
same function
* In the future, this function could be applied to the quantization
forward pass, although there's probably no need to outside of
standardization
* Implement `get_global_scale` on `Observer` base
* This function decouples minmax calculations from regular qparam
calculations (avoiding the double increment bug)
* This function enables the MSE observer to be used with FP4 global
scales

## Testing ##
* Added additional minmax tests which check exact values of scales. This
test passes both on main and this branch, demonstrating that minmax
observer behavior remains unchanged
* Added additional MSE tests which check exact values of mse losses.
This test passes both on main and this branch, demonstrating that MSE
observer behavior remains unchanged
* Added FP4 MSE test

## Evaluation ##
```
nvfp4-static-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6167|±  |   N/A|
```

```
nvfp4-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6011|±  |   N/A|
```

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Dan Huang <[email protected]>
Co-authored-by: dhuangnm <[email protected]>
zhanglei1172 pushed a commit to zhanglei1172/llm-compressor that referenced this pull request Oct 17, 2025
…servers (vllm-project#1903)

## Purpose ##
* FP4
* Fix bug discovered
[here](vllm-project#1830 (comment))
where dynamic="local" nvfp4 calculations would increment the observer
twice as fast as normal
  * Enable MSE observer to be used with FP4
    ```psuedocode
    mse_quant_error := mean((x - fake_quant(x))**2)
global_scale <- min[min_vals, max_vals,
global_scale](mse_quant_error(x))
scale, zp <- min[min_vals, max_vals](mse_quant_error(x, global_scale))
    ```
* Simplification
* Make supporting attention calibration easier by separating out
weight/activation/attention reshaping
* Improve readability of observer codes by removing many levels of
function indirection
* Drop support for calibration with non-divisible group sizes. This is
not really a loss, since [forward
passes](https://github.com/neuralmagic/compressed-tensors/blob/main/src/compressed_tensors/quantization/lifecycle/forward.py#L279)
also make this assumption
* New observers
* `memoryless_minmax` computes min and max values on the fly in a
dynamic-quantization style. This observer is useful for PTQ weight
quantization
* `static_minmax` computes absolute min and max values across all
observations. This observer is useful for PTQ activation quantization
* `memoryless_mse` computes best qparams w.r.t. MSE loss for each
observation. This observer is useful for PTQ weight quantization
* Memory improvements
* All observers no longer store copies of scales and zero points,
reducing the amount of required memory
* Newly introduced "memoryless" observers do not store any quantization
parameters, which greatly reduces the memory requirements for PTQ weight
quantization of very large models

| Diagrams |
| - |
| Before |
| <img width="886" height="595" alt="before"
src="https://github.com/user-attachments/assets/660d94c2-3ac8-4e05-9e9b-53d21145abac"
/> |
| After |
<img width="1527" height="595" alt="after"
src="https://github.com/user-attachments/assets/51a0107e-3fbd-413c-a7a6-03ddc3612169"
/> |

## Changes ##
* Standardize reshaping using `flatten_for_calibration`
* This function reshapes all observed values to `(num_observations,
*qparams_shape, group_size)`
* This function the complexity associated with passing "reduce dims" and
trying to handle weights, activations, and attention states all in the
same function
* In the future, this function could be applied to the quantization
forward pass, although there's probably no need to outside of
standardization
* Implement `get_global_scale` on `Observer` base
* This function decouples minmax calculations from regular qparam
calculations (avoiding the double increment bug)
* This function enables the MSE observer to be used with FP4 global
scales

## Testing ##
* Added additional minmax tests which check exact values of scales. This
test passes both on main and this branch, demonstrating that minmax
observer behavior remains unchanged
* Added additional MSE tests which check exact values of mse losses.
This test passes both on main and this branch, demonstrating that MSE
observer behavior remains unchanged
* Added FP4 MSE test

## Evaluation ##
```
nvfp4-static-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6167|±  |   N/A|
```

```
nvfp4-minmax
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.6011|±  |   N/A|
```

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Dan Huang <[email protected]>
Co-authored-by: dhuangnm <[email protected]>
Signed-off-by: LeiZhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants