Skip to content

Implement QuantizationMixin #1351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 2, 2025
Merged

Implement QuantizationMixin #1351

merged 12 commits into from
May 2, 2025

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Apr 15, 2025

Purpose

  • Abstract functionality which allows modifiers to act as quantization configs into a mixin called QuantizationMixin
    • This gives Pipeline Extraction #1279 an interface to properly infer which pipeline to use based on the recipe (if a recipe contains modifiers requires calibration, then use the "basic" or "sequential" pipelines)
    • This enables future modifiers to act as quantization modifiers (in the same way that GPTQ does now)
  • Related to problem cause when doing w8a8kvfp8 for qwen2.5-vl #1354 where previous logic would attempt to add a QuantizedKVCache for dynamic kv_quant

Changes

  • Implement QuantizationMixin which implements five public methods

    • Lifecycle methods
      • initialize_quantization is used to apply a config and attach observers to a model
        • quantization is disabled so that modules aren't quantized before they're calibrated
      • start_calibration is used to initialize calibration hooks and status
        • quantization is enabled, since we currently quantize as we calibrate, although this decision is somewhat arbitrary
      • end_calibration is used to remove calibration hooks and apply the frozen status
        • quantization remains enabled, since we want future forward passes to simulate quantization
    • Recipe-related methods
      • has_config returns true if a config was specified, used for checking against duplicate configs in the recipe
      • resolve_quantization_config returns the quantization config specified by the modifier fields
  • QuantizationModifier inherits from QuantizationMixin

  • GPTQModifier inherits from QuantizationMixin

    • Unlike QMod, GPTQ disables quantization during calibration. As noted before, this is a somewhat arbitrary choice but one which matches the current implementation
  • Calibration utils

    • Replace set_unset_kv_cache with initialize_quantized_kv_cache and freeze_module_quantization
      • Treat the QuantizedKVCache as analogous to another observer
    • Pull setting the calibration status out ofupdate_weight_zp_scale
      • This better matches the lifecycle detailed in QuantizationMixin description
    • Implement reset_quantization_status which is used to remove any existing quantization configs before the current config is applied by initialize_quantization

Remove Support

  • Removing support for recipe with multiple quantization modifiers active at the same time (a check for this will be added by Pipeline Extraction #1279)
  • Remove num_calibration_steps, quantize, disable_quantization_observer_epoch and min_tokens_per_module
    • num_calibration_steps is already controlled by
      num_calibration_samples=dataset_args.num_calibration_samples,
    • quantize was implemented as a workaround for GPTQ's modifier builder. Similar functionality may be require to support SpinQuant + GPTQ, but such functionality should exist at a higher level
    • disable_quantization_observer_epoch seems to implement functionality where a model's observers are removed but quantization remains active. This functionality is maintained by setting an "end" epoch for qmod
    • min_tokens_per_module requires that the modifier have references to the calibration dataset, which is disallowed by Pipeline Extraction #1279. This information is already printed in GPTQ's logs. If research still wants this tool specifically for QuantizationModifier, then it can be reimplemented to avoid using references to the calibration dataset

Testing

  • Updated tests to reflect new mixin
  • Ran a set of GPTQ and QuantizationModifier examples to completion
  • CI tests pass

Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@kylesayrs kylesayrs mentioned this pull request Apr 15, 2025
@kylesayrs kylesayrs force-pushed the kylesayrs/quantization-mixin branch from 8906be2 to 18f8341 Compare April 15, 2025 16:05
@kylesayrs kylesayrs added the ready When a PR is ready for review label Apr 15, 2025
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs force-pushed the kylesayrs/quantization-mixin branch from 3213a7d to fa75986 Compare April 16, 2025 16:14
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM! slowly understanding more

@coolKeen
Copy link

Purpose

  • Abstract functionality which allows modifiers to act as quantization configs into a mixin called QuantizationMixin

    • This gives Pipeline Extraction #1279 an interface to properly infer which pipeline to use based on the recipe (if a recipe contains modifiers requires calibration, then use the "basic" or "sequential" pipelines)
    • This enables future modifiers to act as quantization modifiers (in the same way that GPTQ does now)
  • Related to problem cause when doing w8a8kvfp8 for qwen2.5-vl #1354 where previous logic would attempt to add a QuantizedKVCache for dynamic kv_quant

Changes

  • Implement QuantizationMixin which implements four public methods

    • attach_scheme_and_observers is used to apply the quantization config to the model and attach observers
    • register_calibration_hooks add calibration hooks which calibrate the observers which calibrate the scales
    • resolve_quantization_config returns the quantization config specified by the modifier fields
    • has_config returns true if a config was specified, used for checking against duplicate configs
  • QuantizationModifier inherits from QuantizationMixin

    • The scheme and observers are attached on initialization

    • The activations and weights are calibrated on start via hooks and update_weight_zp_scale

      • Add a tqdm for weight calibration, which is useful for larger models
    • The observers and hooks are removed on finalize

  • GPTQModifier inherits from QuantizationMixin

    • Implements a similar lifecycle to QuantizationModifier
    • Remove attached QuantizationModifier logic
  • Replace set_unset_kv_cache with initialize_quantized_kv_cache and freeze_module_quantization

    • Treat the QuantizedKVCache as analogous to another observer
  • Pull setting the calibration status out ofupdate_weight_zp_scale

    • This better matches the lifecycle detailed in QuantizationMixin description

Remove Support

  • Removing support for recipe with multiple quantization modifiers active at the same time (a check for this will be added by Pipeline Extraction #1279)

  • Remove num_calibration_steps, quantize, disable_quantization_observer_epoch and min_tokens_per_module

    • num_calibration_steps is already controlled by
      num_calibration_samples=dataset_args.num_calibration_samples,
    • quantize was implemented as a workaround for GPTQ's modifier builder. Similar functionality may be require to support SpinQuant + GPTQ, but such functionality should exist at a higher level
    • disable_quantization_observer_epoch seems to implement functionality where a model's observers are removed but quantization remains active. This is currently implemented by setting an "end" epoch
    • min_tokens_per_module requires that the modifier have references to the calibration dataset, which is disallowed by Pipeline Extraction #1279. This information is already printed in GPTQ's logs. If research still wants this tool specifically for QuantizationModifier, then it can be reimplemented to avoid using references to the calibration dataset

Testing

  • Updated tests to reflect new mixin
  • Ran a set of GPTQ and QuantizationModifier examples to completion
  • CI tests pass

Hi, I see that you mention SpinQuant + GPTQ? Does llm compressor support such recipe now?

@kylesayrs
Copy link
Collaborator Author

Hi @coolKeen! Transforms support is actively being worked on, you can see the WIP PRs here!

Signed-off-by: Kyle Sayers <[email protected]>
dsikka
dsikka previously requested changes Apr 21, 2025
Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending comment; Thank you, nice work!

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make sure your other PR goes in first, or drop the changes to kv cache tests from this one

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good stuff. always nice when adding an abstraction leads to more lines removed than added 🔥

rahul-tuli
rahul-tuli previously approved these changes Apr 30, 2025
Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synced offline on scope of testing; Changes introduced by this diff are tested by #1279 by virtue of lm_eval tests! Changes look good.

@kylesayrs kylesayrs dismissed stale reviews from rahul-tuli and brian-dellabetta via e4debea May 1, 2025 16:38
kylesayrs added 2 commits May 1, 2025 12:42
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs
Copy link
Collaborator Author

@kylesayrs kylesayrs enabled auto-merge (squash) May 2, 2025 16:14
@kylesayrs kylesayrs merged commit dce5e81 into main May 2, 2025
8 checks passed
@kylesayrs kylesayrs deleted the kylesayrs/quantization-mixin branch May 2, 2025 16:51
kylesayrs added a commit that referenced this pull request May 4, 2025
## Purpose ## 
* Abstract functionality which allows modifiers to act as quantization
configs into a mixin called `QuantizationMixin`
* This gives #1279 an interface to properly infer which pipeline to use
based on the recipe (if a recipe contains modifiers requires
calibration, then use the "basic" or "sequential" pipelines)
* This enables future modifiers to act as quantization modifiers (in the
same way that GPTQ does now)
* Related to #1354 where previous logic would attempt to add a
QuantizedKVCache for dynamic kv_quant

## Changes ##
* Implement `QuantizationMixin` which implements five public methods
  * Lifecycle methods
* `initialize_quantization` is used to apply a config and attach
observers to a model
* quantization is disabled so that modules aren't quantized before
they're calibrated
* `start_calibration` is used to initialize calibration hooks and status
* quantization is enabled, since we currently quantize as we calibrate,
although this decision is somewhat arbitrary
* `end_calibration` is used to remove calibration hooks and apply the
frozen status
* quantization remains enabled, since we want future forward passes to
simulate quantization
  * Recipe-related methods
* `has_config` returns true if a config was specified, used for checking
against duplicate configs in the recipe
* `resolve_quantization_config` returns the quantization config
specified by the modifier fields
* `QuantizationModifier` inherits from `QuantizationMixin`
* `GPTQModifier` inherits from `QuantizationMixin`
* Unlike QMod, GPTQ disables quantization during calibration. As noted
before, this is a somewhat arbitrary choice but one which matches the
current implementation

* Calibration utils
* Replace `set_unset_kv_cache` with `initialize_quantized_kv_cache` and
`freeze_module_quantization`
    * Treat the `QuantizedKVCache` as analogous to another observer
  * Pull setting the calibration status out of`update_weight_zp_scale`
* This better matches the lifecycle detailed in `QuantizationMixin`
description
* Implement `reset_quantization_status` which is used to remove any
existing quantization configs before the current config is applied by
`initialize_quantization`

## Remove Support ##
* Removing support for recipe with multiple quantization modifiers
active at the same time (a check for this will be added by #1279)
* Remove `num_calibration_steps`, `quantize`,
`disable_quantization_observer_epoch` and `min_tokens_per_module`
* `num_calibration_steps` is already controlled by
https://github.com/vllm-project/llm-compressor/blob/42b62f5283d0234b26623fe1f1bf02a77c6e4019/src/llmcompressor/datasets/utils.py#L106
* `quantize` was implemented as a workaround for GPTQ's modifier
builder. Similar functionality may be require to support SpinQuant +
GPTQ, but such functionality should exist at a higher level
* `disable_quantization_observer_epoch` seems to implement functionality
where a model's observers are removed but quantization remains active.
This functionality is maintained by setting an "end" epoch for qmod
* `min_tokens_per_module` requires that the modifier have references to
the calibration dataset, which is disallowed by #1279. This information
is already printed in GPTQ's logs. If research still wants this tool
specifically for `QuantizationModifier`, then it can be reimplemented to
avoid using references to the calibration dataset
  
## Testing ##
* Updated tests to reflect new mixin
* Ran a set of GPTQ and QuantizationModifier examples to completion
* CI tests pass

---------

Signed-off-by: Kyle Sayers <[email protected]>
kylesayrs added a commit that referenced this pull request May 7, 2025
## Purpose ##
* Extract data pipelines from modifiers to enable multiple modifiers to
be active at the same time
  * This enables faster compression of larger models
* This enables more memory efficient compression of larger models (not
limited to just GPTQ/SGPT)

## Prerequisites ##
* #1351
* #1298

## Callback Changes ##
* Implement `calibration_epoch_start`
* This callback should be called at the start of every calibration
pipeline
  * This callback causes modifiers to attach hooks
* Implement `sequential_epoch_end`
* This callback should be called after one sequential layer has been
calibrated with one epoch
* This callback triggers compression and replaces passing a
`callback_modifier`
* Implement `calibration_epoch_end`
* This callback triggers at the end of a calibration epoch, and is used
to *trigger compression* in between pipelines composed using the
independent pipeline and *remove hooks* in between independent pipelines

## Lifecycle Changes ##
* Oneshot modifiers implement on_end, which removes hooks when
calibration finishes
* In the future, calibration_epoch_start is treated like batch_start,
where it is an opportunity for modifiers to start
* In the future, calibration_epoch_end is treated like batch_end, where
it is an opportunity for modifiers to end
* In the future, finalize is treated like batch_end, where it is an
opportunity for modifiers to end
* Right now, these opportunities are implemented manually on each
oneshot modifier, rather than being a lifecycle rule

## Data Pipeline Changes ##
* Implement data pipeline registry
* Inferred pipeline is selected using modifiers and can be overridden by
user
* Implement independent pipeline
* This pipeline treats each modifier as a separate stage and assigns a
pipeline to each modifier
  * Meant to replicate current LC behavior
* Originally, these compression events were triggered by reaching the
end of each module’s initialize function. Now a separate event is
required
* Implement `session.get_modifiers`
* In order to perform data pipeline inference and other sequential
pipeline inference, these functions must get the list of active
modifiers before they initialize
* This function gets all the active modifiers across all
`ModifierStages`
* Prepare smoothquant for pipeline extraction
* Trigger `_apply_smoothing` on the `sequential_epoch_end ` and
`calibration_epoch_end`
* Add a
[guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285)
which allows the `_apply_smoothing` function to be called multiple times
per session (as is required by sequential pipeline)

## Testing ##
* Quantized llama3-8b using both the independent (basic + sequential)
and sequential pipelines
* There was no accuracy regression from using a shared pipeline,
although we keep the `independent` pipeline as the default for now
* Transformers tests pass
*
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074

---------

Signed-off-by: Kyle Sayers <[email protected]>
shanjiaz pushed a commit that referenced this pull request May 7, 2025
## Purpose ##
* Extract data pipelines from modifiers to enable multiple modifiers to
be active at the same time
  * This enables faster compression of larger models
* This enables more memory efficient compression of larger models (not
limited to just GPTQ/SGPT)

## Prerequisites ##
* #1351
* #1298

## Callback Changes ##
* Implement `calibration_epoch_start`
* This callback should be called at the start of every calibration
pipeline
  * This callback causes modifiers to attach hooks
* Implement `sequential_epoch_end`
* This callback should be called after one sequential layer has been
calibrated with one epoch
* This callback triggers compression and replaces passing a
`callback_modifier`
* Implement `calibration_epoch_end`
* This callback triggers at the end of a calibration epoch, and is used
to *trigger compression* in between pipelines composed using the
independent pipeline and *remove hooks* in between independent pipelines

## Lifecycle Changes ##
* Oneshot modifiers implement on_end, which removes hooks when
calibration finishes
* In the future, calibration_epoch_start is treated like batch_start,
where it is an opportunity for modifiers to start
* In the future, calibration_epoch_end is treated like batch_end, where
it is an opportunity for modifiers to end
* In the future, finalize is treated like batch_end, where it is an
opportunity for modifiers to end
* Right now, these opportunities are implemented manually on each
oneshot modifier, rather than being a lifecycle rule

## Data Pipeline Changes ##
* Implement data pipeline registry
* Inferred pipeline is selected using modifiers and can be overridden by
user
* Implement independent pipeline
* This pipeline treats each modifier as a separate stage and assigns a
pipeline to each modifier
  * Meant to replicate current LC behavior
* Originally, these compression events were triggered by reaching the
end of each module’s initialize function. Now a separate event is
required
* Implement `session.get_modifiers`
* In order to perform data pipeline inference and other sequential
pipeline inference, these functions must get the list of active
modifiers before they initialize
* This function gets all the active modifiers across all
`ModifierStages`
* Prepare smoothquant for pipeline extraction
* Trigger `_apply_smoothing` on the `sequential_epoch_end ` and
`calibration_epoch_end`
* Add a
[guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285)
which allows the `_apply_smoothing` function to be called multiple times
per session (as is required by sequential pipeline)

## Testing ##
* Quantized llama3-8b using both the independent (basic + sequential)
and sequential pipelines
* There was no accuracy regression from using a shared pipeline,
although we keep the `independent` pipeline as the default for now
* Transformers tests pass
*
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants