Skip to content

Pipeline Extraction #1279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
May 7, 2025
Merged

Pipeline Extraction #1279

merged 15 commits into from
May 7, 2025

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Mar 24, 2025

Purpose

  • Extract data pipelines from modifiers to enable multiple modifiers to be active at the same time
    • This enables faster compression of larger models
    • This enables more memory efficient compression of larger models (not limited to just GPTQ/SGPT)

Prerequisites

Callback Changes

  • Implement calibration_epoch_start
    • This callback should be called at the start of every calibration pipeline
    • This callback causes modifiers to attach hooks
  • Implement sequential_epoch_end
    • This callback should be called after one sequential layer has been calibrated with one epoch
    • This callback triggers compression and replaces passing a callback_modifier
  • Implement calibration_epoch_end
    • This callback triggers at the end of a calibration epoch, and is used to trigger compression in between pipelines composed using the independent pipeline and remove hooks in between independent pipelines

Lifecycle Changes

  • Oneshot modifiers implement on_end, which removes hooks when calibration finishes
    • In the future, calibration_epoch_start is treated like batch_start, where it is an opportunity for modifiers to start
    • In the future, calibration_epoch_end is treated like batch_end, where it is an opportunity for modifiers to end
    • In the future, finalize is treated like batch_end, where it is an opportunity for modifiers to end
    • Right now, these opportunities are implemented manually on each oneshot modifier, rather than being a lifecycle rule

Data Pipeline Changes

  • Implement data pipeline registry
    • Inferred pipeline is selected using modifiers and can be overridden by user
  • Implement independent pipeline
    • This pipeline treats each modifier as a separate stage and assigns a pipeline to each modifier
    • Meant to replicate current LC behavior
    • Originally, these compression events were triggered by reaching the end of each module’s initialize function. Now a separate event is required
  • Implement session.get_modifiers
    • In order to perform data pipeline inference and other sequential pipeline inference, these functions must get the list of active modifiers before they initialize
    • This function gets all the active modifiers across all ModifierStages
  • Prepare smoothquant for pipeline extraction
    • Trigger _apply_smoothing on the sequential_epoch_end and calibration_epoch_end
    • Add a guard which allows the _apply_smoothing function to be called multiple times per session (as is required by sequential pipeline)

Testing

@kylesayrs kylesayrs changed the title [WIP] Shared Pipeline Extraction [WIP] Pipeline Extraction Mar 25, 2025
@vllm-project vllm-project deleted a comment from github-actions bot Mar 25, 2025
@kylesayrs kylesayrs changed the title [WIP] Pipeline Extraction Pipeline Extraction Mar 25, 2025
@kylesayrs kylesayrs marked this pull request as ready for review March 25, 2025 04:43
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely looks cleaner this way! Leaving comments rather than approving, as I am still getting up to speed with pipelines

@kylesayrs kylesayrs added the ready When a PR is ready for review label Mar 27, 2025
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you're looking for feedback on this, but I'm not sure I understand it enough to approve. I do like the removal of all the try/catch code in GPTQ. Maybe we can have a deep dive session on this next week?

dsikka pushed a commit that referenced this pull request Apr 1, 2025
## Purpose ##
* Revert the behavior regression introduced as a result of #1114
* When calibrating a model using the `QuantizationModifier`,
quantization should be enabled when calibrating

## Changes ##
* Remove "disabling quantization" from the calibration forward pass
* Add "disabling quantization" to the sequential pipelines in order to
continue to disable quantization during calibration for GPTQ and SGPT
* When [calibration pipelines become shared between modifiers](#1279),
the decision of whether to disabling quantization during calibration
will have to be moved to the calibration pipelines themselves. Some work
needs to be done to demonstrate that GPTQ and SGPT do not suffer
accuracy regression from enabling activation quantization during
calibration (in theory, the change should increase accuracy)

---------

Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs removed the ready When a PR is ready for review label Apr 2, 2025
@kylesayrs kylesayrs marked this pull request as draft April 2, 2025 05:53
@kylesayrs kylesayrs force-pushed the kylesayrs/shared-pipelines branch 2 times, most recently from ee33c44 to 46f6811 Compare April 15, 2025 16:47
@kylesayrs kylesayrs changed the base branch from main to kylesayrs/quantization-mixin April 15, 2025 17:08
@kylesayrs kylesayrs force-pushed the kylesayrs/shared-pipelines branch from 46f6811 to 2182705 Compare April 15, 2025 17:21
@kylesayrs kylesayrs force-pushed the kylesayrs/quantization-mixin branch from 3213a7d to fa75986 Compare April 16, 2025 16:14
@kylesayrs kylesayrs force-pushed the kylesayrs/shared-pipelines branch from 2182705 to 92c9dee Compare April 16, 2025 17:21
@kylesayrs kylesayrs added the ready When a PR is ready for review label Apr 16, 2025
@kylesayrs kylesayrs marked this pull request as ready for review April 16, 2025 17:44
@kylesayrs kylesayrs changed the base branch from kylesayrs/quantization-mixin to main April 16, 2025 17:45
@kylesayrs kylesayrs force-pushed the kylesayrs/shared-pipelines branch from b9c91e7 to 3fdbb8d Compare April 22, 2025 19:43
@kylesayrs
Copy link
Collaborator Author

Looks like there's one perplexity failure, although I wasn't able to replicate locally https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074/job/41024772318#step:13:31981

@kylesayrs kylesayrs marked this pull request as draft May 1, 2025 13:58
@kylesayrs kylesayrs marked this pull request as ready for review May 1, 2025 16:27
Signed-off-by: Kyle Sayers <[email protected]>
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really cool! excited to try this out. Should we run the e2e/lmeval tests before merging this in? Lots of moving pieces, they might catch something

Signed-off-by: Kyle Sayers <[email protected]>
kylesayrs added 3 commits May 5, 2025 12:35
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs
Copy link
Collaborator Author

Reran job: https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14839620945

Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs
Copy link
Collaborator Author

kylesayrs commented May 6, 2025

I've validated the previously failing awq e2e test passes locally

Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 the refactor from pipelines as functions to classes looks really good to me!

Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, left some minor comments.
One change/resolution/explanation requested for independent pipelines.

Generally I see a lot of similar TODO's trickled across multiple file(s); would like to address/delete/link out to ticket or issues before merge.

Great job on this!

@kylesayrs
Copy link
Collaborator Author

@brian-dellabetta
Copy link
Collaborator

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14870766869

Looks like tests passed, but there was a issue reporting timings, possibly expected for this manual run and unrelated to these changes. So I think we're good to go on this!

@kylesayrs kylesayrs enabled auto-merge (squash) May 7, 2025 14:59
@kylesayrs kylesayrs merged commit a54b3d1 into main May 7, 2025
8 checks passed
@kylesayrs kylesayrs deleted the kylesayrs/shared-pipelines branch May 7, 2025 15:00
shanjiaz pushed a commit that referenced this pull request May 7, 2025
## Purpose ##
* Extract data pipelines from modifiers to enable multiple modifiers to
be active at the same time
  * This enables faster compression of larger models
* This enables more memory efficient compression of larger models (not
limited to just GPTQ/SGPT)

## Prerequisites ##
* #1351
* #1298

## Callback Changes ##
* Implement `calibration_epoch_start`
* This callback should be called at the start of every calibration
pipeline
  * This callback causes modifiers to attach hooks
* Implement `sequential_epoch_end`
* This callback should be called after one sequential layer has been
calibrated with one epoch
* This callback triggers compression and replaces passing a
`callback_modifier`
* Implement `calibration_epoch_end`
* This callback triggers at the end of a calibration epoch, and is used
to *trigger compression* in between pipelines composed using the
independent pipeline and *remove hooks* in between independent pipelines

## Lifecycle Changes ##
* Oneshot modifiers implement on_end, which removes hooks when
calibration finishes
* In the future, calibration_epoch_start is treated like batch_start,
where it is an opportunity for modifiers to start
* In the future, calibration_epoch_end is treated like batch_end, where
it is an opportunity for modifiers to end
* In the future, finalize is treated like batch_end, where it is an
opportunity for modifiers to end
* Right now, these opportunities are implemented manually on each
oneshot modifier, rather than being a lifecycle rule

## Data Pipeline Changes ##
* Implement data pipeline registry
* Inferred pipeline is selected using modifiers and can be overridden by
user
* Implement independent pipeline
* This pipeline treats each modifier as a separate stage and assigns a
pipeline to each modifier
  * Meant to replicate current LC behavior
* Originally, these compression events were triggered by reaching the
end of each module’s initialize function. Now a separate event is
required
* Implement `session.get_modifiers`
* In order to perform data pipeline inference and other sequential
pipeline inference, these functions must get the list of active
modifiers before they initialize
* This function gets all the active modifiers across all
`ModifierStages`
* Prepare smoothquant for pipeline extraction
* Trigger `_apply_smoothing` on the `sequential_epoch_end ` and
`calibration_epoch_end`
* Add a
[guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285)
which allows the `_apply_smoothing` function to be called multiple times
per session (as is required by sequential pipeline)

## Testing ##
* Quantized llama3-8b using both the independent (basic + sequential)
and sequential pipelines
* There was no accuracy regression from using a shared pipeline,
although we keep the `independent` pipeline as the default for now
* Transformers tests pass
*
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074

---------

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready When a PR is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants