-
Notifications
You must be signed in to change notification settings - Fork 206
Pipeline Extraction #1279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline Extraction #1279
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely looks cleaner this way! Leaving comments rather than approving, as I am still getting up to speed with pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you're looking for feedback on this, but I'm not sure I understand it enough to approve. I do like the removal of all the try/catch code in GPTQ. Maybe we can have a deep dive session on this next week?
## Purpose ## * Revert the behavior regression introduced as a result of #1114 * When calibrating a model using the `QuantizationModifier`, quantization should be enabled when calibrating ## Changes ## * Remove "disabling quantization" from the calibration forward pass * Add "disabling quantization" to the sequential pipelines in order to continue to disable quantization during calibration for GPTQ and SGPT * When [calibration pipelines become shared between modifiers](#1279), the decision of whether to disabling quantization during calibration will have to be moved to the calibration pipelines themselves. Some work needs to be done to demonstrate that GPTQ and SGPT do not suffer accuracy regression from enabling activation quantization during calibration (in theory, the change should increase accuracy) --------- Signed-off-by: Kyle Sayers <[email protected]>
ee33c44
to
46f6811
Compare
46f6811
to
2182705
Compare
3213a7d
to
fa75986
Compare
2182705
to
92c9dee
Compare
b9c91e7
to
3fdbb8d
Compare
Looks like there's one perplexity failure, although I wasn't able to replicate locally https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074/job/41024772318#step:13:31981 |
Signed-off-by: Kyle Sayers <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really cool! excited to try this out. Should we run the e2e/lmeval tests before merging this in? Lots of moving pieces, they might catch something
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
I've validated the previously failing awq e2e test passes locally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 the refactor from pipelines as functions to classes looks really good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, left some minor comments.
One change/resolution/explanation requested for independent pipelines.
Generally I see a lot of similar TODO's trickled across multiple file(s); would like to address/delete/link out to ticket or issues before merge.
Great job on this!
Signed-off-by: Kyle Sayers <[email protected]>
Looks like tests passed, but there was a issue reporting timings, possibly expected for this manual run and unrelated to these changes. So I think we're good to go on this! |
## Purpose ## * Extract data pipelines from modifiers to enable multiple modifiers to be active at the same time * This enables faster compression of larger models * This enables more memory efficient compression of larger models (not limited to just GPTQ/SGPT) ## Prerequisites ## * #1351 * #1298 ## Callback Changes ## * Implement `calibration_epoch_start` * This callback should be called at the start of every calibration pipeline * This callback causes modifiers to attach hooks * Implement `sequential_epoch_end` * This callback should be called after one sequential layer has been calibrated with one epoch * This callback triggers compression and replaces passing a `callback_modifier` * Implement `calibration_epoch_end` * This callback triggers at the end of a calibration epoch, and is used to *trigger compression* in between pipelines composed using the independent pipeline and *remove hooks* in between independent pipelines ## Lifecycle Changes ## * Oneshot modifiers implement on_end, which removes hooks when calibration finishes * In the future, calibration_epoch_start is treated like batch_start, where it is an opportunity for modifiers to start * In the future, calibration_epoch_end is treated like batch_end, where it is an opportunity for modifiers to end * In the future, finalize is treated like batch_end, where it is an opportunity for modifiers to end * Right now, these opportunities are implemented manually on each oneshot modifier, rather than being a lifecycle rule ## Data Pipeline Changes ## * Implement data pipeline registry * Inferred pipeline is selected using modifiers and can be overridden by user * Implement independent pipeline * This pipeline treats each modifier as a separate stage and assigns a pipeline to each modifier * Meant to replicate current LC behavior * Originally, these compression events were triggered by reaching the end of each module’s initialize function. Now a separate event is required * Implement `session.get_modifiers` * In order to perform data pipeline inference and other sequential pipeline inference, these functions must get the list of active modifiers before they initialize * This function gets all the active modifiers across all `ModifierStages` * Prepare smoothquant for pipeline extraction * Trigger `_apply_smoothing` on the `sequential_epoch_end ` and `calibration_epoch_end` * Add a [guard](https://github.com/vllm-project/llm-compressor/pull/1244/files#diff-90bb5efcbf5f23ba1db62664a91f6b2d6492a909c387cd82c1589f45d5e8615cR285) which allows the `_apply_smoothing` function to be called multiple times per session (as is required by sequential pipeline) ## Testing ## * Quantized llama3-8b using both the independent (basic + sequential) and sequential pipelines * There was no accuracy regression from using a shared pipeline, although we keep the `independent` pipeline as the default for now * Transformers tests pass * https://github.com/neuralmagic/llm-compressor-testing/actions/runs/14622080074 --------- Signed-off-by: Kyle Sayers <[email protected]> Signed-off-by: shanjiaz <[email protected]>
Purpose
Prerequisites
QuantizationMixin
#1351align_module_device
util #1298Callback Changes
calibration_epoch_start
sequential_epoch_end
callback_modifier
calibration_epoch_end
Lifecycle Changes
Data Pipeline Changes
session.get_modifiers
ModifierStages
_apply_smoothing
on thesequential_epoch_end
andcalibration_epoch_end
_apply_smoothing
function to be called multiple times per session (as is required by sequential pipeline)Testing
independent
pipeline as the default for now