-
Notifications
You must be signed in to change notification settings - Fork 52
Support loading for static quant weight fp8 act fp8 #730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for loading static quantized models with FP8 weights and FP8 activations by implementing a new quantized linear layer class and updating the model conversion infrastructure.
Key changes:
- Implemented
WeightFP8ActFP8StaticQuantLinear
class for handling FP8 weight and activation quantization - Updated model conversion logic to detect and handle FP8 static quantization configurations
- Enhanced test coverage to verify both export and loading functionality for static FP8 quantization
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
test/test_cpu/test_export.py | Extended test to verify loading of static FP8 quantized models and renamed test method |
auto_round/inference/convert_model.py | Added support for act_dynamic parameter and FP8 static quantization detection in model conversion |
auto_round/inference/backend.py | Added FP8 static quantization detection function and updated dynamic import logic |
auto_round/export/export_to_autoround/export_to_fp8_woq.py | Implemented new WeightFP8ActFP8StaticQuantLinear class with quantization/dequantization methods |
This PR is unnecessary for now, you need to work with Heng to fix the FP8 |
@wenhuach21 The purpose of this PR is to support loading an existing qmodel from disk and then evaluating its accuracy. cc @n1ck-guo |
Yes, but the primary purpose is for evaluation, which the fake model should cover well #731. This is not a product feature, and it involves changes to critical product code. As discussed earlier, please hold this PR for now, or move the code elsewhere without modifying the important HF model inference code. |
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
@wenhuach21 Please help review it again, thanks. |
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
layer_extra_config = extra_config.get(layer_name, {}) | ||
for scheme_attr in quant_scheme_attrs: | ||
layer_config[scheme_attr] = layer_extra_config.get(scheme_attr, getattr(default_quant_scheme, scheme_attr)) | ||
layer_configs[layer_name] = QuantizationScheme.from_dict(layer_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @n1ck-guo @WeiweiZhang1, I wrapped the layer_config
dict as a QuantizationScheme
, and propagated it to the backend check. Please take a look and let me know if I missed any corner cases. Thanks.
priority=0, | ||
feature_checks=[torch_fp8_static_check], | ||
alias=["auto_round", "torch"], | ||
requirements=["auto-round>0.6.0"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR modifies the core Transformers inference code, I’d prefer to merge it after the 0.6.1 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feature is targeted for the INC 3.6, and several loading-related features depend on it, such as NVFP4/MXFP8/MXFP4, and possibly non-transformers model loading @mengniwang95 .
What’s the release plan for version 0.6.1? Could we create the 0.6.1 RC branch first and merge this PR into main
as soon as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will start pre-release test after merging pr 781
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
auto_round:torch_fp8_static
for loading and inference w8afp8QuantizationScheme
to support dict-style access.QuantizationScheme
, and propagate it to the backend check