-
Notifications
You must be signed in to change notification settings - Fork 661
[Quantization] Support w4afp8 MoE dynamic quantization #5282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5282 +/- ##
==========================================
Coverage ? 59.87%
==========================================
Files ? 324
Lines ? 39826
Branches ? 5991
==========================================
Hits ? 23844
Misses ? 14080
Partials ? 1902
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
|
||
| self.default_dtype = layer._helper.get_default_dtype() | ||
| if layer.ep_size > 1 and not layer.moe_quant_config.moe_dynamic_quant: | ||
| if layer.ep_size > 1 and layer.is_quantized and not layer.moe_quant_config.moe_dynamic_quant: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_quantized和moe_dynamic_quant,要不以后就用一个字段吧,用户设定了is_quantized,并且是w4af8,你直接看权重的dtype和shape,对不上的话,就动态量化
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_quantized和moe_dynamic_quant,要不以后就用一个字段吧,用户设定了is_quantized,并且是w4af8,你直接看权重的dtype和shape,对不上的话,就动态量化
- is_quantized这个参数不止这里用到,组网里也要用它来判断weight_key_map里给的字段是“weight”还是“quant_weight”,框架里本来就有的,我觉得这里直接复用下也OK
- 这里保留了moe_dynamic_quant是因为想同时支持:1. 激活权重都动态量化(not is_quantized) 2. 权重静态激活动态(is_quantized and moe_dynamic_quant) 3. 激活权重都静态(is_quantized) 三种方式。以后W4FP8的激活都走动态量化了的话,那moe_dynamic_quant就不再需要了
| quantization_config["moe_quant_type"] = "wint4" | ||
| quantization_config["quantization"] = "mix_quant" | ||
| quant_config_name = "mix_quant" | ||
| # Special handling for moe w4afp8 dynamic quant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这儿建议不要写死吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这儿建议不要写死吧
那这个咋写呀,命令行设置的话目前没那么灵活
|
Usage or Command这里按照不同的使用方式写吧,方式1: 仅仅通过命令行 , 方式2:通过 config配置, 方式3:再给一些搭配 kv cache 量化的 命令行写法 等。 后续需要增加使用文档。 |
好的,方式3 我再确认下现在能不能用命令行指定kv cache的量化方式 |
Motivation
With both weights and activations using dynamic quantization, there is no longer any dependency on outputs from an offline-quantized W4AFP8 model.
Modifications
apply_ep_prefillissues infused_moe_cutlass_backendby adding stream sync.permute_x_kernel.Usage or Command
Use command line:
Use config.json:
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.