[Quantization] Support w4afp8 MoE dynamic quantization #5282

Sunny-bot1 · 2025-11-27T14:47:27Z

Motivation

With both weights and activations using dynamic quantization, there is no longer any dependency on outputs from an offline-quantized W4AFP8 model.

Modifications

Add the W4 group-wise quantization function and the Hadamard rotation logic for Down-Proj.
Support dynamic weights quantization for W4AFP8.
Fix apply_ep_prefill issues in fused_moe_cutlass_backend by adding stream sync.
Fix issues in permute_x_kernel.

Usage or Command

Use command line:

--quantization w4afp8

use "dense_quant_type":"block_wise_fp8","moe_quant_type":"w4afp8", "hadamard_block_size": 512 as default.

Use config.json:

    "quantization_config":{
        "dense_quant_type":"block_wise_fp8",
        "kv_cache_quant_type" : "block_wise_fp8",
        "moe_quant_type":"w4afp8",
        "quantization":"mix_quant",
        "hadamard_block_size": 512
    }

Dynamic quantization is only enabled when kv_cache_quant_type is set in config.json; this is defined by the framework’s design.

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…into w4a8

…into w4a8_dy

paddle-bot · 2025-11-27T14:47:37Z

Thanks for your contribution!

codecov-commenter · 2025-11-27T16:00:05Z

Codecov Report

❌ Patch coverage is 69.78417% with 42 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@aa35ce4). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/layers/utils.py	77.08%	16 Missing and 6 partials ⚠️
...l_executor/layers/moe/fused_moe_cutlass_backend.py	56.25%	7 Missing and 7 partials ⚠️
...loy/model_executor/layers/quantization/__init__.py	0.00%	5 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #5282   +/-   ##
==========================================
  Coverage           ?   59.87%           
==========================================
  Files              ?      324           
  Lines              ?    39826           
  Branches           ?     5991           
==========================================
  Hits               ?    23844           
  Misses             ?    14080           
  Partials           ?     1902

Flag	Coverage Δ
GPU	`59.87% <69.78%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yangjianfengo1 · 2025-11-28T02:39:09Z

fastdeploy/model_executor/layers/moe/fused_moe_cutlass_backend.py


        self.default_dtype = layer._helper.get_default_dtype()
-        if layer.ep_size > 1 and not layer.moe_quant_config.moe_dynamic_quant:
+        if layer.ep_size > 1 and layer.is_quantized and not layer.moe_quant_config.moe_dynamic_quant:


is_quantized和moe_dynamic_quant，要不以后就用一个字段吧，用户设定了is_quantized，并且是w4af8，你直接看权重的dtype和shape，对不上的话，就动态量化

is_quantized和moe_dynamic_quant，要不以后就用一个字段吧，用户设定了is_quantized，并且是w4af8，你直接看权重的dtype和shape，对不上的话，就动态量化

is_quantized这个参数不止这里用到，组网里也要用它来判断weight_key_map里给的字段是“weight”还是“quant_weight”，框架里本来就有的，我觉得这里直接复用下也OK

这里保留了moe_dynamic_quant是因为想同时支持：1. 激活权重都动态量化(not is_quantized) 2. 权重静态激活动态(is_quantized and moe_dynamic_quant) 3. 激活权重都静态(is_quantized) 三种方式。以后W4FP8的激活都走动态量化了的话，那moe_dynamic_quant就不再需要了

yangjianfengo1 · 2025-11-28T02:39:47Z

fastdeploy/model_executor/layers/quantization/__init__.py

            quantization_config["moe_quant_type"] = "wint4"
            quantization_config["quantization"] = "mix_quant"
            quant_config_name = "mix_quant"
+        # Special handling for moe w4afp8 dynamic quant


这儿建议不要写死吧

这儿建议不要写死吧

那这个咋写呀，命令行设置的话目前没那么灵活

…into w4afp8_dy

qingqing01 · 2025-12-02T05:41:39Z

Usage or Command这里按照不同的使用方式写吧，方式1：仅仅通过命令行，方式2：通过 config配置，方式3：再给一些搭配 kv cache 量化的命令行写法等。

后续需要增加使用文档。

Sunny-bot1 · 2025-12-02T10:36:03Z

Usage or Command这里按照不同的使用方式写吧，方式1：仅仅通过命令行，方式2：通过 config配置，方式3：再给一些搭配 kv cache 量化的命令行写法等。

后续需要增加使用文档。

好的，方式3 我再确认下现在能不能用命令行指定kv cache的量化方式

Sunny-bot1 and others added 5 commits November 18, 2025 20:46

support dynamic activation quant for w4afp8

2169573

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

3d9cece

…into w4a8

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

a331848

…into w4a8

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

81c5f45

…into w4a8_dy

support dynamic w4afp8

b838694

yangjianfengo1 reviewed Nov 28, 2025

View reviewed changes

Sunny-bot1 and others added 5 commits December 1, 2025 19:57

add test

4cd3ff7

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

d840109

…into w4afp8_dy

fix

92e8909

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

16a5c51

…into w4afp8_dy

fix

29a88f9

zhoutianzi666 approved these changes Dec 2, 2025

View reviewed changes

EmmonsCurse added the skip-ci: coverage label Dec 2, 2025

EmmonsCurse approved these changes Dec 2, 2025

View reviewed changes

EmmonsCurse merged commit 3629db4 into PaddlePaddle:develop Dec 2, 2025
13 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Quantization] Support w4afp8 MoE dynamic quantization #5282

[Quantization] Support w4afp8 MoE dynamic quantization #5282

Sunny-bot1 commented Nov 27, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 27, 2025

Uh oh!

codecov-commenter commented Nov 27, 2025 •

edited

Loading

Uh oh!

yangjianfengo1 Nov 28, 2025

Uh oh!

Sunny-bot1 Nov 28, 2025

Uh oh!

yangjianfengo1 Nov 28, 2025

Uh oh!

Sunny-bot1 Nov 28, 2025

Uh oh!

qingqing01 commented Dec 2, 2025

Uh oh!

Sunny-bot1 commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[Quantization] Support w4afp8 MoE dynamic quantization #5282

[Quantization] Support w4afp8 MoE dynamic quantization #5282

Conversation

Sunny-bot1 commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Nov 27, 2025

Uh oh!

codecov-commenter commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

yangjianfengo1 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Sunny-bot1 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

yangjianfengo1 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

Sunny-bot1 Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

qingqing01 commented Dec 2, 2025

Uh oh!

Sunny-bot1 commented Dec 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Sunny-bot1 commented Nov 27, 2025 •

edited

Loading

codecov-commenter commented Nov 27, 2025 •

edited

Loading