You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Support DeepSeekV3-style block FP8 quantization (clean) (#1675)
SUMMARY:
Fixes [1475](#1475)
This was originally pr
[#1607](#1607), the
commit history got messy. I cherry picked Michael's original commit
451219a and updated from there.
TEST PLAN:
Tested locally and generated the model.
---------
Signed-off-by: mgoin <[email protected]>
Signed-off-by: shanjiaz <[email protected]>
Co-authored-by: mgoin <[email protected]>
Co-authored-by: Kyle Sayers <[email protected]>
- Uses block-wise quantization to compress weights to FP8 in (commonly 128×128 tiles), and dynamic per-token-group (128) quantization for activations. Does not require calibration dataset. Activation quantization is carried out during inference on vLLM.
24
+
22
25
## Sparsification
23
26
Sparsification reduces model complexity by pruning selected weight values to zero while retaining essential weights in a subset of parameters. Supported formats include:
0 commit comments