-
Notifications
You must be signed in to change notification settings - Fork 12.9k
ggml : add ggml_mul_mat_id_set_prec and use in llama.cpp #15619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
How broadly is this enabled? F32 accumulators are half speed of F16 on geforce. |
On
This PR updates to use F32 accumulators for:
|
Here's a quick before/after, it is a significant slowdown across a lot of models:
I'd prefer to apply this only to models that need it. Or, we could try other fixes like clamping to finite values, or scaling/unscaling to avoid infinities. |
The main problem is that we don't really have a way to reliably determine which models/tensors need F32 accumulators. Sometimes the FP range issues occur at large contexts with specific content. I am pretty confident that the existing checks for GLM4 are not enough and there are other models that need F32 accumulators and flew under the radar - it's just nobody has reported that yet, so we don't know. We can also consider enabling F32 accumulators as proposed and whitelisting F16 acc only when we run enough tests at large context to have some confidence that it does not break? For now I'll update the PR to just whitelist F32 acc only for GLM and GPT-OSS and we can decide later how to improve this. Just waiting on #15274 (comment) to confirm that they used the correct branch. Otherwise it would mean there are more tensors that need F32 to make GPT-OSS run with Vulkan. |
I'd also be curious to know whether we really need F32 precision for these models, or if we're just running into infinities and we might be able to get away with clamping to the max value. |
Maybe it's possible to build a specific test that runs a full-size prompt through the model and checks each mul_mat, mul_mat_id, flash attention, conv2d and whatever else could use a fp16 accumulator for NaNs/infinities. Then tuning each model would be simple. |
I was able to reproduce the failure from #15274. I verified that clamping infinites to +/-max fp16 in the mul_mat and mul_mat_id shaders was sufficient to fix it (I didn't do it in FA, but that should be possible too). The quality of the output seems fine (see below). This is something we could enable all the time and it has negligible effect on performance.
|
I pushed a draft change to clamp values to finite in #15652. |
I'm OK to proceed with the proposed #15652 if you prefer it, although I continue to have some doubts about the usage of F16 accumulators. What I plan to do is to run the AIME25 benchmark and compare the results between the CUDA and Vulkan backends. I recently got a RTX 5090 so I should be able to run the eval with Ideally it would be better to run the AIME25 eval for the large |
Replaced with #15652 |
fix #15274, #15517, #15516
Force F32 accumulators for attention and ffn output matrix multiplications.